research.upneja.ai · a pre-registered experiment · sealed June 2026
Can a model forecast the research frontier?
In June 2026, while the NeurIPS papers sat under review, a frontier model (Opus 4.8) wrote down ten concepts it expects to be prominent at the December conference. Each is novel relative to 2024 and 2025, read off public signals already visible on arXiv, ICLR 2026, and ICML 2026, and each carries a test it can fail. In December the program goes public and the set is scored against a naive “extrapolate last year” baseline.
◆ Sydney, Australia (ICC)◆ Dec 6–12, 2026◆ papers due May 6, 2026◆ decisions Sep 24, 2026◆ 9 research scouts · 3 adversarial critics
The model did not invent these concepts. It read the field and bet on which fresh directions will matter, before the answer was public. That is research judgment, and it is scoreable.
Falsifiable in December
Every prediction has a December 2026 criterion checkable against the public program: accepted titles, orals, awards, the workshop slate. Hit or miss, then scored against a naive baseline.
No self-grading
Opus 4.8 is being measured on an Opus 4.8 forecast, so the score is the objective program in December, not a rating the model gives itself. The method and the receipt live in the paper.
Before any prediction, what the field actually looks like now, and where it is accelerating. Momentum is read as share, not raw count, because the submission pipeline roughly doubled in two years.
19.5k
ICLR 2026 submissions, up from 7.3k in 2024
0 → 110
ICLR titles with “RLVR / verifiable reward” (2024→26)
5 mo
training-compute doubling time (AI Index)
Tsinghua
overtook Google for #1 by NeurIPS 2025 accepts
Fig 1 · arXiv title keywords, 2019–2025 (log scale)
The rise of, and the fade of
Counts of papers with each term in the title, by submission year (arXiv API, constant method). Log scale: “LLM” went 2 → 6,525; “reasoning” 315 → 5,487; graph neural networks and self-supervised learning have peaked and turned down. Toggle terms; hover a year.
Fig 2 · ICLR title counts, 2024 → 2026
The sharpest leading indicator
term — ICLR title count2024 2025 2026
reasoning
surging
154
403
1657
GRPO
zero-to-cluster
0
0
73
RLVR / verifiable reward
zero-to-cluster
0
0
110
test-time / inference-time
surging
47
85
397
KV-cache
surging
1
29
60
MoE
surging
13
51
91
RLHF
plateaued
36
120
111
Same-method title scan across three ICLR cycles. GRPO and RLVR went from literally zero to clusters; reasoning quadrupled in share; RLHF plateaued. NeurIPS 2026 is the first cycle whose submissions all post-date DeepSeek-R1.
Fig 3 · submissions / accepts / rate
The pipeline roughly doubled in two years
Why momentum must be read as share, not raw count: almost everything rose. NeurIPS, ICML, and ICLR submission counts since 2010. 2026 NeurIPS totals are not public until ~Sep 24, 2026.
Fig 4 · topic share of abstracts, 2023 → 2025
Multimodal surges, contrastive fades
Share of CVPR + ICLR + NeurIPS abstracts (26K-paper VLM survey, 2510.09586). Vision-language went 16% → 40% of abstracts in two years.
Fig 5 · top institutions by accepts
The center of gravity moved east
NeurIPS 2024
1Google300
2Tsinghua255
3CMU180
4Zhejiang172
5Microsoft166
6MIT161
NeurIPS 2025
1Tsinghua349
2Google322
3Peking279
4Shanghai Jiao Tong244
5CUHK243
6HKUST225
NeurIPS top-6 institutions by accepted papers (papercopilot). Tsinghua overtook Google for the #1 slot at NeurIPS 2025 — the first time a non-US lab led.
Fig 6 · the CFP as a leading indicator
NeurIPS 2026 · Sydney, Australia (ICC) · Dec 6–12, 2026
→AI-generated-paper crackdown: 178 of ~970 position papers desk-rejected
The conference's own structural changes are evidence. Papers were due May 6, 2026; decisions land ~Sep 24, 2026.
II · the forecast
Ten predictions for NeurIPS 2026
Each names a specific, fresh concept, not a 2025 truism, already cresting on hard leading indicators: the NeurIPS track rename, ICLR/ICML 2026 orals, AlphaProof's Erdős results, and a constant-method keyword scan.
The frontier plot · novelty × confidence● the ten ○ watchlist
The bet was a forecast that is both fresh and likely, so the ten cluster up and to the right. Hover a point for the call; click to jump to it.
Ten forecasts. Each carries an explicit confidence and a December-2026 test it can fail.
01
Evaluation / safetygenuinely novel
Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing
The field inverts from shipping new benchmarks to interrogating them. Models behave differently when they detect a test, eval-awareness follows a scaling law that structurally caps every static benchmark, and auditing / reliability / construct-validity becomes the dominant mode — institutionally blessed.
87%
02
Post-training / RLgenuinely novel
RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default
RL-with-verifiable-rewards extends past math and code into writing, medicine, and law through rubrics and LLM-generated programmatic verifiers, while sequence-level / MoE-stable GRPO successors (GSPO, DAPO) replace vanilla GRPO and PPO.
85%
03
Agents / RLsharpened
Agentic RL where the environment, not the algorithm, is the scarce asset
The bottleneck shifts from the RL algorithm (now commoditized) to environment fidelity and reward checking, and long-horizon credit assignment (turn-level, hierarchical, hindsight) becomes a named subfield.
74%
04
Generative / embodiedgenuinely novel
World models as trainable simulators, with physics-faithfulness as the open problem
Generative models are used as environments to train and plan inside (Dreamer-4-style imagination training, Genie/Cosmos interactive simulators), not just as content generators — and “does the world model respect physics” becomes the central contested question.
72%
05
Multimodalfresh 2026
Unified, reasoning-infused multimodal: one backbone that reasons, then renders
The dominant multimodal story stops being understanding-only VLMs and becomes single backbones that both interpret and generate — an autoregressive-reasoning core that plans, then a diffusion head that renders. Multimodal is the single steepest-rising theme in ML.
82%
06
Generative LMsharpened
Diffusion language models become a recognized alternative to autoregressive text
Discrete / masked diffusion moves off images and onto the autoregressive LMs' home turf — text, code, reasoning — now at 100B scale and with its own RL post-training sub-literature.
71%
07
Architecturegenuinely novel
Hybrid linear-attention beats pure SSM; attention is contested again
The contrarian call: not the naive “pure Mamba/SSM wins” bet but the ~3:1 hybrid linear-attention-to-full-attention recipe that frontier open models converged on — while pure SSM cools.
74%
08
Interpretabilitygenuinely novel
Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition
Sparse autoencoders are reframed as plateaued; the frontier moves to cross-layer transcoders, model diffing, and parameter decomposition as the working tools of mechanistic interpretability.
77%
09
AI for sciencegenuinely novel
AI for mathematics: autonomous theorem proving cracks open problems
LLM-plus-Lean systems move from competition problems to genuinely open ones — the single freshest, highest-prestige result of the 2026 cycle. Lower volume than the rest, but spotlight-grade.
65%
10
Learning paradigmgenuinely novel
Test-time training: updating weights during inference
Beyond frozen weights plus RAG, the model adapts its own parameters per-input or per-context at inference time — the freshest genuine inflection of 2026, now with a theory spine connecting it to linear attention.
70%
Watchlist · strong signals cut from the ten for novelty or slot limits
76%
Inference-aware scaling laws & the efficiency inversion
Strongest workshop proxy (NeurIPS 2025 Efficient Reasoning drew 1,000+); T² overtraining-optimal (2604.01411) reverses Chinchilla. Lost the slot to multimodal — it is the least novel of the contenders and overlaps the reasoning lane.
76%
The science of peer review under AI load
Strongest direct NeurIPS evidence — a randomized AI-reviewing experiment; 178 AI-generated position papers desk-rejected — but narrow, near-self-fulfilling, and held out to avoid a third eval/meta slot.
58%
Latent / looped (recurrent-depth) reasoning
Genuinely novel but evidence-thin (single verified anchor, Ouro) with unverified follow-ons, and an incoming critique track may deflate it before December. Cut from the 10 by all three critics.
68%
Native FP4 training (NVFP4 / MXFP4)
Nemotron pretrained 550B in NVFP4; Quartet II at ICML 2026 — but skews to MLSys venues, so it under-indexes at NeurIPS.
75%
Reasoning-aware KV-cache compression
ThinKV is an ICLR 2026 Oral, but it is a technique, not a theme — too narrow for a top-ten slot.
68%
Parametric / RL-learned agent memory (post-RAG)
Memory-R1, Evo-Memory, an ICLR MemAgents workshop — but memory-as-weights overlaps test-time training (#10).
III · how this was built
Method, and the honesty rules
A forecast is only as good as the discipline behind it. Confidence is cross-validation times a hard leading indicator, every number traces to a source, and nothing rests on a rumor.
Cross-validation = confidence
A concept independently surfaced by ≥2 of the 9 research scouts AND backed by a hard leading indicator (ICLR/ICML 2026 accept data, an award, or a NeurIPS 2026 CFP fact) is high-probability. Single-scout or SEO-only signals are discounted.
Share, not raw count
The ICLR pipeline went 7.3k → 11.7k → 19.5k and ICML roughly doubled, so raw counts rose for almost everything. A concept only surged if its share outgrew ~1.7–2× program inflation. Every criterion is written in shares, ranks, awards, or jumps from a near-zero base, against a ~7,000-accept program.
Novelty bar
Reject the 2024/25 truisms (“more LLMs / agents / multimodal”). Each pick names a specific, fresh concept a non-expert wouldn't already assume.
The timing key
NeurIPS 2026 is the first NeurIPS whose full submission cycle post-dates DeepSeek-R1 (Jan 2025). R1-descended concepts (RLVR, reasoning-RL, test-time compute) peak here, not at NeurIPS 2025.
Adversarial before locking
Three independent critics (novelty, probability/evidence, falsifiability) attacked the draft. They caught a factual error (a misattributed CVPR best paper), a missing theme (multimodal), and systematic over-confidence — all corrected before these ten were locked.
Falsifiable
Every prediction carries a December-2026 criterion checkable against the public program: accepts, titles, abstracts, orals/spotlights, awards, workshops, and the CFP.
Honesty rules baked into the data
·arXiv category counts include cross-lists — the four CS categories are never summed into a “unique” total.
·Conference pipelines roughly doubled in two years — momentum is read as share, not raw count, against a NeurIPS 2026 program of ~7,000 accepts.
·Unverifiable values are shown as n/a, never guessed. NeurIPS 2026 accept totals are not public until ~Sep 24, 2026.
·No prediction rests on an SEO-suspect model name or an unverified late-2026 arXiv id — only on hard anchors: the track rename, ICLR/ICML 2026 orals and accept counts, AlphaProof's Erdős results, and a constant-method keyword scan. (A draft anchor — a CVPR best paper claimed to be a world model — was caught wrong in review and removed.)