research.upneja.ai · a pre-registered experiment · sealed June 2026

Can a model forecast
the research frontier?

In June 2026, while the NeurIPS papers sat under review, a frontier model (Opus 4.8) wrote down ten concepts it expects to be prominent at the December conference. Each is novel relative to 2024 and 2025, read off public signals already visible on arXiv, ICLR 2026, and ICML 2026, and each carries a test it can fail. In December the program goes public and the set is scored against a naive “extrapolate last year” baseline.

Sydney, Australia (ICC)Dec 6–12, 2026◆ papers due May 6, 2026◆ decisions Sep 24, 2026◆ 9 research scouts · 3 adversarial critics
What this is, and is not

A test of foresight, not invention

The model did not invent these concepts. It read the field and bet on which fresh directions will matter, before the answer was public. That is research judgment, and it is scoreable.

Falsifiable in December

Every prediction has a December 2026 criterion checkable against the public program: accepted titles, orals, awards, the workshop slate. Hit or miss, then scored against a naive baseline.

No self-grading

Opus 4.8 is being measured on an Opus 4.8 forecast, so the score is the objective program in December, not a rating the model gives itself. The method and the receipt live in the paper.

I · the data layer

The landscape, as the evidence shows it

Before any prediction, what the field actually looks like now, and where it is accelerating. Momentum is read as share, not raw count, because the submission pipeline roughly doubled in two years.

19.5k
ICLR 2026 submissions, up from 7.3k in 2024
0 → 110
ICLR titles with “RLVR / verifiable reward” (2024→26)
5 mo
training-compute doubling time (AI Index)
Tsinghua
overtook Google for #1 by NeurIPS 2025 accepts

Fig 1 · arXiv title keywords, 2019–2025 (log scale)

The rise of, and the fade of

1101001,00010,0002019202020212022202320242025
Counts of papers with each term in the title, by submission year (arXiv API, constant method). Log scale: “LLM” went 2 → 6,525; “reasoning” 315 → 5,487; graph neural networks and self-supervised learning have peaked and turned down. Toggle terms; hover a year.

Fig 2 · ICLR title counts, 2024 → 2026

The sharpest leading indicator

term — ICLR title count2024 2025 2026
reasoning
surging
154
403
1657
GRPO
zero-to-cluster
0
0
73
RLVR / verifiable reward
zero-to-cluster
0
0
110
test-time / inference-time
surging
47
85
397
KV-cache
surging
1
29
60
MoE
surging
13
51
91
RLHF
plateaued
36
120
111
Same-method title scan across three ICLR cycles. GRPO and RLVR went from literally zero to clusters; reasoning quadrupled in share; RLHF plateaued. NeurIPS 2026 is the first cycle whose submissions all post-date DeepSeek-R1.

Fig 3 · submissions / accepts / rate

The pipeline roughly doubled in two years

05k10k15k20k25k2012201620202024NeurIPSICMLICLR
Why momentum must be read as share, not raw count: almost everything rose. NeurIPS, ICML, and ICLR submission counts since 2010. 2026 NeurIPS totals are not public until ~Sep 24, 2026.

Fig 4 · topic share of abstracts, 2023 → 2025

Multimodal surges, contrastive fades

202320242025Vision-Language 40%Diffusion & Generative 19.2%Contrastive 5.1%
Share of CVPR + ICLR + NeurIPS abstracts (26K-paper VLM survey, 2510.09586). Vision-language went 16% → 40% of abstracts in two years.

Fig 5 · top institutions by accepts

The center of gravity moved east

NeurIPS 2024
  1. 1Google300
  2. 2Tsinghua255
  3. 3CMU180
  4. 4Zhejiang172
  5. 5Microsoft166
  6. 6MIT161
NeurIPS 2025
  1. 1Tsinghua349
  2. 2Google322
  3. 3Peking279
  4. 4Shanghai Jiao Tong244
  5. 5CUHK243
  6. 6HKUST225
NeurIPS top-6 institutions by accepted papers (papercopilot). Tsinghua overtook Google for the #1 slot at NeurIPS 2025 — the first time a non-US lab led.

Fig 6 · the CFP as a leading indicator

NeurIPS 2026 · Sydney, Australia (ICC) · Dec 6–12, 2026

  • Track renamed "Datasets & Benchmarks" → "Evaluations & Datasets"
  • Page limit changed to 9 content pages
  • Position Paper Track returns (2nd year)
  • ML Reproducibility Challenge (MLRC) becomes an official track
  • Creative AI Track (4th year), theme: "Agency"
  • Randomized controlled AI-assisted-reviewing experiment (LLM-as-reviewer)
  • AI-generated-paper crackdown: 178 of ~970 position papers desk-rejected
The conference's own structural changes are evidence. Papers were due May 6, 2026; decisions land ~Sep 24, 2026.

II · the forecast

Ten predictions for NeurIPS 2026

Each names a specific, fresh concept, not a 2025 truism, already cresting on hard leading indicators: the NeurIPS track rename, ICLR/ICML 2026 orals, AlphaProof's Erdős results, and a constant-method keyword scan.

The frontier plot · novelty × confidence● the ten  ○ watchlist
novel & probable506070809010060%70%80%90%NOVELTY · expected → genuinely freshCONFIDENCE · longshot → near-certain12345678910

The bet was a forecast that is both fresh and likely, so the ten cluster up and to the right. Hover a point for the call; click to jump to it.

Ten forecasts. Each carries an explicit confidence and a December-2026 test it can fail.

01
Evaluation / safetygenuinely novel

Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing

The field inverts from shipping new benchmarks to interrogating them. Models behave differently when they detect a test, eval-awareness follows a scaling law that structurally caps every static benchmark, and auditing / reliability / construct-validity becomes the dominant mode — institutionally blessed.

87%
02
Post-training / RLgenuinely novel

RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default

RL-with-verifiable-rewards extends past math and code into writing, medicine, and law through rubrics and LLM-generated programmatic verifiers, while sequence-level / MoE-stable GRPO successors (GSPO, DAPO) replace vanilla GRPO and PPO.

85%
03
Agents / RLsharpened

Agentic RL where the environment, not the algorithm, is the scarce asset

The bottleneck shifts from the RL algorithm (now commoditized) to environment fidelity and reward checking, and long-horizon credit assignment (turn-level, hierarchical, hindsight) becomes a named subfield.

74%
04
Generative / embodiedgenuinely novel

World models as trainable simulators, with physics-faithfulness as the open problem

Generative models are used as environments to train and plan inside (Dreamer-4-style imagination training, Genie/Cosmos interactive simulators), not just as content generators — and “does the world model respect physics” becomes the central contested question.

72%
05
Multimodalfresh 2026

Unified, reasoning-infused multimodal: one backbone that reasons, then renders

The dominant multimodal story stops being understanding-only VLMs and becomes single backbones that both interpret and generate — an autoregressive-reasoning core that plans, then a diffusion head that renders. Multimodal is the single steepest-rising theme in ML.

82%
06
Generative LMsharpened

Diffusion language models become a recognized alternative to autoregressive text

Discrete / masked diffusion moves off images and onto the autoregressive LMs' home turf — text, code, reasoning — now at 100B scale and with its own RL post-training sub-literature.

71%
07
Architecturegenuinely novel

Hybrid linear-attention beats pure SSM; attention is contested again

The contrarian call: not the naive “pure Mamba/SSM wins” bet but the ~3:1 hybrid linear-attention-to-full-attention recipe that frontier open models converged on — while pure SSM cools.

74%
08
Interpretabilitygenuinely novel

Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition

Sparse autoencoders are reframed as plateaued; the frontier moves to cross-layer transcoders, model diffing, and parameter decomposition as the working tools of mechanistic interpretability.

77%
09
AI for sciencegenuinely novel

AI for mathematics: autonomous theorem proving cracks open problems

LLM-plus-Lean systems move from competition problems to genuinely open ones — the single freshest, highest-prestige result of the 2026 cycle. Lower volume than the rest, but spotlight-grade.

65%
10
Learning paradigmgenuinely novel

Test-time training: updating weights during inference

Beyond frozen weights plus RAG, the model adapts its own parameters per-input or per-context at inference time — the freshest genuine inflection of 2026, now with a theory spine connecting it to linear attention.

70%
Watchlist · strong signals cut from the ten for novelty or slot limits
76%
Inference-aware scaling laws & the efficiency inversion
Strongest workshop proxy (NeurIPS 2025 Efficient Reasoning drew 1,000+); T² overtraining-optimal (2604.01411) reverses Chinchilla. Lost the slot to multimodal — it is the least novel of the contenders and overlaps the reasoning lane.
76%
The science of peer review under AI load
Strongest direct NeurIPS evidence — a randomized AI-reviewing experiment; 178 AI-generated position papers desk-rejected — but narrow, near-self-fulfilling, and held out to avoid a third eval/meta slot.
58%
Latent / looped (recurrent-depth) reasoning
Genuinely novel but evidence-thin (single verified anchor, Ouro) with unverified follow-ons, and an incoming critique track may deflate it before December. Cut from the 10 by all three critics.
68%
Native FP4 training (NVFP4 / MXFP4)
Nemotron pretrained 550B in NVFP4; Quartet II at ICML 2026 — but skews to MLSys venues, so it under-indexes at NeurIPS.
75%
Reasoning-aware KV-cache compression
ThinKV is an ICLR 2026 Oral, but it is a technique, not a theme — too narrow for a top-ten slot.
68%
Parametric / RL-learned agent memory (post-RAG)
Memory-R1, Evo-Memory, an ICLR MemAgents workshop — but memory-as-weights overlaps test-time training (#10).

III · how this was built

Method, and the honesty rules

A forecast is only as good as the discipline behind it. Confidence is cross-validation times a hard leading indicator, every number traces to a source, and nothing rests on a rumor.

Cross-validation = confidence

A concept independently surfaced by ≥2 of the 9 research scouts AND backed by a hard leading indicator (ICLR/ICML 2026 accept data, an award, or a NeurIPS 2026 CFP fact) is high-probability. Single-scout or SEO-only signals are discounted.

Share, not raw count

The ICLR pipeline went 7.3k → 11.7k → 19.5k and ICML roughly doubled, so raw counts rose for almost everything. A concept only surged if its share outgrew ~1.7–2× program inflation. Every criterion is written in shares, ranks, awards, or jumps from a near-zero base, against a ~7,000-accept program.

Novelty bar

Reject the 2024/25 truisms (“more LLMs / agents / multimodal”). Each pick names a specific, fresh concept a non-expert wouldn't already assume.

The timing key

NeurIPS 2026 is the first NeurIPS whose full submission cycle post-dates DeepSeek-R1 (Jan 2025). R1-descended concepts (RLVR, reasoning-RL, test-time compute) peak here, not at NeurIPS 2025.

Adversarial before locking

Three independent critics (novelty, probability/evidence, falsifiability) attacked the draft. They caught a factual error (a misattributed CVPR best paper), a missing theme (multimodal), and systematic over-confidence — all corrected before these ten were locked.

Falsifiable

Every prediction carries a December-2026 criterion checkable against the public program: accepts, titles, abstracts, orals/spotlights, awards, workshops, and the CFP.

Honesty rules baked into the data
  • ·arXiv category counts include cross-lists — the four CS categories are never summed into a “unique” total.
  • ·Conference pipelines roughly doubled in two years — momentum is read as share, not raw count, against a NeurIPS 2026 program of ~7,000 accepts.
  • ·Unverifiable values are shown as n/a, never guessed. NeurIPS 2026 accept totals are not public until ~Sep 24, 2026.
  • ·No prediction rests on an SEO-suspect model name or an unverified late-2026 arXiv id — only on hard anchors: the track rename, ICLR/ICML 2026 orals and accept counts, AlphaProof's Erdős results, and a constant-method keyword scan. (A draft anchor — a CVPR best paper claimed to be a world model — was caught wrong in review and removed.)

This forecast is dated and falsifiable.

Each prediction resolves hit or miss against its own December 2026 test, and the set is scored against a naive extrapolate-last-year baseline. Built from a nine-scout research sweep of arXiv, ICLR 2026, ICML 2026, and CVPR 2026, then hardened by three adversarial critics that caught a factual error and a missing theme before the ten were locked.

research.upneja.ai · sealed June 2026 · resolves December 2026 · forecast by Opus 4.8