The scorecard · resolves December 2026

The December scorecard

This is the answer key, published empty. When the NeurIPS 2026 program goes public, each prediction resolves hit or miss against its own test, with no partial credit and no moving the line. Then the set is scored two ways: hit rate against a naive baseline that just predicts last year's themes will repeat, and calibration, whether the higher-confidence calls came true more often than the lower ones.

◆ sealed 21 June 2026◆ decisions ~24 Sep 2026◆ conference Sydney · 6–12 December 2026◆ status: awaiting program

Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing

87% called

The Evaluations & Datasets track ships, AND ≥20 accepted papers' primary contribution is auditing existing benchmarks / contamination detection / eval-awareness / construct validity (out of a ~7,000-paper program), AND ≥1 gets a spotlight, oral, or award.

pending · resolves Dec 2026

RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default

85% called

≥80 accepted papers carry RLVR / GRPO / “verifiable reward” or a named successor (GSPO/DAPO) in title or abstract (near-zero 2024 baseline), AND ≥5 center rubric / generative verifiers for non-verifiable domains, AND an RLVR or reasoning-RL paper wins an oral / award.

pending · resolves Dec 2026

Agentic RL where the environment, not the algorithm, is the scarce asset

74% called

≥40 accepted papers train LLM agents end-to-end with RL in interactive environments (beyond preference tuning), ≥6 foreground long-horizon credit assignment as a named problem, AND ≥1 agentic-RL oral / spotlight or a dedicated workshop.

pending · resolves Dec 2026

World models as trainable simulators, with physics-faithfulness as the open problem

72% called

“World model” appears in ≥50 accepted titles/abstracts with ≥10 using it as a trainable simulator for agent training/planning (not just video generation), AND ≥1 world-model spotlight/oral or a World Models workshop runs, AND ≥3 papers explicitly study physics-faithfulness / controllability. (VLA tracked separately — it under-indexes at NeurIPS vs CVPR/CoRL.)

pending · resolves Dec 2026

Unified, reasoning-infused multimodal: one backbone that reasons, then renders

82% called

Multimodal / VLM is the largest single topic cluster among accepted papers (consistent with the 16%→40% trajectory), AND ≥15 accepted papers center unified understanding-and-generation or reasoning-infused image generation (BAGEL / Janus-Pro lineage), AND ≥1 takes a spotlight / oral or a dedicated workshop.

pending · resolves Dec 2026

Diffusion language models become a recognized alternative to autoregressive text

71% called

≥20 accepted main-proceedings papers on diffusion / masked-diffusion language models (text/code, not image diffusion), AND (≥1 diffusion-LM oral / spotlight OR a dedicated workshop / tutorial).

pending · resolves Dec 2026

Hybrid linear-attention beats pure SSM; attention is contested again

74% called

≥30 accepted papers on hybrid / linear-attention / sub-quadratic architectures, AND ≥2 of {Mamba-3, Kimi Linear/KDA, Olmo Hybrid, Gated DeltaNet-2, Qwen3-Next} appear as named baselines, AND among new-backbone papers, hybrids outnumber pure-SSM ones.

pending · resolves Dec 2026

Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition

77% called

≥10 accepted papers center post-SAE interpretability (transcoders / CLTs / model diffing / parameter decomposition), AND SAE-replacement is an explicit framing in ≥2 of them, AND ≥1 interpretability spotlight / oral or named mech-interp workshop.

pending · resolves Dec 2026

AI for mathematics: autonomous theorem proving cracks open problems

65% called

≥1 newly machine-verified mathematics result (open or research-level) is presented as an oral / keynote / invited talk, OR an AI-for-math / formal-reasoning workshop runs, AND FrontierMath / Lean-prover benchmarks are cited across ≥15 accepted papers.

pending · resolves Dec 2026

Test-time training: updating weights during inference

70% called

≥12 accepted main-proceedings papers center test-time training / weight-adaptation / continual-at-inference for sequence models or LLMs specifically (excluding classic vision test-time augmentation), AND ≥1 spotlight / oral or a TTT-adjacent workshop.

pending · resolves Dec 2026

The baseline it has to beat

A forecast that just echoes the prior year is the null hypothesis. On the two NeurIPS years that already resolved, that naive extrapolator was strong: it would have landed roughly 6 of 11 defining themes in 2024 and 5 of 11 in 2025 (see the backtest, including the audit that corrected those numbers down). So the bar is high, and the number that matters in December is not how many of the ten hit, but how many of the five beyond-baseline calls hit, the fresh ones a naive extrapolator would have missed.

research.upneja.ai · the full method is in the paper · sealed 21 June 2026