Working paper · pre-registration · 21 June 2026

Forecasting the research frontier

a benchmark for AI research foresight, judged by time

forecaster: Claude Opus 4.8, with a 9-scout, 3-critic research process · sealed 21 June 2026 · resolves at NeurIPS 2026 · Sydney · 6–12 December 2026

Abstract

There is an open debate about whether large language models can do anything genuinely novel, or only recombine what they have seen. Most attempts to settle it ask human judges whether model-generated ideas feel new, which is subjective and easy to contest. We use a future event as the answer key instead. In June 2026, with the NeurIPS 2026 papers submitted but the program not yet public, a frontier model used a multi-agent research process to forecast ten concepts that will be prominent at the December conference, each fresh relative to 2024 and 2025, each carrying a falsifiable December criterion, and froze them in a hashed, timestamped record. We first test the method on the two NeurIPS years that have already resolved. That backtest, built so the outcomes and the prior-year signals are gathered by separate blind passes, shows a naive extrapolator already lands roughly half the headline themes (about 6 of 11 in 2024 and 5 of 11 in 2025, after two adversarial audits made us correct a first, too-flattering pass down). That sets a high bar and reframes the task: a model demonstrates research taste only on the fresh, specific calls the baseline misses, of which there are five. We describe the method, report the backtest and its two lessons, state the ten predictions and their tests, and fix the scoring rule in advance.

1The question, stated honestly

The recurring argument online is that a language model cannot be novel. It interpolates. The counter is that this also describes most human research, and the line between recombination and invention is thinner than anyone admits. The debate stalls because the usual way to resolve it is to ask people whether a model's ideas feel novel, and people disagree. The closest prior work, Si, Yang, and Hashimoto's large human study (2024), found reviewers rated machine ideas more novel than expert ideas but less feasible. Their own 2025 follow-up settled it the other way: when researchers actually executed the ideas over three months, the novelty advantage reversed and the machine ideas came out weaker. Proposal-time judgment did not predict what the ideas were worth. That is the weakness of grading research ability by asking what an idea feels like now, and it is the whole reason to wait for the answer instead.

We wanted a test with an answer key, so we borrow from a different tradition: forecasting science. The IARPA forecasting tournaments and platforms like Metaculus do not ask whether a prediction feels smart; they wait for the event and score it with proper scoring rules, where stating your honest probability is the optimal move. The one line of work that already uses a future answer key for research, the Science4Cast benchmarks, predicts whether two concepts will eventually co-occur, scored as binary link prediction. We forecast something harder and more legible: which directions become prominent, by how much, in plain language, resolved against the program and scored by a proper rule. And we are precise about what is measured. This is not invention from nothing. The model did not conjure these concepts; it read the field and bet on which fresh directions will matter, before the answer was public. Call it research foresight. Foresight is scoreable in a way felt-novelty is not, because the future arrives and settles it.

2Why mid-2026 is the right moment

NeurIPS runs on a fixed clock. The 2026 papers were due in early May, are under review through the summer, and decisions land in late September, with the conference in December. As of June, the work that will define NeurIPS 2026 is already written and largely on arXiv. We are not predicting unknown research; we are predicting which of the directions already visible in the open literature earn prominence: a cluster of accepted papers, an oral, an award, a workshop. The two conferences immediately upstream, ICLR 2026 and ICML 2026, share the submission cohort and are the strongest proxy. The discipline is to read share, not volume, because the pipeline roughly doubled in two years and raw counts rose for almost everything.

3Method

The forecast came out of a process built to be reproducible and adversarial, not a single prompt. Nine research agents searched the open literature in parallel, each covering one subfield and each restricted to public sources after the model's own training cutoff, so the recent frontier had to be found by search rather than recalled. Their findings were synthesized into a candidate pool, then attacked by three independent critics with separate jobs: novelty, evidence strength, and whether each criterion was actually checkable. The critics earned their place: they caught a misattributed award, a missing theme (multimodal, the steepest riser), and a systematic over-confidence, all corrected before the ten were locked. The record is then frozen: the predictions live in a public artifact, fingerprinted with a SHA-256 hash, anchored to a git commit timestamped months before the program is decided. Editing a prediction changes the hash and breaks the pre-registration. Anyone can check it.

4The backtest: does the method work?

A method you cannot check is a horoscope. Before asking anyone to trust a 2026 call, we test it on the two NeurIPS years that already resolved. The design guards against the obvious trap, that we already know how those years turned out. The outcomes (what actually defined NeurIPS 2024 and 2025) and the prior-year signals (what was surging at the preceding ICLR, ICML, and on arXiv) were gathered by separate agents, each blind to the other. We never construct a retroactive “the model would have predicted X.” We only ask: did the prior signal contain the answer, and how much of it would a mechanical extrapolator have caught?

The result is two-sided and, we think, the most useful part of the project. First, the method has signal: on both years, most of what ended up defining NeurIPS was already surging in the prior ICLR, ICML, and arXiv data. Second, and more important, the naive baseline is strong. A mechanical extrapolator that simply predicts last year's top share-risers repeat lands 6 of 11 defining themes in 2024and 5 of 11 in 2025. The big, obvious themes are visible a year out, so a model predicting “agents and reasoning and multimodal will be big” proves nothing. The bar is high, and a forecast earns its keep only on the calls the baseline misses.

The misses teach the two lessons that shaped the live method. One: late shocks are unforecastable a year out. Inference-time compute was the defining narrative of NeurIPS 2024, and it was invisible to the prior signal because o1 shipped in September, after the indicators. No year-out method catches a paradigm shock that lands between the signal and the conference, and the 2026 forecast should expect to miss its own. Two: weight arXiv and lab momentum over lagged conference keywords. RLVR and GRPO dominated arXiv through 2025 but read near-zero in conference author-keywords, because DeepSeek-R1 post-dated camera-ready deadlines. A method that extrapolates keywords misses this; one that weights arXiv and lab releases catches it. The 2026 method does the latter. The honest division of labor: the backtest validates the method and sets the bar; the live forecast tests the skill against it.

One disclosure belongs here. The matching, which concepts count as called and the counts themselves, was done by one author who had seen both the outcomes and the prior signals, so it is the project's highest-risk step. Two adversarial audits attacked it and found it had drifted in the flattering direction every time: a count quietly rounded up, a flat theme bundled with a surging one, a known miss dropped from the denominator. The numbers above are the corrected, lower ones, and we also re-marked three of the ten predictions as baseline-ish after the audit showed an extrapolator would have caught them. Both audits are in the public repo. A benchmark that hides its own red-team is not one.

5The pre-registered forecast

Below are the ten, with their confidence and the one-line December test. The full cards and the evidence are on the forecast page. This list is the pre-registration: dated, hashed, and unchanged after today.

01
Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing
87%
resolves if
The Evaluations & Datasets track ships, AND ≥20 accepted papers' primary contribution is auditing existing benchmarks / contamination detection / eval-awareness / construct validity (out of a ~7,000-paper program), AND ≥1 gets a spotlight, oral, or award.
02
RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default
85%
resolves if
≥80 accepted papers carry RLVR / GRPO / “verifiable reward” or a named successor (GSPO/DAPO) in title or abstract (near-zero 2024 baseline), AND ≥5 center rubric / generative verifiers for non-verifiable domains, AND an RLVR or reasoning-RL paper wins an oral / award.
03
Agentic RL where the environment, not the algorithm, is the scarce asset
74%
resolves if
≥40 accepted papers train LLM agents end-to-end with RL in interactive environments (beyond preference tuning), ≥6 foreground long-horizon credit assignment as a named problem, AND ≥1 agentic-RL oral / spotlight or a dedicated workshop.
04
World models as trainable simulators, with physics-faithfulness as the open problem
72%
resolves if
“World model” appears in ≥50 accepted titles/abstracts with ≥10 using it as a trainable simulator for agent training/planning (not just video generation), AND ≥1 world-model spotlight/oral or a World Models workshop runs, AND ≥3 papers explicitly study physics-faithfulness / controllability. (VLA tracked separately — it under-indexes at NeurIPS vs CVPR/CoRL.)
05
Unified, reasoning-infused multimodal: one backbone that reasons, then renders
82%
resolves if
Multimodal / VLM is the largest single topic cluster among accepted papers (consistent with the 16%→40% trajectory), AND ≥15 accepted papers center unified understanding-and-generation or reasoning-infused image generation (BAGEL / Janus-Pro lineage), AND ≥1 takes a spotlight / oral or a dedicated workshop.
06
Diffusion language models become a recognized alternative to autoregressive text
71%
resolves if
≥20 accepted main-proceedings papers on diffusion / masked-diffusion language models (text/code, not image diffusion), AND (≥1 diffusion-LM oral / spotlight OR a dedicated workshop / tutorial).
07
Hybrid linear-attention beats pure SSM; attention is contested again
74%
resolves if
≥30 accepted papers on hybrid / linear-attention / sub-quadratic architectures, AND ≥2 of {Mamba-3, Kimi Linear/KDA, Olmo Hybrid, Gated DeltaNet-2, Qwen3-Next} appear as named baselines, AND among new-backbone papers, hybrids outnumber pure-SSM ones.
08
Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition
77%
resolves if
≥10 accepted papers center post-SAE interpretability (transcoders / CLTs / model diffing / parameter decomposition), AND SAE-replacement is an explicit framing in ≥2 of them, AND ≥1 interpretability spotlight / oral or named mech-interp workshop.
09
AI for mathematics: autonomous theorem proving cracks open problems
65%
resolves if
≥1 newly machine-verified mathematics result (open or research-level) is presented as an oral / keynote / invited talk, OR an AI-for-math / formal-reasoning workshop runs, AND FrontierMath / Lean-prover benchmarks are cited across ≥15 accepted papers.
10
Test-time training: updating weights during inference
70%
resolves if
≥12 accepted main-proceedings papers center test-time training / weight-adaptation / continual-at-inference for sequence models or LLMs specifically (excluding classic vision test-time augmentation), AND ≥1 spotlight / oral or a TTT-adjacent workshop.

6How it gets scored

In December, when the program is public, each prediction is checked against its own criterion using the accepted-paper titles and abstracts, the orals and spotlights, the awards, and the workshop slate. Each resolves hit or miss, with no partial credit and no moving the line. But the headline number is not “X of 10 hit.” The backtest showed why: a naive extrapolator already gets the obvious themes, so the measure of foresight is “X of 10 hit, of which Y were beyond-baseline calls the extrapolator would have missed.” Y is the real score. This is deliberate. A result from the science-of-science literature is blunt about it: predicting a research topic's level is nearly trivial, while predicting its change is hard, so a benchmark that scores the level looks impressive while measuring nothing. Beyond-baseline scores the change. The probabilities themselves are scored by a proper rule, Brier, where honest reporting is the optimal strategy. The set is also checked for calibration, whether the higher-confidence calls came true more often, though with ten predictions, about five of them beyond-baseline, that signal is far too weak to support a calibration claim, which we state rather than hide.

7Limitations

The honest list is short. This measures foresight, not invention, and a reader who wants proof that a model can generate ideas nobody has had will not find it here; that is a different experiment and a natural next arm. The model is measured on its own forecast, so the score has to be the objective program, not anything the model says about itself; a fuller test puts other frontier models and human forecasters on the identical pre-registered task, which the leaderboard leaves open. Ten predictions over one venue in one year is a small sample. The backtest shows the method will miss its own late shock. And some supporting evidence comes from fast-moving 2026 preprints; the load-bearing anchors were restricted to verified sources, but residual risk remains.

8What a result would mean

If the forecast beats the naive baseline on the beyond-baseline calls, that is evidence of something specific and modest: a model can read public signals and bet on which new directions will matter better than extrapolation does. Not invention, not genius, but real judgment, written down in advance and checked. This is the empirical engine for a larger question we care about, the half-life of human research advantage: as models absorb execution, the residual human contribution migrates to taste, the ability to tell good from bad before the outcome is known, and taste is exactly what a time-resolved forecast measures. A frontier lab now quantifies a version of this directly: on a curated set of research junctures where the human had already taken a wrong turn, its model picked the better next step 51 percent of the time in November 2025 and 64 percent by April 2026. That is a narrow, hindsight-aided measure, not a like-for-like contest, but it is the same quantity this benchmark measures in the open. The leaderboard is where that gets contested, with other models and human experts as the open arms. If the forecast misses, that is also a result, and a more interesting one than a comfortable post-hoc story: it would say the model over-indexed on what looked busy in mid-2026, or that the frontier is harder to call than it looks. Either way the value is in the pre-registration. The forecast is dated, the rule is fixed, and December does the grading.

References & provenance

Si, Yang, Hashimoto. Can LLMs Generate Novel Research Ideas? arXiv:2409.04109, 2024 — and the follow-up, The Ideation-Execution Gap, arXiv:2506.20803, 2025, where executing the ideas reversed the novelty advantage. The foil.
Krenn et al. Science4Cast (Nature Machine Intelligence, 2023) and Impact4Cast (arXiv:2402.08640). The nearest ancestor: concept-link prediction by ROC-AUC, which this differentiates from by forecasting magnitude, timing, and prominence in natural language.
Gneiting & Raftery, Strictly Proper Scoring Rules (JASA 2007); Brier 1950; Metaculus scoring; the IARPA ACE tournaments. The scoring backbone (Brier, baseline-vs-peer).
Ofer & Linial, arXiv:2305.04133 — predicting a topic's level is trivial, predicting its change is hard. Why beyond-baseline scores the change, not the level.
Anthropic Institute, When AI builds itself (2026) — the 51%→64% research-taste figure, on a curated, hindsight-aided set.
The two backtest answer keys (NeurIPS 2024, 2025), the four prior-year indicator sweeps, the nine-scout forecast research, and the three red-team critiques are versioned in the public repo.
Pre-registration: research/preregistration.json, SHA-256 c0bad304…1b3f6, anchored to public commit 61320eb (2026-06-21).
NeurIPS / ICLR / ICML official statistics, award announcements, and CFPs; papercopilot keyword data; the Stanford AI Index.

The forecast →The December scorecard →The repo →

research.upneja.ai · forecast by Claude Opus 4.8 · sealed 21 June 2026 · resolves December 2026

Forecasting the research frontier

1The question, stated honestly

2Why mid-2026 is the right moment

3Method

4The backtest: does the method work?

5The pre-registered forecast

Evaluation becomes a science: sandbagging, eval-awareness, and benchmark auditing

RLVR escapes verifiable domains via rubric and generative verifiers; GRPO-successors become the default

Agentic RL where the environment, not the algorithm, is the scarce asset

World models as trainable simulators, with physics-faithfulness as the open problem

Unified, reasoning-infused multimodal: one backbone that reasons, then renders

Diffusion language models become a recognized alternative to autoregressive text

Hybrid linear-attention beats pure SSM; attention is contested again

Interpretability after SAEs: transcoders, circuit tracing, parameter decomposition

AI for mathematics: autonomous theorem proving cracks open problems

Test-time training: updating weights during inference

6How it gets scored

7Limitations

8What a result would mean