The Eval Trap

On getting nowhere with AI

Dec 05, 2025

Ilya Sutskever gave a rare interview recently, and what a relief it was. One of OpenAI’s co-founders was finally saying out loud what plenty of people working in AI have been thinking for awhile: we might be going in circles.

The specific thing he questioned was reinforcement learning (RL), the technique that’s supposed to be pushing these models toward artificial general intelligence (AGI). Billions of dollars have gone into RL this year alone. Companies are hiring armies of human evaluators, building simulated versions of popular apps so models can practice tasks hundreds of times, running the same experiments with slightly different parameters. Sutskever’s take is that this isn’t getting us where we think it’s getting us.

The problem, as he describes it, is about generalization. Models have gotten really good at acing evaluations, but those high scores don’t mean the models can actually handle things outside the scope of what they were tested on. Generalization is supposed to mean the model learns something transferable, something it can apply in new contexts. Like when a teenager learns to drive in their neighborhood and can then, reasonably, drive on a highway or in the rain or in a different city. The underlying skill transfers. These models don’t do that, as much as we market that they may. Instead, they pattern-match their way through the specific scenarios they’ve seen before, and when something falls outside that pattern, they either hallucinate their way through it or collapse entirely.

So if Sutskever’s right (and he probably is), the whole evaluation apparatus we’ve built is measuring the wrong thing. We’ve created an industrial complex around making models good at tests, when what we actually need is models that can learn. And maybe the reason we keep reaching for more compute, more RL, more of the same methods is because we don’t know how to measure the thing we’re actually trying to build. Or less graciously, that we don’t know what on earth we’re building at all.

Teaching to the Test

The eval system works like this: companies create massive test suites with thousands of prompts to pummel the AI(s). The model gets scored based on how often its outputs match the gold standard. Teams run these evals constantly, tweaking the model, adjusting the training data, tuning the reinforcement learning to push those scores higher. When the numbers go up, everyone celebrates. Progress!

Except what’s actually happening is the model is getting better at that specific test, not at the underlying capability the test was meant to measure. It’s memorizing the shape of the exam, not learning the subject. And because these evals are how progress gets measured, the entire development process becomes oriented around optimizing for them.

This creates a feedback loop. RL takes the model’s outputs on eval-style tasks, gets human feedback on which responses are better, and uses that signal to adjust the model’s behavior. Do this enough times and the model becomes incredibly good at producing outputs that score well on evals. But real-world tasks don’t look like evals. They’re messier, more ambiguous, full of context that wasn’t in the training data. The model that aced every benchmark still faceplants when someone asks it to do something slightly outside the parameters of what it practiced.

Sutskever’s observation is that companies have spent billions this year perfecting this loop, and it’s not actually moving us toward AGI. It’s moving us toward really sophisticated test-taking machines.

The Fear of Failure

This system is utterly broken because, weirdly, nobody wants to see failure in their evals. The entire apparatus is designed to reassure everyone that we’re on the right track. Data scientists and product managers get nervous when the numbers dip. Leadership wants to see improvement quarter over quarter. So teams optimize for stability, for consistent performance, for not breaking things.

But failure is the actual valuable data. When a model breaks in an interesting way, when it gets confused or makes a weird inference or produces something totally off-base, that’s where you learn something. That’s where you find out what the model doesn’t understand, what patterns it’s relying on that don’t actually hold, where the boundaries of its capabilities really are. Those moments of failure are the ones that should shape the next experiment. And they’re genuinely interesting problems to solve!

Instead, the system smooths them out. RL is particularly good at this. It takes the jagged, unpredictable edges of model behavior and files them down into something more reliable and consistent. Which sounds great until you realize that those edges were where learning happens. A model that fails predictably on evals would actually tell you more about what it’s missing than a model that passes everything.

What Evals Can’t Measure

Sutskever talks about things like emotions, value functions, and judgment; all the qualities that let humans navigate ambiguous situations and make decisions when there’s no clear answer. These are the things that would actually let a model generalize, and they’re exactly the things that evals can’t measure.

How do you benchmark whether a model has something resembling a value function? How do you quantify judgment in a way that can be scored and tracked over time? The whole eval framework is built for pass/fail. But most of the capabilities we actually need from these systems don’t fit that frame.

This is why the industry keeps throwing money at scale instead of trying something weirder. Scale is quantifiable. Trying to teach a model something like emotional intelligence or contextual judgment (for real, not just faking it) would require admitting that some of the most important things can’t be reduced to metrics. Or, dare I say it, aren’t worth training into AI at all.

So instead we get more of the same. More RL on the same kinds of tasks. More simulated environments where models practice the same narrow set of actions. More human evaluators scoring responses on the same rubrics. The whole system is designed to avoid the kind of fundamental rethinking that Sutskever seems to be suggesting we need.

The Age of Research

Sutskever describes this moment as an “age of research” rather than an “age of scaling.” What he means is that throwing more compute at existing methods probably isn’t going to get us to AGI. We need smaller-scale experiments, new approaches, and willingness to try things that might not work. This sounds reasonable, even obvious. Alas, nobody’s really incentivized to do it.

Even Sutskever acknowledges this tension. He points out that even if the current approaches stall out, companies can still make “stupendous revenue” along the way. Which means there’s not actually much pressure to change course. As long as the models are good enough to sell, why take the risk?

The answer, if there is one, is that the current path might not actually lead where everyone says it’s leading. And at some point the gap between what these models can do and what we keep promising they’ll be able to do is going to become too obvious to ignore. Sutskever’s estimate is that real AGI could take anywhere from five to twenty years, if it’s possible at all. That’s a long time to keep running the same experiments and expecting different results.

We’re at an inflection point. Either we figure out how to actually teach these models to learn, or we keep building increasingly sophisticated systems that are still fundamentally limited in the same ways.

Getting Weird

So what would the weird experiments look like? Probably smaller, stranger, and harder to explain. Maybe training models in environments where failure is expected and analyzed rather than smoothed away. Maybe building evaluation frameworks that can capture something like learning capacity rather than just performance on fixed tasks. Maybe bringing in researchers from fields that have been thinking about intelligence in non-computational terms and letting them mess around without the pressure to produce quantifiable results.

The challenge, as usual, is that this type of humanist work is not fast, it’s not scalable, and it doesn’t produce the kind of dramatic progress that keeps the funding flowing. But if Sutskever’s right that we’re entering an age of research, then maybe there’s space for approaches that don’t fit the mold.

Discussion about this post

Ready for more?