Senior Research Scientist @Google DeepMind

Benchmark Matters

A recent clarity I’ve had is that the most consequential design document in modern AI research isn’t the model architecture file, but the benchmark spec. The spec is a form of crystallized intent that subtly guides the field, often for years.

We often treat benchmarks as objective grounds-of-truth, but they are deeply opinionated artifacts. The choice of what to measure (e.g., accuracy vs. creativity), how to measure it (e.g., multiple choice vs. open-ended generation), and what edge cases to include or ignore implicitly defines what "progress" means for a given task. A benchmark that over-weights tricky math problems will incentivize the field to build giant calculators; one that rewards stylistic mimicry will incentivize literary chameleons.

This is why creating a new, widely adopted benchmark can have more leverage than publishing a dozen papers with incremental SOTA gains. It’s like discovering a new continent versus mapping a known coastline in slightly more detail. The former opens up entirely new avenues for exploration, while the latter is a race in a fixed direction.

The danger is that we can get stuck on a local optimum, over-optimizing for a flawed or outdated definition of progress. The most impactful researchers I know have a healthy skepticism of existing evals. They don’t just ask, "How can my model solve this benchmark?" They ask, "Is this benchmark even asking the right question?"