Training Models in Our Own Image
There’s a reason frontier model evals look a lot like math and coding competitions: the people building them grew up in that world. You build what you know, and you measure what you can count. It's a simple case of a culture remaking itself in its own image.
What we choose to measure, however, has enormous downstream effects. It sets the research agenda for the entire field, funneling billions of dollars and thousands of bright minds toward a single finish line. This creates a powerful feedback loop, making us progressively better at the narrow set of problems we’ve already crowned as important. The risk is that we end up with AIs that are brilliant, but only in the same ways we are. Our very definition of success introduces a more fundamental bias than any training data ever could.