The quieter and much more awkward question of how to truly determine whether a new model is superior to the previous one lurks somewhere in the background of every discussion about AI advancements, including leaderboards, product launches, and breathless announcements.
As it happens, responding to that query is costly. Really, it’s surprisingly pricey. There could be hundreds of thousands of benchmark questions, and each one needs to be reviewed by a human to ensure that the response is accurate and to flag any suspicious information. The procedure may be as expensive as the model’s actual training, which is a substantial amount. That bottleneck has grown into a significant issue for a sector that appears to create a new frontier model every few months, and a Stanford team is now making a serious effort to address it.
Their method is based on educational testing, which has nothing to do with AI. For many years, standardized tests have employed item response theory, or IRT. A version of it is used in the SAT. The basic idea is that when assessing someone’s ability, the questions’ difficulty is just as important as their accuracy. When you answer a difficult question correctly, it reveals different information than when you answer an easy question correctly. Even though their raw scores appear to be similar on paper, a student who masters the difficult questions is in a different category than one who easily answers the easy ones.
| Field | Details |
|---|---|
| Topic | Stanford’s new framework for evaluating AI language models using Item Response Theory (IRT) |
| Lead Researcher | Sanmi Koyejo, Assistant Professor of Computer Science, Stanford School of Engineering |
| Co-Author | Sang Truong, Doctoral Candidate, Stanford Artificial Intelligence Lab (SAIL) |
| Published At | International Conference on Machine Learning (ICML) |
| Core Method | Item Response Theory (IRT) — borrowed from standardized education testing |
| Cost Reduction | Up to 80%+ compared to traditional human-reviewed benchmarking |
| Models Tested | 172 language models across 22 datasets |
| Knowledge Domains Covered | Medicine, mathematics, law, and more |
| Key Finding | Accounted for question difficulty to enable fairer, more accurate model comparisons |
| Additional Feature | AI-generated question banks; automatic removal of “contaminated” benchmark questions |
| Affiliated Institutions | Stanford, UC Berkeley, University of Illinois Urbana-Champaign (UIUC) |
| Funding | MacArthur Foundation, Stanford HAI, Google Inc. |

The same reasoning holds true for AI, according to Sanmi Koyejo, an assistant professor of computer science at Stanford. “Some models may do better or worse just by luck of the draw,” Koyejo stated, summarizing the central finding that inspired the research. He and his team contend that the existing evaluation procedure completely ignores the difficulty of the questions, so a model that was tested on simpler questions may appear more capable than one that was tested on more difficult ones. It’s not a speculative issue. It is a methodical source of inaccuracy ingrained in the way the field assesses its own advancement.
The Stanford team’s solution eliminates the need for a human to evaluate each response from scratch by using AI to analyze and score questions based on difficulty. In many situations, that alone reduces expenses by about half. The method can reduce evaluation costs by more than 80% while maintaining accurate comparisons when used selectively, feeding models harder questions when the difficulty matters and easier ones when it doesn’t. Co-author Sang Truong, a doctoral candidate at SAIL, talked about creating an infrastructure that enables researchers to “adaptively select subsets of questions based on difficulty.” Although the framing is technical, the implication is clear: more equitable comparisons at a much lower cost.
The approach tackles a second issue that receives insufficient attention in public debates concerning AI benchmarking. Question banks become tainted. When a benchmark question seeps into a model’s training data, it becomes a test of the model’s memory rather than its capacity for reasoning. AI-generated question banks are used by the Stanford framework to automatically filter out compromised questions while continuously replenishing the pool of new, calibrated questions. It’s a form of continuous upkeep for the assessment infrastructure, which the field has sorely lacked and sorely needed.
The results of the team’s large-scale application of this technique are truly intriguing. They tested it on 22 datasets from the fields of law, mathematics, and medicine, as well as 172 language models. In one instance, the system could monitor changes in GPT-3.5’s safety performance over several iterations in 2023, identifying an initial improvement followed by a retreat. This would allow the system to track minute changes in a model’s behavior over time. With traditional benchmarking approaches, where ongoing monitoring is impractical due to the cost and logistics of repeated full-scale evaluations, that kind of fine-grained longitudinal tracking has been practically impossible.
Observing this research, it seems that the field has been functioning with a substantial discrepancy between how certain AI developers sound when they assert that their model is superior and how rigorous the supporting data is. The benchmarks that influence public opinion, guide policy discussions, and influence purchasing decisions are based on a foundation that hasn’t gotten nearly as much attention as the models themselves. Some of the performance gains reported in recent years may have been the result of question selection, where models were tested on simpler content without anyone noticing.
The idea is uncomfortable. However, the Stanford work takes it seriously, and the approach they’ve created at least provides a way forward for more truthful accounting. Reducing the cost by 80% while increasing the rigor alters what is practically possible, but the evaluation problem isn’t solved—it’s unlikely to be fully solved given how quickly the models are changing. More frequent testing, a wider range of question types, and greater clarity regarding the capabilities and limitations of these systems. That is more important than it may seem to anyone attempting to understand the current state of AI.

