On Tuesday, startup Anthropic released a family of generative AI models that claim to deliver best-in-class performance. Just a couple of days later, competitor Inflection AI introduced a model that it claims is close in quality to a number of the strongest models available on the market, including OpenAI’s GPT-4.

Anthropic and Inflection are in no way the primary AI firms to assert that their models are equal to or objectively outperform the competition. Google made the identical argument when releasing its Gemini models, and OpenAI also said the identical for GPT-4 and its predecessors GPT-3, GPT-2 and GPT-1. The list goes on.

But what metrics are they talking about? When a vendor says a model is state-of-the-art in performance or quality, what exactly does that mean? Perhaps more specifically: Will a model that’s technically higher “performing” than one other model actually improve noticeably?

On that last query: unlikely.

The reason – or somewhat the issue – lies within the benchmarks that AI firms use to quantify the strengths – and weaknesses – of a model.

Esoteric measures

Today’s mostly used benchmarks for AI models—particularly chatbot-based models like OpenAI’s ChatGPT and Anthropic’s Claude—poorly capture how the typical person interacts with the models being tested. For example, a benchmark cited by Anthropic in its recent announcement, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), accommodates lots of of graduate-level biology, physics, and chemistry questions—yet most individuals use chatbots to do tasks like Answer emails, write cover letters And speak about their feelings.

Jesse Dodge, a scientist on the Allen Institute for AI, the nonprofit AI research organization, says the industry has reached a “valuation crisis.”

“Benchmarks are inclined to be static and narrowly focused on assessing a single capability, comparable to a model’s facticity in a single domain or its ability to resolve multiple-choice mathematical reasoning questions,” Dodge told TechCrunch in an interview . “Many benchmarks used for evaluation are greater than three years old and are available from a time when AI systems were mostly only used for research purposes and didn’t have many real users. In addition, people use generative AI in a wide range of ways – they’re very creative.”

The fallacious metrics

It’s not that probably the most commonly used benchmarks are completely useless. Undoubtedly, someone is asking Ph.D.-level ChatGPT math questions. However, as generative AI models are increasingly positioned as “do-it-all” systems for the mass market, old benchmarks have gotten less applicable.

David Widder, a postdoctoral researcher at Cornell University who studies AI and ethics, notes that lots of the abilities tested using common benchmarks — from solving elementary-level math problems to determining whether a sentence accommodates an anachronism accommodates – won’t ever be relevant to nearly all of users.

“Older AI systems were often designed to resolve a selected problem in a context (e.g. medical AI expert systems), thereby higher enabling a deep contextual understanding of what constitutes good performance in that specific context.” Widder told TechCrunch. “As systems are increasingly viewed as ‘general purpose’ systems, that is becoming less possible, so we’re seeing an increasing emphasis on testing models against a wide range of benchmarks in numerous areas.”

Errors and other defects

Aside from the misalignment of use cases, the query is whether or not some benchmarks even accurately measure what they claim to measure.

A evaluation by HellaSwag, a test designed to evaluate common sense in models, found that greater than a 3rd of test questions contained typos and “nonsensical” spellings. Elsewhere, MMLU (short for Massive Multitask Language Understanding), a benchmark that vendors like Google, OpenAI, and Anthropic point to as proof that their models can solve logical problems, asks questions that might be solved through memorization.

Test questions from the HellaSwag benchmark.

“Benchmarks like MMLU are more about remembering two keywords and linking them together,” Widder said. “I can find (a relevant) article and answer the query fairly quickly, but that does not imply I understand the causal mechanism or could use an understanding of that causal mechanism to truly think through and solve latest and complicated problems in unexpected contexts.” . A model can’t do this either.”

Fix what’s broken

So benchmarks are broken. But can they be fixed?

Dodge is convinced of this – with more human involvement.

“The way forward here is a mix of evaluation benchmarks and human evaluation,” she said, “stimulating a model with an actual user request after which hiring an individual to judge how good the reply is.”

As for Widder, he’s less optimistic that benchmarks today — even with fixes for more obvious errors like typos — might be improved enough to be informative for the overwhelming majority of users of generative AI models. Instead, he believes that model testing should give attention to the downstream effects of those models and whether the consequences, good or bad, are viewed as desirable by those affected.

“I might ask what specific contextual goals AI models might be used for and assess whether or not they can be – or are – successful in such contexts,” he said. “And hopefully that process can even include assessing whether we must always use AI in such contexts.”

This article was originally published at techcrunch.com