Blind Selection: The Struggle to Objectively Measure AI

by Doug McCord
June 04, 2024
AI Bias and Objectivity Challenge

If you’re reading this article, you likely know something about AI. You probably also have an opinion about what model’s the best, or at least your favorite one to use, even with the options changing all the time.  

There is furious development underway, with integrations occurring all around us, and common sense dictates that a single AI Large Language Model (LLM) is unlikely to be our best tool to achieve all ends. 

So which are the best AI tools for coding website backends? Are they equal at Java and Python? Which one’s the best for research, with the most specific results and the least (or least dangerous) hallucinations?  

We know the popular AI tools, but which one is the least likely to offend in customer service? And though most can do it now, which model is the most likely to pass the Turing test, by fooling us into thinking it’s a person?  

It turns out that most AI experts can’t answer these questions. That is, unless they work for one of the handful of big AI companies whose closed, massive models dominate the space; and in that case, it’s hard to trust the answers are unbiased. 

And speaking of bias, Google’s newly integrated AI search, AI Overview, claimed 13 US Presidents attended UW-Madison (with John Kennedy graduating a staggering 6 times, between 1930 and 1993). Also that Barack Obama was America’s first Muslim president. 

It would have been nice for Google to know ahead of time that this would happen so quickly. 

In this week’s The PTP Report, we look for objective AI tool assessment, or how we know what’s best (aside from personal instinct). We’ll look at which tests are being used now, new means coming online, and we’ll consider why better AI evaluation standards will soon be critical. 

Testing and Comparison Now 

When you research AI tool performance metrics, you probably encounter stats like this: 

Top Performing AI Models

Okay, so at a glance, you see numbers, from human expert to unspecialized human test examples, with a group of AI offerings nearer the top.  

Clearly a bigger number is better.  

(This was taken from the LYMSYS Chatbot Arena Full Leaderboard, for relative scores on the MMLU, and we’ll get into what all that means in just a moment.) 

Whenever we read about new AI products hitting the market, we usually get some kind of a score like this, a so-called benchmark, or at least a comparison (that it outperformed other models, etc), meant to standardize comparisons across the board.  

AI companies don’t usually put out release notes or significant documentation to help us understand what’s gone into them (where we can expect to see changes from the last model, or why). We only get that it’s the newest flavor, produced at greater scale and expense, maybe in a new integration with a traditional technology (like search, or PC’s onboard chips), or speaking with a controversial new voice. 

[For an AI tool comparison, check out The PTP Report for the top AI software development tools on the market now.] 

Just last year, OpenAI’s GPT-4 made news for its performance on standardized tests, with headlines such as: GPT-4 Beats 90% Of Lawyers Trying To Pass The Bar (Forbes), Bar exam score shows AI can keep up with ‘human lawyers,’ researchers say (Reuters), and GPT-4 can ace the bar, but it only has a decent chance of passing the CFA exams. Here’s a list of difficult exams the ChatGPT and GPT-4 have passed (Business Insider).  

These exams, by the way, included the SAT, GRE, Biology Olympiad, numerous AP exams, sections of the Wharton MBA exam, and the US medical licensing exams, and heady scores were listed for many, even hitting the 90th percentile on the Bar Exam.   

These are tests we all know, and have likely taken some of, which makes this an extremely impressive achievement on the surface. But of course, it’s more complicated than it seems initially. 

New research by MIT’s Eric Martinez found that OpenAI’s estimates of GPT-4’s scores were extremely misleading, with it actually scoring closer to the 48th percentile (and 15th percentile on the essays), and calling for AI companies to use more transparent and rigorous testing. The research directly questioned the assertion that AI was as ready to tackle legal tasks as we were led to believe. 

Benchmarking AI models allows us to compare them over time, and is generally considered a better approach than using standardized tests made for humans. The leader of these tests currently in use is the Massive Multitask Language Understanding (MMLU). A multiple-choice test with 57 tasks including math, US history, computer science, and law, it was even devised with LLMs in mind.  

But one of its authors, Dan Hendrycks, helped develop the test while at UC Berkeley, and told the New York Times’s Kevin Roose that it was never meant for this kind of public scrutiny or publicity, but instead to get researchers to take this challenge seriously: 

“All of these benchmarks are wrong, but some are useful… Some of them can serve some utility for a fixed amount of time, but at some point, there’s so much pressure put on it that it reaches its breaking point.” 

While their test is widely used and believed to generally indicate greater model competence, it also has several issues as a reliable benchmark, especially in how it’s administered. 

In an Anthropic blog post from October, entitled Challenges in evaluating AI systems, some of these issues are well detailed, including: 

  1. The MMLU’s heavy use means newer AI LLM rollouts will have been trained with its test questions 
  2. Formatting changes on the test can impact the scores by as much as 5% 
  3. AI developers are highly inconsistent in their application of the test, using few-shot (learning from subset of examples) and chain-of-thought reasoning (prompting in a series of steps rather than the whole ask at once) which can greatly elevate their own scores 
  4. The MMLU itself has inconsistencies with some mislabeled or unanswerable questions 


In other words, we don’t administer (or have our families administer) our own SATS, GREs, or Bar Exams, and for the same reason, it’s probably best that the AI companies themselves don’t measure their own benchmarks for public use. 

New Tests, Testers, and Approaches Needed 

The Anthropic blog post directly calls for more funding and support for third-party evaluations, and we see new means of testing popping up all the time, with several coming from academia, including the following: 

AI Evaluation Techniques Overview

BIG-bench (Beyond the Imitation Game) is a rigorous testing method, but it also requires a lot from the companies using it. Both time consuming and engineering-intensive, it’s not really practical for private enterprise benchmarking, but can be used for additional insight. (Anthropic, for example, dropped it after a single experiment.) 

HELM (Holistic Evaluation of Language Models) was introduced by Stanford in 2022, and uses evaluation methods determined by experts, across diverse scenarios like reading comprehension, language understanding, and mathematical reasoning. It also uses API access, making it easier for the companies to use than BIG-bench, and has been employed by many of the top companies, like Anthropic, Google, Meta, and OpenAI.   

But as noted by Anthropic, HELM is also slow (can take months) as it’s volunteer run, and they found that it did not evaluate all models fairly, given its requirement for identical formatting. 

In the arena of crowdsourcing, LMSYS, an open research project started by researchers at UC Berkeley, recently launched the Chatbot Arena: with a leaderboard format across AI models. It allows users to test chatbots against each other, and then ranks user votes. Employing the Elo rating system created for chess to compare models with wins based on user preferences, it is also an excellent resource to compare other administered test results.  

(Check it out here!) 


As AI investor Nathan Benaich recently told Kevin Roose of The New York Times: 

“Despite the appearance of science, most developers really judge models based on vibes or instinct… That might be fine for the moment, but as these models grow in power and social relevance, it won’t suffice.” 

Stanford’s AI Index Report for 2024 agrees, including as a key takeaway point that:  

Robust and standardized evaluations for LLM responsibility are seriously lacking… Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This practice complicates efforts to systematically compare the risks and limitations of top AI models.” 

Testing for bias, something that’s mandated by a new Colorado law, is also rare, and extremely difficult, as evidenced by Anthropic’s description of trying to create their own, called Bias Benchmark for QA (BBQ). For Anthropic, developing this test proved exceedingly costly, and returning sometimes questionable values, though they can be applauded for the effort.  

Without more funding, or third-party testing services of sufficient scale and objectivity, it remains unlikely that truly objective AI model evaluation will occur anywhere near the pace of development, leaving us to compare test results across a smattering of mismatched tests, wherever we can find them. (LMSYS’s full Arena Leaderboard is a great place to start.) 

And without it, how can any of us know that we’re really using the best tool available for any given job?  


Re-evaluating GPT-4’s bar exam performance by Martínez, E., Springer Link  

Measuring Massive Multitask Language Understanding, arXiv:2009.03300 

A.I. Has a Measurement Problem, The New York Times 

Challenges in evaluating AI systems, Anthropic 

26+ Years in IT Placements & Staffing Solutions


1030 W Higgins Rd, Suite 230
Park Ridge, IL 60068


222 West Las Colinas Blvd.,
Suite 1650, Irving, Texas, 75039


Av. de las Américas #1586 Country Club,
Guadalajara, Jalisco, Mexico, 44610


8th floor, 90, Dolorez Alcaraz Caldas Ave.,
Belas Beach, Porto Alegre, Rio Grande do Sul
Brazil, 90110-180


240 Ing. Buttystreet, 5th floor Buenos Aires,
Argentina, B1001AFB


08th Floor, SLN Terminus, Survey No. 133, Beside Botanical Gardens,
Gachibowli, Hyderabad, Telangana, 500032, India


16th Floor, Tower-9A, Cyber City, DLF City Phase II,
Gurgaon, Haryana, 122002, India

Work with us
Please enable JavaScript in your browser to complete this form.
*By submitting this form you agree to receiving marketing & services related communication via email, phone, text messages or WhatsApp. Please read our Privacy Policy and Terms & Conditions for more details.

Subscribe to the PTP Report

Be notified when new articles are published. Receive IT industry insights, recruitment trends, and leadership perspectives directly in your inbox.  

By submitting this form you agree to receiving Marketing & services related communication via email, phone, text messages or WhatsApp. Please read our Privacy Policy and Terms & Conditions for more details.

Unlock our expertise

If you're looking for a partner to help build talent management solutions, get in touch!

Please enable JavaScript in your browser to complete this form.
Please enable JavaScript in your browser to complete this form.
*By submitting this form you agree to receiving marketing & services related communication via email, phone, text messages or WhatsApp. Please read our Privacy Policy and Terms & Conditions for more details.
Global Popup