If you’re reading this article, you likely know something about AI. You probably also have an opinion about what model’s the best, or at least your favorite one to use, even with the options changing all the time.
There is furious development underway, with integrations occurring all around us, and common sense dictates that a single AI Large Language Model (LLM) is unlikely to be our best tool to achieve all ends.
So which are the best AI tools for coding website backends? Are they equal at Java and Python? Which one’s the best for research, with the most specific results and the least (or least dangerous) hallucinations?
We know the popular AI tools, but which one is the least likely to offend in customer service? And though most can do it now, which model is the most likely to pass the Turing test, by fooling us into thinking it’s a person?
It turns out that most AI experts can’t answer these questions. That is, unless they work for one of the handful of big AI companies whose closed, massive models dominate the space; and in that case, it’s hard to trust the answers are unbiased.
And speaking of bias, Google’s newly integrated AI search, AI Overview, claimed 13 US Presidents attended UW-Madison (with John Kennedy graduating a staggering 6 times, between 1930 and 1993). Also that Barack Obama was America’s first Muslim president.
It would have been nice for Google to know ahead of time that this would happen so quickly.
In this week’s The PTP Report, we look for objective AI tool assessment, or how we know what’s best (aside from personal instinct). We’ll look at which tests are being used now, new means coming online, and we’ll consider why better AI evaluation standards will soon be critical.
Testing and Comparison Now
When you research AI tool performance metrics, you probably encounter stats like this: