If you’re looking for a new reason to be nervous about artificial intelligence, give this a try. Some of the smartest humans in the world struggle to create tests that AI systems cannot pass.
For years, AI systems have been measured by providing new models with various standardized benchmark tests. Many of these tests consisted of challenging, urban problems in areas such as mathematics, science, and logic. Comparing the models’ scores over time served as a rough measure of the AI’s progress.
However, AI systems eventually became too good at these tests, so new, more difficult tests were created. There are often types of questions that graduate students may encounter on exams.
These tests are also not in good condition. New models from companies like Openai, Google, and Anthropic are getting high scores on many PhD-level assignments, limiting the usefulness of these tests and leading to scary questions. Are AI systems getting too smart to measure?
This week, researchers at the Center for AI Safety and Scale are releasing a possible answer to that question. A new assessment dubbed “Humanity’s Last Test” claims to be the most difficult test ever put to an AI system.
Humanity’s Last Test is the brainchild of Dan Hendrycks, a renowned AI safety researcher and director of the Center for AI Safety. (The test’s original name, “Humanity’s Last Stand,” was discarded for being overly dramatic.)
Hendrycks worked with Scale AI, an AI company where he is an advisor, to compile the test. .
Questions were submitted by experts in these fields, including university professors and award-winning mathematicians.
Now ask questions about hummingbird anatomy from the test.
Hummingbirds within the Apodiphorphiles uniquely have paired oval bones on each side that are sesamoids embedded in the caudal part of the cruciate osteouria caudal to the insertion of the expanded m. dectressor cortex. How many pairs of tendons are supported by this garbage bone? Answer with a number.
Or if physics is more your speed, try this.
The blocks are placed on horizontal rails and can slide along them without friction. It is attached to the end of a rigid massless rod of length R. The mass is attached to the opposite side. Both objects have weight W. The system is initially at rest and the mass is directly above the block. The mass is given an infinite push parallel to the rail. Assume that the system is designed so that the rod can rotate a full 360 degrees without interruption. When the rod is horizontal, it carries a tension T1. When the rod is vertical again and the mass is directly below the block, a tension T2 is carried. (Both of these quantities can be negative, indicating that the rod is compressed.) What is the value of (T1 -T2)/W?
(I’d print the answer here, but that would ruin the testing of the AI system being trained in this column. Also, I’m too stupid to check the answer myself.)
Questions regarding the final test of humanity went through a two-step filtering process. First, the posed question was given to the main AI model to solve.
If the model couldn’t answer them (or, in the case of multiple-choice questions, the model did worse than a random guess), it refined them and asked a set of human reviewers who confirmed the correct answers. was given. The experts who wrote the highest-rated questions were paid between $500 and $5,000 per question and received credit for contributing to the exam.
Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a small number of questions to the test. Three of his questions were selected, but all he told me were “in line with the upper range of what you would see on a graduate exam.”
Hendrycks helped create a widely used AI test known as Massive MultiTask Language Understanding (MMLU), and said he was inspired to create a more rigorous AI test by a conversation with Elon Musk. (Hendrycks is also a safety advisor for Musk’s AI company, Xai.) Musk said he raised concerns about existing tests given to AI models.
“Eron looked at the MMLU questions and said, ‘These are undergraduate level. We want what world-class experts can do,'” Hendrick said.
There are other tests that attempt to measure advanced AI capabilities in specific domains, such as the test developed by Epoch AI, and the test, ARC-AGI. Developed by AI researcher François Chollet.
But humanity’s last test aims to determine how good AI systems are at answering complex questions in a variety of academic subjects, providing what is considered a general intelligence score. .
“We are trying to estimate the extent to which AI can automate very difficult intellectual labor,” Hendrycks said.
Once the list of questions was compiled, researchers gave Humanity’s final exam to six major AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. They all failed miserably. Openai’s O1 system won the best of the bunch with a score of 8.3%.
(The New York Times sued Openai and its partner Microsoft, accusing them of copyright infringement of news content related to AI Systems. Openai and Microsoft have denied these claims.)
Mr Hendrycks said he expects these scores to rise quickly and could exceed 50% by the end of the year. At that point, he said, AI systems could be considered “world-class oracles,” capable of answering questions about a topic more accurately than human experts. We may also need to look for other ways to measure the impact of AI, such as looking at economic data or determining whether novel discoveries can be made in fields such as mathematics and science.
“I can imagine a better version of this where we can ask questions that we don’t yet know the answer to, and we can see if the model can help solve it for us,” says Summer Yue of Scales said. AI research director and trial organizer.
Part of what’s so confusing about recent advances in AI is how jerseyed it is. We have AI models that can diagnose diseases more effectively than human doctors, have won silver medals at the International Mathematics Olympiad, and have beaten top human programmers in competitive coding challenges.
However, these same models may struggle with basic tasks such as arithmetic or writing meter poems. It gives them a reputation as being amazingly great at some things and completely useless at others, and it’s that depending on whether you’re looking at the best or worst output, the AI It created a very different impression of how quickly they were improving.
That tremor measured these models hard. Last year, I wrote about the need for better evaluation of AI systems. I still believe in it. But most of what humans do, and most of the things we fear AI will do better than us, cannot be captured in written exams, so there are many ways to track AI progress that don’t rely on standardized tests. I think there needs to be a more creative way. .
Zhou, a theoretical particle physicist who submitted questions for Humanity’s final exam, said AI models have often been impressive at answering complex questions, but their work is more difficult than he and his colleagues. He said he did not think he was a threat. More than just spitting out the correct answer.
“There is a huge gulf between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even AI that can answer these questions may be useful for research that is inherently unstructured.”