A study of new, larger versions of three major artificial intelligence (AI) chatbots found that chatbots were more likely to generate incorrect answers than admit ignorance. The assessment also found that people aren’t good at finding bad answers.
AI models fed with AI-generated data immediately spit out nonsense.
Much attention has been paid to the fact that large-scale language models (LLMs) used to power chatbots sometimes cause problems or “hallucinate” strange responses to queries. José Hernández Olaro and his colleagues at Spain’s Valencia Institute for Artificial Intelligence are using more training data, involving more parameters and decision-making nodes, and looking at how the errors grow as the model grows. We analyzed such errors to see if they changed. Improve more computing power. They also tracked whether the likelihood of mistakes matched people’s perception of the question’s difficulty and how well people could identify incorrect answers. This study 1 was published in the journal Nature on September 25th.
The researchers found that larger, more sophisticated versions of LLM were predictably more accurate, largely because they were shaped by fine-tuning methods such as reinforcement learning from human feedback. did. That’s good news. But they are less reliable. The researchers report that among all incorrect responses, the proportion of incorrect responses is increasing. This is because the model is less likely to avoid answering the question. For example, by saying “I don’t know” or changing the subject.
“They answer almost everything these days, which means more accurate answers, but also more inaccurate answers,” Hernández-Olaro says. In other words, chatbots are increasingly providing opinions beyond their own knowledge. “It looks to me like what we would call bullshit,” says Mike Hicks, a philosopher of science and technology at the University of Glasgow in the UK, who has proposed the term bullshit to describe this phenomenon. 2. “I’m getting better at pretending to be knowledgeable.”
As a result, everyday users are likely to overestimate the chatbot’s capabilities, which is dangerous, says Hernandez-Olaro.
imprecise and evasive
The team considered three LLM families: OpenAI’s GPT, Meta’s LLaMA, and BLOOM, an open source model created by the academic group BigScience. For each, we looked at an initial raw version of the model and a later refined version.
They tested the model with thousands of prompts, including questions about arithmetic, anagrams, geography, and science, as well as prompts that tested the bot’s ability to transform information, such as alphabetizing a list. They also ranked the questions’ difficulty as perceived by humans. For example, a question about Toronto, Canada was ranked easier than a question about the little-known small town of Achill, Mexico.
AI models fed with AI-generated data immediately spit out nonsense.
As expected, the accuracy of the answers increased as the sophisticated model became larger, and the accuracy of the answers decreased as the questions became more difficult. And while it may be wise for models to avoid answering very difficult questions, the researchers found no strong trend in this direction. Instead, some models, such as GPT-4, answered almost everything. Among the incorrect or avoided answers, the proportion of incorrect answers increased as the model grew, reaching more than 60% in some sophisticated models.
The team also found that all the models could get even simple questions wrong, and there was no “safe operating area” where users could have high confidence in the answers.
The team then asked people to rank their answers as correct, incorrect, or avoidance. People mistakenly classify inaccurate answers as accurate for both easy and difficult questions with surprising frequency (approximately 10% to 40% of the time). “Humans cannot supervise these models,” says Hernández Olaro.
safe space
Hernandez-Orallo said developers can improve AI performance on easy questions, encourage chatbots to refuse to answer difficult questions, and help people better judge when an AI is likely to be trusted. I think it should be possible. “You have to get people to understand, ‘You can use it in this area, but you shouldn’t use it in that area,'” he says.
3 ways ChatGPT can help you write academic papers
Making chatbots easier to answer difficult questions is impressive and works well with leaderboards that rank performance, but it’s not always helpful, Hernandez-Olaro says. “In some recent versions of these models, such as OpenAI’s o1, if you ask me to multiply two very long numbers, I get an answer, but I now know that the answer was wrong. But I’m very surprised,” he says. That should be fixable, he added. “You can set thresholds and have[the chatbot]say, ‘No, I don’t understand’ if the question is difficult.”
“There are some models that say, ‘I don’t know,’ or ‘I don’t have enough information to answer your question,'” says Vipla Route, a computer scientist at the University of South Carolina, Columbia. All AI companies are working hard to reduce hallucinations, and chatbots developed for specific purposes, such as medical applications, may be further refined to avoid exceeding their knowledge base. But for companies looking to sell general-purpose chatbots, “that’s usually not what you want to offer your customers,” she added.