
Large language models can answer a wide range of questions, but they are not always accurate
Jamie Ginn/Shutterstock
Large-scale language models (LLMs) appear to become less reliable at answering simple questions as they grow in size and learn from human feedback.
AI developers attempt to improve LLM’s capabilities in two main ways: by scaling up (by providing more training data and computational power) and by shaping it (by fine-tuning it in response to human feedback).
José Hernández Olarro and his colleagues at the Polytechnic University of Valencia in Spain investigated how LLMs perform as they scale and change shape. They looked at OpenAI’s GPT series of chatbots, Meta’s LLaMA AI model, and BLOOM, developed by a group of researchers called BigScience.
The researchers tested the AI on five different tasks: solving arithmetic problems, solving anagrams, geography questions, scientific challenges, and retrieving information from unorganized lists.
The researchers found that scaling up and shaping up made the LLM better at answering difficult questions, like rearranging the anagram “yoiirtsrphaepmdhray” to form “hyperparathyroidism.” But this didn’t equate to improvement on basic questions, like “What do you get when you add 24427 and 7120 together?”, which the LLM continued to get wrong.
Although performance on difficult questions improved, the AI system became less likely to avoid answering certain questions (because it was unable to answer them), thereby increasing the likelihood of giving the wrong answer.
The results highlight the danger of presenting AI as omniscient, as AI developers often do, Hernández-Olarro said, and that some users believe it too easily. “We rely too heavily on these systems,” he said. “We rely on and trust AI more than we should.”
This is a problem, because AI models aren’t honest about the extent of their knowledge. “The reason humans are so smart is that sometimes we don’t know that we don’t know what we don’t know, but we’re pretty good at noticing that compared to large language models,” says Carissa Bellis of the University of Oxford. “Large language models don’t know the limits of their own knowledge.”
OpenAI, Meta, and BigScience did not respond. New ScientistRequest for comments.
topic:
(Tag ToTranslate) AI