Technology tendency is almost always giving priority to speed, but it is intentionally included in the latest epidemic of artificial intelligence. Deceleration Chat bot down. Mechanical learning researchers and major high -tech companies, including Openai and Google, are focusing on a larger model size and training dataset to emphasize what is called “test time calculation.”
Although these models work more strictly than the human brain, this strategy is often described as giving more time to “think” or “reasons”. It’s not as if the AI model has a new freedom over problem. Instead, test time calculation will be introduced Structure An intervention built by computer systems to reconfirm the work through additional algorithm applied to the intermediate calculation or the final response. It’s more like creating an open book for the exam, rather than simply extending the time limit.
Another name for the new and popular AI improvement strategy (actually existed for several years) is “Prontarian Scaling”. Progress is a process in which a previously trained AI crunches new data and runs a newly -promoted task, whether to generate text or flag spam e -mail. Some AI developers are chatbots by providing extra calculation capabilities at the important moment of program inference between the user’s prompt and the program response. I’ve seen a dramatic jump in the accuracy of the answer.
About the support of science journalism
If you are enjoying this article, consider supporting journalism that has won. Subscription. By purchasing a subscription, it will help you secure the future of a story that has an impactful story that forms our world today.
Test time calculation is especially useful for quantitative questions. “The place we saw the most exciting improvements are like code and mathematics,” says Amanda Baht, the fourth year of Dr. Amanda Baht. Student at Carnegie Melon University. I am studying natural language processing. Bertsch explains that testing time calculation can be the most profitable if there is an objective correct response or a “better” or “worse” method.
O1, recently released by Openai, is the latest public model that drives Chatgpt -style bots, is much better to create computer code and answer mathematics and science questions more correctly than his predecessor. I am. In response to the prompts used in programming competitions, the accurate accuracy of the doctoral level of physics, biology, and chemistry is almost accurate. Openai believes that these improvements are due to test time calculation and related strategies. The follow -up model called O3 will be released in the second half of this month after taking a safety test, but in response to a specific reasoning question that is three times that of O1, Lindsay McCallumrémy, Openai’s communication officer, says. 。
Most of the other academic analysis released as a pre -printed research that has not yet been peared reports the same impressive results. According to Aviral Kumar, an assistant professor of Changnegie Melon University’s Computer Science and Machinery Learning, Aviral Kumar, an assistant professor at the University of Carnegie Melon, that the test time calculation could improve the accuracy of AI and complex inference issues. He is excited about his field shift to this strategy. Because it gives the machine the same blessings given to people when they take extra beats to work on difficult questions. He believes that this can bring us close to a human -like intelligence model.
“They seem to be a little better, and we don’t really understand what the relationship between them is.”
Otherwise, the test time calculation provides practical alternatives to a large language model or a general way to improve LLM. The cost of building an increasingly large model and training them with an increasingly large dataset, currently offers a decrease in revenue. According to Bertsch, the test time calculation proves that it is worth “consistently improving”. It is forced to inflate the models that have already been expanded or reduce the additional high quality data to developers. However, the increase in test time cannot be solved. There are unique trade -off and restrictions.
Large umbrella
AI developers have multiple ways to adjust the test time calculation process to improve the model output. “It’s really a wide range,” says Bertsch.
The most rudimentary method is that anyone who has a computer can do it at home. Ask the chatbot to create many answers to one question. To generate more answers, it takes more time. In other words, the reasoning process takes time. One way to think about it: The user is the layer of a human scaffold that leads the model to the most accurate or optimal answer.
Another basic method includes encouraging chatbots to report the middle procedures required to solve the problem. This strategy called “chain of” prompt was officially outlined by Google researchers in 2022 print paper. Similarly, the user can also ask LLM to reconfirm or improve the output after generating.
Some evaluations indicate that self -correction methods related to the chain oftear prompts improve the output of the model, but other researches are to generate the same type of hallucinations as other chatbot outputs. It indicates that these strategies are unreliable. In order to reduce reliability, many test -time strategies are an algorithm trained to score the model output based on the preset standard, selecting the output that provides the best steps for specific goals. Use the external “verification” to do.
After generating a list of responsibilities that the model can be, you can apply a verificationer. For example, when LLM generates computer code, the verification agent is as easy as a program that runs the code and works. Other verification agents may guide the model through each contact of multiple issues. Some versions of the test time calculation use a verification agent that evaluates the output of the model in both methods and combines the logic of these approaches: many possible branches, and as a gradual process. As a final response. Other systems use a verification agent to find errors in chatbot’s first output or thinking chain, and to fix those problems.
The test time calculation is very successful due to quantitative problems because all verification agents depend on the known correct answer (or at least two options comparing). Says. This strategy is not very effective for the ranking to improve output such as subjective poetry and translation.
A slight deviation from all above can use the same type of algorithm to hone the model during development and training, and apply them during testing.
“Currently, there are various methods of all these methods. All of these are all additional calculations at the time of testing and basically they do not share other technical functions,” he said. Jacob Andreas, an associate professor of the science, says. Technology. “They seem to make all the models a little better, and we don’t really understand what the relationship between them is.”
Share restriction
There are various methods, but we share the same specific restrictions. Slow production speed and potential need for more calculation resources, water and energy. The sustainability of the environment is a problem that has already increased for this field.
EKINAKYUREK, a computer science doctorate, may take about 5 seconds for LLM to answer one query without adding a test time calculation. MIT candidate advised by Andreas. However, the methods developed by Akyürek, Andreas and their colleagues raised the response time to 5 minutes. Dilek hakkani-tr, a professor at the Illinois University of Urbanhampaga University, states that it is meaningless to increase the time it takes to infer in a specific application and prompt. Hakkani-Tur has been working on the development of AI conversation agents “speak” to users such as Amazon Alexa. “Speed is the most important,” she says. In the case of complex interactions, the user may not care about the pause of a few seconds to respond to the bot. However, in the case of basic before and after, humans may leave if they have to wait for something unnatural for a long time.
More time means more calculation efforts and money. According to the creator of the popular AI benchmark test that has been allowed to access AI early access, if one task is executed to O3, $ 17 or $ 1,000 in the version of the used software version. The above costs are charged. Then, if the model is inquire millions of times through a large -scale user base, all of these prompts quickly, large financial burden and large energy suction by shifting calculation investment from training to reasoning. It will be. LLM queries such as Chatgpt have already used 10 times the power of Google search. According to Akyürek, the demand for several tens of times increases after 5 minutes after the calculation of 5 seconds.
But this is not a clear drawback in all cases. Boosting test time calculation will make the strategy potential if the model is able to perform better with a smaller training, or to eliminate the need to build and train more models from zero. Masu Reduce According to Hakkani-Tr, the energy consumption of AI is generated. The final balance depends on the intended use, the frequency of the model, and the problem of whether the model is small enough to be executed on a local device instead of a distant server stack. The advantages and disadvantages are added, “I need to calculate carefully.” “I will look at the whole picture about how to use the model.” In other words, AI developers should work hard for a long time before encouraging their work to do the same. 。