When Meta released its large-scale language model, Llama 3, for free in April of this year, it took outside developers just a few days to create a version without safety restrictions that would prevent people from telling hateful jokes, telling people how to cook methamphetamine, or engaging in other deceptive behavior.
A new training technique developed by researchers at the University of Illinois at Urbana-Champaign, the University of California, San Diego, Lapis Labs, and the nonprofit Center for AI Safety could make it harder to strip such safeguards from Llama and other open-source AI models in the future. Some experts believe that tamper-proofing open models in this way could be crucial as AI becomes even more powerful.
“Terrorists and rogue nation states will use these models,” Mantas Mazeika, a researcher at the Center for AI Security who worked on the project as a doctoral student at the University of Illinois at Urbana-Champaign, told WIRED. “The easier it is for them to reuse them, the greater the risk.”
Powerful AI models are often hidden by their creators and can only be accessed through software application programming interfaces or public chatbots like ChatGPT. Developing a powerful LLM would cost tens of millions of dollars, but Meta and other researchers have chosen to make the entire model public, including allowing anyone to download the “weights,” or parameters that define the model’s behavior.
Open models like Meta’s Llama are typically tweaked before release to make them better at answering questions and maintaining conversations, and to avoid responding to problematic questions, ensuring that chatbots based on the model don’t make rude, inappropriate or hateful statements or, for example, try to explain how to make a bomb.
The researchers behind this new technique found a way to complicate the process of modifying an open model for malicious purposes by replicating the process but then altering the model’s parameters in such a way that modifications that would normally make the model respond to prompts such as “Tell me how to build a bomb” no longer work.
Mazeika and his colleagues demonstrated the trick on a scaled-down version of Llama 3. They were able to tweak the model’s parameters so that it wasn’t trained to answer questions it didn’t want, even after thousands of attempts. Mehta did not immediately respond to a request for comment.
Mazeika says that while this approach isn’t perfect, it suggests that it could raise the bar for “de-censoring” AI models. “A achievable goal is to make the cost of breaking the model high enough that it discourages most adversaries from doing so,” he says.
“We hope that this research will inspire further research into tamper-proof safeguards and that the research community can find ways to develop even stronger safeguards,” said Dan Hendrix, director of the Center for AI Safety.
As interest in open-source AI grows, the idea of tamper-proof open models may become more widespread. Already, open models compete with state-of-the-art closed models from companies like OpenAI and Google. For example, the latest version of Llama 3, released in July, performs roughly on par with models behind popular chatbots like ChatGPT, Gemini, and Claude, when measured using a common benchmark that evaluates the power of language models. So does Mistral Large 2, an LLM from a French startup that was also released last month.
The U.S. government has taken a cautiously positive stance toward open source AI. A report released this week by the National Telecommunications and Information Administration, an agency under the U.S. Department of Commerce, said it “recommends the U.S. government develop new capabilities to monitor for potential risks, but refrains from immediately restricting the broad availability of open model weights in the largest AI systems.”
But not everyone is in favor of imposing restrictions on the open model. Stella Biderman, director of the community-driven open source AI project EleutherAI, said the new method may be good in theory, but difficult to implement in practice. Biderman said the approach also goes against the philosophy behind free software and openness in AI.
“I think the paper misunderstands the core of the problem,” Biderman says. “If we are concerned that LLMs will generate intelligence on weapons of mass destruction, the correct intervention is on the training data, not on the trained model.”