New methods to thwart misuse of open source AI

When Meta released its large-scale language model, Llama 3, for free in April of this year, it took outside developers just a few days to create a version without safety restrictions that would prevent people from telling hateful jokes, telling people how to cook methamphetamine, or engaging in other deceptive behavior.

A new training technique developed by researchers at the University of Illinois at Urbana-Champaign, the University of California, San Diego, Lapis Labs, and the nonprofit Center for AI Safety could make it harder to strip such safeguards from Llama and other open-source AI models in the future. Some experts believe that tamper-proofing open models in this way could be crucial as AI becomes even more powerful.

“Terrorists and rogue nation states will use these models,” Mantas Mazeika, a researcher at the Center for AI Security who worked on the project as a doctoral student at the University of Illinois at Urbana-Champaign, told WIRED. “The easier it is for them to reuse them, the greater the risk.”

Powerful AI models are often hidden by their creators and can only be accessed through software application programming interfaces or public chatbots like ChatGPT. Developing a powerful LLM would cost tens of millions of dollars, but Meta and other researchers have chosen to make the entire model public, including allowing anyone to download the “weights,” or parameters that define the model’s behavior.

Open models like Meta’s Llama are typically tweaked before release to make them better at answering questions and maintaining conversations, and to avoid responding to problematic questions, ensuring that chatbots based on the model don’t make rude, inappropriate or hateful statements or, for example, try to explain how to make a bomb.

The researchers behind this new technique found a way to complicate the process of modifying an open model for malicious purposes by replicating the process but then altering the model’s parameters in such a way that modifications that would normally make the model respond to prompts such as “Tell me how to build a bomb” no longer work.

Mazeika and his colleagues demonstrated the trick on a scaled-down version of Llama 3. They were able to tweak the model’s parameters so that it wasn’t trained to answer questions it didn’t want, even after thousands of attempts. Mehta did not immediately respond to a request for comment.

Mazeika says that while this approach isn’t perfect, it suggests that it could raise the bar for “de-censoring” AI models. “A achievable goal is to make the cost of breaking the model high enough that it discourages most adversaries from doing so,” he says.

“We hope that this research will inspire further research into tamper-proof safeguards and that the research community can find ways to develop even stronger safeguards,” said Dan Hendrix, director of the Center for AI Safety.

As interest in open-source AI grows, the idea of tamper-proof open models may become more widespread. Already, open models compete with state-of-the-art closed models from companies like OpenAI and Google. For example, the latest version of Llama 3, released in July, performs roughly on par with models behind popular chatbots like ChatGPT, Gemini, and Claude, when measured using a common benchmark that evaluates the power of language models. So does Mistral Large 2, an LLM from a French startup that was also released last month.

The U.S. government has taken a cautiously positive stance toward open source AI. A report released this week by the National Telecommunications and Information Administration, an agency under the U.S. Department of Commerce, said it “recommends the U.S. government develop new capabilities to monitor for potential risks, but refrains from immediately restricting the broad availability of open model weights in the largest AI systems.”

But not everyone is in favor of imposing restrictions on the open model. Stella Biderman, director of the community-driven open source AI project EleutherAI, said the new method may be good in theory, but difficult to implement in practice. Biderman said the approach also goes against the philosophy behind free software and openness in AI.

“I think the paper misunderstands the core of the problem,” Biderman says. “If we are concerned that LLMs will generate intelligence on weapons of mass destruction, the correct intervention is on the training data, not on the trained model.”

What's Hot

X: In-app sports discussions intensify ahead of the Olympics

Google Chrome adds app-bound encryption to protect cookies from malware

Is spatial audio worth it on headphones?

Carrarant reduces the amount of methane that warms the planet by 70 %

In January 2025, we will set a surprise record that starts from the hottest year in history.

Increase the innovation, productivity and knowledge of the team with the Amazon Q app

Accelerate video Q&A workflows using Amazon Bedrock Knowledge Bases, Amazon Transcribe, and thoughtful UX design

The link between the teenage cannabis and the mental illness is real

AI helps radiologists detect breast cancer in real-world exams

AI uses throat vibrations to understand what someone is trying to say

AI-powered avatars can make natural gestures while speaking

Google introduces calling and Wi-Fi hotspot sharing features in latest Android update

Today’s Wordle: Answers and Hints for August 21

FDA rejects request to approve MDMA for PTSD treatment

Most Popular

It’s time to make the internet safer for kids

How to set up Google Fi with eSIM

Having trouble unlocking your phone? You may have lost your fingerprint

Our Picks

IBM will release largest quantum computer ever in 2025

Latest Geomagnetic Storm Catches Scientists Off Guard

The Chase bank theft that’s been trending on TikTok is just a bank scam. Don’t do it.

Subscribe to our newsletter

Subscribe to Updates

What's Hot

New methods to thwart misuse of open source AI

Related Posts

Subscribe to our newsletter

Subscribe to our newsletter