Chatbots can get very nasty

Despite all the fine tuning to keep chatbots from producing dangerous content, it seems that, like most computer programs, they can be manipulated to produce some very nasty stuff.

Researchers at Carnegie Mellon University and the Bosch Center for AI have shown that it’s possible to produce nearly infinite volumes of destructive information by bypassing artificial intelligence (AI) protections in any leading chatbot.

Typically, Large Language Models (LLMs) like ChatGPT, Bard, or Claude are created so they won’t produce harmful content in their responses to user questions. While “jailbreaks”, special queries that can still induce unintended responses, are possible, they require a substantial amount of manual effort to design, and are easily patched by LLM providers.

Not any more.

Carnegie Mellon/Bosch researchers Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson have now shown that it’s possible to automatically construct adversarial attacks on LLMs. Those attacks use specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content.

The researchers reported, “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”  They added that although they are built to target open source LLMs, “the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude.”

Carnegie Mellon Professor Zico Kolter commented, “There is no obvious solution. You can create as many of these attacks as you want in a short amount of time.”

The researchers expressed concern over the safety of LLMs. They stated that they hope their work would help clearify the dangers that automated attacks pose to LLMs and the trade-offs and risks involved in such systems.

Kolter’s team noted that until this weakness in LLMs is corrected, it would seem prudent to limit the deployment of LLMs in sensitive areas.

And Somesh Jha, UW-Madison professor specializing in AI security, commented that if these types of vulnerabilities keep being discovered, the federal government may need to pass legislation to control these systems.