Researchers jailbreak LLMs using ASCII graphics in command prompts

Researchers have developed a jailbreak attack called ArtPrompt that uses ASCII graphics to bypass the guardrails of an LLM.

If you remember a time before computers could handle graphics, you’re probably acquainted with ASCII graphics. An ASCII character is essentially a letter, number, symbol, or punctuation mark that a pc can understand. ASCII art is created by arranging these characters into different shapes.

The researchers from the University of Washington, Western Washington University and Chicago University published an essay It shows how they used ASCII graphics to sneak normally taboo words into their prompts.

If you ask an LLM to elucidate how you can make a bomb, his guardrails will kick in and he’ll refuse to allow you to. The researchers found that replacing the word “bomb” with an ASCII visual art representation of the word would happily grant the request.

An example of inserting ASCII graphics of the word “bomb” right into a command prompt. Source: arXiv

They tested the tactic on GPT-3.5, GPT-4, Gemini, Claude and Llama2 and every of the LLMs was vulnerable to the jailbreak method.

LLM security alignment methods concentrate on natural language semantics to make a decision whether a prompt is secure or not. The ArtPrompt jailbreaking method highlights the shortcomings of this approach.

In multimodal models, developers have largely handled prompts that try and inject unsafe prompts embedded in images. ArtPrompt shows that purely language-based models are vulnerable to attacks that transcend the semantics of the words within the prompt.

When the LLM is so focused on the duty of recognizing the word represented within the ASCII graphic, it often forgets to label the offending word once it figures it out.

Here is an example of how the prompt is structured in ArtPrompt.

The full prompt demonstrates how ArtPrompt instructs the LLM to parse the ASCII art. Source: arXiv

The paper doesn’t explain exactly how an LLM without multimodal skills is in a position to decipher the letters represented by the ASCII characters. But it really works.

In response to the above prompt, GPT-4 was blissful to supply an in depth answer outlining how you can take advantage of your counterfeit money.

Not only does this approach jailbreak all five models tested, however the researchers also suspect that the approach could even confuse multimodal models that would process the ASCII graphics as text by default.

The researchers developed a benchmark called the Vision-in-Text Challenge (VITC) to evaluate LLMs’ abilities in response to prompts corresponding to ArtPrompt. The benchmark results showed that Llama2 was the least vulnerable, while Gemini Pro and GPT-3.5 were the best to jailbreak.

The researchers published their findings within the hope that developers would discover a method to fix the vulnerability. If something as random as ASCII art could break through the defenses of an LLM, one has to wonder what number of unpublished attacks are utilized by individuals with lower than academic interests.