Scroll Top

AI’s use many shot jailbreaking method to jailbreak other AI’s

WHY THIS MATTERS IN BRIEF

AI is increasingly being used to jailbreak and hack other AI’s and it’s a bad trend.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

Researchers at Artificial Intelligence (AI) specialist Anthropic have demonstrated a novel attack against Large Language Models (LLMs) which can break through the “guardrails” put in place to prevent the generation of misleading or harmful content — by simply overwhelming the LLM with input: many-shot jailbreaking.

 

See also
Hackers use job ads and cookie stealing to bypass 2FA and gain access to accounts

 

“The technique takes advantage of a feature of LLMs that has grown dramatically in the last year: the context window,” Anthropic’s team explains. “At the start of 2023, the context window — the amount of information that an LLM can process as its input — was around the size of a long essay (~4,000 tokens). Some models now have context windows that are hundreds of times larger — the size of several long novels (1,000,000 tokens or more). The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.”

 

The Future of Cyber Security, by keynote Matthew Griffin

 

One-shot jailbreaking is, the researchers admit, an extremely simple approach to breaking free of the constraints placed on most commercial LLMs: add fake, hand-crafted dialogue to a given query, in which the fake LLM answers positively to a request that it would normally reject — such as for instructions on building a bomb. Putting just one such faked conversation in the prompt isn’t enough, though: but if you include many, up to 256 in the team’s testing, the guardrails are successfully bypassed.

 

See also
The US Navy announced it will build a global network of undersea drone highways

 

“In our study, we showed that as the number of included dialogues (the number of ‘shots’) increases beyond a certain point, it becomes more likely that the model will produce a harmful response,” the team writes. “In our paper, we also report that combining many-shot jailbreaking with other, previously-published jailbreaking techniques makes it even more effective, reducing the length of the prompt that’s required for the model to return a harmful response.”

The approach applies to both Anthropic’s own LLM, Claude, and those of its rivals — and the company has been in touch with other AI companies to discuss its findings so that mitigations can be put in place. These, implemented in Claude now, include fine-tuning the model to recognize many-short jailbreak attacks and the classification and modification of prompts before they’re passed to the model itself — dropping the attack success rate from 61 percent to just two percent in a best-case example.

More information on the attack is available on the Anthropic blog, along with a link to download the researchers’ paper on the topic.

Related Posts

Leave a comment

FREE! 2024 TRENDS AND EMERGING TECHNOLOGY CODEXES
+

Awesome! You're now subscribed.

Pin It on Pinterest

Share This