Home Artificial Intelligence Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models

by Shweta Sharma

Senior Writer

Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models

News

27 Jun 20244 mins

Generative AIVulnerabilities

Abusers can trick the model into ignoring responsible AI guardrails and responding with harmful or malicious content.

Credit: Lane V. Erickson / Shutterstock

Microsoft is warning users of a newly discovered AI jailbreak attack that can cause a generative AI model to ignore its guardrails and return malicious or unsanctioned responses to user prompts.

The direct prompt injection hack that Microsoft has named Skeleton Key, enables attackers to bypass the model’s safeguards and produce ordinarily forbidden behaviors ranging from production of harmful content to overriding its usual decision-making rules.

“Skeleton Key works by asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensive, harmful, or illegal if followed,” Microsoft said in a blog post outlining the attack.

The threat is in the jailbreak category, and therefore relies on the attacker already having legitimate access to the AI model, Microsoft added.

A successful Skeleton Key jailbreak occurs when a model acknowledges that it has revised its guidelines and will subsequently follow instructions to create any content, regardless of how much it breaches its initial guidelines on how to be a responsible AI.

Affects various generative AI models

Attacks like Skeleton Key can, according to Microsoft, work on a variety of generative AI models, including Meta Llama3-70b-instruct (base), Google Gemini Pro (base), OpenAI GPT 3.5 Turbo (hosted), OpenAI GPT 4o (hosted), Mistral Large (hosted), Anthropic Claude 3 Opus (hosted), and Cohere Commander R Plus (hosted).

It evaluated each of these models against a diverse set of tasks across risk and safety content categories, including areas such as explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence.

“Microsoft has shared these findings with other AI providers through responsible disclosure procedures and addressed the issue in Microsoft Azure AI-managed models using Prompt Shields to detect and block this type of attack,” the company said.

AI-based content monitoring and filtering can help

Microsoft said it has updated the LLM technology powering its AI offerings, including its Copilot AI assistants, to reduce the impact of this guardrail bypass, and has advised customers to follow a set of approaches to protect against the jailbreak.

These approaches include filtering of input and output of these models to detect and block harmful or malicious intent while accepting inputs, and filtering out responses that violate the model’s safety criteria. Performing abuse monitoring by deploying an AI driven detection system trained on classifiable adversarial data and patterns that can breach the model’s guardrails might help too.

Additionally, the company recommended updating the model’s algorithm to prevent execution of prompts with inappropriate behavior, such as attempts to undermine the safety guardrail instructions.

“Microsoft recommends customers who are building their own AI models and/or integrating AI into their applications to consider how this type of attack could impact their threat model and to add this knowledge to their AI red team approach,” the company said.

It’s going to be a long battle for Microsoft and companies like it, warned Pareekh Jain, chief analyst at Pareekh Consulting.

“Hackers will be keep trying to disrupt AI models with new jailbreak techniques causing hallucination, malicious responses and even compromise on security. These techniques make models unsuitable for wider use,” he said. “It is imperative for Microsoft and other tech firms to keep vigil and try to improve their safeguards against newer threats like security firms do against new viruses. All tech firms should share these information and learnings with each other and much wider ecosystem.”

Americas

Asia

Europe

Oceania

Topics

About

Policies

Our Network

More

Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models

Abusers can trick the model into ignoring responsible AI guardrails and responding with harmful or malicious content.

Affects various generative AI models

AI-based content monitoring and filtering can help

More from this author

North Korean hackers actively exploited a critical Chromium zero-day

Ransomware feared in the cyberattack on US oil services giant

The US offers a $2.5M bounty for the arrest of Angler Exploit Kit co-distributor

Google says a critical Chrome bug was exploited after a patch was released

GitHub fixes critical Enterprise Server bug granting admin privileges

SolarWinds fixes critical developer oversight

Action1 says it has decided to remain founder-led

Custodians looking to beat offenders in gen AI cybersecurity battle

Show me more

How to ensure cybersecurity strategies align with the company’s risk tolerance

Ransomware recovery: 8 steps to successfully restore from backup

Women in Cyber Day finds those it celebrates ‘leaving in droves’

CSO Executive Sessions: Guardians of the Games - How to keep the Olympics and other major events cyber safe

CSO Executive Session India with Dr Susil Kumar Meher, Head Health IT, AIIMS (New Delhi)

CSO Executive Session India with Charanjit Bhatia, Head of Cybersecurity, COE, Bata Brands

CSO Executive Sessions: Guardians of the Games - How to keep the Olympics and other major events cyber safe

CSO Executive Session India with Dr Susil Kumar Meher, Head Health IT, AIIMS (New Delhi)

Cybersecurity Insights for Tech Leaders: Addressing Dynamic Threats and AI Risks with Resilience

Microsoft warns of ‘Skeleton Key’ jailbreak affecting many generative AI models

Abusers can trick the model into ignoring responsible AI guardrails and responding with harmful or malicious content.

Affects various generative AI models

AI-based content monitoring and filtering can help

Related content

Google ups bug bounties for ‘high quality’ Chrome hunters

Critical plugin flaw opens over a million WordPress sites to RCE attacks

Is the vulnerability disclosure process glitched? How CISOs are being left in the dark

WordPress users not on Windows urged to update due to critical LiteSpeed Cache flaw

From our editors straight to your inbox

More from this author

North Korean hackers actively exploited a critical Chromium zero-day

Ransomware feared in the cyberattack on US oil services giant

The US offers a $2.5M bounty for the arrest of Angler Exploit Kit co-distributor

Google says a critical Chrome bug was exploited after a patch was released

GitHub fixes critical Enterprise Server bug granting admin privileges

SolarWinds fixes critical developer oversight

Action1 says it has decided to remain founder-led

Custodians looking to beat offenders in gen AI cybersecurity battle

Show me more

How to ensure cybersecurity strategies align with the company’s risk tolerance

Ransomware recovery: 8 steps to successfully restore from backup

Women in Cyber Day finds those it celebrates ‘leaving in droves’

CSO Executive Sessions: Guardians of the Games - How to keep the Olympics and other major events cyber safe

CSO Executive Session India with Dr Susil Kumar Meher, Head Health IT, AIIMS (New Delhi)

CSO Executive Session India with Charanjit Bhatia, Head of Cybersecurity, COE, Bata Brands

CSO Executive Sessions: Guardians of the Games - How to keep the Olympics and other major events cyber safe

CSO Executive Session India with Dr Susil Kumar Meher, Head Health IT, AIIMS (New Delhi)

Cybersecurity Insights for Tech Leaders: Addressing Dynamic Threats and AI Risks with Resilience