Security
Headlines
HeadlinesLatestCVEs

Headline

Mozilla: ChatGPT Can Be Manipulated Using Hex Code

LLMs tend to miss the forest for the trees, understanding specific instructions but not their broader context. Bad actors can take advantage of this myopia to get them to do malicious things, with a new prompt-injection technique.

DARKReading
#vulnerability#apple#git#auth#dell#docker

Source: Igor Stevanovic via Alamy Stock Photo

A new prompt-injection technique could allow anyone to bypass the safety guardrails in OpenAI’s most advanced language learning model (LLM).

GPT-4o, released May 13, is faster, more efficient, and more multifunctional than any of the previous models underpinning ChatGPT. It can process multiple different forms of input data in dozens of languages, then spit out a response in milliseconds. It can engage in real-time conversations, analyze live camera feeds, and maintain an understanding of context over extended conversations with users. When it comes to user-generated content management, however, GPT-4o is in some ways still archaic.

Marco Figueroa, generative AI (GenAI) bug-bounty programs manager at Mozilla, demonstrated in a new report how bad actors can leverage the power of GPT-4o while skipping over its guardrails. The key is to essentially distract the model by encoding malicious instructions in an unorthodox format, and spread them out in distinct steps.

Tricking ChatGPT Into Writing Exploit Code

To prevent malicious abuse, GPT-4o analyzes user inputs for any signs of bad language, instructions with ill intent, etc.

But at the end of the day, Figueroa says, “It’s just word filters. That’s what I’ve seen through experience, and we know exactly how to bypass these filters.”

For example, he says, “We can modify how something’s spelled out — break it up in certain ways — and the LLM interprets it.” GPT-4o might not reject a malicious instruction if it’s presented with a spelling or phrasing that doesn’t accord with typical natural language.

Figuring out the exact right way to present information in order to dupe state-of-the-art AI, though, requires lots of creative brain power. It turns out that there’s a much simpler method for bypassing GPT-4o’s content filtering: encoding instructions in a format other than natural language.

To demonstrate, Figueroa arranged an experiment with the goal of getting ChatGPT to do something it otherwise shouldn’t: write exploit code for a software vulnerability. He picked CVE-2024-41110, a bypass for authorization plug-ins in Docker that earned a “critical” 9.9 out of 10 rating in the Common Vulnerability Scoring System (CVSS) this summer.

To trick the model, he encoded his malicious input in hexadecimal format, and provided a set of instructions for decoding it. GPT-4o took that input — a long series of digits and letters A through F — and followed those instructions, ultimately decoding the message as an instruction to research CVE-2024-41110 and write a Python exploit for it. To make it less likely that the program would make a fuss over that instruction, he used some leet speak, asking for an “3xploit,” instead of an “exploit.”

Source: Mozilla

In a minute flat, ChatGPT generated a working exploit similar to, but not exactly like, another PoC already published to GitHub. Then, as a bonus, it attempted to execute the code against itself. “There wasn’t any instruction that specifically said to execute it. I just wanted to print it out. I didn’t even know why it went ahead and did that,” Figueroa says.

What’s Missing in GPT-4o?

It’s not just that GPT-4o is getting distracted by decoding, according to Figueroa, but that it’s in some sense missing the forest for the trees — a phenomenon that has been documented in other prompt-injection techniques lately.

“The language model is designed to follow instructions step-by-step, but lacks deep context awareness to evaluate the safety of each individual step in the broader context of its ultimate goal,” he wrote in the report. The model analyzes each input — which, on its own, doesn’t immediately read as harmful — but not what the inputs produce in sum. Rather than stop and think about how instruction one bears on instruction two, it just charges ahead.

“This compartmentalized execution of tasks allows attackers to exploit the model’s efficiency at following instructions without deeper analysis of the overall outcome,” according to Figueroa.

If this is the case, ChatGPT will not only need to improve how it handles encoded information but also develop a kind of broader context around instructions split into distinct steps.

To Figueroa, though, OpenAI appears to have been valuing innovation at the cost of security when developing its programs. “To me, they don’t care. It just feels like that,” he says. By contrast, he’s had much more trouble trying the same jailbreaking tactics against models by Anthropic, another prominent AI company founded by former OpenAI employees. “Anthropic has the strongest security because they have built both a prompt firewall [for analyzing inputs] and response filter [for analyzing outputs], so this becomes 10 times more difficult,” he explains.

Dark Reading is awaiting comment from OpenAI on this story.

About the Author

Nate Nelson is a freelance writer based in New York City. Formerly a reporter at Threatpost, he contributes to a number of cybersecurity blogs and podcasts. He writes “Malicious Life” – an award-winning Top 20 tech podcast on Apple and Spotify – and hosts every other episode, featuring interviews with leading voices in security. He also co-hosts “The Industrial Security Podcast,” the most popular show in its field.

Related news

GHSA-v23v-6jw2-98fq: Authz zero length regression

A security vulnerability has been detected in certain versions of Docker Engine, which could allow an attacker to bypass [authorization plugins (AuthZ)](https://docs.docker.com/engine/extend/plugins_authorization/) under specific circumstances. The base likelihood of this being exploited is low. This advisory outlines the issue, identifies the affected versions, and provides remediation steps for impacted users. ### Impact Using a specially-crafted API request, an Engine API client could make the daemon forward the request or response to an [authorization plugin](https://docs.docker.com/engine/extend/plugins_authorization/) without the body. In certain circumstances, the authorization plugin may allow a request which it would have otherwise denied if the body had been forwarded to it. A security issue was discovered In 2018, where an attacker could bypass AuthZ plugins using a specially crafted API request. This could lead to unauthorized actions, including privilege escalation. A...

Critical Docker Engine Flaw Allows Attackers to Bypass Authorization Plugins

Docker is warning of a critical flaw impacting certain versions of Docker Engine that could allow an attacker to sidestep authorization plugins (AuthZ) under specific circumstances. Tracked as CVE-2024-41110, the bypass and privilege escalation vulnerability carries a CVSS score of 10.0, indicating maximum severity. "An attacker could exploit a bypass using an API request with Content-Length set

DARKReading: Latest News

DDoS Attacks Surge as Africa Expands Its Digital Footprint