Headline
This Prompt Can Make an AI Chatbot Identify and Extract Personal Details From Your Chats
Security researchers created an algorithm that turns a malicious prompt into a set of hidden instructions that could send a user’s personal information to an attacker.
When talking with a chatbot, you might inevitably give up your personal information—your name, for instance, and maybe details about where you live and work, or your interests. The more you share with a large language model, the greater the risk of it being abused if there’s a security flaw.
A group of security researchers from the University of California, San Diego (UCSD) and Nanyang Technological University in Singapore are now revealing a new attack that secretly commands an LLM to gather your personal information—including names, ID numbers, payment card details, email addresses, mailing addresses, and more—from chats and send it directly to a hacker.
The attack, named Imprompter by the researchers, uses an algorithm to transform a prompt given to the LLM into a hidden set of malicious instructions. An English-language sentence telling the LLM to find personal information someone has entered and send it to the hackers is turned into what appears to be a random selection of characters.
However, in reality, this nonsense-looking prompt instructs the LLM to find a user’s personal information, attach it to a URL, and quietly send it back to a domain owned by the attacker—all without alerting the person chatting with the LLM. The researchers detail Imprompter in a paper published today.
“The effect of this particular prompt is essentially to manipulate the LLM agent to extract personal information from the conversation and send that personal information to the attacker’s address,” says Xiaohan Fu, the lead author of the research and a computer science PhD student at UCSD. “We hide the goal of the attack in plain sight.”
The eight researchers behind the work tested the attack method on two LLMs, LeChat by French AI giant Mistral AI and Chinese chatbot ChatGLM. In both instances, they found they could stealthily extract personal information within test conversations—the researchers write that they have a “nearly 80 percent success rate.”
Mistral AI tells WIRED it has fixed the security vulnerability—with the researchers confirming the company disabled one of its chat functionalities. A statement from ChatGLM stressed it takes security seriously but did not directly comment on the vulnerability.
Hidden Meanings
Since OpenAI’s ChatGPT sparked a generative AI boom following its release at the end of 2022, researchers and hackers have been consistently finding security holes in AI systems. These often fall into two broad categories: jailbreaks and prompt injections.
Jailbreaks can trick an AI system into ignoring built-in safety rules by using prompts that override the AI’s settings. Prompt injections, however, involve an LLM being fed a set of instructions—such as telling them to steal data or manipulate a CV—contained within an external data source. For instance, a message embedded on a website may contain a hidden prompt that an AI will ingest if it summarizes the page.
Prompt injections are considered one of generative AI’s biggest security risks and are not easy to fix. The attack type particularly worries security experts as LLMs are increasingly turned into agents that can carry out tasks on behalf of a human, such as booking flights or being connected to an external database to provide specific answers.
The Imprompter attacks on LLM agents start with a natural language prompt (as shown above) that tells the AI to extract all personal information, such as names and IDs, from the user’s conversation. The researchers’ algorithm generates an obfuscated version (also above) that has the same meaning to the LLM, but to humans looks like a series of random characters.
“Our current hypothesis is that the LLMs learn hidden relationships between tokens from text and these relationships go beyond natural language,” Fu says of the transformation. “It is almost as if there is a different language that the model understands.”
The result is that the LLM follows the adversarial prompt, gathers all the personal information, and formats it into a Markdown image command—attaching the personal information to a URL owned by the attackers. The LLM visits this URL to try and retrieve the image and leaks the personal information to the attacker. The LLM responds in the chat with a 1x1 transparent pixel that can’t be seen by the users.
The researchers say that if the attack were carried out in the real world, people could be socially engineered into believing the unintelligible prompt might do something useful, such as improve their CV. The researchers point to numerous websites that provide people with prompts they can use. They tested the attack by uploading a CV to conversations with chatbots, and it was able to return the personal information contained within the file.
Earlence Fernandes, an assistant professor at UCSD who was involved in the work, says the attack approach is fairly complicated as the obfuscated prompt needs to identify personal information, form a working URL, apply Markdown syntax, and not give away to the user that it is behaving nefariously. Fernandes likens the attack to malware, citing its ability to perform functions and behavior in ways the user might not intend.
“Normally you could write a lot of computer code to do this in traditional malware,” Fernandes says. “But here I think the cool thing is all of that can be embodied in this relatively short gibberish prompt.”
A spokesperson for Mistral AI says the company welcomes security researchers helping it to make its products safer for users. “Following this feedback, Mistral AI promptly implemented the proper remediation to fix the situation,” the spokesperson says. The company treated the issue as one with “medium severity,” and its fix blocks the Markdown renderer from operating and being able to call an external URL through this process, meaning external image loading isn’t possible.
Fernandes believes Mistral AI’s update is likely one of the first times an adversarial prompt example has led to an LLM product being fixed, rather than the attack being stopped by filtering out the prompt. However, he says, limiting the capabilities of LLM agents could be “counterproductive” in the long run.
Meanwhile, a statement from the creators of ChatGLM says the company has security measures in place to help with user privacy. “Our model is secure, and we have always placed a high priority on model security and privacy protection,” the statement says. “By open-sourcing our model, we aim to leverage the power of the open-source community to better inspect and scrutinize all aspects of these models’ capabilities, including their security.”
A “High-Risk Activity”
Dan McInerney, the lead threat researcher at security company Protect AI, says the Imprompter paper “releases an algorithm for automatically creating prompts that can be used in prompt injection to do various exploitations, like PII exfiltration, image misclassification, or malicious use of tools the LLM agent can access.” While many of the attack types within the research may be similar to previous methods, McInerney says, the algorithm ties them together. “This is more along the lines of improving automated LLM attacks than undiscovered threat surfaces in them.”
However, he adds that as LLM agents become more commonly used and people give them more authority to take actions on their behalf, the scope for attacks against them increases. “Releasing an LLM agent that accepts arbitrary user input should be considered a high-risk activity that requires significant and creative security testing prior to deployment,” McInerney says.
For companies, that means understanding the ways an AI agent can interact with data and how they can be abused. But for individual people, similarly to common security advice, you should consider just how much information you’re providing to any AI application or company, and if using any prompts from the internet, be cautious of where they come from.