Headline
Updating our Vulnerability Severity Classification for AI Systems
The Microsoft Security Response Center (MSRC) is always looking for ways to provide clarity and transparency around how we assess the impact of vulnerabilities reported in our products and services. To this end, we are announcing the Microsoft Vulnerability Severity Classification for AI Systems, an update to Microsoft’s existing vulnerability severity classification (i.
The Microsoft Security Response Center (MSRC) is always looking for ways to provide clarity and transparency around how we assess the impact of vulnerabilities reported in our products and services. To this end, we are announcing the Microsoft Vulnerability Severity Classification for AI Systems, an update to Microsoft’s existing vulnerability severity classification (i.e., our “bug bar”) to cover new vulnerability categories arising specifically from the use of AI in our products and services.
The aim of this update is to provide a common framework for external researchers and Microsoft security engineering teams to discuss the impact of vulnerability submissions, with more detail than previous guides.
New Vulnerability Categories New Vulnerability Categories
We are introducing three new top-level categories, each of which contains one or more AI-specific vulnerability types*.
1. Inference Manipulation 1. Inference Manipulation
This category consists of vulnerabilities that could be exploited to manipulate an AI model’s response to individual inference requests, but do not modify the model itself. There are two new vulnerability types in this category: command injection and input perturbation.
Command injection is the ability to inject instructions that cause the model to deviate from its intended behavior. This is somewhat similar to the concept of “prompt injection” but we wanted to make it clear that the ability to inject (part of) a prompt is not in itself a vulnerability – it only becomes a vulnerability if the injected prompt is able to substantially change the behavior of the model. For example, injecting irrelevant information is not a vulnerability, whereas injecting a command/instruction that causes the model to perform a completely different task is a vulnerability. On the other hand, command injection is broader than prompt injection because the injected commands need not necessarily be textual input – they could be any type of command that causes the model to deviate from its intended behavior (e.g., specially crafted images in the case of multi-modal models).
Input perturbation is the ability to perturb valid inputs such that the model produces incorrect outputs. This is also sometimes referred to as evasion or adversarial examples, and mainly applies to decision-making systems. This is not the same as simply finding examples of incorrect outputs – in order to qualify as a security vulnerability, there must be a clear perturbation of valid inputs that consistently leads to incorrect outputs and has a demonstrable security impact.
The severity of vulnerabilities in this category depends on how the manipulated response is used in the specific product or service. If the potential impact of the vulnerability is limited to the attacker themselves (i.e., a manipulated response is only shown to the attacker), we do not currently consider this to be an in-scope vulnerability. We assign higher severity if the manipulated response is directly shown to other users or used to make decisions that affect other users (e.g., cross-domain command injection).
2. Model Manipulation 2. Model Manipulation
This category consists of vulnerabilities that could be exploited to manipulate a model during the training phase. There are two new vulnerability types in this category, model poisoning and data poisoning, both of which involve manipulating the model during training.
Model poisoning is the ability to poison the trained model by tampering with the model architecture, training code, or hyperparameters.
Data poisoning is similar to model poisoning, but involves modifying the data on which the model is trained before training takes place.
To qualify as either of the above vulnerability types, there must be a demonstrable impact on the final model, which would not have been present without the poisoning. For example, the ability to insert backdoors into a model during training would be assessed as poisoning if it could be demonstrated that these backdoors persist into the final model, and could be triggered by specific inputs at inference time. The severity of vulnerabilities in this category depends on how the impacted model is used. Similarly to the inference manipulation category above, model manipulation vulnerabilities that affect only the attacker are not currently in scope, whereas those that could affect other users are assigned higher severity.
3. Inferential Information Disclosure 3. Inferential Information Disclosure
This category consists of vulnerabilities that could be exploited to infer information about the model’s training data, architecture and weights, or inference-time input data. This is similar to the existing Information Disclosure category, but differs in how the information is obtained. Whereas Information Disclosure vulnerabilities directly reveal the impacted data, Inferential Information Disclosure vulnerabilities permit something to be inferred about the impacted data.
There are several new vulnerability types in this category, each considering a different attacker goal. These include inferring whether a particular data record was used during training (membership inference), inferring sensitive attributes of a training data record (attribute inference), or inferring properties of the training data (property inference). Another type of vulnerability in this category covers inferring information about the model itself, such as its architecture or weights, based on interactions with the model (model stealing). The final vulnerability types in this category deal with extracting a model’s system prompt (prompt extraction) or information about another user’s inputs (input extraction).
The vulnerabilities in this category are evaluated in terms of the level of confidence/accuracy attainable by a potential attacker, and are only applicable if an attacker can obtain a sufficient level of confidence/accuracy. In all cases, the severity depends on the classification of the impacted data (e.g., the training data, model weights, or system prompt). We use the same data classification as the recently published Microsoft Vulnerability Severity Classification for Online Services.
Complementing Existing Vulnerability Categories Complementing Existing Vulnerability Categories
It is important to note that this is an update to our existing vulnerability severity classification, not a stand-alone list of all vulnerabilities that could affect AI systems. In fact, many of the vulnerabilities that could arise in AI systems are already covered by our existing severity classification. For example, directly stealing the weights of a trained model through a storage account misconfiguration is covered by the existing Information Disclosure category. Directly modifying the stored weights of a trained model is an example of Tampering. Vulnerabilities that cause the model to respond slowly are covered by the existing Denial of Service category.
Out of Scope Vulnerability Types Out of Scope Vulnerability Types
In a small number of rows, we have indicated that a specific scenario is “Not in scope”. This is usually the case when the impact is limited to the attacker themselves (e.g., a manipulated response that is only shown to the attacker). This is not to say that these scenarios are not relevant. Indeed we encourage researchers to report them directly to the affected product or service, via their respective feedback channels, similarly to other non-security bugs.
Relationship with Other Taxonomies Relationship with Other Taxonomies
The new vulnerability categories above have many similarities with recently published taxonomies, such as the MITRE ATLAS, the OWASP Top 10 for Large Language Model Applications and the NIST Adversarial Machine Learning taxonomy, but are not always one-to-one mappings. For example, several of the security challenges in the new OWASP Top 10 for LLMs map directly to the new categories above, whilst others are already covered by existing vulnerability categories in our bug bar, which automatically also apply to AI systems and services. Additionally, the new vulnerability categories we are introducing above are not limited to LLMs – they are intended to cover all AI modalities. Overall, we see these new categories as being complementary to the existing taxonomies.
Going Forward Going Forward
We recognize that this initial update may not incorporate all possible AI-specific vulnerability types, and that new vulnerability types may be discovered at any time. We will continue to monitor this space and add or update vulnerability types as needed. We value the partnership of external researchers who find and report security vulnerabilities to help us protect billions of customers. We hope these resources make it easier to understand the reasoning behind our vulnerability severity classification and assist researchers in focussing their efforts on the highest impact areas. If you have any questions about the new vulnerability classification guide or MSRC, please visit our FAQ page or contact [email protected].
* All vulnerability descriptions and examples in this article are for reference only. For the normative definitions, please refer to the Microsoft Vulnerability Severity Classification for AI Systems.