OWASP Top 10 for LLM Applications 2025: Data and Model Poisoning
AI and machine learning (ML) models heavily rely on the data they are trained on, learning patterns and behaviors from vast datasets. This dependence creates a critical vulnerability—attackers can introduce deceptive or malicious data into the training sets, influencing the model’s learning process. As a result, the model with compromised AI security may appear to function normally at first, but then produce unreliable or even deliberately harmful outputs. This is the core design of insidious data poisoning attacks.
Types and Examples of LLM Data and Model Poisoning Attacks
Affecting both private and public data sets, data poisoning can take a vast number of approaches. Public datasets can seem the more risky choice, since they’re incorporated into many more LLMs and are hosted on public-facing architecture; as a result, poisoned data can be injected into these repositories via upstream databases or web-crawled content.
But private datasets aren’t without risk either. These are often made up of sensitive data specific to the organization, but can also be infiltrated by an insider, or compromised account.
Since the contours of different data poisoning attacks can vary substantially, they can be grouped largely into the following attack types:
Targeted
In targeted adversarial attacks, an attacker will start with an explicit goal – usually to force an organization’s model to misclassify a specific type of input. The specific aim depends on the LLM itself – for facial recognition systems, an attacker may want to poison its facial recognition system so that it fails to identify a particular individual. Since LLMs are relatively new, and this is a highly resource-intensive attack type, real-life examples of this are rare, but researchers have seen pre-trained vision-language models be poisoned in order to disrupt downstream vision recognition tasks.
Non-Targeted
Since LLMs pull from such wide sources of data, the more common forms of data poisoning fall under the non-targeted type. This sees an attacker gain access to – and then manipulate – the direct sources of data that any LLM is trained on. These model integrity attacks take a spray-and-pray approach, rather than targeted. At the end of 2023, researchers found that they could directly write data to the databases of Meta’s Llama2, due to large numbers of unsecured APIs found on code repositories GitHub and Hugging Face. This access and write access granted researchers full permission to the majority of the 723 accounts they found API access tokens for.
Split-View Data Poisoning
LLMs demand sizable data sets; much of which is collected through web scraping. These are then stored as URLs – but just because a page hosted benign content once doesn’t mean it’s safe forever. Split-view attacks take advantage of the residual trust from URLs. This becomes an ever greater concern when domain names expire, as they can then be bought and their inclusion in LLM databases leveraged against the LLM itself. Since this is an essentially unskilled attack, it’s possible for an attacker to gain control over a large range of domain names that supply training images. This residual trust guarantees that all future calls the LLM makes to that hijacked domain will download poisoned data.
Frontrunning Data Poisoning
Similar to a URL-based split-view attack, frontrunning relies on how training data is incorporated into an LLM. Widely used LLM datasets work by capturing periodic snapshots of user-generated content, such as Wikipedia or Reddit dumps. Since these snapshots occur at pre-established intervals, it’s possible to modify web content for short periods of time, and have the false data be captured and uploaded via snapshot. Even if moderators later detect and reverse the malicious edits, this training data manipulation remains in the archived snapshot, potentially influencing the training of deep learning models.
Label Flipping
These attacks manipulate the labels attached to training data, rather than the raw data itself. By assigning incorrect labels, attackers deceive machine learning models, leading to misclassifications and flawed decision-making. For instance, an adversary may tamper with a training dataset by relabeling phishing emails as legitimate. Once the AI model is retrained on this manipulated data, it risks incorporating these false classifications into its deployed app.
RAG-focused attacks
Since LLMs are at risk of hallucinations, they’re often deployed alongside Retrieval Augmented Generation (RAG). RAG combines the core LLM with an external data source, but in doing so opens up the LLM to data poisoning vulnerabilities that extend far beyond its initial training phase. Using the same frontrunning and split-view methodologies, an attacker can manipulate a RAG’s knowledge receiver component. As a result, the LLM can generate false and outright malicious text.
Machine Learning Poisoning Prevention Strategies
With the reputation and reliability of million-dollar LLM projects at stake, it’s vital to maintain data integrity. The following strategies help reduce the generative AI risks.
Outlier Reduction
Data poisoning attacks are a balance between the LLM’s own databases and an attacker’s time and capabilities. Because of this, adversaries aim to pollute or modify as few data points as possible, while still maximizing their impact on the LLM’s integrity. Because of this, it’s common for maliciously-planted data to be an outlier within their datasets. By eliminating outliers, it’s possible to proactively remove a considerable portion of tampered data, even if directly planted there by a compromised account or service.
Ensemble Architectures
Data poisoning attacks can be very effective when leveraged against a single LLM. However, it’s possible to stack multiple diverse models that then individually vote on the best response. Its underlying theory relies on slightly different models being highly unlikely to make the same errors. There are also different types of ensemble architecture.
Model averaging
AI models are inherently unpredictable, thanks to the high variance of weighting put on different parts of each dataset. To mitigate this unpredictability – and therefore reduce the risk of data poisoning – multiple models can be trained on the same data and their predictions combined. The class with the highest aggregate score is then selected —an approach known as model averaging.
Weighted model averages
A limitation of model averaging is treating all models as equally effective. In reality, some perform better than others. To address this, weighted averaging assigns greater influence to stronger models, with weights determined by performance on a separate validation set.
Model stacking
This method adds a higher-order model to ensemble outcomes. Using the outcome of ensembled models, the higher-order model learns how to use the ensembled classifications to predict the correct output.
De-Pois Method
The De-Pois method operates by creating a model that imitates the behavior of the original model and is trained with clean data. To obtain this verifiably clean data, the De-Pois method generates synthetic data with generative adversarial networks (GANs). This synthetic data effectively increases the size of the training set and helps teach the mimic model how the original model should behave. Once the mimic model is trained, it can be used to judge the validity of each prediction.
Protect Your LLM Environments with Check Point CloudGuard
LLM development and deployment rely heavily on secure cloud architecture. At the beginning of the article, we touched on how unsecured APIs had granted researchers complete write privileges. However, securing the entire breadth of app and LLM deployment – from code to cloud – can be an overwhelming task.
Check Point’s CloudGuard simplifies cloud security into a single dashboard. Rather than hoping that all connected services and permissions are secure, CloudGuard provides verifiable visibility into real-time access, incidents, and remediation options. This then provides actionable, contextually-aware cloud security. For instance, it’s able to flag any suspicious, misconfigured, or unencrypted data that is connected to stored assets. Any anomalous account behavior is identified and reported on, granting in-depth Web Application Firewall capabilities to an evolving LLM. Explore the solution with a demo and start securing your AI advancements.