Machine Learning Models: Unveiling Security Vulnerabilities and Fortifying Robustness

August 7, 2023

Introduction: Machine Learning in a Security Context

Our capabilities are improving across a variety of industries, including healthcare, automobile, law and finance, thanks to machine learning, a key component of the current wave of digital transformation. But along with its impressive development come a number of complex security worries. The purpose of this piece is to identify the main weaknesses and suggest solutions for making machine learning models robust.

As machine learning models become more sophisticated, they also become more vulnerable to attack. This is because machine learning models are trained on data, and if that data is corrupted or manipulated, the model can be tricked into making incorrect predictions. The security risks which we are discussing in this article can have serious consequences, such as financial losses, identity theft, and even physical harm. It is therefore essential to take steps to secure machine learning models.

The vulnerabilities in these machine learning models are primarily because of the dataset on which they are trained. Data, in essence, mirrors our society; and thus, inevitably absorbs the biases that permeate it. Given that datasets are human constructs - collected, labelled, and applied by us - they become a prism through which our implicit and explicit biases are refracted. Bias may seep into the data collection process when we subconsciously select or exclude certain pieces of information. Similarly, during data labelling, our perspectives and prejudices can influence the ways we categorise and classify data. Moreover, the applications we choose for these datasets can also reflect our personal or societal biases, as we might favour certain outcomes over others. Therefore, it is crucial to remember that no dataset is a perfect, impartial snapshot of reality. Each carries the traces of human bias, underlining the importance of diversity, equity, and transparency in all stages of data handling.

Common Security Vulnerabilities in ML Models

Let's probe further into this issue. Among the prevalent security risks associated with machine learning models are data poisoning, adversarial attacks, and model inversion and extraction. In this part, our goal is to grasp these terms' meanings and operational procedures more intensively. We will also examine their various types.

Data Poisoning

Data poisoning poses a significant risk to machine learning models. A data poisoning attack refers to a scenario where the learning data used by these models is deliberately tampered with. In essence, this threat operates by modifying or adding data to the training set, leading the model to internalise incorrect or biassed information. Consequently, the model may make inaccurate or misleading predictions. To illustrate, consider a scenario where a machine learning model is trained to distinguish between cats and dogs using a dataset of images. An attacker, however, could alter this dataset by including photoshopped images of cats appearing like dogs or removing some dog images altogether. When the model trains on this manipulated dataset, its ability to accurately differentiate between cats and dogs diminishesâ€”illustrating successful data poisoning. This kind of attack can potentially result in severe consequences. Hence, the importance of robust data validation and sanitization processes cannot be overstated in mitigating such threats to machine learning models. We will be discussing data validation and other solutions for making machine learning models robust in depth later in this article.

â€

In the image above which is taken from 'Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks' by Ali Shafahi et al., an example data poisoning scheme is illustrated. When the machine learning model is trained on this poisoned data, it will learn to classify the poisoned emails as non-spam. This means that similar spam emails will be able to pass through the filter in the future. The test email is a spam email, but the model classifies it as non-spam because it has learned to classify the poisoned emails as non-spam. This is an example of how a data poisoning attack can be used to fool a spam filtering model.

There are two types of data poisoning attacks, direct and indirect. Let us go through the definition of those to understand how they work and what is the difference between both of them.

Direct Poisoning Attacks

In this kind of attack, the attacker intentionally introduces detrimental data into the training dataset in an effort to change the model's final result.

For Example: A machine learning model designed to filter out spam emails can be tricked by an attacker who adds carefully crafted spam emails to the training set. These spam emails look like regular emails, so the model will learn to classify them as non-spam. This means that similar spam emails will be able to pass through the filter in the future.

Indirect Poisoning Attacks

In case of Indirect poisoning attacks, the attacker adjusts the data distribution across the complete training set, thereby eventually swaying the decisions made by the model.

For Example: Consider a model trained to suggest personalised movie recommendations to users. An attacker, aiming to promote a particular movie, could subtly manipulate the data distribution by adding numerous slightly altered user profiles showing a strong preference for that movie. Over time, this skews the model's understanding of general user preference, leading it to recommend that particular movie more frequently, even to users with different movie tastes.

Adversarial Attacks

Consider you're in your school's photography club and you're learning how to edit pictures using software. Now, suppose your mischievous friend decides to play a prank on you and subtly alters some pixels in one of your photos. At first, the changes are so minor that you don't even notice them with your naked eyes. But when you submit it to a photo recognition contest, the recognition model used by judges, because of those few changed pixels, identifies your picture of a dog as a cat. This can be really frustrating, right? This is very similar to an adversarial attack in machine learning. An attacker makes small changes that are almost imperceptible but can make a highly accurate model fail at its task, like mistaking a dog for a cat.

In critical situations, this could be more than just an annoyance; it could have severe consequences. In other words, a scenario where an attacker introduces meticulously crafted noise into the input data. This can be also compared to a targeted digital manipulation of an image, changing just enough pixels to cause an otherwise accurate image recognition model to misinterpret the image. These adversarial inputs are designed to exploit the model's decision boundaries, making the model classify them into the wrong category. In non-critical applications, the consequences might be minimal. However, when such models are utilised in crucial sectors like healthcare for disease diagnostics or in the automotive industry for autonomous vehicles, the results could be catastrophic. Therefore, understanding and mitigating these adversarial attacks should be a high priority for those in the cybersecurity field.

â€

â€

The above image illustrates one of the most famous adversarial attacks, FGSM. Advis.js is a platform which is the first to bring adversarial example generation and dynamic visualisation to the browser for real-time exploration.

White Box Attacks

White-box attacks are the most powerful type of adversarial attack because the adversary has complete knowledge of the model, including its architecture, parameters, training method, and data. This allows them to craft the most effective adversarial examples, which are small, imperceptible perturbations to the input data that can cause the model to misclassify the data. Examples of white-box attacks include FGSM and JSMA.

In the example above, the adversarial attack is FGSM. It is a white-box adversarial attack that is used to fool machine learning models. It works by adding small, imperceptible perturbations to an image, which causes the model to misclassify the image. In the example shown above, we start with an image of a panda. If you feed this image to a machine learning model, the model will correctly classify it as a panda. However, if you use FGSM to add small perturbations to the image, the model may misclassify the image as a gibbon. The perturbations that are added to the image are calculated using the gradient of the loss function of the machine learning model. The gradient tells you how much the loss function will change if you change the input image. By adding perturbations in the direction of the gradient, you can make the loss function increase, which causes the model to misclassify the image. The perturbations that are added to the image are very small, so they are not visible to the naked eye. However, they are enough to cause the machine learning model to misclassify the image.

Black Box Attacks

Black-box attacks are a type of adversarial attack where the adversary does not have complete knowledge of the model, such as its architecture, parameters, or training data. This means that they cannot use the same methods as white-box attacks to craft adversarial examples. Instead, they need to rely on trial and error, or on techniques that exploit the transferability of adversarial examples. Transferability of adversarial examples is the idea that an adversarial example created for one model can often fool another model. This is because both models are likely to be vulnerable to the same types of perturbations. This makes black-box attacks more challenging to defend against, as the adversary does not need to know the specific details of the target model.

An illustration of a black-box attack is the zeroth-order optimization attack. This method functions by perpetually seeking an adversarial example that reduces the loss function of the intended model. The attacker lacks knowledge of the gradient of the loss function, requiring them to employ a zeroth-order approximation. While this characteristic makes the attack less speedy compared to white-box attacks, it remains feasible to locate potent adversarial examples.

Grey Box Attacks

Grey Box attacks fall between white and black-box attacks. The attacker has some knowledge about the model, but not complete information.

For those keen to explore more about adversarial attacks, the research paper titled 'Explaining and Harnessing Adversarial Examples' by Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedyâ€™ come highly recommended. The authors in this paper argue that this vulnerability is not primarily due to nonlinearity or overfitting of the model, as had been previously thought. Instead, they suggest that the fundamental cause of this vulnerability is the inherent linearity of these models. The paper supports this argument with quantitative results. Moreover, it provides an explanation for why these adversarial perturbations are effective even when applied to different model architectures and training sets.

Model Inversion & Extraction

Model inversion and extraction attacks represent a significant threat to machine learning models, primarily because they can compromise the privacy of the data and violate intellectual property rights. We will be discussing their types separately. Inversion attacks are primarily of two types while extraction attacks are primarily of three types. Let us now try to understand how all these attacks function.

Inversion Attacks

Model Inversion attacks were proposed by Fredrikson et al. in their paper - 'Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures'. The attack exploits a trained classification model as a tool to retrieve and recreate the data representations that were utilised during the model's training process. By doing so, it provides an opportunity to gain deep insights into the original training data, effectively bypassing privacy and security barriers. This advanced approach allows for substantial understanding and potentially unauthorised utilisation of the initial data, opening up a broad spectrum of security and privacy considerations.

â€

â€

The image above presents an engaging and insightful illustration, unravelling the intricacies of image reconstruction via the lens of both baseline and XAI-aware (Explainable Artificial Intelligence) inversion attack models. The focus here is on the CelebA dataset, a massive compendium of celebrity faces, a staple in the machine learning domain for an array of applications. At the heart of this demonstration lies the 'target task,' which is identification. This refers to the model's training to discern specific individuals within the extensive CelebA dataset. The 'attack task,' on the other hand, pertains to the inversion attack, characterised by its intent to reconstruct the original input images using solely the outputs of the model. Through a comparative exploration between the baseline model - a model devoid of specific defences or enhancements against inversion attacks - and the XAI-aware inversion attack models, the demonstration offers a vivid portrayal of the vulnerability of machine learning models in the face of such attacks. Simultaneously, it spotlights the burgeoning importance of Explainable AI. The aim here is to render AI models increasingly transparent and interpretable, fostering an environment of trust around AI. This visual representation of image reconstruction, realised through varying models, lends a palpable comprehension of the capacity of inversion attacks to leverage the innate vulnerabilities within machine learning models. It offers a comparative perspective, illuminating the potential impact of integrating Explainable AI tactics within inversion attack models. Ultimately, this exposition accentuates the pressing need for continual progress and enhancement in the spheres of AI security and explainability.

For a more comprehensive and in-depth exploration of the image discussed above, consider delving into the research paper 'Exploiting Explanations for Model Inversion Attacks' authored by Xuejun Zhao and his team.

Black Box Inversion Attacks

These kinds of attacks see the adversary leveraging the outputs of the model (such as prediction probabilities) to reconstruct the inputs initially used for training. Crucially, in this scenario, the attacker does not have the benefit of accessing the model parameters or its architecture.

White Box Inversion Attacks

Contrasting with the black box variant, white box model inversion attacks see the attacker equipped with complete access to both the model parameters and its architecture. This additional information can aid in achieving a more precise reconstruction of the training data.

Extraction Attacks

Model extraction attacks are a type of attack where an adversary tries to steal the functionality of a machine learning model without having access to the model's parameters or training data. This is done by making queries to the model and observing the corresponding responses. The attacker can then use this information to train a replica of the model.

API-based Model Extraction Attacks

In this category of attack, the adversary continuously interacts with the model through an API. They utilise the subsequent responses to engineer a surrogate model that closely emulates the behavioural patterns of the original model. An example would be an attacker querying a language translation model offered as a service by a company, to generate an imitation model that behaves similarly.

Membership Inference Attacks

While not strictly a model extraction attack, it does share a close relationship. Here, the attacker manipulates the model's outputs to discern whether a specific data point was incorporated within the training set, thereby potentially breaching the privacy of individuals. For instance, an attacker could infer whether a certain medical record was included in the training data of a health prediction model.

Model Stealing Attacks

Under this scenario, the attacker attempts to replicate the structure of the model and the parameters it was trained on, despite not having immediate access to them. Often, this is achieved by using a series of input-output pairs to reverse-engineer the model. A classic example would be an attacker querying a facial recognition model with various images and using the received predictions to build a similar model.

Real Life Case Studies: Exploitations & Solutions

Tesla's Autopilot Adversarial Attack

In 2019, a team of researchers from Tencent's Keen Security Lab conducted an adversarial attack on Tesla's Autopilot system. They achieved this by strategically placing small stickers on the road with specific patterns, causing the Autopilot system to misinterpret lane markings and make unexpected lane changes. This exploit raised concerns as it could potentially jeopardise safety during real-world driving scenarios.

If you are interested in gaining more insight into the entire experiment's methodology, you can consult the report titled 'Experimental Security Research of Tesla Autopilot' conducted by Tencent Keen Security Lab.

Google's Cloud Vision API Vulnerability

Google's Cloud Vision API is a machine learning system that can classify and label images. In 2017, Hossein Hosseini, Baicen Xiao, Radha Poovendran in their paper 'Google's Cloud Vision API Is Not Robust to Noise' demonstrated that by adding carefully crafted noise to an image, they could trick the API into misclassifying objects with high confidence.

For a more in-depth exploration of the Google Vision API vulnerability, you can consult the same paper mentioned earlier.

Strategies for Robust Machine Learning Models

Following our exploration of various types of security vulnerabilities and an analysis of real-world instances, it becomes clear that the quest for improvement is relentless and ongoing.

At this juncture of our discussion, we now shift our focus to understanding the various strategies that can be deployed to develop more robust machine learning models.

Mitigating Data Poisoning

To thwart data poisoning attacks, the application of rigorous data validation and anomaly detection mechanisms is essential. Such procedures proficiently spot and discard tainted data. For instance, an anomaly detection system might flag data points that deviate significantly from the norm, indicating potential poisoning. Furthermore, leaning on dependable and secure data sources along with the use of robust encryption methodologies can significantly reduce the susceptibility to data poisoning. An example of data validation could be checking the integrity of data through checksums or other verification techniques before using it for model training.

Defending Against Adversarial Attacks

To fortify models against adversarial attacks, several methods can be implemented. One such method is adversarial training, where the model learns from a mixture of ordinary and adversarial examples, thereby enhancing its capacity to withstand such attacks. Consider an unusual example of a model learning to recognize different bird species. In adversarial training, alongside the regular images of birds, the model is also fed subtly modified images (which may slightly alter colour patterns or shapes but are still visually indistinguishable for humans) to help it learn to identify the species even under deceptive conditions.

Other effective techniques include defensive distillation and gradient masking. Defensive distillation trains the model to predict the likelihood of different classes, thus improving its interpretability and robustness. On the other hand, gradient masking aims to obscure the model's gradients, making it more challenging for adversaries to craft efficient adversarial inputs.

Preventing Model Inversion & Extraction

Shielding against model inversion and extraction attacks requires an amalgamation of data privacy safeguards and model defence strategies. By applying differential privacy, a layer of protection is added to sensitive data by introducing noise into the model's outputs, thereby shielding individual data elements. Employing techniques like homomorphic encryption can assure data safety while still permitting computations on encrypted data. Regarding model protection, strategies such as model hardening and obfuscation can be leveraged to deter unauthorised extraction of the model.

Conclusion

While machine learning models unfurl vast potential and open the gates for unprecedented innovation, they simultaneously usher in substantial security challenges with far-reaching impact. It is through the prism of an all-encompassing understanding of these vulnerabilities, coupled with assertive and resilient security practices, that we can effectively leverage the dynamism of machine learning in a secure and accountable fashion. As we navigate the complex maze of machine learning security, we must persistently endeavour to forge a pathway that leads us towards a fortified, secure digital tomorrow.

References

Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks
Advis.js
Explaining and Harnessing Adversarial Examples- Ian Goodfellow et al.
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures- Fredrikson et al.
Exploiting Explanations for Model Inversion Attacks - Xuejun Zhao
Experimental Security Research of Tesla Autopilot - Tencent Keen Security Lab
Google's Cloud Vision API Is Not Robust To Noise - Hossein Hosseini, Baicen Xiao, Radha Poovendran
Further Exploration: Amazon's Alexa Skill Evasion
For additional exploration in this field, see these further references-

Awesome Model Inversion Attack Repository

â€

Sign up for Free Trial

Latest Blogs

August 20, 2025

4 min read