Addressing Privacy Risks in Split Learning
Assessing privacy concerns and solutions in Split Learning methods.
― 7 min read
Table of Contents
Privacy is an important concern when using machine learning, especially when private information is involved. Privacy-Preserving Machine Learning (PPML) aims to train and use models without exposing raw data. On-device machine learning allows models to run on user devices without sending personal data to external servers. However, these on-device models often perform worse than models that run on powerful servers because they rely on fewer data features and have to be smaller for efficiency.
Split Learning (SL) is a method that can help improve these on-device models. In SL, a large machine learning model is divided into two parts: a bigger part on the server side and a smaller part on the client side, which is usually the user device. This setup allows the model to use private data without sending it to the server for processing. However, during training, information (called Gradients) is exchanged between the server and the devices. This exchange can accidentally reveal private information.
This discussion focuses on the potential privacy risks in SL and looks at different strategies to reduce these risks. It was found that gradients shared during SL training can greatly improve the chances for an attacker to uncover sensitive information. However, introducing a small amount of Differential Privacy (DP) can effectively reduce this risk without making the training process much less effective.
On-Device Machine Learning
On-device machine learning involves training and running models directly on user devices without depending on cloud computing. This approach offers benefits such as improved privacy, faster response times, and access to real-time data. Such models are applied in areas like smartphone keyboards, personal assistants, computer vision, healthcare, and online ranking systems.
However, there are some challenges with on-device AI models. First, user devices have limited computing power and storage capacity, which restricts the size and complexity of the models. Consequently, the learning ability and accuracy of these models may be lower compared to server-based models. Second, user devices often lack access to large datasets and cannot process extensive features that require significant storage.
Despite the privacy benefits of on-device AI, not all features are sensitive. Some examples are e-commerce item suggestions, word meanings from language models, or advertising-related features. Therefore, training a small on-device model may not be the best choice for every situation.
Split Learning as a Solution
Split Learning (SL) offers a solution to some of the challenges of on-device machine learning. A large model is split into two parts: the main part runs on the server, and the smaller part runs on the user device. This allows collaborative training using both private and public data while minimizing information transfer between the two sides.
During the prediction phase, the server calculates results using its full feature set. The server then sends a small piece of data, known as the cut layer, to the device. The device uses its part of the model to further compute results based on its private features. Typically, the device model is simplified and uses fewer data types due to the limitations of client-side hardware.
SL generally involves two parties, but Federated Split Learning (FSL) expands this to include many user devices working with a central server. However, when training these models, gradients are exchanged, which can reveal private information.
Privacy Risks in Split Learning
In this discussion, we look into the risks of data leaks during SL training. An extensive approach was developed to reconstruct private information from gradients. This method uses various information sources at the cut layer, such as model parameters and gradients, to recover private features or labels.
The results show that gradients can significantly enhance an attacker's ability to get sensitive data. For example, in tests, attackers could perfectly reconstruct some labels and most features. However, adding noise to the gradients during training could counteract this risk, leading to only a minor drop in the model's performance.
Background on Related Work
SL allows for the training of deep learning models among several parties without sharing raw data. While Federated Learning can also be used, it may not work for all situations, especially in industries like e-commerce, where large and complex models are needed. These models can become too big to run on mobile devices, while sensitive client-side data may not be safe to store on a server.
In SL, the model is split, with the server processing data until the cut layer and then sending its intermediate results to the user devices. The client model continues the training with its private data. During back-propagation, gradients are calculated and sent back to the server, which may still hold sensitive information.
Several studies have focused on attacks aimed at revealing private data, including membership inference attacks and reconstruction attacks. The latter seeks to recover data points or other attributes using model access.
Mitigation Strategies
One well-known method of reducing the impact of these attacks is DP. DP adds random noise to the gradients, making it harder for attackers to extract private data. This technique is measured by considering the amount of noise added and the impact it has on model performance.
While label DP focuses on protecting labels during training, traditional DP relies on adding noise to gradients. Both methods can help reduce information leakage but require careful adjustment to balance privacy and performance.
Attack Methodology
In this study, an attack method known as EXACT (Exhaustive Attack for Split Learning) was developed to evaluate privacy risks. This approach assumes that the client retains private features that should not be shared. By manipulating the gradients exchanged between the server and the client, attackers can reconstruct sensitive data.
The attacker builds a list of possible configurations of private features and labels. For each sample, the adversary can compute the gradient for every configuration and find the one that matches the obtained gradient most closely.
This method does not require complicated optimizations, allowing efficient reconstruction of relevant private features. In tests, the method successfully reconstructed features in an average of 16.8 seconds per sample.
Experimental Setup
Experiments were conducted on three datasets: Adult Income, Bank Marketing, and Taobao ad display/click data. The Adult Income dataset aims to predict if an individual's income exceeds $50K based on census data. The Bank Marketing dataset focuses on direct marketing campaigns from a Portuguese bank. The Taobao dataset contains millions of interactions based on ads shown.
Different models and configurations were tested to evaluate how the attack's performance varied. This included comparing the results of normal SL training to scenarios that used DP or Label DP to see how effectively each mitigated attacks.
Results Overview
In an unmitigated setup, the results indicated that attackers were able to accurately reconstruct labels and many private features. However, when introducing DP, the attack performance dropped significantly, suggesting that adding noise to the gradients can effectively protect private information.
For Label DP, while it provided some security, it did not offer as much protection as DP for the private features. This highlights the importance of implementing proper measures to secure sensitive data.
Conclusion
This analysis highlights the need to be aware of privacy risks in split learning. By examining how gradients can leak sensitive information, we can explore and implement measures to safeguard private data. Future work can expand on these findings by looking into other forms of data and different machine learning tasks, ensuring that privacy is maintained without compromising model performance.
Title: Evaluating Privacy Leakage in Split Learning
Abstract: Privacy-Preserving machine learning (PPML) can help us train and deploy models that utilize private information. In particular, on-device machine learning allows us to avoid sharing raw data with a third-party server during inference. On-device models are typically less accurate when compared to their server counterparts due to the fact that (1) they typically only rely on a small set of on-device features and (2) they need to be small enough to run efficiently on end-user devices. Split Learning (SL) is a promising approach that can overcome these limitations. In SL, a large machine learning model is divided into two parts, with the bigger part residing on the server side and a smaller part executing on-device, aiming to incorporate the private features. However, end-to-end training of such models requires exchanging gradients at the cut layer, which might encode private features or labels. In this paper, we provide insights into potential privacy risks associated with SL. Furthermore, we also investigate the effectiveness of various mitigation strategies. Our results indicate that the gradients significantly improve the attackers' effectiveness in all tested datasets reaching almost perfect reconstruction accuracy for some features. However, a small amount of differential privacy (DP) can effectively mitigate this risk without causing significant training degradation.
Authors: Xinchi Qiu, Ilias Leontiadis, Luca Melis, Alex Sablayrolles, Pierre Stock
Last Update: 2024-01-19 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2305.12997
Source PDF: https://arxiv.org/pdf/2305.12997
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.