Table of Contents
- 1.0 Overview
- 2.0 Goals and Non-Goals
- 2.0.1 Goals:
- 2.0.2 Non-Goals:
- 3.0 Framework / Design:
- 3.0.1 Why DP being chosen?
- 3.0.2 The basic of DP
- 3.0.3 Privacy
- 3.0.4 Noise 𝜺 term
- 3.0.5 Other terminologies
- 3.0.6 DP-SGD (Differentially Private Stochastic Gradient Descent)
- 3.0.7 DP in FL settings: DP-FedAvg:
- 4.0 Experimental design
- Reference
- Results of Privacy-Accuracy trade-off
- Assumptions of the framework
- 2.0 Chosen data and framework for FL training
- 3.0 𝜺,𝜹 Differentially Private (DP) strategies
- 3.0 Privacy-accuracy trade-off
- 4.0 Understanding 𝜺
1.0 Overview
Federated Learning (FL) offers an alternative solution to traditional centralized machine learning (ML) approaches by allowing a server to learn a ML model across multiple decentralized clients without outsourcing their private data to a server. In each training epoch, the server calculates a global update of the model parameters upon receiving local updates from clients who get the local updates by using their data in a private setting. Though FL increases data privacy, as clients’ raw private data never leave their device, it is not completely immune to different kinds of adversarial attacks.
Adversarial attacks can be either from security standpoints or from privacy grounds. Regarding security, a single or multiple malicious clients can either force the model to stop from converging or force to predict wrongly by poisoning their own private dataset. Regarding privacy, client-supplied models or updates can leak certain attributes from the client’s private data (attribute inference attacks). Different privacy attacks, like membership attacks, malicious client(s) can learn whether a particular data point or sample data points (reconstruction attacks) has/have been used in the training process. We already addressed possible security attacks and defence mechanism against those kinds of attacks which have severe affects on the accuracy of the models belong to other benign clients. It is crucial to investigate the extent to which defence mechanism placed to combat against privacy attacks can maintain a overall acceptable accuracy of the global model.
Our distributed ML platform, OctaiPipe, combines FL, automated ML and fault tolerant Edge MLOps to automate the entire ML lifecycle on device or connected machine in a secure and privacy preserving way. Hence it is extremely important for us to tackle both security and privacy attacks regarding FL in order to deliver a platform that is more private and secure.
In this design template, we are describing a framework for accessing privacy accuracy trade-off in FL, starting from the goal to testing procedures and open questions.
2.0 Goals and Non-Goals
2.0.1 Goals:
- Developing FL frameworks/sandbox with different defence mechanisms against privacy attacks and see how that affects the accuracy of the global model or local models belong to other benign clients.
- We will not be using FL implemented OctaiPipe platform to test the attacked FL framework and possible solutions, rather will use an independent python environment to test the framework.
- We will couple the defence mechanisms regarding privacy and security attacks together in order to see the combined affects on the overall accuracy.
- The future goal is to implement the counterattack measurement to the OctaiPipe platform to make it more secure and private.
2.0.2 Non-Goals:
- Unlike the security attacks and mitigation analysis, we will not conduct any kind of privacy attacks, rather develop the countermeasures and access how the privacy-accuracy balance pays off.
- The FL architecture will be strictly a server-client based architecture which means we will not be considering a peer-to-peer FL or any other kind of hierarchical architecture.
- The coding language will be Python and we will be using Flower package for FL. No other coding language or FL packages will be considered.
3.0 Framework / Design:
As explicitly stated in the goals section we will not conduct any privacy attacks rather will build a defence mechanism against privacy attacks and its effect on the model accuracy. We have chosen ‘Differential Privacy (DP)’ as a defence mechanism against privacy attacks.
3.0.1 Why did we choose differential privacy (DP)?
- Some of the initial approaches used a variety of approaches like suppression, aggregation, and transformation to anonymize the data to remove identifying information from the data source. These cryptographic ciphers can obfuscate the original source, but a dedicated attacker would be able to infer or extrapolate more information.
- Others like removal of certain columns or attributes or aggregation of certain row attributes or mixing the row attributes from multiple rows so that no single row has its original values, but the overall distribution of values is unchanged. These do not guarantee privacy as release of any data has a potential for a re-identification attack or information gain.
- DP has shifted focus from ‘how to anonymize data’ to ‘How can we measure the loss of privacy when the data is released? ’ or ‘How much privacy loss am I willing to create in order to answer the question? ’ DP defines a limit or bounds on the amount of privacy loss one can have while releasing information. Compared to other methods, it focuses on the process rather than the results.
3.0.2 The basic of DP
DP incorporates random noise into the mix so that everything an adversary receives becomes noisy and imprecise and so it is much more difficult to breach privacy.
Let two datasets 𝐷1 and 𝐷2 with one row difference between them and 𝐷1 and 𝐷2 be called adjacent dataset (this is called example-level DP). One uses an algorithm (𝑀) to release the query which can be ensured satisfying the following definition:
𝑃[𝑀(𝐷1) ∈𝑆] ≤exp(𝜀) ×𝑃[𝑀(𝐷2) ∈𝑆]
The probability 𝑃 that an attacker can determine based on the response 𝑀(𝐷1) or 𝑀(𝐷2) from the expected database, 𝑆 , if they are interacting with the first or second database is limited by a small value exp(𝜀). The value exp(𝜀) represents the amount of information gained by a dedicated adversary investigating the algorithm in a worst-case scenario. The goal of 𝑀 is to keep the accuracy as high as possible but to still provide as much privacy as possible by setting an upper bound giving individual privacy using 𝜀.
3.0.3 Privacy Loss:
The attacker will try to guess whether the database is either 𝐷1 and 𝐷2 depends on the information gather from 𝑀(𝐷1)=𝑂 where 𝑂 is the actual response the attacker sees. We would like to quantify how much information an attacker can gain and want to keep it in a probabilistic range. From Bayes theorem, 𝑃(𝐷=𝐷1|𝑀(𝐷1)=𝑂) lies within a range which is a function of exp(𝜀).
By varying 𝜀 a best-case and worst-case scenario can be drawn which will shed light on how much more a certain attack with a given prior information can get from a query which uses DP. It is often recommended to choose a small 𝜀, close to 1 or even less than 1.
Implementing those boundaries can be done by adding calibrated noise adding to the query which will guarantee information leakage within the range.
3.0.4 Noise 𝜺 term
𝜖 can be drawn either from Laplace distribution or Gaussian distribution.
The difference between these two distributions is the peak of Laplace distribution is sharper than Gaussian meaning a greater number of samples around zero is more than Gaussian. But it is obvious that the variance of residual is minimized if Gaussian prior is assumed.
The previous DP equation has considered Laplace distribution. Gaussian distribution allows a relaxation of the probability bounds by adding a small delta to the original definition.
𝑃[𝑀(𝐷1) ∈𝑆] ≤exp(𝜀) ×𝑃[𝑀(𝐷2) ∈𝑆]+ 𝛿
Where 𝛿 = margin of error means that ,𝛿 allows an implementation to, essentially, release all information about a small number of rows with no noise added, it that fits within the delta.
A good choice of delta is somewhere between 1/n, n = number of rows and cryptographically small (2−30≈9 × 10−10). A smaller delta values adds more guarantees.
A Gaussian mechanism is ( 𝜀,𝛿) differentially private if it follows the criteria and 𝜀<1.
𝐹(𝑥)=𝑓(𝑥)+𝑁( 𝜎2) where 𝜎2= 2𝑠2 log( 1.25𝛿)𝜀2⁄
𝑁( 𝜎2) samples a Gaussian distribution centred at 0 with variance 𝜎2 and 𝑠 is sensitivity.
3.0.5 Other terminologies
- Sensitivity: Measures the maximum change in the query result based on the change in the underlying dataset.
- Privacy budget: With each new query, additional information about the sensitive data is released. To ensure a meaningful privacy guarantee, data curators can enforce a maximum privacy loss. The maximum privacy loss is called the privacy budget. If the number of queries exceeds the threshold, then the privacy guarantee becomes too weak the curator stops answering queries.
- Privacy accounting: Each query as a privacy expense which incurs an incremental privacy loss. The strategy of using budgets, expenses and losses is fittingly known as privacy accounting.
3.0.6 DP-SGD (Differentially Private Stochastic Gradient Descent)
- The difficulty in applying DP is determining what the sensitivity of a given training tensor is. Hence DP-SGD uses the gradient itself as a measurement of a single person’s contribution. This comes with the assumption that each person is seen only once.
- It mainly consists of two steps: gradient clipping and adding noise for DP.
- Gradient clipping: At each step of SGD, we compute the gradient for a loss function for a random subset of examples. Due to presence of unbounded gradient in SGD, one must clamp the amount that any one same can contribute to control the amount of privacy leakage from the unbounded gradients. One uses 𝐿2 norm for clipping purposes.
- Adding noise: Once the per-sample gradient is clipped and scaled, combine those gradients back to the batch update and add noise, appropriately scaled on the bounded gradients, to protect privacy.
- Privacy accounting: ‘accountant’ procedure that computes the privacy cost at each access to the training data and accumulates this cost as the training progress.
- Privacy Loss: DP-SGD gives better, tighter bound on privacy loss by tracking detailed information (higher moments) of the privacy loss. For MNIST dataset with epoch 1000 with Gaussian variance 𝜎=4,𝛿= 10−5 , one can has 𝜀≈1.26 using DP-SGD method compared to 𝜀≈9.34 using strong composition theorem.
3.0.7 DP in FL settings: DP-FedAvg:
Next question: How DP-SGD can be implemented in federated settings?
- User-level DP: When users can contribute multiple examples to the training dataset, an example-level DP (depends on a single training example) is not necessarily strong enough to ensure the users’ data is not memorized. User-level DP which requires that the output distribution of models does not change even if we add/remove all the training examples from any one user (or all the examples from any one device in our application). As FL summarizes all a user’s training data as a single model update, federated algorithms are well-suited to offering user-level DP guarantees.
- Amplification-via-sampling: Means amplifying the privacy guarantee by leveraging the randomness in the sampling training examples.
- Flow-chart of DP-FedAvg: o Using random-sized batches where one selects users independently with probability q, rather than always selecting a fixed number of users.
- Enforcing clipping of per-user updates so that the total update has bounded 𝐿2 norm.
- Using different estimators for the average update to control sensitivity.
- Adding Gaussian noise 𝑁(0,𝜎2) to the final average update where 𝜎=𝑧.𝑆 where 𝑧 is a parameter and 𝑆 is the sensitivity of the query. Varying 𝑧 and 𝛿>0, the accountant gives an 𝜀 for which this mechanism satisfies ( 𝜀,𝛿) differentially private.
- Note: DP-FedAvg has been deployed in practice by Google in 2020 to mobile device and this algorithm has been implemented assuming a dataset with a sufficiently large number of users and claimed to be easily met by even small internet-scale dataset.
4.0 Experimental design
- The FL architecture design is strictly server-client based. No hierarchical structure has been considered.
- Baseline: We will distribute the chosen dataset (MNIST) into 5 to 7 different clients and there are no malicious clients or server directly conducting privacy attacks. We will get the overall accuracy of the model.
- Privacy implementing scenario 1:
- Implement DP-FedAvg where the server is adding noise on the aggregated parameters to incorporate privacy and varying different parameters (𝑧,𝑆,𝛿) to estimate how much privacy (𝜀) we can guarantee. Also keeping track of privacy accounting in the overall training time and its relationship with accuracy. In each iteration, either we can get local parameter updates from all the clients or a selected few to see how privacy-accuracy trade-off.
- Privacy implementing scenario 2:
- There are different ways to add noise for DP purposes. In the above-described method, noise has been added at the server side. Users should have flexibility to set up the training such that each client independently adds a small amount of noise to the clipped update.
- To be precise, if we let 𝑚 be the number of clients sampled each round and 𝜎 be the scale of the total Gaussian noise that needs to be added to the sum of the model updates, we can use simple maths to show that this is equivalent to each client adding noise with scale 𝜎/√𝑚.
- We will get clients flexibility to add noise and access the privacy/accuracy trade-off in that architecture.
- Security–privacy-accuracy scenario 3:
- During security defence mitigation, we have considered other aggregation methods apart from FedAvg, such as FedMedian and FedTrimmedAvg and some of them with proper hyperparameter settings perform better than FedAvg.
- We can let the clients add noises in their local updates and see how the other aggregation methods behave in order of privacy-accuracy tradeoff scenario either in presence of security attacks or not.
REFERENCE:
1) https://doi.org/10.1016/j.engappai.2021.104468
2) https://doi.org/10.1016/j.dcan.2021.11.006
3) Practical Data Privacy-Katharine Jarmul
4) DP-definition: https://medium.com/@shaistha24/differential-privacy-definition-bbd638106242#:~:text=(%CE%B5%2C%20%CE%B4)%2Ddifferential%20privacy%20says%20that%20for%20every%20pair,when%20the%20database%20is%20y.
5) DP-definition: https://towardsdatascience.com/understanding-differential-privacy-85ce191e198a
6) DP: https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
7) Privacy-accuracy trade-off: https://medium.com/multiple-views-visualization-research-explained/visualizing-the-accuracy-privacy-trade-off-to-improve-budget-decisions-with-differential-privacy-66fc3efb34a
8) Google privacy with DP: https://ai.googleblog.com/2022/02/federated-learning-with-formal.html
9) DP-SGD: https://arxiv.org/abs/1607.00133
10) FedSGD and FedAvg: https://shreyansh26.github.io/post/2021-12-18_federated_optimization_fedavg/
11) DP-FedAvg: https://arxiv.org/abs/1710.06963
12) DP-FedAvg_clipped: http://www.omthakkar.com/papers/DPLAC.pdf
13) DP-FTFL: https://arxiv.org/abs/2103.00039
14) DP-FedAvg from flwr: https://flower.dev/docs/differential-privacy-wrappers.html#andrew
15) DP-engineering from Meta: https://engineering.fb.com/2022/06/14/production-engineering/federated-learning-differential-privacy/
16) https://arxiv.org/abs/2206.00807
17) DP with Flwr and Opacus: https://towardsdatascience.com/differentially-private-federated-learning-with-flower-and-opacus-e14fb0d2d229
18) RDP DP : https://arxiv.org/pdf/1702.07476.pdf
RESULTS OF PRIVACY-ACCURACY TRADE-OFF
ASSUMPTIONS OF THE FRAMEWORK
- The FL architecture design is strictly server-client based. No hierarchical structure has been considered. The server is honest which means it is not trying to extract clients’ private and sensitive information (privacy attack) neither it is trying to manipulate the global model ‘s parameters in order to change the accuracy of the model or predict wrongly.
- Clients are not malicious meaning not actively trying to attack other clients’ server in order to conduct a privacy attack. There are no malicious client(s) who is trying to extract private and sensitive information from other clients or from the server.
- Server is providing an ( 𝜀,𝛿) differentially private FL setting to the clients in order to provide more privacy while compromising accuracy of the model. In this report, we will dive deep into this privacy-accuracy trade off under this above assumption of the server and clients.
2.0 CHOSEN DATA AND FRAMEWORK FOR FL TRAINING
1. Data:
- Our experiments use the MNIST handwritten digits data set with number of labels is 10, digit 0 to digit 9. This is a classification problem, means the global model, after training, will predict the label correctly from the digit image.
- We have used MNIST dataset in the adversarial attack and mitigation project and in future we want to explore security-privacy-accuracy trade off. Hence, we have chosen the same data set for privacy-accuracy project as well.
2. Model and Framework:
- We have used the exact same model for the classification problem.
- As we have explored PyTorch framework for the adversarial framework project. PyTorch has DP library Opacus which handles differentially private mechanism with selected models.
- In our project, we will use Opacus library as DP in FL settings to incorporate privacy mechanism in FL.
3.0 ( 𝜺,𝜹) DIFFERENTIALLY PRIVATE (DP) STRATEGIES
As mentioned already we have used Opacus library for DP settings in FL. Opacus uses PrivacyEngine to incorporate basic DP strategies like the followings:
1. Implement gradient clipping and ad Gaussian noise 𝑁(0,𝜎2).
2. Incorporate Poisson sampling into the training process in order to add a certain probability to each sample from the dataset.
Now the noise term 𝜎 is a function of 𝜀,𝛿. In PrivacyEngine, there are two sub libraries:
- PrivacyEngine without fixing final 𝜀 : sub library make_private with input 𝜎. The engine itself finds the final 𝜀 value using 𝜎. In each training session, 𝜀 will increase and the final value will depend how many epoch each training session has run.
- PrivacyEngine with fixing final 𝜀 : sub library make_private_with_epsilon with inputs like target_epsilon and target_delta. This has more control about the final 𝜀 values or the privacy term by fixing it in the beginning.
Nomenclatures:
- Epoch: No of times each chosen client is running its local training on its local private data.
- NRuns: No of time server is asking local model parameters from the chosen clients and passing the aggregated global model parameters to the clients.
Fig 1: (Top) Loss centralized function (bottom) Accuracy centralized means loss or accuracy functions calculated at the server’s side using the global aggregated model parameters on the test data set. Due to the DP induced noise term being added at the model parameters, both the loss and accuracy functions are very different from the loss and accuracy of the global model without the DP. With the noise term 𝜎=1.2, 𝜀 is reaching different final values depending on the no of Epochs as expected. The difference between the accuracy curves with final 𝜀=1.28 and 𝜀=1.50 are small but still significant with the small changes in 𝜀 values.
Take home points:
- By varying the 𝜎 parameter, we can change the 𝜀 values but we have sufficiently less control over the evolution of privacy term 𝜀. In that case, we can increase the number of epochs to increase the value of 𝜀 in order to improve the accuracy.
- As our data set is MNIST dataset, distributed equally among clients following proper i.i.d, the differences between two scenarios are less prominent. In future, we want to incorporate the above analysis in real life noisy dataset to see how different the two scenarios can be.
3.0 Privacy-accuracy trade-off
Next, we will focus on the fixed 𝜀 situation only to see the effect of changing 𝜀 on the overall accuracy of the model. Increasing 𝜀 means adding lesser and lesser noise to the model which in turn will increase the overall accuracy of the model but will offer less privacy. We have chosen 𝜀 values to be 0.5 (very noisy local model parameter), 1.5 ( 𝜀≈1 generally considered to give highest privacy in the data set), 5 and 10 (way less noise added). We have explored the loss function and the accuracy, both centralized functions to see how those two are getting affected with the varying 𝜀 values.
Fig 2. (Top) Evolution of Loss centralized (bottom) Accuracy centralized with varying 𝜀 values. For 𝜀=0.5, due to the presence of high noise term, both accuracy and loss function are severely affected. Surprisingly, the accuracy and loss functions are not improving with marginal differences for cases 𝜀=5.0 and 𝜀=10.0 from the scenario when 𝜀=1.5.
Take home points:
- As the accuracy and loss functions at scenario 𝜀=1.5 is not significantly lesser than the scenarios 𝜀=5.0 and 𝜀=10.0, it is logical to fixed 𝜀 at 1.5 values as it gives the maximum privacy guarantees.
- In future, we want to explore this situation further with real life noisy data set to see how accuracy and loss functions at scenarios 𝜀=1.5 performing with comparison with scenarios with much high 𝜀 values.
4.0 Understanding 𝜺
In this section, we want to explore how the noise term 𝜀 is evolving with the global epoch. In the privacy engine set up, we have not initialized the starting process of 𝜀 rather let the engine decide. In future, we will explore further in this direction whether it is possible to initialize the DP process and if possible, how it is different than the current scenarios and whether the initialization will add any benefit in improving either accuracy or privacy.
Fig 3. (Top) The evolution of 𝜀 over the NRuns. In each epoch, the 𝜀 starts from the same initial values and ends up with the same final values for a fixed no of epochs. This is independent of the two scenarios of starting DP with and without fixed 𝜀. (Middle) Just proving the above with further changing the number of epochs. With increasing epoch, the 𝜀 does not change shape (this is the case of privacy engine without fixed 𝜀 values. (Bottom) Trajectories remains the same only the initial starting point of 𝜀 are changing to reach the final 𝜀 values as the no of epochs are same.
Take home points:
- PrivacyEngine with fixed 𝜀 and fixed number of epochs must calculate the initial 𝜀 values as it must reach the final value after a certain time. It is better to increase the number of epochs in order to give the optimizer enough steps to reach to the global optimum. So far, we have fixed the learning rate of the optimizer, for more complex system, we can let the learning rate to be high in the beginning of the epochs and later goes to a smaller value while reaching the optimum.
- Opacus Privacy engine return a dataloader after passing the original PyTorch dataloader. Opacus dataloader has variable batch size as it must add a Poisson probability to attach each dataset. Hence, each time, Flwr calls a certain client, Privacy engine sees new dataset and re-starts the 𝜀 value for the same client. In future, we want to explore whether we can initialize the Opacus optimizer in such a way so that it does not re-start the 𝜀 values in each global run rather start from the last run’s end 𝜀 value. We want to see how this new method, if possible, affects the accuracy and privacy trade-off.
- As our dataset MNIST is equally divided into clients in an i.i.d fashion, variation of 𝜀 in each client are exactly like each other.
Future goals:
- We want to explore the privacy-accuracy trade off in relation to real life noisy data set with large number of clients and a fraction of the clients are being called in each global run.
- So far, we have set the same target 𝜀 and noise term for each client. In future, we want to give the clients privacy to set their own privacy target meaning limit of the DP parameters. We want to simulate that situation and see how that affect the global loss and accuracy functions.
- DP is a complicated system with a significantly large number of highly correlated parameters. In future, we want to increase our understanding of the important relationship between the key parameters in order to provide our customer best privacy while maintaining an acceptable high level of accuracy.