Table of Contents
- Explainability in Edge MLOps
- Abstract
- 1. Introduction
- 2. Experimental setup
- Computational resources
- Federated learning
- Datasets
- Models
- Model explainability methods
- 3. Procedure
- 4. Main results
- Model performance: FL VS central training
- Time consumption
- Distribution of background datasets
- Local explanations
- Global explanations
- Feature correlations
- 5. Conclusion and Outlook
- References
Explainability in Edge MLOps
Abstract
We implement a prototype explainable AI framework for edge/IoT devices in the horizontal federated learning setting, in which computational efficiency is a major consideration. Of ex-plainable AI, we focus on machine learning model explainability: both a model’s prediction on a given data instance (local explainability), as well as the model’s overall behaviour (global explainability). We use LIME and KernelSHAP explainers for such purposes. Model explainers often require fitting by a background dataset. It has been pointed out in the literature that, due to the localised nature of data in federated learning, explainers on different FL clients generally produce different explanations that may exhibit inconsistencies. We investigate such possibil-ity in simulated federated learning scenarios, and report our findings. Finally, we propose fu-ture directions to be undertaken on the explainable edge MLOPS front.
1. Introduction
As the use of artificial intelligence (AI) is ubiquitous in every aspect of human society, being able to explain an AI solution— under the ‘explainable AI’ (XAI) paradigm— has become in-creasingly important. The ability to explain every component of an AI solution elevates its trustworthiness, as it allows for cross-checking its decision-making logic against human intui-tion and experience, and uncovering possible biases (gender, demographics, etc.) in the so-lution. Explainability on machine learning/deep learning models (‘models’ henceforth) is espe-cially important, as they are often the most obscure component of a machine learning solution. This is in contrast to some other components such as feature engineering, which are typically established by a human-driven exploratory data analysis and readily explainable— unless they are also automated, e.g. by using an autoML pipeline.
Machine learning/deep learning models are often perceived as ‘black boxes’: they somehow produce predictions, often with high performance, from the input features, but their inner work-ings are often opaque. The higher the complexity of the model, the less manifest the model’s general behaviour (global explainability), and how the prediction on a particular data point depends on the features (local explainability). Examples of ‘black-box’ models that are not intrinsically explainable include deep neural network models, as well as some machine learn-ing models such as support-vector machines. In contrast, ‘white-box’ models are intrinsically explainable such as linear regression (by the weights and biases), or decision trees (by impu-rity measures).
In the framework of federated learning (FL) over edge/IoT devices, there are multiple points/entities where explainability may be required. In the horizontal FL setting, explainability may be required from the FL server, or on the FL clients. For example, consider the FL clients being industrial machines from different (maybe rival) manufacturing companies, federated with an FL server set up by the manufacturer of those machines. The manufacturer (server) may want to explain the model to in order to evaluate the overall behaviour of their machines over the companies, and each manufacturing company (client) may want to explain how the model arrived at a particular prediction. In the vertical FL setting, the ‘guests’ (e.g. banks) and the ‘host’ (e.g. a clearing house) may want explainability for different reasons. Explainability in FL also applies to components such as client selection— e.g. explaining why the FL algorithm in use consistently excludes certain FL clients from the federation.
There are well-established model explainability methods (‘explainers’ henceforth) in the liter-ature, that provide explanation by assigning feature importance to a model (global) or to a particular prediction (local). These methods can be model-specific (e.g. applicable to tree-based models only), or model-agnostic e.g. LIME (Local Interpretable Model-Agnostic Expla-nations) and KernelSHAP. Model-agnostic explainers often require fitting by a background dataset. And some make simplifying assumptions on the data behaviour, such as feature in-dependence or the absence of feature interactions in relation to the target variables. Some explainers are available as part of a cloud computing solution, e.g. the Responsible AI Dash-board in Azure Machine Learning. In the context of edge/IoT computing, extra considerations must be made. Importantly, edge devices are often equipped with limited computational re-sources, which necessitates a careful choice of efficient model explainers and their implemen-tation, often with a trade-off against the accuracy of the explanations.
In passing, we mention that it is possible to abuse model explainability methods to launch adversarial attacks; see references.
In federated learning, there are additional subtleties to model explainability, mainly due to the localisation of data on the FL clients, and the possible non-independent-and-identically-dis-tributed (non-i.i.d.) nature of the local data. More concretely, in the horizontal FL, where clients contain data of the same feature space, the variations among the datasets on the FL clients used to fit local model explainers may lead to the explainers producing different, or even in-consistent results. In the vertical FL, each entity only possesses a subset of the feature space, making the use of conventional explainability techniques challenging. In either case, explain-ability in FL is less discussed in the literature (to our knowledge), compared to other compo-nents of Trustworthy AI such as different privacy.
In this investigation, we implement a prototype explainable AI framework for edge/IoT devices, confining our scope to model explainability, in the horizontal federating learning setting. Our main objective in this investigation are:
- a) Implement a selection of efficient global and local model-agnostic explainers for tabular datasets,
- b) Evaluate local and global model explanations in horizontal FL, explaining the global aggregated model (‘FL model’ henceforth), and in the centralised training setting as a control,
- c) Compare model explanations obtained from various points in the FL (server, clients and the aggregate thereof), and compare against control results in the centralised training setting, to find if inconsistencies exist as alluded to in the literature.
In section 2 and 3, we specify respectively the experimental setup and procedure. In Section 4, we report and discuss the main results. We conclude in Section 5 with an outlook on future directions to be taken.
2. Experimental setup
In this section, we enumerate the experimental setup. It is illustrated in Figure 5 at the end of this document.
Computational resources
Our experiments are run on a ‘Standard D8s v3’ virtual machine on Azure, with 8 CPUs and 32 GiB of RAM.
Federated learning
We use flower to simulate horizontal federated learning on 5 clients (and 1 server). In each FL round, each local model (see below) is trained for 1 epoch. 5 FL rounds is run.
Datasets
We use scikit-learn’s synthetic tabular (classification and regression) datasets. For each da-taset, there are 20 features in total, with two 2 informative features (column names ‘0′ and ‘1’). For the classification dataset, there are also 2 redundant features— linear combinations of the informative features (column names ‘2’ and ‘3’). The remaining feature columns are white noise.
Each FL client has a training set of 9,000 samples, and a validation set of 1,000 samples. The FL server has a held-out test set of 1,000 samples. During central training (control), the training sets of the clients are combined into one training set, similar for the validation set. The same test set as the FL server is used.
Models
We consider the following model types readily supported in flower:
- scikit-learn’s logistic regression (for classification)
- a dense neural network model. Tensoflow is used (author’s personal preference):
- Three blocks of [dense layer (64 units with tanh activation) + batch normalisa-tion + dropout(rate=0.2)] are used, followed by
- An output layer (sigmoid activation is used for classification).
- Adam is used as the optimisation algorithm, with a learning rate of 0.0003. The loss functions for classification and regression are respectively ‘bi-nary_crossentropy’ and ‘mean_squared_error’.
Model explainability methods
The model-agnostic explainers LIME (for local explainability) and KernelSHAP (for both local and global explainability) are considered. For a given data instance, LIME computes feature weights contributing to the model prediction, whereas KernelSHAP computes the SHAP val-ues— estimates of the Shapley values— which quantify the contribution of each feature to the model prediction compared to the average prediction for the dataset. Global feature im-portance is obtained from KernelSHAP by explaining on a background dataset and taking the mean-squared value over the instances. Note that both LIME and KernelSHAP assumes fea-ture independence, which is violated for our classification dataset (it has two redundant fea-tures). There are explainers that make no such assumption, e.g. TreeSHAP and Linear SHAP explainer, but they are model-specific.
We fit the explainers separately on the FL server, FL clients, and in the centralised training case. In FL, the FL model is explained but not the local models. Both kinds of explainers require fitting by a background dataset. On the FL server, the background dataset used is the held-out test set. On each Fl client, the validation set is used. For the central training case, the validation sets and the test set are separately used (‘central_1’ and ‘central_2’ in what follows). In all cases, the background dataset is resampled to size 100 to improve computational effi-ciency, with edge computing applications in mind. For KernelSHAP, each explainer is fitted with the median value in the respective background dataset for further improving computa-tional efficiency.
In the local explainability experiment, we pick two instances from the held-out test set, and have each local explainer produce feature importance to each model prediction. In global ex-plainability, each (KernelSHAP) explainer explains the respective background dataset
3. Procedure
- Prepare the dataset. For each dataset:
- A test dataset is held out, as a test set for the server and for the centralised training (control)
- For the remaining data:
- Partition the data for the FL clients, each of which will perform a further train-validation split
- For the centralised training (control), the training set is obtained from combining the training data from the FL clients, similar for the validation sets
- For each model type considered:
- Train an aggregated FL global model (‘FL model’) using flower
- Train a centralised model (control)
- Evaluate the model’s scores on the held-out test set; for fair comparison, the FL model and central model should have similar performances.
- Benchmark local feature importance on the first two rows from the test dataset— all explainers explain the same two rows of data for a fair comparison:
- Run the local explainer on the FL model, separately with the background da-tasets on the FL server and on the FL clients. Explain those rows of data
- Control: run the local explainer on the centralised model. Explain those rows of data
- Compare the results
Benchmark global feature importance:
- Run the global explainer on the FL model, separately with the background da-tasets on the FL server and on the FL clients, explaining the background da-tasets
- Aggregate the global explanation results from the FL clients
- Control: run the global explainer on the centralised model, with the validation set, and separately with the held-out test dataset
- Compare the results
- Prepare the dataset. For each dataset:
4. Main results
The results from the experiments are similar across model types and the ML problem (classi-fication VS regression). In what follows, we report the general observations, and include ex-ample plots for illustration.
Model performance: FL VS central training
For classification problem, for both model types, the FL model and the central model achieve similar test accuracy and F1-scores: accuracy is at least 0.851 and the score differences are less than 2%. For regression, the FL and the central models have mean square errors of 3.4 and 2.9 respectively, i.e. a difference of about 18%. The similar scores allowed for a fair comparison as discussed in what follows.
Time consumption
Model explanations were computed in a Jupyter notebook on the virtual machine. We find that it takes more time to fit an explainer on the dense neural networks than on logistic re-gression, likely due to the higher model complexity. Each Local explanation, it requires less than 2 seconds for the explainer to fit and explain. Global explanations with KernelSHAP, ob-tained by fitting with the median value of a background set, and explaining on the same set, takes approximately 30 seconds. Note that, had we fitted KernelSHAP with the whole back-ground set, it would have taken hours to produce a global explanation.
Distribution of background datasets
The background datasets used to fit the explainers are of relatively small size (100) com-pared to the total dataset size (51,000). As a result, the background datasets follow varied distributions; an example is plotted in Figure 1 for the classification problem.
Figure 1. Histogram of background datasets: informative feature columns ‘0’ (left) and ‘1’ (right) in the classification problem.
We expect such variation may lead to different, if not inconsistent, explanation results, as pointed out in the literature.
Local explanations
From the explanations on the first two rows from the test dataset, we arrive at the following main observations.
- The feature importance (Shap values and LIME feature weights) generally agrees with expectation: the informative and redundant features have the highest importance in mag-nitude. White noise features generally have small importance, with few exception.
For example, in the classification problem, the KernelSHAP explanations (Shap values) on the data point X_test[0] from the FL clients, and from the server and central case, are as follows:
Figure 2. KernelSHAP values for X_test[0] on the neural network models in the classification problem. Horizontal axis is SHAP value and vertical axis is feature. Upper: Shap values among FL clients on the FL model. Lower: Shap values among the central, server and client mean.
- Depending on the data point, the local model explanations may vary among the FL clients, and with the server and central training cases; a feature may be assigned a positive contribution by one, but assigned a negative contribution by another. Such inconsistency across the FL clients is independent of the choice of explanation method (KernelSHAP or LIME). The inconsistency is likely due to the different, small (100) background datasets used to fit the explainers, which follow different distribu-tions from the population, as illustrated in Figure 3Figure 1. For example, in the classification problem, KernelSHAP explanations on the data point X_test[0] from the FL clients, and from the server and central case, are quite consistent, as shown in Figure 2 above. But the same cannot be said of X_test[1]:
Figure 3. KernelSHAP values for X_test[1] on the neural network models in the classification problem. Features ‘0’ and ‘2’ receives different-sign Shap values from FL clients, and from the server and central cases.
The LIME feature importance shows qualitatively the same behaviour as the Kernel-SHAP values (not shown here to avoid cluttering). In particular, the sign differences are the same as in KernelSHAP.
Global explanations
In all datasets and models considered, the global explanations from KernelSHAP for all cases (FL server, FL clients and central) are consistent and sensible: the informative and re-dundant features are given the largest feature importance. The sign inconsistency across the FL clients that we observed in the local explanations do not appear in the global importance, as it is computed by taking the mean of absolute SHAP values over the background dataset for each feature. See e.g. in Figure 4 for the classification problem.
Figure 4. KernelSHAP global summary on the neural network models in the classification problem. The informa-tive features ‘0’ and ‘1’ have consistently the largest feature importance, followed by the redundant features ‘2’ and ‘3’.
Feature correlations
In the classification problem, features ‘2′ and ‘3’ are redundant— they are linear combinations of the informative features ‘0’ and ‘1’; this violates the feature independence assumptions made by both KernelSHAP and LIME. While it has been known in the literature that SHAP values can be misleading when some features are correlated, in our experiments we did not find obvious contradictions or inconsistencies in our results, other than the ones mentioned above.
5. Conclusion and Outlook
In this investigation, we have achieved the following outcomes:
- Demonstrated a computationally efficient implementation of model explainability methods in the horizontal federated learning setting
- Performed FL simulations on tabular datasets, and extracted local and global model explanations from explainers in different entities: FL servers and clients
- Empirically re-discovered the existence of inconsistency in model explanations over different entities, which can be attributed to the different distributions of the local background data used to fit the explainers. This corroborates with the same observa-tion made in the literature, as one of the challenges in explainable AI in federated learning.
Based on these outcomes, looking forward, there are various avenues to explore regarding explainable edge MLOPS, including but not limited to:
- The relationship and integration of explainable edge MLOPS infrastructure with other pillars of Trustworthy AI. For example, explainability infrastructure as possible point of adversarial attacks, and the prevention and mitigation thereof. And applications of ex-plainability infrastructure to the detection and mitigation of adversarial attacks
- Benchmarking and selecting suitable global and local model explainability methods for prototypical edge devices, from the generalisability (on model types and assump-tions on data behaviour) and computational standpoints
- Investigate explainability methods specific to image and text data
- Explainability methods in vertical federated learning, and federated transfer learning
- Non-independent-and-identically-distributed (non-i.i.d.) local data among FL clients.
References
- Linardatos et al. Explainable AI: A Review of Machine Learning Interpretability Meth-ods: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7824368/
- Christoph Molnar. Interpretable Machine Learning— A Guide for Making Black Box Models Explainable: https://christophm.github.io/interpretable-ml-book/
- Responsible AI Dashboard on Azure Machine Learning: https://learn.mi-crosoft.com/en-us/azure/machine-learning/concept-responsible-ai-dash-board?view=azureml-api-2
- Shokri et al. On the Privacy Risks of Model Explanations: https://arxiv.org/abs/1907.00164
- Slack et al. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods: https://arxiv.org/abs/1911.02508
- Li et al. Towards Interpretable Federated Learning: https://arxiv.org/abs/2302.13473
- Sánchez Sánchez et al. FederatedTrust: A Solution for Trustworthy Federated Learn-ing: https://arxiv.org/abs/2302.09844
- Jelena Fiosina. Explainable Federated Learning for Taxi Travel Time Prediction: https://link.springer.com/chapter/10.1007/978-3-031-17098-0_20
- Guan Wang. Interpret Federated Learning with Shapley Values: https://arxiv.org/abs/1905.04519
- SHAP package: https://shap-lrjball.readthedocs.io/en/latest/index.html
- LIME package: https://lime-ml.readthedocs.io/en/latest/index.html
- Aas et al. Explaining individual predictions when features are dependent: More accu-rate approximations to Shapley values: https://arxiv.org/abs/1903.10464
- Haffer et al. Explaining predictions and attacks in federated learning via random forests: https://link.springer.com/article/10.1007/s10489-022-03435-1
Figure 5. Model explanation in horizontal federated learning.