Abstract
The use of deep learning techniques in medical applications holds great promises for advancing health care. However, there are growing privacy concerns regarding what information about individual data contributors (i.e., patients in the training set) these deep models may reveal when shared with external users. In this work, we first investigate the membership privacy risks in sharing deep learning models for cancer genomics tasks, and then study the applicability of privacy-protecting strategies for mitigating these privacy risks.
Keywords: Privacy, Deep Learning, Genomic Data
I. Introduction
Deep learning technology is becoming widely deployed in many genomic data-driven applications with the goal of advancing medical research, facilitating early diagnosis, and delivering personalized medicine. However, the sensitive nature of genomic data poses significant privacy concerns, as privacy violations can have severe consequences for data donors and blood relatives [1], [2]. While privacy risks have been long investigated in the context of genomic data sharing (e.g., GWAS [3], [4]), limited privacy research has been made in the use of genomic data for deep learning applications. Recent studies have shown that an adversary, who has access to a shared model, may be able to perform powerful inference attacks [5], [6]. As an example, an adversary may exploit a pre-trained model shared by a cancer research center, to learn the participation of a target individual in the training set (i.e., membership inference), thus inferring her predisposition to cancer. To enable a broad and responsible use of machine learning in genomic data-driven applications, it is imperative to study these privacy risks and design effective mitigation strategies.
The goal of this study is two-folds. First, we study the privacy risk of deep learning approaches for cancer genomics. Our evaluations on a recent deep learning method for cancer sub-typing show that the trained model may disclose information about the membership of individual data contributors. Second, we investigate the applicability of several privacy-protecting methods in mitigating membership inference. Among recent strategies, we consider the approach proposed by Abadi et al. [7], which is the state-of-the-art framework for training deep learning models while satisfying differential privacy (DP) [8]. However, due to the high dimensionality of the genomic data and large number of epochs required to train the model, our results show that it is challenging to preserve accuracy while providing meaningful privacy protection in our application setting. To bridge the gap between usability and privacy, we propose a dimensionality reduction data sanitization approach, in which genomic data are first sanitized via a compact representation, and then used to train the model without additional privacy cost. Overall, our initial evaluations demonstrate the effectiveness of our proposed approach.
II. Methods
In this section, we describe the privacy mitigation strategies considered in this work.
Baseline.
The differentially private stochastic gradient descent approach (DP-SGD) proposed by Abadi et al. [7] is widely used to train deep learning models while providing strong privacy protection under the differential privacy model [8]. In that technique, the model is trained on the original data and privacy is achieved by perturbing the clipped gradient (i.e., shrinking the gradient within a clipping norm followed by noise injection) of the loss function in the training process. To quantify the overall privacy protection, the privacy parameter needs to be accumulated over all the epochs during training (i.e., moments accountant).
Our Data Sanitization Approaches.
Here, we propose two data sanitization approaches that generate a privacy-protecting version of the original data, and then use the protected data to train the deep learning model without incurring additional privacy cost (Figure 1). In our sanitization approaches, we first transform the original genomic profiles into a compact representation, then use randomized techniques to protect privacy, and finally we reverse the transformation to obtain profiles in the original domain space. Similarly to our previous study [9], we use the Discrete Cosine Transform (DCT) to generate a compact representation of the original data. To protect privacy in the DCT domain, we consider two perturbation approaches: (1) we perturb each embedded vectors using Laplace noise, and (2) we design a random sampling approach that satisfies the metric privacy model [10], which is an extension of differential privacy over generic metric spaces.
Fig. 1:

Overview of the data sanitization framework. The data are first sanitized and the model is trained on the sanitized data.
III. Results
Deep Learning Method.
In this study, we considered DeepType [11], which has been recently proposed for advancing cancer sub-type classification using deep learning methods. In our evaluation, we use the default settings, where the nodes are 20,000, 1,024, 512, and 6, for the input layer, two hidden layers and output layer, respectively. Additionally, the learning rate is 10−3, with a total of 500 training epochs.
Data.
We conducted our preliminary evaluations on the breast cancer dataset obtained from the METABRIC study [12], as in the original DeepType paper [11]. This dataset contains roughly 2,000 breast tissue samples, each comprising the expression of 25,160 genes.
Training with DP-SGD.
We utilized the DP-SGD to privately train DeepType, using the tensorflow-privacy framework1. The are two important parameters that impact the usability, gradient clipping and noise multiplier, while ϵ represents the overall level of privacy protection. Intuitively, larger values of the noise multiplier (σ) increase the amount of noise in the gradient, thus providing stronger privacy (smaller ϵ) and reducing the utility. Too small/large values of the clipping norm can have adverse effects during training (e.g., utility). We performed grid search with respect to clipping ranging from 0.1 to 0.001 and noise multiplier ranging from 0.01 to 1. Model accuracy and values of the overall privacy parameter are reported in Table I. Our evaluations show that DP-SGD leads to poor accuracy results, even for weak privacy protection.
TABLE I:
Training Evaluations using DP-SGD.
| Noise Multiplier (σ) | Privacy Budget (ϵ) | Clipping Norm | Accuracy |
|---|---|---|---|
| 1.0 | 45.98 | 0.1 | 0.39 |
| 0.01 | 0.32 | ||
| 0.001 | 0.45 | ||
| 0.1 | 255,130 | 0.1 | 0.36 |
| 0.01 | 0.35 | ||
| 0.001 | 0.35 | ||
| 0.01 | 65,580,612 | 0.1 | 0.29 |
| 0.01 | 0.33 | ||
| 0.001 | 0.37 |
Training with sanitized data.
We compared both Laplace and random sampling methods to generate sanitized data in the DCT domain. In the DCT domain, we varied the dimensionality of the embedded vectors (K), and the overall privacy parameter (ϵ). First, we evaluated how well the sanitized data retain the usefulness of the original data in terms of Euclidean distance (Figure 2). From our results, the random sampling method (i.e., our metric privacy model) outperformed the Laplace perturbation mechanism. Second, we assessed the applicability of the sanitized data obtained with our metric privacy model in training DeepType by measuring both the performance in terms of utility and privacy (Table II). To measure the utility, we spitted the data into 80% training and 20% testing. Specifically, we compared the accuracy on the same test set between DeepType trained on the original training data and the sanitized training data obtained with our metric privacy model, with increasing values of the privacy parameter. We observed that larger values of the privacy parameter are needed to obtain useful predictive results with the DeepType model. To measure the membership inference risk, we performed our evaluations using the attack models proposed in [13]. From our evaluations, we observed that our proposed sanitization method can significantly mitigate the privacy risk for DeepType, reducing the accuracy of the best attack (measured using AUC) from 0.71 to 0.5.
Fig. 2:
Data usability for the sanitized data produced by our methods (random sampling and noise perturbation) for different values of the privacy parameter ϵ and dimensionality K in the DCT domain.
TABLE II:
Utility and privacy evaluations using DeepType. Sanitized data were obtained with metric privacy and K = 50.
| Privacy Budget (ϵ) | Test Accuracy | Best membership AUC |
|---|---|---|
| 50.0 | 0.58 | 0.5 |
| 100.0 | 0.69 | 0.52 |
| 200.0 | 0.67 | 0.55 |
| Non-private | 0.89 | 0.71 |
IV. Conclusion
In future research, we plan to further study the trade-off between privacy and usability in emerging applications of deep learning technology for genomic data-driven research.
Acknowledgment
This work was supported in part by the National Human Genome Research Institute grant R00HG010493.
Footnotes
Contributor Information
Chonghao Zhang, Dept. of Computer Science and Engineering, University of California, San Diego, La Jolla, CA.
Luca Bonomi, Dept. of Biomedical Informatics, Vanderbilt University, Nashville, TN.
References
- [1].Erlich Y and Narayanan A, “Routes for breaching and protecting genetic privacy,” Nature Reviews Genetics, vol. 15, no. 6, pp. 409–421, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Bonomi L, Huang Y, and Ohno-Machado L, “Privacy challenges and research opportunities for genomic data sharing,” Nature genetics, vol. 52, no. 7, pp. 646–654, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV, Stephan DA, Nelson SF, and Craig DW, “Resolving individuals contributing trace amounts of dna to highly complex mixtures using high-density snp genotyping microarrays,” PLoS genetics, vol. 4, no. 8, p. e1000167, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Sankararaman S, Obozinski G, Jordan MI, and Halperin E, “Genomic privacy and limits of individual detection in a pool,” Nature genetics, vol. 41, no. 9, pp. 965–967, 2009. [DOI] [PubMed] [Google Scholar]
- [5].Shokri R, Stronati M, Song C, and Shmatikov V, “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 3–18. [Google Scholar]
- [6].Fredrikson M, Jha S, and Ristenpart T, “Model inversion attacks that exploit confidence information and basic countermeasures,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015, pp. 1322–1333. [Google Scholar]
- [7].Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, and Zhang L, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016, pp. 308–318. [Google Scholar]
- [8].Dwork C, “Differential privacy,” in Automata, Languages and Programming, Bugliesi M, Preneel B, Sassone V, and Wegener I, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 1–12. [Google Scholar]
- [9].Fan L and Bonomi L, “Time series sanitization with metric-based privacy,” in 2018 IEEE International Congress on Big Data (BigData Congress). IEEE, 2018, pp. 264–267. [Google Scholar]
- [10].Chatzikokolakis K, Andrés ME, Bordenabe NE, and Palamidessi C, “Broadening the scope of differential privacy using metrics,” in International Symposium on Privacy Enhancing Technologies Symposium. Springer, 2013, pp. 82–102. [Google Scholar]
- [11].Chen R, Yang L, Goodison S, and Sun Y, “Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data,” Bioinformatics, vol. 36, no. 5, pp. 1476–1483, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y et al. , “The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups,” Nature, vol. 486, no. 7403, pp. 346–352, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Song L and Mittal P, “Systematic evaluation of privacy risks of machine learning models,” in 30th {USENIX} Security Symposium ({USENIX} Security 21), 2021. [Google Scholar]

