Abstract
Objective
Differential privacy is a relatively new method for data privacy that has seen growing use due its strong protections that rely on added noise. This study assesses the extent of its awareness, development, and usage in health research.
Materials and Methods
A scoping review was conducted by searching for [“differential privacy” AND “health”] in major health science databases, with additional articles obtained via expert consultation. Relevant articles were classified according to subject area and focus.
Results
A total of 54 articles met the inclusion criteria. Nine articles provided descriptive overviews, 31 focused on algorithm development, 9 presented novel data sharing systems, and 8 discussed appraisals of the privacy-utility tradeoff. The most common areas of health research where differential privacy has been discussed are genomics, neuroimaging studies, and health surveillance with personal devices. Algorithms were most commonly developed for the purposes of data release and predictive modeling. Studies on privacy-utility appraisals have considered economic cost-benefit analysis, low-utility situations, personal attitudes toward sharing health data, and mathematical interpretations of privacy risk.
Discussion
Differential privacy remains at an early stage of development for applications in health research, and accounts of real-world implementations are scant. There are few algorithms for explanatory modeling and statistical inference, particularly with correlated data. Furthermore, diminished accuracy in small datasets is problematic. Some encouraging work has been done on decision making with regard to epsilon. The dissemination of future case studies can inform successful appraisals of privacy and utility.
Conclusions
More development, case studies, and evaluations are needed before differential privacy can see widespread use in health research.
Keywords: differential privacy, privacy, confidentiality, data sharing, statistical disclosure limitation
INTRODUCTION
Background and significance
Data sharing is vital for the acceleration of health research. It enables transparency and facilitates discovery through secondary analysis. Several institutions have issued policies that mandate or encourage data sharing: the U.S. Office of Science and Technology Policy, for example, has directed federal agencies to increase public access to scientific data.1 Both the National Institutes of Health and the National Science Foundation require data sharing plans to be submitted with grant proposals,2,3 and more recently, the International Committee of Medical Journal Editors has begun requiring the same for reports of clinical trials.4
In the United States, the sharing of personal health information is subject to the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA). This requires Institutional Review Board approval and the signing of a data use agreement for access by researchers who are not part of covered entities (eg, health care providers or academic medical centers). If individual-level health data are to be distributed publicly, then HIPAA permits de-identifying it through either the Safe Harbor method (the removal of 18 specific types of identifiers) or by the Expert Determination method, in which a qualified individual ensures that the risk of reidentification is very small.5 If needed, this may involve the application of statistical disclosure limitation techniques, including information suppression, high-level aggregation of detail, rounding, noise addition, or synthetic data.6,7 However, the Safe Harbor method does not always protect against reidentification, as unique combinations of nonprivate attributes can be linked to auxiliary data, like public voter registration records.8–10 Also, there are no specific standards for the Expert Determination method—either in what constitutes an expert, or in the disclosure limitation techniques used, or in the definition of “very small” when it comes to the risk of reidentification. Given the multitude of privacy models and risk assessment criteria, privacy protection measures can be largely ad hoc in nature, and thus there have been desires for universal standards of controlling privacy risk.6,11,12
Differential privacy
One of the strongest methods of controlling disclosure risk in recent years is differential privacy. It involves replacing statistical queries with comparable algorithms that add random noise. The amount and type of noise adheres to a mathematical definition of database privacy formalized by Dwork et al.13 In simple terms, an algorithm is differentially private if the addition or removal of a single individual from the database changes the likelihood of any output by an imperceptible amount. Differential privacy provides guaranteed protections against attacks that many other disclosure limitation techniques are vulnerable to, including linkage, differencing, and database reconstruction attacks.14
As an example of how it works, consider a database with an indicator of disease status (1 = yes, 0 = no) for a list of individuals. To prevent individual-level information from being disclosed, a nondifferentially private query system might restrict queries to aggregate statistics (eg, sum or average) over a large number of records only. However, an attacker may circumvent this by computing the total number of disease cases with and without the record of a known individual. By examining the change in the total, the disease status of this individual is revealed (this is known as a differencing attack). Randomly perturbing a statistic through a differentially private mechanism inhibits this sort of traceback by occluding the influence of any single individual. Although the query results are inexact, they ideally converge on average around the original, nonperturbed value. Because the scale of noise depends on the maximum impact of only 1 individual, accuracy is typically less compromised when dealing with large numbers of records.
Privacy is controlled by the parameter ε, which quantifies the information leakage resulting from the addition or removal of a single participant. Generally, the smaller the value of ε, the more noise an algorithm adds to a query, and the more “private” it is. However, as in most methods of statistical disclosure limitation, there is an inverse relationship between privacy and utility. Large amounts of noise may virtually guarantee privacy, but the costs to accuracy will be high. Determining acceptable values ε is context specific and continues to be an open question.15–18
ε is additive, meaning that if 2 queries are performed that are each ε-differentially private, then the entire analysis is 2ε-differentially private. Because the uncertainty introduced by an algorithm can be diminished via repeated sampling, a “privacy budget” of a total amount of ε can be assigned to an analyst prior to being granted data access. In this sense, privacy of a database is an exhaustible resource, and data custodians are afforded control over how much and to whom information is given. Alternatively, one may release an entire dataset sanitized through a differentially private mechanism. This may take the form of summarized, perturbed data, or synthetic data generated from a model. Because differentially private output retains its degree of privacy even after being made public, any subsequent queries on such a dataset would not count toward a privacy budget.
Objective
Differential privacy’s robust protections have made it an increasingly popular option in the realm of big data.19–22 Research on variants, algorithms, and implementations has grown exponentially since its 2006 introduction, and interest is likely to continue to grow (Figure 1). However, the question of whether it is a true Swiss Army knife deserves scrutiny when it comes to the sprawling and multifaceted fields that make up health science. Even within fields like epidemiology or public health, there is wide variation in research traditions, analytic goals, and the amount and nature of available data. An administrator considering implementing a system of differential privacy would need to know if existing algorithms are suitable for researchers’ intended aims, whether they be health surveillance, predictive modeling, comparative effectiveness research, exploratory analysis, or others. Algorithms must be evaluated in context to inform expectations about their privacy-utility tradeoff. Furthermore, the exact implications of differential privacy for the confidentiality of health data must be appropriately appraised and understood. Data custodians and Institutional Review Boards would need to know what values of ε (or other parameters) constitute an acceptable amount of risk to patients, and what suggestions can be learned from prior findings.18 The nature of the risk must also be interpretable to individuals prior to giving consent to the use of their health information. Case studies of successful applications of differential privacy are needed.
Figure 1.
Yearly count of publications on differential privacy found on EBSCOhost. Obtained from a keyword-only search of “differential privacy” (full-text search results excluded). For comparison, a similar search of “k-anonymity,” another popular statistical disclosure limitation method, yields 49 publications in the year 2006, with an increase to 349 publications in the year 2020. A keyword search of “de-identification” yields 116 publications in the year 2006, with an increase to 490 publications in the year 2020.
Reviews of the literature can provide guidance to these questions, yet preliminary searches at the outset of this study have indicated that few to none exist.23 This may speak to the novelty of differential privacy to the health and biomedical sciences. Its particularly robust protections, however, as well as its snowballing popularity in other sectors, suggest that it is time for serious examinations of what it can offer. This scoping review24 is a step in that direction, by providing a broad overview of how differential privacy has been leveraged and discussed in health research. By mapping how differential privacy is being applied in the field, administrators and research practitioners can learn in which subfields it may have been used successfully, what algorithms address current needs, and in what circumstances it might pose challenges. The identification of gaps in evaluation and usage will also help scholars prioritize where to focus future development efforts and utility assessments. The review is structured around the following questions: (1) To what extent and in what areas of health research is differential privacy being applied? (here, the term health research encompasses quantitative research, informatics, and analytics pertinent to the health sciences, health care, and related fields; we broadly define an “application” as a practical usage for purposes related to health research, and this is distinguished from the more specific definitions of “computer” or “software” application); (2) For what analytical purposes are algorithms being developed?; and (3) What can be said about the privacy-utility tradeoff in specific health research contexts?
MATERIALS AND METHODS
A scoping review24 was conducted to assess the published status of research and applications with regard to differential privacy in the health sciences. A full text search of the query [“differential privacy” AND “health”] was conducted in PubMed, CINAHL, Embase, Ovid, and PsycINFO. The same query was used to search for grey literature from the New York Academy of Medicine Grey Literature Report, HSRProj, and the Technology Assessment Program at Agency for Healthcare Research and Quality. The specified range for the year of publication was 2006, the year differential privacy was formally introduced,13 to 2020 (the date of access was January 14, 2021). Articles in all languages were considered, provided that an English translation was available. Additional articles were sought through consultation with a biostatistical expert familiar with the topic.
The search yielded 36 articles from PubMed, 7 from CINAHL, 36 from Embase, 133 from Ovid, 2 from PsycINFO, and 1 from HSRProj. Three articles were obtained via contact with an expert. After the removal of duplicates, a total of 117 unique articles were retrieved. Following this, the full text of articles was screened for 2 criteria: (1) context specific to health or biomedicine and (2) differential privacy as a primary topic of focus. A total of 62 articles did not meet these criteria and hence were excluded. A flowchart of the selection process is provided in Figure 2.
Figure 2.
Flow diagram of study selection.
The remaining 55 articles were reviewed and grouped by application area, if specified. They were also classified into 5 categories based on their topic of focus: introductory overview of differential privacy, privacy risk appraisal, algorithm evaluation, algorithm development, and data system (defined here as a system for storing, processing, sharing, and/or querying data, usually with added security and encryption protections). Articles relating to algorithm evaluation, algorithm development, or data systems were further classified based on the type of queries or analyses they addressed.
RESULTS
Article focus and application areas
The 54 articles that met the inclusion criteria were classified into groupings based on article focus, intended application (if specified), and analytic purpose of the algorithm or data system. All articles and their classifications are listed in Table 1. Regarding focus, then 9 (17%) articles provided a descriptive overview of differential privacy for the purpose of informational awareness, 8 (15%) discussed a privacy-utility appraisal, 31 (57%) focused on the development of novel algorithms, and 9 (17%) described data system designs. One article discussed algorithms and a privacy-utility appraisal concurrently, and 2 described both algorithms and data systems concurrently.
Table 1.
Classification of health research articles with focus on differential privacy
Study | Article focus | Intended application | Analytic purpose |
---|---|---|---|
Dennis et al25 | Introductory overview | Behavioral research | – |
Jiang et al26 | Introductory overview | Comparative effectiveness research | – |
Al Aziz et al27 | Introductory overview | Genomic data sharing | – |
Shi and Wu28 | Introductory overview | Genomic data sharing | – |
Wang et al29 | Introductory overview | Genomic data sharing | – |
Mehta et al30 | Introductory overview | Genomic data sharing | – |
Yakubu and Chen31 | Introductory overview | GWAS | – |
Dankar and Emam23 | Introductory overview | Unspecified | – |
Dwork and Pottenger32 | Introductory overview | Unspecified | – |
Khokhar et al33 | Privacy-utility appraisal | Health data publishing | – |
Santos-Lozada et al34 | Privacy-utility appraisal | Health disparities research | – |
Krieger et al35 | Privacy-utility appraisal | Health disparities research | – |
Xu and Zhang36 | Privacy-utility appraisal | Health disparities research | – |
Calero Valdez and Ziefle37 | Privacy-utility appraisal | Health recommender systems | – |
Matthews and Harel38 | Privacy-utility appraisal | Unspecified | – |
Matthews et al39 | Privacy-utility appraisal | Unspecified | – |
Vu and Slavkovic40 | Algorithm development, privacy-utility appraisal | Clinical trials | Hypothesis testing |
Liu et al41 | Algorithm development | Coronary heart disease diagnosis | Predictive modeling |
Niinimäki et al42 | Algorithm development | Drug sensitivity prediction in cancer | Predictive modeling |
Honkela et al43 | Algorithm development | Drug sensitivity prediction in cancer | Predictive modeling |
Bonomi et al44 | Algorithm development | Epidemiology | Nonparametric survival analysis |
Beaulieu-Jones et al45 | Algorithm development | Clinical data sharing | Data release |
Lee et al46 | Algorithm development | Clinical data sharing | Data release |
Almadhoun et al47 | Algorithm development | Genomic data sharing | Data release |
Simmons and Berger48 | Algorithm development | GWAS | Outputting top M ranked statistics |
Wang et al49 | Algorithm development | GWAS | Outputting top M ranked statistics |
Yu et al50 | Algorithm development | GWAS | Outputting top M ranked statistics |
Kim et al51 | Algorithm development | Health surveillance with personal devices | Data release |
Lin et al52 | Algorithm development | Health surveillance with personal devices | Data release |
Wu et al53 | Algorithm development | Health surveillance with personal devices | Data release |
Ren et al54 | Algorithm development, data system | Health surveillance with personal devices | Data release |
Saleheen et al55 | Algorithm development, data system | Health surveillance with personal devices | Data release |
Ukil et al56 | Algorithm development | Health surveillance with personal devices | Predictive modeling |
Li et al57 | Algorithm development | Medical phenotyping | Predictive modeling |
Ma et al58 | Algorithm development | Medical phenotyping | Predictive modeling |
Baker et al59 | Algorithm development | Neuroimaging | Data release |
Le et al60 | Algorithm development | Neuroimaging | Predictive modeling |
Plis et al61 | Algorithm development | Neuroimaging | Predictive modeling |
Li et al62 | Algorithm development | Neuroimaging | Predictive modeling |
Cho et al63 | Algorithm development | Clinical data sharing | Data release |
Vinterbo et al64 | Algorithm development | Clinical data sharing | Data release |
Mohammed et al65 | Algorithm development | Clinical data sharing | Data release |
Li et al66 | Algorithm development | Unspecified | Predictive modeling |
Wang et al67 | Algorithm development | Unspecified | Predictive modeling |
Krall et al68 | Algorithm development | Unspecified | Predictive modeling |
Parvandeh et al69 | Algorithm development | Unspecified | Predictive modeling |
Shao et al70 | Algorithm development | Unspecified | Predictive modeling |
Gardner et al71 | Data system | Clinical data sharing | Data release |
Xiong72 | Data system | Clinical data sharing | Data release |
Froelicher et al73 | Data system | Clinical data sharing | Data release |
Raisaro et al74 | Data system | Clinical data sharing | Data release |
Raisaro et al75 | Data system | Clinical data sharing | Data release |
Huang et al76 | Data system | GWAS | Data quality control |
Eicher et al77 | Data system | Unspecified | Predictive modeling |
GWAS: genome-wide association study.
Although all articles can broadly be considered under the purview of health or biomedical informatics, 37 (67%) articles emphasized a specific research or clinical application. The most common was research in genomics—genome-wide association studies (GWASs), genomic data sharing, or drug sensitivity prediction in cancer cell lines—which together comprised 13 articles (24% of all articles found). The next most common application (11%) was health surveillance with personal devices (ie, bodily sensors or mobile devices), followed by the analysis of neuroimaging data (7%). Three (6%) articles discussed differential privacy in the context of health disparity research. The remaining categories appear in Table 1.
Algorithms and data systems
The 38 articles relating to the development of an algorithm or data system were further categorized by analytic purpose. These are also cross classified by intended health application in Table 2. Of these 38 articles, 17 (45%) discussed algorithms or systems for data release, ie, the release of perturbed aggregate statistics (most often counts or multivariate histograms) or synthetic data.45,46 Although their intent is generally for sharing clinical data, 6 (16%) specialized in personal data streams originating from wearable sensors or mobile devices,51–56 1 (3%) specialized in genomic data sharing,47 and 1 (3%) specialized in neuroimaging data.59
Table 2.
Frequency of analytic purpose and intended research application among algorithms
Analytic purpose | Application | Count |
---|---|---|
|
Clinical data sharing | 10 |
Health surveillance with personal devices | 5 | |
Genomic data sharing | 1 | |
Neuroimaging | 1 | |
|
Unspecified | 6 |
Neuroimaging | 3 | |
Medical phenotyping | 2 | |
Drug sensitivity prediction | 2 | |
Coronary heart disease diagnosis | 1 | |
Health surveillance with personal devices | 1 | |
|
Genome-wide association studies | 3 |
|
Clinical trials | 1 |
|
Epidemiology | 1 |
|
Genome-wide association studies | 1 |
Fifteen (39%) of the 38 technical articles discussed differentially private algorithms for predictive modeling. Several of these algorithms were part of a federated learning system using data distributed at multiple sites.57,58,61,62,70 Two (5%) articles proposed methods of leveraging public data to improve the accuracy of prediction algorithms.66,67 As for specific health applications, then 3 (8%) algorithms were developed for disease classification using neuroimaging data.60–62 Medical phenotyping using electronic health records was the explicit aim of 2 (5%) articles,57,58 and 2 (5%) others explicitly developed algorithms for drug sensitivity prediction in cancer.42,43 Two (5%) articles described processes for privately monitoring cardiac health—one was for diagnosing coronary heart disease using medical records41 and the other was for detecting abnormalities in phonocardiogram signals.56
Three (8%) articles developed algorithms for ranking sets of genetic variants by their statistical associations with disease phenotypes in GWASs.48–50 One article discussed hypothesis testing in the context of clinical trials.40 Differentially private versions of the binomial proportion test and Pearson’s chi-square test were proposed, along with sample size adjustments needed to achieve desired levels of power. One (3%) article proposed a differentially private approach to nonparametric survival analysis,44 and 1 (3%) article discussed a protocol for private data quality control in GWASs.76
Appraisals of privacy-utility tradeoff
Of the 8 articles that touched on privacy-utility appraisal, 5 (62%) provided guidance for decision making with regard to the parameter ε. Khokhar et al33 developed an economic cost-benefit model for an algorithm for data release.78 Among the inputs to their model were the costs of information distortion (a function of ε), personal damage and liability costs due to privacy breaches, the likelihood of a privacy breach, and the value of the data. Matthews et al39 and Matthews and Harel38 proposed the area under the receiver-operating characteristic curve as a familiar and interpretable means of comparing privacy risks for different values of ε. To accompany their algorithm for a differentially private version of Pearson’s chi-square test, Vu and Slavkovic40 calculated the sample size inflation factor needed to maintain a nominal level of statistical power for different values of ε. Considering the perspectives of database participants, then Calero Valdez and Ziefle37 used focus groups and conjoint analysis to assess attitudes toward sharing data in different usage scenarios (eg, commercial vs scientific) under k-anonymity and differential privacy.
Three (38%) articles evaluated the utility of differentially privacy for health disparity research. Krieger et al35 found perturbed population counts, as implemented by the U.S. Census Bureau, to have minimal impact on mortality rate measurements stratified by race/ethnicity, gender, and levels of economic segregation in the state of Massachusetts. At county levels nationwide, however, Santos-Lozada et al34 found concerning effects on the accuracy of mortality rates among minorities in nonurban sectors. In a de-identified health dataset from Pennsylvania, Xu and Zhang34 demonstrated greater privacy vulnerabilities among ethnicities with lower population sizes, in the absence of additional statistical disclosure limitation techniques. At the same time, a differentially private statistical test79 was subject to a severe loss of power for some of these same ethnic groups (by virtue of their smaller sample size), revealing yet another potential source of disparity.
DISCUSSION
The strong protections offered by differential privacy, and its growing ubiquity, call for an examination of its potential for widespread adoption in health research. This review described the state of published health literature with regard to developments in and applications of differential privacy, with the aim of assessing how well it has met varied analytic needs in the 14 years since its inception. The results suggest that differential privacy is still only burgeoning in the health sector. Most differential privacy-related research has focused on the development of predictive algorithms and systems for data release, but case reports of real-world implementations in health research are notably scant. Important gaps have been identified that warrant future investigation.
To date, the most common purposes for which differentially privacy has been developed are to enable private queries, publish sanitized data for subsequent analysis, and train machine learning algorithms for diagnostic prediction. Specific areas that have seen the most attention are genomics, neuroimaging studies, and analytics on health data streams originating from personal devices. Significant gaps exist, however, for applications involving explanatory modeling and statistical inference,80 which are particularly important in epidemiology and clinical research. These typically involve estimating the effect of an exposure or intervention (eg, relative risk or odds ratio) on a particular health outcome, with statistical uncertainty measured by standard errors and confidence intervals. Statistical hypothesis tests are also critical tools for exploratory and confirmatory investigation (eg, such as to assess the efficacy of a novel health intervention). Only one study40 proposed a test that accounts for additional statistical uncertainty due to differentially private noise, though this was for an extremely simple model. Much has been contributed in the machine learning literature for differentially private chi-square tests of contingency tables in GWASs,50,79,81,82 though developments for other statistical tests and models have been scant, even in nonhealth publications.83–87 Conceivably, one could apply standard inferential statistics to perturbed or synthetic data, but the validity of the resulting estimates would need to be verified. More attention should be directed toward algorithms for statistical inference as practiced in the health sciences.
None of the publications surveyed explicitly addressed differential privacy for correlated data, which is ubiquitous in health datasets. Common examples might be follow-up time points in longitudinal trials, or clustered observations originating from the same individual or community unit. Information leakage with dependent data is one known weakness of standard definitions of differential privacy,88 so recent extensions have begun to address this.89–92 Future algorithm development efforts and assessments should take these into account.
Some encouraging research has aided in the appraisal of differential privacy’s privacy-utility tradeoff. Matthews et al39 and Matthews and Harel38 showed how the parameter ε translates to a definition of risk based on the area under the receiver-operating characteristic curve, depending on the inferential attack an adversary might use. Khokhar et al33 suggested a framework for quantifying the monetary costs and benefits of data utility and privacy risk, and Vu and Slavkovic40 provided a specific example of how clinical trial enrollment costs can scale with differing values of ε. The work of Calero Valdez and Ziefle37 is also a starting point for understanding individuals’ decision making when it comes to sharing health data under differential privacy.
However, perhaps the greatest need for the responsible usage of differential privacy is experimental deployment and the dissemination of case studies that characterize the privacy-utility outcomes of real-world implementations. Part of this might take the form of an Epsilon Registry, as suggested by Dwork et al,18 in which institutions make informational contributions regarding the values of ε used, variants of differential privacy chosen, the justification processes that led to these, experiences with the privacy budget “burn rate,” and protocol when the privacy budget is exhausted. In addition to informing decision making, publicly registering ε values fosters ethical transparency and accountability, particularly when such values are deemed too large to afford any reasonable degree of privacy. Concrete examples of ε and their implications may also help to inform more narrowed HIPAA guidelines regarding the use of differential privacy for shared personal health data.
Finally, some studies of differential privacy implementations in health data have demonstrated adverse consequences for utility in cases of limited data or small populations.34,36 This may warrant research into more efficient differentially private algorithms, or consideration of alternative data sharing paradigms altogether. Data custodians may benefit from simple guidelines as to when datasets, or potential subsets of interest, are too small for differential privacy to be useful (some guidance for the case of count queries is provided in the Supplementary Appendix).
CONCLUSION
There is burgeoning awareness of differential privacy in the health sciences, with numerous developments for data release, predictive algorithms, genomic research, and personal health surveillance. However, there are few differentially private algorithms available for inferential statistics commonly used in health research. Compromises to accuracy are also of concern in cases of limited data. Unfortunately, real-world evaluations of differential privacy in the health sector are extremely scant. More experimental deployments and case studies are needed to assess the actual privacy-utility implications of current algorithms, processes, and decisions regarding ε. Future studies should address these concerns and continue to explore how differential privacy and other privacy-preserving technologies can best be leveraged to serve the needs of health research.
FUNDING
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
AUTHOR CONTRIBUTIONS
JF performed the literature search and drafted and revised the manuscript. WW conceived of the project. WW, HC, GD, and ED reviewed the manuscript and suggested edits.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at Journal of the American Medical Informatics Association online.
DATA AVAILABILITY STATEMENT
There are no new data associated with this article.
CONFLICT OF INTEREST STATEMENT
None declared.
Supplementary Material
REFERENCES
- 1.Holdren J. Memorandum for the Heads of Executive Departments and Agencies: Increasing Access to the Results of Federally Funded Scientific Research. 2013. https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf. Accessed April 5, 2019.
- 2.National Institutes of Health. Final NIH Statement on Sharing Research Data. 2003. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html. Accessed April 8, 2019.
- 3.National Science Foundation. Proposal & Award Policies & Procedures Guide. 2019. https://www.nsf.gov/pubs/policydocs/pappg19_1/pappg_11.jsp#XID4. Accessed April 8, 2019.
- 4.Taichman DB, Sahni P, Pinborg A, et al. Data sharing statements for clinical trials: a requirement of the international committee of medical journal editors. Ann Intern Med 2017; 167 (1): 63–5. [DOI] [PubMed] [Google Scholar]
- 5.U.S. Department of Health and Human Services. Guidance Regarding Methods for De-Identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. 2012. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html. Accessed April 18, 2019.
- 6.O'Keefe CM, Rubin DB.. Individual privacy versus public good: protecting confidentiality in health research. Stat Med 2015; 34 (23): 3081–103. [DOI] [PubMed] [Google Scholar]
- 7.Matthews GJ, Harel O.. Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Statist Surv 2011; 5: 1–29. [Google Scholar]
- 8.Sweeney L.Weaving technology and policy together to maintain confidentiality. J Law Med Ethics 1997; 25 (2–3): 98–100. [DOI] [PubMed] [Google Scholar]
- 9.Benitez K, Malin B.. Evaluating re-identification risks with respect to the HIPAA privacy rule. J Am Med Inform Assoc 2010; 17 (2): 169–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Malin B, Benitez K, Masys D.. Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule. J Am Med Inform Assoc 2011; 18 (1): 3–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Skinner C.Statistical disclosure risk: separating potential and harm. Int Stat Rev 2012; 80 (3): 349–68. [Google Scholar]
- 12.Taylor L, Zhou X-H, Rise P.. A tutorial in assessing disclosure risk in microdata. Stat Med 2018; 37 (25): 3693–706. [DOI] [PubMed] [Google Scholar]
- 13.Dwork C, McSherry F, Nissim K, et al. Calibrating noise to sensitivity in private data analysis. In: Halevi S, Rabin T, eds. Theory of Cryptography TCC 2006. Berlin, Heidelberg: Springer; 2006: 265–284. [Google Scholar]
- 14.Dwork C, Roth A.. The algorithmic foundations of differential privacy. Fnt Theor Comput Sci 2013; 9 (3–4): 211–407. [Google Scholar]
- 15.Lee J, Clifton C.. How much is enough? Choosing epsilon for differential privacy. Inf Secur 2011; 7001: 325–40. [Google Scholar]
- 16.Hsu J, Gaboardi M, Haeberlen A, et al. Differential privacy: an economic method for choosing epsilon. In: 2014 IEEE 27th Computer Security Foundations Symposium (CSF); 2014.
- 17.Naldi M, D’Acquisto G. Differential privacy: an estimation theory-based method for choosing epsilon. arXiv, doi: https://arxiv.org/abs/1510.00917, 4 Oct 2015, preprint: not peer reviewed.
- 18.Dwork C, Kohli N, Mulligan D.. Differential privacy in practice: expose your Epsilons! J Priv Confid 2019; 9 (2): 1–22. [Google Scholar]
- 19.Kapelke C. Using differential privacy to harness big data and preserve privacy. 2020. https://www.brookings.edu/techstream/using-differential-privacy-to-harness-big-data-and-preserve-privacy/. Accessed May 7, 2021.
- 20.Jain P, Gyanchandani M, Khare N.. Differential privacy: its technological prescriptive using big data. J Big Data 2018; 5 (1): 15. [Google Scholar]
- 21.Jain P, Gyanchandani M, Khare N.. Big data privacy: a technological perspective and review. J Big Data 2016; 3 (1): 25. [Google Scholar]
- 22.Yao X, Zhou X, Ma J. Differential privacy of big data: an overview. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS); 2016.
- 23.Dankar F, El Emam K.. Practicing differential privacy in health care: a review. Trans Data Priv 2013; 5: 35–67. [Google Scholar]
- 24.Arksey H, O'Malley L.. Scoping studies: towards a methodological framework. Int J Soc Res Methodol 2005; 8 (1): 19–32. [Google Scholar]
- 25.Dennis S, Garrett P, Yim H, et al. Privacy versus open science. Behav Res Methods 2019; 51 (4): 1839–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Jiang X, Sarwate AD, Ohno-Machado L.. Privacy technology to support data sharing for comparative effectiveness research: a systematic review. Med Care 2013; 51 (8 Suppl 3): S58–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Al Aziz M, Sadat M, Alhadidi D, et al. Privacy-preserving techniques of genomic data—a survey. Brief Bioinform 2019; 20 (3): 887–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Shi X, Wu X.. An overview of human genetic privacy. Ann N Y Acad Sci 2017; 1387 (1): 61–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang S, Jiang X, Singh S, et al. Genome privacy: challenges, technical approaches to mitigate risk, and ethical considerations in the United States. Ann N Y Acad Sci 2017; 1387 (1): 73–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Mehta SR, Vinterbo SA, Little SJ.. Ensuring privacy in the study of pathogen genetics. Lancet Infect Dis 2014; 14 (8): 773–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yakubu A, Chen Y-P.. Ensuring privacy and security of genomic data and functionalities. Brief Bioinform 2020; 21 (2): 511–26. [DOI] [PubMed] [Google Scholar]
- 32.Dwork C, Pottenger R.. Toward practicing privacy. J Am Med Inform Assoc 2013; 20 (1): 102–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Khokhar RH, Chen R, Fung BCM, et al. Quantifying the costs and benefits of privacy-preserving health data publishing. J Biomed Inform 2014; 50: 107–21. [DOI] [PubMed] [Google Scholar]
- 34.Santos-Lozada AR, Howard JT, Verdery AM.. How differential privacy will affect our understanding of health disparities in the United States. Proc Natl Acad Sci U S A 2020; 117 (24): 13405–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Krieger N, Nethery RC, Chen JT, et al. Impact of differential privacy and census tract data source (decennial census versus American Community Survey) for monitoring health inequities. Am J Public Health 2021; 111 (2): 265–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Xu H, Zhang N.. Privacy in health disparity research. Med Care 2019; 57 (Suppl 2): S172–75. [DOI] [PubMed] [Google Scholar]
- 37.Calero Valdez A, Ziefle M.. The users’ perspective on the privacy-utility trade-offs in health recommender systems. Int J Human Comput Stud 2019; 121: 108–21. [Google Scholar]
- 38.Matthews GJ, Harel O.. Assessing the privacy of randomized vector-valued queries to a database using the area under the receiver operating characteristic curve. Health Serv Outcomes Res Method 2012; 12 (2–3): 141–55. [Google Scholar]
- 39.Matthews GJ, Harel O, Aseltine RH Jr.. Assessing database privacy using the area under the receiver-operator characteristic curve. Health Serv Outcomes Res Method 2010; 10 (1-2): 1–15. [Google Scholar]
- 40.Vu D, Slavkovic A. Differential privacy for clinical trial data: preliminary evaluations. In: 2009 IEEE International Conference on Data Mining Workshops (ICDMW); 2009.
- 41.Liu X, Zhou P, Qiu T, et al. Blockchain-enabled contextual online learning under local differential privacy for coronary heart disease diagnosis in mobile edge computing. IEEE J Biomed Health Inform 2020; 24 (8): 2177–88. [DOI] [PubMed] [Google Scholar]
- 42.Niinimäki T, Heikkila M, Honkela A, et al. Representation transfer for differentially private drug sensitivity prediction. Bioinformatics 2019; 35 (14): i218–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Honkela A, Das M, Nieminen A, et al. Efficient differentially private learning improves drug sensitivity prediction. Biol Direct 2018; 13 (1): 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Bonomi L, Jiang X, Ohno-Machado L.. Protecting patient privacy in survival analyses. J Am Med Inform Assoc 2020; 27 (3): 366–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Beaulieu-Jones B, Wu Z, Williams C, et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ Cardiovasc Qual Outcomes 2019; 12 (7): e005122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Lee D, Yu H, Jiang X, et al. Generating sequential electronic health records using dual adversarial autoencoder. J Am Med Inform Assoc 2020; 27 (9): 1411–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Almadhoun N, Ayday E, Ulusoy O.. Differential privacy under dependent tuples-the case of genomic privacy. Bioinformatics 2020; 36 (6): 1696–703. [DOI] [PubMed] [Google Scholar]
- 48.Simmons S, Berger B.. Realizing privacy preserving genome-wide association studies. Bioinformatics 2016; 32 (9): 1293–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wang M, Ji Z, Wang S, et al. Mechanisms to protect the privacy of families when using the transmission disequilibrium test in genome-wide association studies. Bioinformatics 2017; 33 (23): 3716–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Yu F, Fienberg SE, Slavković AB, et al. Scalable privacy-preserving data sharing methodology for genome-wide association studies. J Biomed Inform 2014; 50: 133–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kim JW, Jang B, Yoo H.. Privacy-preserving aggregation of personal health data streams. PLoS One 2018; 13 (11): e0207639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Lin C, Song Z, Song H, et al. Differential privacy preserving in big data analytics for connected health. J Med Syst 2016; 40 (4): 97. [DOI] [PubMed] [Google Scholar]
- 53.Wu X, Khosravi MR, Qi L, et al. Locally private frequency estimation of physical symptoms for infectious disease analysis in Internet of Medical Things. Comput Commun 2020; 162: 139–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Ren H, Li H, Liang X, et al. Privacy-enhanced and multifunctional health data aggregation under differential privacy guarantees. Sensors 2016; 16 (9): 1463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Saleheen N, Chakraborty S, Ali N, et al. mSieve: differential behavioral privacy in time series of mobile sensor data. Proc ACM Int Conf Ubiquitous Comput 2016; 2016: 706–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Ukil A, Jara AJ, Marin L.. Data-driven automated cardiac health management with robust edge analytics and de-risking. Sensors 2019; 19 (12): 2733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Li Z, Roberts K, Jiang X, et al. Distributed learning from multiple EHR databases: contextual embedding models for medical events. J Biomed Inform 2019; 92: 103138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Ma J, Zhang Q, Lou J, et al. Privacy-preserving tensor factorization for collaborative health data analysis. Proc ACM Int Conf Inf Knowl Manag 2019; 2019: 1291–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Baker B, Abrol A, Silva R, et al. Decentralized temporal independent component analysis: leveraging fMRI data in collaborative settings. Neuroimage 2019; 186: 557–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Le TT, Simmons WK, Misaki M, et al. Differential privacy-based evaporative cooling feature selection and classification with relief-F and random forests. Bioinformatics 2017; 33 (18): 2906–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Plis S, Sarwate A, Turner J, et al. From private sites to big data without compromising privacy: a case of neuroimaging data classification. Value Health 2014; 17 (3): A190. [Google Scholar]
- 62.Li X, Gu Y, Dvornek N, et al. Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results. Med Image Anal 2020; 65: 101765. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Cho H, Simmons S, Kim R, et al. Privacy-preserving biomedical database queries with optimal privacy-utility trade-offs. Cell Syst 2020; 10 (5): 408–16.e9. [DOI] [PubMed] [Google Scholar]
- 64.Vinterbo S, Sarwate A, Boxwala A.. Protecting count queries in study design. J Am Med Inform Assoc 2012; 19 (5): 750–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Mohammed N, Jiang XQ, Chen R, et al. Privacy-preserving heterogeneous health data sharing. J Am Med Inform Assoc 2013; 20 (3): 462–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Li H, Xiong L, Ohno-Machado L, et al. Privacy preserving RBF kernel support vector machine. BioMed Res Int 2014; 2014: 827371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Wang M, Ji Z, Kim HE, et al. Selecting optimal subset to release under differentially private M-estimators from hybrid datasets. IEEE Trans Knowl Data Eng 2018; 30 (3): 573–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Krall A, Finke D, Yang H.. Gradient mechanism to preserve differential privacy and deter against model inversion attacks in healthcare analytics. Annu Int Conf IEEE Eng Med Biol Soc 2020; 2020: 5714–7. [DOI] [PubMed] [Google Scholar]
- 69.Parvandeh S, Yeh H-W, Paulus M, et al. Consensus features nested cross-validation. Bioinformatics 2020; 36 (10): 3093–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Shao R, He H, Chen Z, et al. Stochastic channel-based federated learning with neural network pruning for medical data privacy preservation: model development and experimental validation. JMIR Form Res 2020; 4 (12): e17265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Gardner J, Xiong L, Xiao YH, et al. SHARE: system design and case studies for statistical health information release. J Am Med Inform Assoc 2013; 20 (1): 109–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Xiong L. Building data registries with privacy and confidentiality for patient-centered outcomes research (PCOR). 2018. https://hsrproject.nlm.nih.gov/view_hsrproj_record/20152272. Accessed February 12, 2019.
- 73.Froelicher D, Misbach M, Troncoso-Pastoriza JR, et al. MedCo2: privacy-preserving cohort exploration and analysis. Stud Health Technol Inform 2020; 270: 317–21. [DOI] [PubMed] [Google Scholar]
- 74.Raisaro J, Marino F, Troncoso-Pastoriza J, et al. SCOR: a secure international informatics infrastructure to investigate COVID-19. J Am Med Inform Assoc 2020; 27 (11): 1721–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Raisaro JL, Troncoso-Pastoriza JR, Misbach M, et al. MedCo: enabling secure and privacy-preserving exploration of distributed clinical and genomic data. IEEE/ACM Trans Comput Biol Bioinform 2019; 16 (4): 1328–41. [DOI] [PubMed] [Google Scholar]
- 76.Huang Z, Lin H, Fellay J, et al. SQC: secure quality control for meta-analysis of genome-wide association studies. Bioinformatics 2017; 33 (15): 2273–80. [DOI] [PubMed] [Google Scholar]
- 77.Eicher J, Bild R, Spengler H, et al. A comprehensive tool for creating and evaluating privacy-preserving biomedical prediction models. BMC Med Inform Decis Mak 2020; 20 (1): 29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Mohammed N, Chen R, Fung B, et al. Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2011.
- 79.Gaboardi M, Lim H-W, Rogers RM, et al. Differentially private chi-squared hypothesis testing: goodness of fit and independence testing. Proc Mach Learn Res 2016; 48: 2111–20. [Google Scholar]
- 80.Shmueli G.To explain or to predict? Statist Sci 2010; 25 (3): 289–310. [Google Scholar]
- 81.Kakizaki K, Fukuchi K, Sakuma J.. Differentially private chi-squared test by unit circle mechanism. Proc Mach Learn Res 2017; 70: 1761–70. [Google Scholar]
- 82.Rogers R, Kifer D.. A new class of private chi-square hypothesis tests. Proc Mach Learn Res 2017; 54: 991–1000. [Google Scholar]
- 83.Awan J, Slavkovic A. Differentially private uniformly most powerful tests for binomial data. arXiv, doi: https://arxiv.org/abs/1805.09236, 23 May 2018, preprint: not peer reviewed.
- 84.Couch S, Kazan Z, Shi K, et al. A differentially private Wilcoxon signed-rank test. arXiv preprint arXiv, doi: https://arxiv.org/abs/1809.01635, 5 Sep 2018, preprint: not peer reviewed.
- 85.Ding B, Nori H, Li P, et al. Comparing population means under local differential privacy: with significance and power. arXiv, doi: https://arxiv.org/abs/1803.09027, 24 Mar 2018, preprint: not peer reviewed.
- 86.Barrientos AF, Reiter JP, Machanavajjhala A, et al. Differentially private significance tests for regression coefficients. J Comput Graph Stat 2019; 28 (2): 440–24. [Google Scholar]
- 87.Solea E.Differentially Private Hypothesis Testing for Normal Random Variables [PhD thesis]. State College, Pennsylvania, Department of Statistics, Pennsylvania State University; 2014.
- 88.Kifer D, Machanavajjhala A. No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data; 2011: 193–204.
- 89.Kifer D, Machanavajjhala A.. Pufferfish: a framework for mathematical privacy definitions. ACM Trans Database Syst 2014; 39 (1): 1–36. [Google Scholar]
- 90.Yang B, Sato I, Nakagawa H. Bayesian differential privacy on correlated data. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data; 2015: 747–62.
- 91.Zhang T, Zhu T, Liu R, et al. Correlated data in differential privacy: definition and analysis. Concurr Comp Pract Exp 2020 Sep 19 [E-pub ahead of print]. [Google Scholar]
- 92.Zhao J, Zhang J, Poor HV. Dependent differential privacy for correlated data. In: 2017 IEEE Globecom Workshops (GC Wkshps); 2017.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
There are no new data associated with this article.