Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Jan 7.
Published in final edited form as: Nat Med. 2019 Jan 7;25(1):37–43. doi: 10.1038/s41591-018-0272-7

Privacy in the Age of Medical Big Data

W Nicholson Price II 1, I Glenn Cohen 1
PMCID: PMC6376961  NIHMSID: NIHMS1005787  PMID: 30617331

Abstract

Big data has become the ubiquitous watch word of medical innovation. The rapid development of machine-learning techniques and artificial intelligence, in particular, has promised to revolutionize medical practice from the allocation of resources to the diagnosis of complex diseases. But with big data comes big risks and challenges, among them significant questions about patient privacy. This article outlines the legal and ethical challenges big data brings to patient privacy. It discusses, among other things, how best to conceive of health privacy; the importance of equity, consent, and patient governance in data collection; discrimination in data uses; and how to handle data breaches. It closes by sketching possible ways forward for the regulatory system.

Keywords: privacy, big data, AI, machine learning, health information, HIPAA, electronic health records

Introduction

Big data has come to medicine. Its advocates promise increased accountability, quality, efficiency, and innovation. Most recently, the rapid development of machine-learning techniques and artificial intelligence has promised to wring even more useful applications from big data, from resource allocation to complex disease diagnosis.1 But with big data comes big risks and challenges, among them significant questions about patient privacy. In this article, we examine the host of ethical concerns and legal responses raised. Nevertheless, attempts to reduce privacy risks also bring their own costs that must be considered, both for current patients and for the system as a whole.

We begin by discussing the benefits big data may bring to health science and practice, before turning to the concerns big data raises in these contexts. We focus on a prominent (but not the only) worry: privacy violations. We present a basic theory of health privacy and examine how privacy concerns play out in two phases of the life cycle of big data’s application to health care: data collection and data use. We ground these concerns in a discussion of relevant U.S. law, a key feature of the health data world faced by innovators in this space, and make some regulatory recommendations. We argue, counter to the current zeitgeist, that while too little privacy raises concerns it is also true that too much privacy in this area can pose problems.

Why do we need big data in health?

Big data has long been promised to substantially improve health care. But what is big data and why does it matter? Big data is often defined by “three Vs”: volume (large amounts of data), velocity (high speed of access and analysis), and variety (substantial data heterogeneity across individuals and data types), all of which appear in medical data.2 We can organize the research applications of big data into two rough groups: long-practiced analysis approaches, and newer methods using machine learning and artificial intelligence.

Big data enables more powerful evaluations of health care quality and efficiency, which can themselves be used to promote care improvement.3 Currently, much care remains relatively untracked and under-analyzed; amid persistent evidence of ineffective treatment, substantial waste, and medical error,4 understanding what works and what doesn’t is crucial to systemic improvement. Big data can help: it can be leveraged to measure hospital quality, as in the Centers for Medicare and Medicaid Services’ Hospital Inpatient Quality Reporting program;5 to develop scientific hypotheses, as with proliferating genome-wide association studies;6 to compare the effectiveness of different interventions, as in the Patient Centered Outcome Research Institute;7 and to monitor drug and device safety, as with the FDA’s Sentinel system.8

A new, rapidly developing set of tools use artificial intelligence techniques to find patterns in big health data that can then be used to make predictions and recommendations in care.9, 10 The best-known of these tools, beginning to enter clinical use, involves image analysis. Algorithms have been able to identify cancerous skin lesions from images as well as trained dermatologists,11 and EyeDiagnosis has recently received FDA approval for image-based AI diagnosis of diabetic retinopathy. Further afield, AI can be used for prognostic purposes—to predict when trauma patients are about to suffer a catastrophic hemorrhage and need immediate intervention12 or when patients are very likely to die within a year and therefore might consider shifting from traditional care to palliative care.13

AI algorithms could also make recommendations for treatment (see Box 1). Finally, and somewhat controversially, AI algorithms could help make resource allocation decisions (see Box 1).1 All of these uses require very large sets of data from health care: how patients have been treated; how they have responded; and data about those patients themselves, including genetic data, family history, health behavior, and vital signs.14 Without these data, algorithms cannot be trained or—once trained—evaluated on how they perform.15

Box 1. Vignettes illustrating possible uses of big data:

  • Scott suffers from liver cancer. Anita, his physician, is deciding which chemotherapeutics to administer. She turns to the CancerChoice module in the hospital’s electronic health record (EHR) system. This module pulls data from Scott’s EHR—his medical history, family history, and genetic sequence—but also automatically links to large collections of commercially collected data to acquire additional information about Scott’s shopping, eating, and exercise habits, which can help inform treatment choice. The module then makes a recommendation by combining all the data it has gleaned about Scott with similar data— both health-care and health-related lifestyle data—from millions of patients across the country.

  • Samantha presents at Chicago Hope Hospital with moderate organ dysfunction; the physician is trying to decide whether to send Samantha to a specialized ICU; Samantha might benefit, but beds are limited and other patients might benefit more. In traditional medicine, assessing a patient’s risk for cardiopulmonary arrest or other preventable serious adverse events might take hours, the assessment has limited prognostic accuracy, and the risk may change during that period. Imagine that an alternative assessment mechanism is available. CorazonAI has developed a predictive analytics engine, based on data from millions of U.S. patients’ electronic health records, that could ascertain the risk accurately for hundreds of patients with real time updates to help the physician evaluate who should be admitted to the ICU.1 The physician uses the system, which advises that Samantha be admitted.

In these vignettes, have patients’ privacy been violated? Are these violations unethical? Are they ones the law should prohibit? And how do these concerns stack up against the benefits gleaned from using big data in the health context?

The next evolution of big data in health care—an evolution slowly gaining momentum—lies in the development of learning health systems.16 In learning health systems, the traditional boundary between clinical research and care is eroded—although even in more traditional health system designs, there is significant fuzziness, doubt, and gamesmanship as to the line between “research” and “quality improvement” or “innovation,” with important ramifications for regulatory review.17, 18,19, 20, 21 In learning health systems, data are collected routinely in the process of care, with the explicit aim that those data be used for the purpose of analyzing and improving care. Just as data are continuously collected, they are continuously analyzed for patterns in the process of care, procedures that can be improved, and other underlying patterns such as differential patient response to different treatment.22 Finally, these new insights are routinely incorporated back into the clinical care pathway, whether explicitly (in practice guidelines or publications) or implicitly (in the context of recommendations or procedures automatically embedded into electronic health record systems). The concept of a learning health system can be applied either through explicit learning mechanisms or through artificial intelligence algorithms, though at least for the foreseeable future we would expect humans to remain embedded firmly within the loop of learning-analysis-implementation.

I. How to Think About Health Privacy

The concept of privacy is notoriously difficult to define. One currently-prominent view connects privacy to context. There are contextual rules about how information can flow that depend on the actors involved, the process by which information is accessed, the frequency of the access, and the purpose of that access.23, 24 When these contextual rules are contravened, we say there has been a privacy violation. Such violations can occur because the wrong actor gets access to the information, the process by which information may be accessed is violated, or the purpose of access is inappropriate, etc. When we think of normative reasons why such violations are problematic, they can be divided (with some simplification) into two categories—consequentialist and deontological concerns. Two caveats are in order: first, some privacy violations raise issues in both categories. Second, some concerns we discuss are also present for “small data” collection. Big data settings, however, have a tendency to increase the number of persons affected, the severity of the effects, and the difficulty for aggrieved individuals to engage in preventive or self-help measures.

Consequentialist concerns

Consequentialist concerns result from negative consequences that affect the person whose privacy has been violated. These can be tangible negative consequences—for example, one’s long-term-care insurance premium goes up as a result of additional information now available as a result of a breach of privacy, one experiences employment discrimination, one’s HIV status becomes known to those in one’s social circle, etc.—or these can be the emotional distress associated with knowing that private medical information is “out there” and potentially exploited by others—consider the potential for increased anxiety if one believed one was now susceptible to identity theft, even before any misuses of one’s identity have occurred.

Deontological concerns

Deontological concerns do not depend on experiencing negative consequences. In this category, the concern from a privacy violation manifests even if no one uses a person’s information against a person or the person never even becomes aware that a breach has occurred. One may be wronged by a privacy breach even if one has not been harmed. For example, suppose that an organization unscrupulously or inadvertently gains access to data you store on your smart phone as part of a larger data dragnet. After reviewing it, including photos you have taken of an embarrassing personal ailment, the organization realizes your data is valueless to them and destroys the record. You never find out this happened. Those reviewing your data live abroad and will never encounter you or anyone who knows you. It is hard to say you have been harmed in a consequentialist sense, but many think the loss of control over your data, the invasion, is itself ethically problematic even absent harm. This is a deontological concern.

Gathering data

Custodian-specific vs. blanket provisions

The gathering of medical data raises many legal and ethical privacy questions. We focus here on the treatment of health data in the U.S., but it is worth comparing the U.S. approach to the E.U. approach.25 Health data come from many different sources: electronic health records, insurance claims, Internet of Things devices, social media posts, to name but a few. U.S. privacy law treats health data differently depending on how they are created and who is handling the data—that is, who is the custodian. By contrast, the E.U. General Data Protection Regulation sets out a single broadly defined regime for health data (as well as other data), no matter what format, how it is collected, or who the custodian is.26 It defines the category of “data concerning health” broadly to mean “personal data related to the physical or mental health of a natural person, including the provision of health care services, which reveal information about his or her health status.”27

The custodians that U.S. law focuses on are physicians, health systems, and their business associates. The major U.S. federal law governing health data privacy is the Privacy Rule created under the Health Insurance Portability and Accountability Act (HIPAA)—there are also state-specific privacy laws, and the federal Common Rule, which protects research subjects, but they are not our focus here.28

Under the HIPAA Privacy Rule, “covered entities” are prohibited from using or disclosing “protected health information” (PHI) except in a specified list of circumstances; “business associates” face similar limitations under required contracts with covered entities.29 The definition of PHI is broad, including most individually identifiable health information; “covered entities” includes most health care providers, health insurance companies, and “health information clearinghouses.”29

HIPAA creates a set of rules that are arguably both overprotective and underprotective of privacy (HIPAA also directly protects information security through a separate Security Rule30). On the overprotective side, while HIPAA allows use of PHI for health care treatment (including “quality improvement”), operations, payment, public health, and law enforcement—it does not allow the use of PHI without IRB waiver or patient authorization for research, which is to say the systematic production of generalizable knowledge.31

As to health data covered by HIPAA, the rule also has gaps. One of HIPAA’s most important strategies for protecting patients from privacy breaches while enabling data sharing is deidentifying their data by removing a set of 18 specified identifiers like names and email addresses.32 However, deidentified data may become re-identifiable through data triangulation from other data sets (see Text Box 2).25, 33, 34, 35 Moreover, HIPAA focuses its regulation on particular actors and their activities, not the data themselves. For instance, once patients request their own health data—which HIPAA gives them the right to do, and some concerted efforts encourage patients to do36, 37—if the patients give those data to another individual, HIPAA does not restrict use or disclosure of those data (unless the recipient is another covered entity or a business associate).25

Text Box 2. The challenge of multiple data sets for re-identifiability.

Many assume that “anonymized” data cannot be used to re-identify the subject of the data. Unfortunately, as data sets proliferate, the ability to combine multiple data sets may defeat the deidentification strategy. The most famous example, which preceded HIPAA, was demonstrated by Latanya Sweeney, then a graduate student. In the 1990’s, the state of Massachusetts purchased health insurance for state employees and subsequently released records summarizing every state employee’s hospital visits at no cost to any researcher who requested the data. Then-Governor William Weld assured the public the data had been scrubbed to defeat re-identification, by removing information such as names, addresses, and Social Security numbers. Unfortunately, many patient attributes were not scrubbed. Sweeney knew Weld resided in the city of Cambridge and purchased the complete voter rolls from the city, which contained the name, address, ZIP code, birth date, and sex of every voter in the city. She paired that data with the state health insurance data to demonstrate that one could re-identify Weld’s prescriptions, diagnosis, and medical history. 74, 75

A more recent example of the same problem outside of medicine pertains to the prize offered by Netflix in the mid-2000s to improve its movie recommendation algorithm. To enable the competition, Netflix publicly released one-hundred million records revealing hundreds of thousands of user ratings from 1999 to 2005. Netflix stripped identifying information, but added unique user numbers to group ratings by users. Two researchers from the University of Texas, Arvind Narayanan and Vitaly Shmatikov, showed that one could nonetheless re-identify Netflix users by linking to other data sets. In particular, they drew on the publicly available data from the Internet Movie Database (IMDb), wherein users also rate movies but do so publicly, to offer a proof of concept. They showed that: “Given a user’s public IMDb ratings, which the user posted voluntarily to reveals some of his … movie likes and dislikes, we discover all ratings that he entered privately into the Netflix system.” In particular, their reidentification strategy took advantage of ratings for more obscure movies in both systems and also the timing in which reviews were posted. To be sure, neither of these examples are meant to show that deidentification is never possible or that reidentification will always be easy. Instead, they are meant to show how the increase in the number of datasets and linking of information makes reidentification more plausible even for data that had otherwise been thought deidentified. 75, 76

But the more fundamental problem is that the majority of health data is not covered by HIPAA at all. When Congress enacted HIPAA in 1996, it envisioned a regime where most health data would be held in health records and accordingly focused on health care providers and other covered entities. In the big-data world, the type of data sources covered by HIPAA are but a small part of a larger health data ecosystem. HIPAA does not cover health care data generated outside of covered entities and business associates, such as health care related information recorded by life insurance companies. It does also does not cover health (as opposed to health care) data generated by a myriad of people or products other than the patient. It does not cover user-generated information about health, such as the use of a blood-sugar-tracking smartphone app or a set of Google searches about particular symptoms and insurance coverage for serious disorders. And it certainly does not cover the huge volume of data that is not about health at all, but permits inferences about health—such as when the information about a shopper’s Target purchases famously revealed her pregnancy.35, 38, 39 This focus on data specifically arising from health care contrasts with European regulation of “data concerning health” more generally.

We are already entering a future where traditional health care spaces, HIPAA’s “covered entities,” are being supplanted in the health data space by behemoths like Google, Apple, or IBM—all of which operate outside of HIPAA’s regime. While, as we discuss below, some laws may protect particular uses of those data, overall there is little to protect patients from these threats to their health privacy in the U.S. at the moment.

Equitable data collection

Another concern is not that too much data is taken from patients, but that data collection is not occurring equitably. As an ethical matter data collection is best justified as a kind of “bargain” struck between data sources and data users—provide us your data, recognizing this may encroach in some ways on your privacy, because it will permit us to provide advances in health care that will improve your life. When this balance is off, the bargain may break down. As has been shown for predictive analytics in policing, existing bias can reappear in data mining, as when racial disparities in policing patterns result in racially biased predictions of criminal activity.40 Unfortunately, health data have many of the same problems. Marginalized populations that are missing from non-health data such as credit card use or internet history—leading to biases in credit scores or consumer profiles—may also be absent from big health data, such as genomic databases or electronic health records, due in part to lack of health insurance, the inability to access healthcare, and a number of other reasons.41 The distributional consequences of this lack of inclusion in “big data” are complex; in some instances it may favor but in other instances disfavor those whose data are missing. For example, consider an allocation decision between multiple patients as to a scarce medical resource. If a particular minority group actually responds less well to the medical intervention then other groups, failure to collect information on the minority group might lead the algorithm to give the minority patient more priority than had the data been included. If the minority group responds better than other groups, the opposite effect might result. Whichever way it cuts, though, the result will be that the system’s prediction will be problematically biased. This is a hard nut to crack, among other reasons because of contested and incompatible definitions of “fairness” in the predictive analytics space.42, 43 One solution in our contexts, of course, is better access to health care for the underserved populations, but if that goal is reached it will not be because of big data needs. Statistical adjustment for data gaps may help mitigate the problem somewhat, but this is an area to which funders (especially public funders) should be attuned. The All of Us program, for instance, aims to develop a nationally representative sample for its genomic work.44 While that ambition will not be realizable for all big data research, funders should consider asking applicants to explicitly address their strategies for making their data sets more inclusive and take that into consideration in allocating grants.

The role of the patient in data collection and access

To what extent should an individual’s data be available for use in predictive analytics without her consent, for example the use of electronic health records without consent to build the proprietary CancerChoice model discussed in box 1? Especially for deontological concerns with health privacy, the loss of control over who accesses one’s data and for what purpose matters, even if there are no material consequences for the individual or the individual does not even know.

Should some health data be seen as a kind of public good that can be conscripted for some potentially publicly minded uses? Here the notion of privacy as stemming from contextual rules, discussed above, is particularly helpful. The ethical analysis will depend heavily on the type of data, including its identifiability; who will be accessing it; and for what purpose. Take one data source, EHR data stripped of the 18 HIPAA identifiers. One might feel differently about the CDC accessing it for flu tracking purposes, compared to a hospital system using it to reevaluate its staffing and workflow to improve both cost efficiency and patient experience, compared to a pharmaceutical company using it for product development. Even if privacy is violated, it may be that, all-things-considered, the violation is outweighed by equitably distributed benefits in some cases. As a guiding principle for this analysis, one might think that individually unconsented use is more appropriate (especially for relatively deidentified data) the more the contributing patient will benefit from the data use—a principle of reciprocity—and where the risks to the patient (including the consequentialist risks discussed below) are low, such that the “ask” of patients is small compared to the benefit—a principle of proportionality.17, 45

Second, whether or not patients consent for their data to be included within a set, what role should they have in deciding what kind of uses of their data are permissible? This is a question of designing a governance regime—and it matters to patient privacy because, as discussed below, many of the privacy harms of big health data arise not merely in the collection of data, but in their eventual use. On one extreme, one could imagine enabling every patient to approve every access to every piece of data individually after a purpose has been stated—a regime that would maximize patient autonomy but could eliminate most work using big health data.46

On the other extreme, one could treat data as completely “alienable,” such that the patient retains no rights of control, whether by external mandate or by “broad consent,” as has been proposed in the biobank context.47, 48 As noted, our conception of privacy is contextual and the analysis will depend on the specifics of who seeks access to what data in what way for what purpose. For many cases, though, the optimal governance regime may lie somewhere in the middle. This might involve, for example, chartering a steering board that includes representative patients in deciding which requests for data to permit and under what circumstances. One analogy would be the Independent Review Panels that have been used to approve or deny requests for the sharing of clinical trial data.49 A slightly different approach would be to actually put the data in a charitable trust, with trustees (some of whom would be patient representatives) making decisions about access conditions and approved uses while owing fiduciary duties to the patients’ whose data is used, a model championed by some for biobanks.50

Still another approach is what Barbara Evans, a law professor at the University of Houston, calls “consumer driven data commons,” “institutions that enable groups of consenting individuals to collaborate to assemble powerful, large-scale health data resources for use in scientific research, on terms the group members themselves would set.”51 There are many other governance possibilities,52 including so-called “citizen juries” that have been used in the UK in these domains,53 but especially where individualized patient consent will not be used it is important to have patient representatives involved in the crucial decisions about how their data will be used.

While approaches built on any of these models may be feasible at the current moment, they may be less feasible in a future where data sets—containing not only huge amounts but huge varieties of data—are used for multiple different analyses. Such cross-context datasets and data-uses—using collections of consumer data to make health predictions, or collections of health data to target advertising, or joint collections to do both—would make it harder to meaningfully set one governance regime for consumer data and another for health data. And to the extent that policymakers today require context-specific regimes, they may limit exactly that future development of cross-context datasets, for good and ill.

II. Data Uses

In this section, we outline major legal and ethical privacy issues raised by using already-collected patient data, especially in AI-driven systems, and approaches for addressing them.

Discrimination based upon health data

The use of patient-derived big data in medicine can lead to consequentialist privacy concerns. One well-characterized set of objective harms comes from the possibility of discrimination: if employers or insurers learn of sensitive patient information from medical data, such as a debilitating or expensive disease, they may wish not to employ or insure that person, especially since in the U.S. health insurance is typically tied to employment.54 Some would argue that this type of discrimination is justified under a principle of “actuarial fairness,” where everyone should pay or be paid according to their risk as precisely as possible55—an enterprise which big data could make much easier. This raises some very fundamental question about whether to favor a notion of “to each according to his risk” as opposed to a more solidaristic view of insurance, whereby to some extent we redistribute through insurance pooling.56 In any event, our existing laws in health insurance and employment contexts have favored the latter view, prohibiting some but not all of this sort of discrimination:57

The Genetic Information Nondiscrimination Act (GINA) prohibits discrimination by health insurers or employers on the basis of genetic information, the Americans with Disabilities Act (ADA) prohibits discrimination in employment and insurance based on medical conditions that are disabilities, and the Patient Protection and Affordable Care Act (PPACA) prohibits health insurance discrimination and sharply limits medical underwriting. These laws represent an attempt to limit consequentialist privacy harms by limiting consequences of access to data, rather than focusing on protecting data themselves (though GINA does also include some limits on data acquisition). But these laws have important limits. The ADA, for example, will not limit uses of big data to adversely treat “people who are currently healthy but are perceived as being at high risk of becoming sick in the future.”58 Neither GINA nor the ADA reaches life insurance. And even when these laws do apply, they can be hard to enforce because it is often hard to know when discrimination has occurred. Moreover, other kinds of consequentialist harms are hard to address through law at all, such as stigma that can arise from others knowing about a sexually transmitted infection or learning that a child’s parent is not the child’s biological parent.

A recent survey of clinical trial participants on the sharing of participant-level clinical trial data beyond genomic information found that 6.6% were “very concerned” and 14.9% were “somewhat concern” that “I could be discriminated against if the information was linked back to me,” but as the authors acknowledge, specific characteristics of that study population, especially the fact that they have already decided to participate in clinical trials, may make it a poor predictor for general public attitudes on these questions. 59, 60

Sharing of private information

A second set of consequentialist privacy harms involves more subjective injuries. Patients whose private health information becomes available can suffer embarrassment, paranoia or mental pain. Even though these injuries may not have measurable external effects—the patients may suffer no financial injury or encounter no stigma from others—they are still injuries.61 Laws like GINA, the ADA, or the PPACA have little purchase on this type of injury.

Big data also raises the possibility of more dignitary harms. In order to live a flourishing life, it is important that there be part of our lives that is ours alone, that others do not know unless we share it with them. Facts about our health are particularly sensitive and private. In some instances, big data permits direct knowledge of our health by those we would not want to access the information— whether through inadvertent disclosure or malicious activities such as hacking. Most people are woefully unaware of the uses to which their data may be put; a particularly salient example comes from use of the GEDmatch genetic database to help identify the Golden State Killer.62 This example also helpfully illustrates the problem that information shared about one individual may reveal information about other individuals—here, genetic relatives—who are unaware that potentially revealing information has been shared, and who have not consented to the sharing.

A more subtle and more difficult issue raised by predictive analytics is whether our privacy is breached when others make inferences about us as opposed to know things about us.63 Jeff Skopek, a law professor at Cambridge University, argues that “data mining often generates knowledge about people through the process of inference rather than direct observation or access, and there are both legal and normative grounds for rejecting the notion that inferences can violate privacy.”64 To put the question another way, consider pregnancy. If I were to believe that you are pregnant by stealing your OB/GYN’s records or tapping your phone, that would clearly represent a privacy violation. However, if we are friends and I reach a belief that you are pregnant by seeing that you stop drinking when we go out for dinner, change your diet, and have put on some weight, it is hard to argue I have violated your privacy. The question is whether big data analysis is more like the former or more like the latter. Of course, big data enables us to make many more inferences with much more confidence than do the friendly observations of pregnancy, but is the deontological analysis about the amount we believe we know or the route by which we believe we know it?

III. A path forward

One reaction to the health privacy violations described above, both deontological and consequentialist, is to sharply limit access to patient data. Particularly if deontological and consequentialist concerns are difficult to decrease ex post, decreasing access to data ex ante seems like an attractive solution.65 Under this approach, perhaps data sharing should be limited to the minimal amount necessary in all contexts, data should be retained only for limited time, or data should be intentionally obfuscated, if consequential harms are difficult to limit.66 Nevertheless, we argue that limits on data access can bring their own harms.

The basic harm of privacy overprotection is the brakes it puts on data-driven innovation.67 Privacy protections limit both data aggregation, whether in the creation of longitudinal records or in the collation of data from different sources at the same time, and innovative data use. As a straightforward example, data de-identification is a common way to comply with HIPAA requirements—but de-identified data are much harder to link together when a patient sees different providers, gets insurance through different payers over time, or moves state-to-state.31, 51 Patchy, fragmented health data make data-driven innovation hard, imposing both technological and economic hurdles.

Some approaches can protect privacy while minimizing the cost to innovation, and these should be pursued. In some contexts, researchers could use techniques involving pseudonymized data or differential privacy rather than identified data.68, 69, 70 Privacy audits can ensure appropriate use and security standards should guard against unauthorized use. Data holders should be stewards of data, not privacy-agnostic intermediaries. But in many contexts a privacy/innovation tradeoff will still exist.

Privacy also interacts problematically with secrecy. As described above, there are many potential innovations that can arise from data, and some of these may be very lucrative, such as an algorithm that accurately selects cancer drugs. Innovators have incentives to keep data secret to maintain a competitive advantage in development and deployment of such valuable innovations.71 But we might prefer as a society to have access to the data on which such innovations are based: others can use those data to create better predictors from the same data, to aggregate data to find more subtle patterns, or to validate and verify that the original innovator’s research was accurate.

Myriad Genetics’ maintenance of a proprietary database of the genetic sequences and medical history of women who sought BRCA1/2 breast and ovarian cancer predisposition tests exemplifies these concerns; non-Myriad tests returned variants of unknown significance more frequently because Myriad’s data were unavailable, and the data could not be aggregated to provide even better tests.72 Privacy concerns can provide a shield—rhetorical or not—for this type of practice; to the extent that firms can justify keeping proprietary data on the basis that they are protecting patients’ privacy, data sharing is harder to demand.

Privacy-justified secrecy can erode trust in already-opaque big-data innovations. When big data yields surprising insights about how to provide care, providers and patients need to trust the results to implement them. This already creates challenges when the insights come from explicit analyses of big data; when machine-learning and opaque algorithms are involved, trust may be even harder to engender. To the extent that data and algorithms are kept secret under a potentially disingenuous veil of privacy protection, providers and patients will have even less cause for trust in the results.73 To be sure, there are many medical processes whose inner workings are shrouded by trade secrecy and very opaque to patients, but the media attention to and newness of big data and artificial intelligence may make patients particularly nervous about their integration into care.

On the other hand, to the extent that patients concerned about privacy refuse to participate in a data-driven system, those algorithms may not even be developed in the first place. Striking the right balance—protecting privacy so that patients are comfortable providing their data, but not allowing privacy to drive secrecy that reduces validation and trust in the potential benefits arising from those data—will be a tricky challenge for proponents of big data, machine learning, and learning health systems. What is more, the answer will not be uniform. The future of big data privacy will be sensitive to data source, data custodian, and type of data, as well as the importance of data triangulation from multiple sources. But it is important that we not assume privacy maximalism across the board is the way to go. Privacy underprotection and overprotection each create cognizable harms to patients both today and tomorrow.

Acknowledgements:

Thanks to Nicolas Terry and Kayte Spector-Bagdady.

Disclosures: Price and Cohen’s research reported in this publication was done with the support of CeBIL – Collaborative Research Program for Biomedical Innovation Law. CeBIL is a scientifically independent collaborative research program supported by a Novo Nordisk Foundation Grant (Grant number NNF17SA0027784). Price’s work was also supported by the National Cancer Institute (Grant number 1-R01-CA-214829–01-A1; The Lifecycle of Health Data: Policies and Practices). Cohen has served as a consultant for Otsuka Pharmaceuticals on their Abilify MyCite product.

References

  • 1.Cohen IG, Amarasingham R, Shah A, Xie B & Lo B The legal and ethical concerns that arise from using complex predictive analytics in health care. Health Aff 33, 1139–1147 (2014). [DOI] [PubMed] [Google Scholar]
  • 2.Executive Office of the President. Big data: seizing opportunities, preserving values https://bigdatawg.nist.gov/pdf/big_data_privacy_report_may_1_2014.pdf (2014).
  • 3.Hoffman S Electronic Health Records and Medical Big Data (Cambridge Univ. Press, New York, 2016). [Google Scholar]
  • 4.Institute of Medicine. Committee on Quality of Health Care in America, the National Academies. To Err is Human: Building a Safer Health System (eds. Kohn LT, Corrigan JM, & Donaldson MS, National Academies Press, Washington, D.C., 2000). [PubMed] [Google Scholar]
  • 5.Centers for Medicare and Medicaid Services. Hospital inpatient quality reporting program https://www.cms.gov/Medicare/Quality-Initiatives-Patient-Assessment-Instruments/HospitalQualityInits/HospitalRHQDAPU.html.
  • 6.Kohane IS Using electronic health records to drive discovery in disease genomics. Nature Reviews Genetics 12, 417–428 (2011). [DOI] [PubMed] [Google Scholar]
  • 7.PCORI, Patient Centered Outcome Research Institute, www.pcori.org.
  • 8.Behrman RE et al. Developing the sentinel system—a national resource for evidence development. N. Engl. J. Med 364, 498–499 (2011). [DOI] [PubMed] [Google Scholar]
  • 9.Price WN II Black-box medicine. Harv. J.L. & Tech 28, 419–467 (2016). [Google Scholar]
  • 10.Terry NP Appification, AI, & healthcare’s new iron triangle Preprint at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3020784 (2018).
  • 11.Esteva A et al. , Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Liu NT et al. Development and validation of a machine learning algorithm and hybrid system to predict the need for life-saving interventions in trauma patients. Med. Biol. Eng. Comput 52, 193–203 (2014). [DOI] [PubMed] [Google Scholar]
  • 13.Avati A et al. Improving palliative care with deep learning Preprint at https://arxiv.org/pdf/1711.06402.pdf (2018). [DOI] [PMC free article] [PubMed]
  • 14.Spector-Bagdady K & Shuman A Reg-ENT within the learning health system. Otolaryngol. Head Neck Surg 158, 405–406 (2018). [DOI] [PubMed] [Google Scholar]
  • 15.Price WN II Regulating black-box medicine. Mich. L. Rev 116, 421–474 (2017). [PubMed] [Google Scholar]
  • 16.Institute of Medicine. The Learning Healthcare System: Workshop Summary (eds. Olsen LA, Aisner D & McGinnis JM, National Academies Press, Washington, D.C., 2007). [PubMed] [Google Scholar]
  • 17.Faden RR et al. An ethics framework for a learning health care system: a departure from traditional research ethics and clinical ethics. Hastings Ctr. Rep 43, S16–S27 (2013). [DOI] [PubMed] [Google Scholar]
  • 18.Kass NE The research-treatment distinction: a problematic approach for determining which activities should have ethical oversight. Hastings Ctr. Rep 43, S4–S15 (2013). [DOI] [PubMed] [Google Scholar]
  • 19.Raval MV, Sakran JV, Medbery RL, Angelos P & Hall BL Distinguishing QI projects from human subjects research: ethical and practical considerations. Bull. Am. Coll. Surg http://bulletin.facs.org/2014/07/distinguishing-qi-projects-from-human-subjects-research-ethical-and-practical-considerations/ (1 July 2014). [PubMed]
  • 20.Miller FG & Emanuel EJ Quality-improvement research and informed consent. N. Engl. J. Med 358, 765–767 (2008). [DOI] [PubMed] [Google Scholar]
  • 21.Morreim H Research versus innovation: real differences. Am. J. Bioeth 5, 42–43 (2005). [DOI] [PubMed] [Google Scholar]
  • 22.Friedman CP, Wong AK & Blumenthal D Achieving a nationwide learning health system. Sci. Translat. Med 2, 57cm29 (2010). [DOI] [PubMed] [Google Scholar]
  • 23.Nissenbaum H Privacy in Context: Technology, Policy, and the Integrity of Social Life (Stanford Univ. Press, Stanford, 2010). [Google Scholar]
  • 24.Konnoth C An expressive theory of privacy intrusions. Iowa L. Rev 102, 1533–1581 (2017). [Google Scholar]
  • 25.Terry NP Regulatory disruption and arbitrage in health-care data protection. Yale J. Health Pol’y L. & Ethics 17, 143–208 (2017). [PubMed] [Google Scholar]
  • 26.Terry NP Existential challenges for healthcare data protection in the United States. Ethics, Med., & Pub. Health 3, 19–27 (2017). [Google Scholar]
  • 27.Commission Regulation 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with regard to the Processing of Personal Data and on the Free Movement of such Data, and Repealing Directive, 95/46/EC, 2016 O.J. (L 119) 1, 34 (General Data Protection Regulation), available at http://www.privacy-regulation.eu/en/article-4-definitions-GDPR.htm.
  • 28.Spector-Bagdady K, Prince AER, Yu JH & Appelbaum PS, Analysis of state laws on informed consent for clinical genetic testing in the era of genomic sequencing. Am. J. Med. Genet. C. Semin. Med. Genet 178, 81–88 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. 45 C.F.R. §§ 160.103, 164.504.
  • 30. 5 C.F.R. Part 160; 45 C.F.R. Part 164 A & C.
  • 31.Eisenberg RS & Price WN II Promoting healthcare innovation on the demand side. J.L. & Biosciences 4, 3–49 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. 45 C.F.R. § 164.514.
  • 33.Gymrek M et al. Identifying personal genomes by surname inference. Science 339, 321–324 (2013). [DOI] [PubMed] [Google Scholar]
  • 34.National Committee on Vital and Health Statistics and its Privacy, Security, and Confidentiality Subcommittee, U.S. Department of Health and Humam Services. Health information privacy beyond HIPAA: a 2018 environmental scan of major trends and challenges https://ncvhs.hhs.gov/wp-content/uploads/2018/05/NCVHS-Beyond-HIPAA_Report-Final-02-08-18.pdf (2017).
  • 35.Philibert RA et al. Methylation array data can simultaneously identify individuals and convey protected health information: an unrecognized ethical concern. Clin. Epigenetics 6, 28 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Centers for Medicare and Medicaid Services. Blue Button® 2.0: improving medicare beneficiary access to their health information https://www.cms.gov/Research-Statistics-Data-and-Systems/CMS-Information-Technology/Blue-Button/index.html/
  • 37.Couzin-Frankel J After a prominent gene-testing firm declined to give patients their complete data, ACLU filed a complaint. Science http://www.sciencemag.org/news/2016/05/after-prominent-gene-testing-firm-declined-give-patients-their-complete-data-aclu-file (19 May 2016).
  • 38.Riley MF Big data, HIPAA, and the common rule: time for a big change?, in Big Data, Health Law, and Bioethics (eds. Cohen IG, Fernandez Lynch H., Vayena E & Gasser U, Cambridge Univ. Press, New York, 2018). [Google Scholar]
  • 39.Hoffman S Citizen science: the law and ethics of public access to medical big data. Berkeley Tech. L.J 30, 1741–1805 (2015). [Google Scholar]
  • 40.Barocas S & Selbst AD, Big data’s disparate impact. Calif. L Rev 104, 671–732 (2016). [Google Scholar]
  • 41.Malanga SE, Loe JD, Robertson CT & Ramos KS Who’s Left Out of Big Data? How Big Data Collection, Analysis, and Use Neglects Populations Most in Need of Medical and Public Health Research and Interventions, in Big Data, Health Law, and Bioethics (eds. Cohen IG, Fernandez Lynch H., Vayena E & Gasser U, Cambridge Univ. Press, New York, 2018). [Google Scholar]
  • 42.Chen I, Johansson FD & Sontag D Why is my classifier discriminatory? Preprint at https://arxiv.org/pdf/1805.12002.pdf (2018).
  • 43.Kleinberg J, Mullainathan S & Raghavan M Inherent trade-offs in the fair determination of risk scores Preprint at https://arxiv.org/pdf/1609.05807.pdf (2016).
  • 44.NIH. All of Us: about us, https://allofus.nih.gov/about/about-all-us-research-program.
  • 45.Cohen IG Is there a duty to share health care data?, in Big Data, Health Law, and Bioethics (eds. Cohen IG, Fernandez Lynch H., Vayena E & Gasser U, Cambridge Univ. Press, New York, 2018). [Google Scholar]
  • 46.Kaye J et al. , Dynamic consent: a patient interface for twenty-first century research networks. Eur. J. Hum. Genet 23, 141–146 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Grady C et al. Broad consent for research with biological samples: workshop conclusions. Am. J. Bioeth 15, 34–42 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Mayer-Schönberger V & Ingelsson E Big data and medicine: a big deal? (Review Symposium). J. Intern. Med 283, 418–429 (2018). [DOI] [PubMed] [Google Scholar]
  • 49.Rockhold F, Nisen P & Freeman A Data sharing at a crossroads. N. Engl. J. Med 375, 1115–1117 (2016). [DOI] [PubMed] [Google Scholar]
  • 50.Winickoff D & Winickoff M The charitable trust as a model for genomic biobanks. N. Engl. J. Med 349, 1180–1184 (2003). [DOI] [PubMed] [Google Scholar]
  • 51.Evans BJ Big data and individual autonomy in a crowd, in Big Data, Health Law, and Bioethics (eds. Cohen IG, Fernandez Lynch H., Vayena E & Gasser U, Cambridge Univ. Press, New York, 2018). [Google Scholar]
  • 52.Maschke KJ Governance Issues for Biorepositories and Biospecimen Research 299, in Specimen Science: Ethics and Policy Implications (eds. Lynch HF, Bierer BE, Cohen IG & Rivera SM, MIT Press, Cambridge, MA, 2017). [Google Scholar]
  • 53.Connected Health Cities. Citizens’ juries report https://www.connectedhealthcities.org/what-is-a-chc/public-engagment/citizens-juries-chc/citizens-juries/ (2017).
  • 54.Calo MR The boundaries of privacy harm. Indiana L.J 86, 1131–1162 (2011). [Google Scholar]
  • 55.Epstein RA The legal regulation of genetic discrimination: old responses to new technology. B.U. L. Rev 74, 1–23 (1994). [PubMed] [Google Scholar]
  • 56.Stone DA The struggle for the soul of health insurance. J. Health Polit. Policy & L 18, 287–317 (1993). [DOI] [PubMed] [Google Scholar]
  • 57.Hoffman AK Three models of health insurance: the conceptual pluralism of the Patient Protection and Affordable Care Act. U. Penn. L. Rev 159, 1873–1954 (2011). [Google Scholar]
  • 58.Hoffman S Big Data’s New Discrimination Threats: Amending the Americans with Disabilities Act to Cover Discrimination Based on Data-Driven Predictions of Future Disease, in Big Data, Health Law, and Bioethics (eds. Cohen IG, Fernandez Lynch H., Vayena E & Gasser U, Cambridge Univ. Press, New York, 2018). [Google Scholar]
  • 59.Mello MM, Lieou V & Goodman SN Clinical trial participants’ views of the risks and benefits of data sharing. N. Engl. J. Med 378, 2202–2211 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Grande D et al. Public preferences about secondary uses of electronic health information. JAMA Intern. Med 173, 1798–1806 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Ford RA & Price WN II Privacy and accountability in black-box medicine. Mich. Telecomm. & Tech. L. Rev 23, 1–43 (2016). [Google Scholar]
  • 62.May T Sociogenetic risks—ancestry DNA testing, third-party identity, and protection of privacy. N. Engl. J. Med 379, 410–412 (2018). [DOI] [PubMed] [Google Scholar]
  • 63.Crawford K & Schultz J Big data and due process: toward a framework to redress predictive privacy harms. B.C. L. Rev 55, 93–128 (2014). [Google Scholar]
  • 64.Skopek JM Big Data’s Epistemology and Its Implications for Precision Medicine and Privacy, in Big Data, Health Law, and Bioethics (eds. Cohen IG, Fernandez Lynch H., Vayena E & Gasser U, Cambridge Univ. Press, New York, 2018). [Google Scholar]
  • 65.Terry NP Protecting patient privacy in the age of big data, U.M.K.C. L. Rev 81, 1–34 (2012). [Google Scholar]
  • 66.Goldacre B How to get all trials reported: audit, better data, and individual accountability. PLOS Medicine 12, e1001821 http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001821 (14 April 2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Price WN II. Drug approval in a learning health system. Preprint at https://papers.ssrn.com/abstract_id=3152570.
  • 68.Beaulieu-Jones BK et al. Privacy-preserving generative deep neural networks support clinical data sharing Preprint at https://www.biorxiv.org/content/early/2018/06/05/159756 (2018). [DOI] [PMC free article] [PubMed]
  • 69.Dwork C & Roth A The algorithmic foundations of differential privacy. Found. & Trends in Theoretical Comput. Sci 9, 211–407 (2014). [Google Scholar]
  • 70.Moussa M & Demurjian SA Differential Privacy Approach for Big Data Privacy in Healthcare, in Privacy and Security Policies in Big Data (eds. Tamane S, Solanki VK & Dey N, IGI Global, Hershey, PA, 2017). [Google Scholar]
  • 71.Price WN II Big data, patents, and the future of medicine. Cardozo L. Rev 37, 1401–1453 (2016). [Google Scholar]
  • 72.Cook-Deegan R et al. , The next controversy in genetic testing: clinical data as trade secrets?. Eur. J. Hum. Genetics 21, 585–588 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Spector-Bagdady K “The Google of Healthcare:” enabling the privatization of genetic bio/databanking. Ann. Epidemiol 26, 515–519 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Greely HT The uneasy ethical and legal underpinnings of large-scale genomic biobanks. Annu. Rev. Genomics Hum. Genet 8, 343–346 (2007). [DOI] [PubMed] [Google Scholar]
  • 75.Ohm P Broken promises of privacy: responding to the surprising failure of anonymization. UCLA L. Rev 57, 1738–1777 (2010). [Google Scholar]
  • 76.Narayanan A & Shmatikov V Robust deanonymization of large sparse datasets (how to break the anonymity of the Netflix prize database). IEEE Sec’y & Privacy Symposium (2008), available at http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf. [Google Scholar]

RESOURCES