Skip to main content
HHS Author Manuscripts logoLink to HHS Author Manuscripts
. Author manuscript; available in PMC: 2015 Dec 19.
Published in final edited form as: Science. 2014 Nov 28;346(6213):1054–1055. doi: 10.1126/science.aaa2709

Big data meets public health

Human well-being could benefit from large-scale data if large-scale noise is minimized

Muin J Khoury 1,2, John P A Ioannidis 3
PMCID: PMC4684636  NIHMSID: NIHMS743582  PMID: 25430753

In 1854, as cholera swept through London, John Snow, the father of modern epidemiology, painstakingly recorded the locations of affected homes. After long, laborious work, he implicated the Broad Street water pump as the source of the outbreak, even without knowing that a Vibrio organism caused cholera. “Today, Snow might have crunched Global Positioning System information and disease prevalence data, solving the problem within hours” (1). That is the potential impact of “Big Data” on the public’s health. But the promise of Big Data is also accompanied by claims that “the scientific method itself is becoming obsolete” (2), as next-generation computers, such as IBM’s Watson (3), sift through the digital world to provide predictive models based on massive information. Separating the true signal from the gigantic amount of noise is neither easy nor straightforward, but it is a challenge that must be tackled if information is ever to be translated into societal well-being.

The term “Big Data” refers to volumes of large, complex, linkable information (4). Beyond genomics and other “omic” fields, Big Data includes medical, environmental, financial, geographic, and social media information. Most of this digital information was unavailable a decade ago. This swell of data will continue to grow, stoked by sources that are currently unimaginable. Big Data stands to improve health by providing insights into the causes and outcomes of disease, better drug targets for precision medicine, and enhanced disease prediction and prevention. Moreover, citizen-scientists will increasingly use this information to promote their own health and wellness. Big Data can improve our understanding of health behaviors (smoking, drinking, etc.) and accelerate the knowledge-to-diffusion cycle (5).

But “Big Error” can plague Big Data. In 2013, when influenza hit the United States hard and early, analysis of flu-related Internet searches drastically overestimated peak flu levels (6) relative to those determined by traditional public health surveillance. Even more problematic is the potential for many false alarms triggered by large-scale examination of putative associations with disease outcomes. Paradoxically, the proportion of false alarms among all proposed “findings” may increase when one can measure more things (7). Spurious correlations and ecological fallacies may multiply. There are numerous such examples (8), such as “honey-producing bee colonies inversely correlate with juvenile arrests for marijuana.”

The field of genomics has addressed this problem of signal and noise by requiring replication of study findings and by asking for much stronger signals in terms of statistical significance. This requires the use of collaborative large-scale epidemiologic studies. For nongenomic associations, false alarms due to confounding variables or other biases are possible even with very large-scale studies, extensive replication, and very strong signals (9). Big Data’s strength is in finding associations, not in showing whether these associations have meaning. Finding a signal is only the first step.

Even John Snow needed to start with a plausible hypothesis to know where to look, i.e., choose what data to examine. If all he had was massive amounts of data, he might well have ended up with a correlation as spurious as the honey bee–marijuana connection. Crucially, Snow “did the experiment.” He removed the handle from the water pump and dramatically reduced the spread of cholera, thus moving from correlation to causation and effective intervention.

How can we improve the potential for Big Data to improve health and prevent disease? One priority is that a stronger epidemiological foundation is needed. Big Data analysis is currently largely based on convenient samples of people or information available on the Internet. When associations are probed between perfectly measured data (e.g., a genome sequence) and poorly measured data (e.g., administrative claims health data), research accuracy is dictated by the weakest link. Big Data are observational in nature and are fraught with many biases such as selection, confounding variables, and lack of generalizability. Big Data analysis may be embedded in epidemiologically well-characterized and representative populations. This epidemiologic approach has served the genomics community well (10) and can be extended to other types of Big Data.

There also must be a means to integrate knowledge that is based on a highly iterative process of interpreting what we know and don’t know from within and across scientific disciplines. This requires knowledge management, knowledge synthesis, and knowledge translation (11). Curation can be aided by machine learning algorithms. An example is the ClinGen project (12) that will create centralized resources of clinically annotated genes to improve interpretation of genomic variation and optimize the use of genomics in practice. And new funding, such as the Biomedical Data to Knowledge awards of the U.S. National Institutes of Health, will develop new tools and training in this arena.

Another important issue to address is that Big Data is a hypothesis-generating machine, but even after robust associations are established, evidence of health-related utility (i.e., assessing balance of health benefits versus harms) is still needed. Documenting the utility of genomics and Big Data information will necessitate the use of randomized clinical trials and other experimental designs (13). Emerging treatments based on Big Data signals need to be tested in intervention studies. Predictive tools also should be tested. In other words, we should embrace (and not run away from) principles of evidence-based medicine. We need to move from clinical validity (confirming robust relationships between Big Data and disease) to clinical utility (answering the “who cares?” health impact questions).

As with genomics, an expanded translational research agenda (14) for Big Data is needed that goes beyond an initial research discovery. In genomics, most published research consists of either basic scientific discoveries or preclinical research designed to develop health-related tests and interventions. What happens after that in the bench-to-bedside journey is a “road less traveled” with <1% of published research (15) dealing with validation, evaluation, implementation, policy, communication, and outcome research in the real world. Reaping the benefits of Big Data requires a “Big Picture” view.

Bringing Big Data to bear on public health is where the rubber meets the road. The combination of a strong epidemiologic foundation, robust knowledge integration, principles of evidence-based medicine, and an expanded translation research agenda can put Big Data on the right course.

From validity to utility.

From validity to utility

Big Data can improve tracking and response to infectious disease outbreaks, discovery of early warning signals of disease, and development of diagnostic tests and therapeutics.

Contributor Information

Muin J. Khoury, Email: muk1@cdc.gov.

John P. A. Ioannidis, Email: jioannid@stanford.edu.

References

RESOURCES