Skip to main content
. 2019 Jun 28;25(24):2990–3008. doi: 10.3748/wjg.v25.i24.2990

Table 1.

Advantages and shortcomings of Big Data analysis (with proposed solutions)

Advantages
Clinical data readily available with minimal resources required
Can study rare exposures
Can study rare events
Can study long-term effects
Real-world data
Large sample size
Subgroup analysis
Sensitivity analysis
Interaction of different variables
Adjustment of outcome to a multitude of risk factors
Precise estimation of effect size
Reliable capture of small variations in incidence or disease flare
No selection bias if n = all
Shortcomings specific of Big Data analysis Solution
Data validity Cross reference with medical records in a subset of the sample
Missing data Statistical methods to deal with missing data, e.g. multiple imputation
Text mining or natural language processing of unstructured data
Incomplete capture of variables or unavailability of certain diagnosis codes Surrogate markers (e.g., COPD for smoking, alcohol-related diseases for alcoholism)
Inclusion of a large set of measured variables
Text mining or natural language processing of unstructured data
Privacy De-identification of individuals
Review of study plan by local ethics committee
Hypothesis-free predictive models Validation in prospective studies or randomized control trials
Shortcomings of all observational study including Big Data analysis Solution
Residual and/or unmeasured confounding Inclusion of a large set of measured variables
Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources
Fulfilment of Bradford Hill criteria
Reverse causality/protopathic bias (outcome of interest leads to exposure of interest) Cohort study design instead of case-control study design
Excluding prescriptions of drugs of interest (e.g., PPIs) within a certain period (e.g., 6 mo) before development of the outcome of interest (e.g., gastric cancer)
Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC
Selection bias Encompassing entire study population (n = all)
Indication bias (or confounding by indication/disease severity) Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment (e.g., PS matching of a large set of measured variables)
Negative control exposure
Confounding by functional status and cognitive impairment Balance of patient characteristics, in particular comorbidities that can affect functional and cognitive status (e.g., PS matching)
Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes) Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data
Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design) Landmark analysis
Analysis using time varying covariates
Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI users Selection of an unexposed group with a similar likelihood of screening/testing
Selection of an outcome that are likely to be diagnosed equally in exposed and control groups
Adjustment for the surveillance rate
Access to healthcare Stratified analysis according to patients’ residential regions (e.g., rural vs urban), socioeconomic status, immigration status, race/ethnicity, institutional factors (e.g., restrictive formularies)
Selective prescription and treatment in frail and very sick patients PS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction)

COPD: Chronic pulmonary obstructive disease; RCT: Randomized controlled trial; GC: Gastric cancer; PPI: Proton pump inhibitor; PS: Propensity score.