Table 1.
Advantages | |
Clinical data readily available with minimal resources required | |
Can study rare exposures | |
Can study rare events | |
Can study long-term effects | |
Real-world data | |
Large sample size | |
Subgroup analysis | |
Sensitivity analysis | |
Interaction of different variables | |
Adjustment of outcome to a multitude of risk factors | |
Precise estimation of effect size | |
Reliable capture of small variations in incidence or disease flare | |
No selection bias if n = all | |
Shortcomings specific of Big Data analysis | Solution |
Data validity | Cross reference with medical records in a subset of the sample |
Missing data | Statistical methods to deal with missing data, e.g. multiple imputation |
Text mining or natural language processing of unstructured data | |
Incomplete capture of variables or unavailability of certain diagnosis codes | Surrogate markers (e.g., COPD for smoking, alcohol-related diseases for alcoholism) |
Inclusion of a large set of measured variables | |
Text mining or natural language processing of unstructured data | |
Privacy | De-identification of individuals |
Review of study plan by local ethics committee | |
Hypothesis-free predictive models | Validation in prospective studies or randomized control trials |
Shortcomings of all observational study including Big Data analysis | Solution |
Residual and/or unmeasured confounding | Inclusion of a large set of measured variables |
Inclusion of RCT datasets with extensive collection of data and outcomes for trial participants or linkage with other data sources | |
Fulfilment of Bradford Hill criteria | |
Reverse causality/protopathic bias (outcome of interest leads to exposure of interest) | Cohort study design instead of case-control study design |
Excluding prescriptions of drugs of interest (e.g., PPIs) within a certain period (e.g., 6 mo) before development of the outcome of interest (e.g., gastric cancer) | |
Example: Early symptoms of undiagnosed GC leads to PPI use, rather than PPIs cause GC | |
Selection bias | Encompassing entire study population (n = all) |
Indication bias (or confounding by indication/disease severity) | Balance of patient characteristics, in particular comorbidities that are indications for a certain treatment (e.g., PS matching of a large set of measured variables) |
Negative control exposure | |
Confounding by functional status and cognitive impairment | Balance of patient characteristics, in particular comorbidities that can affect functional and cognitive status (e.g., PS matching) |
Healthy user bias / adherer bias (individuals who are more health conscious tend to have better health outcomes) | Adjustment for other lifestyle factors – text mining or natural language processing of unstructured data |
Immortal time bias (arises when the study outcome cannot occur during a period of follow-up due to study design) | Landmark analysis |
Analysis using time varying covariates | |
Ascertainment bias / surveillance bias / detection bias (differential degree of surveillance or screening for the outcome among exposed and unexposed individuals) Example: PPI users may undergo upper endoscopy more frequently than non-PPI users, and hence more GC detected in PPI users | Selection of an unexposed group with a similar likelihood of screening/testing |
Selection of an outcome that are likely to be diagnosed equally in exposed and control groups | |
Adjustment for the surveillance rate | |
Access to healthcare | Stratified analysis according to patients’ residential regions (e.g., rural vs urban), socioeconomic status, immigration status, race/ethnicity, institutional factors (e.g., restrictive formularies) |
Selective prescription and treatment in frail and very sick patients | PS methodology (trimming of areas of non-overlap, PS matching, PS by treatment interaction) |
COPD: Chronic pulmonary obstructive disease; RCT: Randomized controlled trial; GC: Gastric cancer; PPI: Proton pump inhibitor; PS: Propensity score.