Skip to main content
. Author manuscript; available in PMC: 2013 Nov 1.
Published in final edited form as: Cancer. 2012 Apr 19;118(21):5186–5197. doi: 10.1002/cncr.27552

Table 2.

Existing and secondary data sources: Illustrative examples, strengths, limitations and applicability/utility to CER.

Examples Strengths Limitations Applicability/Utility to Cancer CER
Existing and Fixed Data Sources
1) Experimental studies (trials) data (e.g., clinical trials, pragmatic trials):
■ ACCENT/ PS2
■ Women's Health Initiative (WHI)
■ Physicians’ Health Study(PHS)
■ Women's Health Study (WHS)
■ Detailed and unbiased information on treatment, and important clinical covariates
■ Enormous breadth and diversity of data (across 12 NCI cooperative groups)
■Limited generalizability
■ Expensive to conduct, requires lengthy follow-up for many outcomes
■ Limited sample sizes
■ Highly specific (i.e. usually single treatment/intervention)
■ Limited in essential/important covariates
■Utility for CER depends on type of experimental study
■ Broadly defined (or population-based) trials can be useful for CER; but require extensive inclusion of covariates outside of the main trial aims
■ Secondary use of experimental studies for CER could be improved through investments in 1) Pragmatic clinical trials and 2) methods / design development
2) Non-experimental (observational) studies data
■North Carolina - Lousiana Prostate Cancer Project
■ Cancer Care Outcomes Research and Surveillance Consortium (CanCORs)
■ Health Professionals Follow-Up Study(HPFS)
■ Nurses Health Study (NHS)
■ American Cancer Society Cohort
■Extensive data on diagnosis, procedures and outcomes
■ Rich in covariates (risk factors, important confounders)
■ Often include patient medical records
■ Can be population-based
■Expensive to develop and maintain
■ Logistics of study development limit data availability and addition of new hypotheses
■ Several biases may exist: selection; information; recall; and response
■ Unclear event temporality between data collection waves
■ Limited in scope, statistical power beyond initial study aims
■ Proprietary data requiring extensive protocols, procedures
■ Can be leveraged for comparative effectiveness depending on data quality and extent of biases
■ Utility for CER also dependent on study design, quality/completeness of measures and broad inclusion covariates
■ Can be strengthened through potential data linkages to claims or EHR data which can augment or off-set biases/limitations (can provide temporality of events, verification of treatment/outcomes, etc.)
3) Registry data
■Surveillance Epidemiology and End Results (SEER)
■ National Program of Cancer Registries (NPCR)
■ National Cancer Data Base (NCDB)
■ National Oncologic Positron Emission Tomography (NOPR)
■Rich disease information
■ Clinical information at point of care or diagnosis
■ Simultaneously collected with diagnosis and treatment
■ Opportunity for recruitment into cohorts or trials
■ Can link with administrative data
■Potential sampling biases (selection, inclusion, etc.)
■ Questionable generalizability
■ Primarily limited to first occurrence of event or disease and limited inclusion of covariates
■ Unknown response, toxicity, patient reported outcomes
■ Challenging for longitudinal data capture
■ Sparse patient identifiers
■ Challenging for selecting controls / comparator populations
■Do not provide enough complete data for rigorous CER
■ Linkages to additional data are necessary to provide missing information
■ Dearth of literature on solutions/methods for inherent biases, interoperable study design, and evaluation/application of comparator populations
4) Administrative and claims data
■Most health insurance programs: Medicare; Medicaid; Blue Cross / Blue Shield, etc.
■ Medstat / Marketscan
■ United Health
■Represents large proportion of US population
■ Rich patient-level data: demographics, procedures, treatments
■ Includes temporality of events
■ Some include organizational/ provider characteristics
■ Most have unique identifiers enabling linkage to other data
■Design/structure often impacts data sensitivity/specificity
■ Missing important clinical etiologic information
■ Includes date or type of testing procedures, but no results (e.g., pathology, tumor response, genetics, vital stats, etc.)
■ High patient turn-over
■ Complicated data structure requires significant learning-curve and programming resources
■ Burdensome and prohibitive data use agreements
■ Expensive to obtain
■ Untimely data releases – significant time lags
■Missing key CER components including vital tumor and disease information
■ Linkages can supplement missing information – but costs and/or DUA's often inhibit additional linkages
■ Utility for CER would be greatly improved through institutional and governmental policies which overcome limitations (i.e., funding, training, collaboration)
5) Electronic health records
■Health care systems: Veterans Administration (VA); HMO-network; Kaiser; Mayo; Geisinger; US Oncology ; UK General Practitioners Research Database (GPRD)
■Large vendors: GE Health; Allscripts/Misys; Epic; McKesson; NextGen
■Includes multiple data components (practice management, electronic patient record, patient portal)
■Fully integrated EHR's provide clinical information, claims, tumor specifics, longitudinal follow-up, objectively measured events
■Allows for studies of toxicity, quality of life, natural history
■Populations are not generalizable
■Lack of standardization of patient information and clinical measures between systems (technology, data structure, and coding)
■Missing or insufficient data elements necessary for CER
■Imperfect record keeping/follow up - Patients not consistently maintained within a single system/EHR
■Enormous expense to obtain data from private sector/vendors
■Currently there is limited utility for EHR data from private vendors
■However examples from VA and universal/national systems (UK, Canada), exemplify potential of EHR sources
■Future utility dependent on: standardization of measures and data systems/interoperability; standard linkage variables; public and private institutional data governance and stewardship
6) Other Data
■Genetic and genomic data
■Geospatial data
■Environmental monitoring data
■Over the counter drug purchasing
■Health seeking on internet
■Patient-networking sites, 66
■Syndromic surveillance
■Data at both patient and ecological level
■Information on behavioral and environmental risks
■Can provide information on disease determinants
■Self-reported experiences, exposures, outcomes
■ Unclear how to identify, define and utilize these data ■ Utility to CER dependent on integration into other data, specifically clinical care data
Hybrid Data Sources
7) Linked clinical and claims data
■SEER-Medicare
■State Cancer Registry – Medicare/Medicaid
■WHI-Medicare
■Includes clinical and health services data
■Provides temporality of events
■Large population samples; ability to study rare events/treatments
■Provides access to controls or comparison populations
■Allows for adjudication/validation of events (i.e., self-reported)
■Can detecting recurrence
■Missing information (eg, HMO or supplemental insurance); often highly specific populations (>65, disabled, etc)
■Non-covered services are excluded (e.g., prescription drugs, long-term care, free screenings)
■Missing vital clinical information (tumor response)
■Treatment rationale and test results are unknown
■Complicated algorithms needed to characterize treatment
■Large, complex data require advanced training/experience
■Delay in research access
■Powerful for CER studies because of large, generalizable populations
■Large number of covariates and clinical information
■Lengthy follow-up available including information on temporality of treatment and events
■Could be strengthened by linkages to laboratory and clinical results
8) Validation study data
■Internal validation studies49, 67
■External validation studies47
■Rich disease information
■Used to minimize limitations of other data
■Can give estimates of associations not discernable within data
■Lack of validated studies exist for CER
■Methodologic limitations and lack of model transportability to CER
■To be useful for CER an investment in methodologic work is required -- similar to P01 CA 142538 “Statistical Methods for
■ Cancer Clinical Trials” (PI, Kosorok)
■Validation studies could lead to immediate return of investment with regard to leveraging existing data for CER