. Author manuscript; available in PMC: 2021 Sep 1.

Published in final edited form as: Diabetologia. 2020 Jul 15;63(9):1694–1705. doi: 10.1007/s00125-020-05217-1

Table 1.

Examples of RWD sources and applications to diabetes research

RWD source	Merits	Caveats	Potential areas for application in diabetes research
Administrative claims data: Insurance claims for pharmacy prescriptions and medical inpatient and outpatient visits submitted for billing purposes by government or commercial payers Include cost information, date/place of service and patient demographics, all linked by a common patient identifier	Longitudinally follow patients as they navigate through the healthcare system Reliable for studying important medical encounters, diagnoses and treatment using variables that are captured for reimbursement purposes Provide information on large samples of patients and their families, considered to be representative of the target population (commercially insured/populations under public health insurance programs) Demographically and geographically diverse, relatively low cost and time-efficient vs RCTs	Primary purpose for data collection is administrative rather than for research Key clinical variables (e.g. severity), medications for which patients pay out-of-pocket, patient-reported outcomes, lifestyle variables and laboratory results are typically not captured Loss of follow-up, particularly in commercial claims data when patients switch employers/health plans (known censoring date due to availability of enrolment file) Identification of disease/treatment depends on accuracy of billing codes used and data require validation prior to use particularly for hard-to-diagnose rare conditions	Can be used in real-world studies to compare the effectiveness and safety of glucose-lowering therapies using active comparator new-user design [42, 59], patient characterisation, treatment utilisation [91] and health policy/cost [92] research, as well as burden of illness [93] studies Can be used to estimate basic prevalence or incidence measures of conditions within diabetes populations given large sample sizes and representativeness
EHRdata: Data from patients’ electronic medical records Data typically include information on medical diagnoses, procedures, medications, free text with physician notes, vital signs at each visit, laboratory results, clinical variables	Data collected to capture clinical care and contain rich data on clinical variables or other important confounders May provide rationale for treatment decisions depending on the quality of free text	Variability in the quality of data as clinical variables are often missing and may be recorded differently by different physicians Follow-up only available as long as patients remain in the healthcare system and seek care (unknown censoring date since no enrolment file) Typically, data from only one place of service are available and capture of information from other types of practices are often unreliable (e.g. in a general practice system, specialist data may not be accurately captured for all patients; hospitalisations for acute problems outside the system may not be captured)	Assessing comparative effectiveness or safety, treatment patterns and patient characterisation Typically less useful for cost assessments or prevalence/incidence estimation Analyses of EHR data have been shown to improve glycaemic control, reduce emergency department visits and non-elective hospitalisations [94, 95]
Patient-generated data: Data from surveys, questionnaires, smartphone apps and social media that allow continuous data capture Information is provided mainly by patients, rather than by providers	Questionnaire/survey data sources provide data on quality-of-life measures, which are hard to find in other data sources Can be used as external validation datasets May find particular relevance in pharmacovigilance, particularly rare adverse events associated with treatments, and factors predicting patients’ adherence, behaviours and attitudes Some data include real-time monitoring to allow tracking of selected measures and symptoms	Use of these sources implies reliance on self-reported variables, leading to recall bias, selective reporting and missing data on important patient characteristics and medical variables Limited generalisability and internal validity, as the clinical outcomes reported are often not validated and authenticity is often unverifiable Utility only in specific settings after careful evaluation and vetting	The FDA-approved WellDoc BlueStar System is a healthcare app that provides secure capture of blood glucose data and aids in diabetes self-management [96]
Patient registries: Repositories of rich information on specific disease or treatment	Include data on patients’ characteristics and medical variables, including rich clinical information on disease or treatments of interest Allow long patient follow-up Useful in areas where richness of information related to a specific disease/treatment is desirable (e.g. rare tumours) and in unique populations (e.g. pregnancy registries)	Validity highly depends on what type of patients are selected into the registry (voluntary vs mandatory enrolment) Expensive to maintain May not contain information on other comorbidities or concurrent treatment; more potential for missing data	The diabetes collaborative registry, organised by the leading societies in diabetes research, provides RWD on diabetes patient care and treatment [17]
Data linkages: Data from two or more sources are linked to bring together the information needed, assuming appropriate safeguards are applied	Bring together data from disparate sources allowing capture of comprehensive information needed in a particular research setting (e.g. linking administrative claims with EHRs would enable combination of longitudinal follow-up, cost information that may be lacking in EHRs, with clinical variables that are incomplete in claims) Help minimise missing data on key variables, reducing misclassification	Validity of results depends on the quality of linkage Expensive to link and maintain linked data sources Challenges in linking data due to different purposes of data collection, discrepancies in data recording, legal/confidentiality issues	Several studies using linked data are being conducted in diabetes patients, predicting hospital admissions [97], cancer outcomes [98] and weight gain with diabetes treatments [99]

EHR, electronic health record