Skip to main content
. 2016 Oct 13;1(4):260–271. doi: 10.1016/j.adro.2016.10.001

Table 1.

Categorization of key data element categories and summary of our experience of challenges to extract, transform, and load (ETL) of data from source systems to aggregation tier.

Key element category Demand ranking ETL difficulty Typical source systems Access Multiple source systems Use or used free text entry Missing data Data accuracy Lack of standardization PHI constraints limit access Legacy formats or systems Require process changes Extensive transformation Other
Demographics ● 1 L EHR × E
Health status factors 2 L EHR × E
Pathology ⊙ 3 M to H EHR × × × × × E, X
Surgery ⊙ 2 M to H EHR × × × × × E, X
Chemotherapy ● 2 M EHR, ODB × E
Encounter details ●
Office, emergency room, hospitalization
3 L EHR × R
Diagnosis ●,Inline graphic,Inline graphic 1 M EHR, ROIS × × × × R, E
Staging ●,Inline graphic,Inline graphic 1 H EHR, ROIS × × × × × E
Prescription Inline graphic,♦ 1 H ROIS, ODB × E, X, R
As-treated plan details ● 1 M ROIS ×
DVH ●,Inline graphic,♦ 1 M TPS × × × × ATPS
Survival ● 1 M EHR, XLS, ODB × UD, E
Recurrence Inline graphic,Inline graphic 1 H EHR × × × × × E, X
Toxicity ●,Inline graphic 1 H EHR, ROIS × × × × × E, X
Patient-reported outcomes Inline graphic 2 H EHR, P × × × × E, X
Laboratory values ● 2 M EHR × × E
Medications● 2 M EHR × × E
Height, weight, BMI● 2 M EHR × × E
Treatment imaging: Timeline details● 3 H ROIS × R
Diagnostic imaging
details ⊙
3 M ODB × × ×
Radiomics ⊙,♦ 3 L XLS ×
Genomics ⊙ 3 L XLS ×
Charges ● 3 L ROIS
Research datasets ⊙ 4 H XLS × × × E
Registry data ⊙ 4 M ODB × × × UD

Demand ranking ranges from most (1) to least (4) frequently needed as part of queries. Range in ETL is specified when significant variation among institutions is anticipated; extensive transformation indicates need to construct sophisticated algorithms to process raw data from source systems to provide needed information.

APTS, special manual effort needed to construct as-treated plan sums; BMI, body mass index; DVH, dose-volume histogram; E, manual entry without process corrected curation are susceptible to random or system-related systematic errors; HER, electronic health records; ETL, extract, transform, and load; H, extensive process changes needed, data typically in unstructured free text fields; L, little modification required; M, changes to clinical processes required, interactions across different groups in the institution, significant computational processing; M-ROAR, Michigan Radiation Oncology Analytics Resource; NLP, natural language processing; ODB, other database systems; P, paper records; PHI, Patient Health Information; R, missing detail on key relationships to other data items; ROIS, radiation oncology information system; TPS, treatment planning system; UD, data values not being up to date; X, manual effort required to extract data; XLS, spreadsheet.

M-ROAR–specific ETL status for all patients: ●, current processes enable capture for all; ⊙, developing new extractions; Inline graphic, exploring NLP-based process; Inline graphic, piloting new clinical process; ♦, developing new software applications to improve availability or accuracy; Inline graphic, developing extractions for legacy data with differing formats. The current database includes 17,956 patients treated since 2002. Records per patient vary with time period and key data element category.

×, specific ETL challenges; ⊠, the primary issue for enabling automated extractions for multiple issues.