Table 1.
Categorization of key data element categories and summary of our experience of challenges to extract, transform, and load (ETL) of data from source systems to aggregation tier.
| Key element category | Demand ranking | ETL difficulty | Typical source systems | Access | Multiple source systems | Use or used free text entry | Missing data | Data accuracy | Lack of standardization | PHI constraints limit access | Legacy formats or systems | Require process changes | Extensive transformation | Other |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Demographics ● | 1 | L | EHR | × | E | |||||||||
| Health status factors | 2 | L | EHR | × | E | |||||||||
| Pathology ⊙ | 3 | M to H | EHR | × | × | × | × | × | ⊠ | E, X | ||||
| Surgery ⊙ | 2 | M to H | EHR | × | × | × | × | × | ⊠ | E, X | ||||
| Chemotherapy ● | 2 | M | EHR, ODB | × | E | |||||||||
| Encounter details ● Office, emergency room, hospitalization |
3 | L | EHR | ⊠ | × | R | ||||||||
Diagnosis ●, ,
|
1 | M | EHR, ROIS | × | × | × | × | ⊠ | R, E | |||||
Staging ●, ,
|
1 | H | EHR, ROIS | × | × | × | × | × | ⊠ | E | ||||
Prescription ,♦ |
1 | H | ROIS, ODB | ⊠ | × | E, X, R | ||||||||
| As-treated plan details ● | 1 | M | ROIS | × | ||||||||||
DVH ●, ,♦ |
1 | M | TPS | × | × | × | ⊠ | × | ATPS | |||||
| Survival ● | 1 | M | EHR, XLS, ODB | × | ⊠ | UD, E | ||||||||
Recurrence ,
|
1 | H | EHR | × | × | × | × | × | ⊠ | E, X | ||||
Toxicity ●,
|
1 | H | EHR, ROIS | × | × | × | × | × | ⊠ | E, X | ||||
Patient-reported outcomes
|
2 | H | EHR, P | × | × | × | × | ⊠ | E, X | |||||
| Laboratory values ● | 2 | M | EHR | ⊠ | × | × | E | |||||||
| Medications● | 2 | M | EHR | ⊠ | × | × | E | |||||||
| Height, weight, BMI● | 2 | M | EHR | ⊠ | × | × | E | |||||||
| Treatment imaging: Timeline details● | 3 | H | ROIS | × | R | |||||||||
| Diagnostic imaging details ⊙ |
3 | M | ODB | ⊠ | × | × | × | |||||||
| Radiomics ⊙,♦ | 3 | L | XLS | × | ⊠ | |||||||||
| Genomics ⊙ | 3 | L | XLS | × | ⊠ | |||||||||
| Charges ● | 3 | L | ROIS | |||||||||||
| Research datasets ⊙ | 4 | H | XLS | × | ⊠ | × | × | E | ||||||
| Registry data ⊙ | 4 | M | ODB | ⊠ | × | × | × | UD |
Demand ranking ranges from most (1) to least (4) frequently needed as part of queries. Range in ETL is specified when significant variation among institutions is anticipated; extensive transformation indicates need to construct sophisticated algorithms to process raw data from source systems to provide needed information.
APTS, special manual effort needed to construct as-treated plan sums; BMI, body mass index; DVH, dose-volume histogram; E, manual entry without process corrected curation are susceptible to random or system-related systematic errors; HER, electronic health records; ETL, extract, transform, and load; H, extensive process changes needed, data typically in unstructured free text fields; L, little modification required; M, changes to clinical processes required, interactions across different groups in the institution, significant computational processing; M-ROAR, Michigan Radiation Oncology Analytics Resource; NLP, natural language processing; ODB, other database systems; P, paper records; PHI, Patient Health Information; R, missing detail on key relationships to other data items; ROIS, radiation oncology information system; TPS, treatment planning system; UD, data values not being up to date; X, manual effort required to extract data; XLS, spreadsheet.
M-ROAR–specific ETL status for all patients: ●, current processes enable capture for all; ⊙, developing new extractions;
, exploring NLP-based process;
, piloting new clinical process; ♦, developing new software applications to improve availability or accuracy;
, developing extractions for legacy data with differing formats. The current database includes 17,956 patients treated since 2002. Records per patient vary with time period and key data element category.
×, specific ETL challenges; ⊠, the primary issue for enabling automated extractions for multiple issues.