Table 1.
Summary of methods in each step by their use.
| Module | Step | Use | Methods |
| Data harmonization | Concept identification | Identify medical concepts from RCTa documents | MetaMap [46], HPOb [47], NILEc [48], cTAKESd [49] |
| Data harmonization | Concept matching | Grouping of structured EHRe | PheWASf catalog [32], CCSg [50], RxNorm [51], and LOINCh [52] |
| Data harmonization | Concept matching | Expansion and selection of relevant features using knowledge source or co-occurrence | Expert curation [33,53], knowledge sources [54-58], and EHR data [31,34,59-62] |
| Cohort construction | Data mart | Filter patients with diagnosis codes relevant to the disease of interest | PheWAS catalog [32] or HPO [47] |
| Cohort construction | Disease cohort | Identify patients with the disease of interest through phenotyping | Unsupervised: anchor and learn [63], XPRESSi [64], APHRODITEj [65], PheNorm [66], MAPk [36], and sureLDAl [67]; semisupervised: AFEPm [57], SAFEn [58], PSSTo [68], likelihood approach [69], and PheCAP [70] |
| Cohort construction | Indication and treatment arm | Identify indication conditions before treatment | Phenotyping with temporal input [37] |
| Variable curation | Extraction of baseline variables or end points | Extraction of binary variables through phenotyping | Phenotyping methods as listed out under cohort construction: disease cohort |
| Variable curation | Extraction of baseline variables or end points | Extraction of numerical variables through NLPp | EXTENDq [71] and NICEr [38] |
| Variable curation | Extraction of baseline variables | Extraction of radiological characteristics through medical AIs | For organs [72], blood vessels [73], neural systems [74,75], nodule detection [76,77], cancer staging [78], and fractional flow reserve [79,80] |
| Variable curation | Extraction of baseline end points | Extraction of event time through incidence phenotyping | Unsupervised [81,82], semisupervised [83,84], and supervised [85,86] |
| Downstream analysis | Causal inference for ATEt | Efficient and robust estimation of treatment effect with partially annotated noisy data | SMMALu [87] |
aRCT: randomized controlled trial.
bHPO: human phenotype ontology.
cNILE: narrative information linear extraction.
dcTAKES: clinical text analysis and knowledge extraction system.
eEHR: electronic health record.
fPheWAS: phenome-wide association scans.
gCCS: clinical classification software.
hLOINC: logical observation identifier names and codes.
iXPRESS: extraction of phenotypes from records using silver standards.
jAPHRODITE: automated phenotype routine for observational definition, identification, training and evaluation.
kMAP: multimodal automated phenotyping.
lsureLDA: surrogate-guided ensemble latent Dirichlet allocation.
mAFEP: automated feature extraction for phenotyping.
nSAFE: surrogate-assisted feature extraction.
oPSST: phenotyping through semisupervised tensor factorization.
pNLP: natural language processing.
qEXTEND: extraction of EMR numerical data.
rNICE: natural language processing interpreter for cancer extraction.
sAI: artificial intelligence.
tATE: average treatment effect.
uSMMAL: semisupervised multiple machine learning.