Skip to main content
. 2023 May 25;25:e45662. doi: 10.2196/45662

Table 1.

Summary of methods in each step by their use.

Module Step Use Methods
Data harmonization Concept identification Identify medical concepts from RCTa documents MetaMap [46], HPOb [47], NILEc [48], cTAKESd [49]
Data harmonization Concept matching Grouping of structured EHRe PheWASf catalog [32], CCSg [50], RxNorm [51], and LOINCh [52]
Data harmonization Concept matching Expansion and selection of relevant features using knowledge source or co-occurrence Expert curation [33,53], knowledge sources [54-58], and EHR data [31,34,59-62]
Cohort construction Data mart Filter patients with diagnosis codes relevant to the disease of interest PheWAS catalog [32] or HPO [47]
Cohort construction Disease cohort Identify patients with the disease of interest through phenotyping Unsupervised: anchor and learn [63], XPRESSi [64], APHRODITEj [65], PheNorm [66], MAPk [36], and sureLDAl [67]; semisupervised: AFEPm [57], SAFEn [58], PSSTo [68], likelihood approach [69], and PheCAP [70]
Cohort construction Indication and treatment arm Identify indication conditions before treatment Phenotyping with temporal input [37]
Variable curation Extraction of baseline variables or end points Extraction of binary variables through phenotyping Phenotyping methods as listed out under cohort construction: disease cohort
Variable curation Extraction of baseline variables or end points Extraction of numerical variables through NLPp EXTENDq [71] and NICEr [38]
Variable curation Extraction of baseline variables Extraction of radiological characteristics through medical AIs For organs [72], blood vessels [73], neural systems [74,75], nodule detection [76,77], cancer staging [78], and fractional flow reserve [79,80]
Variable curation Extraction of baseline end points Extraction of event time through incidence phenotyping Unsupervised [81,82], semisupervised [83,84], and supervised [85,86]
Downstream analysis Causal inference for ATEt Efficient and robust estimation of treatment effect with partially annotated noisy data SMMALu [87]

aRCT: randomized controlled trial.

bHPO: human phenotype ontology.

cNILE: narrative information linear extraction.

dcTAKES: clinical text analysis and knowledge extraction system.

eEHR: electronic health record.

fPheWAS: phenome-wide association scans.

gCCS: clinical classification software.

hLOINC: logical observation identifier names and codes.

iXPRESS: extraction of phenotypes from records using silver standards.

jAPHRODITE: automated phenotype routine for observational definition, identification, training and evaluation.

kMAP: multimodal automated phenotyping.

lsureLDA: surrogate-guided ensemble latent Dirichlet allocation.

mAFEP: automated feature extraction for phenotyping.

nSAFE: surrogate-assisted feature extraction.

oPSST: phenotyping through semisupervised tensor factorization.

pNLP: natural language processing.

qEXTEND: extraction of EMR numerical data.

rNICE: natural language processing interpreter for cancer extraction.

sAI: artificial intelligence.

tATE: average treatment effect.

uSMMAL: semisupervised multiple machine learning.