. 2023 May 25;25:e45662. doi: 10.2196/45662

Table 1.

Summary of methods in each step by their use.

Module	Step	Use	Methods
Data harmonization	Concept identification	Identify medical concepts from RCT^a documents	MetaMap [46], HPO^b [47], NILE^c [48], cTAKES^d [49]
Data harmonization	Concept matching	Grouping of structured EHR^e	PheWAS^f catalog [32], CCS^g [50], RxNorm [51], and LOINC^h [52]
Data harmonization	Concept matching	Expansion and selection of relevant features using knowledge source or co-occurrence	Expert curation [33,53], knowledge sources [54-58], and EHR data [31,34,59-62]
Cohort construction	Data mart	Filter patients with diagnosis codes relevant to the disease of interest	PheWAS catalog [32] or HPO [47]
Cohort construction	Disease cohort	Identify patients with the disease of interest through phenotyping	Unsupervised: anchor and learn [63], XPRESSⁱ [64], APHRODITE^j [65], PheNorm [66], MAP^k [36], and sureLDA^l [67]; semisupervised: AFEP^m [57], SAFEⁿ [58], PSST^o [68], likelihood approach [69], and PheCAP [70]
Cohort construction	Indication and treatment arm	Identify indication conditions before treatment	Phenotyping with temporal input [37]
Variable curation	Extraction of baseline variables or end points	Extraction of binary variables through phenotyping	Phenotyping methods as listed out under cohort construction: disease cohort
Variable curation	Extraction of baseline variables or end points	Extraction of numerical variables through NLP^p	EXTEND^q [71] and NICE^r [38]
Variable curation	Extraction of baseline variables	Extraction of radiological characteristics through medical AI^s	For organs [72], blood vessels [73], neural systems [74,75], nodule detection [76,77], cancer staging [78], and fractional flow reserve [79,80]
Variable curation	Extraction of baseline end points	Extraction of event time through incidence phenotyping	Unsupervised [81,82], semisupervised [83,84], and supervised [85,86]
Downstream analysis	Causal inference for ATE^t	Efficient and robust estimation of treatment effect with partially annotated noisy data	SMMAL^u [87]

^aRCT: randomized controlled trial.

^bHPO: human phenotype ontology.

^cNILE: narrative information linear extraction.

^dcTAKES: clinical text analysis and knowledge extraction system.

^eEHR: electronic health record.

^fPheWAS: phenome-wide association scans.

^gCCS: clinical classification software.

^hLOINC: logical observation identifier names and codes.

ⁱXPRESS: extraction of phenotypes from records using silver standards.

^jAPHRODITE: automated phenotype routine for observational definition, identification, training and evaluation.

^kMAP: multimodal automated phenotyping.

^lsureLDA: surrogate-guided ensemble latent Dirichlet allocation.

^mAFEP: automated feature extraction for phenotyping.

ⁿSAFE: surrogate-assisted feature extraction.

^oPSST: phenotyping through semisupervised tensor factorization.

^pNLP: natural language processing.

^qEXTEND: extraction of EMR numerical data.

^rNICE: natural language processing interpreter for cancer extraction.

^sAI: artificial intelligence.

^tATE: average treatment effect.

^uSMMAL: semisupervised multiple machine learning.