Skip to main content
Journal of the American Medical Informatics Association : JAMIA logoLink to Journal of the American Medical Informatics Association : JAMIA
. 2024 May 15;31(8):1785–1796. doi: 10.1093/jamia/ocae121

A general framework for developing computable clinical phenotype algorithms

David S Carrell 1,✉,#, James S Floyd 2,3,#, Susan Gruber 4,#, Brian L Hazlehurst 5,#, Patrick J Heagerty 6,#, Jennifer C Nelson 7,#, Brian D Williamson 8,#, Robert Ball 9,#
PMCID: PMC11258420  PMID: 38748991

Abstract

Objective

To present a general framework providing high-level guidance to developers of computable algorithms for identifying patients with specific clinical conditions (phenotypes) through a variety of approaches, including but not limited to machine learning and natural language processing methods to incorporate rich electronic health record data.

Materials and Methods

Drawing on extensive prior phenotyping experiences and insights derived from 3 algorithm development projects conducted specifically for this purpose, our team with expertise in clinical medicine, statistics, informatics, pharmacoepidemiology, and healthcare data science methods conceptualized stages of development and corresponding sets of principles, strategies, and practical guidelines for improving the algorithm development process.

Results

We propose 5 stages of algorithm development and corresponding principles, strategies, and guidelines: (1) assessing fitness-for-purpose, (2) creating gold standard data, (3) feature engineering, (4) model development, and (5) model evaluation.

Discussion and Conclusion

This framework is intended to provide practical guidance and serve as a basis for future elaboration and extension.

Keywords: computable algorithms, recommended practices, health outcomes, modeling methods

Background and motivation

Computable clinical phenotype algorithms are used to identify a wide variety of health conditions for epidemiologic, clinical, and health services research and for medical product safety surveillance. With appropriate privacy safeguards and stakeholder approval, such algorithms use electronic healthcare data to identify patients with specific clinical conditions (phenotypes). Both the data inputs and the modeling methods used in developing algorithms range from simple to complex. Some phenotypes such as acute pancreatitis1 and HIV infection2–4 can be accurately identified by relatively simple, manually crafted rules applied to structured claims data (eg, diagnosis, procedure, and medication codes). Identifying more clinically complex phenotypes may require marshalling more diverse and subtle information from electronic health records (EHRs), introducing additional measurement and modeling challenges. Anaphylaxis is an example of a difficult phenotype,5,6 requiring synthesis of information about exposures and subjectively assessed symptoms during a short time period.7

The emergence of rich and abundant EHR data, powerful natural language processing (NLP) tools,8 and data-driven machine learning (ML) methods9,10 has expanded the quantity, types, and granularity of information available for phenotyping, but better data and methods alone do not eliminate algorithm development challenges. To the contrary, they introduce or exacerbate certain challenges. Which phenotypes that cannot be accurately identified by simpler data and methods are good candidates for algorithm development? What types of data and modeling approaches are minimally necessary to achieve desired performance? Under what conditions is engineering of NLP-derived predictors likely to improve performance? Can algorithms incorporating cutting-edge data mining and machine learning methods be comparably implemented in multiple settings, including those with modest computing infrastructure and expertise?

In our experience, these and myriad other questions arising during algorithm development tend to be resolved parochially with heavy reliance on the habits, intuition, expertise, and tools available to local developers. The informatics literature offers some guidance on selected aspects of algorithm development (eg, cataloging commonly used methods and tools11–13 and sources of ambiguity14) but lacks a characterization of key development stages and systematic presentation of strategies, principles, and practical guidelines intended to make algorithm development more successful and efficient.

Our objective is to present a general framework that offers high-level guidance to developers of computable clinical phenotyping algorithms using a wide variety of methods—from simple manually crafted rules to machine learning applied to high-dimensional or unstructured data. This framework is organized around key development stages and articulates broadly applicable principles and practical guidelines focusing on common phenotyping challenges and strategies for addressing them. It is agnostic to phenotype, scientific objective, data sources, and implementation setting.

We derived the content of this framework from 2 sources. First was our collective experience in several multi-site national consortia where computable phenotyping played central roles, including the Electronic Medical Records and Genomics Network,15 the Strategic Health IT Advanced Research Projects Program,16,17 the Health Care Systems Research Network Collaboratory,18 the Mental Health Research Network,19,20 the Vaccine Safety Datalink,21 and the US Food and Drug Administration’s (FDA) Sentinel Initiative.22–25 The second and most formative source was a set of projects we conducted to develop algorithms for anaphylaxis,26 acute pancreatitis,27 and COVID-19 disease28 during 2020-2023. This work, supported by the Sentinel Innovation Center,29 was undertaken in part to prospectively identify and articulate principles, strategies, and guidelines that developers of computable algorithms could use to normalize and streamline the development process itself. These are priorities for the Sentinel Active Risk Identification and Analysis (ARIA) system, the backbone of FDA automated medical product safety surveillance.30,31

We organize algorithm development into 5 stages (Figure 1). The first is assessing fitness-for-purpose, a “pre-mortem”32 exercise, completed before development begins, to assess the feasibility of developing an algorithm that can deliver desired performance. Next is resource-intensive manual chart review to create gold standard data, sometimes required for algorithm training and always required for evaluation. Stage 3 involves the resource-intensive task of engineering features (predictor variables) from various data sources to be used during modeling. Stage 4, model development, applies statistical methods to estimate true functional relationships between predictors and phenotype status. Stage 5 is model evaluation and reporting. All stages address topics of “scalable development”—approaches that enhance efficiency by avoiding high-risk/low-reward efforts, minimizing use of costly/rare expertise, and optimizing methods and algorithms for reusability.

Figure 1.

Figure 1.

Flow diagram of 5 stages of computable phenotype algorithm development.

Assessing fitness-for-purpose

The first stage in any phenotyping effort is assessing fitness-for-purpose. Such assessments produce, on a short timeline, judgements regarding the likelihood that the clinical condition in question can be accurately identified by a computable algorithm.

Assessing fitness for purpose is a team effort, drawing on expertise in clinical practice, medicine, statistics, EHR data and systems, NLP, and chart review. It begins with thoroughly understanding the scientific objective of the phenotyping effort, the clinical characteristics and case definition of the phenotype, and the literature on approaches to identifying necessary variables using available data, typically administrative claims and EHR data (including electronic text). For example, Sentinel's scientific objectives for phenotyping include identifying outcomes for safety studies of potential adverse effects of medical products and evaluating disease incidence or prevalence.33 Clinical outcomes and estimated incidence, study inclusion and exclusion criteria, data setting(s), types of data available, minimum model performance requirements, and a study timeline should be clearly specified. Discussions with those commissioning an algorithm can resolve ambiguities and identify negotiable specifications. Next, a development team with relevant expertise identifies potential barriers to success, a forward-looking type of assessment that has been referred to elsewhere as a project premortem.32 For the next 4 development stages, team members list challenges to project success and strategies for mitigating them. Tools for estimating how much gold standard training data is needed for modeling may be useful.34 Identified challenges may merit a limited, exploratory chart review to confirm or resolve concerns. Throughout assessment, we recommend attentiveness to clinical and data complexity.

Clinical complexity refers to the inherent ambiguity of a clinical condition. Sources include lack of definitive diagnostic tests or consensus about diagnostic criteria, competing diagnoses, and limitations in clinical knowledge and skill, time, and technology (eg, unavailability of advanced medical imaging) (Table 1). Generally, high clinical complexity leads to greater error in diagnosis, limiting the accuracy of gold standard data and complicating feature engineering by necessitating more data capture, from more data types (eg, imaging, labs, clinical text), and using more sophisticated methods (eg, NLP, ML). This increases the time and expertise needed for development.

Table 1.

Sources of clinical complexity and data complexity impacting development of computable phenotype algorithms.

Sources of clinical complexity
1. Lack of definitive diagnostic tests Definitive diagnostic tests are unavailable for many conditions. Symptoms such as shortness of breath and pain are subjective, reported differently by patients, and present for various health outcomes. In contrast, elevated serum lipase levels are easily and accurately measured making diagnosing acute pancreatitis less clinically complex.1
2. Lack of consensus about diagnostic criteria Disagreement in the medical community about the defining characteristics of a condition create ambiguity about the definition of the phenotype to be modeled and variability in its documentation (fibromyalgia is an example).35 The need to select or adapt one set of clinical criteria contributes to the complexity of the development effort, and the use of different criteria may impact the accuracy and reproducibility of a computable phenotyping effort.
3. Competing diagnoses Competing diagnoses exist when multiple clinical conditions have overlapping symptoms. For example, angioedema and anaphylaxis can present similarly. Some events coded as angioedema meet clinical criteria for anaphylaxis.36
4. Limited knowledge, time, or technology Resource scarcity may introduce time, skill, technology, and/or clinical knowledge constraints that make accurate diagnosis difficult. Examples are non-urgent, non-respiratory chronic health conditions that may be obscured by a focus on acute COVID-19 disease during a pandemic.37–39

Sources of data complexity

5. Data heterogeneity Heterogeneity of data across settings impacts the availability and meaning of data. Relevant settings include healthcare organizations, time periods, and/or cohorts of patients. Sources of difference include variation in clinical care practice, and data capture priorities and technologies. An example is the use of different diagnosis codes for eye conditions in different healthcare settings.40
6. Data obscurity Relevant data may be obscured when recorded among voluminous or temporally dispersed observations, or when clinical conditions have signs and symptoms that partly overlap with other conditions (eg, the clinical overlap of heart failure and chronic obstructive pulmonary disease). Obscurity may cause facts to be overlooked or relationships between facts to go unrecognized—even when records are complete (eg, patient histories of suicide attempt mentioned in clinical notes but not documented by standard diagnosis codes*).41–43
7. Data imprecision Imprecision in clinical data may be caused by inherent subjectivity, measurement error, and imprecise or missing documentation. It results in non-specific or non-sensitive data, rendering its relevance to a particular phenotype ambiguous. Examples include imprecisely reported pain location/severity, use of medication orders instead of fills.
8. Data irregularity Data irregularity results from the use of locally defined coding schemes or variability across setting and time in the use of standardized coding systems. Shi and colleagues found substantial variation in the use of ICD-10 ophthalmology diagnosis codes across 2 relatively similar healthcare settings.36,44 Valid, alternative documentation of depressive symptoms include diagnosis codes, depression scale scores, clinical notes.
9. Data instability Data instability results from evolution in clinical terminology and/or coding schemes, and shifts in care delivery practice. Examples include the 2015 switch from ICD-9 to ICD-10 coding, and the rise of virtual care (accelerated by the COVID-19 pandemic).
10. High dimensionality High dimensionality occurs when information is represented by discrete data elements that are large in number relative to sample size. It makes feature engineering and modeling more complicated, and measurement error more likely. An example is using 843 NLP-derived measures to model anaphylaxis in a cohort of 239 patients.26
11. Lack of structure Unstructured data, such as narrative text or images, are inherently complex and require special processing—by humans or software—increasing opportunities for error. Examples include the NLP challenges of distinguishing patient clinical history from family history in clinical notes.

Abbreviations: NLP = natural language processing.

Anaphylaxis is illustrative. Diverse combinations of symptoms satisfy diagnostic criteria,7 complicating feature engineering and modeling. Diagnostic ambiguity is exacerbated by sporadic EHR documentation of key information (eg, symptom timing) and/or exposure to known allergens (historical and current) and by frequent preemptive epinephrine treatment which can suppress key symptoms. Clinical complexity can lead to difficulty in assigning accurate gold standard labels, for example, when independent expert reviewers reach discordant determinations. A process for resolving such discordances to produce definitive case status labels cannot eliminate the underlying clinical ambiguity of the diagnosis. This may impose upper bounds to algorithm performance. In our anaphylaxis study, 20% of potential events were assigned discordant labels or judged “difficult to determine” by independent physician adjudicators.26

Data complexity—which may be overlapping with or independent of clinical complexity—refers to ambiguity, lack of structure, missingness, or error in documentation of clinical conditions. It may hinder each development stage and reduce model performance. Sources include data heterogeneity, obscurity, imprecision, irregularity, instability, high dimensionality, and lack of structure (Table 1).

Anaphylaxis has high data complexity, involving documentation of numerous signs, symptoms, and symptom severity for multiple organ systems. Our NLP dictionary for anaphylaxis contained >450 entries for a single symptom category, “skin/mucosal involvement.”26 Symptom timing is also a key clinical criterion but is often sparsely documented or difficult to extract using NLP. In contrast, acute pancreatitis has low data complexity. Although diagnostic criteria include subjective and imprecisely documented pain, simple diagnosis codes and serum lipase labs identify acute pancreatitis with 92% positive predictive value (PPV) (95% CI 86%-95%) and 85% sensitivity (95% CI 79%-89%) (Figure 2).1

Figure 2.

Figure 2.

Relationship between clinical complexity, data complexity, and increasing phenotyping difficulty with illustrative phenotypes.

As clinical and data complexity increase, the anticipated difficulty of phenotyping rises (Figure 2). Anaphylaxis, having high clinical and data complexity, required extensive model development effort and yielded modest model accuracy.26 Acute pancreatitis has high clinical complexity but low data complexity, as noted. Opioid overdose has low clinical complexity, especially in emergency departments (EDs) where overdose symptoms are obvious and laboratory testing confirms opioid exposure, but incomplete clinical documentation often creates high data complexity.45 Diabetes is easily diagnosed and typically well-documented.46

Fitness-for-purpose assessments conclude with decisions about whether development should proceed. A clear “go” decision allows phenotyping to proceed with a better understanding of the tasks. A clear “no-go,” though disappointing, may avoid expensive, failed development. If a clear decision is elusive, the team should consider whether the algorithm’s scientific objectives may be altered to mitigate feasibility challenges. If the lack of a clear clinical case definition is the problem, the project might be delayed until a sufficiently clear definition is identified or developed. If the problem results from data complexity, options include substituting a comparable phenotype with lower data complexity, targeting a phenotype surrogate (eg, severe allergic reaction instead of anaphylaxis), or restricting outcomes to care settings (outpatient, inpatient, or ED) with more complete phenotype data.

Creating gold standard data

Gold standard data indicating whether a phenotype is present is created by applying vetted clinical events criteria47,48 during manual review of medical records. These data may be used in feature engineering or model training and are essential for evaluating all algorithms—regardless of their complexity. Medical record review may also yield insights regarding a phenotype’s clinical and data complexities, sources of potential predictors (eg, labs, imaging), and documentation practices.

The quantity of gold standard data needed depends on a phenotype’s clinical complexity, case/non-case proportions (≥10 minority-class observations per predictor is recommended),49 and whether any data will be used to inform feature engineering (precluding reuse for modeling).

Inherent challenges in creating gold standard data make it time and resource intensive (and, along with feature engineering, a bottleneck in computable phenotyping). Treating clinicians may struggle to diagnose clinically complex conditions in real time; retrospectively creating gold standard determinations for complex conditions based on EHR data that may lack clinical information or contain errors can be even harder. EHR information can also be impacted by local care and documentation practices and EHR interfaces, sometimes complicating data interpretations. While a gold standard review process may yield unambiguous case/non-case labels, it cannot eliminate underlying documentation inadequacies or clinical ambiguity.

To create high-quality gold standard data typically required for model training and/or evaluation, we recommend structuring review tasks according to established clinical criteria, performing periodic quality assurance evaluations, and using full-featured data collection systems (Table 2). Monitoring for extreme case/non-case imbalance and adapting sampling as warranted can enhance the utility of the data generated. For multi-site studies where physician adjudicators may not be available at some sites, or when the number of cases to be validated is large, there may be advantages to conducting reviews by non-clinician abstractors trained and supervised by clinicians.

Table 2.

Guiding principles for efficiently creating high-quality gold standard data via manual chart review with illustrative examples.

Principle Description/illustration
1. Structure reviews around established clinical criteria Structure gold standard chart review tasks and written abstraction guidelines around established, widely accepted clinical criteria (eg, from the Brighton Collaboration).48 If suitable criteria are unavailable, modify existing or develop new criteria to facilitate consistent determinations of phenotype status. The Atlanta criteria for diagnosing acute pancreatitis,50 for example, require any 2 of (1) abdominal pain consistent with acute pancreatitis, (2) serum lipase at least 3 times greater than the upper limit of normal, or (3) characteristic imaging findings.
2. Provide comprehensive training to chart reviewers Training develops a shared understanding among all chart reviewers of the EHR interface, terminology, process, and clinical events criteria. It requires considerable time and effort and requires reviewing actual charts. Reviewer training for a study of acute pancreatitis included independent review and group discussion of 16 actual charts with clinician guidance.
3. Conduct periodic quality assurance evaluations Periodic inter-reviewer reliability (IRR) analyses51 based on dual, blind reviews of random 5%-10% samples can detect “drift” in outcome determinations and foster communication among reviewers about potential threats to consistency. Alternatively, random samples of reviewer determinations may be verified by clinician reviews, as was done in an acute pancreatitis study.27
4. Use a full-featured data collection system Data collection systems such as Research Electronic Data Capture (REDCap)52,53 can streamline and secure the review process by embedding review guidelines in data capture forms, pre-populating data fields, managing IRR reviews, and recording determinations and supporting evidence (eg, a REDCap acute pancreatitis form).54
5. Strive for balance in cases and non-cases Equal proportions of validated cases and non-cases are optimal for model training.55 Designing a sampling scheme that mitigates class imbalance can help avoid challenges due to small minority class size. Training data in our anaphylaxis algorithm study were severely imbalanced, having 153 true cases and 83 true non-cases.26
6. Train local, non-clinician reviewers to conduct chart reviews An acute pancreatitis study trained local, non-clinician reviewers to conduct 300 chart reviews, reducing the need for relatively scarce, more costly clinician reviewers.56 According to the principles of distributed cognition,57–59 knowledge sharing and discussion during training may enhance chart review quality by (1) formalizing a clear, reproducible decision-making process, (2) identifying where quality assurance is needed and how to monitor it, and (3) revealing local knowledge of the EHR, healthcare operations, and local practices that impact clinical data and its interpretation. Blind, duplicate review by a physician in the acute pancreatitis study achieved 100% agreement with trained reviewers in a 15% random sample (31/205) and 100% agreement in 10 randomly selected determinations by other physicians.60

Abbreviation: EHR = electronic health record.

Feature engineering

Features (also predictor variables or covariates) are structured data representations of information used as inputs in all phenotype algorithms. Features may be based on structured data, such as medical claims codes, or unstructured data, such as indicators derived from chart notes via NLP. Feature engineering uses clinical and informatics domain knowledge (subject-area expertise) to construct operational definitions of features from these “raw” sources. Feature engineering is an art,60 though automation has shown early success.61 Model development may require dozens to hundreds of features.

Feature engineering is an expertise-intensive, time-consuming task62 and a major bottleneck in algorithm development. The necessary expertise is expensive and sometimes scarce. For efficiency, initial feature engineering and model development may occur in one institutional setting, extending to others if successful. Although predicting which features will most improve performance is difficult, excessive effort creating or honing features may be wasted effort.

When algorithms will be implemented in multiple settings, incorporating locally idiosyncratic data can also be wasteful because performance gains may not generalize. Engineering for reusability requires favoring simplicity over complexity and using data available across settings.

Table 3 summarizes principles to enhance efficiency, transportability (ease of implementation across settings), and generalizability of feature engineering. Including standard features, engineering diverse sets of features, and leveraging domain expertise (principles 1-3) are common sense but important. Limiting the number of features engineered is neither beneficial nor necessary; determining which features help predict a phenotype occurs during modeling. Engineering generalizable features requires knowledge of clinical diagnostic criteria, informatics, and how EHR systems influence data capture. Useful domain knowledge may also be derived from published clinical knowledge sources7,50,66,67 or scientific reports of phenotyping efforts (eg, modeling anaphylaxis6,68).

Table 3.

Guiding principles of feature engineering with illustrative examples.

Principle Description/illustration
1. Include standard features Standard features capture information about demographics, how cohort inclusion criteria were met, presentation setting, and time period (eg, the ICD-10 era or COVID-19 era) and may impact model performance. For example, diagnosis codes for anaphylaxis had a PPV of 78% for patients aged <20 years vs 57% for patients aged ≥60 years.
2. Engineer diverse features to capture diverse signals Diverse features represent information about the same thing in different ways or from different sources (eg, NLP-derived concepts may be engineered as raw or normalized counts, the latter adjusted for chart length). Diverse features may improve model performance by capturing complementary signals (ie, alternative forms of evidence that a phenotype is present).
3. Feature engineering is enhanced by domain knowledge Feature engineering relies on clinical domain, informatics, and statistical knowledge to identify information that is relevant, extractable, and represented in a way useful for modeling. We used expertise from all 3 domains to define a key predictor of acute pancreatitis: a patient’s maximum lab value for serum lipase normalized by 3x the upper limit of normal within ±14 days of a diagnosis for acute pancreatitis.
4. Consider automated feature engineering approaches Automated feature engineering uses formulaic, data-driven methods to define features for specific clinical phenotypes. One such approach mines relevant medical concepts from published clinical knowledge articles on the phenotype, then defines one feature per concept as the count of its mentions in a patient’s chart.63 We used this approach to create and use 158 features in a model identifying symptomatic COVID-19 patients.28
5. Use expert-guided manual feature engineering sparingly Expert-guided manual feature engineering is potentially very costly and may embody idiosyncrasies stemming from local data and/or expert knowledge. It is therefore best used to identify promising feature categories rather than fashion complex operational definitions. Of 45 structured and 468 NLP features we manually engineered for anaphylaxis, our best models retained only 16 structured and 32 NLP features. We are now investigating whether lower cost automated approaches may perform comparably.
6. Design for transportability across settings Transportability refers to ease of implementation across settings. Feature definitions that are more general (vs specific) are more transportable (eg, NLP of all imaging reports in a time window vs reports with setting-specific titles). Engineering blind to gold standard labels avoids “baking in” feature associations that vary across settings.64
7. Standardize required tailoring Tailoring refers to changes in algorithm definitions or logic required to accommodate idiosyncratic data. Tailoring is standardized when software code used to engineer features anticipates and accommodates specification of setting-specific lists or selection of options (eg, custom term lists for NLP features, or software options to include/exclude patient after visit summaries as clinical text for NLP processing).
8. Share code and tools publicly Use of publicly available software-sharing services such as GitHub vastly improves transportability by simplifying sharing of software code/tools and version management. For example, code used in our anaphylaxis algorithm development projects is available on GitHub.65

Abbreviation: PPV = positive predictive value.

Automated engineering (principle 4) may produce features that perform comparably to manually engineered features with dramatically less effort, expertise, and operator-dependent idiosyncrasies. Both structured69 and unstructured data9,61,63,70 have been successfully used to automate feature engineering to identify chronic conditions and an acute health condition.71 The low cost of automated features makes them an attractive phenotyping starting point.

Expert-guided manual feature engineering should be used sparingly, focusing on strategic guidance rather than fine tuning (principle 5). Engineering effort that more precisely represents the true facts about a patient but does not improve model performance is wasted. Avoiding overengineering can reduce costs and threats to model generalizability.

Principles 6-8 simplify implementing features across settings (transportability). These include designing features to accommodate heterogeneous data inputs, anticipating and accommodating unavoidable local tailoring, and making software tools publicly available.72–74

Model development

Constructing a useful prediction model depends on (1) the underlying strength of relationships between available features (predictor variables) X and outcome Y; (2) complexity of the true underlying functional form of these relationships; and (3) how closely algorithms under consideration can approximate the true prediction function. We cannot know in advance which ML or parametric modeling approach will most accurately learn the prediction function from the set of features, so we recommend V-fold cross-validation to simultaneously evaluate many algorithms. Modeling choices should be tailored to the prediction task and data characteristics.75 These guiding principles can improve the scalability of developing computable phenotype models.

ML is enhanced by incorporating domain knowledge. Domain expertise can focus manual engineering of features known to distinguish cases from non-cases (eg, serum lipase measures for acute pancreatitis, Table 4), caution against using information that is poorly captured or non-specific to the outcome and determine whether heterogeneous feature-outcome associations across sub-populations pose challenges (eg, anaphylaxis in adults vs children). Automated feature engineering also exploits domain knowledge to identify candidate predictors; Yu and colleagues engineered features for clinical concepts mined from clinical knowledge base articles for several clinical conditions.63

Table 4.

Guiding principles of model development and illustrative examples.

Principle Description/illustration
1. Incorporate domain knowledge in modeling ML is enhanced by incorporating domain knowledge to define the target population and in the selection of candidate features. When clinical presentations are known to vary by demographic subgroups, eg, anaphylaxis in adults vs children, at small sample size, stratified modeling may improve performance; cross-validated (cv) sensitivity in our adults-only anaphylaxis models improved to 70% from 65% in our all-age-group models (at cv-PPV of 79% in both models).26 Elevated serum lipase labs ≥3 times the upper limit of normal are known to be clinically significant in diagnosing acute pancreatitis1; if available, such labs should be included as candidate predictors.
2. Pre-process the data to reduce dimensionality. Outcome blind pre-processing of the set of candidate features to remove those that are redundant or highly correlated can reduce the complexity of a prediction task without sacrificing predictive ability. In developing acute pancreatitis models, we created a binary variable indicating if “necrosis” was ever mentioned in any of a patient's imaging reports, and a count variable for the total number of mentions in all reports; but because most patients had only one mention of necrosis, the empirical correlation between these variables was 0.998, and we randomly excluded one variable from the dataset.
3. Consider many algorithms and dimension reduction strategies Because the true functional form of the relationship between candidate features and the outcome is unknown, considering a diverse set of candidate prediction algorithms is advantageous. A main terms logistic regression model for anaphylaxis with 136 covariates and 236 observations (153 cases, 83 non-cases), achieved a modest cv-AUC of 0.49 because it had overfit the data. Coupling the same logistic model with least absolute shrinkage and selection operator (LASSO) based dimension reduction retained only 48 covariates, increasing the model's cv-AUC to 0.64.
4. Use V-fold cross-validation to evaluate performance The cross-validation scheme should accurately estimate performance of a model trained on all available data. With a modest size sample of 236 observations with gold standard labels (153 cases, 83 non-cases) we used V-fold cross-validation (V = 20) to develop an anaphylaxis model. We randomly assigned observations to 20 disjoint validation sets stratifying on gold standard outcome, each set having 7 or 8 cases and 4 or 5 non-cases. Thus, each training set contained 95% of all observations with a similar ratio of cases to non-cases, and performance metrics were averaged over all validation sets, ie, all observations in the data.
5. Specify performance metrics relevant to the use case Cross validation of performance metric(s) relevant for downstream use is vital. Models we developed for anaphylaxis, acute pancreatitis, and symptomatic COVID-19 disease were intended for use in identifying outcomes for FDA safety studies. Beyond evaluating AUC, the ability to discriminate between cases and non-cases, FDA studies often benefit from high PPV (0.80 or above) while maximizing sensitivity; our performance metrics thus included cv-sensitivity and cv-PPV at multiple candidate classification thresholds. In contrast, for a screening task, high sensitivity is the most relevant metric. If the training sample is not a simple random sample from the population, use appropriate weighting in the evaluation.
6. Final algorithm choice is guided by downstream considerations Although cross-validated model performance is the primary consideration in algorithm selection, replicability and generalizability should be considered. Consider hypothetical models A, based on easily obtained structured data covariates with cv-AUC of 0.89 (95% CI: 0.87, 0.91), and B, based on the same structured covariates plus NLP-derived covariates with cv-AUC of 0.90 (95%CI: 0.88, 0.92). The small, estimated performance gain of B may not be worth the time and effort of using NLP. However, if model B offers better classification accuracy in a region of the AUC curve relevant to the use case, B may be worth the additional effort. Models based on fewer and more easily operationalized and interpreted covariates may be favored due to their ease of implementation and transparency.
7. Train selected algorithm on the entire dataset In developing a model to identify patients with acute pancreatitis, a LASSO dimension reduction approach coupled with an elastic net algorithm achieved the best tradeoff between cv-PPV (0.90) and cv-sensitivity (0.92) by setting the classification threshold at the 37% quantile. This threshold corresponded to a predicted probability of 0.39 based on a model trained on all observations.

Abbreviation: PPV = positive predictive value.

Outcome-blind pre-processing of the set of candidate features to remove those that are redundant or highly correlated can reduce the complexity of a prediction task without sacrificing predictive ability. Recommended procedures include omitting features with little to no variation in values (eg, 98% identical), retaining only one of several highly correlated features (eg, duplicate values, absolute correlation > 0.98). Features unlikely to be available in other settings should be avoided for model generalizability.

Because the true functional form of relationships between candidate features and the outcome is unknown, considering a diverse set of candidate prediction algorithms is advantageous. These can include multiple variations of parametric models, tree-based algorithms, and neural networks. If features are continuous, consider multiple additive regression splines, or generalized additive models, and tree-based algorithms that avoid imposing monotonic linear relationships between predictor and outcome. An ensemble Super Learner that combines predictions from multiple algorithms can consider up to a polynomial number of algorithms simultaneously, much larger than typically investigated in practice,76,77 (and “multiple testing” is not a concern because P-values are not evaluated). With high-dimensional feature data, coupling candidate algorithms with dimension reduction screening strategies that consider feature-outcome associations may be advantageous. Each algorithm-screener combination defines a custom ML approach. Cross validation of performance metric(s) relevant for downstream use is vital. For phenotyping, relevant loss functions are 1 − AUC (area under the receiver operating curve) and negative log likelihood (NLL). 1 − AUC captures ability to discriminate between cases and non-cases but ignores accuracy of the predicted probabilities. A similar metric 1 − AUC-PR (area under the precision-recall curve) is sometimes recommended when the minority class is rare.78 NLL favors well-calibrated algorithms with high predictive accuracy, but potentially less accurate classification. We recommend V-fold cross validation over other types because an appropriate V-fold cross validation scheme makes most efficient use of the data.76,77 V-fold cross-validation begins with randomly assigning observations to V disjoint validation sets. Each validation set has a corresponding independent training set containing the remaining observations. Algorithms are fitted/trained on training set data, and loss evaluated on observations in the independent validation set. Cross-validated risk is average loss over all V validation sets. Stratifying assignment to validation folds on the outcome and choosing a large V (eg, 20) ensures that the training data closely approximates the entire dataset, particularly with sparse data or rare outcomes.

Although cross-validated model performance is the primary consideration in algorithm selection, transportability and generalizability (comparable performance across settings) should be considered, particularly when using data from multiple healthcare settings (Table 4, Figure 3).

Figure 3.

Figure 3.

Selecting a final model from all models developed based on considerations of model performance, model transportability, and model generalizability.

The production model is produced by training the selected algorithm on all observations. The classification threshold is chosen to optimize cross-validated performance with respect to PPV, sensitivity, or a summary F-measure, further discussed in the next section. The probability threshold is set to the probability at the optimal quantile of the predicted values.

Model evaluation and reporting

Model evaluation aims to estimate the performance of a prediction model applied to novel data from the same setting or a new setting. Three key questions are as follows: What performance metrics are relevant to the potential downstream use case(s)? Where and when will the predictions be applied, considering generalizability to new candidate locations and specific time frames (past, present)? How will the model be used in epidemiologic studies and trade-offs in performance be tailored to application goals?

Criteria for what to measure in evaluating a clinical prediction model are well established and focus on measures such as sensitivity (proportion of true cases identified, or recall) and PPV (proportion of observations classified as cases that are true cases, or precision).79 These rates depend upon cut-offs or thresholds that dichotomize model prediction into “positive” and “negative” outcomes and allow evaluation at multiple candidate thresholds. Specifically, a binary phenotype generated this way is defined as Y*c, an indicator of whether model prediction, M, is larger than cut-off value, c, or [M>c]. Performance measures are then defined in terms of true outcomes, Y, and dichotomous predictions, Y*(c), such as sensitivity, P[Y*c=1 |Y=1], and PPV, P [Y=1 |Y*c=1]. Graphical summaries can plot error metrics as a function of threshold, c, across a range of candidate thresholds. In the ML literature, a precision-recall curve plots PPV (precision) versus sensitivity (recall) for all possible thresholds.78 The F-score summarizes sensitivity and PPV into a single summary, F1, which is the harmonic mean, and is sometimes recommended as a metric to optimize in model building or evaluation.80 When the model defines a safety outcome, both sensitivity and PPV are key to understanding the implications of measurement errors associated with using a computable phenotype compared to a gold-standard outcome. When sensitivity is more (or less) important than PPV, a weighted F-score can be evaluated. For example, a screening test might evaluate F2, weighting sensitivity twice as heavily as PPV. Cohort inclusion criteria might evaluate F0.5, weighting PPV twice as heavily as sensitivity.

Considering where model predictions will be used directly relates to generalizability and whether performance may vary by context.81 Clearly defining settings in which predictions are applied aids in interpreting validation assessments. A setting similar to the development data source evaluates reproducibility (internal validity). A setting with different context elements such as time periods for data capture, geographic or healthcare settings, or population characteristics simultaneously evaluates reproducibility, transportability, and generalizability. Honest performance assessment requires evaluation using data independent of the training data.

For downstream epidemiologic analyses using a prediction model, many applications use a simple binary indicator, Y*(c). The primary drivers for using a derived binary outcome include simplicity of presentation and interpretation, and potential relaxing of the need for local calibration of predictions. However, binary indicators lead to measurement error in computable phenotypes that can impact both the bias and precision of estimated association measures between the phenotype and exposures.82 Understanding the implications of measurement error is key to choosing a threshold to define the phenotype. For example, higher PPV leads to lower bias in relative risk estimates and higher sensitivity leads to lower bias in absolute risk difference estimates.83 Furthermore, if discrimination is poor, higher thresholds for Y*(c) tend to decrease the overall prevalence of computable phenotypes and inflate the variance of association estimates due to lower in-sample information. Therefore, the implication for bias and standard errors should be carefully considered, evaluating potential trade-offs and linking them to downstream analytical goals. For example, narrowly focusing on high PPV may impact statistical uncertainty of estimation and adversely impact power to detect non-null associations. With large samples for which bias dominates variance, high PPV is paramount. However, smaller samples require more nuanced bias-variance trade-off. Analytical expressions for bias and variance of common regression estimators can be useful guides,83 and simulations can summarize bias, variance, and power for alternative choices of Y*(c). Bias may be reduced by using the predicted probability as the outcome when estimating effects, rather than binary indicators. Calibration is a key contributor to success.

Discussion

This general framework offers guidance to teams with varying levels of expertise and resources in developing computable phenotype algorithms of many types. It is particularly relevant for phenotypes with high clinical and data complexity, and where multi-site implementation is needed. It is intended to encourage efficiency in all aspects of development and to enhance the performance and transportability of algorithms applied in multi-site settings.

Several points merit emphasis. First, fitness-for-purpose assessment before development work begins is critical. Halting when failure is likely is essential for achieving overall scalability and optimizing the impact of precious, sometimes scarce expertise.

Two major bottlenecks in development are creating gold standard data and feature engineering. Careful attention to the recommendations for these stages will enhance the overall efficiency of phenotype development efforts.

Computable phenotypes developed in one setting for reuse in other settings should be guided by awareness of potential data heterogeneity, computing resources, and availability of relevant expertise across settings. Designing phenotype algorithms based on “lowest common denominators” across settings reduces risks of uneven cross-site performance. When setting-specific feature engineering or algorithm tailoring is unavoidable, code and software should be designed to accommodate them. “One-off” solutions that boost performance in one setting may not be transportable to or perform comparably in other settings.

Automated approaches to feature engineering or developing entire phenotyping models should be considered because they require relatively modest effort, may fully achieve performance requirements or, if not, provide a useful baseline for manual curation.

Aspects of methods development that should be prioritized for future research include:

  1. Disseminating findings of and lessons learned by applying fitness-for-purpose assessments to many different phenotypes.

  2. Assessing feasibility and efficiency gains of machine-assisted chart review interfaces for creating gold standard data.

  3. Defining a systematic approach to incorporating automated feature engineering and silver-label model training,28 coupled with sampling strategies to minimize the number of chart reviews required for evaluation.

  4. Defining conditions under which reusing high-performing models without additional gold-standard evaluation is reasonable.

Several limitations merit attention. First, general frameworks inevitably contain flaws; we hope this attempt will elicit useful discussion and future elaboration. Second, the framework we present is limited by our own experiences and knowledge. Third, our illustrative models were appropriate for our data and phenotype relationships; other phenotypes and use cases will involve different data and relationships.

Conclusion

This general framework is intended to provide an overall structure and guidance useful to those considering undertaking development of computable clinical phenotype algorithms. Once commitment to development occurs, this framework provides an integrated set of principles, strategies, and practical guidelines intended to enhance the overall efficiency and transparency of the effort as well as the performance, transportability, and reusability of the algorithms developed.

Acknowledgments

The authors, all members of the Sentinel Phenotyping Project, would like to acknowledge the contributions of other members of the Sentinel Phenotyping Project whose input and assistance greatly contributed to the success of this work (listed alphabetically): Adebola Ajao, Maralyssa A. Bann, Cosmin Bejan, David J. Cronkite, Vina F. Graham, Kara Haugen, Sara Karami, Yong Ma, Denis B. Nyongesa, Daniel S. Sapp, Mary Shea, Xu Shi, Mayura Shinde, Matthew T. Slaughter, and Shamika More. The views expressed in this article represent those of the authors and do not necessarily represent the official views of the U.S. FDA.

Contributor Information

David S Carrell, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

James S Floyd, Department of Medicine, School of Medicine, University of Washington, Seattle, WA 98195, United States; Department of Epidemiology, School of Public Health, University of Washington, Seattle, WA 98195, United States.

Susan Gruber, Putnam Data Sciences, LLC, Cambridge, MA 02139, United States.

Brian L Hazlehurst, Center for Health Research, Kaiser Permanente Northwest, Portland, OR 97227, United States.

Patrick J Heagerty, Department of Biostatistics, School of Public Health, University of Washington, Seattle, WA 98195, United States.

Jennifer C Nelson, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Brian D Williamson, Kaiser Permanente Washington Health Research Institute, Seattle, WA 98101, United States.

Robert Ball, Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD 20993, United States.

Author contributions

All authors contributed to the conceptualization, organization, and presentation of the principles and guidelines presented in this paper. David S. Carrell, James S. Floyd, Susan Gruber, Patrick J. Heagerty, Brian L. Hazlehurst, Jennifer L. Nelson, and Brian D. Williamson drafted original content for various sections of the paper. All authors critically reviewed and edited the combined manuscript and contributed intellectual value to the article.

Funding

This work was supported by the U.S. Food and Drug Administration via task order number 75F40119F19002 under master agreement number 75F40119D10037. The FDA approved the study protocol and reviewed and approved this manuscript. Coauthors from the FDA participated in the preparation of and decision to submit the manuscript for publication. The FDA had no role in data collection, management, or analysis.

Conflicts of interest

R.B. is an author on US Patent 9,075,796, “Text mining for large medical text datasets and corresponding medical text classification using informative feature selection.” At present, this patent is not licensed and does not generate royalties. All other authors have no competing interests to declare.

Data availability

This work did not entail collection, management, or analysis of any original or secondary data.

References

  • 1. Floyd JS, Bann MA, Felcher AH, et al.  Validation of acute pancreatitis among adults in an integrated healthcare system. Epidemiology. 2023;34(1):33-37. 10.1097/ede.0000000000001541 [DOI] [PubMed] [Google Scholar]
  • 2. Liu Y, Siddiqi KA, Cook RL, et al.  Optimizing identification of people living with HIV from electronic medical records: computable phenotype development and validation. Methods Inf Med. 2021;60(3-4):84-94. 10.1055/s-0041-1735619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Paul DW, Neely NB, Clement M, et al.  Development and validation of an electronic medical record (EMR)-based computed phenotype of HIV-1 infection. J Am Med Inform Assoc. 2018;25(2):150-157. 10.1093/jamia/ocx061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Goetz MB, Hoang T, Kan VL, Rimland D, Rodriguez-Barradas M.  Development and validation of an algorithm to identify patients newly diagnosed with HIV infection from electronic health records. AIDS Res Hum Retroviruses. 2014;30(7):626-633. 10.1089/aid.2013.0287 [DOI] [PubMed] [Google Scholar]
  • 5. Walsh KE, Cutrona SL, Foy S, et al.  Validation of anaphylaxis in the Food and Drug Administration's Mini-Sentinel. Pharmacoepidemiol Drug Saf. 2013;22(11):1205-1213. 10.1002/pds.3505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ball R, Toh S, Nolan J, Haynes K, Forshee R, Botsis T.  Evaluating automated approaches to anaphylaxis case classification using unstructured data from the FDA Sentinel System. Pharmacoepidemiol Drug Saf. 2018;27(10):1077-1084. 10.1002/pds.4645 [DOI] [PubMed] [Google Scholar]
  • 7. Sampson HA, Munoz-Furlong A, Campbell RL, et al.  Second symposium on the definition and management of anaphylaxis: summary report—Second National Institute of Allergy and Infectious Disease/Food Allergy and Anaphylaxis Network symposium. J Allergy Clin Immunol. 2006;117(2):391-397. 10.1016/j.jaci.2005.12.1303 [DOI] [PubMed] [Google Scholar]
  • 8. Kreimeyer K, Foster M, Pandey A, et al.  Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform. 2017;73:14-29. 10.1016/j.jbi.2017.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Zhang Y, Cai T, Yu S, et al.  High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP). Nat Protoc. 2019;14(12):3426-3444. 10.1038/s41596-019-0227-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Wong J, Prieto-Alhambra D, Rijnbeek PR, Desai RJ, Reps JM, Toh S.  Applying machine learning in distributed data networks for pharmacoepidemiologic and pharmacovigilance studies: opportunities, challenges, and considerations. Drug Saf. 2022;45(5):493-510. 10.1007/s40264-022-01158-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rasmussen LV, Thompson WK, Pacheco JA, et al.  Design patterns for the development of electronic health record-driven phenotype extraction algorithms. J Biomed Inform. 2014;51:280-286. 10.1016/j.jbi.2014.06.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Xu J, Rasmussen LV, Shaw PL, et al.  Review and evaluation of electronic health records-driven phenotype algorithm authoring tools for clinical and translational research. J Am Med Inform Assoc. 2015;22(6):1251-1260. 10.1093/jamia/ocv070 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Peissig PL, Rasmussen LV, Berg RL, et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012;19(2):225-234. 10.1136/amiajnl-2011-000456 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Yu J, Pacheco JA, Ghosh AS, et al.  Under-specification as the source of ambiguity and vagueness in narrative phenotype algorithm definitions. BMC Med Inform Decis Mak. 2022;22(1):23. 10.1186/s12911-022-01759-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gottesman O, Kuivaniemi H, Tromp G, eMERGE Network, et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet Med. 2013;15(10):761-771. 10.1038/gim.2013.72 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Rea S, Pathak J, Savova G, et al.  Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: the SHARPn project. J Biomed Inform. 2012;45(4):763-771. 10.1016/j.jbi.2012.01.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Office of the National Coordinator for Health Information Technology. Strategic Health IT Advanced Research Projects (SHARP) program. 2011. Accessed June 25, 2024. https://www.healthit.gov/data/quickstats/strategic-health-it-advanced-research-projects-sharp-program
  • 18. Weinfurt KP, Hernandez AF, Coronado GD, et al.  Pragmatic clinical trials embedded in healthcare systems: generalizable lessons from the NIH Collaboratory. BMC Med Res Methodol. 2017;17(1):144. 10.1186/s12874-017-0420-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Mental Health Research Network. About MHRN. Accessed June 25, 2024. https://mhresearchnetwork.org/
  • 20. HCSRN. Mental Health Research Network overview. Accessed June 25, 2024. https://hcsrn.org/collaboration/cornerstone-projects/mhrn/
  • 21. Baggs J, Gee J, Lewis E, et al.  The Vaccine Safety Datalink: a model for monitoring immunization safety. Pediatrics. 2011;127 Suppl 1:S45-S53. 10.1542/peds.2010-1722H [DOI] [PubMed] [Google Scholar]
  • 22. Behrman RE, Benner JS, Brown JS, McClellan M, Woodcock J, Platt R.  Developing the Sentinel System—a national resource for evidence development. N Engl J Med. 2011;364(6):498-499. 10.1056/NEJMp1014427 [DOI] [PubMed] [Google Scholar]
  • 23. Ball R, Robb M, Anderson SA, Dal Pan G.  The FDA's Sentinel Initiative—a comprehensive approach to medical product surveillance. Clin Pharmacol Ther. 2016;99(3):265-268. 10.1002/cpt.320 [DOI] [PubMed] [Google Scholar]
  • 24. Platt R, Brown JS, Robb M, et al.  The FDA Sentinel Initiative—an evolving national resource. N Engl J Med. 2018;379(22):2091-2093. 10.1056/NEJMp1809643 [DOI] [PubMed] [Google Scholar]
  • 25. Food and Drug Administration. Food and Drug Administration Amendments Act of 2007. Accessed June 25, 2024. https://www.govinfo.gov/content/pkg/PLAW-110publ85/pdf/PLAW-110publ85.pdf
  • 26. Carrell DS, Gruber S, Floyd JS, et al.  Improving methods of identifying anaphylaxis for medical product safety surveillance using natural language processing and machine learning. Am J Epidemiol. 2023;192(2):283-295. 10.1093/aje/kwac182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Sentinel. Validation of acute pancreatitis using machine learning and multi-site adaptation for anaphylaxis. Accessed June 25, 2024. https://www.sentinelinitiative.org/methods-data-tools/methods/validation-acute-pancreatitis-using-machine-learning-and-multi-site
  • 28. Smith JC, Williamson BD, Cronkite DJ, et al.  Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease. J Am Med Inform Assoc. 2024;31(3):574-582. 10.1093/jamia/ocad241 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Brown JS, Maro JC, Nguyen M, Ball R.  Using and improving distributed data networks to generate actionable evidence: the case of real-world outcomes in the Food and Drug Administration's Sentinel system. J Am Med Inform Assoc. 2020;27(5):793-797. 10.1093/jamia/ocaa028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Sentinel. Assessing the ARIA system's ability to evaluate a safety concern. Accessed June 25, 2024. https://sentinelinitiative.org/studies/drugs/assessing-arias-ability-evaluate-safety-concern
  • 31. Sentinel. Drug studies. Accessed June 25, 2024. https://www.sentinelinitiative.org/studies/drugs
  • 32. Klein G. Performing a Project Premortem. 2007. Accessed June 25, 2024. https://hbr.org/2007/09/performing-a-project-premortem
  • 33. Desai RJ, Wang SV, Sreedhara SK, et al.  Process guide for inferential studies using healthcare data from routine clinical practice to evaluate causal effects of drugs (PRINCIPLED): considerations from the FDA Sentinel Innovation Center. BMJ. 2024;384:e076460. 10.1136/bmj-2023-076460 [DOI] [PubMed] [Google Scholar]
  • 34. Fang X, Saha S, Song J, Dharmarajan S. planningML: a sample size calculator for machine learning applications in healthcare. Accessed June 25, 2024. https://cran.r-project.org/web/packages/planningML/index.html
  • 35. Galvez-Sánchez CM, Reyes Del Paso GA.  Diagnostic criteria for fibromyalgia: critical review and future perspectives. J Clin Med. 2020;9(4):1219-1235. 10.3390/jcm9041219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Bann MA, Carrell DS, Gruber S, et al.  Identification and validation of anaphylaxis using electronic health data in a population-based setting. Epidemiology. 2021;32(3):439-443. 10.1097/ede.0000000000001330 [DOI] [PubMed] [Google Scholar]
  • 37. Fekadu G, Bekele F, Tolossa T, et al.  Impact of COVID-19 pandemic on chronic diseases care follow-up and current perspectives in low resource settings: a narrative review. Int J Physiol Pathophysiol Pharmacol. 2021;13(3):86-93. [PMC free article] [PubMed] [Google Scholar]
  • 38. Muhrer JC.  Risk of misdiagnosis and delayed diagnosis with COVID-19: a syndemic approach. Nurse Pract. 2021;46(2):44-49. 10.1097/01.Npr.0000731572.91985.98 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Van den Bulck S, Crèvecoeur J, Aertgeerts B, et al.  The impact of the Covid-19 pandemic on the incidence of diseases and the provision of primary care: a registry-based study. PLoS One. 2022;17(7):e0271049. 10.1371/journal.pone.0271049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Shi X, Zhai Y, Yu X, et al. Harmonizing electronic health record data across FDA Sentinel Initiative data partners using privacy-protecting unsupervised learning: case study and lessons learned. Under review 2024.
  • 41. Saini P, Chantler K, Kapur N.  General practitioners' perspectives on primary care consultations for suicidal patients. Health Soc Care Community. 2016;24(3):260-269. 10.1111/hsc.12198 [DOI] [PubMed] [Google Scholar]
  • 42. Bajaj P, Borreani E, Ghosh P, Methuen C, Patel M, Joseph M.  Screening for suicidal thoughts in primary care: the views of patients and general practitioners. Ment Health Fam Med. 2008;5(4):229-235. [PMC free article] [PubMed] [Google Scholar]
  • 43. Schulberg HC, Bruce ML, Lee PW, Williams JW Jr., Dietrich AJ.  Preventing suicide in primary care patients: the primary care physician's role. Gen Hosp Psychiatry. 2004;26(5):337-345. 10.1016/j.genhosppsych.2004.06.007 [DOI] [PubMed] [Google Scholar]
  • 44. Food and Drug Administration, HHS. Guidance for industry: for the submission of chemistry, manufacturing and controls and establishment description information for human blood and blood components intended for transfusion or for further manufacture and for the completion of the form FDA 356h, “Application to market a new drug, biologic or an antibiotic drug for human use”. Notice. Fed Regist. 1999;64(89):25049-25050. [PubMed] [Google Scholar]
  • 45. Yang LH, Wong LY, Grivel MM, Hasin DS.  Stigma and substance use disorders: an international phenomenon. Curr Opin Psychiatry. 2017;30(5):378-388. 10.1097/YCO.0000000000000351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Lipscombe LL, Hwee J, Webster L, Shah BR, Booth GL, Tu K.  Identifying diabetes cases from administrative data: a population-based validation study. BMC Health Serv Res. 2018;18(1):316. 10.1186/s12913-018-3148-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Ives DG, Fitzpatrick AL, Bild DE, et al.  Surveillance and ascertainment of cardiovascular events. The Cardiovascular Health Study. Ann Epidemiol. 1995;5(4):278-285. 10.1016/1047-2797(94)00093-9 [DOI] [PubMed] [Google Scholar]
  • 48. Brighton Collaboration. Brighton Collaboration case definition. Accessed June 25, 2024. https://brightoncollaboration.org/category/pubs-tools/case-definitions/
  • 49. Vittinghoff E, McCulloch CE.  Relaxing the rule of ten events per variable in logistic and Cox regression. Am J Epidemiol. 2007;165(6):710-718. 10.1093/aje/kwk052 [DOI] [PubMed] [Google Scholar]
  • 50. Banks PA, Bollen TL, Dervenis C, Acute Pancreatitis Classification Working Group, et al.  Classification of acute pancreatitis—2012: revision of the Atlanta classification and definitions by international consensus. Gut. 2013;62(1):102-111. 10.1136/gutjnl-2012-302779 [DOI] [PubMed] [Google Scholar]
  • 51. Kottner J, Audigé L, Brorson S, et al.  Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64(1):96-106. 10.1016/j.jclinepi.2010.03.002 [DOI] [PubMed] [Google Scholar]
  • 52. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG.  Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377-381. 10.1016/j.jbi.2008.08.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Van Bulck L, Wampers M, Moons P.  Research Electronic Data Capture (REDCap): tackling data collection, management, storage, and privacy challenges. Eur J Cardiovasc Nurs. 2022;21(1):85-91. 10.1093/eurjcn/zvab104 [DOI] [PubMed] [Google Scholar]
  • 54. Github. RedCAP form for acute pancreatitis chart review. Accessed June 25, 2024. https://github.com/kpwhri/Sentinel-Acute-Pancreatitis
  • 55. Japkowicz N, Stephen S.  The class imbalance problem: a systematic study. IDA. 2002;6(5):429-449. [Google Scholar]
  • 56. Newton KM, Peissig PL, Kho AN, et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc. 2013;20(e1):e147-e154. 10.1136/amiajnl-2012-000896 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Hazlehurst B, Gorman PN, McMullen CK.  Distributed cognition: an alternative model of cognition for medical informatics. Int J Med Inform. 2008;77(4):226-234. 10.1016/j.ijmedinf.2007.04.008 [DOI] [PubMed] [Google Scholar]
  • 58. Hazlehurst B, McMullen C, Gorman P, Sittig D.  How the ICU follows orders: care delivery as a complex activity system. AMIA Annu Symp Proc. 2003;2003:284-288. [PMC free article] [PubMed] [Google Scholar]
  • 59. Hazlehurst B, McMullen CK, Gorman PN.  Distributed cognition in the heart room: how situation awareness arises from coordinated communications during cardiac surgery. J Biomed Inform. 2007;40(5):539-551. 10.1016/j.jbi.2007.02.001 [DOI] [PubMed] [Google Scholar]
  • 60. Shekhar A. What is feature engineering for machine learning? 2018. Accessed June 25, 2024. https://medium.com/mindorks/what-is-feature-engineering-for-machine-learning-d8ba3158d97a
  • 61. Yu S, Ma Y, Gronsbell J, et al.  Enabling phenotypic big data with PheNorm. J Am Med Inform Assoc. 2018;25(1):54-60. 10.1093/jamia/ocx111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Press G. Cleaning big data: most time-consuming, least enjoyable data science task, survey says. 2016. Accessed June 25, 2024. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=167897986f63
  • 63. Yu S, Liao KP, Shaw SY, et al.  Toward high-throughput phenotyping: unbiased automated feature extraction and selection from knowledge sources. J Am Med Inform Assoc. 2015;22(5):993-1000. 10.1093/jamia/ocv034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Denny JC, Choma NN, Peterson JF, et al.  Natural language processing improves identification of colorectal cancer testing in the electronic medical record. Med Decis Making. 2012;32(1):188-197. https://doi.org/0272989X11400418 [pii] [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Kaiser Permanente Washington Health Research Institute. Sentinel anaphylaxis. 2024. GitHub Repository. Accessed June 25, 2024. https://github.com/kpwhri/Sentinel-Anaphylaxis
  • 66. MedlinePlus. Anaphylaxis. Accessed January 23, 2022. https://medlineplus.gov/ency/article/000844.htm
  • 67. Fernandez J. Anaphylaxis. 2022. Accessed June 25, 2024. https://www.merckmanuals.com/professional/immunology-allergic-disorders/allergic,-autoimmune,-and-other-hypersensitivity-disorders/anaphylaxis?query=anaphylaxis
  • 68. Yu W, Zheng C, Xie F, et al.  The use of natural language processing to identify vaccine-related anaphylaxis at five health care systems in the Vaccine Safety Datalink. Pharmacoepidemiol Drug Saf. 2020;29(2):182-188. 10.1002/pds.4919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Sinnott JA, Cai F, Yu S, et al.  PheProb: probabilistic phenotyping using diagnosis codes to improve power for genetic association studies. J Am Med Inform Assoc. 2018;25(10):1359-1365. 10.1093/jamia/ocy056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Liao KP, Sun J, Cai TA, et al.  High-throughput multimodal automated phenotyping (MAP) with application to PheWAS. J Am Med Inform Assoc. 2019;26(11):1255-1262. 10.1093/jamia/ocz066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Smith JC, Williamson BD, Cronkite DJ, et al.  Data-driven automated classification algorithms for acute health conditions: applying PheNorm to COVID-19 disease. J Am Med Inform Assoc. 2024;31(3):574-582. 10.1093/jamia/ocad241 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Github. NLP system for acute pancreatitis. Accessed June 25, 2024. https://github.com/kpwhri/Sentinel-Acute-Pancreatitis
  • 73. Github. NLP system for COVID-19 disease. Accessed June 25, 2024. https://github.com/kpwhri/Sentinel-Scalable-NLP
  • 74. Github. NLP system for anaphylaxis. Accessed June 25, 2024. https://github.com/kpwhri/Sentinel-Anaphylaxis
  • 75.Phillips RV, van der Laan MJ, Lee H, Gruber S. Practical considerations for specifying a super learner. Int J Epidemiol. 2023;52(4):1276-1285. 10.1093/ije/dyad023 [DOI] [PubMed] [Google Scholar]
  • 76. van der Laan MJ, Polley EC, Hubbard AE.  Super learner. Stat Appl Genet Mol Biol. 2007;6:Article25. 10.2202/1544-6115.1309 [DOI] [PubMed] [Google Scholar]
  • 77. van der Laan M, Dudoit S, van der Vaart A. The cross-validated adaptive epsilon-net estimator. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 142 2004.
  • 78. Davis J, Goadrich M. The relationship between precision-recall and ROC Curves. 2006. Accessed June 25, 2024. https://ftp.cs.wisc.edu/machine-learning/shavlik-group/davis.icml06.pdf
  • 79. Steyerberg EW, Vickers AJ, Cook NR, et al.  Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138. 10.1097/EDE.0b013e3181c30fb2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Lipton ZC, Elkan C, Naryanaswamy B.  Optimal thresholding of classifiers to maximize F1 measure. Mach Learn Knowl Discov Databases. 2014;8725:225-239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Justice AC, Covinsky KE, Berlin JA.  Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130(6):515-524. 10.7326/0003-4819-130-6-199903160-00016 [DOI] [PubMed] [Google Scholar]
  • 82. Funk MJ, Landi SN.  Misclassification in administrative claims data: quantifying the impact on treatment effect estimates. Curr Epidemiol Rep. 2014;1(4):175-185. 10.1007/s40471-014-0027-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Neuhaus J.  Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999;86(4):843-855. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This work did not entail collection, management, or analysis of any original or secondary data.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES