ABSTRACT
Electronic health records (EHRs) provide a rich data source for building prediction models to improve the quality of care. However, EHR data are prone to measurement error, and outcomes used to evaluate prediction model performance may be misclassified. Comparing risk predictions to misclassified outcomes will result in unreliable estimates of a prediction model's performance. We propose a method that leverages a smaller chart review sample with gold‐standard outcome measurements to adjust validation for outcome misclassification and provide more accurate prediction model evaluation. We derive formulae to estimate true positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) in the presence of outcome misclassification. Different scenarios of misclassification are explored, including when the misclassification is independent or dependent on features, and when misclassification is unidirectional (e.g., only missed diagnoses) or bidirectional. In simulation studies, we compare the bias and 95% confidence interval coverage of performance estimates obtained using our proposed method to those estimated with misclassified outcomes (without accounting for misclassification) or in the smaller chart review sample. Simulation results indicate that, across all misclassification scenarios examined, our proposed estimates have good accuracy and improved precision. Outcome misclassification should be considered when evaluating a prediction model's performance in order to accurately inform decision‐making about whether and how to use a clinical prediction model.
Keywords: classification accuracy, clinical prediction model, machine learning, measurement error, ROC curve
Abbreviations
- AUC
area under the curve
- EHR
electronic health record
- FPR
false positive rate
- NPV
negative predictive value
- PPV
positive predictive value
- ROC
receiver operating characteristic
- TPR
true positive rate
1. Introduction
Electronic health record (EHR) data provide an information‐rich resource for developing, validating, and implementing clinical prediction models [1, 2]. These prediction models can inform clinical decision‐making in a myriad of ways, including delivering personalized patient care [3, 4], detecting disease conditions early [5, 6, 7], and stratifying patients for care pathways based on their predicted risk [8, 9], which can potentially lead to better health outcomes [4, 10] and more efficient health care resource allocation [11]. Our work is motivated by our collaboration in estimating suicide risk‐prediction models used to guide suicide prevention care. Consider a prediction model for the risk of self‐harm or suicide attempt following an outpatient mental health visit. These prediction models can be used to inform providers that a patient is at higher risk for self‐harm. In response, providers then may conduct additional risk assessment or initiate safety planning or other self‐harm risk mitigating strategies [12].
Evaluation of a clinical prediction model's expected performance in the population of interest, known as validation, is needed to inform decision‐making about whether and how to use the model in clinical care. The area under the receiver operating characteristic (ROC) curve (AUC) is a commonly used metric of discrimination that measures how well a prediction model orders observations by risk of the event of interest. In the context of predicting risk of self‐harm, a poor AUC would indicate the prediction model is not effectively distinguishing between visits that are and are not followed by self‐harm events; such a model would not be recommended for clinical use.
In addition to AUC, model validation frequently also examines the accuracy of using specific risk thresholds to classify those at higher risk of the event of interest. Classification accuracy measures such as true positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), and negative predictive value (NPV) may better guide prediction model implementation, as they reflect a model's impact when a threshold of the risk score is used to distinguish between patients at higher and lower risk of, for example, self‐harm. In our motivating example, TPR quantifies the potential maximum impact implementing a prediction model may have on reducing self‐harm events. Only those patients who are flagged as high‐risk and receive the intervention will have their self‐harm risk potentially reduced by the model implementation. FPR quantifies how frequently a visit would unnecessarily be recommended for the risk‐reducing intervention. FPR must be considered in the context of the potential harms of an intervention. While a brief risk assessment during a visit is relatively inexpensive—unnecessarily conducting this intervention takes away from limited visit time—could damage the therapeutic relationship between patient and provider, and could escalate to more intrusive interventions like involuntary hospitalization. PPV, the probability of an event among those visits classified as high‐risk, also informs the selection of an appropriate risk‐reducing intervention. Existing self‐harm prediction models have a PPV of 5% or less [13]. Given that more than 95% of the visits flagged as high‐risk will not be followed by self‐harm, accompanying interventions must be inexpensive, safe, and minimally disruptive [14]. NPV corresponds to the probability of no event among visits classified as low‐risk. Achieving high NPV is crucial in settings such as self‐harm risk prediction, in which the consequences of a false negative classification are significant; high NPV facilitates more confident management of lower risk patients. Overall, selection of an appropriate risk threshold must balance the benefits and harms of subsequent interventions.
While EHR data provide an excellent resource for rich data on a well‐defined population of individuals that can be leveraged for prediction model development and validation, EHR data are prone to measurement error [15, 16, 17], including misclassification of binary outcomes, like self‐harm [18, 19]. Patients may not seek care following self‐harm or may be hesitant to report injuries or poisonings as self‐inflicted, leading to missed diagnoses in our data [20]. It is also possible for accidental injuries or poisonings to be incorrectly documented as intentional self‐harm [21, 22]. Additionally, in many situations, outcome misclassification can be associated with patient characteristics [23, 24, 25]. For example, a key predictor of future self‐harm and suicide is a prior incident of self‐harm [26, 27]. Past self‐harm could also be associated with the risk of outcome misclassification, as self‐harm may be more likely to be recognized and diagnosed in those with a documented history of self‐harm.
Outcome misclassification in EHR data can lead to incorrectly estimating relationships between predictors and outcomes and inaccurate model validation [28, 29]. In the presence of binary outcome misclassification, naive model evaluation with the observed, potentially misclassified, outcomes will not reflect model performance for true (not misclassified) outcomes. Consider, for example, AUC: a prediction model's ability to discriminate between misclassified outcomes is typically not of scientific or clinical interest, and a prediction model that performs well in distinguishing misclassified outcomes may have limited discriminatory power for true outcomes. The impact of outcome misclassification is particularly concerning as accurate validation is necessary to correctly inform decisions about whether and how to use a prediction model. Consider PPV of a self‐harm prediction model: providers would take very different action for patients classified as high‐risk if the estimated PPV, or probability of a subsequent event, in that high‐risk stratum was 20% compared to 2%. If misclassification leads to biased estimates of PPV, clinical leaders may choose to pair the model with an intervention that is inappropriate for the true risk posed to patients classified as high‐risk.
While there are well‐established methods to adjust for outcome misclassification in an inferential setting [30, 31, 32, 33], few studies focus on the correct validation of prediction models in the presence of outcome misclassification. Wang et al. [28] derived formulas to estimate corrected ROC curves in the situation where misclassification is not related to risk factors or predictions, that is, misclassification is not risk‐differential. They demonstrated their approach in a setting where some events are not captured, but did not apply their method to the setting where events may be incorrectly present in the EHR. Zawistowski et al. [29] proposed a misclassification‐adjusted ROC procedure to account for misclassification in observed outcomes in both prediction model estimation and validation. This method relies on a likelihood‐based model for correcting logistic regression parameter estimates and, as a result, it is not applicable when a likelihood cannot be written explicitly, as is the case with common machine learning approaches, such as random forest. Demonstration of the proposed method of Zawistowski et al. [29] was also limited to a setting with only uncaptured events (underdiagnosis) with misclassification rates independent of the features. Thus, there is a need for validation approaches that account for outcome misclassification in complex misclassification scenarios and when flexible prediction modeling approaches are used.
Another limitation of existing literature is that validation methods have been demonstrated assuming misclassification rates are known [28, 29]. In reality, it is unlikely that researchers conducting model validation have prior knowledge of the true, underlying misclassification rates. And, naively assuming (estimated) misclassification rates elicited from earlier studies are precisely known will underestimate the variance of performance measures.
Manual review of clinical records is a common approach to quantifying misclassification in EHR data. In these chart review studies, trained annotators carefully examine patient records, including free text notes and other indicators in the data, to determine “gold standard” outcome measures. For example, a prior chart review study of EHR‐documented injuries or poisonings among patients reporting suicidal ideation found that nearly 90% of events with self‐harm diagnoses included documentation of self‐harm intent [19]. Some injuries or poisonings without a diagnosis of self‐harm had accompanying information in free text or other EHR documentation indicating self‐harm intent, including 8% of events coded as accidental, 28% coded as undetermined, and 11% without a coding of intent [19]. While a prediction model can be accurately validated within a chart review sample with gold‐standard outcomes, these samples are frequently quite small, as individual patient record review is expensive and time‐consuming to conduct. Validation of clinical prediction models in small chart review samples will yield estimates with high variability that, as a result, will be insufficient to inform decision‐making about whether and how to implement a prediction model. Ideally, information on model performance from much larger (unannotated) EHR datasets could also be leveraged to accurately evaluate model performance.
In this paper, we present a validation method that accommodates misclassification of binary outcomes for the AUC and classification measures (including TPR, FPR, PPV, and NPV) for true outcomes. Our proposed approach addresses the gaps in existing work as it is applicable to any prediction modeling approach and to complex misclassification scenarios, reflective of those encountered in real‐world EHR data. Instead of assuming misclassification rates are known, we illustrate how a model of the relationship between inaccessible true outcomes, observed misclassified outcomes, and risk predictors can be elicited using an external chart review data set. We further demonstrate how uncertainty in this estimation can be reflected in our proposed method to accurately quantify variability in performance measures. When available, EHR samples (with misclassified outcomes) are much larger than chart review samples (as is typically the case), and our proposed estimators are more efficient than those obtained by limiting validation to observations with gold‐standard outcomes.
We present results evaluating the proposed method in situations in which only events are misclassified (unidirectional misclassification) and for those in which both events and nonevents may be misclassified (bidirectional misclassification). We also consider scenarios when misclassification is independent of risk factors (non‐risk differential) as well as those when misclassification is associated with risk factors (risk differential) [34]. The method is evaluated through plasmode simulations leveraging data from a suicide and self‐harm prediction study [26, 35].
In Section 2, we present notation, define terms, and illustrate the impact of outcome misclassification on prediction model validation. In Section 3, we describe our proposed method to account for outcome misclassification. Section 4 contains details of our simulations, and Section 5 shows simulation results. We discuss how our methods could be useful in real‐world problems and further research directions in Section 6.
2. Prediction Model Validation with True and Misclassified Outcomes
2.1. Notation and Definitions
Consider a prediction model . We would like to evaluate the true performance of in a validation sample, which may have misclassified outcomes. The methods presented in this paper are agnostic to the modeling strategy of . Further, we make no assumptions about the presence or absence of outcome misclassification in the training data used to estimate and focus solely on estimating the accurate performance of the already estimated prediction model.
Let represent the features and the observed potentially misclassified binary outcome, respectively, for each observation in the validation sample, . We calculate for all observations to obtain predicted risk scores. We use to denote some risk score threshold such that observations with would be labeled as positive predicted outcomes (that is, high‐risk for self‐harm in our motivating example) while observations with would be considered as negative predicted outcomes (not classified as high‐risk). We would like to evaluate the performance of the risk model in this validation set, but cannot easily do so because the true, correctly classified outcomes, denoted , are not observed. In our motivating example, represents risk factors for self‐harm, is a binary indicator of whether a self‐harm event occurred, and is a binary indicator of whether a self‐harm event was documented in the EHR.
Suppose we also have a chart review sample containing , where denotes the features, denotes the observed binary outcome, and denotes the gold‐standard determination (referred throughout this paper as the “true” outcome) for each observation in our ‐size chart review sample, where . We assume observations in our validation and chart review sample are independently and identically distributed. (We consider the special case in which chart reviewed observations are a subset of the validation sample in Section 3.2).
We next define several misclassification scenarios. Misclassification may be unidirectional where only true events or only true nonevents are possibly mislabeled, but not both. The scenario in which some self‐harm events that truly occurred are not documented in the EHR, while no nonevents are erroneously recorded, can be expressed as and The alternate scenario in which some nonevents are incorrectly diagnosed, while all true events are correctly labeled and some nonevents are incorrectly documented as events can be expressed as and . We also consider bidirectional misclassification, in which both events and nonevents may potentially be mislabeled. Bidirectional misclassification can be written as and
We use nondifferential outcome misclassification to describe a scenario in which the likelihood of misclassification is independent of such that , while differential outcome misclassification indicates that misclassification rates are associated with at least some elements of , that is, . In our motivating example, the correct recording of self‐harm in the EHR may be more likely for patients with other risk factors present, such as prior self‐harm or suicidal ideation. We note that both nondifferential and differential misclassification scenarios can be uni‐ or bidirectional.
2.2. ROC and Classification Accuracy Analysis With Gold‐Standard Outcomes
The ROC curve for a prediction model is defined by the TPR and FPR of across all possible risk score thresholds . The targets of interest are
| (1) |
| (2) |
We will use the shorthand and below to refer to these quantities, since in all cases the prediction performance metric depends on the fixed .
In a chart review sample, we have access to true gold‐standard outcomes for observations, enabling us to obtain the following estimators for TPR and FPR at each threshold :
where is a binary indicator function. The above estimators evaluated in the chart review sample are consistent for our target parameter of interest, the true population TPR and FPR of at threshold .
The ROC curve is defined as the plot of TPR and FPR over all thresholds. The AUC is defined as the area under the ROC curve: . A plug‐in estimator of can be obtained using and .
We also consider two related measures of classification accuracy, PPV, the probability of an event among observations with a risk score above a threshold , and NPV, the probability of a nonevent among observations with a risk score no greater than a threshold . These targets can be written as:
| (3) |
| (4) |
Similar to the estimator of TPR and FPR, gold‐standard outcomes can be used to obtain estimates of PPV() and NPV() consistent for the target:
2.3. ROC and Classification Accuracy Analysis With Observed, Potentially Misclassified Outcomes in a Validation Sample
In this section, we review the impact of unidirectional, nondifferential binary outcome misclassification on TPR, FPR, and AUC estimates, as provided in Wang et al. [28], and extend this presentation to consider the unidirectional differential scenario as well as bias in PPV and NPV estimates.
Outcome misclassification interferes with accurate ROC analysis. When outcome misclassification is present, true outcomes are not observed; only potentially incorrect outcome observations are available. Having only access to , we could estimate TPR and FPR using
| (5) |
| (6) |
where we use the subscript “mis” to highlight that the possibly‐misclassified outcomes are used. can be obtained by plugging in and into the formula for AUC: . Additionally, the estimators for PPV and NPV when only are observed is
Naively using the observed, potentially misclassified outcomes available in a validation sample may result in biased performance estimates. Below, we describe how misclassification of binary outcomes can introduce bias in the ROC analysis by first describing bias in each of its components. We focus our initial presentation on the setting of unidirectional misclassification in which only true events are missed before considering unidirectional misclassification in which only nonevents are incorrectly diagnosed.
The direction of bias for bidirectional misclassification cannot be as clearly outlined across all settings since total bias in performance metrics reflects bias from the rates of both false positives (nonevents incorrectly classified as events) and false negatives (events incorrectly classified as nonevents). However, in most cases, we expect the bias to be nonzero.
2.3.1. Estimation of the True Positive Rate
We start by considering the estimation of TPR. The target of , the estimator for TPR at threshold evaluated with potentially misclassified , is . We can express this target in terms of and in order to compare the target of to (1):
| (7) |
The first equation in (7) follows from our misclassification scenario; here, we are considering misclassification that is unidirectional such that only some true events are incorrectly labeled, such that, for those with , they must also have . The second equation in (7) follows by using the conditional law of probability. We can now examine the amount of bias that will occur by using to estimate .
When misclassification is nondifferential, , so equation (7) simplifies to , the same as (1). Thus, in this scenario, the target of is the parameter of interest ; estimates of the TPR using will be consistent for the true TPR when unidirectional misclassification is nondifferential. We can also follow this reasoning intuitively in the context of our motivating example: when only some self‐harm events are undiagnosed and the probability of missed diagnosis is not associated with any risk factors, the distribution of risk factors (and the risk score ) is the same among all the patients with a true self‐harm event () and the subset with self‐harm diagnosed ().
However, when outcome misclassification is differential, the target of does not equal but instead can be written as below, following from equation (7) above:
Here, we can see that comparing to 1 indicates whether is asymptotically biased upwards or downwards. When patients with higher risk scores are more likely to have true events correctly classified, . As a result, the asymptotic bias of will be nonnegative, and the estimated sensitivity of the model will be optimistic. We may expect this misclassification scenario under the assumption that patients with self‐harm risk factors like prior self‐harm or suicidal ideation are more likely to have future self‐harm events recognized and diagnosed in the EHR.
Alternatively, one could also hypothesize a scenario in which patients with lower predicted risk of self‐harm may be more likely to have self‐harm events correctly diagnosed, perhaps if those with lower predicted risk had better access to mental health care and were more likely to receive care following a self‐harm event. In this situation, and will underestimate the for most , asymptotically.
Considering these two basic scenarios, we can conclude that using misclassified outcomes to naively estimate the TPR can have positive or negative bias. One could also imagine more complex risk‐differential scenarios in which the probability of correctly diagnosing a true event is not monotone in such that TPR is not uniformly over‐ or underestimated across thresholds .
We next consider the alternate unidirectional misclassification setting in which outcome misclassification is restricted to observations without an event, that is, observations with may be misdiagnosed or misrecorded as , and quantify the bias in TPR estimation for both nondifferential and differential scenarios. We start by writing as a weighted average of with weights equal to and :
where . When the misclassification in nondifferential, and TPR() is underestimated, as shown below:
In the nondifferential setting, the asymptotic bias direction of depends on how well the prediction model distinguishes those with and at the specific , that is, the sign of . At the threshold values () that , will overestimate TPR(); at the threshold values () that , will underestimate TPR(); for satisfying , will correctly recover TPR(). For the remainder of this manuscript, for simplicity, we assume the ROC curve is never below the diagonal line, that is, for all 's, which implies it performs no worse than random chance at all possible 's in distinguishing between those with and without true events.
For the differential case, the direction of bias depends on the difference between and . For example, when patients with higher predicted risk scores are less likely to have their true event status misclassified as , and, following our assumption that the ROC curve of is never below the diagonal line, < such that will underestimate TPR().
2.3.2. Estimation of the False Positive Rate
We next consider estimation of the FPR in the presence of unidirectional binary outcome misclassification. Analogous to the presentation above, we compare the target for to :
| (8) |
where . The last line of (8) quantifies the deviation of the from the target, . In the nondifferential case, , and provided that our ROC curve of is never below the diagonal line across all possible thresholds, the asymptotic bias of is . Given that a higher FPR indicates poor model performance—prediction models with lower FPR at a given are preferred—over‐estimating the FPR is equivalent to underestimating model performance.
When unidirectional misclassification is risk‐differential, the direction of bias of depends on the relationship between and . For example, if those with higher predicted risk scores are less likely to have true events recorded in the EHR, perhaps due to more barriers to accessing care, and will overestimate FPR().
Under the alternate unidirectional misclassification scenario in which all events are correctly labeled but nonevents may be mislabeled, will recover the desired estimand, in (2), if misclassification is nondifferential. To see this, we expand as follows:
where . When misclassification is nondifferential, , so the target of is FPR() in (2). When the misclassification is risk‐differential, the direction of bias depends on the comparison between and 1. For example, if patients with higher predicted risk scores are more likely to have their nonevents correctly classified, then and will overestimate FPR(). If, instead, patients with higher predicted scores are more likely to have their true nonevent status misclassified as , then we have leading to underestimation of FPR.
2.3.3. Estimation of the AUC
We next consider estimation of the AUC. When unidirectional (only true events possibly missed) binary outcome misclassification is nondifferential, we have shown that, for all possible , are unbiased estimates of , but can overestimate the target asymptotically. Given that the ROC curve plots TPR on the y‐axis against FPR on the x‐axis for all possible thresholds of , overestimation of will shift the curve to the right as illustrated in Figure 1. Thus, the for the ROC curve obtained by and will underestimate the target AUC, underestimating the discriminative ability of the prediction model.
FIGURE 1.

Effect of unidirectional, nondifferential outcome misclassification on ROC plot and the estimated AUC when (a) only true events are incorrectly labeled as nonevents and (b) only nonevents are incorrectly labeled as events.
As shown in Figure 1, on average, if the FPR is always overestimated while the TPR remains unbiased, one can imagine the ROC curve shifts to the right (toward higher FPR values) for a given TPR. This shift moves the curve away from the ideal top‐left corner, which corresponds to better performance. This magnitude of bias will increase as the misclassification gets worse because the proportion of in will be greater, and thus the positive deviation of from will get larger.
When the direction of unidirectional‐nondifferential misclassification is flipped, that is, the binary outcome misclassification only happens within patients with true nonevents, and provided that the ROC curve of is never below the diagonal line, will also underestimate the target AUC, thereby underestimating the discriminative ability of the model, since is asymptotically unbiased, while will underestimate TPR().
We cannot make a general conclusion about the direction of bias for AUC estimates when misclassification is differential. As presented above, the direction of bias for and depends on the association between the risk score and misclassification risk for true events and nonevents, respectively. The asymptotic bias direction for is then a balance between the bias from the estimates of each, and the resulting AUC estimate may over‐ or underestimate the model's discriminatory ability for true outcomes.
2.3.4. Estimation of the Positive and Negative Predictive Value
A similar approach can be taken to evaluate the potential impact of unidirectional binary outcome misclassification for other classification accuracy measures. We first compare the target of the estimate of to the target measure evaluated in true outcomes, in the setting where only some true events may be mislabeled and incorrectly classified as in the EHR:
Since , tends to underestimate ; this holds when misclassification is nondifferential or differential. In our motivating example of self‐harm risk prediction, PPV obtained using misclassified outcomes would underestimate the rate of self‐harm at a threshold selected for clinical intervention.
When only nonevents are incorrectly classified as having the event, tends to overestimate the desired estimand, PPV(). To see this, we expand as follows:
since , in this situation, will have nonnegative bias in the case of both nondifferential and risk‐differential misclassification.
Similarly, will generally overestimate NPV() in the unidirectional misclassification setting in which only true events may be misclassified as , regardless of whether the misclassification is nondifferential or risk‐differential:
Under the alternate unidirectional binary outcome misclassification in which only nonevents are incorrectly labeled as ,
Given , in this case, tends to underestimate NPV(), in both nondifferential or differential misclassification settings.
3. Leveraging Chart Review to Improve Model Validation When There is Outcome Misclassification
We now consider how gold‐standard outcomes obtained through chart review can be used in prediction model validation to produce better performance estimates under outcome misclassification. As a reminder, we are interested in validating a prediction model in a validation set with misclassified binary outcomes ; the true outcomes are not observed on the full sample.
We do, however, have a smaller chart review sample that is selected from the same population as the validation set and contains gold‐standard true outcome determinations , in addition to risk predictions and observed outcomes . While the prediction model could be accurately validated in this chart review sample—as it is drawn from the same population as the validation set, estimates would be unbiased—the sample size is small, thus uncertainty around the performance estimates in the chart review sample is likely to be large. Ideally, one would like to take advantage of the considerably larger sample size in the validation set to more precisely estimate performance.
Toward that goal, with access to the true outcomes in the chart review set, we can estimate a model that relates true outcomes to the misclassified observed outcomes and risk factors . Specifically, we can estimate within the independent chart review set by , which can be any function mapping that constrains fitted values to the interval [0,1], including a logistic regression model. This function can then be used to obtain fitted values for in the validation sample to approximate the unobserved true outcomes.
Our proposed method is then to substitute the unobserved true outcomes with . Proposed estimators for and are then defined as follows:
| (9) |
| (10) |
can be obtained as
This approach can also be used for other measures of classification accuracy, such as PPV and NPV:
| (11) |
| (12) |
If as , then, the proposed estimators in Equations ((9), (10), [Link], (11), (12)) are consistent to the targets of interest in Equations ((1), (2), [Link], [Link], (3), (4)). We note that this is a sufficient condition, but there may be weaker conditions under which consistency holds. For example, for , both the numerator and denominator converge in probability to their targets, as shown below by applying the weak law of large numbers,
and, thus,
Provided that , by Slutsky's theorem,
Similar calculations can be used for the other estimated classification measures, including , and thus .
3.1. Estimating Uncertainty in Performance Measures
We next propose a method for accurately quantifying variability in the proposed performance measures by accounting for both sampling variability in the validation set and variability in the model approximating that is derived from chart review data. To capture sampling variability in the validation sample, one can repeatedly sample with replacement from the validation set (i.e., nonparametric bootstrap) to generate multiple bootstrap samples for bootstrap replications [36].
Since the method we proposed involves estimating , we also need to account for the uncertainty in this estimation. Otherwise, we might be overly optimistic about the efficiency of the performance estimates. If we have access to individual observations in the chart review sample, we can also use a nonparametric bootstrap to obtain a sample in which to estimate . Alternatively, the parametric bootstrap (i.e., resampling from some assumed distribution) can be used, depending on the form of (e.g., fitted with a parametric model or a nonparametric model), to obtain in both settings where we have access to the chart review data or we only have .
Bootstrap procedures for the validation sample and must be integrated to obtain bootstrapped estimates of the performance metrics. For example, for TPR estimation, for each bootstrap , we obtain an estimate
from the bootstrapped validation sample and the bootstrapped . Then, one can use standard approaches to aggregate estimates across bootstrap samples to derive quantile‐ or Wald‐based 95% confidence intervals for each measure. We demonstrate one possible approach to uncertainty quantification when is estimated with a logistic regression in our simulation study, see below for details.
3.2. Pooling Chart Review and Validation Sample Data
The estimators proposed above in 3 can be used when the investigators only have access to and do not require access to individual observations in the chart review sample. If individual‐level chart review data are accessible, then we can extend the proposed estimator to pool the chart review and validation samples when estimating performance metrics, for example, TPR:
Under the additional assumption , this pooled estimator is also consistent for the target. Details can be found in the Supporting Information. Variance estimates for the pooled estimators can be assessed via a nonparametric bootstrapping approach in which performance estimates are obtained in a series of bootstrap samples of size from the validation set (excluding any observations in the chart review set) and of size from the chart review set. This approach will account for variability in the (pooled) validation and chart review sets used for estimating prediction model performance and variability in the chart review set used to obtain . In the remainder of the manuscript, we mimic the more general setting in which we only have access to the fitted model and the chart review and validation samples are not pooled.
4. Simulation
4.1. Data Generation
We evaluated the proposed method using a plasmode simulation study with real‐world observations from clinical records data [37]. Our plasmode simulation follows the motivating example of risk prediction for self‐harm. The simulation dataset was obtained from the population of all outpatient mental health visits for patients 11 years old and older at 7 health systems (HealthPartners, Henry Ford Health, and the Colorado, Hawaii, Northwest, Southern California, and Washington regions of Kaiser Permanente) in the United States during 2009–2017 [26, 35]. Predictor data associated with each visit includes more than 100 baseline risk factors (such as prior diagnoses, medications, utilization, and self‐harm) and a binary indicator of whether the person had a nonfatal self‐harm diagnosis or suicide death in the 90 days following the visit. These event indicators were considered to be the “true” outcome status for the purposes of this simulation study. In the original dataset, complete outcome observation was not available for people with non‐suicide death or disenrollment from the participating health systems during the 90‐day follow‐up period. The sample used for plasmode simulations was limited to visits with complete 90‐day outcome observation. Misclassified outcomes were simulated under different settings, following the scenarios described in Table 1. From the existing dataset with repeated visits per person, we created a dataset to use for simulation with 3 million independent observations by randomly sampling one visit per person from the entire population. For people with a self‐harm event following any visit, a visit with an event was sampled.
TABLE 1.
Misclassification scenarios explored in plasmode simulation studies. Data were generated with an event rate of 0.1 for all scenarios. Event rates of 0.05 and 0.2 were also explored for scenarios marked with a .
| Misclassification rates | ||||||
|---|---|---|---|---|---|---|
| Misclassification type | Description |
|
|
Scenario label | ||
| Unidirectional Nondifferential |
|
0.9 0.7 0.5 |
0 0 0 |
(a) (b)* (c) |
||
| Unidirectional Differential | ranges from for those with lowest “true” risk scores to for those with highest “true” risk scores |
, , , |
0 0 0 |
(d)* (e) (f) |
||
| Bidirectional Nondifferential | , |
0.9 0.7 0.5 |
0.01 0.05 0.075 |
(g) (h)* (i) |
||
| Bidirectional differential |
ranges from for those with lowest “true” risk scores to for those with highest “true” risk scores ranges from for those with lowest “true” risk scores to for those with highest “true” risk scores |
, , , |
, , , |
(j)* (k) (l) |
||
All misclassification scenarios (presented in Table 1 and described below) were explored with 1000 simulation iterations. For each iteration, we randomly sampled (without replacement) 50 000 observations for training and validation sets and 1000 observations for the chart review set (independent of the validation set), that is, and , with stratified sampling to obtain a 10% event rate in training, validation, and chart review sets. Additional simulations considered smaller training and validation sets () and chart review sets as well as event rates of 5% and 20%.
We generated misclassified outcome data under various scenarios in which we manipulated the misclassification types (uni‐ and bidirectional, differential and nondifferential), misclassification rates, and event rates. We explored a variety of potential misclassification rates, ranging from minimal to more substantive misclassification (e.g., the probability that a true event is correctly observed ranges from 50% to 90%). We generated misclassified outcomes in the chart review set and validation set using Bernoulli distributions with specified misclassification rates and listed in Table 1. Misclassification rates were selected to explore a range of plausible clinical scenarios. For nondifferential misclassification, the probability of misclassification depended on the true event status, , only. For differential misclassification, the probability of misclassification depended on the “true” self‐harm risk, that is, risk scores that were estimated using a random forest prediction model trained with true (not misclassified) outcomes. For example, in the first unidirectional risk‐differential scenario listed in Table 1, we defined parameters for the probability of correctly observing for observations with events such that the probability of correctly diagnosing the event ranged from a minimum of 0.7 for observations with the lowest true risk (denoted in Table 1) of an event and a maximum of 0.95 for those at the highest risk (denoted ). More details on the data‐generating mechanism are provided in the Supporting Information.
While our proposed approach can be used to validate any existing prediction model (regardless of whether the model was trained with correct or error‐prone data), in this simulation, we estimated prediction models with misclassified outcomes following the data‐generating specification of each misclassification scenario. In each simulation iteration, we estimated a random forest prediction model in the training set with misclassified outcomes. We then obtained fitted values, , for that prediction model for all observations in the validation and chart review sets. More details on how the random forest prediction models were estimated (e.g., tuning parameter value selection) in each simulation iteration are provided in the Supporting Information.
4.2. Estimation of Performance Measures
In each simulation iteration, we calculated 3 estimators for the target performance measures, , , , , and :
those estimated using chart review data alone, , , , , and ;
those estimated by naively using potentially misclassified outcomes, , , , , and ; and
the proposed estimators presented above, , , , , and .
The proposed estimators were obtained through 3 steps:
- Estimate a logistic regression model for true outcomes, from the chart review sample. Within the chart review sample, we estimate a regression model for the true outcomes given misclassified outcomes and predicted risk . We note that the “true” was not specified directly for these plasmode simulations, as we instead specified to generate misclassified outcomes based on true outcomes sampled from our real‐world dataset. For simplicity, a logistic regression model was selected:
We save the coefficient estimates and the corresponding model coefficient covariance matrices, . While we used a relatively simple model in our experiment, in practice and sample size allowing, one can use more flexible models, including splines, as the relationship would be unknown and could be complex. Plug fitted values into proposed formulae. We calculated the proposed estimates for all performance measures, including TPR(), FPR(), PPV(), and NPV() at all possible thresholds, that is, , in the validation set by plugging in obtained in step 1 into ((9), (10), [Link], (11), (12)), respectively. We also obtained proposed estimates for AUC.
Estimate uncertainty using bootstrap sampling. We used parametric and nonparametric bootstrapping to obtain interval estimates for all performance measure estimates. Within each simulation iteration, we obtained 2000 bootstrap samples of size 50 000 with replacement from the validation set. We indexed each bootstrapped sample by , for , and the corresponding misclassified outcomes and features for each bootstrap sample are denoted as and , for .
We also captured uncertainty associated with estimating via parametric bootstrap sampling. For each bootstrap sample , we randomly drew coefficients from the multivariate normal distribution . We obtained fitted values under the models
These predictions were then used to obtain proposed performance measures for each bootstrap sample. From the completion of 2000 bootstrap replications, we evaluated the empirical distribution of our estimated performance measures to derive 95% quantile‐based CIs.
We also estimated performance measures in the validation set using misclassified outcomes as described in Section 2.3 and in the chart review sample using gold‐standard outcomes. For each, we accounted for the sampling uncertainty via the 2000 bootstrap resamples from the validation set (for estimates using misclassified outcomes) and from the chart review sample (for gold‐standard estimates) and derived the estimates naively using the potentially misclassified outcomes and the estimates using the chart review sample alone in each bootstrapped sample. The 95% CIs were obtained for each estimate by using the 2.5% and 97.5% percentiles of these 2000 estimates.
4.3. Evaluation of Proposed Method
In each iteration, we calculated the empirical “true” performance measures for each metric in the validation set using true outcomes known to the investigators in this plasmode simulation scenario. Bias, mean squared errors (MSE), and 95% CI coverage with respect to the empirical targets (estimated with true outcomes) and average CI width were calculated for estimates naively using the potentially misclassified outcomes in the validation set, using the chart review sample alone, and our proposed estimates over 1000 simulation iterations. We computed the Monte Carlo standard errors (MCSE) for all bias, MSE, and 95% CI coverage and average CI width estimates [38].
5. Results
Random forest prediction models trained with misclassified outcomes had similar or slightly lower performance than prediction models trained with correct outcomes. The AUC of prediction models trained with correct outcomes (averaged over all simulation replications) at an event rate of 0.10 and training set size of was 0.871. The AUC of prediction models trained with misclassified outcomes at that same event rate ranged from 0.853‐0.871 across the misclassification scenarios examined, with lower AUC in settings with larger misclassification rates.
We present the results for our evaluation of estimators for AUC under different misclassification scenarios and an event rate of 0.1 in Figures 2, 3, 4 and Table S1 in the Supporting Information. Figure 2 shows the average bias across simulation replications for , , and , where bias is defined as the difference between each estimate of AUC and the AUC as evaluated in the validation set with true outcomes observed. Across the misclassification scenarios examined, the proposed estimator showed negligible bias that was smaller in magnitude than the AUC estimate calculated with misclassified outcomes. For example, in the nondifferential unidirectional misclassifications scenario with (scenario [a]), the proposed method had an average bias(MCSE) of , while estimating AUC with misclassified outcomes showed an average bias of . For all scenarios, MCSEs were sufficiently small in magnitude to support making conclusions about comparative properties of the estimation methods. MCSEs for all evaluations are presented in the Supporting Information. The magnitude and direction of bias for follows the behavior outlined in Section 2.3 for the unidirectional misclassification scenarios (scenarios a–f). The directions of bias for and are also as expected; tables detailing TPR and FPR estimators separately are in Supporting Information: Table S2‐S5. In the bidirectional misclassification scenarios examined (scenarios g–l), underestimated the true AUC in the validation set, and the magnitude of the bias was larger than in the unidirectional examples. Recall, the data generation process for bidirectional misclassification used the same misclassification rates as the unidirectional scenarios for misclassifying events and added misclassification of nonevents as well. As expected, the estimate of AUC calculated with true outcomes identified in the gold‐standard set was unbiased in all scenarios.
FIGURE 2.

Absolute bias of AUC estimates under different misclassification scenarios, event , , and . Scenario labels and descriptions are provided in Table 1. For nondifferential cases ([a–c] and [g–i]), the x‐axis corresponds to worsening misclassification. For differential cases ([d–f] and [j–l]), the direction of association between risk and misclassification rates switches for the farthest right scenario. Shapes and colors denote the estimator using potentially misclassified outcomes (, circles), the estimator using only chart reviewed data (, triangles) and our proposed estimator accounting for misclassification (, squares).
FIGURE 3.

The coverage of 95% CIs for AUC under different misclassification scenarios, event rate = 0.1, , and . Scenario labels and descriptions are provided in Table 1. For nondifferential cases ([a–c] and [g–i]), the x‐axis corresponds to worsening misclassification. For differential cases ([d–f] and [j–l]), the direction of association between risk and misclassification rates switches for the farthest right scenario. Shapes and colors denote the estimator using potentially misclassified outcomes (, circles), the estimator using only chart reviewed data (, triangles) and our proposed estimator accounting for misclassification (, squares).
FIGURE 4.

Average 95% CI width under different misclassification scenarios, event rate=0.1, , and . For nondifferential cases ([a–c] and [g–i]), the x‐axis corresponding to worsening misclassification. Scenario labels and descriptions are provided in Table 1. For differential cases([d–f] and [j–l]), the direction of association between risk and misclassification rates switches for the farthest right scenario. Shapes and colors denote the estimator using potentially misclassified outcomes (, circles), the estimator using only chart reviewed data (, triangles) and our proposed estimator accounting for misclassification (, squares).
Figure 3 displays the coverage, defined here as the percent of simulations for which each estimator's 95% CI contained the AUC of the prediction model as evaluated in the validation set using true observations. Coverage of 95% CIs for the proposed and gold‐standard estimators was close to 95% while coverage for the estimators calculated with misclassified outcomes was nearly 0 for all misclassification settings except for (a).
Figure 4 presents the average width of 95% CIs for each AUC estimator across the 1000 simulations. In all scenarios, the interval estimates of are wider than ; this is expected, as the proposed method accounts for the additional uncertainty introduced by , the model approximating T, in the chart review set when constructing intervals. Despite this additional variability, we see that 95% CIs for the proposed estimator are narrower than that of the AUC estimate obtained in the smaller chart review set, indicating that evaluating performance in the larger validation set did improve precision under these simulation settings. As seen in Figure 3, the coverage of 95% CIs for is poor; this is due to the considerable bias in estimates as well as the narrower CI width. Our has appropriate coverage due to minimal bias and an appropriate 95% CI width that reflects uncertainty.
We present evaluation metrics for PPV and NPV, including assessed bias, MSE, coverage, and 95% CI width, in Tables S6–S9. Comparisons between estimation methods are consistent to those described above for AUC. Bias in PPV and NPV estimates obtained using misclassified outcomes follow the behavior described in 2.3.
We present additional results in the Supporting Information, including bias, MSE, coverage, and 95% CI width for AUC and other performance measures in settings with 5% and 20% event rates (Table S10–S18), chart review set size of (Tables S19–S27), and training and validation set sizes of (Tables S28–S36). Conclusions about the behavior of each estimator at these different event rates and sample sizes are consistent with those presented and discussed above. Notably, as the event rate increases, the absolute bias of PPV and NPV estimators calculated with misclassified outcomes is amplified due to the influence event rate has on PPV (independent of the discriminatory ability of the model). As expected, variability of all performance measure estimates was greater for all methods at smaller values of and . All estimated MSEs and constructed 95% CIs accurately reflected this increased sampling variability. At the smallest chart review size examined (), 95% CI width was still narrower when estimating performance with the proposed method than when using only the chart review set with gold‐standard outcome measures alone.
6. Discussion
Accurately estimating a prediction model's performance is important as it informs decision‐making about whether and how to use the model in clinical care [39, 40]. In the presence of outcome misclassification, validation estimates obtained by using the observed outcomes quantify how well a model predicts those misclassified outcomes, which, as we show, may not reflect predictive validity for true outcomes and may mislead decision‐makers. Our proposed method, which uses a small chart reviewed sample to train a model predicting the true outcome given the observed outcome and risk factors, improved the evaluation of predictive performance across various outcome misclassification scenarios examined. Estimates based on this approach showed small bias and confidence intervals with close to nominal coverage (86%–97% for AUC estimates).
The method we proposed is applicable across any type of prediction modeling strategy and computationally efficient to implement. The key idea is leveraging a chart review set to develop a model for the relationship between the true outcome status, which we do not have access to in our validation set, and the potentially misclassified outcomes and the features of our observations, which we do have full access to in our validation set. With this relationship model, one can obtain for all the validation observations and use this approximation as a substitute for the true outcome status instead of naively relying on the misclassified outcomes to calculate performance estimates. This approach elicits information on the misclassification rate from a representative chart review sample rather than requiring previous knowledge of those rates. One can fit any appropriate model to the chart review set for deriving the proxy estimates. Uncertainty in estimating can be accounted for using standard bootstrapping methods; the resulting confidence intervals around proposed estimates provide better coverage than those based on the naive estimates. Rather than using a bootstrap, one could instead use imputation to account for uncertainty; our procedure can be viewed as a single imputation using . Multiple imputation could proceed as follows: for each in imputation rounds, impute , then estimate TPR, FPR, or AUC using in place of in our proposed estimators. Then Rubin's rules [41] can be used for variance estimation: the within‐imputation variance can be estimated using the asymptotic variance of the corresponding estimator [42, 43, 44, 45], while the across‐imputation variance can be estimated using the standard estimator. The adjustment method we proposed can be applied to any prediction model, regardless of whether it was developed with or without misclassified outcomes, since our objective is accurately validating a given prediction model in a validation set of interest. The proposed method can also be applied across all prediction modeling approaches, including parametric and nonparametric models, as the approach does not require any knowledge of or adjustment to how the prediction model was developed.
As shown in Section 3, the targets of our proposed estimates are not always guaranteed to be the targets of interest (i.e., the true TPR, FPR, PPV, or AUC). We provided a sufficient condition under which our proposed estimates will be asymptotically unbiased: as . However, this condition may not typically be met and cannot be verified in the available data. When this assumption does not hold, the method we proposed may be asymptotically biased because the proposed estimator is not consistent for the target. However, nonconsistency does not mean non‐improvement. This improvement is demonstrated in our simulations. The data‐generating approach used in our plasmode simulations did not include specifying a model to generate ; instead, we sampled true outcomes from existing data and specified a misclassification probability. As such, the fitted from the chart review set in our simulations is likely misspecified. We found that the proposed estimates have less bias and better coverage than estimates of performance calculated with misclassified outcomes, and this improvement holds across various misclassification scenarios. And, though our proposed estimator is not guaranteed to be asymptotically unbiased for the target, the confidence intervals for the proposed estimators were more efficient than estimates obtained using the chart review set alone while maintaining expected coverage. Even when inconsistent, may approximate sufficiently to better accommodate binary outcome misclassification. The next steps of this research include theoretical development to more clearly delineate the scenarios in which the proposed method reduces bias and designing approaches for examining the method's assumptions in a dataset.
The proposed method was demonstrated with a relatively large chart review sample randomly drawn from the same population as the validation set. Our simulation study used a chart review sample of 1000 observations, which is not typically practical due to the cost and time required for expert annotation. Decreasing the size of the chart review sample will increase the variability in modeling and, in turn, the MSE of the proposed performance estimates. In practice, chart review samples are relatively small. Chart review samples may also not be representative of the target population, as researchers typically attempt to optimize chart review resources by oversampling observations with higher risk scores or with certain features predictive of the outcome or misclassification risk [19, 46]. Also, when the outcome of interest is a rare event, a small random sample may not contain enough events to be informative. The ideal method would enable accurate estimation of performance measures in the larger population of interest, even with more efficient stratified sampling of a smaller chart review set. To address this limitation, further research is needed to extend the proposed method so that it is robust under distribution shift, that is, when the distribution of and varies between the chart review and validation sets. This is especially common in the context of data fusion, where, for example, chart review data and validation data may be collected from multiple sources. Exploration of sampling strategies, including two‐phase designs, may uncover more effective approaches for selecting chart review samples [47, 48].
This paper focused on improving estimates of prediction model discrimination and classification accuracy and did not consider measures of calibration. Chart review samples with gold‐standard outcome determinations could also be leveraged to improve prediction calibration in the presence of outcome misclassification. As with the performance measures explored here, outcome misclassification will bias model calibration when observed (misclassified) event rates are not similar to true event rates. The methods presented here could potentially be extended to accommodate binary outcome misclassification for calibration measures like Brier score, calibration‐in‐the‐large, and regression‐based intercept and slope calibration estimates.
The practical impact of outcome classification on prediction model validation was not severe in all scenarios examined. For example, in the unidirectional‐nondifferential setting where 10% of true events were not captured, the average bias in AUC was less than 0.01. Such minor biases in performance measures may not be clinically relevant, and in these cases, the additional time and resources required to collect chart review data and obtain adjusted estimates could be avoided. Health providers can leverage their expertise on the likely degree and magnitude of misclassification to determine whether the additional time and resources required to collect chart review data and obtain proposed estimates are necessary. The simulation framework we presented here could be used to explore how robust performance validation with potentially misclassified outcomes may be across reasonable misclassification scenarios.
Our method focuses on evaluating the performance of any given prediction model in a validation set. There is also a need for methods to improve prediction model estimation when outcomes are misclassified. Zawistowski et al. [29] proposed a method that simultaneously adjusts for outcome misclassification in model estimation and validation. However, their method relies on a likelihood‐based prediction model, which is not applicable to machine learning prediction models. Further research in this area would improve the performance of clinical prediction models estimated with real‐world data prone to outcome misclassification.
7. Conclusion
Our proposed method utilizes a chart review sample with gold‐standard outcomes to accommodate binary outcome misclassification when evaluating a prediction model's performance. Under various misclassification scenarios examined, empirically, our proposed estimators reduced bias compared to naively evaluating the prediction model with misclassified outcomes and provided reliable uncertainty quantification with satisfying efficiency. The resulting point and interval estimates for a prediction model's performance provide information about the targets of interest—how well a model predicts true (if unobserved) outcomes—and can guide decision‐making about whether and how to use the prediction model.
Funding
This work was supported by the National Institute of Mental Health (Grant Nos. R01‐MH125821, U19‐MH099201, and U19‐MH121738).
Conflicts of Interest
Dr. Shortreed has worked on grants awarded to Kaiser Permanente Washington Health Research Institute (KPWHRI) by Bristol Meyers Squibb and by Pfizer. She was also a co‐investigator on grants awarded to KPWHRI from Syneos Health, who represented a consortium of pharmaceutical companies carrying out FDA‐mandated studies regarding the safety of extended‐release opioids.
Supporting information
Data S1. Supporting Information.
Data Availability Statement
The dataset used in this study is not publicly available, as it contains detailed information from patient clinical records data.
References
- 1. Goldstein B. A., Navar A. M., Pencina M. J., and Ioannidis J. P. A., “Opportunities and Challenges in Developing Risk Prediction Models With Electronic Health Records Data: A Systematic Review,” Journal of the American Medical Informatics Association 24, no. 1 (2017): 198–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Xiao C., Choi E., and Sun J., “Opportunities and Challenges in Developing Deep Learning Models Using Electronic Health Records Data: A Systematic Review,” Journal of the American Medical Informatics Association 25, no. 10 (2018): 1419–1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Rajkomar A., Oren E., Chen K., et al., “Scalable and Accurate Deep Learning With Electronic Health Records,” npj Digital Medicine 1, no. 18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Parikh R. B., Kakad M., and Bates D. W., “Integrating Predictive Analytics Into High‐Value Care: The Dawn of Precision Delivery,” Journal of the American Medical Association 315, no. 7 (2016): 651–652, 10.1001/jama.2015.19417. [DOI] [PubMed] [Google Scholar]
- 5. Swinckels L., Bennis F. C., Ziesemer K. A., et al., “The Use of Deep Learning and Machine Learning on Longitudinal Electronic Health Records for the Early Detection and Prevention of Diseases: Scoping Review,” Journal of Medical Internet Research 26 (2024): e48320, 10.2196/48320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ng K., Steinhubl S. R., deFilippi C., Dey S., and Stewart W. F., “Early Detection of Heart Failure Using Electronic Health Records,” Circulation. Cardiovascular Quality and Outcomes 9, no. 6 (2016): 649–658, 10.1161/CIRCOUTCOMES.116.002797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Zheng L., Wang O., Hao S., et al., “Development of an Early‐Warning System for High‐Risk Patients for Suicide Attempt Using Deep Learning and Electronic Health Records,” Translational Psychiatry 10, no. 1 (2020): 72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Eapen Z. J., Liang L., Fonarow G. C., et al., “Validated, Electronic Health Record Deployable Prediction Models for Assessing Patient Risk of 30‐Day Rehospitalization and Mortality in Older Heart Failure Patients,” JACC. Heart Failure 1, no. 3 (2013): 245–251. [DOI] [PubMed] [Google Scholar]
- 9. Verhoeff M., de Groot J., Burgers J. S., and van Munster B. C., “Development and Internal Validation of Prediction Models for Future Hospital Care Utilization by Patients With Multimorbidity Using Electronic Health Record Data,” PLoS One 17, no. 3 (2022): e0260829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Rothman B., Leonard J. C., and Vigoda M. M., “Future of Electronic Health Records: Implications for Decision Support,” Mount Sinai Journal of Medicine: A Journal of Translational and Personalized Medicine 79, no. 6 (2012): 757–768. [DOI] [PubMed] [Google Scholar]
- 11. Bates D. W., Saria S., Ohno‐Machado L., Shah A., and Escobar G., “Big Data in Health Care: Using Analytics to Identify and Manage High‐Risk and High‐Cost Patients,” Health Affairs 33, no. 7 (2014): 1123–1131. [DOI] [PubMed] [Google Scholar]
- 12. Yarborough B. J. H., Stumbo S. P., Schneider J., Richards J. E., Hooker S. A., and Rossom R., “Clinical Implementation of Suicide Risk Prediction Models in Healthcare: A Qualitative Study,” BMC Psychiatry 22, no. 1 (2022): 789. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Belsher B. E., Smolenski D. J., Pruitt L. D., et al., “Prediction Models for Suicide Attempts and Deaths: A Systematic Review and Simulation,” JAMA Psychiatry 76, no. 6 (2019): 642–651. [DOI] [PubMed] [Google Scholar]
- 14. Simon G. E., Shortreed S. M., and Coley R. Y., “Positive Predictive Values and Potential Success of Suicide Prediction Models,” JAMA Psychiatry 76, no. 8 (2019): 868–869. [DOI] [PubMed] [Google Scholar]
- 15. Wu H., Yamal J. M., Yaseen A., and Maroufy V., Statistics and Machine Learning Methods for EHR Data: From Data Extraction to Data Analytics (CRC Press, 2020). [Google Scholar]
- 16. Duan R., Cao M., Wu Y., et al., “An Empirical Study for Impacts of Measurement Errors on EHR Based Association Studies,” in AMIA Annual Symposium Proceedings, vol. 2016 (American Medical Informatics Association, 2016), 1764. [PMC free article] [PubMed] [Google Scholar]
- 17. Nissen F., Quint J. K., Morales D. R., and Douglas I. J., “How to Validate a Diagnosis Recorded in Electronic Health Records,” Breathe 15, no. 1 (2019): 64–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Walkup J. T., Townsend L., Crystal S., and Olfson M., “A Systematic Review of Validated Methods for Identifying Suicide or Suicidal Ideation Using Administrative or Claims Data,” Pharmacoepidemiology and Drug Safety 21 (2012): 174–182. [DOI] [PubMed] [Google Scholar]
- 19. Simon G. E., Shortreed S. M., Boggs J. M., et al., “Accuracy of ICD‐10‐CM Encounter Diagnoses From Health Records for Identifying Self‐Harm Events,” Journal of the American Medical Informatics Association 29, no. 12 (2022): 2023–2031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Mars B., Cornish R., Heron J., et al., “Using Data Linkage to Investigate Inconsistent Reporting of Self‐Harm and Questionnaire Non‐Response,” Archives of Suicide Research 20, no. 2 (2016): 113–141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Ball J. R., Miller B. T., and Balogh E. P., Improving Diagnosis in Health Care (National Academies Press, 2016). [PubMed] [Google Scholar]
- 22. Haase C. B., Brodersen J., and Bulow J., “8 the Lack of Ontological Awareness in Evidence‐Based Medicine Allows Overdiagnosis,” BMJ Evidence‐Based Medicine 24, no. Suppl 1 (2019): A5, 10.1136/bmjebm-2019-EBMLive.8. [DOI] [Google Scholar]
- 23. Chen Y., Wang J., Chubak J., and Hubbard R. A., “Inflation of Type I Error Rates due to Differential Misclassification in EHR‐Derived Outcomes: Empirical Illustration Using Breast Cancer Recurrence,” Pharmacoepidemiology and Drug Safety 28, no. 2 (2019): 264–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Desai R. J., Levin R., Lin K. J., and Patorno E., “Bias Implications of Outcome Misclassification in Observational Studies Evaluating Association Between Treatments and All‐Cause or Cardiovascular Mortality Using Administrative Claims,” Journal of the American Heart Association 9, no. 17 (2020): e016906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Edwards J. K., Cole S. R., Shook‐Sa B. E., Zivich P. N., Zhang N., and Lesko C. R., “When Does Differential Outcome Misclassification Matter for Estimating Prevalence?,” Epidemiology 34, no. 2 (2023): 192–200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Simon G. E., Johnson E., Lawrence J. M., et al., “Predicting Suicide Attempts and Suicide Deaths Following Outpatient Visits Using Electronic Health Records,” American Journal of Psychiatry 175, no. 10 (2018): 951–960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Chan M. K., Bhatti H., Meader N., et al., “Predicting Suicide Following Self‐Harm: Systematic Review of Risk Factors and Risk Scales,” British Journal of Psychiatry 209, no. 4 (2016): 277–283. [DOI] [PubMed] [Google Scholar]
- 28. Wang L., Shaw P. A., Mathelier H. M., Kimmel S. E., and French B., “Evaluating Risk‐Prediction Models Using Data From Electronic Health Records,” Annals of Applied Statistics 10, no. 1 (2016): 286–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Zawistowski M., Sussman J. B., Hofer T. P., Bentley D., Hayward R. A., and Wiitala W. L., “Corrected ROC Analysis for Misclassified Binary Outcomes,” Statistics in Medicine 36, no. 13 (2017): 2148–2160. [DOI] [PubMed] [Google Scholar]
- 30. Beesley L. J. and Mukherjee B., “Statistical Inference for Association Studies Using Electronic Health Records: Handling Both Selection Bias and Outcome Misclassification,” Biometrics 78, no. 1 (2022): 214–226. [DOI] [PubMed] [Google Scholar]
- 31. Webb K. A. H. and Wells M. T., “Statistical Inference for Association Studies in the Presence of Binary Outcome Misclassification,” (2023). arXiv preprint arXiv:2303.10215. [DOI] [PubMed]
- 32. Lyles R. H., Williamson J. M., Lin H. M., and Heilig C. M., “Extending McNemar's Test: Estimation and Inference When Paired Binary Outcome Data Are Misclassified,” Biometrics 61, no. 1 (2005): 287–294. [DOI] [PubMed] [Google Scholar]
- 33. Shu D. and Yi G. Y., “Causal Inference With Noisy Data: Bias Analysis and Estimation Approaches to Simultaneously Addressing Missingness and Misclassification in Binary Outcomes,” Statistics in Medicine 39, no. 4 (2020): 456–468. [DOI] [PubMed] [Google Scholar]
- 34. Höfler M., “The Effect of Misclassification on the Estimation of Association: A Review,” International Journal of Methods in Psychiatric Research 14, no. 2 (2005): 92–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Shortreed S. M., Walker R. L., Johnson E., et al., “Complex Modeling With Detailed Temporal Predictors Does Not Improve Health Records‐Based Suicide Risk Prediction,” npj Digital Medicine 6, no. 1 (2023): 47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Efron B. and Tibshirani R. J., An Introduction to the Bootstrap (Chapman and Hall/CRC, 1994). [Google Scholar]
- 37. Franklin J. M., Schneeweiss S., Polinski J. M., and Rassen J. A., “Plasmode Simulation for the Evaluation of Pharmacoepidemiologic Methods in Complex Healthcare Databases,” Computational Statistics & Data Analysis 72 (2014): 219–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Morris T. P., White I. R., and Crowther M. J., “Using Simulation Studies to Evaluate Statistical Methods,” Statistics in Medicine 38, no. 11 (2019): 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Altman D. G. and Royston P., “What Do We Mean by Validating a Prognostic Model?,” Statistics in Medicine 19, no. 4 (2000): 453–473. [DOI] [PubMed] [Google Scholar]
- 40. Sperrin M., Riley R. D., Collins G. S., and Martin G. P., “Targeted Validation: Validating Clinical Prediction Models in Their Intended Population and Setting,” Diagnostic and Prognostic Research 6, no. 1 (2022): 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Rubin D. B., Multiple Imputation for Nonresponse in Surveys, vol. 81 (John Wiley & Sons, 2004). [Google Scholar]
- 42. Pepe M. S., The Statistical Evaluation of Medical Tests for Classification and Prediction (Oxford university press, 2003). [Google Scholar]
- 43. Gu W. and Pepe M., “Measures to Summarize and Compare the Predictive Capacity of Markers,” International Journal of Biostatistics 5, no. 1 (2009): 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. LeDell E., Petersen M., and van der Laan M., “Computationally Efficient Confidence Intervals for Cross‐Validated Area Under the ROC Curve Estimates,” Electronic Journal of Statistics 9, no. 1 (2015): 1583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Steingrimsson J. A., Wen L., Voter S., and Dahabreh I. J., “Interpretable Meta‐Analysis of Model or Marker Performance,” (2024). arXiv preprint arXiv:2409.13458.
- 46. Yin Z., Tong J., Chen Y., Hubbard R. A., and Tang C. Y., “A Cost‐Effective Chart Review Sampling Design to Account for Phenotyping Error in Electronic Health Records (EHR) Data,” Journal of the American Medical Informatics Association 29, no. 1 (2022): 52–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Legg J. C. and Fuller W. A., “Two‐Phase Sampling,” in Handbook of Statistics, vol. 29 (Elsevier, 2009), 55–70. [Google Scholar]
- 48. Amorim G., Tao R., Lotspeich S., Shaw P. A., Lumley T., and Shepherd B. E., “Two‐Phase Sampling Designs for Data Validation in Settings With Covariate Measurement Error and Continuous Outcome,” Journal of the Royal Statistical Society: Series A (Statistics in Society) 184, no. 4 (2021): 1368–1389. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1. Supporting Information.
Data Availability Statement
The dataset used in this study is not publicly available, as it contains detailed information from patient clinical records data.
