Skip to main content
American Journal of Epidemiology logoLink to American Journal of Epidemiology
. 2021 Feb 1;190(9):1830–1840. doi: 10.1093/aje/kwab010

Addressing Measurement Error in Random Forests Using Quantitative Bias Analysis

Tammy Jiang , Jaimie L Gradus, Timothy L Lash, Matthew P Fox
PMCID: PMC8408353  PMID: 33517416

Abstract

Although variables are often measured with error, the impact of measurement error on machine-learning predictions is seldom quantified. The purpose of this study was to assess the impact of measurement error on the performance of random-forest models and variable importance. First, we assessed the impact of misclassification (i.e., measurement error of categorical variables) of predictors on random-forest model performance (e.g., accuracy, sensitivity) and variable importance (mean decrease in accuracy) using data from the National Comorbidity Survey Replication (2001–2003). Second, we created simulated data sets in which we knew the true model performance and variable importance measures and could verify that quantitative bias analysis was recovering the truth in misclassified versions of the data sets. Our findings showed that measurement error in the data used to construct random forests can distort model performance and variable importance measures and that bias analysis can recover the correct results. This study highlights the utility of applying quantitative bias analysis in machine learning to quantify the impact of measurement error on study results.

Keywords: machine learning, measurement error, misclassification, noise, quantitative bias analysis, random forests

Abbreviation

AUC

area under the receiver operating characteristic curve

NCS-R

National Comorbidity Survey Replication

NPV

negative predictive value

PPV

positive predictive value

SI

simulation interval

Editor’s note: An invited commentary on this article appears on page 1841, and the authors’ response appears on page 1844.

Machine learning is a branch of computer science that aims to construct programs that improve with more data (1). Machine learning is increasingly used to predict health outcomes and detect novel risk factors for diseases (2, 3). However, the validity of the data in machine learning is often overlooked (4). Measurement error is a common source of bias in epidemiologic studies. It arises when information is not correctly captured in the study database due to faulty instruments, inaccurate information from study respondents, or mistakes in the recording of data (5). It is well known that measurement error in traditional epidemiologic studies can lead to biased effect estimates (6–8). For example, differential exposure misclassification may lead to upward or downward bias in effect estimates (6, 9, 10). However, few studies quantify the impact of measurement error on machine-learning models used to predict health outcomes (2, 4).

Understanding the impact of measurement error on machine learning is important because there may be consequences for patient care decision-making. A central motivation for using machine learning is to develop clinical tools that identify high-risk patients so that resources can be targeted (11). However, measurement error might lead to poor prediction accuracy, and consequently, high-risk patients in need of care may not receive interventions, or scarce resources may be spent on persons who are not truly at risk. In addition, variable importance is used to assess which predictors are important for accurately predicting an outcome. For example, random-forest variable importance rankings have been used to identify a small set of genes that can be used for diagnostic purposes (12–15). Incorrect variable rankings due to measurement error might lead to investigation of false-positive risk factors or lack of investigation of false-negative risk factors.

In the machine-learning literature, measurement error is referred to as “label noise” (16). Label noise has been found to decrease prediction performance (17–23), increase the complexity of models (20, 24), and affect feature selection and ranking (25–27). Current approaches to addressing label noise in machine learning include 1) label-noise–robust models, 2) data-cleansing methods, and 3) label-noise–tolerant learning algorithms (16, 28–33). Label-noise–robust models are algorithms that are supposedly robust against measurement error but rarely are, except in simple situations (32, 34). Data-cleansing methods involve excluding observations that 1) are measured with error (35), 2) increase model complexity (36, 37), 3) are inaccurately predicted by the classifier (38), or 4) have an abnormally large influence (e.g., outliers) (39). A major challenge with data cleansing is that it can remove a substantial amount of data (40, 41). Label-noise–tolerant learning algorithms involve using machine-learning methods to detect observations that are probably mismeasured (42). These methods are considerably different from traditional epidemiologic approaches to handling measurement error (5). In a comprehensive survey on classification in the presence of label noise in machine learning, Frénay and Verleysen (16) found that most studies artificially induced measurement error. They called for more research into development of realistic models of how machine-learning algorithms may be affected by measurement error using real-world data sets in which measurement error is clearly identified (16).

Our purpose in this study was to examine the impact of misclassification on the performance of random-forest models and variable importance using quantitative bias analysis, which is a traditional epidemiologic approach to quantifying the influence of systematic error on an epidemiologic study’s results (5). We focused on random forests because it is one of the most popular machine-learning methods (43–45). We used real data and simulations to assess the direction and magnitude of bias from misclassification of predictor and outcome variables in random forests. We first used data from the National Comorbidity Survey Replication (NCS-R) to predict suicide attempts. Suicide attempts are a leading cause of injury (46) and may lead to increased risk of disability (47), psychological distress (48), and suicide death (49). Accurate prediction of suicide attempts may facilitate preventive interventions among high-risk persons. We used mental and substance use disorders as predictors because they are strong risk factors for suicide attempts (50–52). We assessed the impact of misclassification of mental and substance use disorders on random-forest model performance and variable importance using NCS-R validation data for these disorders and quantitative bias analysis. We then conducted simulations to assess the impact of different misclassification mechanisms and verify that quantitative bias analysis was recovering the truth in misclassified versions of the data sets.

METHODS

Example 1: predicting suicide attempts in the NCS-R before and after adjusting for misclassification

Study design and participants.

The NCS-R was a study of a nationally representative sample of English-speaking US residents aged ≥18 years (53, 54). The NCS-R included 9,282 respondents (55). Kessler et al. (55) have provided a detailed description of the methodology and sampling procedures of the NCS-R. For this study, we excluded persons with missing data on suicide attempts (n = 19) and persons missing the date of their last suicide attempt if they had reported making a suicide attempt (n = 58). Our analytical sample included 9,205 participants.

Measures.

NCS-R participants were administered the Composite International Diagnostic Interview (56). We examined the following mental and substance use disorders as predictors: panic disorder, agoraphobia, specific phobia, social phobia, posttraumatic stress disorder, major depressive disorder, alcohol abuse, alcohol dependence, drug abuse, and drug dependence. We examined these diagnoses because they were validated in a study (54, 57) conducted in a subset of NCS-R participants (n = 325) who were reinterviewed using the Structured Clinical Interview for DSM-IV Axis I Disorders (58), which is based on the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, and is considered the diagnostic gold standard in psychiatric research. Participants were reinterviewed by trained clinical interviewers who were blinded to the original diagnoses. Web Table 1 (available online at https://doi.org/10.1093/aje/kwab010) shows the sensitivity and specificity for each diagnosis. The Suicidality Module of the Composite International Diagnostic Interview was used to assess the lifetime occurrence of suicide attempts (56). Temporal precedence of mental and substance use disorders was examined using age-of-onset reports.

Random forests.

We implemented a random-forest classifier to predict suicide attempts using the original NCS-R data ignoring measurement error. Each random forest was built with 500 trees, and 3 variables were selected as split candidates at each node (i.e., the square root of the total number of predictors). Each individual tree was built using all suicide attempt observations and an equal number of randomly selected nonsuicide attempt observations to address class imbalance (4.5% of the sample made a suicide attempt; Table 1) and to increase the overall classification accuracy and positive predictive value (PPV) (59, 60). We evaluated model performance using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, PPV, and negative predictive value (NPV). We used a cutoff of greater than 0.5 for the predicted probabilities to determine who was predicted to have the outcome. We assessed variable importance using mean decrease in accuracy, which represents the reduction in accuracy that would occur if a variable were permuted (61). The larger the mean decrease in the accuracy of a variable, the more important it is for prediction accuracy. We used the R package randomForest (R Foundation for Statistical Computing, Vienna, Austria) (62).

Table 1.

Characteristics (%) of Suicide Attempt Cases and Persons Who Did Not Make a Suicide Attempt in the National Comorbidity Survey Replication, 2001–2003

Suicide Attempt Case Status
Mental Disorder Suicide Attempt  
(n = 411)
No Attempt  
(n = 8,794)
Panic disorder 12 4
Agoraphobia 7 2
Specific phobia 30 12
Social phobia 33 11
Posttraumatic stress disorder 21 5
Major depressive disorder 35 16
Alcohol abuse 27 10
Alcohol dependence 15 4
Drug abuse 21 6
Drug dependence 11 2

Bias analysis.

We conducted probabilistic bias analysis for nondifferential misclassification of predictors using bias parameters derived from the NCS-R validation study of mental and substance use disorders (54). We use the term “nondifferential misclassification of predictors” to refer to systematic error in classification of predictors that is independent of other variables. Our bias analysis involved 1) modeling the bias parameters, 2) conducting record-level correction for exposure misclassification, and 3) estimating model performance and variable importance using the bias-adjusted data. We then compared the results of the bias-adjusted analysis with those of the original data ignoring measurement error.

We used the observed data, sensitivity, and specificity (Web Table 1) to calculate the PPV and NPV for each mental and substance use disorder. The PPV represents the probability that a person classified as having a disorder was correctly classified. The NPV is the probability that a person classified as not having a disorder was correctly classified. Web Figure 1 displays the formulas for these calculations. First, we used the observed data to calculate the number of exposed (e.g., had a disorder) and unexposed (e.g., did not have a disorder) individuals at each level of the outcome. Second, we used the observed data and the sensitivity and specificity values drawn from beta distributions to calculate the expected values for the 2 × 2 table that would exist if there had been no misclassification. We used beta distributions to model the probability density functions and incorporate binomial error in the sensitivity and specificity to reflect uncertainty regarding their true values (5, 63). Third, we used the expected 2 × 2 table values, sensitivity, and specificity to calculate the expected number of true-positive, true-negative, false-positive, and false-negative exposures at each level of the outcome, which were used to calculate the PPV and NPV for each disorder (Web Figure 1).

We conducted a record-level correction for exposure misclassification by applying the predictive values to each record in the data to simulate whether an individual was correctly classified (5). To model whether an individual observation was correctly or incorrectly classified, we conducted a Bernoulli trial with a probability equal to the corresponding PPV for persons classified as having a disorder and the corresponding 1-NPV for those classified as not having a disorder. This reclassification process for each record in the data set yields a single new, bias-adjusted data set. We implemented random forests in the bias-adjusted data set and evaluated model performance and variable importance.

The entire process (i.e., sampling from the beta distributions for sensitivity and specificity, calculating PPV and NPV, reclassifying individual records in the data set to generate a new data set, conducting random-forest analyses, and assessing model performance and variable importance) was repeated 10,000 times to create a distribution of model performance metrics and variable importance values. We then calculated the median values and 95% simulation intervals (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie) for the bias-adjusted model performance and variable importance measures. We compared the results from the original analysis which ignored measurement error with those of the bias analysis.

Example 2: simulation studies for assessing the impact of different types of misclassification mechanisms on model performance and variable importance

Simulating the original data sets.

We simulated 3 data sets with 10,000 participants in each: 1) a data set in which all of the predictors were associated with the outcome, 2) a data set in which only 1 of the predictors was associated with the outcome, and 3) a data set in which none of the predictors were associated with the outcome.

The first data set had 4 binary predictors (X1, X2, X3, and X4) and 1 binary outcome (Y). Web Figure 2A shows the directed acyclic graph for this data set, as well as the risk differences for the associations between variables. The distribution of the outcome depended on the values of X1, X2, X3, and X4. We refer to this data set as “informative data set 1.”

For the second and third data sets, we simulated 5 binary predictors (X1, X2, X3, X4, and X5) and 1 binary outcome (Y). In the second data set, we assigned a 20% disease prevalence, 20% prevalence of X1 among cases, 10% prevalence of X1 among noncases, and 10% prevalence of X2, X3, X4, and X5 among both cases and noncases. Thus, the distribution of the outcome depended only on the value of X1, not on the values of X2, X3, X4, and X5. We refer to this data set as “informative data set 2,” and the directed acyclic graph for it is shown in Web Figure 2B.

In the third data set, we assigned a 20% outcome prevalence and a 10% prevalence of all predictors among all records. The distribution of the outcome did not depend on the value of any of the predictors. We refer to this data set as the “uninformative data set” (i.e., no associations present in the data), and the directed acyclic graph for it is shown in Web Figure 2C.

Creating misclassified data sets.

We created 16 copies of each simulated data set and introduced 4 misclassification mechanisms: 1) nondifferential misclassification of a predictor, 2) differential misclassification of a predictor, 3) nondifferential outcome misclassification, and 4) differential outcome misclassification. For each mechanism, we examined 4 combinations of sensitivity and specificity (Web Table 2). We created a misclassified version of a predictor/outcome by simulating a Bernoulli trial with probability equal to the sensitivity of classification for persons with the predictor/outcome and 1 − specificity for those without the predictor/outcome. When examining misclassification of a predictor(s), we misclassified all of the predictors in informative data set 1, predictor X1 in informative data set 2, and predictor X5 in the uninformative data set. For differential outcome misclassification, we misclassified the outcome with respect to X1 in informative data sets 1 and 2 and X5 in the uninformative data set.

Bias analysis.

To create the bias-adjusted data sets, we used the observed data, sensitivity, and specificity values to calculate the PPV and NPV for the predictors/outcome. Web Figures 1 and 3 display the formulas for these calculations. We conducted a record-level correction for each misclassification mechanism by applying the predictive values to each record to simulate whether an individual was correctly classified (5). We implemented random forests using the bias-adjusted data set. The entire process was repeated 10,000 times. We then calculated the median values and 95% simulation intervals of the bias-adjusted model performance metrics for informative data set 1 and the bias-adjusted variable importance values for informative data set 2 and the uninformative data set.

We carried out 10,000 iterations of the random-forest analysis in 1) the 3 original data sets that were correctly classified, 2) the 16 misclassified versions of each data set, and 3) the bias-adjusted data sets. To assess the impact of misclassification on model performance and variable importance, we compared the random-forest results of the misclassified data sets with those of the original data sets without misclassification. To assess whether we could recover the correct random-forest results in the misclassified data sets using quantitative bias analysis, we compared the random-forest results of the bias-adjusted data sets with the results of the original data sets that were correctly classified.

Analyses were carried out in R, version 3.4.1. Software code and documentation are available on GitHub (64).

RESULTS

Example 1: predicting suicide attempts in the NCS-R before and after adjusting for misclassification

Figure 1 displays the AUC, accuracy, sensitivity, specificity, PPV, and NPV of the random forests before and after probabilistic bias analysis adjusting for nondifferential predictor misclassification. The sensitivity of the random forests in detecting suicide attempts was 69% in the original data and increased to 90% (95% simulation interval (SI): 85, 97) after bias adjustment. Bias adjustment also led to increases in specificity (original: 70%; adjusted: 93% (95% SI: 88, 98)), AUC (original: 72%; adjusted: 97% (95% SI: 94, 100)), PPV (original: 10%; adjusted: 37% (95% SI: 26, 69)), NPV (original: 98%; adjusted: 100% (95% SI: 99, 100)), and accuracy (original: 70%; adjusted: 93% (95% SI: 88, 98)). The low bias-adjusted PPV is a function of the low prevalence of suicide attempts (65).

Figure 1.

Figure 1

Performance of random forests in predicting suicide attempts before and after adjustment for nondifferential predictor misclassification in the National Comorbidity Survey Replication, 2001–2003. The solid line (top line) represents the negative predictive value. The dashed line (second from top) represents the area under the receiver operating characteristic curve. The long-dashed line (third from top) represents sensitivity; it overlaps with the 2-dashed line representing accuracy. The dotted line (second from bottom) represents sensitivity. The dotted-dashed line (bottom line) represents the positive predictive value.

Figure 2 shows the variable importance values of each disorder in the original data set. Figure 3 shows the variable importance values of each disorder after conducting probabilistic bias analysis adjusting for nondifferential predictor misclassification. Before bias adjustment, the disorders with the largest mean decrease in accuracy values (i.e., more important for prediction accuracy) were posttraumatic stress disorder, major depressive disorder, social phobia, and drug abuse. After adjustment for misclassification of disorders, the most important predictors were specific phobia and social phobia. Specific phobia and social phobia were identified as the most important predictors in 62% and 28% of the 10,000 simulations, respectively. Web Table 3 shows the median values and 95% simulation intervals for the mean decrease in the accuracy of each variable.

Figure 2.

Figure 2

Variable importance of random forests in predicting suicide attempts in the original National Comorbidity Survey Replication data set, 2001–2003. The mean decrease in accuracy represents the reduction in the overall accuracy of the random-forest model when a predictor is permuted. “Depression” refers to major depressive disorder. PTSD, posttraumatic stress disorder.

Figure 3.

Figure 3

Variable importance of random forests in predicting suicide attempts in the bias-adjusted data set adjusting for nondifferential misclassification of predictors, National Comorbidity Survey Replication, 2001–2003. The mean decrease in accuracy represents the reduction in the overall accuracy of the random-forest model when a predictor is permuted. “Depression” refers to major depressive disorder. PTSD, posttraumatic stress disorder.

Example 2: simulation studies for assessing the impact of different types of misclassification mechanisms on variable importance and model performance

We examined whether misclassification of all predictors or the outcome in a simulated data set in which all of the predictors were associated with the outcome would affect model performance. Nondifferential and differential predictor misclassification led to reductions in model accuracy, AUC, sensitivity, PPV, and NPV, whereas specificity increased slightly (Figure 4 and Web Figures 4 and 5). There was little to no impact when we induced nondifferential and differential outcome misclassification for most measures, except for PPV, which decreased (Web Figures 6 and 7).

Figure 4.

Figure 4

Performance of random forests in simulated informative data set 1 after inducing misclassification. Each point indicates the median performance of the model across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). A) Model performance in the original simulated informative data set 1 without misclassification; B) model performance under the scenario of nondifferential misclassification of all predictors in which the sensitivity of each predictor was 0.45 and specificity was 0.99; C) model performance under the scenario of differential misclassification of all predictors in which the sensitivity of each predictor among persons with the outcome was 0.50, specificity of each predictor among those with the outcome was 0.90, the sensitivity of each predictor among those without the outcome was 0.45, and specificity of each predictor among those without the outcome was 0.95; D) model performance under the scenario of nondifferential outcome misclassification in which the sensitivity of the outcome was 0.85 and specificity was 0.99; E) model performance under the scenario of differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X1 was 0.9, specificity among those with predictor X1 was 0.95, the sensitivity among those without predictor X1 was 0.85, and specificity among those without predictor X1 was 0.99. AUC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.

We also examined whether misclassification of a predictor or outcome in a simulated data set in which only 1 of the predictors (X1) was associated with the outcome would affect variable importance measures. Nondifferential misclassification of predictor X1 led to reductions in the mean decrease in accuracy for X1 (Figure 5 and Web Figure 8). In other words, predictor X1 appeared to be less important for accurate prediction of the outcome than it truly was. When predictor X1 was differentially misclassified by outcome status such that the specificity of X1 classification was higher among cases than among noncases, X1 appeared not to contribute to prediction accuracy (i.e., the mean decrease in accuracy was approximately 0) (Web Figure 9). There was no impact of differential exposure misclassification when the specificity was higher among noncases than among cases. There was also no impact when we induced nondifferential outcome misclassification (Web Figure 10). Differential outcome misclassification by predictor X1 status led to minor changes in the variable importance value for X1 (Web Figure 11).

Figure 5.

Figure 5

Variable importance of random forests in simulated informative data set 2 after inducing misclassification. Each point indicates the median importance value of the random-forest variable across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). A) Variable importance in the original simulated informative data set 2 without misclassification; B) variable importance under the scenario of nondifferential misclassification of X1 in which the sensitivity of predictor X1 was 0.45 and specificity was 0.99; C) variable importance under the scenario of differential misclassification of X1 in which the sensitivity of predictor X1 among persons with the outcome was 0.50, specificity of predictor X1 among those with the outcome was 0.95, the sensitivity of predictor X1 among those without the outcome was 0.45, and specificity of predictor X1 among those without the outcome was 0.90; D) variable importance under the scenario of nondifferential outcome misclassification in which the sensitivity of the outcome was 0.90 and specificity was 0.99; E) variable importance under the scenario of differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X1 was 0.9, specificity among those with predictor X1 was 0.99, the sensitivity among those without predictor X1 was 0.85, and specificity among those without predictor X1 was 0.95.

We examined whether misclassification of a predictor or outcome in a simulated data set in which none of the predictors were associated with the outcome would affect variable importance. Nondifferential misclassification of a predictor (X5) and nondifferential outcome misclassification did not affect variable importance (Figure 6 and Web Figures 12 and 13). However, differential misclassification of predictor X5 led to spuriously increased importance of this predictor (Web Figure 14). In other words, predictor X5 appeared to be important for accurate prediction of the outcome in the misclassified data set even though it was truly not associated with the outcome in the original correctly classified data. Similarly, differential outcome misclassification by predictor X5 also led to increased variable importance for X5 (Web Figure 15).

Figure 6.

Figure 6

Variable importance of random forests in the simulated uninformative data set after inducing misclassification. Each point indicates the median importance value of the random-forest variable across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). A) Variable importance in the original simulated uninformative data set without misclassification; B) variable importance under the scenario of nondifferential misclassification of X5 in which the sensitivity of predictor X5 was 0.45 and specificity was 0.99; C) variable importance under the scenario of differential misclassification of X5 in which the sensitivity of predictor X5 among persons with the outcome was 0.45, specificity of predictor X5 among those with the outcome was 0.99, the sensitivity of predictor X5 among those without the outcome was 0.40, and specificity of predictor X5 among those without the outcome was 0.95; D) variable importance under the scenario of nondifferential outcome misclassification in which the sensitivity of the outcome was 0.85 and specificity was 0.99; E) variable importance under the scenario of differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X5 was 0.9, specificity among those with predictor X5 was 0.99, the sensitivity among those without predictor X5 was 0.85, and specificity among those without predictor X5 was 0.95.

Quantitative bias analysis recovered the true model performance and variable importance measures in all scenarios in which misclassification had an impact on results (Figure 7 and Web Figures 16–24).

Figure 7.

Figure 7

Performance of random forests in simulated informative data set 1 after conducting bias adjustment for misclassification. Each point indicates the median performance of the model across 10,000 simulations. The bars indicate the 95% simulation interval (2.5th and 97.5th percentiles of the 10,000 iterations describing where 95% of the simulated estimates lie). The misclassification probabilities for each misclassification scenario are the same as those described in the legend of Figure 4. A) Model performance in the original simulated informative data set 1 without misclassification; B) model performance after conducting bias adjustment for nondifferential misclassification of all predictors in which the sensitivity of each predictor was 0.45 and specificity was 0.99; C) model performance after conducting bias adjustment for differential misclassification of all predictors in which the sensitivity of each predictor among persons with the outcome was 0.50, specificity of each predictor among those with the outcome was 0.90, the sensitivity of each predictor among those without the outcome was 0.45, and specificity of each predictor among those without the outcome was 0.95; D) model performance after conducting bias adjustment for nondifferential outcome misclassification in which the sensitivity of the outcome was 0.85 and specificity was 0.99; E) model performance after conducting bias adjustment for differential outcome misclassification in which the sensitivity of the outcome among persons with predictor X1 was 0.9, specificity among those with predictor X1 was 0.95, the sensitivity among those without predictor X1 was 0.85, and specificity among those without predictor X1 was 0.99. AUC, area under the receiver operating characteristic curve; NPV, negative predictive value; PPV, positive predictive value.

DISCUSSION

Many machine-learning studies include predictor and outcome variables that are vulnerable to measurement error, yet measurement error is rarely addressed quantitatively (4). Using data from an epidemiologic study and extensive simulations, we found that misclassification can decrease random-forest model performance and distort variable importance. The magnitude and direction of the impact of measurement error are not easily predictable; thus, bias analyses are necessary for understanding how measurement error may affect any particular machine-learning analysis. Using quantitative bias analysis (5), we were able to obtain bias-adjusted model performance and variable importance measures.

If the goal of a machine-learning study is to predict a patient’s future outcome (66), improve treatment decision-making (67), and target interventions to persons in greatest need (68), then measurement error should be assessed using quantitative bias analysis (5). In our applied example using NCS-R data to predict suicide attempts, we found that adjusting for misclassification of predictors led to substantial increases in all model performance measures. This is crucial when there is a high cost associated with failing to recognize persons who are likely to make a suicide attempt (69, 70). In addition, measurement error can lead to biased variable importance. Findings from our simulations showed that measurement error can lead to spuriously increased variable importance, which may lead to investment of resources to investigate false-positive leads, or it could obscure the true importance of a predictor, which may lead to lack of investigation of an important, understudied risk factor.

Limitations of this study include the fact that we examined a small number of predictors and the possibility that the NCS-R validation study’s results were biased, since the diagnostic interview may not accurately capture all mental and substance use disorders. To address this limitation, we assigned beta distributions to the bias parameters to account for uncertainty. In future studies, researchers could examine the impact of misclassification of a larger number of predictors and assess the impact of measurement error of continuous variables.

With the growing popularity of machine learning and the availability of big data, it is crucial to assess and mitigate measurement error. A key challenge in using measurement error correction methods is obtaining the necessary classification probabilities. Credible values and distributions assigned to bias parameters should reflect data from internal and external validation studies and expert judgment (71). The bias analysis we used (5) and other methods such as multiple imputation (72, 73), regression calibration (74, 75), and propensity score calibration require validation data. If validation data are not available, modified maximum likelihood (76, 77) and Bayesian (78) methods may be preferable. Additional work is needed to develop tools with which to automate quantitative bias analysis by incorporating bias parameters in the development of prediction models so that future data measured with error could automatically be reclassified and used to construct a bias-adjusted prediction model. The development of a data-validation system that continuously monitors data quality to detect errors early and corrects them instantaneously may lead to time savings and increases in model quality (79).

Supplementary Material

Web_Material_kwab010

ACKNOWLEDGMENTS

Author affiliations: Department of Epidemiology, School of Public Health, Boston University, Boston, Massachusetts, United States (Tammy Jiang, Jaimie L. Gradus, Matthew P. Fox); Department of Psychiatry, School of Medicine, Boston University, Boston, Massachusetts, United States (Jaimie L. Gradus); Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta, Georgia, United States (Timothy L. Lash); and Department of Global Health, School of Public Health, Boston University, Boston, Massachusetts, United States (Matthew P. Fox).

This work was supported by National Institute of Mental Health grants 1R01 MH109507 and 1R01 MH110453-01A1 to J.L.G. and US National Library of Medicine grant R01LM013049 to T.L.L.

This study used National Comorbidity Survey Replication data from the Inter-University Consortium for Political and Social Research (Ann Arbor, Michigan). We thank the researchers who conducted the survey for making the data publicly available.

Conflict of interest: none declared.

REFERENCES

  • 1. Mitchell  TM. Machine Learning. 1st ed. New York, NY: McGraw-Hill; 1997. [Google Scholar]
  • 2. Gianfrancesco  MA, Tamang  S, Yazdany  J, et al.  Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. 2018;178(11):1544–1547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Simon  GE. Big data from health records in mental health care: hardly clairvoyant but already useful. JAMA Psychiatry. 2019;76(4):349–350. [DOI] [PubMed] [Google Scholar]
  • 4. Whittle  R, Peat  G, Belcher  J, et al.  Measurement error and timing of predictor values for multivariable risk prediction models are poorly reported. J Clin Epidemiol. 2018;102:38–49. [DOI] [PubMed] [Google Scholar]
  • 5. Lash  TL, Fox  MP, Fink  AK. Applying Quantitative Bias Analysis to Epidemiologic Data. New York, NY: Springer-Verlag; 2009. [Google Scholar]
  • 6. Flegal  KM, Keyl  PM, Nieto  FJ. Differential misclassification arising from nondifferential errors in exposure measurement. Am J Epidemiol. 1991;134(10):1233–1244. [DOI] [PubMed] [Google Scholar]
  • 7. Kristensen  P. Bias from nondifferential but dependent misclassification of exposure and outcome. Epidemiology. 1992;3(3):210–215. [DOI] [PubMed] [Google Scholar]
  • 8. Brenner  H, Savitz  DA, Gefeller  O. The effects of joint misclassification of exposure and disease on epidemiologic measures of association. J Clin Epidemiol. 1993;46(10):1195–1202. [DOI] [PubMed] [Google Scholar]
  • 9. Wacholder  S, Dosemeci  M, Lubin  J. Blind assignment of exposure does not always prevent differential misclassification. Am J Epidemiol. 1999;134(4):433–437. [DOI] [PubMed] [Google Scholar]
  • 10. Drews  C, Greenland  S. The impact of differential recall on the results of case-control studies. Int J Epidemiol. 1990;19(4):1107–1112. [DOI] [PubMed] [Google Scholar]
  • 11. Obermeyer  Z, Emanuel  EJ. Predicting the future—big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Nguyen  T-T, Huang  J, Wu  Q, et al.  Genome-wide association data classification and SNPs selection using two-stage quality-based random forests. BMC Genomics. 2015;16(suppl 2):Article S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Diaz-Uriarte  R, Alvarez de Andrés  S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:Article 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Briones  N, Dinu  V. Data mining of high density genomic variant data for prediction of Alzheimer’s disease risk. BMC Med Genet. 2012;13:Article 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Yan  Z, Li  J, Xiong  Y, et al.  Identification of candidate colon cancer biomarkers by applying a random forest approach on microarray data. Oncol Rep. 2012;28(3):1036–1042. [DOI] [PubMed] [Google Scholar]
  • 16. Frénay  B, Verleysen  M. Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst. 2014;25(5):845–869. [DOI] [PubMed] [Google Scholar]
  • 17. Lachenbruch  PA. Discriminant analysis when the initial samples are misclassified. Dent Tech. 1966;8(4):657–662. [Google Scholar]
  • 18. Okamoto  S, Nobuhiro  Y. An average-case analysis of the k-nearest neighbor classifier for noisy domains. In: Pollack  ME, ed. Proceedings of the 15th International Joint Conference on Artificial Intelligence. Vol. 1. Nagoya, Japan: Morgan Kaufmann Publishers; 1997:238–243. [Google Scholar]
  • 19. Wilson  DR, Martinez  TR. Reduction techniques for instance-based learning algorithms. Mach Learn. 2000;38(3):257–286. [Google Scholar]
  • 20. Quinlan  JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106. [Google Scholar]
  • 21. Nettleton  DF, Orriols-Puig  A, Fornells  A. A study of the effect of different types of noise on the precision of supervised learning techniques. Artif Intell Rev. 2010;33(4):275–306. [Google Scholar]
  • 22. Dietterich  TG. Ensemble methods in machine learning. In: Kittler  J, Roli  F, eds. Multiple Classifier Systems. Cagliari, Italy: Springer Berlin Heidelberg; 2000:1–15. [Google Scholar]
  • 23. Lachenbruch  PA. Discriminant analysis when the initial samples are misclassified II: non-random misclassification models. Dent Tech. 1974;16(3):419–424. [Google Scholar]
  • 24. Brodley  CE, Friedl  MA. Identifying mislabeled training data. J Artif Intell Res. 1999;11(1):131–167. [Google Scholar]
  • 25. Gerlach  R, Stamey  J. Bayesian model selection for logistic regression with misclassified outcomes. Stat Model. 2007;7(3):255–273. [Google Scholar]
  • 26. Frénay  B, Doquire  G, Verleysen  M. Estimating mutual information for feature selection in the presence of label noise. Comput Stat Data Anal. 2014;71:832–848. [Google Scholar]
  • 27. Shanab  AA, Khoshgoftaar  TM, Wald  R. Robustness of threshold-based feature rankers with data sampling on noisy and imbalanced data  [abstract]. Presented at the Twenty-Fifth International FLAIRS Conference, Marco Island, Florida, May 22–25, 2012. [Google Scholar]
  • 28. Frénay  B, Kabán  A. A comprehensive introduction to label noise  [abstract]. Presented at the Twenty-Second European Symposium on Artificial Neural Networks, Bruges, Belgium, April 23–25, 2014. [Google Scholar]
  • 29. Teng  C-M. A comparison of noise handling techniques  [abstract]. Presented at the Fourteenth International FLAIRS Conference, Key West, Florida, May 21–23, 2001. [Google Scholar]
  • 30. Golzari  S, Doraisamy  S, Sulaiman  MN, et al.  The effect of noise on RWTSAIRS classifier. Eur J Sci Res. 2009;31:632–641. [Google Scholar]
  • 31. Bouveyron  C, Girard  S. Robust supervised classification with mixture models: learning from data with uncertain labels. Pattern Recognit. 2009;42(11):2649–2658. [Google Scholar]
  • 32. Manwani  N, Sastry  PS. Noise tolerance under risk minimization. IEEE Trans Cybern. 2013;43(3):1146–1151. [DOI] [PubMed] [Google Scholar]
  • 33. Yin  H, Dong  H. The problem of noise in classification: past, current and future work. Presented at the IEEE 3rd International Conference on Communication Software and Networks, Xi’an, China, May 27–29, 2011. [Google Scholar]
  • 34. Sastry  PS, Nagendra  GD, Manwani  NA. Team of continuous-action learning automata for noise-tolerant learning of half-spaces. IEEE Trans Syst Man Cybern Part B Cybern. 2010;40(1):19–28. [DOI] [PubMed] [Google Scholar]
  • 35. Sun  J, Zhao  F, Wang  C, et al.  Identifying and correcting mislabeled training instances. In: 2007 International Conference on Future Generation Communication and Networking (FGCN 2007). Washington, DC: Institute of Electrical and Electronics Engineers; 2008:244–250. [Google Scholar]
  • 36. Gamberger  D, Lavrač  N. Conditions for Occam’s razor applicability and noise elimination. In: van  Someren  M, Widmer  G, eds. Machine Learning: ECML-97. Berlin, Germany: Springer-Verlag; 1997:108–123. [Google Scholar]
  • 37. Gamberger  D, Boskovic  R, Lavrac  N, et al.  Experiments with noise filtering in a medical domain. In: Bratko  I, Dzeroksi  S, eds. Proceedings of the 16th International Conference on Machine Learning. Burlington, MA: Morgan Kaufmann Publishers; 1999:143–151. [Google Scholar]
  • 38. Jeatrakul  P, Wong  KW, Fung  CC. Data cleaning for classification using misclassification analysis. JACIII. 2010;14(3):297–302. [Google Scholar]
  • 39. Zhang  C, Wu  C, Blanzieri  E, et al.  Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics. 2009;25(20):2708–2714. [DOI] [PubMed] [Google Scholar]
  • 40. Matic  N, Guyon  I, Bottou  L, et al.  Computer aided cleaning of large databases for character recognition. In: Proceedings. 11th IAPR International Conference on Pattern Recognition. Vol. II. Conference B: Pattern Recognition Methodology and Systems. Washington, DC: Institute of Electrical and Electronics Engineers; 1992:330–333. [Google Scholar]
  • 41. Guyon  I, Matic  N, Vapnik  V. Discovering informative patterns and data cleaning. In: KDD-94: AAAI-94 Workshop on Knowledge Discovery in Databases. (AAAI technical report WS-94-03). Palo Alto, CA: Association for the Advancement of Artificial Intelligence; 1996:145–156. [Google Scholar]
  • 42. Lawrence  ND, Schölkopf  B. Estimating a kernel Fisher discriminant in the presence of label noise. In: Brodley  CA, Danyluk  AP, eds. ICML ‘01: Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers; 2001:306–313. [Google Scholar]
  • 43. Breiman  L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
  • 44. Rose  S. Mortality risk score prediction in an elderly population using machine learning. Am J Epidemiol. 2013;177(5):443–452. [DOI] [PubMed] [Google Scholar]
  • 45. Karim  ME, Platt  RW, BeAMS Study Group . Estimating inverse probability weights using super learner when weight-model specification is unknown in a marginal structural Cox model context. Stat Med. 2017;36(13):2032–2047. [DOI] [PubMed] [Google Scholar]
  • 46. World Health Organization . Mental health and substance use. Suicide data. http://www.who.int/mental_health/prevention/suicide/suicideprevent/en/. Accessed August 20, 2019.
  • 47. Biering-Sørensen  F, Pedersen  W, Müller  PG. Spinal cord injury due to suicide attempts. Paraplegia. 1992;30(2):139–144. [DOI] [PubMed] [Google Scholar]
  • 48. Stanley  IH, Hom  MA, Boffa  JW, et al.  PTSD from a suicide attempt: an empirical investigation among suicide attempt survivors. J Clin Psychol. 2019;75(10):1879–1895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Han  B, Kott  PS, Hughes  A, et al.  Estimating the rates of deaths by suicide among adults who attempt suicide in the United States. J Psychiatr Res. 2016;77:125–133. [DOI] [PubMed] [Google Scholar]
  • 50. Harris  EC, Barraclough  B. Suicide as an outcome for mental disorders: a meta-analysis. Br J Psychiatry. 1997;170:205–228. [DOI] [PubMed] [Google Scholar]
  • 51. Pokorny  AD. Prediction of suicide in psychiatric patients: report of a prospective study. Arch Gen Psychiatry. 1983;40(3):249–257. [DOI] [PubMed] [Google Scholar]
  • 52. Gradus  JL, Rosellini  AJ, Horváth-Puhó  E, et al.  Prediction of sex-specific suicide risk using machine learning and single-payer health care registry data from Denmark. JAMA Psychiatry. 2020;77(1):25–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Kessler  RC, Berglund  P, Chiu  WT, et al.  The US National Comorbidity Survey Replication (NCS-R): design and field procedures. Int J Methods Psychiatr Res. 2004;13(2):69–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Kessler  RC, Berglund  P, Demler  O, et al.  Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the National Comorbidity Survey Replication. Arch Gen Psychiatry. 2005;62(6):593–602. [DOI] [PubMed] [Google Scholar]
  • 55. Kessler  RC, Merikangas  KR. The National Comorbidity Survey Replication (NCS-R): background and aims. Int J Methods Psychiatr Res. 2004;13(2):60–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Kessler  RC, Üstün  TB. The World Mental Health (WMH) Survey Initiative version of the World Health Organization (WHO) Composite International Diagnostic Interview (CIDI). Int J Methods Psychiatr Res. 2004;13(2):93–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Kessler  RC, Abelson  J, Demler  O, et al.  Clinical calibration of DSM-IV diagnoses in the World Mental Health (WMH) version of the World Health Organization (WHO) Composite International Diagnostic Interview (WMH-CIDI). Int J Methods Psychiatr Res. 2004;13(2):122–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. First  M, Spitzer  R, Gibbon  M, et al. Structured Clinical Interview for DSM-IV Axis I Disorders. New York, NY: Biometrics Research Department, New York State Psychiatric Institute; 1998. [Google Scholar]
  • 59. Chen  C, Liaw  A, Breiman  L. Using Random Forest to Learn Imbalanced Data. (Technical report 110). Berkeley, CA: Department of Statistics, University of California, Berkeley; 2004. [Google Scholar]
  • 60. Huang  BFF, Boutros  PC. The parameter sensitivity of random forests. BMC Bioinformatics. 2016;17(1):Article 331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Strobl  C, Malley  J, Tutz  G. An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychol Methods. 2009;14(4):323–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Liaw  A, Wiener  M. Classification and regression by randomForest. R News. 2002;2(3):18–22. [Google Scholar]
  • 63. Fox  MP, Lash  TL, Greenland  SA. Method to automate probabilistic sensitivity analyses of misclassified binary variables. Int J Epidemiol. 2005;34(6):1370–1376. [DOI] [PubMed] [Google Scholar]
  • 64. Jiang  T. Quantitative-Bias-Analysis. https://github.com/jiangtammy/Quantitative-Bias-Analysis. Published November 30, 2020. Accessed March 30, 2021.
  • 65. Wong  HB, Lim  GH. Measures of diagnostic accuracy: sensitivity, specificity, PPV and NPV. Proc Singapore Healthc. 2011;20(4):316–318. [Google Scholar]
  • 66. Kourou  K, Exarchos  TP, Exarchos  KP, et al.  Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2014;13:8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Steyerberg  E. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York, NY: Springer-Verlag New York; 2009. [Google Scholar]
  • 68. Krumholz  HM. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff (Millwood). 2014;33(7):1163–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Shepard  DS, Gurewich  D, Lwin  AK, et al.  Suicide and suicidal attempts in the United States: costs and policy implications. Suicide Life Threat Behav. 2016;46(3):352–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Kinchin  I, Doran  CM. The economic cost of suicide and non-fatal suicide behavior in the Australian workforce and the potential impact of a workplace suicide prevention strategy. Int J Environ Res Public Health. 2017;14(4):Article 347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Lash  TL, Fox  MP, MacLehose  RF, et al.  Good practices for quantitative bias analysis. Int J Epidemiol. 2014;43(6):1969–1985. [DOI] [PubMed] [Google Scholar]
  • 72. Cole  SR, Chu  H, Greenland  S. Multiple-imputation for measurement-error correction. Int J Epidemiol. 2006;35(4):1074–1081. [DOI] [PubMed] [Google Scholar]
  • 73. Edwards  JK, Cole  SR, Troester  MA, et al.  Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. Am J Epidemiol. 2013;177(9):904–912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Spiegelman  D, McDermott  A, Rosner  B. Regression calibration method for correcting measurement-error bias in nutritional epidemiology. Am J Clin Nutr. 1997;65(4 suppl):1179S–1186S. [DOI] [PubMed] [Google Scholar]
  • 75. Bang  H, Chiu  Y-L, Kaufman  JS, et al.  Bias correction methods for misclassified covariates in the Cox model: comparison of five correction methods by simulation and data analysis. J Stat Theory Pract. 2013;7(2):381–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Magder  LS, Hughes  JP. Logistic regression when the outcome is measured with uncertainty. Am J Epidemiol. 1997;146(2):195–203. [DOI] [PubMed] [Google Scholar]
  • 77. Lyles  RH, Tang  L, Superak  HM, et al.  Validation data-based adjustments for outcome misclassification in logistic regression: an illustration. Epidemiology. 2011;22(4):589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Hubbard  RA, Huang  J, Harton  J, et al.  A Bayesian latent class approach for EHR-based phenotyping. Stat Med. 2019;38(1):74–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Polyzotis  N, Zinkevich  M, Roy  S, et al.  Data validation for machine learning. In: Talwalkar  A, Smith  V, Zaharia  M, eds. Proceedings of Machine Learning and Systems 1 (MLSys 2019). Indio, CA: Systems and Machine Learning Foundation; 2019:334–347. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web_Material_kwab010

Articles from American Journal of Epidemiology are provided here courtesy of Oxford University Press

RESOURCES