Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Oct 1.
Published in final edited form as: Biom J. 2021 May 24;63(7):1375–1388. doi: 10.1002/bimj.202000199

Clinical risk prediction models and informative cluster size: assessing the performance of a suicide risk prediction algorithm

R Yates Coley 1,2,*, Rod L Walker 1, Maricela Cruz 1, Gregory E Simon 1, Susan M Shortreed 1,2
PMCID: PMC9134927  NIHMSID: NIHMS1801511  PMID: 34031916

Abstract

Clinical data are clustered within people, complicating prediction modeling. Cluster size is often informative; people receiving more care are less healthy and at higher risk of poor outcomes. We compared four sampling frameworks for estimating prediction models for suicide attempt within 90 days following 1,518,968 outpatient mental health visits by 207,915 people: (i) visit-level training/test split and cross-validation, observed cluster analysis for prediction model estimation; (ii) visit-level training/test split, person-level cross-validation, observed cluster analysis for model estimation; (iii) person-level training/test split and cross-validation, observed cluster analysis for model estimation; (iv) person-level training/test split and cross-validation, model estimation using within cluster resampling of one visit per person. We used two prediction methods: logistic regression with LASSO and random forest. Prediction models’ true performance was evaluated using a prospective validation set of 4,286,495 visits. Random forest models using visit-level training/test splits overestimated discrimination (AUC=0.91-0.95 in testing vs. 0.84-0.85 in validation) and classification accuracy (sensitivity at 99th percentile=0.48 in testing vs. 0.19 in validation for visit-level cross-validation and 0.23 vs. 0.17, respectively, for person-level cross-validation). Logistic regression using visit-level splitting was less optimistic: AUC=0.86-0.87 in testing vs. 0.85 in validation, sensitivity=0.18-0.19 in testing vs. 0.18 in validation. Using person-level train/test splits for both methods accurately estimated prospective discrimination and classification: AUC=0.85-0.86 in testing vs. 0.85 in validation, sensitivity=0.15-0.20 in testing vs. 0.17-0.19 in validation. Within cluster resampling did not improve performance. Our case study suggests person-level splits for clustered data, rather than visit-level, may be preferable to accurately estimate and optimize prospective performance.

Keywords: Correlated data, Electronic health records, Machine learning, Nonignorable cluster size, Predictive analytics

1. Introduction

Clinical risk prediction models to guide point-of-care decision-making are growing in popularity. These prediction models use information on patient demographics and longitudinal health data available in the electronic health record (EHR) to generate predictions that can be shared and acted upon in a clinical setting. Frequently, it is of interest to identify patients at high risk of a harmful event following a clinic visit so that the provider can intervene during the visit to reduce that risk. Current clinical risk prediction models are often estimated using historical data and implemented in the clinic with a flag that alerts a patient’s healthcare provider to increased risk so the provider can further assess risk and offer appropriate interventions. A natural complication of using EHR data to estimate visit-level prediction models is that people may have multiple visits. The number of visits, or cluster size, may be related to a person’s risk as typically people who are sick seek more care than those who are healthy and are also more likely to have the predicted outcome. Informative cluster size, also known as nonignorable cluster size, occurs when the number of observations or visits within a cluster are related to the outcome of interest.

The influence of clustering and informative cluster size on inference is well-documented, and methods exist for valid inference on marginal parameters in a variety of settings (Benhin, Rao, & Scott, 2005; Hoffman, Sen, & Weinberg, 2001; Huang & Leroux, 2011; Seaman, Pavlou, & Copas, 2014; Shen & Chen, 2018; Williamson, Datta, & Satten, 2003; Williamson, Kim, Manatunga, & Addiss, 2008). By contrast, the impact of clustering and informative cluster size in prediction modeling has not been closely examined. While inference requires unbiased estimation of parameters, statistical aims vary for prediction. With clinical prediction, the goal is to optimize risk discrimination, typically among those at higher risk, and to accurately assess future prediction performance. Here, informative cluster size is a concern if it affects risk discrimination, calibration (i.e., agreement between observed event rates and predictions) (Steyerberg et al., 2010), or generalizability. Specifically, if a prediction model better captures the relationship between predictors and risk for visits in larger clusters, as happens with marginal inference when informative cluster size is present, the resulting prediction model may perform poorly once implemented in practice for visits in smaller clusters. An ideal prediction method will adjust for informative cluster size in order to improve performance in smaller clusters without sacrificing performance in larger clusters where visits at highest risk are expected.

Clustering and informative cluster size also raise concerns if they impact assessment of a prediction model’s performance. Estimates of prediction model performance (including discrimination, classification accuracy, and calibration) are intended to reflect future performance and guide decisions about model implementation. How clustering is handled in model development and assessment may affect optimism, or overestimation of how a model will perform once implemented (Efron, 1983). Furthermore, if cluster size is informative, estimates of model performance may be incorrect for visits in smaller or larger clusters.

Clustering and informative cluster size have implications for how the sampling framework should be defined for training and validating a prediction model. We must consider the population in which a prediction model will be applied when deciding how to handle multiple observations per person throughout the prediction modeling process. These decisions include how to identify a training dataset, estimate the prediction model, and evaluate its accuracy in a test set. For clinical prediction models intended for point-of-care use, the population of interest is typically all eligible visits, not individual people. When risk varies over time for a person, each unique visit is relevant to model training and performance evaluation. Since multiple observations within a person are correlated and common prediction methods assume independence, additional consideration for handling clustered data is needed.

This paper considers predicting a person’s risk of attempting suicide in the 90 days following a mental health visit using data gathered from electronic health records. In this example, each person may have contributed multiple visits to the data used for model development. Covariates and outcomes were correlated within a person, and the time frame for ascertaining covariate information (at or before the visit) and outcome information (90 days following the visit) overlapped for visits close in time. We expected cluster size to be informative because people who seek mental health care more often are known to be at higher risk for suicide attempt and death (Simon et al., 2019). Another important statistical consideration for estimating risk prediction models is having sufficient sample size and event rate to both accurately estimate risk and precisely evaluate prediction model performance (Hirsch, 1991). Risk prediction models are often developed to predict rare events; for example, a suicide attempt was observed within 90 days of a visit in 0.67% of visits in our sample. Even in our setting, which included data from over 1.5 million visits, it was important to maximize our sample size to be able to build an accurate risk model and precisely assess its performance.

Our goal was to examine the impact of informative cluster size on how a suicide risk prediction model would perform if implemented in a clinical setting across several sampling approaches for selection of training and validation sets and model estimation. Visit data from 2012 to 2015 was split into training and test datasets. The training dataset was used to fit prediction models and the test dataset was used to estimate future performance of each model. Data for the 18 months following this time period was then used to examine how the estimated models would have truly performed if implemented in clinical care, and how similar our expectations of model performance based on assessment in the test dataset were to the performance in this later time period. We evaluated two prediction modeling methods, a parametric approach, logistic regression with variables selected using LASSO, and random forest, a non-parametric method.

2. Methods

2.1. Data

We randomly sampled 40% of people with at least one outpatient visit to a mental health specialty provider at age 13 years or older between January 1, 2012 and June 30, 2015 in seven health systems (HealthPartners; Henry Ford Health System; and the Colorado, Hawaii, Northwest, Southern California, and Washington regions of Kaiser Permanente). A random subset of people was used, rather than all people, to enable examination of several sampling frameworks and prediction models; using the entire sample would have been computationally prohibitive. All visits in this sample of people were included in a dataset that was divided into a training dataset for model development and a test dataset for estimating future model performance. Information available at the time of the outpatient visit, including current and past diagnoses, prescriptions, and mental health care encounters, was used to estimate prediction models. We refer to this dataset as our development dataset, as it includes all data that would be available during initial work to estimate and assess performance of a risk prediction model. Our prospective validation dataset includes data on all outpatient mental health visits for people 13 years and older from the same health systems between October 1, 2015 and September 30, 2017. We used this dataset for temporal validation and to mimic prospective evaluation of the clinical performance of a prediction model developed and tested using prior data from the same setting. Note, no suicide risk prediction models were in clinical use at any of the health systems during any of the study years. Use of health system data for this research was approved by each site’s institutional review board.

Potential predictors for building the risk prediction model collected at each visit included demographics (age, sex, race, and insurance status), comorbidity burden, and history of mental health and substance use diagnoses, prescription fills for psychiatric medications, mental health encounters (inpatient, outpatient, and emergency department), and past suicide attempts (Charlson, Szatrowski, Peterson, & Gold, 1994). Clinical history was summarized using binary indicators of history of each diagnosis, prescription, encounter, or suicide attempt in three overlapping time periods: 90 days, 1 year, and 5 years before the visit. Observation of clinical history was incomplete for people without 5 years of health insurance enrollment prior to the study visit, such that recorded prior diagnoses, prescriptions, encounters, and comorbidities were under-counted. Our analysis included visits without 5 years of prior enrollment because we wanted to estimate a prediction model that could be used to guide care for all patients and because lack of continuous long-term health insurance enrollment is associated with other social determinants of health (e.g., stable employment) that may be risk factors for suicide. Duration of health plan coverage at the time of the study visit was included as a predictor.

Some visits also had information from the Patient Health Questionnaire (PHQ-9), a patient-reported measure of depressive symptoms (Kroenke, Spitzer, & Williams, 2001). Rates of PHQ-9 response changed during the study period. Earlier in the study period, PHQ-9 use was at provider discretion; later, some health systems contributing data to this sample recommended use of the PHQ-9 for all outpatient mental health visits. When available, the distribution of PHQ-9 responses were summarized for the visit day and the prior 90, 183, and 365 days. The first eight items of the PHQ-9 were treated as a depressive symptom scale and the 9th-item, which specifically asks about thoughts of death or self-harm, was treated as its own measure of suicidal ideation. The resulting dataset included 149 predictors and an additional 164 interactions. A full list of predictors is given in Table S1.

Suicide outcomes were identified in health system records and state death certificate data. Non-fatal suicide attempt was defined by a diagnosis of self-harm or undetermined intent accompanying any injury or poisoning diagnosis captured through the EHR or insurance claims data. Suicide deaths were identified from death certificates indicating definite or probable suicides (Bakst, Braun, Zucker, Amitai, & Shohat, 2016; Cox et al., 2017). We have previously reported that “undetermined intent” injuries and poisonings account for approximately 25% of all suicide attempts(Simon et al., 2016; Simon et al., 2018). In Simon et al., (2018), exclusion of those events had no effect on prediction model performance for mental health patients. About 0.2% of visits were excluded from our sample because a non-suicide death followed within 90 days. The follow-up windows for outcome ascertainment overlapped for people with multiple visits within 90 days. As a result, a single suicide attempt may have been attributed to more than one visit. We did not censor follow-up at the time of a person’s next visit because censoring would be informative as people with more mental health visits are at higher risk of suicide attempt.

More information on data collection and variable definitions is available in Simon et al. (2018) and in this article’s online supporting information.

2.2. Descriptive analyses

We conducted descriptive analyses to evaluate variation in cluster size and the relationship between cluster size and risk of suicide attempt in the development dataset. We also summarized the number of visits attributed to unique events to quantify outcome correlation due to overlap of follow-up windows.

2.3. Defining the sampling framework

Our analysis explored four sampling frameworks that offer a distinct approach to accommodating clustering and informative cluster size for estimating prediction models and their future performance. The sampling frameworks are defined by how the development dataset is divided into training and test sets (visit- or person-level), how training data are divided for 10-fold cross-validation (visit- or person-level) to select model tuning parameters, and how data are sampled for model estimation (observed cluster analysis or within cluster resampling). The four sampling frameworks are listed and described in detail below.

  1. Visit-level training/test split, visit-level cross-validation, and observed cluster analysis for prediction model estimation

  2. Visit-level training/test split, person-level cross-validation, and observed cluster analysis for prediction model estimation

  3. Person-level training/test split, person-level cross-validation, and observed cluster analysis for prediction model estimation

  4. Person-level training/test split, person-level cross-validation, and within cluster resampling of one visit per person for prediction model estimation

We considered two approaches to dividing the development dataset into training and test sets: a person-level split (i.e., all visits for any person are included together in either the training or test set), or a visit-level split (i.e., visits from any person may appear in both the training and test set). Because outcome windows may have overlapped for people with multiple visits, using a person-level split precluded training the prediction model on outcomes that were also in the test set. The visit-level split, however, better reflected both the clinical context for model use—predicting suicide risk following a visit—and the time-varying nature of suicide risk within a person as risk-related covariates and outcomes may change for people with multiple visits over the time period examined. Moreover, the visit-level split made more unique people and events available for prediction model training, which could improve power. For each approach, we sampled 65% of the development sample for the training dataset and 35% for the test set used to assess model performance before deployment. No information from the prospective validation set was used in these samples, as these approaches represent options available to an analyst for building and assessing performance prior to deployment of a risk prediction model in a clinic.

Folds for tuning parameter selection within the training set can also be divided at the person- or visit- level. We compared a visit- and person-level split for 10-fold cross-validation within the visit-level training and test split approach. While a visit-level split for cross-validation followed naturally from the visit-level training/test division, we also considered a person-level split to reduce over-fitting due to visits with correlated outcomes appearing across folds. For the person-level training and test split, we only examined person-level cross-validation, as sampling folds at the visit level after selecting the training set at the person level does not follow a logical sampling scheme.

Finally, we considered two approaches for model estimation. Using all observations from all clusters for model estimation (as is typical) is known as observed cluster analysis. In this approach, the contribution of a cluster is proportional to its size and the estimated model may not reflect the relationship between predictors and outcomes for visits in small clusters well. In our particular application, using all observations from a cluster also means that the outcome window is overlapping for visits that occur within 90 days of each other.

Alternatively, within cluster resampling can be used to balance the information contributed by sampling one observation per cluster to construct a “resampled” dataset (Hoffman et al., 2001). In our suicide risk prediction case study, using only one observation per cluster also eliminates overlap of the outcome follow-up window for visits within 90 days of each other. Within cluster resampling repeats the process of randomly selecting one visit per cluster several times; the final model estimates are averaged across resampled datasets in which cluster size is not associated with the outcome. For cross-validation, we selected tuning parameters with only a single resampled dataset because we found that using multiple resampled datasets did not affect tuning parameter selection but did impose considerable computational burden. For model estimation in the complete training set, we performed model estimation on 20 resampled datasets using selected tuning parameters. Predictions for the development test set and prospective validation set were obtained for each of the 20 models fit and were averaged to obtain a single prediction for each visit. Results for model estimation using within cluster resampling were examined for the person-level training and test set split only because within cluster resampling after a visit-level split does not uniformly reduce analysis to one visit per cluster when visits can be distributed across folds and between the training and test sets.

2.4. Estimation methods

We estimated prediction models for the four sampling frameworks described using two methods: logistic regression with least absolute shrinkage and selection operator (LASSO) and random forest. Logistic regression with LASSO uses the L1 penalty to select stronger predictors of the outcome while shrinking the coefficient for weaker predictors towards zero(Hastie, Tibshirani, & Friedman, 2009; Tibshirani, 1996). The degree of shrinkage, determined by the tuning parameter, λ, was selected using 10-fold cross-validation within the training dataset.

Random forest is a non-parametric ensemble learning method in which a model is comprised of many decision trees, each estimated on a bootstrap sample of the training dataset (Breiman, 2001). Random selection of predictors to be considered at each split in a tree reduces correlation between trees. We estimated trees using the Gini index measuring node impurity as the splitting rule (Breiman, 1996). For each tree, suicide attempt risk predictions were returned equal to the proportion of events among visits in each terminal node. Predicted risk of a suicide for a visit was obtained by averaging across predictions from each tree. We selected values for two tuning parameters—minimum terminal node size and the number of predictors considered at each split—using cross-validation. We estimated random forest models with 200 trees to limit the number of tuning parameter combinations evaluated; initial random forest models estimated with this data set showed that improvement in prediction plateaued after 100 trees.

Prediction model estimation for both logistic regression with LASSO and random forest and across all sampling frameworks used a visit-level (rather than a person-level) loss function because risk of suicide varies over time within a person and because our goal was to estimate a prediction model that could be used at the time of a mental health visit to prevent suicidal behavior for those at high risk. Similarly, tuning parameters were selected via cross-validation to optimize the visit-level area under the receiver operating characteristic curve (AUC).

2.5. Performance measures

The development test set was used to estimate the performance of prediction models fit under the eight modeling options (logistic regression with LASSO and random forest, across 4 sampling frameworks). This performance was then compared to the true performance observed in the prospective validation set. The development test set represented the data an analyst would have in-hand for model development and evaluation. The prospective validation set demonstrated model performance for future visits as prospective data are not often available to the analyst during the development and evaluation of a risk prediction model. Comparison of performance estimates between the development test set and the prospective validation set indicated the degree of overfitting or optimism in the original model estimation process.

We evaluated risk discrimination using area under the curve (AUC) (Hanley & McNeil, 1982). Classification accuracy was assessed for the highest risk visits. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated using cut points of the risk scores associated with the 95th, 99th, and 99.5th percentile in the training set. Bootstrapped 95% confidence intervals (CIs) were calculated for all measures (Efron, 1979). Bootstrapping was done at the visit level, rather than the person level, to reflect the expected variability in the population of visits if implementing the prediction model in a clinical setting. Cluster size is not fixed in real-life clinical settings, and the number of visits a person makes contributes to variability in the population.

Additionally, we estimated the amount of information associated with cluster size after adjusting for predicted suicide attempt risk for each of the estimated models. Within the prospective validation set, we fit a logistic regression model with observed 90-day suicide attempt status regressed on predicted risk from each estimated model and cluster size categorized into 1, 2-5, 6-10, 11-20, and >20 visits per person. Estimated odds ratios (ORs) indicated whether observed event rates varied by cluster size after conditioning on model predictions, that is, whether cluster size remained informative of risk.

All analyses were conducted in R version 3.4.4. (“R Core Team. R: A language and environment for statistical computing,” 2019). Logistic regression with LASSO was estimated using the CRAN package glmnet (version 2.0.16) (Friedman et al., 2010) and random forest models were estimated using the CRAN package ranger (version 0.9.0) (Wright & Ziegler, 2017). Statistical code for sampling procedures is provided in the online supplementary materials.

3. Results

3.1. Descriptive analyses

The development dataset included 1,518,968 outpatient mental health visits made by 207, 915 people (Table 1). There were 2,444 unique suicide attempts observed, and 10,171 visits had a suicide attempt within 90 days (6.7 events per 1,000 visits). 1,881 people had any suicide attempt recorded during the study period (0.9% of all people in the sample). 1,579 people had a single suicide attempt (65% of all suicide attempts), 220 people had two suicide attempts (18%), and 82 people had 3 or more unique suicide attempts in the development dataset (17%).

Table 1.

Characteristics of cohort used for prediction model development

All visits Visits per person
1 2 3-5 6-10 11-20 21+

Sample and outcome characteristics N (row %) N (row %) N (row %) N (row %) N (row %) N (row %) N (row %)

Total number of visits 1,518,968 (100) 62,660 (4) 67,306 (4) 175,850 (12) 221,286 (15) 294,202 (19) 697,664 (46)
Total number of unique people 207,915 (100) 62,660 (30) 33,653 (16) 46,588 (22) 28,996 (14) 20,355 (10) 15,663 (8)
90-day suicide attempts (visit-level) 10,171 (100) 104 (1) 123 (1) 529 (5) 836 (8) 1,488 (15) 7,091 (70)
  Event rate (per 1,000 visits) 6.7 1.7 1.8 3.0 3.8 5.1 10.2
Unique 90-day suicide attempts (person-level) 2,444 (100) 104 (4) 88 (4) 299 (12) 376 (15) 484 (20) 1,093 (45)
  Rate of unique events per (1,000 people) 11.8 1.7 2.6 6.4 13.0 23.8 69.8
Visit characteristics N (column %) N (column %) N (column %) N (column %) N (column %) N (column %) N (column %)

Female 955,894 (63) 37,988 (61) 41,176 (61) 109,163 (62) 140,545 (64) 189,708 (64) 437,314 (63)
Age
 13 – 17 157,605 (10) 5,923 (9) 7,461 (11) 21,808 (12) 28,670 (13) 35,937 (12) 57,806 (8)
 18 – 29 262,965 (17) 13,085 (21) 14,811 (22) 36,623 (21) 41,613 (19) 50,460 (17) 106,373 (15)
 30 – 44 390,789 (26) 15,818 (25) 17,129 (25) 46,287 (26) 59,898 (27) 78,127 (27) 173,530 (25)
 45 – 64 538,274 (35) 17,593 (28) 18,959 (28) 51,281 (29) 67,175 (30) 99,071 (34) 284,195 (41)
 65 or older 169,335 (11) 10,241 (16) 8,946 (13) 19,851 (11) 23,930 (11) 30,607 (10) 75,760 (11)
Race
 White 1,034,027 (68) 38,391 (61) 42,020 (62) 113,819 (65) 145,588 (66) 195,701 (67) 498,508 (71)
 Asian 69,952 (5) 3,638 (6) 3,596 (5) 9,070 (5) 10,923 (5) 14,411 (5) 28,314 (4)
 Black 142,153 (9) 6,407 (10) 6,664 (10) 16,188 (9) 21,170 (10) 28,763 (10) 62,961 (9)
 Hawaiian / Pacific Islander 17,034 (1) 768 (1) 824 (1) 2,249 (1) 2,648 (1) 3,552 (1) 6,993 (1)
 Native American 15,538 (1) 524 (1) 684 (1) 1,660 (1) 2,146 (1) 3,079 (1) 7,445 (1)
 More than one or Other 9,454 (1) 501 (1) 532 (1) 1,444 (1) 1,815 (1) 2,119 (1) 3,043 (0)
 Not Recorded 230,810 (15) 12,431 (20) 12,986 (19) 31,420 (18) 36,996 (17) 46,577 (16) 90,400 (13)
Ethnicity
 Hispanic 343,132 (23) 16,719 (27) 17,614 (26) 44,096 (25) 53,507 (24) 69,381 (24) 141,815 (20)
Insurance type§
 Commercial group 1,113,389 (73) 43,287 (69) 48,397 (72) 129,149 (73) 163,331 (74) 220,082 (75) 509,143 (73)
 Self-funded 51,177 (3) 1,848 (3) 2,243 (3) 6,911 (4) 8,836 (4) 10,653 (4) 20,686 (3)
 Medicare 233,409 (15) 10,545 (17) 9,593 (14) 23,031 (13) 28,510 (13) 39,393 (13) 122,337 (18)
 Medicaid 75,581 (5) 3,785 (6) 4,016 (6) 10,520 (6) 12,634 (6) 15,940 (5) 28,686 (4)
 Private pay 198,591 (13) 8,630 (14) 8,529 (13) 21,594 (12) 25,169 (11) 32,977 (11) 101,692 (15)
 State subsidized 3,355 (0) 178 (0) 177 (0) 581 (0) 645 (0) 658 (0) 1,116 (0)
 High deductible 124,365 (8) 6,277 (10) 6,483 (10) 15,272 (9) 18,890 (9) 22,464 (8) 54,979 (8)
PHQ-9 Item 9 score§§ recorded
 At index visit 203,763 (13) 10,553 (17) 11,723 (17) 32,684 (19) 38,957 (18) 44,559 (15) 65,287 (9)
 At any visit in past year (excluding index) 330,784 (22) 3,130 (5) 8,378 (12) 33,885 (19) 52,788 (24) 73,750 (25) 158,853 (23)
 At any visit in past 5 years (excluding index) 414,467 (27) 8,985 (14) 14,195 (21) 48,196 (27) 68,153 (31) 91,156 (31) 183,782 (26)
Length of enrollment prior to visit
 1 year or more 1,275,487 (84) 47,203 (75) 51,165 (76) 138,025 (78) 177,749 (80) 244,739 (83) 616,606 (88)
 5 years or more 843,134 (56) 30,230 (48) 31,748 (47) 86,692 (49) 111,637 (50) 158,585 (54) 424,242 (61)
Diagnoses in prior 5 years (including index visit)*
 Depression 1,119,160 (74) 31,496 (50) 39,713 (59) 115,121 (65) 155,777 (70) 222,537 (76) 554,516 (79)
 Anxiety 981,688 (65) 25,960 (41) 32,775 (49) 96,454 (55) 132,595 (60) 195,734 (67) 498,170 (71)
 Bipolar dx 190,926 (13) 2,340 (4) 3,421 (5) 11,591 (7) 19,219 (9) 29,453 (10) 124,902 (18)
 Schizophrenia 54,770 (4) 523 (1) 818 (1) 2,803 (2) 4,287 (2) 6,756 (2) 39,583 (6)
 Other MH condition 76,862 (5) 1,291 (2) 1,626 (2) 4,756 (3) 7,795 (4) 11,639 (4) 49,755 (7)
 Dementia 64,914 (4) 3,038 (5) 3,018 (4) 6,424 (4) 7,770 (4) 10,957 (4) 33,707 (5)
 Attention deficit disorder 171,413 (11) 5,258 (8) 6,948 (10) 21,341 (12) 26,648 (12) 33,118 (11) 78,100 (11)
 Autism spectrum disorder 19,404 (1) 428 (1) 439 (1) 1,597 (1) 2,691 (1) 4,362 (1) 9,887 (1)
 Personality disorder 85,829 (6) 700 (1) 1,019 (2) 3,728 (2) 5,587 (3) 11,177 (4) 63,618 (9)
 Alcohol use disorder 196,139 (13) 1,934 (3) 2,289 (3) 7,166 (4) 11,878 (5) 21,880 (7) 150,992 (22)
 Drug dependence disorder 394,513 (26) 6,048 (10) 7,473 (11) 22,512 (13) 33,109 (15) 55,724 (19) 269,647 (39)
 PTSD 140,247 (9) 1,655 (3) 2,427 (4) 8,552 (5) 13,678 (6) 24,258 (8) 89,677 (13)
 Eating disorder 50,347 (3) 495 (1) 691 (1) 2,542 (1) 4,121 (2) 7,089 (2) 35,409 (5)
 Traumatic brain injury 43,063 (3) 1,251 (2) 1,552 (2) 4,097 (2) 5,584 (3) 7,831 (3) 22,748 (3)
Medication fills in prior 5 years
 Antidepressants 943,017 (62) 21,792 (35) 27,193 (40) 83,871 (48) 120,402 (54) 183,700 (62) 506,059 (73)
 Benzodiazepines (anxiety/sedative) 668,960 (44) 15,839 (25) 18,896 (28) 55,752 (32) 79,243 (36) 122,967 (42) 376,263 (54)
 Hypnotics 199,503 (13) 4,271 (7) 5,133 (8) 15,916 (9) 22,060 (10) 35,221 (12) 116,902 (17)
 2nd generation antipsychotics 302,322 (20) 2,675 (4) 3,864 (6) 14,481 (8) 25,105 (11) 43,914 (15) 212,283 (30)
Encounters in prior 5 years with psychiatric diagnoses
 Any inpatient encounter 349,572 (23) 6,604 (11) 7,639 (11) 22,331 (13) 32,281 (15) 52,631 (18) 228,086 (33)
 Any outpatient encounter 1,388,497 (91) 18,970 (30) 44,370 (66) 147,093 (84) 204,523 (92) 283,071 (96) 690,470 (99)
 Any emergency department encounter 518,831 (34) 11,626 (19) 13,438 (20) 38,366 (22) 54,514 (25) 85,945 (29) 314,942 (45)
 Any suicide attempt 68,442 (5) 692 (1) 877 (1) 3,061 (2) 5,087 (2) 9,015 (3) 49,710 (7)

Row percentages correspond to the percent of visits, people, and suicide attempts in clusters of a particular size. Row percentages across all cluster sizes sum to 100%.

Visit-level event rate is calculated as the proportion of visits with 90-day suicide attempts x 1,000. A single event may occur in the 90 days following multiple visits that are close in time.

§

Insurance types at a visit are not mutually exclusive. A person may have insurance in more than one category at the time of visit.

§§

The 9th item of the Patient Health Questionnaire (PHQ-9) asks patients to report levels of suicidal ideation in the two weeks preceding their visit.

*

Summary diagnosis, medication, and encounters includes information available at the index visit and clinical history up to 5 years prior. Clinical history is left-censored at the time of health plan disenrollment.

Cluster size (number of visits per person) ranged from 1 visit (30% of people in the sample) to over 20 visits (8% of people). Cluster size was related to risk of suicide attempt. Visits by people who had only one visit in our development set accounted for only 1% of the visits with suicide attempts and the event rate for clusters of size one was 1.7 per 1,000 visits. By comparison, 70% of visits with attempts were observed in people with more than 20 visits in our sample, and the event rate was 10.2 per 1,000 visits. Characteristics related to suicide risk also varied by cluster size. People with more visits were more likely to be older, White non-Hispanic, and have a previous PHQ-9 response recorded. Mental health diagnoses and prescription orders for psychotropic medications were also more common for people with more visits.

The prospective validation set included 4,286,495 visits made by 660,659 people (Table S2 in online supporting information). Visits in the prospective sample had higher rates of available PHQ-9 information, as was expected given change in health systems’ implementation of the PHQ-9 during the study period. Visits in the prospective validation set also had higher rates of people with Medicaid insurance and of diagnoses of depression and anxiety.

Because our outcome window extended 90 days after a visit, it was possible for multiple visits to be associated with the same suicide attempt. Across all suicide attempts observed in our development sample, the median number of visits associated with an event was 2 with an interquartile range of (1-5). Although people with more visits were more likely to have an event recorded in our sample, most of their visits either occurred more than 90 days before the event or following the event (Table S3 in online supporting information), which meant that each suicide attempt was usually only associated with a small proportion of their total number of visits. This was true for both the development dataset and prospective validation data.

When the development data was split at the visit level, 46% of the 2,444 unique suicide attempts were associated with visits such that these attempts appeared in both the training and test sample (Table 2). The visit-level split included more unique people and events for model training than the person-level split.

Table 2.

Distribution of visits, people, and events using visit-level and person-level splits for training and test sets

All development data Visit-level split Person-level split
Training Testing Training Testing
Number of visits 1,518,968 987,329 531,639 987,038 531,930
   Number of visits with event 10,171 6,614 3,557 6,638 3,533
   Event rate (per 1,000 visits) 6.7 6.7 6.7 6.7 6.6
Number of people 207,915 180,508 141,968 135,144 72,771
   Number of people with any event 1,949 1,669 1,271 1,278 671
   Event rate (per 1,000 people) § 9.4 9.2 9.0 9.5 9.2
Number of unique events §§ 2,444 2,061 1,517 1,603 841
 Number of unique events occurring in both training and testing sets n/a 1134 (46%) 0 (0%)

For the visit-level training/testing split, the sum of the number of people, people with any event, and unique events across the training and test sets will be greater than the number of people, people with any event, and unique events in the combined dataset because visits from the same person (or associated with the same event) may appear in both training and test sets.

Number of people with one or more visits with a suicide attempt within 90 days.

§

Event rate is calculated as the proportion of people with any 90-day suicide attempt x 1,000.

§§

Number of unique events may include multiple events per person if separate suicide attempts occurred in the 90-days following a visit in the development dataset.

3.2. Prediction model estimation

Prediction models were estimated for the four sampling frameworks using logistic regression with LASSO and random forest. Tuning parameters selected by cross-validation are reported in Table 3. Using a visit-level split for both training/testing and cross-validation resulted in more complex prediction models, i.e., more non-zero coefficients selected by LASSO for the logistic regression and smaller minimum terminal node size (indicating deeper trees) for the random forest, compared to using a person-level split. Prediction models estimated in datasets with one visit randomly sampled per cluster were also less complex because the total sample size available for estimating the model was smaller. Out-of-fold AUC results of cross-validation for tuning parameter selection are given in Tables S4S5 in online supporting information.

Table 3.

Selected tuning parameters for prediction models

Sampling Framework Logistic regression with LASSO Random forest
Training/ test split Cross-validation split Model estimation λ # Non-zero coefficients Minimum terminal node size # predictors sampled at each split
Visit Visit Observed cluster analysis 2.5 x 10−6 250 500 34
Visit Person Observed cluster analysis 4.5 x 10−5 86 5,000 8
Person Person Observed cluster analysis 4.5 x 10−5 105 5,000 17
Person Person Within cluster resampling 5.5 x 10−5 64 1,000 17

λ controls the degree of shrinkage for variable selection. A larger value of λ corresponds to more shrinkage, and a smaller value of λ, less shrinkage and more non-zero coefficients.

The default recommendation for number of predictors randomly sampled for consideration at each split is square root of the total number of predictors, equal to 17 for our dataset. We also examined twice this default (34 predictors) and half of the default (8).

§

Average number of non-zero coefficients across logistic regression with LASSO models fit on 20 within cluster resampled datasets

3.3. Discrimination

Table 4 shows the AUC in the development test and prospective validation sets for each prediction model.. Prediction models estimated using logistic regression with LASSO performed similarly in development test sets for all sampling frameworks examined; the AUCs ranged from 0.854 to 0.867. The optimism of discrimination estimates, that is, the difference between the AUC in the test and prospective validation sets, was small. The AUCs for logistic regression prediction models ranged from 0.847 to 0.854 in the prospective validation set.

Table 4.

Discrimination in development and prospective validation sets measured by AUC (95% CIs).

Sampling Framework Logistic regression with LASSO Random forest
Training/ test split Cross-validation split Model estimation Development test set Prospective validation set Development test set Prospective validation set
Visit Visit Observed cluster analysis 0.867 (0.860, 0.873) 0.849 (0.846, 0.851) 0.950 (0.946, 0.954) 0.836 (0.833, 0.838)
Visit Person Observed cluster analysis 0.862 (0.856, 0.868) 0.853 (0.850, 0.855) 0.907 (0.901, 0.912) 0.853 (0.850, 0.855)
Person Person Observed cluster analysis 0.854 (0.847, 0.861) 0.847 (0.845, 0.850) 0.856 (0.849, 0.862) 0.847 (0.844, 0.849)
Person Person Within cluster resampling 0.863 (0.857, 0.869) 0.854 (0.852, 0.856) 0.857 (0.851, 0.864) 0.847 (0.845, 0.849)

Development test set includes 531,639 visits (141,968 people, 1,517 unique events) for the visit-level training/test split and 531,930 visits (72,771 people, 841 unique events) for the person-level training/test split.

Prospective validation set includes 4,286,495 visits (660,659 people, 6,678 unique events).

Optimism was greatest for the random forest model using a visit-level split for training/test division and cross-validation; the AUC in the development test set was 0.950 (95% CI:0.946, 0.954) but only 0.836 (95% CI: 0.835, 0.838) in the prospective validation set. Using a person-level split for cross-validation after dividing the training/test set on the visit-level had lower discrimination in the test dataset (AUC=0.907, 95% CI: 0.902, 0.912) but lower optimism and better performance in the prospective validation set (AUC=0.853, 95% CI: 0.850, 0.855). Among random forest models, optimism was smallest for models using a person-level training/test split. Using a person-level training/test split and observed cluster analysis yielded an AUC of 0.856 (95% CI:0.849, 0.862) in the test set and 0.847 (95% CI: 0.844, 0.849) in the prospective validation set; estimation via within cluster resampling yielded nearly identical results.

3.4. Classification accuracy

Figures 1 and 2 show sensitivity and PPV among visits with the highest predicted risk from each model for the development test and prospective validation sets (Tables S3S4). Patterns in optimism for classification accuracy were similar to what was observed for discrimination. While the eight prediction models showed similar rates of sensitivity and PPV in the prospective validation set, these measures were overestimated in the development test set by random forest models with a visit-level training/test split. Using a person-level split for cross-validation reduced but did not eliminate this optimism. Prediction models estimated with logistic regression with LASSO and random forest models that used a person-level training/test split returned estimates of classification accuracy in the test set that closely matched prospective performance. Because suicide attempt is a rare outcome, estimated specificity was close to 0.995, 0.99, and 0.95 at risk thresholds defined at the 99.5th, 99th, and 95th percentiles of the risk score distribution (Table S5) and estimated NPV was near 1 at all thresholds (Table S6) for all prediction models.

Figure 1.

Figure 1

Sensitivity of prediction models for visits in the highest risk percentiles. Prediction models estimated for all sampling frameworks with (a) logistic regression with LASSO and (b) random forest in the development testing and prospective validation sets. Plotting symbols indicate the sampling framework used and vertical lines represent 95% confidence intervals.

Figure 2.

Figure 2

Positive predictive value (PPV) of prediction models for visits in the highest risk percentiles. Prediction models estimated for all sampling frameworks with (a) logistic regression with LASSO and (b) random forest in the development testing and prospective validation sets. Plotting symbols indicate the sampling framework used and vertical lines represent 95% confidence intervals.

3.5. Calibration given cluster size

Figure 3 displays OR estimates from the prospective validation set for the probability of suicide attempt for visits in clusters of size 2-5, 6-10, 11-20, and >20 visits compared to clusters with only 1 visit per person. ORs were adjusted for predicted risk from the modeling approaches considered. An OR greater than 1 indicated visits in a larger cluster had higher observed rates of suicide attempt than visits with the same predicted risk in clusters of size one, that is, that cluster size was informative of risk.

Figure 3.

Figure 3

Calibration of risk predictions by cluster size. Plot shows odds ratios (OR) estimated in the prospective validation set for visits in larger clusters relative to visits in clusters of size one. The horizontal dashed line shows no difference in risk (OR=1) compared to clusters of size one, and estimates above OR=1 indicate increased risk of a suicide attempt for larger clusters after conditioning on predicted risk. Sampling frameworks are denoted by plotting symbols, and estimation method (logistic regression with LASSO and random forest) is indicated along the x-axis.

Observed event rates at a given risk prediction were comparable between clusters with 2-5 visits per person and those with only one visit. For larger clusters, however, there was some evidence of informative cluster size for most modeling approaches. As the number of visits per cluster increased, so did the estimated OR for cluster size. For example, for the random forest prediction model with a person-level training/test split and observed cluster analysis (represented by a triangle plotting symbol in Figure 3), the estimated OR for visits in clusters of 6-10 visits was 1.0005 (95% CI: 1.0001, 1.0009) and increased to 1.0019 (95% CI: 1.0015, 1.0022) for clusters with more than 20 visits. Two sampling frameworks estimated with random forest demonstrated better calibration of predictions across cluster sizes: first, a visit-level training/test split and person-level cross-validation (circle) and, second, a person-level training/test split and within cluster resampling for estimation (diamond). These two approaches showed no significant difference for clusters with 6-10 and 11-20 visits compared to clusters with only 1 visit and have the lowest ORs for clusters with more than 20 visits.

While this analysis showed statistically significant evidence of informative cluster size, the magnitude of error in calibration was relatively small. For example, consider the impact on predictive accuracy for a visit around the 95th risk percentile for the prediction models in our study. For a visit with a predicted 2.5% chance of suicide attempt (250 events per 10,000 visits), an OR of 1.005 indicates an increase in risk of only 1.2 additional suicide attempts per 10,000 visits. By comparison, the largest OR for miscalibration was 1.0038 (95% CI: 1.0034, 1.0042) for the LASSO prediction model using a visit-level split for training/testing and cross-validation.

4. Discussion

Clustering and informative cluster size is common when using EHR data to build risk prediction tools designed for clinical care. In this study, we compared different approaches to accommodating both clustering and informative clustering in the development of a prediction model for estimating risk of a suicide attempt in the 90 days following a mental health outpatient visit. In this application, data for training the prediction model were clustered because people had multiple mental health care visits, and the 90-day outcome window following a visit overlapped for visits close in time. Cluster size was also related to risk because people who sought more mental health care had higher rates of suicide attempt. As such, this was a natural application in which to evaluate the impact of clustering and informative cluster size.

Our study explored several approaches for prediction model development including four sampling frameworks for handling clustered data and two prediction modeling methods. We considered the impact of clustering and informative cluster size on estimating performance in the development test set. We used a prospective validation set to evaluate model performance in future visits and to assess optimism of each of these sampling techniques. An ideal prediction model is characterized by strong performance in prospective use and accurate estimation of future performance at the time of model development. Our unique data source allowed us to estimate the true optimism of each of these sampling frameworks, because we had a large number of visits from future years in which the risk prediction model was not yet implemented into clinical care at these heath systems. The clinical risk prediction models we have developed as part of this work are now being implemented in several health systems.

We compared splitting the development data into training and test sets on the visit- or person-level. Using a person-level split prevents observations from the same cluster being in both training and test sets. Importantly, this approach avoids training the prediction model on suicide attempts that may also appear in the validation set, which may increase optimism. We found this to be true in our setting. In our dataset, events had a median number of two visits associated with each suicide attempt, and, when a visit-level split was used, 46% of events had visits in the 90 days prior in both the training and validation sets.

There are also arguments in favor of using a visit-level split for dividing the training and test sets. This approach reflects the time-varying nature of suicide risk. Suicide risk does not change monotonically, and people may have many visits in the development dataset and possibly multiple events, perhaps separated by days, weeks, or years. A visit-level split for training and test sets also matches the clinical purpose of the model to intervene at the time of a visit if estimated risk of suicide attempt is high. Moreover, people with mental health visits during the time covered by the development dataset may also have future mental health visits when the model is in clinical use.

Having sufficient sample size, which depends on the event rate, to accurately estimate risk is another important factor when deciding whether to use a person- or visit-level split for the training and test sets. While the development dataset included nearly 1 million visits, the event rate is very low (7 per 1,000 visits). Using a visit-level split increases the number of unique people and events available for model training, which can reduce bias and increase stability in prediction model estimation. The number of unique people and events in the testing set also affect precision when evaluating model performance. When smaller sample size and low event rate are a concern, an alternative to the split-sample approach used here is to train the prediction model with the entire sample and, rather than holding out a random subset of the data for testing, use cross-validation to estimate model optimism (Steyerberg et al., 2001). Estimating model performance in the entire sample, rather than in a separate validation set, also increases the precision of internal validation estimates. Using the entire sample for model development and validation maximizes power but can be computationally challenging with large datasets and memory-intensive modeling methods.

Our results show that using a visit-level split to divide training and test sets led to overestimating model performance in the development validation set and did not improve performance in the prospective validation set relative to a person-level split. Optimism was particularly large for models fit with random forest as the flexibility to capture many interactions enabled overfitting for events associated with visits in both the training and test set. Overfitting was likely also exacerbated in this example due to the overlapping time frame for outcomes (90 days) and predictors (up to 5 years), such that many visits close in time may have nearly identical predictor and outcome values. Overestimation of the sensitivity and PPV among the highest risk visits (Figures 2 and 3) was particularly alarming given the practical importance of accurately anticipating the performance of prediction models intended for clinical use. For example, a provider’s decisions about if and how to implement a clinical prediction model depend on correctly projecting its impact on patient outcomes. Accurate estimates of projected sensitivity and PPV are also key information when calculating sample size for a randomized trial to evaluate the impact of using a risk prediction model to guide care; the trial will be underpowered if classification accuracy is greatly overestimated. While discrimination and classification accuracy in the prospective validation set was comparable to models using a person-level training/test split, a person-level sampling framework avoids overfitting and, thus, is recommended, particularly if using random forest.

We explored using a person-level split to define folds for cross-validation after dividing the training and test sets on the visit-level. This approach reduced overfitting and optimism because tuning parameters were selected that maximized performance in out-of-cluster predictions rather than training on observations correlated across folds. Person-level cross-validation selected a higher level of shrinkage for LASSO and larger minimum node size for random forest. Discrimination and classification accuracy of this approach in the prospective validation set was similar to sampling frameworks using a person-level training/test split for both random forest and LASSO.

We also considered within cluster resampling to limit the impact of clustering and informative cluster size on predictive performance. Within cluster resampling eliminated clustering from model training by performing estimation with a series of datasets sampling only one visit per person. Our study found that within cluster resampling provided only small benefits over using observed cluster analysis. For LASSO, within cluster resampling improved discrimination in the prospective validation set. Within cluster resampling with random forest minimized differences in model calibration across clusters. It is likely that the flexible, non-parametric approach offered by random forest aided calibration compared to models estimated with LASSO. While improvements seen with within cluster resampling were statistically significant, these differences are unlikely to be clinically meaningful. Prediction model training using within cluster resampling is computationally demanding but feasible in a research environment. Unfortunately, the technical demands of deploying a prediction model using within cluster resampling in a clinical setting would likely be prohibitive as it requires estimating and averaging predictions from many models. The benefits of within cluster resampling over standard methods would need to be substantial to warrant such an effort. As an alternative, one could consider inverse cluster size reweighting, which Williamson et al (2003) demonstrated to be equivalent to within cluster resampling for regression models estimated with generalized estimating equations (Liang and Zeger, 1986). This approach is less computationally demanding for parametric regression models, though not all statistical software packages for computationally efficient estimation of regression models with larger datasets may accommodate weights. Inverse cluster size reweighting has not, to our knowledge, been examined for machine learning techniques like random forest, but could be explored as more prediction methods allow for weights in model estimation.

In our application, cluster size was informative because most suicide attempts were observed in people with many visits. Our study found that accurate identification of highest risk visits was not diminished by using a person-level training and validation set split. When larger clusters are associated with increased risk of an event, improvements in predictions for smaller clusters should not come at the expense of performance in larger clusters. At the same time, ensuring adequate performance of a prediction model in smaller clusters is necessary to develop fair and equitable prediction models. While a smaller number of visits in this example may be due to less severe mental illness, fewer visits may also be more common for people with insufficient access to care. Statistical methods like those explored here are an important tool for creating prediction models that reduce, rather than perpetuate, health disparities.

Researchers typically use all available data for prediction model training and testing, rather than leaving out an additional prospective validation set to guide model selection. Our case study used a prospective validation set to compare sampling frameworks, and temporal validation sets can be useful to understand how a prediction model’s performance may change over time. However, using a prospective validation set is generally not recommended when estimating a prediction model intended for clinical use. Since clinical practice evolves over time, prediction models should be estimated using the most up-to-date data available. In situations where researchers do not use a prospective validation set to compare sampling frameworks, as done here, we recommend using a person-level split for dividing the training and test sets, especially when using a non-parametric method like random forest for estimation. This approach reduces optimism so that measures of model performance in the development test set better reflect future model performance.

This paper provides a roadmap for estimating and comparing prediction modeling approaches in the presence of clustering and informative cluster size. As this is a case study, determinations about the ideal sampling framework or prediction method made after following the steps outlined here may vary for other prediction targets and datasets. The impact of informative cluster size would likely be greater with an outcome that is rarer (for example, suicide death instead of suicide attempts) or with larger correlation within clusters, as would be expected for health care settings with more frequent visits. In these cases, there would be fewer unique events, particularly in smaller clusters, resulting in less statistical power to identify risk factors also associated with cluster size. With more common events or less correlation of outcomes within clusters, predictors associated with both event risk and cluster size (such as predictors measuring mental health care utilization) will be more likely to be selected for model inclusion and, as a result, the outstanding association between cluster size and event risk after conditioning on predictors will be less. Future simulation studies in this area could provide a more thorough examination of the relationship between sample size, event rate, cluster size, and association between cluster size and outcome probability.

Supplementary Material

Supporting Information for Suicide Prediction and Informative Cluster Size

Acknowledgements

This project was supported by National Institute of Mental Health grant number U19MH092201. Dr. Coley was supported by grant number K12HS026369 from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality.

Conflict of Interest

Dr. Simon has received research grant support from Novartis. Dr. Shortreed has been co-Investigator on grants awarded to Kaiser Permanente Washington Health Research Institute from Syneos Health, who is representing a consortium of pharmaceutical companies carrying out FDA-mandated studies regarding the safety of extended-release opioids. Dr. Coley, Mr. Walker, and Dr. Cruz report no financial relationships with commercial interests.

Footnotes

Supporting Information for this article is available from the author or on the WWW under http://dx.doi.org/10.1022/bimj.XXXXXXX (please delete if not applicable)

References

  1. Bakst SS, Braun T, Zucker I, Amitai Z, & Shohat T (2016). The accuracy of suicide statistics: are true suicide deaths misclassified? Social Psychiatry and Psychiatric Epidemiology, 51, 115–123. [DOI] [PubMed] [Google Scholar]
  2. Benhin E, Rao JNK, & Scott AJ (2005). Mean estimating equation approach to analysing cluster-correlated data with nonignorable cluster sizes. Biometrika, 92, 435–450. [Google Scholar]
  3. Breiman L (1996). Some properties of splitting criteria. Machine Learning, 24, 41–47. [Google Scholar]
  4. Breiman L (2001). Random forests. Machine Learning, 45, 5–32. [Google Scholar]
  5. Charlson M, Szatrowski TP, Peterson J, & Gold J (1994). Validation of a combined comorbidity index. Journal of Clininical Epidemiology, 47, 1245–1251. [DOI] [PubMed] [Google Scholar]
  6. Cox KL, Nock MK, Biggs QM, Bornemann J, Colpe LJ, Dempsey CL, … Schoenbaum M (2017). An examination of potential misclassification of army suicides: results from the Army Study to Assess Risk and Resilience in Servicemembers. Suicide and Life‐Threat Behavior, 47, 257–265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Efron B (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7, 1–26. [Google Scholar]
  8. Efron B (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association, 78, 316–331. [Google Scholar]
  9. Friedman J, Hastie T, & Tibshirani R (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1. [PMC free article] [PubMed] [Google Scholar]
  10. Hanley JA, & McNeil BJ (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29–36. [DOI] [PubMed] [Google Scholar]
  11. Hastie T, Tibshirani R, & Friedman J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). New York: Spinger-Verlga New York. [Google Scholar]
  12. Hirsch RP (1991). Validation samples. Biometrics, 47, 1193. [PubMed] [Google Scholar]
  13. Hoffman EB, Sen PK, & Weinberg CR (2001). Within‐cluster resampling. Biometrika, 88, 1121–1134. [Google Scholar]
  14. Huang Y, & Leroux B (2011). Informative cluster sizes for subcluster-level covariates and weighted generalized estimating equations. Biometrics, 67, 843–851. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kroenke K, Spitzer RL, & Williams JB (2001). The PHQ-9: validity of a brief depression severity measure. Journal of General Internal Medicine, 16, 606–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Liang K-Y, & Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22. [Google Scholar]
  17. Pavlou M, Seaman SR, & Copas AJ (2013). An examination of a method for marginal inference when the cluster size is informative. Statistical Sinica, 23, 791–808. [Google Scholar]
  18. R Core Team. R: A language and environment for statistical computing [Online]. Vienna, Austria: R Foundation for Statistical Computing. Available: http://www.R-project.org/. [Google Scholar]
  19. Seaman SR, Pavlou M, & Copas AJ (2014). Methods for observed-cluster inference when cluster size is informative: a review and clarifications. Biometrics, 70, 449–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Shen CW, & Chen YH (2018). Model selection for semiparametric marginal mean regression accounting for within-cluster subsampling variability and informative cluster size. Biometrics, 74, 934–943. [DOI] [PubMed] [Google Scholar]
  21. Simon GE, Shortreed SM, Johnson E, Rossom RC, Lynch FL, Ziebell R, … B R (2019). What health records data are required for accurate prediction of suicidal behavior? Journal of the American Medical Informatics Association, 26, 1458–1465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Simon GE, Coleman KJ, Rossom RC, Beck A, Oliver M, Johnson E, … Rutter C (2016). Risk of suicide attempt and suicide death following completion of the Patient Health Questionnaire depression module in community practice. The Journal of Clinical Psychiatry, 77, 221–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Simon GE, Johnson E, Lawrence JM, Rossom RC, Ahmedani B, Lynch FL, … Shortreed SM (2018). Predicting suicide attempts and suicide deaths following outpatient visits using electronic health records. American Journal of Psychiatry, 175, 951–960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, … Kattan MW (2010). Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, Mass.), 21, 128–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Tibshirani R (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58, 267–288. [Google Scholar]
  26. Williamson JM, Datta S, & Satten GA (2003). Marginal analyses of clustered data when cluster size is informative. Biometrics, 59, 36–42. [DOI] [PubMed] [Google Scholar]
  27. Williamson JM, Kim HY, Manatunga A, & Addiss DG (2008). Modeling survival data with informative cluster size. Statistics in Medicine, 27, 543–555. [DOI] [PubMed] [Google Scholar]
  28. Wright M, & Ziegler A (2017). Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information for Suicide Prediction and Informative Cluster Size

RESOURCES