Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jul 12.
Published in final edited form as: Int J Eat Disord. 2021 Apr 2;54(7):1250–1259. doi: 10.1002/eat.23510

Prediction of eating disorder treatment response trajectories via machine learning does not improve performance versus a simpler regression approach

Hallie Espel-Huynh 1,2,3, Fengqing Zhang 1, J Graham Thomas 2,3, James F Boswell 4, Heather Thompson-Brenner 5, Adrienne S Juarascio 1, Michael R Lowe 1
PMCID: PMC8273095  NIHMSID: NIHMS1695553  PMID: 33811362

Abstract

Objective:

Patterns of response to eating disorder (ED) treatment are heterogeneous. Advance knowledge of a patient’s expected course may inform precision medicine for ED treatment. This study explored the feasibility of applying machine learning to generate personalized predictions of symptom trajectories among patients receiving treatment for EDs, and compared model performance to a simpler logistic regression prediction model.

Method:

Participants were adolescent girls and adult women (N = 333) presenting for residential ED treatment. Self-report progress assessments were completed at admission, discharge, and weekly throughout treatment. Latent growth mixture modeling previously identified three latent treatment response trajectories (Rapid Response, Gradual Response, and Low-Symptom Static Response) and assigned a trajectory type to each patient. Machine learning models (support vector, k-nearest neighbors) and logistic regression were applied to these data to predict a patient’s response trajectory using data from the first 2 weeks of treatment.

Results:

The best-performing machine learning model (evaluated via area under the receiver operating characteristics curve [AUC]) was the radial-kernel support vector machine (AUCRADIAL = 0.94). However, the more computationally-intensive machine learning models did not improve predictive power beyond that achieved by logistic regression (AUCLOGIT = 0.93). Logistic regression significantly improved upon chance prediction (MAUC[NULL] = 0.50, SD = .01; p <.001).

Discussion:

Prediction of ED treatment response trajectories is feasible and achieves excellent performance, however, machine learning added little benefit. We discuss the need to explore how advance knowledge of expected trajectories may be used to plan treatment and deliver individualized interventions to maximize treatment effects.

Keywords: feasibility studies, feeding and eating disorders, health services research, machine learning, precision medicine, statistical methodology, support vector machine, treatment outcome

1 |. INTRODUCTION

Eating disorders (EDs) are associated with marked psychosocial impairment, high rates of medical morbidity, and elevated mortality risk (Arcelus, Mitchell, Wales, & Nielsen, 2011; Mitchell, 2016; Mitchell & Crow, 2006; Preti et al., 2009). Relative to other psychiatric conditions, EDs are characterized by poorer treatment response and prolonged chronicity (Abbate-Daga, Amianto, Delsedime, De-Bacco, & Fassino, 2013; Delinsky et al., 2010; Fichter, Quadflieg, & Hedlund, 2006, 2008). Further, patients exhibit heterogenous outcomes in response to empirically-supported ED treatments. In randomized controlled trials, existing treatments fail to provide clinically meaningful improvement and/or full symptom remission by end of treatment for at least 30% of adult patients across diagnostic groups (Fairburn et al., 2015; Linardon & Wade, 2018; Zipfel et al., 2014). Remission rates are even lower in community and private treatment settings (Linardon, Messer, & Fuller-Tyszkiewicz, 2018). Thus, a sizeable minority of patients do not experience clinically meaningful relief from symptoms, even after receiving costly and intensive treatments.

Patterns of symptom change during ED treatment also tend to be heterogeneous. For example, Hilbert et al. (2018) examined early treatment response trajectories (operationalized as binge eating frequency) among patients receiving outpatient cognitive behavioral therapy for binge ED. Four early change trajectories emerged, including low-level binge eating that remained stable throughout treatment, and low-, medium-, and high-level binge eating that decreased over time (Hilbert et al., 2018). In another study, Jennings, Gregas, and Wolfe (2018) examined latent treatment response patterns among inpatients with anorexia nervosa, operationalizing response via weight change. Outcome was best captured by a four-class model, which included groups experiencing “weight gain,” “treatment resistant,” “weight plateau,” and “weight fluctuate” patterns. In both studies, patients in different trajectory groups differed clinically at baseline and on symptom change at end of treatment. Thus, differential patterns of change on at least two ED symptom dimensions appear to be associated with differential clinical outcomes. However, these studies focused on single-diagnosis patient samples and utilized symptom dimensions pertinent only to certain subtypes of EDs.

Our group has examined latent trajectories of symptom change during treatment in a transdiagnostic sample of patients receiving residential care for EDs (Espel-Huynh, Zhang, et al., 2020). In this case, symptoms were measured via a self-report progress monitoring assessment (the Progress Monitoring Tool for EDs; PMED). The PMED is tailored for the unique aspects of structured, 24-h treatment provided in residential care for EDs and assesses cognitive and behavioral symptoms aggregated across five domains: Weight and Shape Concern, ED Urges and Behaviors, Emotion Avoidance, Adaptive Coping (reverse-scored), and Relational Connection (reverse-scored). In our group’s latent trajectory analysis involving N = 360 patients who completed the assessment on a weekly basis throughout treatment, a three-class trajectory model emerged: “Gradual Response” (58%), characterized by high-severity symptoms at admission that improved in a linear fashion through discharge; “Rapid early Response” (24%), with steep, rapid improvements early on in treatment that were maintained through discharge; and “Low-Symptom static response” (18%), characterized by nearly nonclinical symptoms reported at admission that remained stable throughout treatment (Espel-Huynh, Zhang, et al., 2020). Some have discussed the limitations of categorizing patients into discrete groups, given that treatment response may ultimately occur on a continuum of varying patterns (Bauer, 2007; Sher, Jackson, & Steinley, 2011). However, in this case, latent groups differed on important outcomes, supporting the notion that these latent groups represent clinically distinct groups. Patients in the Low-Symptom static group entered treatment with significantly lower ED psychopathology and frequency of laxative misuse/vomiting than the other two groups, and these patients were also more likely to have a low BMI. Regarding clinical treatment outcomes, all groups differed on length of stay (Rapid Response < Low-Symptom < Gradual Response) and total reduction in ED symptoms during treatment (Rapid Response > Gradual Response > Low-Symptom). Together, these findings suggest that patients receiving transdiagnostic care for a primary ED across a range of symptom profiles exhibit heterogeneous treatment response trajectories, which are associated with clinically meaningful differences in presenting concerns and outcomes.

Patients in these groups may also have unique clinical needs during treatment. For example, patients following a Rapid Response course may require greater therapeutic emphasis on maintenance of early gains and/or may be ready for discharge earlier than patients with other response patterns. Patients in the Low-Symptom group may benefit from a greater focus on increased symptom insight and/or enhanced motivation. Providing clinicians with information on a patient’s expected trajectory early on could allow them to tailor treatment using an empirically-driven, “precision medicine” approach.

However, tailoring treatment necessitates that a patient’s expected trajectory is known well before the end of treatment. Our ability to predict outcomes in EDs treatment is currently limited. The only concrete outcome prediction model we know of is in family-based therapy for adolescents, for which early weight gain thresholds have been identified to reliably predict treatment outcome (Le Grange, Accurso, Lock, Agras, & Bryson, 2014). Vall and Wade (2015) identified several predictors that are related to ED treatment outcome, but all were evaluated continuously and yielded relatively small effect sizes. While potentially mechanistically informative, these results provide no concrete means of anticipating differential outcomes or trajectories in advance, and thus cannot be used in an applied setting to inform treatment planning and augment care.

Discrete predictions could be achieved with machine learning. Machine learning refers to a powerful family of mathematical algorithms that predict complex behavioral outcomes with high precision (James, Witten, Hastie, & Tibshirani, 2013). These empirically-derived methods tend to outperform both clinically-derived cutoff scores and more traditional inferential statistical approaches (Kessler et al., 2016; Lutz, Lambert, et al., 2006). Data-driven prediction via machine learning methods has shown promise in predicting clinical trajectories in other fields of psychopathology, including for post-traumatic stress disorder symptom trajectories following a traumatic event (Galatzer-Levy, Karstoft, Statnikov, & Shalev, 2014), persistence of depressive symptoms over time (Kessler et al., 2016), and in predicting treatment nonresponse in general outpatient psychotherapy (Lutz, Lambert, et al., 2006; Lutz et al., 2005). More recently, Lutz, Rubel, Schwartz, Schilling, and Deisenhofer (2019) developed a machine learning-based “treatment navigator” program that predicts patients’ risk for poor outcomes in general outpatient psychotherapy and provides decision support to clinicians to facilitate personalized patient care and improved outcomes. Given demonstrated utility in other domains of psychopathology, we expected that machine learning may also prove utile in predicting symptom trajectories in residential ED treatment. However, we are aware of no prior research that has applied machine learning to predict such outcomes in this way.

2 |. STUDY OBJECTIVES

The aim of this study was to explore the potential of machine learning to accurately predict a patient’s expected response trajectory in residential ED treatment using only early outcome data (i.e., within the first 2 weeks of treatment), and compare its performance to simpler predictive approaches. Ultimately, the goal was to identify a model with good predictive performance and maximal parsimony. Once identified, this model may be used in the future to provide clinicians with real-time information on patients’ expected response patterns.

We selected one primary machine learning approach—support vector machine (SVM)—based upon its prior success with trajectory prediction in PTSD (Galatzer-Levy et al., 2014) and its potential clinical utility for the patient population of interest. SVM performs especially well when separating groups with overlapping distributions (Guyon, Weston, Barnhill, & Vapnik, 2002; James et al., 2013), which were present in the latent trajectory data (Espel-Huynh, Zhang, et al., 2020). Importantly, the use of complex predictive methods should only be used when simpler approaches are insufficient. This is particularly true for models designed to be used for patient care among ED clinicians, who have limited time to interpret such models and high cognitive and emotional burden in their work (Warren, Schafer, Crowley, & Olivardia, 2012). For this reason, we also evaluated two simpler models: (a) k-nearest neighbors—a simpler machine learning modeling approach used by Lutz et al. (2019) for outcome prediction and treatment planning and (b) multinomial logistic regression, which does not involve machine learning methods. We aimed to select the least complex model that still optimized predictive performance.

3 |. HYPOTHESES

We hypothesized that SVM would produce a model with prediction accuracy that performed significantly better than chance (when validated on test data that were not used to train the model), would outperform k-nearest neighbors and logistic regression prediction approach, and that performed comparably to other models developed to predict general psychotherapy outcome (Lutz, Lambert, et al., 2006; Lutz, Saunders, et al., 2006).

4 |. METHODS

4.1 |. Participants & procedures

Participants were adolescent girls and adult women (N = 333) presenting for residential treatment of a primary ED at one of two facilities in the United States. At these facilities, data are collected from all patients as part of a routine assessment battery for internal quality assurance. The assessment protocol consists of a longer self-report assessment battery at admission and discharge and a brief progress assessment administered weekly throughout treatment (see Espel-Huynh, Zhang, et al., 2020, for more detail).

At admission, all patients are approached by facility research staff and led through an informed consent process which includes the option to opt-in to having their deidentified data included in research projects. Only participants who consented to research participation were included here. All research activities were approved by the Institutional Review Boards of the treatment program and Drexel University.

4.1.1 |. Data preparation

The data used here were drawn from the dataset used for trajectory analysis in Espel-Huynh, Zhang, et al. (2020). Data were gathered from all consenting patients admitted on or after February 1, 2016, and discharged before June 29, 2017. In total, 1,055 patients were admitted to one of the residential treatment programs during the data collection time frame. Of these, 1,009 consented to have their deidentified data included in research. Deidentified data were available from 920 admissions during the data collection period. To ensure data adequacy for trajectory analysis in the prior study (Espel-Huynh, Zhang, et al., 2020), patients were excluded if they had a length of stay less than 21 days (n = 139; required for trajectory analyses), failed to complete the admission assessment (n = 167), failed to complete the discharge assessment (n = 55), had fewer than four observations available for curvilinear trajectory analyses (n = 188; known trajectory was required for machine learning prediction), or were readmitted during the data collection period (n = 11). In addition, two cases were excluded due to missing more than 10% of item-level responses at either time point (T0, or baseline assessment, and T1, second assessment) due to technical survey software error (n = 2). To facilitate early trajectory prediction, cases were also required to have completed the T1 assessment within 5–14 days of the admission assessment, and 25 cases were excluded for this reason.

This resulted in a total of N = 333 cases included for analysis. These 333 cases were then randomly split into training (n = 250; to be used for model optimization) and test (n = 83; used to singly validate performance on “unseen” data not used to train the models) data subsets. Of note, excluded cases did not differ from those included on age, race/ethnicity, diagnostic distribution, or ED symptom severity at baseline (ps >.22).

4.2 |. Measures

4.2.1 |. Time

Time in weeks since admission was measured based on the date on which each assessment was completed. The date of admission assessment completion was set as T = 0 (on average 1.56 days postadmission; SD = 1.18). The last assessment was completed within 1.17 days of discharge (SD = 1.41).

4.2.2 |. Transdiagnostic symptom severity

Symptom severity was assessed using the Progress Monitoring Tool for EDs (PMED), which is a 26-item measure of symptoms related to treatment outcome in residential treatment across five domains: Weight and Shape Concern, ED Behaviors/Urges, Emotion Avoidance, Adaptive Coping, and Relational Connection (Espel-Huynh, Thompson-Brenner, et al., 2020). Patients rate the extent to which each item has applied to them in the past week on a 5-point Likert scale (0 = “Never”; 5 = “Always”). Total scale scores range from 26–131, with higher scores indicating greater symptom severity. The PMED was administered at admission, discharge, and approximately every 7 days during treatment to capture symptom progress throughout the full course of a patient’s stay. Scale reliability was in the good to excellent range at admission (α = .87, 95% CI [0.86, 0.89]) and discharge (α = .92, 95% CI [0.91, 0.93]) in this sample.

4.2.3 |. Treatment trajectory class

Patients were classified into one of the three latent trajectory classes based on the posterior predicted probability of class membership that was generated from prior latent growth mixture modeling analyses (Espel-Huynh, Zhang, et al., 2020). Patients were assigned to the class for which their predicted probability was highest. Trajectory groups included Rapid Response (24%), Gradual Response (58%), and Low-Symptom static response (18%). Given that strength of membership in these latent groups occurred on a continuum (i.e., with some patients having a very high predicted probability for only one group but others having lower overall predicted probabilities or similar probabilities for more than one group), we considered applying regression-based prediction of the predicted probability values versus classification of the assigned trajectory class. However, given the desire to generate a prediction model with maximal interpretability and utility for clinicians, we concluded that the classification model was preferred for practical utility.

4.2.4 |. Other patient variables

Other pertinent data were obtained from patients’ medical charts and admission assessment. Variables included age and primary ED diagnosis (as assigned by patients’ treating psychiatrist). Race and ethnicity were obtained via self-report.

4.3 |. Analyses

4.3.1 |. Machine learning feature generation

A set of input variables, or model features, was derived from data collected at admission (T0) and the first weekly assessment (T1). Features included item-level PMED responses, PMED subscale scores, aggregate sum totals of the psychopathology-oriented subscales (Weight/Shape Concern, ED Behaviors, and Emotion Avoidance), aggregated sums of the adaptive functioning scales (Adaptive Coping and Relational Connection), and PMED total scores, at each time point. Slopes and percent change from T0 to T1 were also calculated. SVM allows for inclusion of feature sets with high intercorrelations; related but distinct features may produce complementarity, in that related features may enhance the predictive power of one another in the classification model (Guyon & Elisseeff, 2003). Data were then randomly split into training (n = 250) and test (n = 83) set subsamples, and the training data were standardized and centered.

4.3.2 |. Machine learning analysis

Multi-class SVM was applied with all features included using the caret package in R. Models were trained with linear, polynomial, and radial basis function kernel algorithms, and each model type was trained with varying tuning parameter values to optimize the model (Kuhn, 2008; see Supporting Information for hyperparameter training summary). Model-fitting for the training data involved 10-fold cross-validation, repeated five times each, with 20 random starts to select optimal tuning parameters. Severely imbalanced class frequencies are known to compromise classification accuracy (He & Garcia, 2009), and a slight imbalance in case frequencies was present in our data (i.e., 24% Rapid Response, 58% Gradual Response, and 18% Low-Symptom static response). To consider the effects of the imbalance in the number of observations between latent groups in the training data, oversampling was applied to balance the number of cases in each latent class, using the “upSample” function in caret (Akbani, Kwek, & Japkowicz, 2004; He & Garcia, 2009; Kuhn, 2008). For each of the two less-frequent classes (i.e., Rapid Response and Low-Symptom), the oversampling approach involved random sampling with replacement from the training cases until the number of observations equaled that of the most common class (Gradual Response). Given the relatively minor degree of class imbalance present, models were also run without oversampling. No oversampling was applied to the test data. The best-performing machine learning model was also re-run with only three predictors to ensure that performance differences between logistic regression and machine learning models was not solely attributable to the use of different model features (see Section 4.3.3).

4.3.3 |. Logistic regression analysis

Given the desire to select a model with optimal predictive power, parsimony, and interpretability in clinical practice, multiclass logistic regression was applied to the training data as a simpler comparison approach (nnet package; Ripley & Venables, 2020). Gradual Response was identified as the reference class. Three predictors were selected, informed primarily by prior research on early rapid response as a predictor of outcome (Linardon, Brennan, & de la Piedad Garcia, 2016; MacDonald, Trottier, McFarlane, & Olmsted, 2015): baseline severity on the PMED (T0), percent change between T0 and T1, and slope of change from T0 to T1. The logistic model was also run with all 80 SVM features to ensure that observed differences in model performance were not solely attributable to differences in predictors included. As with the machine learning models, the logistic regression model was “trained” using training data only, and model performance was evaluated using the test dataset.

4.3.4 |. Model evaluation and selection

Model performance was evaluated via the area under the curve (AUC) for the receiver operating characteristics curve, which is generally robust to class imbalance (Ferri, Hernández-Orallo, & Modroiu, 2009; Haixiang et al., 2017). AUC was computed using the pROC package in R, using the generalization for multiclass problems outlined by (Hand & Till, 2001). To provide context for each classifier’s performance within each trajectory class, we examined precision (for a given class, the proportion of “positive” predictions that were correct, or the absence of false negatives) and recall (for a given class, the proportion of true-positive cases that were identified correctly).

The best-performing model was then evaluated further to compare its performance to chance prediction. First, a permutation test was used to determine whether the AUC value achieved was significantly greater than that which would likely be achieved by the model if patient trajectory classes were randomly labeled and their data submitted for analysis (Golland & Fischl, 2003; Pereira, Mitchell, & Botvinick, 2009). AUC exceeding 95% of the values for the 1,000 permuted models suggests that the model has performed significantly better than chance, with an alpha-level of .05. The permutation test does not require the assumption that data in each possible class are normally distributed, nor does it assume that the populations from which the sample was drawn have equal variances (Golland & Fischl, 2003).

5 |. RESULTS

5.1 |. Patient characteristics

Patients’ average age was 25.68 years (SD = 10.92), with a mean self-reported illness duration of 10.82 years (SD = 10.69). The sample was primarily White (90.0%), with the remainder identifying as Asian (2.4%), African American (1.2%), or Other/Multiracial (6.4%). The proportion of the sample identifying their ethnicity as Hispanic/Latina was 4.2%. The average length of stay was 38.07 days (SD = 13.06).

The majority of patients were treated for bulimia nervosa (28.4%), anorexia nervosa—restricting type (26.8%), anorexia nervosa—binge/purge type (13.9%), or other specified feeding or ED (23.8%). A small proportion of patients were treated for binge ED (4.8%) or avoidant-restrictive food intake disorder (1.8%).

For included cases, T1 was completed an average of 28.44 (SD = 13.42) days or 4.06 weeks prior to discharge. Time spent in treatment leading up to and including completion of the T1 assessment accounted for an average of 33.4% (SD = 13.5%) of patients’ total length of stay.

5.2 |. Model performance and selection

Table 1 summarizes performance metrics for all final models. Oversampling of less-common classes in the training data did not improve performance, thus those models are omitted. Interested readers may contact the corresponding author for results. Optimized parameters for all final models, plus tuning parameters tested during model training are summarized in Supporting Information (S1). Contrary to our hypotheses, multinomial logistic regression with three predictors outperformed all machine learning models with the full feature set included (AUC = 0.93). See Table 1 for a full summary of performance. To ensure the improved performance was not simply due to noise reduction with fewer features, we then retested the best-performing SVM model—radial basis function—with the same three predictors as in the final logistic model. This model performed comparably with the logistic model (AUC = 0.94; optimized at σ = 0.01456 and C = 11.32814). Although AUC confidence intervals are not available for multiclass AUC in R, we examined confidence intervals for accuracy of the two models and found them to be overlapping (95% CI AccSVM [0.80, 0.95]; 95% CI AccLogit [0.79, 0.94]); thus, it is unlikely the difference in performance between logistic regression and SVM was statistically significant. To prioritize parsimony and interpretability, the logistic model was selected as the final, optimal model. In a permutation test with 1,000 repetitions, model AUC was higher than that for a multinomial logistic regression model built with randomly labeled data in all 1,000 repetitions (MAUC = 0.50, SD = 0.01 for the null distribution, versus 0.93 with actual data, p <.001). These results therefore suggest that model performance was significantly better than that expected to occur by chance.

TABLE 1.

Test prediction accuracy statistics for all models evaluated

Modeling Approach
Model Statistic Linear SVM Polynomial SVM Radial SVM—all features Radial SVM—three features kNN Logistic regression—all features Logistic regression—three features
ROC AUCa .85 .82 .86 .94 .76 .71 .93
Accuracy .78 .83 .86 .89 .83 .66 .88
Prec./Rec.—GR .85/.82 .88/.86 .88/.90 .94/.88 .82/.94 .81/.70 .92/.90
Prec./Rec.—RR .61/.70 .71/.85 .73/.80 .81/.85 .79/.75 .52/.60 .80/.80
Prec./Rec.—LS .83/.77 .90/.69 1.00/.77 .87/.1.00 1.00/.54 .47/.62 .86/.92

Note: The values marked in bold represents the values for the final model selected for optimal performance via AUC.

a

Used as final determining metric for model performance evaluation.

Abbreviations: GR, gradual response; LS, low-symptom static response; Prec., precision; Rec., recall; ROC AUC, area under the receiver operating characteristics curve; RR, rapid response; SVM, support vector machine.

In the logistic model, precision was highest for the Gradual Response group (0.92), indicating that very few model-predicted Gradual Response cases were actually from a different group (i.e., no false positives). False positives occurred for one-fifth or fewer of the Rapid Response and Low-Symptom predictions (Precision = 0.80 and 0.86, respectively). Recall was highest for the Low-Symptom group (0.92), indicating that the model produced the fewest false negative predictions for this group versus Rapid Response (0.80) and Gradual Response groups (0.90). See Table 2 for a confusion matrix summarizing test data misclassifications for the logistic model. Table 3 summarizes the regression weights for each of the three model predictors tested. The final logistic regression prediction model is included as a supplemental file with this manuscript and can be implemented on new test data in R software. Readers may contact the corresponding author for instructions for model use.

TABLE 2.

Confusion matrix for logistic regression model

True class
Predicted Class Gradual response Rapid response Low-symptom static response
Gradual response 45 4 0
Rapid response 3 16 1
Low-symptom static response 2 0 12

TABLE 3.

Parameter estimates for final logistic regression model

Rapid response
Low-symptom static response
Independent variable b(SE) RR b(SE) RR
Intercept −1.28 (0.29)*** −4.53 (0.91)***
Baseline severity (PMED) −1.87 (0.46)*** 0.15 −6.81 (1.18)*** 0.001
Slope change week 1 → week 2 −0.24 (0.57) 0.79 −1.53 (1.36) 0.22
% change in severity week 1 → week 2 −2.72 (0.65)*** 0.07 −0.24 (1.18) 0.78

Note: The Gradual Response group was considered the reference group.

Abbreviations: PMED, progress monitoring tool for eating disorders; RR, relative risk.

*

p <.05;

**

p <.01;

***

p <.0001.

6 |. DISCUSSION

This study explored the potential utility of machine learning to predict symptom trajectories among patients receiving residential treatment for primary EDs, and considered its merits relative to a simpler multinomial logistic regression approach. We hypothesized that a SVM learning model would demonstrate optimal performance beyond that achieved by simpler approaches. All SVM models performed similarly well, with AUCs ranging from 0.8 to 0.93 (good to excellent range; Hosmer, Lemeshow, & Strurdivant, 2013). This translates to approximately 80–90% of test cases correctly classified. Contrary to our hypotheses, however, logistic regression outperformed nearly all machine learning models tested, including k-nearest neighbors, linear SVM, polynomial SVM, and radial basis function SVM when all 80 model features were included. When the best-performing machine learning model, radial SVM, was retested with only three predictors, it only slightly surpassed the performance of logistic regression. Given the immense added complexity and reduced interpretability of SVM, we conclude that logistic regression is the model of choice. Model performance observed here is similar to or better than that of other machine learning models used to predict treatment outcome and premature termination in the outpatient psychotherapy settings, which have ranged in overall accuracy from 59% to 73% (Lutz, Lambert, et al., 2006; Lutz, Saunders, et al., 2006). Results therefore indicate that, although machine learning has potential to produce high-performing predictive models of ED treatment outcome, in this case they are not preferred over a simpler logistic regression approach. Thus, future work seeking to utilize machine learning to enhance prediction in clinical ED treatment should first consider simpler comparison models.

In this analysis, a model with only three predictors—baseline severity, percent change in the first 2 weeks, and slope of change in the first 2 weeks—was found to perform better than a model with 80 features. Inevitably, although including several potentially redundant features in SVM is common and can add predictive “signal,” it also has the potential to increase “noise,” and that appears to have occurred here. There may be rapidly diminishing returns produced by adding more complexity to the models with additional predictors.

It is perhaps notable that three early treatment response variables, relatively easy to compute, are largely sufficient to predict the general trajectory a patient is likely to follow in treatment. Several other candidate predictors of outcome are possible and have been tested extensively in prior research, including duration of illness, comorbidity, etc. (Vall & Wade, 2015). These predictors provide crucial guidance on possible mechanisms underlying differential treatment response. However, when the goal is simply to anticipate expected response, one can achieve excellent accuracy without these variables and using low-burden self-report data alone. This result therefore provides promise for future clinical applications of trajectory prediction to guide personalized interventions.

In this sample, the model predictions were made when patients had an average of 3.8 weeks of treatment remaining. If applied in clinical practice and integrated into a decision support tool similar to a “treatment navigator” (Delgadillo et al., 2018; Lutz et al., 2019), this model could yield feedback early enough during a treatment episode to allow clinicians to tailor treatment based upon the patient’s expected trajectory. Adjustments to the plan of care could likely be incorporated well before discharge.

6.1 |. Strengths and limitations

This study is the first to apply advanced predictive analytic techniques to model ED treatment response trajectories and highlights the importance of considering simpler alternatives to complex machine learning approaches. This study is strengthened by a large sample size for the patient population and setting. Further, the weekly data allowed for prediction of treatment response trajectories using very early treatment data. This is important in intensive ED treatment, where treatment durations are typically briefer (Thompson-Brenner, Boswell, et al., 2018). Given the short treatment time frame, very early trajectory predictions are needed to facilitate feedback upon which clinicians can act before a patient is discharged, and this can only be achieved with frequent assessment.

Important limitations must be considered. First, the prediction model relied upon self-report data to predict self-reported outcomes. Alignment of patient perceptions of symptom severity with their treating clinicians’ evaluation is therefore unknown. Predictive models for other objectively measured outcomes, such as adherence to prescribed meal plans during treatment and/or weight change (if a patient is underweight or weight-suppressed upon admission) could improve clinical utility. Second, patients were from a female-only facility which serves a primarily White and affluent patient population presenting with severe, chronic eating pathology. In this case, tailoring to the specific treatment center was prioritized to maximize future clinical utility in this specific care setting. Its readiness for use in clinical practice may therefore be considered a strength. However, model generalizability to other treatment settings and more diverse patient populations is unknown and must be tested in future research. Our analytic approach also required exclusion of a large portion of the total patient population, including those with shorter length of stay; thus performance for patients not meeting these criteria is unknown. Finally, the small test subsample size limited our ability to examine whether the model performs more accurately for patients with certain diagnoses and symptom profiles versus others. Future analyses with larger samples should explore this possibility.

6.2 |. Clinical implications

Before using this model in clinical practice, the potential risks and benefits of using predictive models to tailor and modify treatment must be considered carefully. While the final model was highly accurate, current performance estimates suggest that approximately 1 in 10 patients would receive an incorrect trajectory prediction. One potential “cost” of misclassifying a patient is that treatment could be tailored inappropriately for an estimated 10% of patients. For example, providing Rapid Response treatment to a patient who is a true Gradual Responder could potentially lead to an overly-ambitious progression of skills practice that may overwhelm the patient. To address this concern, model predictions could be accompanied by a confidence estimate (e.g., predicted probability) to convey the likelihood this prediction accurately captures a patient’s expected course, along with guidance for clinicians to use the prediction as one of many sources of clinical data to guide treatment. In addition, it is important to note that the class membership predicted in this study represents a latent phenomenon that may occur on a continuum rather than discrete categories. Patients inevitably vary on the extent to which their symptom trajectory aligns closely with one of the three latent classes. Some who were accurately classified by the algorithm may actually have trajectories that vary more widely from the “averages” for each latent class. Those who were misclassified by the models may simply represent patients whose trajectory varied more widely than that of others.

Finally, this model predicts expected trajectory but gives no information about how to tailor treatment based upon these predictions. For example, a patient classified as a Rapid Responder may be discharged earlier than a Gradual Responder. However, rather than being ready for early discharge, it is possible Rapid Response patients require a different therapeutic focus during treatment—perhaps one emphasizing postdischarge planning and more in vivo practice in the home environment prior to discharge. Future randomized controlled trials should explore whether interventions tailored to a patient’s expected trajectory type enhance outcomes. For example, one might test whether earlier treatment stepdown versus more in vivo practice improves outcomes for Rapid Response patients versus treatment as usual, or whether motivation enhancement versus an emphasis on emotional awareness and reducing avoidance is more effective for Low-Symptom Static Response patients.

7 |. CONCLUSIONS AND FUTURE DIRECTIONS

This project was the first to evaluate the utility of a predictive algorithm that is closely tailored to the symptoms and concerns of patients in intensive ED treatment. This is also one of the first studies to examine the utility of such a prediction model in a residential treatment center, and the first to use novel machine learning techniques to predict one of multiple heterogeneous trajectories a patient may follow during treatment.

Future work may examine whether the model’s performance can be improved upon further. For example, other machine learning models (e.g., regularized regression, boosting, naïve Bayes) with additional model refinement methods (e.g., recursive feature elimination) could incrementally improve model performance. One might also examine whether similar performance could be achieved with only 1 week of data versus two. It is also important to examine whether the model is needed at all, or whether clinicians’ predictions of patient treatment trajectories perform comparably (although results from Hannan et al. (2005) suggest this is unlikely).

Results from this project may be used to directly improve the quality of clinical care provided in residential ED treatment facilities. Though the details are beyond the scope of this manuscript, this project was designed in partnership with clinical leadership at the treatment facility under study and was intended to have high relevance to the manualized protocol that is currently used at the program’s treatment sites (Thompson-Brenner, Brooks, et al., 2018). The items included in the self-report PMED measure used here involved theory and methodology that was tailored to the unique characteristics of patients with EDs in this setting (see Espel-Huynh, Thompson-Brenner, et al., 2020, for details on measure development considerations). In the future, successful integration into clinical practice has the potential to facilitate more personalized treatment planning and improved outcomes. If integrated successfully, this model could have important implications for improving efficiency and reducing costs of residential ED treatment. In sum, results from this project have the potential to inform and enhance personalized treatment in a pragmatically-focused setting, and also advance the literature in the growing field of progress monitoring and treatment response prediction.

Supplementary Material

Appendix S1

ACKNOWLEDGMENTS

The authors wish to thank the patients, clinicians, and administrative staff at the Renfrew Centers for their participation. This research would not have been possible without their partnership. This study was funded by the National Heart, Lung, and Blood Institute (T32 HL076134).

Funding information

National Heart, Lung, and Blood Institute, Grant/Award Number: T32 HL076134

Footnotes

CONFLICT OF INTEREST

Drs. Boswell, Espel-Huynh, Lowe, and Thompson-Brenner served as research consultants to The Renfrew Centers during the time this research was conducted. Dr. Thompson-Brenner continued to serve as research consultants at time of manuscript submission. Dr. Thomas has no conflicts of interest to disclose.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

Data are not publicly available.

REFERENCES

  1. Abbate-Daga G, Amianto F, Delsedime N, De-Bacco C, & Fassino S (2013). Resistance to treatment in eating disorders: A critical challenge. BMC Psychiatry, 13, 294. 10.1186/1471-244X-13-294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Akbani R, Kwek S, & Japkowicz N (2004). Applying Support Vector Machines to Imbalanced Datasets. Paper Presented at the European Conference on Machine Learning. [Google Scholar]
  3. Arcelus J, Mitchell AJ, Wales J, & Nielsen S (2011). Mortality rates in patients with anorexia nervosa and other eating disorders: A meta-analysis of 36 studies. Archives of General Psychiatry, 68(7), 724–731. 10.1001/archgenpsychiatry.2011.74 [DOI] [PubMed] [Google Scholar]
  4. Bauer DJ (2007). Observations on the use of growth mixture models in psychological research. Multivariate Behavioral Research, 42(4), 757–786. 10.1080/00273170701710338 [DOI] [Google Scholar]
  5. Delgadillo J, de Jong K, Lucock M, Lutz W, Rubel J, Gilbody S, … Nevin J (2018). Feedback-informed treatment versus usual psychological treatment for depression and anxiety: A multisite, open-label, cluster randomised controlled trial. The Lancet Psychiatry, 5(7), 564–572. [DOI] [PubMed] [Google Scholar]
  6. Delinsky S, Germain SS, Thomas J, Craigen KE, Fagley W, Weigel T, … Becker A (2010). Naturalistic study of course, effectiveness, and predictors of outcome among female adolescents in residential treatment for eating disorders. Eating and Weight Disorders-Studies on Anorexia, Bulimia and Obesity, 15(3), 127–135. [DOI] [PubMed] [Google Scholar]
  7. Espel-Huynh HM, Thompson-Brenner H, Boswell JF, Zhang F, Juarascio AS, & Lowe MR (2020). Development and validation of a progress monitoring tool tailored for use in intensive eating disorder treatment. European Eating Disorders Review, 28(2), 223–236. 10.1002/erv.2718 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Espel-Huynh HM, Zhang F, Boswell JF, Thomas JG, Thompson-Brenner H, Juarascio AS, & Lowe MR (2020). Latent trajectories of eating disorder treatment response among female patients in residential care. International Journal of Eating Disorders, 53(10), 1647–1656. 10.1002/eat.23369 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fairburn CG, Bailey-Straebler S, Basden S, Doll HA, Jones R, Murphy R, … Cooper Z (2015). A transdiagnostic comparison of enhanced cognitive behaviour therapy (CBT-E) and interpersonal psychotherapy in the treatment of eating disorders. Behaviour Research and Therapy, 70, 64–71. 10.1016/j.brat.2015.04.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Ferri C, Hernández-Orallo J, & Modroiu R (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30(1), 27–38. 10.1016/j.patrec.2008.08.010 [DOI] [Google Scholar]
  11. Fichter MM, Quadflieg N, & Hedlund S (2006). Twelve-year course and outcome predictors of anorexia nervosa. International Journal of Eating Disorders, 39(2), 87–100. 10.1002/eat.20215 [DOI] [PubMed] [Google Scholar]
  12. Fichter MM, Quadflieg N, & Hedlund S (2008). Long-term course of binge eating disorder and bulimia nervosa: Relevance for nosology and diagnostic criteria. International Journal of Eating Disorders, 41(7), 577–586. 10.1002/eat.20539 [DOI] [PubMed] [Google Scholar]
  13. Galatzer-Levy IR, Karstoft K-I, Statnikov A, & Shalev AY (2014). Quantitative forecasting of PTSD from early trauma responses: A machine learning application. Journal of Psychiatric Research, 59, 68–76. 10.1016/j.jpsychires.2014.08.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Golland P, & Fischl B (2003). Permutation Tests for Classification: Towards Statistical Significance in Image-Based Studies. Paper Presented at the IPMI. [DOI] [PubMed] [Google Scholar]
  15. Guyon I, & Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. [Google Scholar]
  16. Guyon I, Weston J, Barnhill S, & Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422. [Google Scholar]
  17. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, & Bing G (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239. 10.1016/j.eswa.2016.12.035 [DOI] [Google Scholar]
  18. Hand DJ, & Till RJ (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171–186. 10.1023/A:1010920819831 [DOI] [Google Scholar]
  19. Hannan C, Lambert MJ, Harmon C, Nielsen SL, Smart DW, Shimokawa K, & Sutton SW (2005). A lab test and algorithms for identifying clients at risk for treatment failure. Journal of Clinical Psychology, 61(2), 155–163. [DOI] [PubMed] [Google Scholar]
  20. He H, & Garcia EA (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. [Google Scholar]
  21. Hilbert A, Herpertz S, Zipfel S, Tuschen-Caffier B, Friederich H-C, Mayr A, … de Zwaan M (2018). Early change trajectories in cognitive-behavioral therapy for binge-eating disorder. Behavior Therapy, 50, 115–125. [DOI] [PubMed] [Google Scholar]
  22. Hosmer DW, Lemeshow S, & Strurdivant RX (2013). Applied logistic regression (3rd ed.). Hoboken, NJ: John Wiley & Sons, Inc. [Google Scholar]
  23. James G, Witten D, Hastie T, & Tibshirani R (2013). An introduction to statistical learning. New York: Springer. [Google Scholar]
  24. Jennings KM, Gregas M, & Wolfe B (2018). Trajectories of change in body weight during inpatient treatment for anorexia nervosa. Journal of the American Psychiatric Nurses Association, 24(4), 306–313. 10.1177/1078390317726142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kessler RC, van Loo HM, Wardenaar KJ, Bossarte RM, Brenner LA, Cai T, … Zaslavsky AM (2016). Testing a machine-learning algorithm to predict the persistence and severity of major depressive disorder from baseline self-reports. Molecular Psychiatry, 21 (10), 1366–1371. 10.1038/mp.2015.198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kuhn M (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26. 10.18637/jss.v028.i0527774042 [DOI] [Google Scholar]
  27. Le Grange D, Accurso EC, Lock J, Agras S, & Bryson SW (2014). Early weight gain predicts outcome in two treatments for adolescent anorexia nervosa. International Journal of Eating Disorders, 47(2), 124–129. 10.1002/eat.22221 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Linardon J, Brennan L, & de la Piedad Garcia X (2016). Rapid response to eating disorder treatment: A systematic review and meta-analysis. International Journal of Eating Disorders, 49(10), 905–919. 10.1002/eat.22595 [DOI] [PubMed] [Google Scholar]
  29. Linardon J, Messer M, & Fuller-Tyszkiewicz M (2018). Meta-analysis of the effects of cognitive-behavioral therapy for binge-eating-type disorders on abstinence rates in nonrandomized effectiveness studies: Comparable outcomes to randomized, controlled trials? International Journal of Eating Disorders, 51(12), 1303–1311. 10.1002/eat.22986 [DOI] [PubMed] [Google Scholar]
  30. Linardon J, & Wade TD (2018). How many individuals achieve symptom abstinence following psychological treatments for bulimia nervosa? A meta-analytic review. International Journal of Eating Disorders, 51(4), 287–294. 10.1002/eat.22838 [DOI] [PubMed] [Google Scholar]
  31. Lutz W, Lambert MJ, Harmon SC, Tschitsaz A, Schürch E, & Stulz N (2006). The probability of treatment success, failure and duration—What can be learned from empirical data to support decision making in clinical practice? Clinical Psychology & Psychotherapy, 13 (4), 223–232. 10.1002/cpp.496 [DOI] [Google Scholar]
  32. Lutz W, Leach C, Barkham M, Lucock M, Stiles WB, Evans C, … Iveson S (2005). Predicting change for individual psychotherapy clients on the basis of their nearest neighbors. Journal of Consulting and Clinical Psychology, 73(5), 904–913. 10.1037/0022-006x.73.5.904 [DOI] [PubMed] [Google Scholar]
  33. Lutz W, Rubel JA, Schwartz B, Schilling V, & Deisenhofer A-K (2019). Towards integrating personalized feedback research into clinical practice: Development of the Trier Treatment Navigator (TTN). Behaviour Research and Therapy, 120, 103438. 10.1016/j.brat.2019.103438 [DOI] [PubMed] [Google Scholar]
  34. Lutz W, Saunders SM, Leon SC, Martinovich Z, Kosfelder J, Schulte D, … Tholen S (2006). Empirically and clinically useful decision making in psychotherapy: Differential predictions with treatment response models. Psychological Assessment, 18(2), 133–141. 10.1037/1040-3590.18.2.133 [DOI] [PubMed] [Google Scholar]
  35. MacDonald DE, Trottier K, McFarlane T, & Olmsted MP (2015). Empirically defining rapid response to intensive treatment to maximize prognostic utility for bulimia nervosa and purging disorder. Behaviour Research and Therapy, 68, 48–53. 10.1016/j.brat.2015.03.007 [DOI] [PubMed] [Google Scholar]
  36. Mitchell JE (2016). Medical comorbidity and medical complications associated with binge-eating disorder. International Journal of Eating Disorders, 49(3), 319–323. 10.1002/eat.22452 [DOI] [PubMed] [Google Scholar]
  37. Mitchell JE, & Crow S (2006). Medical complications of anorexia nervosa and bulimia nervosa. Current Opinion in Psychiatry, 19(4), 438–443. 10.1097/01.yco.0000228768.79097.3e [DOI] [PubMed] [Google Scholar]
  38. Pereira F, Mitchell T, & Botvinick M (2009). Machine learning classifiers and fMRI: A tutorial overview. NeuroImage, 45(1), S199–S209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Preti A, Girolamo G. d., Vilagut G, Alonso J, Graaf R. d., Bruffaerts R, … Morosini P (2009). The epidemiology of eating disorders in six European countries: Results of the ESEMeD-WMH project. Journal of Psychiatric Research, 43(14), 1125–1132. 10.1016/j.jpsychires.2009.04.003 [DOI] [PubMed] [Google Scholar]
  40. Ripley B, & Venables W (2020). Package ‘nnet’. Retrieved from http://www.stats.ox.ac.uk/pub/MASS4/
  41. Sher KJ, Jackson KM, & Steinley D (2011). Alcohol use trajectories and the ubiquitous cat’s cradle: Cause for concern? Journal of Abnormal Psychology, 120(2), 322–335. 10.1037/a0021813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Thompson-Brenner H, Boswell JF, Espel-Huynh HM, Brooks GE, & Lowe MR (2018). Implementation of transdiagnostic treatment for emotional disorders in residential eating disorder programs: A preliminary pre-post evaluation. Corrected version. Psychotherapy Research, 29(8), 1045–1061. 10.1080/10503307.2018.1446563 [DOI] [PubMed] [Google Scholar]
  43. Thompson-Brenner H, Brooks GE, Boswell JF, Espel-Huynh HM, Dore R, Franklin DR, … Lowe MR (2018). Evidence-based implementation practices applied to the intensive treatment of eating disorders: A research summary and examples from one case. Clinical Psychology: Science and Practice, 25(1), e12221. 10.1111/cpsp.12221 [DOI] [Google Scholar]
  44. Vall E, & Wade TD (2015). Predictors of treatment outcome in individuals with eating disorders: A systematic review and meta-analysis. International Journal of Eating Disorders, 48(7), 946–971. 10.1002/eat.22411 [DOI] [PubMed] [Google Scholar]
  45. Warren CS, Schafer KJ, Crowley ME, & Olivardia R (2012). A qualitative analysis of job burnout in eating disorder treatment providers. Eating Disorders, 20(3), 175–195. 10.1080/10640266.2012.668476 [DOI] [PubMed] [Google Scholar]
  46. Zipfel S, Wild B, Groß G, Friederich H-C, Teufel M, Schellberg D, … Herzog W (2014). Focal psychodynamic therapy, cognitive behaviour therapy, and optimised treatment as usual in outpatients with anorexia nervosa (ANTOP study): Randomised controlled trial. The Lancet, 383(9912), 127–137. 10.1016/S0140-6736(13)61746-8 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1

Data Availability Statement

Data are not publicly available.

RESOURCES