Abstract
Comorbid anxiety disorders are common among patients with major depressive disorder (MDD), but their impact on outcomes of digital and smartphone-delivered interventions is not well understood. This study is a secondary analysis of a randomized controlled effectiveness trial (n=638) that assessed three smartphone-delivered interventions: Project EVO (a cognitive training app), iPST (a problem-solving therapy app), and Health Tips (an active control). We applied classical machine learning models (logistic regression, support vector machines, decision trees, random forests, and k-nearest-neighbors) to identify baseline predictors of MDD improvement at 4 weeks after trial enrollment. Our analysis produced a decision tree model indicating that a baseline GAD-7 questionnaire score of 11 or higher, a threshold consistent with at least moderate anxiety, strongly predicts lower odds of MDD improvement in this trial. Our exploratory findings suggest that depressed individuals with comorbid anxiety have reduced odds of substantial improvement in the context of smartphone-delivered interventions, as the association was observed across all three intervention groups. Our work highlights a methodology that can identify interpretable clinical thresholds, which, if validated, could predict symptom trajectories and inform treatment selection and intensity.
Keywords: Mental health, machine learning, mood disorders, major depressive disorder, anxiety disorders, comorbidity
1. Introduction
Major depressive disorder (MDD) affects roughly 322 million people, and is a leading cause of disability with large impacts on quality of life (Moreno-Agostino et al., 2021). Less than 20% of people with MDD receive minimally adequate treatment (Thornicroft et al., 2017). 50% of patients who receive MDD treatment experience minimal or no improvement, with a subset of these not responding to multiple treatment attempts (Gaynes et al., 2020). Digital and smartphone-delivered psychotherapy interventions are one promising avenue to increase access to evidence-based MDD treatments (Linardon et al., 2024). Knowledge of the factors that predict which patients are likely to improve could inform personalized care, such as identifying individuals who may require additional support. However, studies attempting to identify these predictors have yielded inconsistent results (Sextl-Plötz et al., 2024).
In this study, we investigated predictors of clinical MDD improvement by conducting a secondary analysis of a large, publicly available clinical trial dataset (Arean et al., 2016). The Brighten MDD trial was an online, fully remote effectiveness trial with a four-week primary intervention period, in which participants were randomized to one of three conditions: Project EVO, a serious game designed to bolster cognitive skills related to MDD; iPST, an app based on problem-solving therapy for MDD; and an information control app called Health Tips, which suggested strategies to improve health and serves as an active control. Consistent with an effectiveness trial framework, some participants endorsed simultaneously receiving other treatments (e.g., medication, seeing a therapist or psychiatrist) while in the trial. The original clinical trial analysis found that for participants with moderate MDD, the active apps resulted in higher remission rates compared to the control intervention at the 12-week follow up (Arean et al., 2016). Although the original study compares effectiveness across the three groups, the participant-level factors related to the likelihood of MDD improvement have not been explored to the best of our knowledge. We applied interpretable machine learning techniques, coupled with a forward feature selection approach, to identify variables measured at baseline that predict greater or lesser odds of clinical improvement during the treatment period.
2. Methods
2.1. Original clinical trial
This study is an independent secondary analysis of open-access data from the Brighten study, a randomized controlled effectiveness trial. The original study evaluated the effectiveness of two smartphone-delivered interventions for depression—Project EVO (a cognitive training app) and iPST (a problem-solving therapy app)—against an active control intervention, Health Tips (Arean et al., 2016). Consolidated Standards of Reporting Trials (CONSORT) diagrams detailing participant enrollment and flow are available in the original publications, which also provide comprehensive details about the interventions and trial design (Anguera et al., 2016; Arean et al., 2016). The primary analysis of the original trial found that, among participants with moderate MDD at baseline, both the Project EVO and iPST interventions resulted in higher remission rates at the 12-week follow-up compared to the Health Tips active control (Arean et al., 2016).
2.2. Models and variables
Our study predicted a binary MDD improvement outcome, measured 4 weeks after trial enrollment in alignment with the main intervention period of the original study Arean et al. (2016). This outcome was defined using established criteria for MDD treatment response: a Patient Health Questionnaire-9 (PHQ-9) score of both < 10 and reduced by ≥ 50% relative to baseline (Kroenke et al., 2001). We chose a machine learning framework for this analysis because of its ability to systematically identify predictive patterns using both linear and non-linear models, an approach well-suited to contexts where prior evidence for predictors is inconsistent. Our methodology was designed to compare multiple algorithm types, rigorously account for missing data and sampling variability, and select a parsimonious set of predictors to generate robust and interpretable findings. We selected five commonly used algorithms to represent distinct modeling approaches: logistic regression and support vector machines as standard linear classifiers; decision trees and random forests to capture non-linear relationships and interactions; and k-nearest-neighbors as a non-parametric, instance-based method. We were specifically interested in decision trees for their ease of interpretability in clinical settings (Banerjee et al., 2019). We considered the following variables in the dataset as “features” that the models could use for prediction (> 0% percentages of missing data shown in brackets after each variable name):
- Demographics:
- Age
- Gender (collected as binary male/female)
- Race/ethnicity (categories “African-American/Black,” “Asian,” “Hispanic/Latino,” “Multiracial/other,” and “Non-Hispanic White”)
- Working/employment (binary yes/no)
- Marital status (binarized to married/partnered or not)
- Education (binarized to education beyond high school or not)
- Satisfaction with level of income (binarized to responses of “can’t make ends meet” vs. responses indicating higher satisfaction) [53% missing]
- Questionnaire scores:
- PHQ-9 at baseline (Kroenke et al., 2001) [0% missing]
- Generalized Anxiety Disorder-7 (GAD-7), used as a global measure of anxiety symptoms (Spitzer et al., 2006; Kroenke et al., 2007; Beard and Björgvinsson, 2014) [47% missing]
- Sheehan Disability Scale (SDS) (Sheehan, 1983) [47% missing]
- AUDIT alcohol consumption questions (AUDIT-C) (Bush et al., 1998) [47% missing]
- Treatment group (binary feature for each)
- Project EVO
- iPST
- Health Tips (active control)
The set of predictor variables was chosen to include all available baseline demographic and clinical questionnaire data for an exploratory analysis. Marital status, education, and satisfaction with level of income were binarized. For race/ethnicity, response options “Native Hawaiian/other Pacific Islander,” “American Indian/Alaskan Native,” and “More than one” were combined to form the category “Multiracial/other.” These modifications were designed to prevent issues arising from categories with few participants. To maintain a parsimonious model and avoid multicollinearity from redundant predictors, we selected one of the two available income-related variables for inclusion. We chose “satisfaction with level of income,” hypothesizing that this measure of subjective financial distress may have a more direct relationship with mental health outcomes than absolute income brackets. We also considered using responses from the IMPACT mania and psychosis screening questionnaire (Unützer et al., 2002; Arean et al., 2016) as predictors. We decided to exclude participants who endorsed a history consistent with bipolar disorder. Only 2 of the remaining participants reported any history of psychosis, precluding a meaningful analysis of psychosis history as a predictor of MDD outcomes. The most detailed source of information regarding the variables collected during the original trial is the Brighten Study Public Researcher Portal (Sage Bionetworks, 2018).
2.3. Data preparation
Data from the Brighten Version 1 study (Anguera et al., 2016; Arean et al., 2016) were downloaded from the Brighten Study Public Researcher Portal (Sage Bionetworks, 2018). We used all available records in the Brighten Version 1 dataset (Anguera et al., 2016). While demographics and baseline PHQ-9 scores were collected upon enrollment, several other questionnaire scores were collected in the days following enrollment, notably GAD-7, SDS, AUDIT-C, and mania and psychosis history. For these questionnaires, we used the earliest available response from each participant as the “baseline” measurement. We did not consider any such responses that were made more than 5 days after enrollment. We excluded participants with a baseline PHQ-9 score below 10. To focus our analysis on unipolar depression, we also excluded participants who endorsed a history of (i) lithium prescription, (ii) prescription of medication for mania symptoms, or (iii) diagnosis of bipolar disorder on the IMPACT questionnaire. These criteria left us with 638 participants.
The PHQ-9 outcome data required for our analysis had substantial missingness, with 52%, 58%, 59%, and 60% missing at weeks 1-4 post-enrollment, respectively. Moreover, 41% of the included participants had only demographic data and baseline PHQ-9 (which were collected together), without any questionnaire scores for GAD-7, SDS, IMPACT, AUDIT-C, and all post-baseline PHQ-9. We checked for differences in demographics and baseline PHQ-9 scores between this group and participants who had additional data, and did not find statistically significant evidence supporting any differences (p > 0.05, t-tests for age and PHQ-9, χ2 tests for gender, race/ethnicity, employment, marital status, and education). We used a random forest-based multiple imputation strategy with predictive mean matching to handle missing data (baseline variables and follow-up PHQ-9 at weeks 1-4), implemented with the miceRanger package in R (Wilson, 2020). We produced 100 imputed versions of the dataset. On average across imputations, 41% of the participants met the definition for MDD improvement. Before training each machine learning model, all non-categorical features in the dataset were rescaled to have a mean of 0 and a standard deviation of 1.
2.4. Feature selection and model fitting
To minimize overfitting due to the high number of features in the dataset, we used a forward selection procedure to identify a minimal set of input features for each machine learning model. AUC estimates were first obtained for univariate models on each feature, and the feature that resulted in the highest AUC was selected. Then, all possible bivariate models including the first chosen feature were tested, and features were added one by one in this fashion until adding another feature did not significantly increase AUC. Each AUC estimate was obtained using a Monte Carlo cross-validation procedure with 10,000 iterations. For each iteration, we first randomly selected 1 of the 100 imputed versions of the dataset, and then randomly sampled 80% of the participants to form a training set, leaving the remaining 20% as a validation set. We took the 10,000 difference values between estimates from the current candidate model and those of the previous best model (or AUC = 0.5 for the first variable). We computed both the mean ∆AUC value and its 95% confidence interval from this empirical distribution, which accounts for uncertainty due to both missingness and recruitment sampling. Statistical significance of the model’s improvement due to the added variable was assessed by calculating the probability p = P(∆AUC ≤ 0) from the empirically sampled distribution. The Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995) was applied to control the false discovery rate at 0.05 across all comparisons for a given model type. To balance the goals of maximizing prediction accuracy and limiting model complexity, we also required an AUC improvement of ≥ 0.02 for each new variable (Greenberg et al., 2024). For decision trees and random forests, we ran variable selection for different maximum tree depths (maximum number of binary decisions allowed per tree, ranging here from 1 to 5 inclusive), keeping the tree depth that produced the highest AUC estimate overall. While the forward selection process was guided exclusively by AUC, we also calculated classification accuracy for the final models to aid in their interpretation and comparison. All models were implemented using the scikit-learn Python library with default hyperparameters (Pedregosa et al., 2011).
2.5. Sensitivity Analyses
We conducted two alternate versions of our analysis to assess the sensitivity of our findings to certain methodological assumptions.
Instead of predicting MDD outcomes 4 weeks post-enrollment (corresponding to the main intervention period in the original trial (Arean et al., 2016)), we predicted outcomes at 12 weeks.
Of the 638 participants who met our screening criteria (baseline PHQ-9 ≥ 10, no reported history of bipolar disorder), 279 (41%) had only baseline PHQ-9 and demographic information recorded, with none of GAD-7, SDS, AUDIT-C, mania/psychosis history, or any post-baseline PHQ-9. While our primary analysis includes these participants and imputes their data in accordance with intention-to-treat principles, we exclude them in an alternative version of the analysis (n = 359). Notably, missing data from these participants accounts for a large proportion of the overall missingness in the dataset: more information can be found in the supplementary material.
3. Results
Logistic regression, support vector machines, random forests, and decision trees demonstrated AUC values significantly above the 0.5 chance level in predicting MDD improvement, while k-nearest-neighbors did not (Table 1). All selected models used only one feature, baseline GAD-7: in no configuration did adding any additional variable increase AUC with statistical significance. Notably, the predictive performance was highly similar across these top 4 model types, with AUC and accuracy point estimates falling within narrow ranges (0.728-0.739 and 0.694-0.702 respectively) with widely overlapping confidence intervals for both. All model types other than k-nearest-neighbors also predicted significantly above the chance level using SDS as the sole predictor (indicating a negative relationship between functional disability and depression improvement), but were not selected by the forward process due to marginally lower AUC estimates.
Table 1: GAD-7 was the only predictor of MDD improvement across all selected models, and adding any second variable did not significantly increase AUC.
K-nearest-neighbors failed to predict significantly above chance (N.S. = not significant). 95% confidence intervals are included in parentheses. “Interp.” refers to whether or not each model type is considered interpretable. Logistic regression and support vector machine coefficients are for standardized features (mean = 0, std = 1). “Depth” refers to maximum tree depth, and is only applicable for random forests and decision trees. A selection of alternative models that predicted significantly above chance, but were not selected by the forward process, are shown in gray: we provide results for (i) the best model with an alternative predictor (SDS) and (ii) a depth = 1 decision tree model using GAD-7, which is of special interest for interpretability. A complete set of results for the main analysis and all sensitivity analyses can be found in the supplementary material.
| Model Type | Interp? | AUC | Accuracy | Depth | Pred. | Coefficient |
|---|---|---|---|---|---|---|
| Logistic Regression | Yes | 0.74 (0.65, 0.82) | 0.69 (0.61, 0.77) | - | GAD-7 | −0.96 (−1.13, −0.80) |
| 0.71 (0.62, 0.79) | 0.66 (0.58, 0.74) | - | SDS | −0.75 (−0.93, −0.57) | ||
| Support Vector Machine | Yes | 0.74 (0.65, 0.82) | 0.70 (0.62, 0.78) | - | GAD-7 | −1.01 (−1.15, −0.88) |
| 0.70 (0.61, 0.79) | 0.66 (0.57, 0.74) | - | SDS | −0.81 (−0.98, −0.58) | ||
| Random Forest | No | 0.74 (0.65, 0.82) | 0.70 (0.62, 0.78) | 2 | GAD-7 | - |
| 0.70 (0.61, 0.79) | 0.66 (0.58, 0.74) | 2 | SDS | - | ||
| Decision Tree | Yes | 0.73 (0.64, 0.81) | 0.70 (0.61, 0.77) | 3 | GAD-7 | - |
| 0.70 (0.61, 0.77) | 0.69 (0.61, 0.77) | 1 | GAD-7 | - | ||
| 0.69 (0.60, 0.77) | 0.65 (0.57, 0.73) | 3 | SDS | - | ||
| K-Nearest-Neighbors | No | N.S. | N.S. | - | N.S. | - |
While all model types had similar AUC values and near-identical accuracy, decision trees are arguably the most straightforward to apply in clinical settings. The depth = 1 model provides a simple clinical heuristic, a binary threshold on a single variable, that can be applied instantly without computation. Although the depth = 3 decision tree yielded a marginally higher AUC value (0.728 vs. 0.696), the simpler and more interpretable depth = 1 tree achieved nearly identical classification accuracy (0.697 vs. 0.692 respectively; see Table 1). We thus focus our primary interpretation on the depth = 1 model, as it maximizes ease of interpretability with no meaningful loss in predictive performance.
Re-fitting a depth = 1 decision tree on the entire dataset (taking the median threshold across imputations) yields a model that classifies participants with a baseline GAD-7 score ≥ 11 as unlikely to experience improvement, and those with GAD-7 < 11 as likely to experience improvement (Fig. 1). Notably, this threshold is similar (but not identical) to the Spitzer et al. (2006) threshold (GAD-7 ≥ 10) for “moderate” to “severe” generalized anxiety disorder.
Figure 1: A decision tree fitted to the multiply-imputed dataset predicts MDD improvement using baseline GAD-7 scores.

Participants who reported a GAD-7 score of 11 or higher were less than one-fifth as likely to experience significant MDD improvement as those with a score below 11.
To better quantify the association of GAD-7 being 11 or higher with MDD outcomes, we calculated an odds ratio of 0.18 for improvement vs. non-improvement given GAD-7 ≥ 11, with a 95% confidence interval of [0.12, 0.28] (p < 0.0001). These findings indicate that a GAD-7 score of 11 or higher reduces the odds of clinical MDD improvement by a factor of slightly less than one-fifth, with statistical significance. We used Rubin’s rules to combine the odds ratio values from multiple imputations (Rubin, 1987). To explore the consistency of this finding across treatment assignment groups, we calculated the odds ratio for a GAD-7 score ≥ 11 predicting improvement within each group separately. The association was statistically significant in all three groups: Project EVO (0.14 [0.07, 0.28]), iPST (0.12 [0.04, 0.39]), and the Health Tips active control (0.27 [0.14, 0.51]).
Viewed as a test that predicts MDD non-improvement, the proposed GAD-7 threshold has a positive predictive value of 76% and a negative predictive value of 63% (Table 2). In other words, our results suggest that the probability that a patient with GAD-7 < 11 will show clinically significant improvement in MDD is 0.63, but the probability that a patient with GAD-7 ≥ 11 will improve is only 0.24.
Table 2: Prognostic performance of the GAD-7 ≥ 11 decision tree threshold for predicting MDD non-improvement.
Sensitivity is the probability that a participant who does not experience MDD improvement is correctly identified by the model (true positive rate). Specificity is the probability that a participant who does experience MDD improvement is correctly identified by the model (true negative rate). Positive Predictive Value is the probability that a participant with a GAD-7 score ≥ 11 truly does not experience MDD improvement. Negative Predictive Value is the probability that a participant with a GAD-7 score < 11 truly does experience MDD improvement. Point estimates and confidence intervals for each of these metrics are pooled from the multiply-imputed dataset using Rubin’s rules.
| Statistic | Pooled Value (%) | 95% Confidence Interval (%) |
|---|---|---|
| Sensitivity | 73 | 68 - 78 |
| Specificity | 67 | 59 - 73 |
| Positive Predictive Value | 76 | 70 - 81 |
| Negative Predictive Value | 63 | 56 - 70 |
The results of both sensitivity analyses are consistent overall with those of the main analysis in that (i) GAD-7 is the single best predictor of MDD improvement, and (ii) GAD-7 ≥ 11 is the most informative threshold for a depth = 1 decision tree, with statistically significant odds ratios. In the sensitivity analysis excluding participants with substantial missing data, the reduced statistical power meant that only a random forest model identified GAD-7 as a significant predictor during variable selection. However, an exploratory analysis of the depth = 1 decision tree in this subgroup yielded the same GAD-7 ≥ 11 threshold with a statistically significant odds ratio. Detailed results for all sensitivity analyses are available in the supplementary material.
While GAD-7 was the single most informative predictor of MDD improvement, SDS was also a statistically significant predictor in alternative univariate models. Yet, bivariate models combining GAD-7 and SDS did not yield significant improvements. In a post-hoc exploratory analysis, a Spearman’s rank-order correlation revealed a strong, positive association between GAD-7 and SDS at baseline (ρ = 0.67, 95% CI [0.61, 0.71], p < 0.001).
4. Discussion
We predicted MDD improvement in a large cohort of participants receiving smartphone-delivered interventions. Our decision tree analysis identified a clear and clinically meaningful relationship: depressed individuals with baseline GAD-7 scores of 11 or higher were roughly one-fifth as likely to experience MDD improvement as those with lower GAD-7 scores. In terms of predictive value, there is a large difference in the probability of MDD improvement when GAD-7 ≥ 11 (24%) and when GAD-7 < 11 (63%). If replicated beyond the present study, this result suggests that additional support or more intensive treatment is warranted for individuals with MDD and GAD-7 ≥ 11 compared to people with GAD-7 < 11. This simple decision rule would be readily applicable in the clinic, as GAD-7 is widely administered to assess a common set of MDD comorbidities. While originally developed to screen for generalized anxiety disorder (Spitzer et al., 2006), the GAD-7 is now well-established as a screening tool for multiple anxiety disorders (Kroenke et al., 2007) and as a transdiagnostic measure of global anxiety symptoms (Beard and Björgvinsson, 2014).
Our findings highlight the utility of decision trees as a readily interpretable non-linear modeling approach. While decision trees demonstrated similar AUC scores and near-identical accuracy to logistic regression, support vector machine, and random forest models, these four model types offer different levels of interpretability - the ability to understand and explain the way the trained model makes predictions. While logistic regression and support vector machines produce coefficients that can be interpreted as the importance of each predictor, the meanings of the actual coefficient values are often unintuitive: logistic regression coefficients represent log-odds, while support vector machine coefficients define a class-separating hyperplane. Random forests are effectively uninterpretable “black-boxes”: we cannot practicably explain how our random forests use GAD-7 to predict outcomes. In contrast, decision trees provided a GAD-7 threshold (GAD-7 ≥ 11) above which MDD improvement is markedly less likely. While decision trees are commonly overlooked in both modern machine learning and traditional statistical analyses, they can generate predictive rules that can be easily interpreted and applied by clinicians and thus directly inform clinical decision-making (Banerjee et al., 2019).
In addition to the interpretability of decision trees, our machine learning-based pipeline, while computationally intensive, offers distinct advantages over traditional statistical approaches. Our framework systematically evaluated both linear models (such as logistic regression) and non-linear models (such as decision trees and random forests), allowing for the discovery of complex relationships that a single, simpler model might miss. Our methodology is also inherently robust: by combining random forest-based multiple imputation with Monte Carlo cross-validation, we rigorously accounted for uncertainty arising from both missing data and participant sampling without making strong distributional assumptions. Finally, our forward variable selection process, which included stringent criteria for model improvement and correction for multiple comparisons, provided a defense against overfitting and identified a parsimonious model while effectively handling co-linearity. While baseline GAD-7 and SDS scores were correlated, the selection algorithm identified GAD-7 as the most powerful single predictor and determined that adding SDS offered no significant improvement in predictive performance. This avoids the unstable coefficient estimates and interpretation challenges that co-linearity creates in standard multivariate regression, resulting in a more parsimonious and robust final model.
It is important to note two limitations of this study. As would be expected from a remotely-recruited national sample for an effectiveness trial, there is a substantial proportion of missing data. We address this limitation by using a rigorous multiple imputation approach, and with a sensitivity analysis that reduces the proportion of missing data by excluding participants with only minimal baseline data. Furthermore, in a supplementary complete-case analysis, we showed that GAD-7 predicts MDD improvement in a logistic regression model, confirming a statistically significant association in the raw, unimputed data (see Supplementary Section 3 for details). While this simpler analysis corroborates our findings, our primary machine learning-based methodology offers several distinct advantages. First, our analysis mitigates the risk of attrition bias, a significant concern given that the complete-case analysis excludes the majority of study participants. Second, multiple imputation increases statistical power while accounting for uncertainty introduced by missing data. Third, our approach is hypothesis-free, identifying GAD-7 as the best predictor of MDD outcomes without a priori assumptions. Finally, our analysis yielded a clinically intuitive threshold (GAD-7 ≥ 11) rather than a less interpretable odds ratio for an ordinal predictor.
A second limitation is that our analysis cannot establish causal associations. One possible explanation for the association between baseline GAD-7 and MDD improvement is that anxiety hinders participants’ engagement with smartphone-delivered interventions. Given findings from previous studies that comorbid anxiety reduces pharmacological treatment response in MDD (e.g., Fava et al. (2008); Saveanu et al. (2015); Dold et al. (2017)), it is also conceivable that comorbid anxiety hinders treatment response through a mechanism that is independent of treatment modality. A third possibility, which our results support most strongly, is that anxiety is a prognostic factor for poorer short-term MDD outcomes, an effect that is independent of both treatment assignment and treatment adherence. The original clinical trial analysis found higher MDD remission rates for moderately depressed participants in the Project EVO and iPST groups compared to the Health Tips control group, despite also reporting that the majority of participants in the active treatment groups did not download the assigned intervention app (Arean et al., 2016). Our analysis, which used different inclusion criteria and was designed to identify outcome predictors with corrections for multiple comparisons, not treatment effectiveness specifically, did not identify treatment assignment as a significant predictor. This is consistent with our finding that GAD-7 ≥ 11 predicted lower odds of improvement within both the active treatment groups and the Health Tips control group, supporting the interpretation that baseline anxiety is a general prognostic factor for poorer short-term MDD outcomes in this context, rather than a predictor of response to a specific treatment. While the point estimates for the odds ratios are consistent with a stronger association between baseline anxiety and MDD improvement in the active treatment groups than in the control group, their wide and overlapping confidence intervals preclude a definitive conclusion about an interaction between baseline anxiety and treatment assignment. Ultimately, our study cannot distinguish between treatment-related, treatment-unrelated, or combined mechanisms for the observed association between baseline GAD-7 scores and PHQ-9 score reductions.
The link between higher baseline anxiety and lower odds of MDD improvement in this setting may have clinically actionable implications. For example, if this association is partly due to patients with comorbid anxiety struggling to engage with smartphone-delivered interventions, these patients might need additional support or different therapeutic approaches. Regardless of the underlying mechanisms, one practical implication of our results is that future randomized controlled trials of smartphone-delivered interventions for MDD should consider stratifying by baseline anxiety levels.
Supplementary Material
Acknowledgements
Morgan B. Talbot’s time was partially supported by the National Institute of General Medical Sciences under Award T32GM144273, and partially supported by Massachusetts Institute of Technology through the David and Beatrice Yamron Fellowship. Dr. Lipschitz’s time was partially supported by the National Institute of Mental Health (NIMH) under Grant MH120324. Dr. Costilla-Reyes’ time was partially supported by the National Science Foundation (NSF) under Award 1918839. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of any of the above organizations.
Footnotes
Source code for data analysis: https://github.com/MorganBDT/brighten-mdd-outcome-predict
References
- Anguera JA, Jordan JT, Castaneda D, Gazzaley A, Areán PA, 2016. Conducting a fully mobile and randomised clinical trial for depression: access, engagement and expense. BMJ Innovations 2. [Google Scholar]
- Arean PA, Hallgren KA, Jordan JT, Gazzaley A, Atkins DC, Heagerty PJ, Anguera JA, 2016. The use and effectiveness of mobile apps for depression: results from a fully remote clinical trial. Journal of Medical Internet Research 18, e330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Banerjee M, Reynolds E, Andersson HB, Nallamothu BK, 2019. Tree-based analysis: a practical approach to create clinical decision-making tools. Circulation: Cardiovascular Quality and Outcomes 12, e004879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beard C, Björgvinsson T, 2014. Beyond generalized anxiety disorder: psychometric properties of the GAD-7 in a heterogeneous psychiatric sample. Journal of anxiety disorders 28, 547–552. [DOI] [PubMed] [Google Scholar]
- Benjamini Y, Hochberg Y, 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 289–300. [Google Scholar]
- Bush K, Kivlahan DR, McDonell MB, Fihn SD, Bradley KA, Project ACQI, 1998. The AUDIT alcohol consumption questions (AUDIT-C): an effective brief screening test for problem drinking. Archives of Internal Medicine 158, 1789–1795. [DOI] [PubMed] [Google Scholar]
- Dold M, Bartova L, Souery D, Mendlewicz J, Serretti A, Porcelli S, Zohar J, Montgomery S, Kasper S, 2017. Clinical characteristics and treatment outcomes of patients with major depressive disorder and comorbid anxiety disorders-results from a European multicenter study. Journal of Psychiatric Research 91, 1–13. [DOI] [PubMed] [Google Scholar]
- Fava M, Rush AJ, Alpert JE, Balasubramani G, Wisniewski SR, Carmin CN, Biggs MM, Zisook S, Leuchter A, Howland R, et al. , 2008. Difference in treatment outcome in outpatients with anxious versus nonanxious depression: a STAR*D report. American Journal of Psychiatry 165, 342–351. [DOI] [PubMed] [Google Scholar]
- Gaynes BN, Lux L, Gartlehner G, Asher G, Forman-Hoffman V, Green J, Boland E, Weber RP, Randolph C, Bann C, et al. , 2020. Defining treatment-resistant depression. Depression and Anxiety 37, 134–145. [DOI] [PubMed] [Google Scholar]
- Greenberg JL, Weingarden H, Hoeppner SS, Berger-Gutierrez RM, Klare D, Snorrason I, Costilla-Reyes O, Talbot M, Daniel KE, Vanderkruik RC, et al. , 2024. Predicting response to a smartphone-based cognitive-behavioral therapy for body dysmorphic disorder. Journal of Affective Disorders 355, 106–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kroenke K, Spitzer RL, Williams JB, 2001. The PHQ-9: validity of a brief depression severity measure. Journal of General Internal Medicine 16, 606–613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kroenke K, Spitzer RL, Williams JB, Monahan PO, Löwe B, 2007. Anxiety disorders in primary care: prevalence, impairment, comorbidity, and detection. Annals of internal medicine 146, 317–325. [DOI] [PubMed] [Google Scholar]
- Linardon J, Torous J, Firth J, Cuijpers P, Messer M, Fuller-Tyszkiewicz M, 2024. Current evidence on the efficacy of mental health smartphone apps for symptoms of depression and anxiety. a meta-analysis of 176 randomized controlled trials. World Psychiatry 23, 139–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moreno-Agostino D, Wu YT, Daskalopoulou C, Hasan MT, Huisman M, Prina M, 2021. Global trends in the prevalence and incidence of depression: a systematic review and meta-analysis. Journal of Affective Disorders 281, 235–243. [DOI] [PubMed] [Google Scholar]
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. , 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830. [Google Scholar]
- Rubin DB, 1987. Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York. [Google Scholar]
- Sage Bionetworks, 2018. Brighten study public researcher portal. https://www.synapse.org/Synapse:syn10848316. Accessed: 2025-07-22, Synapse ID: syn10848316.
- Saveanu R, Etkin A, Duchemin AM, Goldstein-Piekarski A, Gyurak A, Debattista C, Schatzberg AF, Sood S, Day CV, Palmer DM, et al. , 2015. The international study to predict optimized treatment in depression (iSPOT-D): outcomes from the acute phase of antidepressant treatment. Journal of Psychiatric Research 61, 1–12. [DOI] [PubMed] [Google Scholar]
- Sextl-Plötz T, Steinhoff M, Baumeister H, Cuijpers P, Ebert DD, Zarski AC, 2024. A systematic review of predictors and moderators of treatment outcomes in internet-and mobile-based interventions for depression. Internet Interventions , 100760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sheehan D, 1983. The Anxiety Disease. Scribner, New York. [Google Scholar]
- Spitzer RL, Kroenke K, Williams JB, Löwe B, 2006. A brief measure for assessing generalized anxiety disorder: the GAD-7. Archives of Internal Medicine 166, 1092–1097. [DOI] [PubMed] [Google Scholar]
- Thornicroft G, Chatterji S, Evans-Lacko S, Gruber M, Sampson N, Aguilar-Gaxiola S, Al-Hamzawi A, Alonso J, Andrade L, Borges G, et al. , 2017. Undertreatment of people with major depressive disorder in 21 countries. British Journal of Psychiatry 210, 119–124. [Google Scholar]
- Unützer J, Katon W, Callahan CM, Williams JW Jr, Hunkeler E, Harpole L, Hoffing M, Della Penna RD, Noël PH, Lin EH, et al. , 2002. Collaborative care management of late-life depression in the primary care setting: a randomized controlled trial. Journal of the American Medical Association (JAMA) 288, 2836–2845. [DOI] [PubMed] [Google Scholar]
- Wilson S, 2020. miceRanger: Multiple imputation by chained equations with random forests. https://cran.r-project.org/web/packages/miceRanger/index.html. CRANpackage(R). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
