Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Nov 8.
Published in final edited form as: Nat Ment Health. 2024 Aug 8;2(9):1111–1119. doi: 10.1038/s44220-024-00291-5

Mood instability metrics to stratify individuals and measure outcomes in bipolar disorder

Sarah H Sperry 1,2, Anastasia K Yocum 1, Melvin G McInnis 1
PMCID: PMC11545575  NIHMSID: NIHMS2015613  PMID: 39526287

Abstract

Clinical care for bipolar disorder (BD) has a narrow focus on prevention and remission of episodes with pre/post treatment reductions in symptom severity as the ‘gold standard’ for outcomes in clinical trials and measurement-based care strategies. The study aim was to provide a innovative method for measuring outcomes in BD that has clinical utility and can stratify individuals with BD based on mood instability. Participants were 603 with a BD (n=385), other or non-affective disorder (n=71), or no psychiatric history (n=147) enrolled in an longitudinal cohort for at least 10 years that collects patient reported outcomes measures (PROMs) assessing depression, (hypo)mania, anxiety, and functioning every two months. Mood instability was calculated as the intraindividual standard deviation (s.d.) of PROMs over one-year rolling windows and stratified into low, moderate, and high thresholds, respectively. Individuals with BD had significantly higher one-year rolling SDs for depression, (hypo)mania, and anxiety compared to psychiatric comparisons (small - moderate effects) and healthy controls (large effects). A significantly greater proportion of scores for those with BD fell into the moderate (depression: 50.6%; anxiety: 36.5%; (hypo)mania: 52.1%) and high thresholds (depression: 9.4%; anxiety: 6·1%; (hypo)mania: 10·1%) compared to psychiatric comparisons (moderate: 32.3 – 42·9%; high: 2.6% - 6·6%) and healthy controls (moderate: 11.5% - 31.7%; high: 0.4% - 5.8%). Being in the high or moderate threshold predicted worse mental health functioning (small to large effects). Mood instability, as measured in commonly used PROMs, characterized the course of illness over time, correlated with functional outcomes, and significantly differentiated those with BD from healthy controls and psychiatric comparisons. Results suggest a paradigm shift in monitoring outcomes in BD, by measuring intraindividual SDs as a primary outcome index.

Introduction

Bipolar disorders (BD) are among the leading causes of disability worldwide due to early onset, high chronicity, and comorbidity rates (1). Premature mortality associated with BD equals or surpasses that of several common risk conditions, including smoking and cardiovascular disease (2). Despite this, accurate diagnosis is often delayed and, even when diagnosed properly, efficient treatment options remain stagnant (3). Traditional nosology and classification describe BD as relapsing disorders during which distinct episodes of mania, hypomania, and depression vacillate with remitted periods. These remitted periods have long distinguished BD from other primary psychotic disorders or borderline personality disorder, and are based on a return to normal mood or “euthymia” in between episodes (4). However, the recent emphasis on longitudinal designs in BD research has begun to paint a sobering picture of these patterns, challenging traditional nosology and definitions of euthymia (5, 6).

Efforts to model the course of BD across both micro (hourly) and macro (monthly to yearly) timescales have proliferated over the past decade. Using methods such as ecological momentary assessment (EMA) and longitudinal cohort designs, time series analysis, and mathematical modeling finds that individuals with BD, as well as those at risk for BD, experience notable instability (fluctuations and deviations away from one’s average state) in emotions and mood even outside the context of mood episodes (712). Importantly, mood instability is associated with important risk and outcomes. In one study, day-to-day mood instability predicted the development of bipolar but not unipolar disorders three years later (13). Furthermore, several studies have established that mood instability, outside the context of mood episodes, is associated with worse outcomes and poorer functioning (1416).

Despite converging evidence that mood instability is a core phenotypic feature of BD, clinical care continues to have a narrow focus on prevention and remission of episodes. Symptom severity reduction is the ‘gold standard’ for outcomes in clinical trials of novel interventions; the ‘measurement-based’ care strategy is likewise motivated towards achieving a specific integer threshold on validated measures of mood. Success is typically defined by simple linear decreases in scores from two (pre-post treatment) to three time points (pre-treatment, mid-treatment, post-treatment). We, along with others (5, 17), argue that for substantial improvements in diagnosis, care, and treatment development, alternative ways to measure change in BD based on mood instability are needed.

In this study, a unique cohort of individuals from the Prechter Longitudinal Study of Bipolar Disorder (PLS-BD) (18, 19) was leveraged. The PLS-BD includes individuals with BD, a psychiatric comparison (PC) group, and individuals with no psychiatric history (HC) who complete the Patient Health Questionnaire (PHQ-9; (20), Altman Self-Rating Mania Scale (ASRM; (21), Generalized Anxiety Scale (GAD-7) (22), and SF-12 Quality of Life Scale (23), every two months while enrolled. Importantly, these Patient Reported Outcomes Measures (PROMs) are widely administered across medical systems and are a part of standard of care. The current study had three primary goals: (1) to develop a clinically meaningful instability score that is simple to calculate and easy to interpret, (2) identify low, moderate, and high instability thresholds based on these scores, and (3) determine whether these thresholds predict level of mental and physical health functioning over time. We hypothesized that individuals with BD would have higher instability scores for the PHQ-9, ASRM, and GAD-7 and be over-represented in the moderate and high thresholds compared to HC and PC and predict worse mental and physical functioning on the SF-12.

Results

Mood Instability Metrics

Mood (in)stability can be measured in different ways with debates in the field about the construct validity of different statistical measures (24, 25). A key distinction across proposed methods is whether the statistical index of instability simply measures deviations from one’s average (e.g., intraindividual standard deviation, S.D.) regardless of the temporal nature of such deviations, whether it measures the temporal dependency between measurements (e.g., autocorrelation, AR), or whether it combines the standard deviation and temporal dependency (e.g., mean square of successive differences, MSSD). Furthermore, the extent to which these measures represent (in)stability from flexibility depends on several factors including sampling frequency and other characteristics of the time-series and phenemonon of interest (24, 26, 27). Yet, prior research highlights that a higher S.D. is associated with negative outcomes and greater BD risk (13, 28, 29) and accounts for more variance in outcomes than other more complex affect dynamics (30).

Rolling intraindividual standard deviation (S.D.) of the PHQ-9, ASRM, and GAD-7 for each participant using three window widths; 3, 6, and 12 corresponding to 3, 6, and 12 independent measures captured over six-months, one-year, and two-years, respectively. Visual inspection with a LOESS (Locally Weighted Scatterplot Smoothing) line was completed along with calculation of the Mean Squared Error (MSE) and standard deviation to examine responsiveness of each window width. One-year rolling S.D. was selected as it had the lowest MSE but highest standard deviation across 20 random selected participants representing 12:4:4 ratio of BD:PC:HC. To enhance the clinical utility of this score, we created categorical thresholds of low, moderate, and high, independent of diagnostic group, using ranked percentiles (n = 100). A threshold </= 60-%tile was considered “low”, 61–94-%tile was “moderate” and >/= 95-%tile was considered “high” for each PROMs measure. As such, each participant had continuous scores and categorical thresholds for each PROMs measure for each one-year rolling window. Person-level means were also created for descriptive and group comparison purposes and reflected an individual’s average of their one-year rolling S.D. values.

To examine whether individuals with BD had higher instability scores than HC and PC we examined whether diagnostic group (HC = 0, BD = 1, PC = 2) was associated with continuous one-year rolling S.D. scores using linear mixed effects models in the lmer package (31) in R Studio version 09.1+494 (32). Next, we examined whether diagnostic group was associated with one-year rolling S.D. thresholds using chi-square tests. Lastly, we examined whether diagnostic group predicted participant rolling S.D. averages to assess at the population level whether individuals with BD have higher average rolling S.D. scores over the entire duration of the study. We applied thresholds identified from the continuous one-year rolling S.D. to determine whether across the entire duration of enrollment, participants fell into low, moderate, or high thresholds. Lastly, to test whether one-year rolling S.D. thresholds predicted longitudinal quality of life (mental and physical health functioning from the SF-12), over and above other covariates (diagnosis, gender identity, race, age), we ran linear mixed-effects models.

Selection of participants and final samples sizes are provided in a consort diagram (Figure 3) and summary statistics for the final sample are provided in Table 3. Models for S.D. are detailed below with zero-order spearman rho within- and between-person correlations presented in Supplemental Figure 1. General results with AR and MSSD are provided in Appendix 2. Descriptive statistics for participant and instance level rolling S.D., AR, and MSSD are provided in Supplemental Tables 14. Models examining the association between AR and MSSD metrics with longitudinal SF-12 scores are presented in Supplemental Tables 515. Visual plots of results for AR and MSSD metrics are available in Supplemental Figures 2 and 3.

Figure 3.

Figure 3.

Relationship between One-Year Rolling S.D. Thresholds and Mental and Physical Health Functioning

Violin plots showing A. MCS T Scores by low, moderate, and high thresholds for One-year Rolling PHQ-9 S.D. scores. B. PCS T Scores by low, moderate, and high thresholds for One-year Rolling PHQ-9 S.D. scores. C. MCS T Scores by low, moderate, and high thresholds for One-year Rolling ASRM S.D. scores. D. PCS T Scores by low, moderate, and high thresholds for One-year Rolling ASRM S.D. scores. E. MCS T Scores by low, moderate, and high thresholds for One-year Rolling GAD-7 S.D. scores. F. PCS T Scores by low, moderate, and high thresholds for One-year Rolling GAD-7 S.D. scores. MCS T score = Mental Component Summary from the SF-12. PCS T score = Physical Component Summary from SF-12. T Scores have a mean of 50 (dashed black line) and standard deviation of 10. Higher T scores indicate better functioning. Black diamond reflects the group mean.

Table 3.

One-Year Rolling S.D. Thresholds Predicting Longitudinal Physical Health Functioning

PHQ-9 predicting PCS ASRM Predicting PCS GAD-7 Predicting MCS
Predictor Estimates Conf. Int (95%) P-Value   Estimates Conf. Int (95%) P-Value   Estimates Conf. Int (95%) P-Value
(Intercept) 61.00 58.28 – 63.71 <0.001   61.40 58.60 – 64.21 <0.001   60.55 57.67 – 63.42 <0.001
Medium vs. Low −0.34 −0.60 – −0.08 0.011   0.51 0.25 – 0.77 <0.001   −0.04 −0.25 – 0.18 0.742
High vs. Low 0.02 −0.48 – 0.52 0.926   1.56 1.07 – 2.04 <0.001   −0.38 −0.80 – 0.04 0.077
BD v HC −6.00 −7.71 – −4.29 <0.001   −6.67 −8.43 – −4.90 <0.001   −6.32 −8.07 – −4.56 <0.001
PC v HC −3.08 −5.54 – −0.62 0.014   −3.05 −5.58 – −0.53 0.018   −3.31 −5.92 – −0.71 0.013
BD v PC −2.86 −5.37 – −0.34 0.026   −3.47 −5.99 – −0.95 0.007   −2.85 −5.56 – −0.14 0.039
Is Female −1.38 −2.92 – 0.17 0.080   −1.54 −3.11 – 0.04 0.056   −1.15 −2.81 – 0.52 0.179
Is Non-White −0.86 −2.69 – 0.98 0.361   −1.14 −3.03 – 0.74 0.235   −0.59 −2.52 – 1.34 0.550
Age at Enrollment −0.21 −0.26 – −0.16 <0.001   −0.21 −0.26 – −0.16 <0.001   −0.21 −0.26 – −0.15 <0.001

Note. Three linear mixed-effects models with random intercepts were used without adjustment for multiple comparison. By default, the lmer function in r provides two-sided p values using the Satterthwaite’s degrees of freedom method. PCS = SF-12 Physical Health Functioning T Score (M = 50, S.D. = 10). Estimates are unstandardized betas for interpretation of +/− T score values. Medium vs. Low = Medium vs. Low Threshold; High vs. Low = High vs. Low Threshold. BD = bipolar disorders; HC = healthy control; PC = psychiatric comparison.

One-year Rolling S.D. for PHQ-9

On average, individuals had 64 (Range = 11–101) scores.

The BD group had significantly higher one-year rolling S.D. scores than HC (β= 1.01, 95% confidence interval [0.99 – 1.25], p<.001, large effect) and PC (β= 0·65 [0.46 – 0.84], p<.001, large effect). A significantly greater of individuals with BD fell above the moderate and high thresholds compared to HC and PC and a significantly lower proportion fell below the low threshold at both the repeated measure level (X2 (4) = 5639.3, p<.001; Figure 2A) and participant average level (X2 (4) = 197.56, p<.001; Figure 2B). One-year rolling S.D. thresholds were significantly associated with mental health functioning; those above the moderate threshold had an average T score 3·47 less than those in the low group (moderate effect) and those above the high threshold had an average T score 5·21 less (moderate effect) than those in the low group (Table 2; Figure 3). S.D. thresholds for the PHQ-9 were not significantly associated with physical health functioning.

Figure 2.

Figure 2.

Thresholding of One-Year Rolling S.D. Scores Across Diagnostic Groups

A. Mosaic plot showing proportion of one-year rolling PHQ-9 S.D. scores in low, moderate, and high thresholds by diagnosis. B. Scatterplot showing individual participant thresholds based on the average of their one-year rolling PHQ-9 S.D. scores. C. Mosaic plot showing proportion of one-year rolling ASRM S.D. scores in low, moderate, and high thresholds by diagnosis. D. Scatterplot showing individual participant thresholds based on the average of their one-year rolling ASRM S.D. scores. E. Mosaic plot showing proportion of one-year rolling GAD-7 S.D. scores in low, moderate, and high thresholds by diagnosis. F. Scatterplot showing individual participant thresholds based on the average of their one-year rolling GAD-7 S.D. scores. BD = Bipolar Disorders; HC = Healthy Control; PC = Psychiatric Comparison. Solid lines represent the cutoff for moderate instability, dashed lines represent the cutoff for high instability.

Table 2.

One-Year Rolling S.D. Thresholds Predicting Longitudinal Mental Health Functioning

PHQ-9 predicting MCS ASRM Predicting MCS GAD-7 Predicting MCS
Predictor Estimates Conf. Int (95%) P-Value Estimates Conf. Int (95%) P-Value Estimates Conf. Int (95%) P-Value
(Intercept) 51.63 49.17 – 54.08 <0.001 51.58 48.88 – 54.29 <0.001   51.06 48.34 – 53.79 <0.001
Medium vs. Low −3.47 −3.80 – −3.13 <0.001   −0.37 −0.72 – −0.02 0.038 −0.91 −1.19 – −0.63 <0.001
High vs. Low −5.21 −5.86 – −4.57 <0.001   −1.03 −1.68 – −0.38 0.002   −1.58 −2.14 – −1.02 <0.001
BD v HC −12.27 −13.82 – −10.72 <0.001   −14.29 −16.00 – −12.59 <0.001   −14.53 −16.20 – −12.87 <0.001
PC v HC −5.61 −7.84 – −3.38 <0.001   −6.30 −8.73 – −3.86 <0.001   −7.37 −9.85 – −4.90 <0.001
BD v PC −6.70 −8.96 – −4.45 <0.001   −7.95 −10.36 – −5.54 <0.001   −7.18 −9.72 – −4.64 <0.001
Is Female −0.17 −1.56 – 1.23 0.817   −0.51 −2.03 – 1.00 0.508   −0.63 −2.21 – 0.96 0.439
Is Non-White 0.34 −1.32 – 2.01 0.688   0.52 −1.29 – 2.34 0.573   0.15 −1.69 – 1.99 0.872
Age at Enrollment 0.07 0.02 – 0.12 0.003   0.08 0.02 – 0.13 0.004   0.09 0.04 – 0.15 <0.001

Note. Three linear mixed-effects models with random intercepts were used without adjustment for multiple comparison. By default, the lmer function in r provides two-sided p values using the Satterthwaite’s degrees of freedom method. MCS= SF-12 Mental Health Functioning T Score (M = 50, S.D. = 10). Estimates are unstandardized betas for interpretation of +/− T score values. Medium vs. Low = Medium vs. Low Threshold; High vs. Low = High vs. Low Threshold. BD = bipolar disorders; HC = healthy control; PC = psychiatric comparison.

One-year Rolling S.D. for ASRM

On average, individuals had 55 (Range = 11–100) ASRM S.D. scores. The BD group had significantly higher ASRM S.D. than HC

(β= 0·87 [0.73 – 1.00], p<.001, large effect) and PC (β= 0·45 [0.27 – 0.64], p<.001, moderate effect). A significantly greater proportion of individuals with BD were above the moderate and high instability thresholds for the ASRM compared to HC and PC and a significantly lower proportion of individuals with BD fell below the low instability threshold at the repeated measure level (X2 (4) = 3442.8, p<.001; Figure 2C) and the participant average level (X2 (4) = 115.90, p<.001; Figure 2D). Being in the high threshold was associated with poorer mental health functioning with an average T score 1.09 less than those in the low group, holding all other variables constant (Table 2; Figure 3). In contrast, being in the moderate or high threshold was associated with better physical health functioning; those in the moderate threshold had an average T score 0·51 higher (small effect) than those in the low group and those above the high threshold had a T score 1·56 higher (small effect) than those in the low group, holding all other variables constant (Table 3).

One-Year Rolling S.D. for GAD-7

On average, individuals had 47 (Range = 3–68) ASRM S.D. scores.

Within-person observations for the GAD-7 were significantly lower because the GAD-7 did not begin to be administered until year 5 of the study causing much lower number of measured instances. The BD group had significantly higher GAD-7 S.D. than HC (β= 0·19 [0.13 – 0.26], p<.001, small effect) and PC (β= 0·10 [0.01 – 0.20], p=.037, small effect). At the repeated measures level, individuals with BD had higher representation in the moderate threshold, but not high threshold (X2 (4) = 52.603, p<.001; Figure 2E). At the participant average level, a significantly greater proportion of individuals with BD were above the moderate threshold (X2 (4) = 18.21, p<.001; Figure 2F) with no participants falling into the high threshold across diagnostic groups. Rolling thresholds significantly predicted mental health functioning; those above the moderate threshold had an average T score of 0·91 less than those in the low group (small effect); those above the high instability threshold had an average T score of 1·58 less than those in the low group (small effect), holding all other variables constant (Table 2; Figure 3). Rolling thresholds for the GAD-7 were not significantly associated with physical health functioning (Table 3).

Discussion

Mood instability is recognized as a core phenotype of BD, yet there are no established methods to measure and index this instability that can be easily and efficiently adapted to clinical trials and/or in the delivery of routine clinical care, such as with PROMs. In a unique longitudinal cohort with deep phenotyping, the PLS-BD, instability was characterized based on the intraindividual S.D. of PROMs including the PHQ-9, ASRM, and GAD-7 over rolling one-year rolling windows. To enhance clinical utility of this continuous measure, we then stratified individuals using low, moderate, and high thresholds based on one-year rolling S.D. scores.

Across all measures, those with BD had significantly higher one-year rolling S.D. scores compared to HC and PC with moderate to large effect sizes for depression and (hypo)mania and small effect sizes for anxiety. Continuous scores and thresholds differed significantly for BD compared to PC suggesting that higher intraindividual S.D. scores are not simply a function of psychopathology in general but rather are reflective of the trajectory of BD. Based on participant averages over 10 years or more, no individuals in the HC or PC group fell into the high threshold for the PHQ-9, ASRM, or GAD-7. At the repeated measures level for the PHQ-9 rolling SDs, there were 19 instances where a HC participant fell into the high threshold and 71 instances where a PC participant fell into the high threshold compared to 1104 instances for individuals with BD. For the ASRM rolling SDs, there were 87 instances where a HC participant fell into the high threshold, 598 instances for PC participants, and 3,771 for BD partcipants. For the GAD-7 rolling SDs, there were 242 instances where a HC participant fell into the high threshold, 94 for PC participants, and 609 for BD participants. These numbers are suggestive of high specificity of the intraindividual S.D. thresholds both across one-year windows and longer periods to BD.

Higher one-year rolling S.D. based on the PHQ-9 was associated with adverse outcomes for those with BD. Those in the moderate and high thresholds had both lower mental (large effect) and physical health functioning (small effect). This is consistent with findings from the Global Bipolar Cohort (n = 5,882 individuals with BD) that reported subsyndromal symptoms of depression are among the strongest predictors of poor functioning in BD (6). One-year rolling S.D. thresholds based on the ASRM were also associated with lower mental health functioning (small effect). In contrast, they were associated with better physical health functioning (small effect). The ASRM can capture higher positive affectivity and sociability that are not necessarily indicative of (hypo)mania. General positive affectivity is associated with a host of adaptive outcomes, particularly related to physical health (33). As such, high one-year rolling SDs for the ASRM may represent worse mental health functioning (due to the presence of significant deviations from one’s average to clinical levels) but better physical health functioning due to higher levels of positive affectivity. However, these were very small effects and the difference in T score is likely clinically insignificant, so further interpretation is cautioned. In terms of the GAD, being in the moderate and high threshold was associated with worse mental health functioning but wasn’t associated with physical health functioning. Taken together, one-year rolling PHQ-9 SDs are most indicative of poor mental and physical health functioning and clearly represent mood instability in BD.

To examine whether the intraindividual S.D. was the ideal metric to capture mood instability, we also calculated the AR and MSSD over the same one-year rolling windows as these measures account for temporal dependency, an important component on instability (25, 34). Overall, results for MSSD replicated those of the rolling SDs but were attenuated in strength with smaller effect sizes. One-year rolling AR was much less specific – there were no group differences in AR values between BD, HC, and PC and its association with mental and physical health functioning was mixed and difficult to interpret. The most consistent results indicated that being in the high threshold on AR was associated with better mental and physical health functioning, which conceptually, indicates that higher levels of stability are associated with better outcomes. These results support other extant research that highlights that the intraindividual S.D. captures more variance in psychopathology and well-being outcomes than other more complex affective dynamics (30). They also highlight that, rather than the temporal dependency of scores being critical to instability, the strength of shifts in individuals PROMs measures are most indicative of instability and poor functioning in BD.

Clinically, these results provide guidelines for practical clinical monitoring in the daily patient care setting and offer a innovative strategy for outcomes assessments in clinical treatment trials for BD by calculating the intraindividual S.D. of PROMs as an index of mood instability, particularly for the PHQ-9. The PHQ-9 and GAD-7 are among the most validated and most widely used PROMs across both primary care and psychiatric settings (20, 35). Given their short length (7 and 9 items), accessibility, and translations into numerous languages, they are easy to integrate into standard operating procedures in primary care and psychiatric settings, research studies, and clinical trials. The ASRM is less widely administered and is used specifically for BD monitoring but reliability of this measure (and most self-report measures of mania) is generally less than that of the PHQ-9 and GAD-7, but remain useful in the longitudinal setting (36). Traditional methods of scoring the PHQ-9, GAD-7, and ASRM focus on a sequential decrease in scores (e.g., PHQ-9 score from 15 to 8 over 12 months) as evidence of a treatment effect. For example, the reliable change index (RCI) in clinical trials calculates the decrease in a score that should be expected based on one’s starting value and regression to the mean (e.g.,(37)). However, in the study of BD it is arguably of greater clinical relevance to monitor mood instability as measured by the intraindividual S.D. in common PROMs such as the PHQ-9 scores over rolling windows of time. Future investigations must consider that non-linear change (reduction in instability or a shift to a lower S.D. threshold) may be critical for both understanding the nature, trajectory, and treatment response in BD (8, 9). Once norms for mood instability indices are fully established, future research should investigate the impact that reducing mood instability (defined here as reducing the strength of deviations from baseline) has on primary outcomes of interest (e.g., physical and mental health functioning, well-being, cognition, interpersonal relationships, occupational outcomes).

Although results provide preliminary thresholds and guidelines for common PROMs that stratify individuals into the low, moderate, and high thresholds, next steps include identifying larger samples from electronic health record data or large global research collaboratives such as the Global Bipolar Cohort (GBC; (6, 38)) and National Network for Depression Centers (39) to establish norms based on larger and more diverse samples.

Our proposed method for capturing mood instability is currently based on self-reported PROMs measures given that they are widely administered in research studies, clinical trials, and standard measurement based care. However, self-report measures have limitations and can be biased by level of insight. Future directions are to examine whether rolling intraindividual S.D. of passively collected digital phenotyping data can also stratify individuals into low, moderate, and high thresholds that are associated with functioning. In fact, some of our preliminary work has shown that 7-day windows of intraindividual SDs of emotional valence and arousal, extracted from speech, is associated with mood severity (40). Others have explored whether passively collected movement and sleep data can predict depression symptom variability in those with major depressive disorder (41) and in medical interns (42). This is an exciting area of future research.

Limitations

A significant limitation of the PLS-BD is its relatively small size and limited ethnic and racial diversity, having been ascertained in a small geographical area. Effect sizes were large for diagnosis; those with BD and PC had significantly lower mental and physical health functioning than HC. While instability is likely intrinsic to the diagnosis, it is difficult to tease out the effect of the BD illness vs. the instability. A further limitation is the limited number of PROM assessments; participants completed PROMs every two months. This was done to minimize participant burden over an extended period of time; however, it is possible that the fluctuation of the PROMs is greater than what is picked up by a bi-monthly cadence. In fact, it could be that instability scores would be even higher if measured at a more frequent cadence. In fact, several studies over different timescales (e.g., hours, days, weeks) have shown that intraindividual S.D. is a meaningful metric for measuring outcomes and risk in BD (7, 1113, 15, 28). While in this study there was no hypothesized sinusoidal or periodic variation due to the cadence of measurements; future research should examine this metric of (in)stability over varying timescales such as through daily or weekly ecological momentary assessment to identify the extent to which intraindvidual S.D. is heightened after accounting for such variations (43). Although the one-year rolling S.D. scores and thresholds seem to represent a level of instability in this data and population, it is possible that in other populations or data with difference measures administered at a different cadence, results could differ. These results align most closely with real-world clinically administered PROMs and treatment lengths. None-the-less, future investigations should aim to investigate the generalizability of the one-year rolling S.D. measure. Lastly, this study establishes associations between mood instability metrics and functioning; however, future research is needed to establish causal mechanisms underlying mood instability and its impact on the course and symptoms in BD.

Interpretation

This study outlines a paradigm shift in monitoring outcomes in BD, by measuring intraindividual S.D. as a primary outcome index. Higher instability (higher one-year rolling SDs or moderate or high thresholds) in commonly used PROMs, particularly the PHQ-9, characterized the course of illness over time, correlated with functional outcomes, and significantly differentiated those with BD from HC and PC. With the growing datasets emerging worldwide, there will be sufficient information from common clinical outcomes measures to assess instability indices of BD globally.

Methods

Study design and participants

Participants were drawn from the PLS-BD, an ongoing cohort study of BD that has been continuously gathering phenotypic and biological data over the naturalistic course of BD beginning in 2006 (n=1,445 participants). Participants are recruited via advertisements, psychiatric clinics, mental health centers, and community outreach events in Michigan in a purposive sampling approach. The goal of the PLS-BD is to have 1,000 active participants engaged at any time. A priori power analysis was not conducted given that this secondary-data analysis project depended on an existing cohort with longitudinal data already collected. Participants are not enrolled if diagnosed with neurological disease or alcohol or substance use that would interfere with the ability to complete research. The present study included 603 participants (Demographics, Table 1; Consort diagram, Figure 1) who had completed at least 10 years of follow-up in the study. Participants in the BD group had a DSM-IV-TR diagnosis of BD I (n = 258), BD II (n = 80), BD Not Otherwise Specified (n = 30), or Schizoaffective BD (n = 17). Participants in the PC group had a DSM-IV-TR diagnosis of major depressive disorder (n = 20), non-affective diagnoses (n = 23), or other affective diagnoses (n = 28). The HC included individuals (n = 147) who had no psychiatric history and no first-degree relatives with Axis 1 diagnoses. Diagnosis was assessed using the Diagnostic Interview for Genetic Studies, version 4 (44). A team of at least two doctoral-level psychologists or psychiatrists confirmed diagnosis using criteria from the DSM-IV-TR and all available medical history. Participants self-reported their race and ethnicity and in general was representative of the study catchment area. Written informed consent was obtained from participants and all study procedures were approved by the Institutional Review Board at Michigan Medicine (HUM00000606). Participants were financially compensated for their time and made $100 for the baseline assessment with yearly compensation of $150 for the duration of their involvement. This current study followed the requirements of the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline statement.

Table 1.

Participant Demographics

HC (N=147) BD (N=385) PC (N=71)
Age at Enrollment
 Mean (S.D.) 35.6 (15.4) 39.7 (12.8) 38.5 (15.1)
 Median [Min, Max] 30.0 [18.0, 77.0] 39.0 [18.0, 69.0] 39.0 [18.0, 80.0]
Duration Enrolled (years)
 Mean (S.D.) 13.0 (1.92) 12.9 (2.16) 13.0 (2.04)
 Median [Min, Max] 13.0 [10.0, 17.0] 13.0 [10.0, 17.0] 13.0 [10.0, 17.0]
Gender Identity
 Female 92 (62.6%) 264 (68.6%) 40 (56.3%)
 Male 55 (37.4%) 120 (31.2%) 31 (43.7%)
 Other 0 (0%) 1 (0.3%) 0 (0%)
Ethnicity
 Hispanic or Latino 7 (4.8%) 16 (4.2%) 2 (2.8%)
 Not Hispanic or Latino 137 (93.2%) 362 (94.0%) 67 (94.4%)
 Unknown or not reported 3 (2.0%) 7 (1.8%) 2 (2.8%)
Race
 African-American 20 (13.6%) 31 (8.1%) 15 (21.1%)
 Asian 17 (11.6%) 4 (1.0%) 4 (5.6%)
 More Than One Race 5 (3.4%) 14 (3.6%) 3 (4.2%)
 Native American/Alaskan Native 1 (0.7%) 3 (0.8%) 2 (2.8%)
 Unknown or not reported 2 (1.4%) 8 (2.1%) 0 (0%)
 White or Caucasian 102 (69.4%) 325 (84.4%) 47 (66.2%)

Note. BD = bipolar disorders; HC = health controls; PC = psychiatric comparison, NA = Not applicable.

Figure 1. Consort Diagram.

Figure 1.

Consort diagram showing final sample size based on selection criteria.

Choice of primary measures

The full longitudinal protocol and procedures of the PLS-BD are outlined elsewhere (19). The current investigation includes mood and functioning measures that are administered every two months via REDCap electronic data capture tools hosted at the Michigan Institute for Clinical and Health Research at the University of Michigan (45, 46). The repeated measures included were completed independently by participants wherever they had access to internet. Demographics including gender identity, race, ethnicity, and age of onset are collected annually while age was calculated as of November 1, 2023 (data extraction).

Self-reported depression symptoms over the past two weeks were measured using PHQ-9. Nine items are answered on a Likert Scale from 0 (Not at all) to 3 (Nearly every day) with scores ranging from 0 to 27 (5–9: mild, 10–14: moderate, 15–19 moderately severe, 20–27 severe depression). Self-reported manic symptoms over the past two weeks were measured using the five-item ASRM. Items are answered on a scale from 1 to 5 with scores ranging from 5 to 25 (≥ 6 concerns for (hypo)mania). Self-reported anxiety symptoms were measured using the seven-item GAD-7. Items are answered on a scale from 0 (Not at all) to 3 (Nearly every day). Scores range from 0 to 21 (0–4: minimal, 5–9: mild, 10–14: moderate, >= 15: severe anxiety). The GAD-7 was included as anxiety disorders are highly comorbid with BD and anxiety tends to fluctuate with depression in BD (47). General health and quality of life were measured using the short form of the SF-36, the SF-12 (23). The SF-12 results in two T scores (mean = 50, S.D. = 10), the mental component summary (MCS) and physical component summary (PCS), with higher scores indicating better-than-average functioning. Internal consistency for all measures were good to excellent (Chronbach’s α = 0·82 – 0·92).

Supplementary Material

Supplementary Materials

Acknowledgements

This research was supported by the Heinz C. Prechter Bipolar Research Fund at the University of Michigan (MGM), the Richard Tam Foundation (MGM), the Eisenberg Family Depression Center (MGM, SHS) and is based upon work supported by the Brain and Behavior Research Foundation Young Investigator Award 30719 (SHS), NIMH L30MH127613 (SHS), and NIMH K23MH131601 (SHS). With gratitude, we acknowledge the University of Michigan Prechter Bipolar Longitudinal Research participants and thank the research team of the Prechter Bipolar Research Program for their contributions in the collection and stewardship of the data used in this publication.

Footnotes

Competing Interests Statement

Melvin G. McInnis has received consulted and research support from Janssen Pharmaceuticals and has two US patents to the University of Michigan (US Patent #9,685,174; US Patent #11, 545, 173). AKY and SHS have no disclosures to report.

Code Availability

Code to calculate the one-year rolling S.D., MSSD, and AR for a mood measure are openly available and provided on an Open Science Framework (OSF) page: https://osf.io/mj3zk/.

Data Availability

Data collected at the University of Michigan requires a fully executed Data Use Agreement to be shared outside of the institution. Longitudinal and outcomes data used in the present study, along with data dictionaries, are available subject to review of the proposed analyses and acceptance of a Data Use Agreement. Enquiries can be addressed at http://www.prechterprogram.org/data.

References

  • 1.He H, Hu C, Ren Z, Bai L, Gao F, Lyu J. Trends in the incidence and DALYs of bipolar disorder at global, regional, and national levels: Results from the global burden of Disease Study 2017. Journal of psychiatric research. 2020;125:96–105. [DOI] [PubMed] [Google Scholar]
  • 2.Yocum AK, Friedman E, Bertram HS, Han P, McInnis MG. Comparative mortality risks in two independent bipolar cohorts. Psychiatry Res. 2023;330:115601. [DOI] [PubMed] [Google Scholar]
  • 3.Bauer M, Andreassen OA, Geddes JR, Kessing LV, Lewitzka U, Schulze TG, et al. Areas of uncertainties and unmet needs in bipolar disorders: clinical and research perspectives. The Lancet Psychiatry. 2018;5(11):930–9. [DOI] [PubMed] [Google Scholar]
  • 4.Henry C, Mitropoulou V, New AS, Koenigsberg HW, Silverman J, Siever LJ. Affective instability and impulsivity in borderline personality and bipolar II disorders: similarities and differences. Journal of psychiatric research. 2001;35(6):307–12. [DOI] [PubMed] [Google Scholar]
  • 5.Bauer M, Glenn T, Grof P, Schmid R, Pfennig A, Whybrow PC. Subsyndromal mood symptoms: a useful concept for maintenance studies of bipolar disorder? Psychopathology. 2009;43(1):1–7. [DOI] [PubMed] [Google Scholar]
  • 6.Burdick KE, Millett CE, Yocum AK, Altimus CM, Andreassen OA, Aubin V, et al. Predictors of functional impairment in bipolar disorder: Results from 13 cohorts from seven countries by the global bipolar cohort collaborative. Bipolar disorders. 2022;24(7):709–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Sperry SH, Kwapil TR. Bipolar spectrum psychopathology is associated with altered emotion dynamics across multiple timescales. Emotion. 2022;22(4):627. [DOI] [PubMed] [Google Scholar]
  • 8.Cochran AL, Schultz A, McInnis MG, Forger DB. A comparison of mathematical models of mood in bipolar disorder. Computational neurology and psychiatry. 2017:315–41. [Google Scholar]
  • 9.Bonsall MB, Wallace-Hadrill SM, Geddes JR, Goodwin GM, Holmes EA. Nonlinear time-series approaches in characterizing mood stability and mood instability in bipolar disorder. Proceedings of the Royal Society B: Biological Sciences. 2012;279(1730):916–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Henry C, Van den Bulke D, Bellivier F, Roy I, Swendsen J, M’Baïlara K, et al. Affective lability and affect intensity as core dimensions of bipolar disorders during euthymic period. Psychiatry Res. 2008;159(1–2):1–6. [DOI] [PubMed] [Google Scholar]
  • 11.Faurholt-Jepsen M, Busk J, Bardram JE, Stanislaus S, Frost M, Christensen EM, et al. Mood instability and activity/energy instability in patients with bipolar disorder according to day-to-day smartphone-based data–An exploratory post hoc study. Journal of Affective Disorders. 2023;334:83–91. [DOI] [PubMed] [Google Scholar]
  • 12.Faurholt-Jepsen M, Frost M, Busk J, Christensen EM, Bardram JE, Vinberg M, et al. Differences in mood instability in patients with bipolar disorder type I and II: a smartphone-based study. International Journal of Bipolar Disorders. 2019;7(1):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Sperry SH, Walsh MA, Kwapil TR. Emotion dynamics concurrently and prospectively predict mood psychopathology. Journal of Affective Disorders. 2020;261:67–75. [DOI] [PubMed] [Google Scholar]
  • 14.Gershon A, Eidelman P. Inter-episode affective intensity and instability: predictors of depression and functional impairment in bipolar disorder. Journal of Behavior Therapy and Experimental Psychiatry. 2015;46:14–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Faurholt-Jepsen M, Frost M, Busk J, Christensen EM, Bardram JE, Vinberg M, et al. Is smartphone-based mood instability associated with stress, quality of life, and functioning in bipolar disorder? Bipolar Disorders. 2019;21(7):611–20. [DOI] [PubMed] [Google Scholar]
  • 16.Strejilevich S, Martino DJ, Murru A, Teitelbaum J, Fassi G, Marengo E, et al. Mood instability and functional recovery in bipolar disorders. Acta Psychiatrica Scandinavica. 2013;128(3):194–202. [DOI] [PubMed] [Google Scholar]
  • 17.Kessing LV, Faurholt-Jepsen M. Mood instability-A new outcome measure in randomised trials of bipolar disorder? European Neuropsychopharmacology: the Journal of the European College of Neuropsychopharmacology. 2022;58:39–41. [DOI] [PubMed] [Google Scholar]
  • 18.McInnis MG, Assari S, Kamali M, Ryan K, Langenecker SA, Saunders EFH, et al. Cohort Profile: The Heinz C. Prechter Longitudinal Study of Bipolar Disorder. Int J Epidemiol. 2018;47(1):28-n. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yocum AK, Anderau S, Bertram H, Burgess HJ, Cochran AL, Deldin PJ, et al. Cohort Profile Update: The Heinz C. Prechter Longitudinal Study of Bipolar Disorder. International Journal of Epidemiology. 2023:dyad109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kroenke K, Spitzer RL, Williams JBW. The PHQ-9: validity of a brief depression severity measure. 2001. p. 606–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Altman EG, Hedeker D, Peterson JL, Davis JM. The Altman Self-Rating Mania Scale. Biological Psychiatry. 1997; 42(10):948–55. [DOI] [PubMed] [Google Scholar]
  • 22.Spitzer RL, Kroenke K, Williams JBW, Löwe B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Archives of internal medicine. 2006;166(10):1092–7. [DOI] [PubMed] [Google Scholar]
  • 23.Ware JE, Kosinski M, Keller S.D.. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Medical care. 1996;34(3):220–33. [DOI] [PubMed] [Google Scholar]
  • 24.Fisher AJ, Newman MG. Reductions in the diurnal rigidity of anxiety predict treatment outcome in cognitive behavioral therapy for generalized anxiety disorder. Behaviour research and therapy. 2016;79:46–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Jahng S, Wood PK, Trull TJ. Analysis of affective instability in ecological momentary assessment: Indices using successive difference and group comparison via multilevel modeling. Psychological methods. 2008;13(4):354. [DOI] [PubMed] [Google Scholar]
  • 26.Bos EH, de Jonge P, Cox RF. Affective variability in depression: Revisiting the inertia–instability paradox. British Journal of Psychology. 2019;110(4):814–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hu D, Kalokerinos EK, Tamir M. Flexibility or instability? Emotion goal dynamics and mental health. Emotion. 2023. [DOI] [PubMed] [Google Scholar]
  • 28.Sperry SH, Kwapil TR. Affective dynamics in bipolar spectrum psychopathology: Modeling inertia, reactivity, variability, and instability in daily life. Journal of Affective Disorders. 2019;251:195–204. [DOI] [PubMed] [Google Scholar]
  • 29.Sperry SH, Kwapil TR. Bipolar spectrum psychopathology is associated with altered emotion dynamics across multiple timescales. Emotion 2020. [DOI] [PubMed] [Google Scholar]
  • 30.Dejonckheere E, Mestdagh M, Houben M, Rutten I, Sels L, Kuppens P, et al. Complex affect dynamics add limited information to the prediction of psychological well-being. Nature human behaviour. 2019;3(5):478–91. [DOI] [PubMed] [Google Scholar]
  • 31.Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software. 2015;67(1):1–48. [Google Scholar]
  • 32.team P. RStudio: Integrated Development Environment for R. Boston, MA: Posit Software, PBC; 2023. [Google Scholar]
  • 33.Pressman SD, Jenkins BN, Moskowitz JT. Positive affect and health: What do we know and where next should we go? Annual review of psychology. 2019;70:627–50. [DOI] [PubMed] [Google Scholar]
  • 34.Fisher AJ, Woodward SH. Cardiac stability at differing levels of temporal analysis in panic disorder, post-traumatic stress disorder, and healthy controls. Psychophysiology. 2014;51(1):80–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kroenke K, Spitzer RL, Williams JB, Löwe B. The patient health questionnaire somatic, anxiety, and depressive symptom scales: a systematic review. Gen Hosp Psychiatry. 2010;32(4):345–59. [DOI] [PubMed] [Google Scholar]
  • 36.Altman E. Rating scales for mania: is self-rating reliable? Journal of Affective Disorders. 1998;50(2–3):283–6. [DOI] [PubMed] [Google Scholar]
  • 37.Hageman WJ, Arrindell WA. A further refinement of the reliable change (RC) index by improving the pre-post difference score: introducing RCID. Behaviour research and therapy. 1993;31(7):693–700. [DOI] [PubMed] [Google Scholar]
  • 38.Singh B, Yocum AK, Strawbridge R, Burdick KE, Millett CE, Peters AT, et al. Patterns of pharmacotherapy for bipolar disorder: a GBC survey. Bipolar disorders. 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Zandi PP, Wang Y-H, Patel PD, Katzelnick D, Turvey CL, Wright JH, et al. Development of the national network of depression centers mood outcomes program: a multisite platform for measurement-based care. Psychiatr Serv. 2020;71(5):456–64. [DOI] [PubMed] [Google Scholar]
  • 40.Mower Provost E, Sperry SH, Tavernor J, Anderau S, Yocum AK, McInnis MG. Emotion Recognition in the Real-World: Passively Collecting and Estimating Emotions from Natural Speech Data of Individuals with Bipolar Disorder. Transactions on Affective Computing. under review post revision. [Google Scholar]
  • 41.Price GD, Heinz MV, Song SH, Nemesure MD, Jacobson NC. Using digital phenotyping to capture depression symptom variability: detecting naturalistic variability in depression symptoms across one year using passively collected wearable movement and sleep data. Translational Psychiatry. 2023;13(1):381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Fang Y, Forger DB, Frank E, Sen S, Goldstein C. Day-to-day variability in sleep parameters and depression risk: a prospective cohort study of training physicians. NPJ digital medicine. 2021;4(1):28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lazarus G, Song J, Crawford CM, Fisher AJ. A close look at the role of time in affect dynamics research. Affect dynamics. 2021:95–116. [Google Scholar]
  • 44.Nurnberger JI Jr., Blehar MC, Kaufmann CA, York-Cooler C, Simpson SG, Harkavy-Friedman J, et al. Diagnostic interview for genetic studies. Rationale, unique features, and training. NIMH Genetics Initiative. Arch Gen Psychiatry. 1994;51(11):849–59; discussion 63–4. [DOI] [PubMed] [Google Scholar]
  • 45.Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of biomedical informatics. 2009;42(2):377–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Harris PA, Minor B, Elliott V, Fernandez M, O’Neal L, McLeod L, et al. The REDCap consortium: Building an international community of software partners. J Biomed Inform. 2019;95:103208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Kim H, McInnis M, Sperry SH. Longitudinal dynamics between anxiety and depression in bipolar spectrum disorders. Journal of Psychopathology and Clinical Science. 2023;in press. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

Data Availability Statement

Data collected at the University of Michigan requires a fully executed Data Use Agreement to be shared outside of the institution. Longitudinal and outcomes data used in the present study, along with data dictionaries, are available subject to review of the proposed analyses and acceptance of a Data Use Agreement. Enquiries can be addressed at http://www.prechterprogram.org/data.

RESOURCES