Abstract
A growing body of research suggests that standard group-based models might provide little insight regarding individuals. In the current study, we sought to compare group-based and individual predictors of bothersome tinnitus, illustrating how researchers can use dynamic structural equation modeling (DSEM) for intensive longitudinal data to examine whether findings from analyses of the group apply to individuals. A total of 43 subjects with bothersome tinnitus responded to up to 200 surveys each. In multi-level DSEM models, survey items loaded on three factors (tinnitus bother, cognitive symptoms, and anxiety) and results indicated a reciprocal relationship between tinnitus bother and anxiety. In fully idiographic models, the three-factor model fit poorly for two individuals, and the multilevel model did not generalize to most individuals, possibly due to limited power. Research examining heterogeneous conditions such as tinnitus bother may benefit from methods such as DSEM that allow researchers to model dynamic relationships.
Keywords: Tinnitus, multilevel analyses, idiographic models, intensive longitudinal data
Most models presented across psychology and medicine focus on groups of individuals and imply that knowledge about the group will generalize to individuals within that group. That is, when researchers (and the public) read that a study of a group of participants found Factor X to be related to obesity, anxiety, or depressive relapse, most readers infer that Factor X is important for most or at least many individuals in that group. However, there are multiple logical problems with such inferences that have long been noted (Fisher et al., 2018; Hamaker, 2012; Molenaar, 2004, 2007, 2013) but not thoroughly grappled with in most empirical studies.
To summarize arguments presented by Molenaar (2004, 2007, 2013), group-based analyses cannot support the frequently-made inference that what is true about the group is likely to be true about many, or even any individuals within the group. The methods we typically use to examine Factor X’s impact within a group tell us about how variations between individuals on Factor X relate to variation between individuals on the outcome of interest (thus, in multilevel modeling, these findings are often referred to between-level findings). What researchers—and, indeed, the public—typically want to know, however, is what we can reasonably expect regarding the relationship between variations on Factor X and the outcome of interest for a single individual. Unfortunately, between-level findings do not necessarily generalize to individuals over time. Fortunately, with sufficiently intensive longitudinal data, modeling prospective predictors within individuals is possible (see, Molenaar et al., 1992, for an early example).
Completely individualized (i.e., idiographic) models, which we will primarily refer to as N = 1 models, have lately become more available to researchers through a variety of statistical packages. Such models allow a test of whether variations in Factor X over time reliably precede variations in the outcome of interest for a given individual, using only data from that individual. However, these models provide significant challenges in feasibility and interpretability because each individual becomes a single-case study. Therefore, large amounts of data per person are required to adequately model the dynamic relationships specific to that one individual.
Recent developments in dynamic structural equation modeling have led to a multilevel extension of dynamic structural equation modeling (ML-DSEM), in which both the entire group and each individual’s model within the context of that group can be examined within a structural equation modeling implementation, using a Bayesian estimator (Muthen, 2017). ML-DSEM can be viewed as an extension of multilevel modeling (MLM) more generally. For example, the within-level effects reported in ML-DSEM correspond to Level 1 effects in MLM (which can be specified as random). For our purposes, these will always be effects estimated within individuals over time. The effects referred to as “between-level” in ML-DSEM correspond with MLM Level 2 effects, and refer to effects estimated between individuals (for our purposes). Initial tests suggest that ML-DSEM more easily produces less biased results than more familiar multilevel (e.g., standard multilevel modeling) techniques (e.g., by efficiently avoiding Nickell bias through latent centering; see Asparouhov, Hamaker, & Muthén, 2018 for further information). Further, the fully idiographic, N = 1 version of DSEM, in which only a single individual’s data are used, allows an examination of specific individuals based on the individual’s data in isolation. In this way, ML-DSEM and DSEM combined allow us to examine at least three estimates for each individual. First, a model based on the full group in which effects are allowed to vary across individuals, yields a single estimate for the entire sample on average on the between level (i.e., a Level 2 fixed effect in standard MLM terms). This between-level effect can be taken as an estimate of each individual, even though we suspect it may be a poor one (i.e., which is why we allowed the effect to vary on the within level of the model). ML-DSEM models can also produce an estimate for each individual when effects are specified as random, with their parameters informed by the entire dataset. Finally, N = 1 DSEM models can estimate each individual’s parameters completely independently of other individuals in the group.
In this paper, we provide an example of how researchers can use ML-DSEM and N = 1 DSEM in concert to investigate how well findings from a group apply to individuals. In addition to providing new knowledge about our topic, we hope this investigation serves as a useful accompaniment to a helpful primer (McNeish & Hamaker, 2020) on DSEM and ML-DSEM that introduces the techniques more thoroughly but does not address some of the issues that we think will be of particular interest to clinical scientists. We do not provide full technical details regarding ML-DSEM, which can be obtained from existing papers (Asparouhov, Hamaker, & Muthén, 2018; Hamaker et al., 2018; Hamaker et al., in press; McNeish & Hamaker, 2020). We also want to emphasize that we do not mean to imply that ML-DSEM is the only means to conduct such investigations (indeed, we discuss some alternatives). Our intent in providing this illustration is not specifically to promote the use of ML-DSEM over these other available methods, but rather to promote the use of such methods in clinical science more generally. However, we expect that for many clinical scientists, ML-DSEM may be an efficient way to examine within-person processes. To improve the usefulness of this illustration, our Results section is expanded into an Analytic Method and Results section to include details that should be useful to clinical scientists intending to use this technique, with each set of results alongside an accounting of how the analysis was done and what we considered when conducting the analysis. Further, rather than attempt to provide only results that are easily digestible, we provide information about challenges of using the ML-DSEM approach.
Current Focus: Bothersome Tinnitus
Our focus is bothersome tinnitus for several reasons. First, tinnitus, defined as an auditory percept in the absence of stimulation (e.g., ringing in the ears), is associated with significant suffering an d disability. For example, among US military veterans, tinnitus is the most prevalent service-connected disability, accounting for payments to one out of nine new veterans in 2020 (Veterans’ AffairsAdministration, 2021). Second, heterogeneity in the etiology and maintenance of tinnitus is widely acknowledged (Baguley et al., 2013) and might be the source of the difficulty in treating the condition (Hall, 2013; Hoare et al., 2011; Landgrebe et al., 2012; Tyler et al., 2006). There is thus every reason to believe that factors that maintain tinnitus bother may differ across individuals. We therefore chose the combination of ML-DSEM and N = 1 DSEM to explore this possibility.
We assessed tinnitus bother, anxiety symptoms, and cognitive symptoms based on our experience with people bothered by tinnitus. We expected some prospective associations between anxiety, cognitive symptoms, and tinnitus bother and developed items that we expected to measure these constructs. However, we also tested whether these items varied between people and across time as expected, such that, for example, anxiety items tended to vary together. Further, we hypothesized a priori that although (a) there would be significant paths between the three constructs of interest for individuals on average, (b) there would also be significant variation reflecting individual differences. We evaluated the extent to which estimates for individuals based on fully idiographic models correlated with estimates for individuals drawn from an ML-DSEM model. That is, we tested to what extent ML-DSEM produces estimates that are similar to running separate models for each individual. This question is of interest because when number of data points per person is limited, ML-DSEM may provide acceptable power because of the number of people assessed (Schultzberg and Muthen, 2018), yet it is unclear whether the resulting estimates are similar to those that would have been obtained from fully idiographic models. Hamaker and colleagues report moderate to very high correlations between such estimates in one dataset, suggesting that we should expect similar findings here (Hamaker et al., in press). Where individual participants had profiles that clearly varied from the between-level model, we also examined their fully idiographic model in more detail. Finally, because we did not have an a priori hypothesis as to the timescale between tinnitus bother and its predictors, we examined multiple lag lengths.
Method
Power
Both N = 1 and multilevel DSEM rely on having enough repeated measures to reliably estimate relationships within individuals over time; in addition, ML-DSEM also requires enough people to properly estimate the mean and variance of such relationships across the group. Some guidance regarding ML-DSEM is available in a simulation study (Schultzberg and Muthen, 2018). More generally, ML-DSEM derives power from both participants (here, N = 43) and occasions (here, overall T = 5079 for group analyses, with individuals varying between t = 25 and t = 197, M = 118, SD = 50.52). Schultzberg and Muthen’s (2018) simulation work suggests that this number of time points should provide close to adequate power for simple models, but future work is needed to carefully determine what is appropriate for more complex models. Ideally, with each instance of modeling, a research team would use a simulation study to determine the preferred sample size for the next study of the same phenomena. Here, as an example, we demonstrate how to use estimates from ML-DSEM models to determine power in N = 1 models.
To the detriment of power for analyses, it seems inevitable that most participants asked to complete over 100 surveys will probably miss some of these surveys. ML-DSEM is extremely flexible in accounting for missing data, but when participants do not overlap enough on the items they respond to or the lags over which they respond to prompts, this can limit power to detect effects. Thus, for ML-DSEM models, all participants must provide data on at least some of the items to be analyzed, following a schedule that at least partially overlaps. The need for overlapping schedules stems from ML-DSEM requiring the specification of how long a lag will be. Lag length can be either directly represented in the data (e.g., one can set up the data in person period format such that rows of data corresponds to time points with roughly even spacing). Alternatively, the lag length can be set up via the tinterval command, which allows one to specify how much time is measured in a lag. Although a model can be run in data with unequal time separation between rows without specifying the lag length, it is hard to know how to interpret results from such an analysis. Fully randomized prompts can result in data with low power as there would need to be enough data available that are spaced approximately one lag length apart. In our case, for example, we examined individuals who completed the same assessments either every two hours or every four hours; we will thus focus primarily on intervals that are a multiple of four hours (e.g., 4, 8, 12 hours, etc.), because all participants have (roughly) these intervals.
More specifically, in the first study from which we draw data, we specified that surveys would be given every 2 hours. A minority of participants indicated that this frequency was burdensome (Gerull et al., 2019) and attempts to fit models across this time frame were often unsuccessful, potentially because of excessive autocorrelation. When collecting additional data, therefore, we specified that surveys would be given every 4 hours instead. We expected data across lags of hours (rather than days) to be useful based on our clinical experience with people with bothersome tinnitus, who often reported symptom fluctuation across the course of days.
Participants
Participants (N = 43) were drawn from two studies. Study 1 participants (n = 27) were recruited from the Washington University Otolaryngology Research Participant Registry and enrolled in an open trial of a brain training program. Data from these participants were previously reported by Gerull and colleagues (see Gerull et al., 2019, who provide additional detail about the procedures). These participants completed at least some EMA prior to the intervention. Study 2 participants (n = 16) were recruited using similar methods to Study 1 but did not receive an intervention. Protocols were approved by the university’s institutional review board and conducted according to the World Medical Association Declaration of Helsinki.
All participants experienced bothersome non-pulsatile tinnitus for 6 months or longer. Participants were also required to have access to a smartphone and computer for the duration of the study. All participants were required to read, write, and understand English. Exclusion criteria in both studies included moderate depression or worse, as measured by a score of 10 or greater on the Patient Health Questionnaire – 9, due to concern for confounding effects of depressive symptoms on the study results. Candidates with any unstable psychiatric conditions were excluded, as defined as not yet achieving a stable dose of medication, or less than 1 month on their most recent psychiatric medication. Additional exclusion criteria included a history of brain surgery, current participation in a workman’s compensation claim or litigation-related event involving tinnitus, and active substance use problems in the past year, all based on response to screening questions. In Study 1, which included an intervention, participants were also excluded for hearing loss not addressable with hearing aids or an inability to use headphones (i.e., due to the requirements for the intervention).
Most participants (n = 29, 67.44%) were women. The median age of participants was 58 years (range 25 – 70 years). All participants were either White or declined to report (n = 1). Ethnicity, education and employment data were collected in Study 1 but not Study 2. In Study 1, no participants reported being Hispanic, with one preferring not to identify ethnicity. These participants were generally highly educated. Three participants reported an associate’s degree or some college, 10 reported a bachelor’s degree, 9 a master’s degree or equivalent, and 5 a PhD, MD, JD, or similar degree. Most reported working full time (n =16, 59%), with 7 reporting being retired, 3 reporting part-time work, and 1 reporting being unemployed.
EMA Items
We selected 10 items based on the emotional and functional difficulties related to tinnitus observed by the authors during work with patients. Items are listed in Table 1 along with their average factor loadings (see also below). All items were rated using a sliding scale ranging from 0 (Not at all) to 100 (Extremely). Reliability is presented below. Descriptions of all ratings and reliability estimates of non-EMA items (e.g., traditional self-report) are provided in the supplemental materials.
Table 1.
Average Factor Loadings from P-Technique Factor Analyses
Factor 1 | Factor 2 | Factor 3 | |
---|---|---|---|
| |||
How loud is your tinnitus? | .85/.80 | .01/01 | −.02/.04 |
How bothered are you by your tinnitus? | .90/.92 | .05/.02 | .01/.02 |
How anxious or stressed do you feel? | .06/.18 | .71/.54 | .09/.16 |
How difficult is it for you to concentrate on what you are doing? | .01/14 | .00/.14 | .68/.55 |
How difficult is it for you to avoid thinking about your tinnitus? | .68/.63 | −.01/.06 | .19/17 |
How difficult is it for you to follow a conversation? | .06/.12 | −.02/.09 | .55/.51 |
How difficult is it for you to focus? | −.06/.09 | .03/.12 | .79/.64 |
How interested are you in activities or people you normally enjoy? | −.13/−.09 | −.15/−.23 | −.19/−.19 |
How worried do you feel? | −.05/.08 | .80/.71 | −.02/.01 |
How overwhelmed do you feel? | −.01/.04 | .80/.66 | .01/13 |
Note. Loadings greater than .35 in bold. The first estimate is from multilevel structural equation modeling (with nontrustworthy standard errors) and the second is from the average of p-technique factor analyses for the entire group of participants. In the multilevel version of the model, Factor 3 was actually Factor 2; in the p-technique analyses which factor was which varied. Two-factor solutions converged for 39 participants but did not tend to fit well across participants (mean indices: CFI = .91, TLI = .85). We therefore focused on a three-factor structure. A three-factor solution converged for 31 participants and tended to fit participants well (mean indices: CFI = .98, TLI = .96).
Time Variables for Stationarity
A crucial aspect of modeling data over time is the need to consider including time-related variables in models. For example, anxiety might be expected to vary systematically across the course of the day for many individuals. Further, some individuals might be generally increasing or decreasing in their anxiety during the study. Such trends over time violate stationarity, one of the assumptions of ML-DSEM (and indeed almost all time series modeling techniques; see Piccirillo & Rodebaugh, 2019 for a review). Therefore, variables reflecting the survey and day number should be included in models to directly account for these effects. We recorded day in the study and survey of the day for each survey completed. Notably, changes in autoregressive paths or cross-lagged paths over time also violate stationarity and are not accounted for merely by including time variables (cf. Bringmann et al., 2015; Bringmann et al., 2017).
Procedure
All tinnitus registry members were sent an email with a description of the study and link to an electronic RedCap screening survey (RedCap version 7.3.5 survey tool, Vanderbilt University). Participants provided informed consent before enrolling in the study and all study protocol was approved by the Washington University Institutional Review Board. Participants completed a baseline assessment of self-report measures and then began the EMA protocol the following day. Participants in Study 1 were sent a text message survey every two hours between 8:00am and 8:00pm for the two weeks prior to beginning the intervention (7 text message surveys per day). Further details regarding the intervention in Study 1 can be found in the report by Gerull and colleagues (2019), which focuses on evaluation of the intervention itself. In Study 2, we decreased the frequency of our assessments to every four hours to address participant burden. Participants in Study 2 responded to between 129 and 200 prompts because the intent was to determine the optimal number of prompts for individual models.
Analytic Methods and Results
Dimensionality
We expected a priori that the items used for the individual models would fall into roughly three factors: (a) tinnitus bother; (b) anxiety; and (c) cognitive symptoms. We thus wanted to test whether our expectations were correct for participants overall, but also whether any single individuals showed clear departures from this factor structure. Although we expect DSEM might eventually allow us to investigate these issues fully, there are two limitations that currently stand in the way. First, DSEM does not offer a fully exploratory mode (e.g., exploratory factor analysis); nor does Mplus otherwise offer a Bayesian multilevel exploratory factor analysis more generally. Second, DSEM does not offer fit statistics that we can use in all cases to determine whether a factor model fits well or not. DSEM modeling does allow comparison between some models using the Deviance Information Criterion (DIC), but it can be complicated to determine whether DICs are comparable across models, particularly when latent variables are involved (Asparouhov, Hamaker, & Muthén, 2018). To be more precise, although some guidance is available regarding models that are comparable, the guidance is neither full nor complete, let alone clear to users without extensive statistical backgrounds. If DICs are comparable across models, smaller DIC value is preferred. Because the DIC is the existing tool for comparing DSEM and ML-DSEM models, we discuss it further below and in the supplemental material. For now, though, the main point is that the DIC cannot tell us whether a three-factor model fits well in comparison to models we might prefer to compare it to.
Fortunately, there are other methods that can help us determine what factor structures we should consider fitting in DSEM. Using Mplus, we estimated a multilevel structural equation model (MSEM) with robust maximum likelihood (i.e., the estimator Mplus documentation refers to as MLR) to conduct an exploratory factor analysis on the within-level data (the variation across time), leaving the between-level data unstructured. We did this in part because this was the best correspondence to the model we expected to fit in DSEM models, in which items would have a factor structure across time, but no particular structure across people. Although it would be useful to determine whether the measure had a similar structure at both levels, we did not expect to be able to fit a between-level model at all in the current data (see below).
The two-factor model did not fit well (CFI = .80, TLI = .31, SRMR = .05), but the better-fitting three-factor model (CFI = .97, TLI = .85, SRMR = .01) also produced an error message indicating that its standard errors should not be trusted. This error message was likely produced by the relatively small sample size in terms of number of participants. To supplement these findings, we conducted p-technique factor analysis with each individual participant, extracting two and three-factor solutions using Mplus 8. The p-technique factor analysis is a method of determining the factor structure for an individual participant based on the variation in their responses across time (see Lee & Little, 2012 for more detail). For our study, we were interested both in each participant’s factor structure, as well as what factor structure might be preferred for the group as a whole. We examined the average loadings for participants (as a group) for the number of factors that appeared necessary for most participants to show good fit. Details on our procedures, including follow-up tests in individuals, can be found in the supplemental materials.
On the average, the factor structure provided in Table 1 was observed across both methods; see the table note for some analysis details. One factor was defined primarily by tinnitus loudness and bother, in addition to difficulty in avoiding thinking about tinnitus. A second factor was defined by anxiety, worry, and feeling overwhelmed. A third factor had generally lower average loadings but was characterized by difficulty with cognition (concentration, following conversations, and focusing). For the ML-DSEM models, we removed the “How interested are you. . .” variable from consideration because it failed to load clearly on a factor in either method. Details about individual departures from this structure are provided in the supplementary material.
With an idea of how the items might load onto factors, DSEM can be used to check whether the three-factor structure has good local fit compared to other models (e.g., by examining factor loadings or constraining parameters across models). In the supplemental material, we used DSEM and N = 1 models to test whether any individuals had factor structures that appeared to be a better fit than the overall three-factor model (e.g., in terms of convergence and factor loadings); in brief, we found evidence that, for two participants, data were better explained by a separate N = 1 factor structure; for two more participants there was modest evidence that the idiographic structure might be superior.
We also used ML-DSEM models to test whether the three-factor model is superior to a simpler one-factor model. To do this, we estimated a model in which the three-factor structure is specified, and items are regressed upon our two time variables (day in study and number of the prompt of the day). We also specified that each factor was regressed upon itself. None of this would normally be done, let alone possible, in two-level factor analyses, so the reader might ask what the purpose would be of doing so. One example of why we might want to run such a model is if we suspected that participants were reactive to the measurement such that the only reason they had any difficulty avoiding thinking about tinnitus was because we were asking them about it repeatedly, leading to both increased bother and difficulty avoiding thinking about tinnitus. That is, such a model can help to arbitrate between hypotheses that items load on a factor versus artifacts produced by shared trajectories over time.
Because this is the first presentation of an ML-DSEM model in this paper, we should note a few issues. First, a much more complex model is possible here, although perhaps not advisable given that this model is quite complex, especially in a limited sample size. We specified that factor loadings, autoregressive parameters, residuals, and time-related effects were all fixed. However, if fewer items were being examined, or more participants included in the dataset, it would be possible for all of those effects to be treated as random (i.e., varying across participants). Indeed, we allowed such effects to be treated as random in the simpler models presented below. Unfortunately, we found that even allowing the time effects to vary appeared to be too complex for the current data (in that the results included a negative estimate of the number of parameters, a nonsensical result that we find is often returned when models are either too complex for the data or are ill-considered on a fundamental level). This result may not be surprising given that even the model with fewer random effects included nine items, three latent factors, and two timing variables. Nevertheless, ML-DSEM at least allows us to compare the one-factor structure to the three-factor structure by constraining parameters. Specifically, we ran the three-factor model as described above, and then ran the same model but testing the constraint that the correlations between the factors were set to one and the autoregressive parameters for the factors were equal. That is, by making this constraint, we specified that the three factors were acting as a single factor.
Researchers used to non-Bayesian analyses might expect that, to compare these models, we should simply run each model, make adjustments if needed for convergence, and examine the results of constraining the parameters. The general pattern for running ML-DSEM models is a bit different than that, as will be familiar to researchers who have conducted similar Bayesian analyses. After determining, based on its initial convergence, that the three-factor model with largely fixed effects appeared reasonable, we set the number of iterations at more than twice the number at which the model initially converged to ensure that the potential scale reduction (PSR) value continued to go down as expected, or at least did not begin trending upward. It is worth noting that the frequent need to run multiple models (magnified when there is a need to test multiple random seeds if one is attempting to compare DICs) makes the features of the MplusAutomation package in R particularly useful (Hallquist & Wiley, 2018). Using this package, the researcher can run all input files in a given folder in sequence.
The three-factor model converged without incident, with each item loading significantly on its respective factor, and each factor predicting itself over time (judged by no parameter having 0 within its 95% confidence interval). Many items displayed time effects, including the difficulty avoiding thinking about tinnitus item, although this item tended to be rated lower across the study (perhaps suggesting reactivity, although the opposite of what we hypothesized informally above). Similarly, two of the cognitive symptom variables showed a linear trend downward over time, but since each still loaded on the cognitive factor, it was not their shared trajectory alone that led to their loadings. The constrained model also converged, and the model constraints were statistically significant (p < .0001), indicating that three factors were preferable to one1.
Reliability of Subscales Derived from Factors
Now that we have evidence that the nine remaining items represent three factors, we investigated their internal consistency. We evaluated how items related to each other in two ways: Across people (e.g., are people high in one anxiety item high in other anxiety items?) and across time (e.g., when a person is high in one anxiety item, is the person high in other anxiety items?).
MSEM method.
A method to determine reliability on both levels using an MSEM model was described by Geldhof and colleagues (Geldhof et al., 2014), and a correction for overestimation provided by van Alphen and colleagues (2022). We first used the MLR estimator, due to its ability to handle nonnormal data. Notably, we had to examine each subscale separately because of the sample size (i.e., the number of participants). Using syntax provided by Alphen et al. for a model in which the within and between loadings were constrained to be equal (their Step 5 model) we found that reliability was at least good for each subscale on both the within and between levels as measured by coefficient omega (ωs > .77). This result indicates that the items on each factor were internally consistent with each other both across individuals and across time. We then specified the Bayes estimator for the MSEM model, which allowed us to examine all factors at once and returned the same substantive result for omega (ωs > .77).
ML-DSEM method.
The syntax provided by van Alphen et al. (2022) is transferrable to an ML-DSEM model. Two differences were notable versus MSEM with the Bayes estimator. First, we could allow latent factors to predict themselves over time. Second, because van Alphen et al.’s Step 5 model is relatively constrained, we could estimate the model with effect of day of study allowed to be random across participants. This is useful if we suspect that apparent internal consistency might be distorted by linear trends varying across participants (a variation on the issue we discussed regarding factor loadings above). In this case, although the linear effect of days in the study did show clear signs of varying across participants, internal consistency remained good (ωs > .72). We supply output for this model in an additional supplement (see https://osf.io/zyfa8/). For those investigators who wish to examine measurement error more in-depth in ML-DSEM, initial methods focusing on one-item indices (Schuurman & Hamaker, 2019) and trait-state-occasion models (Castro-Alvarez et al., 2021) are also available.
Measurement Invariance
Notably, McNeish and colleagues (McNeish et al., 2021) recently demonstrated how ML-DSEM can be used to test whether factor loadings are invariant across people and time. Although we highly encourage testing for such invariance, in our case our data proved insufficient to allow these models to converge properly. Notably, this appeared to be due in part to the differing number of surveys completed (e.g., ranging from 25 to 200) across participants in addition to the number of participants, so investigators should not take our experience as evidence that they will be unable to explore this issue in similarly-sized samples with a more consistent set of time points across participants.
Lag Length Determination
Having determined that the items used resulted in three factors that were relatively homogeneous across participants, we turned to examining models of how these factors were related over time. We first examined what lag lengths would be important to model. In intensive longitudinal data in general, we often do not know over what time frame, or over what lag length, important effects might occur. The default in psychology is to assume that the effects of most interest occur over whatever time frame happens to separate the observations (e.g., whether that is four hours or six months). Fortunately, time-series methods for analyzing intensive longitudinal data allow us the opportunity to test this assumption (Jacobson et al., 2019). For example, when data are collected with four hours between observations, it is possible to examine whether variables predict each other across four, eight, or 12 hours, and beyond. This kind of examination is rare in the literature, despite the fact that there are now multiple packages available for examining this issue (in R, as described by Jacobson et al. 2019, as well as in SAS, as described by Li et al., 2017).
To examine whether a four hour lag length adequately captured the larger effects in the data, we used the DTVEM package in R (Jacobson et al., 2019). This package uses a state-space framework to examine multiple lag lengths and determine at what lengths effects are strongest. We examined whether this package recommended that additional lag lengths be included; further, we examined whether additional lag lengths seemed to have stronger effects than a Lag 1 of four hours, even if they were not statistically significant. Thus, we were able to determine whether reliance on a single lag length might be obscuring strong relationships between variables.
Study 1 data only were used because these data provided a more extensive range of lag lengths to examine (e.g., starting with a lag length of 2 hours as opposed to 4 hours for Study 2). Visualizations of each outcome variable, provided in the supplemental materials Appendix S1, were examined for a gradual decline of strength of effect with increasing lag lengths. When anxiety and cognitive symptoms were the outcome variables, the final model did not suggest that additional lag lengths beyond 2 to 4 hours (where, again, 4 hours is Lag 1 for the combined data) were warranted for the data. For tinnitus bother, the final model indicated that additional autoregressive lags for tinnitus bother should be considered for lag lengths up to six, which corresponded to Lag 3 (12 hours) in the combined data.
Although the originators of the DTVEM package provide data suggesting it tends to identify important effects, there is no guarantee that the effects transfer across methods (DTVEM employs a state-space framework, whereas all variants of DSEM employ a Bayesian implementation of vector autoregression). We therefore examined whether these additional lags had support in the final model as well.
Normal Distribution Assumptions
The DSEM models assume normal distributions across people (in multilevel models) and across time. No information is currently available as to how robust the model is to violations of this assumption (e.g., as noted by Hamaker et al., in press); further, no alternative estimators (versus the DSEM Bayesian estimation) are available to cope with minor violations, despite alternative estimators being available for standard structural equation modeling; notably, however, there are some options available for floor effects and categorical items (see Hamaker et al., in press). We thus used transformations to approximate normality.
The normal distribution assumption was tested for the items and subscale scores within each participant’s data and across participants using histograms and the one-sample Kolmogorov-Smirnov test. On the item level, we were not convinced that transformations were successful in creating normal distributions, which dissuaded us from relying consistently on ML-DSEM and ML-RDSEM to directly model latent variables. This fact also means that special care must be taken in interpreting our N = 1 results regarding factor structure in the supplemental materials. To improve normality, we focused most analyses on subscale scores instead. Data transformation (square root) was performed for each subscale due to violation of the normality assumption and the transformed scores of the subscales met the assumption for majority of the individuals (and produced normal distributions across participants overall).
The reader may note that we reported some models above before reporting on normality; do these models return the same results if items are rendered more normal? We re-ran the factor structure tests using the square-root transformation to approximate normality, and all substantive conclusions remained the same. As mentioned in the supplement, however, estimates for parameters routinely differ beyond a rounding error across DSEM models when variables are more normally distributed versus when they are less normally distributed.
Multilevel and Idiographic Models
For all multilevel models below, we included random slopes, intercepts, and residuals on the within level (i.e., corresponding to Level 1 in a standard MLM). Random covariances were also allowed, with the constraint that the covariances were positive (see Hamaker et al., 2018 for the reasons for this constraint, which reflect a limitation of the current implementation). In addition to the main constructs of interest, day of the study (e.g., Day 1, Day 2, etc.) and time of day (e.g., Surveys: 1–7 for Study 1, coded as 1, 3, 5, 7 for Study 2) were included as predictors for each variable. Including these timing variables allowed us to partial out any linear effects related to days in the study or time of day.2
ML-DSEM versus ML-RDSEM
McNeish and Hamaker (2020) have recommended the use of the residual ML-DSEM (ML-RDSEM) model rather than the standard ML-DSEM model when time trends are partialled out in the model itself. In brief, the ML-RDSEM model may be preferred because it first partials out the time trends and then models the other relationships on the residuals of the variables, whereas in ML-DSEM all of the relationships are estimated simultaneously (Asparouhov & Muthén, 2020). The ML-RDSEM procedure is thus more computationally intensive, but also may provide more readily-interpretable time trends. Asparouhov and Muthén (2020) used the DIC to test whether the underlying process that generated the data was better modeled as a DSEM or residual DSEM model. However, the model they tested was very simple (one outcome variable and one time variable). It was thus unclear whether we should expect to be able to use the DIC in our intended models, which were more complex.
We expected to examine the full three-variable model in both ML-DSEM and ML-RDSEM, but this was complicated by the fact that the ML-RDSEM model would not converge appropriately (it either did not converge at all or had a negative estimate of the number of parameters). While investigating this problem, we received feedback that it is best to begin modeling with simpler models, such as two-variable models (e.g, Hamaker, personal communication, April 8, 2021; see also Asparouhov & Muthen, 2022). We therefore examined ML-DSEM models versus ML-RDSEM models using three models of pairs of variables.
In brief, it did not appear that the DICs from our models were comparable. First, they were much further apart in value than published examples of comparisons from the Mplus team (e.g., Asparouhov & Muthén, 2020). Second, the pD values, which are an estimate of the number of parameters in the model, were also obviously discrepant despite the fact that the DSEM and RDSEM models would appear to have the same number of actual parameters3. The only means we found to yield similar pD values was to eliminate all random effects in the model, whereas we intended to examine random effects. Accordingly, we defaulted to using the RDSEM model to the extent that would converge (see below) because of its conceptually preferable model.
Compiled ML-RDSEM Model
The compiled models tested our a priori hypotheses that (a) some paths would be statistically significant for individuals on average (i.e., these paths represent group trends, or fixed effects in standard MLM parlance) and (b) paths would show statistically significant variation across individuals. Figure 1 presents the ML-RDSEM between-level paths with a credible interval that did not contain zero (in both models, when the path appeared in more than one model), averaged across models when the path was present in multiple models. The credible interval is given because DSEM uses a Bayesian framework. Since we did not use informative priors, the 95% credible interval is comparable to the confidence intervals that many researchers are more familiar with (Morey et al., 2016). Importantly, we could use the results of the current model as priors for future tests of the same model. Full information for the models can be obtained from the Mplus output included in the supplemental material Appendix S2; each output contains the input statement, as well as several illustrative features. One such feature is in regard to the fact that when questions about models are relatively simple, such as the question of whether a set of paths improves fit, a Wald test can be used to test model constraints (Asparouhov, Hamaker, & Muthen, 2018), as we described above. We thus used the Wald test to determine whether the additional lag lengths for tinnitus bother improved the models, which they did.
Figure 1.
Multilevel Residual Dynamic Structural Equation Model (ML-RDSEM) standardized estimates. Coefficients are drawn from three models including each possible pair of variables, and when those paths differ across models they are averaged. Only paths for which the credible interval did not contain zero (in any of the models) are shown. Thus, for example, all variables were regressed on Day in Study and Time of Survey, only two significant effects were found. The path in gray was not supported by a robustness check (see main text for full explanation). Because all paths were estimated at the same time, all directed paths represent partial effects (e.g., any lag paths are significant above and beyond any other paths shown). Loops represent autoregressive paths, with the progressively larger loops for tinnitus representing longer lags (4, 8, and 12 hours). Standardized estimates are shown, averaged across clusters. Correlations were drawn from Mplus residual output.
As noted above, paths included in multiple models, including autoregressive effects for tinnitus bother and time are provided as averages across the two models. Researchers may be used to multiple regression situations in which two different multiple regressions often return quite different coefficients when competing predictors differ across models, as is true here (i.e., in one model anxiety was included, whereas in the other cognitive symptoms were included). To test whether tinnitus-focused paths (i.e., the most substantively important paths in this case) were similar across models, we examined the correlation between individual estimates obtained from the ML-RDSEM models. All tinnitus paths were relatively consistent across models, in that they were correlated at > .90 with the exception of the second and third autoregressive lag of tinnitus, which correlated > .7. Thus, paths were similar, although not identical across these models.
Model Interpretation
Turning to the interpretation of the model itself, several prospective paths were statistically significant. Tinnitus tended to predict itself over each lag length, suggesting notable inertia for tinnitus bother, such that when participants were highly bothered by their tinnitus, this would predict increased bother, relative to the average, four, eight, and twelve hours later, although the influence over longer lags was far smaller than over Lag 1. All other prospective relationships occurred across four hours (Lag 1). Both anxiety and tinnitus bother and cognitive symptoms and tinnitus bother displayed a positive feedback loop in which each tended to predict the other. Coupled with the autoregressive lags for tinnitus, the model suggests a situation in which an increase in any of the variables would tend to produce at least transient increases in all of the variables.
All of the paths had significant variance, with none of the variance estimates including zero in their credible intervals. However, it is worth noting that the Bayesian estimator does not permit non-zero estimates of variance, which means it is very easy for variances to be estimated as being definitely above zero (e.g., in that they cannot have a tail extending below zero). Due to this issue, Asparouhov and Muthén suggest that variances that exceed three times their standard error should be considered clearly meaningful (Asparouhov & Muthen, 2022). The following paths met this standard: The Lag 1 and Lag 2 tinnitus autoregressive paths and the autoregressive path for anxiety. Thus, there was considerable evidence that the paths varied in a meaningful way across participants, supporting our a priori hypotheses. Notably, Asparouhov and Muthén assert that smaller variances for parameters that are left to vary randomly can slow model estimation and that researchers should consider fixing these paths. Our opinion is that, because we expect such paths to vary, we should allow them to vary unless fixing them is necessary for model estimation (which was not the case here). Further, we find below some evidence that even paths with smaller variances show other signs of differing across individuals in a meaningful way.
Robustness Check
Recall that RDSEM and DSEM are expected to be equivalent in the absence of time trends. While fitting N = 1 DSEM models (see below) with time trends partialled out, we observed that this did not appear to be the case for our models. We confirmed this by running ML-DSEM models on the variables with the time trends partialled out. In this alternative set of models, tinnitus bother predicted anxiety and cognitive symptoms, although across different lag lengths: Tinnitus bother predicted anxiety across Lag 2, and cognitive symptoms across Lag 3. Anxiety still predicted tinnitus bother, but cognitive symptoms did not predict tinnitus bother. We thus have less confidence in the pathway from cognitive symptoms to tinnitus bother, which is displayed in gray in Figure 1 accordingly.
Correspondence of ML-DSEM Estimates to N = 1 Model Parameters
Given the evidence of variation across participants, some researchers and theorists might question the usefulness of the individual estimates from the ML-RDSEM models if N = 1 models produce different results. We examined the correspondence between the parameter estimates from one of the multilevel models (between anxiety and tinnitus) and the N = 1 models (see below) using Spearman correlations because several estimates were not normally distributed. Estimates, excluding intercepts, were highly correlated (rs range .69 to .95, M = .83, SD = .09), suggesting considerable overlap between these two estimation methods. The correlations are on the diagonal of Table S2, which shows all intercorrelations of parameter estimates. We detected no obvious pattern aside from intercepts showing lower correlations.
Estimation of Idiographic Models
We attempted to fit N = 1 RDSEM models that paralleled the ML-RDSEM model for tinnitus and anxiety, but most of these would not converge. Because RDSEM partials out the time trends prior to estimating other parameters, we were able to ease estimation by first partialling out day in study and time of day for each participant using linear regression in R and then running a standard DSEM model on the residuals. These models all converged properly but left us without estimates of time trends from these models (i.e., because time trends were already partialled out, and thus we did not include them in the analyses). We focused on the ML-DSEM model using detrended data for anxiety and tinnitus because it appeared to be of most substantive interest, given that the model using all participants suggested a feedback loop between anxiety and tinnitus.
We can expect that the resulting idiographic models will differ from the ML-DSEM model for at least two reasons. First, the idiographic models will have reduced power compared the ML-DSEM model. Second, there may be substantive differences between the idiographic models (which treat each person separately) versus the multilevel model (which treats each person as a part of an overall distribution). The challenge to the researcher is that there are as many N = 1 models as participants, which makes it difficult to determine whether these models generally converge with the multilevel model or not in terms of substantive findings.
In this case, the feedback loop suggested by the multilevel model for anxiety and tinnitus bother appears to be of most interest clinically. We might expect, based on the multilevel model, that the idiographic models would frequently show this feedback loop. However, focusing on significant paths in the idiographic models, no participants displayed the feedback loop. Tinnitus predicting anxiety was a frequent finding across the three lags (Lag 1: One positive path, two negative; Lag 2: Five positive paths, one negative; Lag 3: Two positive, two negative; note that two individuals showed a positive estimate at Lag 2 and a negative one at Lag 3). None of these individuals showed a positive prediction of tinnitus bother by past anxiety (i.e., although there were two individuals showing such a path, neither of these individuals showed significant positive prediction of anxiety by tinnitus bother). We therefore investigated how much power the idiographic models had to detect the feedback loop seen in the ML-DSEM model.
Post-hoc Power Analyses
We can determine whether the above results could be due to power alone by running simulations for N = 1 models using the estimates from the multilevel model. Mplus can run Monte Carlo power analyses for any model type, which allows researchers to plan for data collection as well as conduct post-hoc examinations as we do here. A useful tool for researchers not intending to use Mplus has been described by (Lafit et al., 2021), although it does not offer the ability to simulate the model we examined here. A simulation with 10,000 simulated datasets revealed that even with one thousand time points from a single individual, power was low to detect the impact of tinnitus on anxiety seen in the ML-DSEM model (estimate = .047; 53% of datasets detected this effect) as well as for detecting the impact of anxiety on tinnitus (estimate = .04; 15% of datasets detected this effect). In contrast, power to detect the strongest effects (autoregressive effects of anxiety and tinnitus bother at Lag 1) was reasonable-to-strong for 125 observations or more (>76% of datasets detected these paths). Notably, even increasing the size of the cross-construct parameters by .05 would improve power considerably in 1000 time points (to 99% for tinnitus predicting anxiety, 56%, for the converse path), although a more realistic 200 time points still had poor power for both paths (< 50%). Detection of the feedback loop seen in the multilevel model should thus not be expected in many idiographic model results due merely to power alone.
Feedback loop plausibility.
Idiographic models, which makes it predictable that no individual displayed significant pathways for the full loop. However, individual participants might still tend to show these paths being higher in tandem, even if neither path was statistically significant. We therefore examined the correlation across the idiographic models between a participant’s parameter for anxiety predicting tinnitus bother and the average of their three parameters for tinnitus bother predicting anxiety (i.e., across the three lags used for tinnitus). This correlation was small and not significant, r(43) = .12, p = .436, which suggests there is no systematic tendency toward having such a feedback loop, even at values below statistical significance. Thus, although some participants might show such a feedback loop, having one part of the proposed loop does not increase the chances that a participant will show the other path in N = 1 models. In contrast, the same correlation based on the ML-DSEM model was moderately large, r(43) = .45, p = .002. Thus, the ML-DSEM model implied that a feedback loop should be common, but the N = 1 models did not imply it should be.
Fully Idiographic Models
The examination given above still assumes that participants had the same factor structure for their items. However, any given individual might have a different factor structure for their items. For example, perhaps what we describe as anxiety and cognitive symptoms above are better understood as a single factor for some participants. As explained in detail in the supplement, we examined whether any participants clearly had a different factor structure than the three-factor structure favored for the whole group by comparing their p-technique-derived factor structure to the group factor structure to see which offered superior fit. Two participants showed evidence of superior idiographic factor structures. These participants were both from Study 2, which means they provided considerably more longitudinal data than participants in Study 1; in addition, they were not enrolled in the treatment study. In both cases, a two-factor model was superior, and items tended to correlate across factors or cross-load. Tinnitus bother was not predicted by another variable in either case.
Exploratory Tests: Do Baseline Model Parameters Relate to Baseline Scores and Change?
For both the multi-level and N = 1 models, estimates for each individual’s value on each parameter can be obtained. Thus, for example, how much a participant’s tinnitus bother is predicted by anxiety (ranging from a strong negative association to a strong positive one) can be treated like another variable or measure in the study, and participants’ level on this variable can be correlated with other measures. This fact opens the possibility of directly testing novel hypotheses. For example, we speculated that anxiety would lead to future tinnitus bother; this was true in terms of the ML-DSEM between-level effect as well as for some specific individuals in N = 1 models. It might seem plausible that the degree to which anxiety predicts tinnitus bother for an individual might itself predict important things about an individual (e.g., at baseline) or how an individual responds (e.g., to treatment). With enough participants, it is possible to examine such questions fully within ML-DSEM. That is, the correlation of between-participant measures and within-participant paths can be observed on the between level. Because of our modest sample, we examined estimates of parameters instead.
Further, as explained in full in the supplement, correlations between parameters from even one model and all the self-report measures we had available could lead to a very high number of tests in highly correlated variables (e.g., measures of distress from various sources correlated highly); accordingly, we used data reduction techniques for the self-report data to reduce the number of tests. The supplemental material gives details on our examination of correlation matrices of model parameters and both baseline measures and change across an intervention. In brief, we saw some evidence that parameters related to anxiety (and especially how much tinnitus predicted anxiety) was related to change during the intervention.
A Note on Convergence
Both above and in the supplement we note occasional trouble obtaining convergence in ML-DSEM models. Guidance exists for troubleshooting (Asparouhov & Muthen, 2022). We can add some observations that elaborate on issues identified in that guidance. As noted by Asparouhov and Muthén (2022), the Bayesian estimator is sensitive to differences in variance between variables. Standardizing variables, or at least dividing them such that their variance is closer to other variables, can make a big difference in estimation time and likelihood of convergence in our experience. Second, we have noted in Bayesian analyses of all sorts in Mplus that the estimator is very sensitive to models that are an extremely poor fit to the data. That is, maximum likelihood will often have no trouble converging for a model that includes a typographical error, whereas a Bayesian estimator may show signs that it would never converge. We do not report above the many instances in which model nonconvergence tipped us off to the fact that we had a typographical error in our model statement. In an instance of nonconvergence, we were generally able to obtain convergence by checking our model statement, linearly transforming variables to have more similar variances, making our models simpler, or, if needed, collecting more data. Asparouhov and Muthén (2022) might add that we would have had even fewer problems if we took their advice and fixed as many parameters as possible, whereas our preference was to leave as many parameters as possible to vary randomly, because we found that more plausible theoretically.
Discussion
In this study, we used multilevel and N = 1 methods to examine bothersome tinnitus, describing some of the considerations that are relevant to clinical scientists interested in using DSEM. We found, as expected, that although there were between-level effects in multilevel models (i.e., that would more usually be termed fixed effects in standard MLM), every parameter showed variation across individuals, with some parameters showing clearly meaningful differences. Examining the possibility of multiple lag lengths—a question that is typically not asked and difficult to answer without intensive longitudinal data—we found that we might have missed important findings by examining only one lag length. However, we also found that aspects of the model involving multiple lag lengths appeared to be the least robust across methods. We also tested whether ML-DSEM estimates of individual models were related to N = 1 model estimates. Estimates across methods were moderately to highly consistent. This result is reassuring in that, when sample size is high, ML-DSEM can offer some ability to examine individual models even with relatively few time points (i.e., about 25 observations; Schultzberg & Muthen, 2018). When we searched for individuals with highly distinct models, we found evidence that a minority of individuals had poorer fit for a group-based model than an N = 1 model. Several of these findings will benefit from further discussion.
First, the implications of the final baseline model and its idiographic variants may be informative for both tinnitus and applications of models like the ML-DSEM model to other areas in psychology. One unexpected finding was that tinnitus bother had additional significant autoregressive lags, both in the ML-DSEM model and for many individuals. On a fundamental level, interpreting this finding is no different than interpreting any other regression coefficient. However, researchers are often vulnerable to misinterpretations of regression coefficients. For example, in our experiences researchers often interpret positive regression coefficients in these models as “the variable is increasing.” However, it cannot be true that tinnitus bother is going up systematically. We know this to be true because, if tinnitus bother were going up, this tendency would have been detected as a time effect (e.g., of day in study and survey of the day): The autoregressive effect is by definition above and beyond any time effects. Autoregressive paths above one, which would indicate an increasing trend, are removed from the iterations of the model in DSEM, because such a trend violates assumptions of stationarity. Even if the autoregressive parameters had a value of very nearly one, the finding would mean that the value at one time point is similar to the next, but because of error, we should not expect the value to be higher than the current time point, but rather near the same value plus or minus error.
If the findings do not mean that tinnitus bother is increasing, what do they mean? As noted above, the correct interpretation is no different than any other regression parameter. Tinnitus bother at the current time is a significant predictor of tinnitus bother 4, 8, and 12 hours from now, such that when it is higher now (relative to the mean), it tends to be higher later (again, relative to the mean). In other words, imagining a person’s distribution of tinnitus bother ratings across the entire study, their score now, relative to the distribution, improves our guess at their score later relative to the distribution. The fact that anxiety and cognitive symptoms only had one lag of autoregressive parameters of note means that once you know how anxiety or cognitive symptoms were 4 hours ago, it does not help you to know how they were beyond four hours ago. In contrast, the past 12 hours of tinnitus bother tells you something about current tinnitus bother, albeit you still learn much more from the value 4 hours ago than the earlier values. With such small values for higher-order autoregressive lags, one can question whether all of this has an impact on the experience of tinnitus bother for individuals. However, the finding does seem to go along with what we have heard from some patients: That they have bad days for their tinnitus bother, when it seems to persist without clear explanation. If this interpretation is correct, however, we might expect higher-order symptom lags whenever we focus on a population of participants who struggle with that symptom, given that our experience with internalizing problems suggests that people often struggle with bad days.
The ML-RDSEM model also suggests that tinnitus bother has a reciprocal relationship with anxiety, although at what lag length tinnitus contributed to anxiety varied across models. Further examination of N = 1 models, however, demonstrated that clearly reciprocal relationships were rare in individuals. Instead, for example, participants who showed clear evidence of anxiety predicting their tinnitus bother tended not to show evidence of their tinnitus bother predicting future anxiety. One reason we might have found this result is due to power alone: Idiographic models have much less power than multilevel models run in the same sample, by definition. Indeed, when we ran a power analysis via simulations, we found that it was unrealistic to expect that the idiographic models could detect the average effects seen on the between-level for the multilevel models; even doubling the size of these numerically small estimates did not result in good power at a realistic number of time points. To provide greater clarity, we examined whether the two components of the proposed feedback loop tended to correlate or not. Again, the multilevel and N = 1 models differed here, with the multilevel model suggesting these paths did positively correlate, whereas the individual models did not. Our findings are thus entirely consistent with long-sounded warnings that group-based models may be limited in their usefulness for determining what is true for individuals (e.g., (Molenaar, 2004).
It may seem obvious to some readers that estimates from multilevel models should be preferred because they, by definition, offer more power. A contrasting point of view is that the multilevel model provides a misleading read of the data by cobbling together a significant fixed effect out of widely dispersed effects that are generally nonsignificant on the individual level. One can bolster this point of view by noting that multilevel models, including ML-DSEM, tend to draw observations toward the mean, a phenomenon known as shrinkage (Liu, 2017; Liu et al., 2021, see Piccirillo & Rodebaugh, 2022 supplemental material for a demonstration within DSEM). As Liu et al. (2021) note, shrinkage can be argued to be a good conservative move, particularly for participants with fewer data points: It may a good bet that people with fewer data are probably closer to the mean than their data indicate, provided that the mean is reasonable. On the other hand, one could argue that someone who is missing more data may be having unusual experiences (indeed, clinical scientists are often particularly interested in unusual experiences), making the assumption that their experiences are normative problematic. That is, whether shrinkage is useful is dependent on there being a distribution with a meaningful central tendency in the population, which we generally do not know, but rather assume.
Someone favoring an idiographic point of view could even argue that the limited power of the idiographic model is a strength, in that only effects likely to be important to the individual (i.e., very large ones) will be observed. Important effects that are smaller might be missed, but at least we are at less risk for emphasizing small effects that individuals are unlikely to notice themselves. A counterargument here would be that we do not really know how large an effect needs to be before an individual would notice it themselves. What is missing from this argument, we believe, is a clear arbiter of what paths are important to recover. Which model is more useful—a model based on a group or a person’s own individual model—will require a test of the real-world consequences of making decisions based on each model (i.e., external validity for each model).
Across methods of analysis, we found evidence that anxiety symptoms impact tinnitus bother for at least some individuals, suggesting possible interventions. Clinically, results suggest that long-term reduction of tinnitus bother might be productively addressed, on the population level, by interventions to reduce anxiety symptoms. Perhaps not surprisingly, reducing anxiety and improving ability to focus are two of the aims of cognitive behavioral interventions for tinnitus (Andersson, 2002). Notably, however, reducing cognitive symptoms (primarily, here, distractibility) would not necessarily produce positive effects on tinnitus bother for all individuals according to the model results. That is, for some individuals, the negative relationship between cognitive symptoms and tinnitus bother might suggest that increasing attention would also increase tinnitus bother. On the other hand, psychological interventions designed specifically to improve cognitive focus (e.g., meditative practices) would also generally be expected to reduce anxiety, which is itself a beneficial outcome and could thus lead to better outcomes overall even if improved concentration does not directly improve tinnitus bother for selected individuals. Future work could combine intensive longitudinal data with treatment and examine the extent to which successful treatment changes participant model pathways (e.g., does successful treatment eliminate paths from anxiety to tinnitus?).
We can offer several practical implications for researchers who are beginning their work with intensive longitudinal data. First, although our inclination was to model all variables at once, we found that simpler models offered more robust and interpretable results. We suspect that many researchers will have intuitions about what makes for a useful model based on a long history of using between-participant techniques, such as standard multiple regression. These intuitions may need updating for models focusing on intensive longitudinal data. Second, although McNeish and Hamaker (2020) advise the use of the RDSEM model when working with time trends, we found that we were not always able to use this model due to convergence issues. We also found, despite Asparouhov and Muthén’s (2020) success in using the DIC to compare across models, that our data did not allow us to do the same. Following McNeish and Hamaker, until more routinely useful fit indices are offered for DSEM, we recommend using the RDSEM model on a conceptual basis, but also note that this model may lead to convergence problems, in which case DSEM models, potentially with time trends removed ahead of time, can be used instead. Further, we recommend that virtually all intensive longitudinal studies should include variables related to time course, both to account for time trends and to better understand them.
Indeed, one important implication of the results for future work with intensive longitudinal data involves the time effects observed. There was a significant effect of day in the study, for the group as a whole, on cognitive symptoms (which tended to decrease across days) and a time-of-day effect for anxiety symptoms (which tended to decrease across the day). It is not difficult to speculate as to why this might be (e.g., initial EMA prompts might lead to slight increases in distraction that then reduce over time). The more important point is that such time effects were observed despite not being hypothesized. Stationarity is a required assumption for interpreting most types of models used for intensive longitudinal data (cf. Bringmann et al., 2017 or McNeish & Hamaker, 2020 for a review). When either the individual or group have scores that show a linear direction over time, this trend violates stationarity unless it is either partialled out or modeled (Bringmann et al., 2017). Here, the group trends indicate an overall tendency toward change that could produce spurious results if not modeled. Violations of stationarity have only been occasionally discussed regarding psychological data (although see, e.g., Bringmann et al., 2015; Bringmann et al., 2013; McNeish & Hamaker, 2020). The current data suggest that time effects may be common in psychological data, and therefore should be examined whenever intensive longitudinal data are analyzed.
The estimation of individual parameters based on a multilevel model is also of interest. The stability of parameters across fully idiographic to ML-DSEM estimation methods suggests that ML-DSEM can provide individual parameter estimates that are relatively consistent with those of N = 1 models. Our findings here converge with very similar findings from Hamaker et al. in a separate dataset (Hamaker et al., in press). The fact that our finding is consistent is important because researchers have cautioned that multilevel level models are not idiographic, and thus the individual estimates they can provide may not be informative (Molenaar, 2004). Part of the reason that researchers have suggested caution on this front is that individuals might have different factor structures than what is true on the average (cf. Fisher, 2015). We found evidence for consistency in factor structure across participants overall, but we did find two individuals who clearly had better-fitting, fully idiographic latent structures. These results provide some reassurance that ML-DSEM models can be useful in estimating individual-level paths, but also warn that even in a relatively modest sample, a clinical scientist should expect that some individuals will not be consistent with any given model based on a group of individuals. We speculate that the focused nature of the questions we asked participants to respond to, coupled with the fact that we examined tinnitus bother only (e.g., as opposed to a variety of anxiety disorders), produced more consistency in factor structure than might be expected in other samples.
The results of this study should be interpreted in light of its limitations. Although the sample size was reasonably large in terms of number of observations, a larger sample of diverse participants, with clear assessment of demographic factors (including race, ethnicity, income, and other factors) would clearly be preferable. Study 2 had less information available regarding demographics, but the information from Study 1 suggests our sample was, in addition to largely white, also highly educated and relatively high in socioeconomic status; future work should examine a wider range of individuals. The limited simulation work available suggests that although our model likely has reasonable estimates, they may still be subject to bias. A sample size closer to 100 would be preferable for future studies with similar numbers of time points, with 200 or more participants preferable when fewer time points are available (Schultzberg & Muthen, 2018). Importantly, our current work will allow us to estimate power for future studies more effectively by using a simulation based on our obtained parameters.
Nevertheless, the current results suggest cautious optimism regarding the possibility of using ML-DSEM, or similar techniques, such as the Group Iterative Multiple Model Estimation technique (cf. Wright et al., 2017, for an extended example with psychological data), to examine group and individual-level models for participants with bothersome tinnitus. In our exploratory analyses, we examined the possibility that such models would predict response to intervention and found several promising paths. If such models are shown consistently to predict treatment outcome, they could ultimately be useful in assigning patients to available treatments. We believe the use of ML-DSEM is particularly promising in this regard. Although fully idiographic models have a number of philosophically appealing features, generating a fully idiographic model currently would typically require a researcher familiar with these analyses. This is the case because of a variety of issues of instability and interpretation of models that require expertise even when some models can be run without such expertise. In contrast, a multilevel model, provided enough participants have contributed data, could become stable enough that a single participant’s data could be added to the group and their individual parameters estimated without the help of a specialist. Whether multilevel models will ultimately be precise enough on the individual level to provide the long-promised revolution in precision interventions is a question for further research.
Tinnitus is not the only heterogeneous health-related condition that has come to the attention of clinical scientists. We conceptualize tinnitus as one of many so-called functional somatic disorders that are affected by anxiety, stress, and disrupted attention (Barsky & Borus, 1999; Geisser et al., 2008; Kanaan et al., 2007). We found evidence, from both methods used, that patients with bothersome tinnitus vary in terms of whether, and to what extent, their bothersome tinnitus is maintained by affect (e.g., anxiety, stress, and worry) and attention dysregulation (e.g., cognitive symptoms). This conceptualization is consistent with cognitive behavioral conceptualizations of medically unexplained symptoms (Deary et al., 2007), and could be expanded to include other factors identified in such conceptualizations (e.g., avoidance of anticipated tinnitus, catastrophic interpretation of symptoms; cf. Deary et al., 2007). We encourage further study of the heterogeneity of the experiences of people with bothersome tinnitus and similar conditions, with an emphasis on determining whether these disparate experiences provide insight into treatment outcomes.
Supplementary Material
Author Note
This research was supported by the National Institute of Mental Health (NIMH), National Research Service Award (NRSA) F31 MH115641 and National Institute on Alcohol Abuse and Alcoholism Training Award T32AA007455 and K99AA029459 to Marilyn Piccirillo, the NIMH, NRSA F31 MH124291 to Madelyn Frumkin, National Institute of Deafness and Other Communication Disorders (NIDCD) award 1R01DC017451-01 to Jay F. Piccirillo and Thomas L. Rodebaugh, and the NIDCD, Development of Clinician/Researchers in Academic ENT Training Program, award T32DC000022 to Katherine Gerull. Thanks to Nicholas C. Jacobson for providing information regarding his R package DT-VEM, Adrienne Beltz for information regarding Group Iterative Multiple Model Estimation capabilities, Daniel McNeish for guidance regarding testing factorial invariance, and Ellen Hamaker and Tihomir Asparouhov for guidance on dynamic structural equation modeling more generally.
Footnotes
In discussion with an anonymous statistician at Mplus, doubts were raised with our approach here because of the complexity of the model; the reader might also find it unsatisfying, as we do, that we still cannot tell from this approach how good the fit is overall. One solution is to return to a two-level confirmatory factor model in MSEM, which we also investigated. Here, although warnings indicated, as expected, that the model did not converge properly (due to the small number of participants), we saw reasonable to excellent fit indices for the three-factor model and inferior fit on every index for the one-factor model. Notably, every approach examined suggested a three-factor model was preferable, so we concluded it was best to use moving forward, although a test in a larger sample would be advisable.
During the review process it was noted that starting a time variable with 1 would normally be avoided in other analysis methods. We investigated and found that whether these variables started with 1 or 0 made no substantive difference for any of our models. Nevertheless, the parameters in Figure 1 and all output we include in supplements were revised to reflect the results when time of day and day of study start with 0 instead of 1.
When we inquired about this issue with anonymous Mplus staff, we were able to confirm that they also expected pD values should be very close in value for DICs to be comparable between DSEM and RDSEM models.
Contributor Information
Thomas L. Rodebaugh, Department of Psychological and Brain Sciences, Washington University in St Louis
Marilyn L. Piccirillo, Department of Psychological and Brain Sciences, Washington University in St Louis
Madelyn R. Frumkin, Department of Psychological and Brain Sciences, Washington University in St Louis
Dorina Kallogjeri, Department of Otolaryngology, Washington University School of Medicine in St Louis.
Katherine M. Gerull, Department of Otolaryngology, Washington University School of Medicine in St Louis
Jay F. Piccirillo, Department of Otolaryngology, Washington University School of Medicine in St Louis
References
- Andersson G (2002). Psychological aspects of tinnitus and the application of cognitive-behavioral therapy. Clinical Psychology Review, 22(7). 10.1016/s0272-7358(01)00124-6 [DOI] [PubMed] [Google Scholar]
- Asparouhov T, Hamaker EL, & Muthen B (2018). Dynamic structural equation models. Structural Equation Modeling, 25(3), 359–388. 10.1080/10705511.2017.1406803 [DOI] [PubMed] [Google Scholar]
- Asparouhov T, Hamaker EL, & Muthén B (2018). Dynamic structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 25(3), 359–388. [Google Scholar]
- Asparouhov T, & Muthen B (2022). Practical Aspects of Dynamic Structural Equation Models. Downloaded from http://www.statmodel.com/download/PDSEM.pdf on June 6, 2022.
- Asparouhov T, & Muthén B (2020). Comparison of models for the analysis of intensive longitudinal data. Structural Equation Modeling: A Multidisciplinary Journal, 27(2), 275–297. [Google Scholar]
- Baguley D, McFerran D, & Hall D (2013). Tinnitus. Lancet, 382(9904), 1600–1607. 10.1016/S0140-6736(13)60142-7 [DOI] [PubMed] [Google Scholar]
- Barsky AJ, & Borus JF (1999). Functional somatic syndromes. Annals of Internal Medicine, 130(11), 910–921. [DOI] [PubMed] [Google Scholar]
- Bringmann L, Ferrer E, Hamaker E, Borsboom D, & Tuerlinckx F (2015). Modeling nonstationary emotion dynamics in dyads using a semiparametric time-varying vector autoregressive model. Multivariate Behavioral Research, 50(6). 10.1080/00273171.2015.1120182 [DOI] [PubMed] [Google Scholar]
- Bringmann LF, Hamaker EL, Vigo DE, Aubert A, Borsboom D, & Tuerlinckx F (2017). Changing dynamics: Time-varying autoregressive models using generalized additive modeling. Psychological Methods, 22(3). 10.1037/met000008510.1037/met0000085.supp [DOI] [PubMed] [Google Scholar]
- Bringmann LF, Vissers N, Wichers M, Geschwind N, Kuppens P, Peeters F, Borsboom D, & Tuerlinckx F (2013). A network approach to psychopathology: new insights into clinical longitudinal data. PLoS ONE, 8(4), e60188. 10.1371/journal.pone.0060188 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Castro-Alvarez S, Tendeiro JN, de Jonge P, Meijer RR, & Bringmann LF (2021). Mixed-effects trait-state-occasion model: Studying the psychometric properties and the person-situation interactions of psychological dynamics. Structural Equation Modeling. 10.1080/10705511.2021.1961587 [DOI] [Google Scholar]
- Deary V, Chalder T, & Sharpe M (2007). The cognitive behavioural model of medically unexplained symptoms: A theoretical and empirical review. Clinical Psychology Review, 27(7). 10.1016/j.cpr.2007.07.002 [DOI] [PubMed] [Google Scholar]
- Fisher AJ (2015). Toward a dynamic model of psychological assessment: Implications for personalized care. Journal of Consulting and Clinical Psychology, 83(4). 10.1037/ccp0000026 [DOI] [PubMed] [Google Scholar]
- Fisher AJ, Medaglia JD, & Jeronimus BF (2018). Lack of group-to-individual generalizability is a threat to human subjects research. PNAS Proceedings of the National Academy of Sciences of the United States of America, 115(27), E6106–E6115. 10.1073/pnas.1711978115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geisser ME, Glass JM, Rajcevska LD, Clauw DJ, Williams DA, Kileny PR, & Gracely RH (2008). A psychophysical study of auditory and pressure sensitivity in patients with fibromyalgia and healthy controls. J. Pain, 9(5), 417–422. http://www.ncbi.nlm.nih.gov/pubmed/18280211 (Not in File) [DOI] [PubMed] [Google Scholar]
- Geldhof GJ, Preacher KJ, & Zyphur MJ (2014). Reliability estimation in a multilevel confirmatory factor analysis framework. Psychological Methods, 19(1). 10.1037/a003213810.1037/a0032138.supp [DOI] [PubMed] [Google Scholar]
- Gerull KM, Kallogjeri D, PIccirillo ML, Rodebaugh TL, Lenze EJ, & Piccirillo JF (2019). Feasibility of intensive ecological sampling of tinnitus in intervention research. Otolaryngology-Head and Neck Surgery, 161(3), 485–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall DAMN; Firkins L;Fenton M;Stockdale D (2013). Identifying and prioritizing unmet research questions for people with tinnitus: the James Lind Alliance Tinnitus Priority Setting Partnership. Clinical Investigation, 3(1), 21–28. 10.4155/cli.12.129Issn [DOI] [Google Scholar]
- Hallquist MN, & Wiley DE (2018). MplusAutomation: An R Package for Facilitating Large-Scale Latent Variable Analyses in Mplus Structural Equation Modeling, 25, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamaker EL (2012). Why researchers should think” within-person”: A paradigmatic rationale.
- Hamaker EL, Asparouhov T, Brose A, Schmiedek F, & Muthen B (2018). At the frontiers of modeling intensive longitudinal data: Dynamic structural equation models for the affective measurements from the COGITO study. Multivariate Behavioral Research, 53(6), 820–841. 10.1080/00273171.2018.1446819 [DOI] [PubMed] [Google Scholar]
- Hamaker EL, Asparouhov T, & Muthen B (in press). Dynamic structural equation modeling as a combination of time series modeling, multilevel modeling, and structural equation modeling. In Hoyle RH (Ed.), The Handbook of Structural Equation Modeling (2nd edition) (2nd ed.). Guilford. [Google Scholar]
- Hoare DJ, Kowalkowski VL, Kang S, & Hall DA (2011). Systematic review and meta-analyses of randomized controlled trials examining tinnitus management. Laryngoscope, 121(7), 1555–1564. http://www.ncbi.nlm.nih.gov/pubmed/21671234 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jacobson NC, Chow S-M, & Newman MG (2019). The Differential Time-Varying Effect Model (DTVEM): A tool for diagnosing and modeling time lags in intensive longitudinal data. Behavior Research Methods, 51(1), 295–315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanaan RA, Lepine JP, & Wessely SC (2007). The association or otherwise of the functional somatic syndromes. Psychosomatic Medicine, 69(9), 855–859. http://www.ncbi.nlm.nih.gov/pubmed/18040094 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lafit G, Adolf JK, Dejonckheere E, Myin-Germeys I, Viechtbauer W, & Ceulemans E (2021). Selection of the number of participants in intensive longitudinal studies: A user-friendly shiny app and tutorial for performing power analysis in multilevel regression models that account for temporal dependencies. Advances in Methods and Practices in Psychological Science, 4(1). 10.1177/2515245920978738 [DOI] [Google Scholar]
- Landgrebe M, Azevedo A, Baguley D, Bauer C, Cacace A, Coelho C, Dornhoffer J, Figueiredo R, Flor H, Hajak G, Van de Heyning P, Hiller W, Khedr E, Kleinjung T, Koller M, Lainez JM, Londero A, Martin WH, Mennemeier M, Piccirillo J, De Ridder D, Rupprecht R, Searchfield G, Vanneste S, Zeman F, & Langguth B (2012). Methodological aspects of clinical trials in tinnitus: A proposal for an international standard. Journal of Psychosomatic Research, 73(2), 112–121. http://www.scopus.com/inward/record.url?eid=2-s2.0-84863781013&partnerID=40&md5=3a8fc5d8edeb9f5af2b892b34db33a6a [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee IA, & Little TD (2012). P-technique factor analysis. In Laursen B, Little TD, & Card NA (Eds.), Handbook of Developmental Research Methods (pp. 350–363). Guilford. [Google Scholar]
- Li R, Dziak JJ, Tan X, Huang L, Wagner AT, & Yang J (2017). TVEM (time-varying effect model) SAS macro users’ guide (Version 3.1.1). The Methodology Center, Penn State. http://methodology.psu.edu [Google Scholar]
- Liu S (2017). Person-specific versus multilevel autoregressive models: Accuracy in parameter estimates at the population and individual levels. British Journal of Mathematical and Statistical Psychology, 70(3), 480–498. 10.1111/bmsp.12096 [DOI] [PubMed] [Google Scholar]
- Liu S, Kuppens P, & Bringmann L (2021). On the use of empirical Bayes estimates as measures of individual traits. Assessment, 28(3), 845–857. 10.1177/1073191119885019 [DOI] [PubMed] [Google Scholar]
- McNeish D, & Hamaker EL (2020). A primer on two-level dynamic structural equation models for intensive longitudinal data in Mplus. Psychol Methods, 25(5), 610–635. 10.1037/met0000250 [DOI] [PubMed] [Google Scholar]
- McNeish D, Mackinnon DP, Marsch LA, & Poldrack RA (2021). Measurement in intensive longitudinal data. Structural Equation Modeling, 28(5), 807–822. 10.1080/10705511.2021.1915788 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Molenaar PC, de Gooijer JG, & Schmitz B (1992). Dynamic factor analysis of nonstationary multivariate time series. Psychometrika, 57(3). 10.1007/bf02295422 [DOI] [Google Scholar]
- Molenaar PCM (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, This Time Forever. Measurement: Interdisciplinary Research and Perspectives, 2(4). 10.1207/s15366359mea0204_1 [DOI] [Google Scholar]
- Molenaar PCM (2007). Psychological methodology will change profoundly due to the necessity to focus on intra-individual variation. Integrative Psychological & Behavioral Science, 41(1). 10.1007/s12124-007-9011-1 [DOI] [PubMed] [Google Scholar]
- Molenaar PCM (2013). On the necessity to use person-specific data analysis approaches in psychology. European Journal of Developmental Psychology, 10(1). 10.1080/17405629.2012.747435 [DOI] [Google Scholar]
- Morey RD, Hoekstra R, Rouder JN, Lee MD, & Wagenmakers EJ (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123. 10.3758/s13423-015-0947-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piccirillo ML, & Rodebaugh TL (2019). Foundations of idiographic methods in psychology and applications for psychotherapy. Clinical Psychological Review. 10.1016/j.cpr.2019.01.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piccirillo ML, & Rodebaugh TL (2022). Personalized networks of social anxiety disorder and depression and implications for treatment. Journal of Affective Disorders, 298(Part A), 262–276. 10.1016/j.jad.2021.10.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schultzberg M, & Muthen B (2018). Number of subjects and time points needed for multilevel time-series analysis: A simulation study of dynamic structural equation modeling. Structural Equation Modeling, 25(4). 10.1080/10705511.2017.1392862 [DOI] [Google Scholar]
- Schuurman NK, & Hamaker EL (2019). Measurement error and person-specific reliability in multilevel autoregressive modeling. Psychological Methods, 24(1), 70–91. 10.1037/met0000188 [DOI] [PubMed] [Google Scholar]
- Tyler RS, Coelho C, & Noble W (2006). Tinnitus: standard of care, personality differences, genetic factors. Orl; Journal of Oto-Rhino-Laryngology & Its Related Specialties, 68(1), 14–19. [DOI] [PubMed] [Google Scholar]
- van Alphen T, Jak S, Jansen in de Wal J, Schuitema J, & Peetsma T (2022). Determining reliability of daily measures: An illustration with data on teacher stress. Applied Measurement in Education, 1–17. [Google Scholar]
- Wright A, Gates K, Arizmendi C, Lane S, Woods W, & Edershile E (2017). Focusing personality assessment on the person: Modeling general, shared, and person specific processes in personality and psychopathology. Psychological Assessment, 31(4), 502–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.