Abstract
The present simulation study indicates that a method where the regression effect of a predictor (X) on an outcome at follow-up (Y1) is calculated while adjusting for the outcome at baseline (Y0) can give spurious findings, especially when there is a strong correlation between X and Y0 and when the test–retest correlation between Y0 and Y1 is relatively weak. Researchers wishing to avoid spurious findings and Type 1 errors should be aware of this phenomenon and are recommended to verify found effects by an unadjusted effect of X on the Y1–Y0 difference.
Keywords: adjusting for baseline, correlation, follow-up, regression effect, Type 1 error
Introduction
It is quite popular to analyze the effect of a predictor measured at baseline on an outcome measured at follow-up while controlling for the outcome at baseline. Studies employing this method have, for example, concluded that while controlling for degree of depression symptoms at baseline, a higher degree of depression symptoms at follow-up can be predicted from higher levels of C-reactive protein and interleukin-6 (Gimeno et al., 2009), underweight (Kim, Noh, Park, & Kwon, 2014), higher body mass index (Dearborn, Robbins, & Elias, 2018), lower self-esteem (Johnson, Meyer, Winett, & Small, 2000), less cognitive control over sad stimuli (Vanderlind et al., 2014), more life stressors (Moos, Schutte, Brennan, & Moos, 2005), higher job demands (Paterniti, Niedhammer, Lang, & Consoli, 2002), poorer relationship quality (Eberhart & Hammen, 2006), and separation (O’Connor, Cheng, Dunn, & Golding, 2005).
A couple of seemingly not always appreciated points about the “X on Y1 while adjusting for Y0” effect might be worth making. First, this effect is usually not the same as the effect of X on the change in Y, that is, the Y1–Y0 difference (Eriksson & Häggström, 2014; Lord, 1967; Pearl, 2016; van Breukelen, 2013). In Figure 1, we see that the total effect of X on Y1–Y0 equals b − a(1 −c) and, hence, equals b only if a(1 −c) = 0, that is, if a = 0 or if c = 1. To borrow a fictitious example from Sorjonen, Farioli, Hemmingsson, and Melin (2017): If a group of professional dart players throws an average of 9 on a first throw (Y0) and an average of 9 on a second throw (Y1) and a group of amateur dart players throws an average of 4 on a first throw (Y0) and an average of 4 on a second throw (Y1), the average Y1–Y0 difference is zero in both groups and there is no effect of “professionalism” on this change. However, would we include Y0 as a covariate in the model, we would probably see a positive effect of professionalism on Y1 (b > 0). This is because if the effect of professionalism on Y1–Y0 equals zero, then b−a(1 −c) = 0, that is, b = a(1 −c), and because there is a positive association between professionalism and Y0 (a > 0) and we can expect a less than perfect association between Y0 and Y1 even when adjusting for professionalism (c < 1) we would expect a positive association between professionalism and Y1 when adjusting for Y0 (b > 0). For example, if the effect of professionalism on Y0 (a) equals 5 and the effect of Y0 on Y1 while adjusting for professionalism (c) equals 0.7, we would expect the effect of professionalism on Y1 while adjusting for Y0 (b) to equal 5(1 − 0.7) = 1.5. In more generic terms, we can predict the effect of X on Y1 while adjusting for Y0 (bx,y1·y0 corresponds to effect b in Figure 1) from the effect of X on the Y1–Y0 difference (bx,y1–y0), the effect of X on Y0 (bx,y0 corresponds to effect a in Figure 1), and the effect of Y0 on Y1 while adjusting for X (by0,y1·x corresponds to effect c in Figure 1) according to the following:
Figure 1.

Effects between a predictor X, an outcome at two time-points (Y0 and Y1), and the Y1–Y0 difference. Adapted from Pearl (2016).
| (1) |
or using the letters in Figure 1: b = bx,y1–y0+ a (1 −c)
Second, the effect of X on the Y1-Y0 difference while adjusting for Y0 is the same as the effect of X on Y1 while adjusting for Y0 (van Breukelen, 2013). This is easy enough to see, because:
Hence, if predicting the Y1–Y0 difference rather than Y1, the coefficient for Y0 (b2) decreases by one, but the intercept (b0) and the coefficient for X (b1) remain the same.
It has been demonstrated algebraically (Huck & McLean, 1975) and through simulations (Petscher & Schatschneider, 2011) that analyzing the effect of X on Y1 while adjusting for Y0 is more powerful than analyzing the crude effect of X on the Y1–Y0 difference, and researchers have been discouraged from using the latter analysis (e.g., by Cronbach & Furby, 1970). However, these demonstrations have assumed an experimental design with subjects randomly assigned to conditions, that is, X-values, at baseline and Eriksson and Häggström (2014) have shown that the “X on Y1 while adjusting for Y0” analysis can give biased results if there is a correlation between X and Y0 and error in the measurement of Y0. It has also been argued, contrary to common assumptions, that the reliability of simple difference scores can be quite adequate and also that lack of reliability would not automatically mean lack of precision in the measurement of change (Rogosa, Brandt, & Zimowski, 1982; Rogosa & Willett, 1983; Zumbo, 1999).
Although Equation (1) above would give us an analytic prediction of the effect of X on Y1 while adjusting for Y0, it would probably not be very usable in practice if one wishes to evaluate effects in published research, as the effect of X on the Y1–Y0 difference (not adjusting for Y0) is seldomly reported. The objective of the present simulation study was, therefore, to analyze and illustrate how the effect of X on Y1 while controlling for Y0 is associated with the more commonly reported correlation between X and Y0 as well as the test–retest correlation between Y0 and Y1. We will employ a, in our opinion realistic, simulation where the observed values X, Y0, and Y1 are treated as indicators of individuals’ scores on nonobservable “true” variables. We will also assume that all observed changes from Y0 to Y1 are only due to random fluctuations around individuals’ true scores and that there is no direct association between X and Y1, that is, any association between X and Y1 is completely due to confounding influences from other variables. Consequently, the present study could be seen to indicate a realistic null hypothesis of a completely spurious effect when analyzing the effect of X on Y1 while adjusting for Y0.
Method
Using R 3.5.0 statistical software (R Core Team, 2018), data were simulated through the following steps (Figure 2, script available as supplementary material available in the online version of the article or from https://osf.io/ngm2s/ Virtual subjects were assigned (1) a True X (TrX) value from a random normal distribution, (2) a True Y (TrY) value from a random normal distribution with a defined population correlation with TrX, (3) two observed Y values, Y0 and Y1, with a defined population correlation with TrY (same for both), and (4) an observed X value. These steps were repeated 50,000 times with different combinations of sample size and correlations.
Figure 2.

Illustration of the steps in the simulation (solid lines) and analyzed effects (dashed lines).
The sample sizes used were 50, 100, 200, 400, 800, 1,600, 3,200, 6,400, 12,800, and 25,600 (i.e., eight doublings) and was judged to cover the full range from a very small to a very large sample size.
The correlation between TrX and TrY was selected from a random uniform distribution between − 1 and 1 while the correlation between TrY and Y0/Y1 was selected from a random uniform distribution between 0 and 1. However, in order to get the full range of correlations between X and Y0, the population correlation between TrX and X was always set at a high 0.99. This degree of reliability in the measurement of X might seem unrealistically high. However, it has no effect on the results over and above its influence on the correlation between X and Y0. Would we, instead, let the correlation between TrX and X vary between − 1 and 1 and set the correlation between TrX and TrY to 0.99, we would get similar results as in the present case. Would we, as in the present case, let the correlation between TrX and TrY vary between − 1 and 1 but set the correlation between TrX and X to a more realistic 0.7, we would also get similar results as in the present case, with the difference that the correlation between X and Y0 would be limited to a range from − 0.7 to 0.7.
In each simulation, the standardized regression effects of X on Y1 while adjusting for Y0 and on the Y1–Y0 difference (not adjusting for Y0) was calculated, as well as the correlations between X and Y0 and between Y0 and Y1. It should be emphasized that the data simulates a situation with no change in individuals’ true scores, so all observed changes from Y0 to Y1 are due to random fluctuations, and that any associations between X and Y1 are completely due to confounding influences of TrX and TrY. So, if the goal is to avoid spurious findings and Type 1 errors, the analyses should result in nonsignificant effects close to zero.
Results
The standardized effects of the predictor X on the outcome at follow-up (Y1) while controlling for the outcome at baseline (Y0) and of X on the Y1–Y0 difference as functions of the correlation between X and Y0, separately for four different ranges of the test–retest correlation between Y0 and Y1 and three different sample sizes, is presented in Figure 3. We see that analyses of the effect of X on the Y1–Y0 difference seem good at avoiding spurious findings. Analyses of the effect of X on Y1 while adjusting for Y0, on the other hand, tend to result in strong effects if there is a strong correlation between X and Y0, unless this strong correlation is accompanied by a strong test-retest correlation between Y0 and Y1, in which case the effect can become anything between zero and close to unity.
Figure 3.
Standardized effects of a predictor X on an outcome at follow-up (Y1) while controlling for the outcome at baseline (Y0) (rows 2-4) and of X on the Y1–Y0 difference (row 1) as functions of the correlation between X and Y0, separately for four different ranges of the test-retest correlation between Y0 and Y1 and three different sample sizes. The solid lines in rows 2–4 indicate predicted effects according to Equations (2) and (3) (see text) and the dashed lines indicate 95% CI (±2 ×SE as given by Equations 4 and 5). CI = confidence interval; SE = standard error.
Figure 3 indicates an inverse sigmoid association between the correlation between X and Y0 (rx,y0) and the effect of X on Y1 while controlling for Y0 (bx,y1·y0). A sigmoid curve is given by the following formula:
In this formula, d = the distance between the floor and the ceiling of the curve, s = the steepness of the curve, and c = the ceiling of the curve. In the present case, c = 1 and the function is symmetrical on both sides of zero, giving that d = 2c = 2, and the function can be simplified. After inversion, we get that the effect of X on Y1 while controlling for Y0 (bx,y1·y0) can be predicted by
| (2) |
Using the nonlinear least squares (nls) function in R, the s-parameter in Equation (2), in its turn, was found to be a function of the test–retest correlation between Y0 and Y1 (ry0,y1):
| (3) |
Equations (2) and (3) have been used to predict the effect of X on Y1 while controlling for Y0 from the correlation between X and Y0, separately for the four ranges of test–retest correlation between Y0 and Y1, and these predictions have been included as solid lines in Figure 3.
The standard error of the effect of X on Y1 while controlling for Y0, SE(bx,y1·y0), was calculated for various combinations of sample size, the correlation between X and Y0, and the test–retest correlation between Y0 and Y1, and this standard error was found to be affected both by sample size, these two correlations, and their interactions (function available in the supplementary script, see supplementary material available in the online version of the article):
| (4) |
| (5) |
Equation (4) gives the natural logarithm (in order to never predict a negative value) of the standard error of the effect of X on Y1 while controlling for Y0 and by tapping this predicted value into Equation (5) we get the predicted standard error. Equation (4) was revealed through a full factorial model including the predictors square root of sample size (), absolute value of the correlation between X and Y0 (|rx,y0|), test–retest correlation between Y0 and Y1 (ry0,y1), and all of their interactions. By adding/subtracting two predicted standard errors to/from the predicted effect, we get a 95% prediction interval, and these have been added as dashed lines in Figure 3. The prediction interval seems to fit the data quite well.
A simulation with skewed variables X, Y0, and Y1 and N = 59 (the sample size in Horwitz, Berona, Czyz, Yeguez, & King, 2017, see the Discussion section) was run and indicated a similar result as the simulations with normally distributed variables presented above. In Figure 4, the lines indicate predicted effects with 95% confidence interval based on the functions from the simulation with normally distributed variables, that is, Equations (2) to (5), and they seem to fit quite well also when the variables are skewed (observe that in Figure 4, the axes are limited to a range from 0 to 1 while in Figure 3 they cover the range from − 1 to 1).
Figure 4.
Standardized effects of a predictor X on an outcome at follow-up (Y1) while controlling for the outcome at baseline (Y0) as functions of the correlation between X and Y0, separately for four different ranges of the test–retest correlation between Y0 and Y1 and for negatively (first row) and positively (second row) skewed variables. The solid lines indicate predicted effects according to Equations (2) and (3) (see text) and the dashed lines indicate 95% CI (±2 ×SE as given by Equations 4 and 5). N = 59. CI = confidence interval; SE = standard error.
Discussion
Results from the present simulation indicate that with a strong correlation between X and Y0 but a not too strong test–retest correlation between Y0 and Y1 you should, according to a null hypothesis of a completely spurious effect, still expect to get a strong effect of X on Y1 while adjusting for Y0 (see also Eriksson & Häggström, 2014). However, if the test–retest correlation between Y0 and Y1 is also strong, you can expect anything. Hence, assuming adequate power, a research hypothesis about a nonspurious effect of X on Y1 while adjusting for Y0 can be corroborated mainly if the correlation between X and Y0 is relatively weak and falsified (but not corroborated) if the correlation between X and Y0 is strong. However, if both the correlations between X and Y0 and between Y0 and Y1 are strong, the research hypothesis can be neither corroborated nor falsified.
Horwitz et al. (2017) found that lack of positive expectations at baseline (X) predicted degree of depression 2 to 4 years later (Y1) among adolescents at elevated risk for suicide (N = 59) when controlling for degree of depression at baseline (Y0). The correlation between lack of positive expectations (X) and depression at baseline (Y0) was 0.402 and the depression test–retest correlation equaled 0.073. Tapping these values into Equations (2) to (5) gives a predicted effect = 0.348 with a 95% prediction interval from 0.116 to 0.580 under the null hypothesis of a completely spurious effect. As the observed effect by Horwitz et al. (β = 0.421) was securely within this prediction interval, one should be careful to conclude, based on their findings, that lack of positive expectations (X) has any nonspurious effect on future degree of depression (Y1) when adjusting for present degree of depression (Y0), and a conclusion about a non-spurious effect would not be saved by reference to skewed variables (Figure 4). Given the positive correlation between lack of positive expectations (X) and depression at baseline (Y0), we can probably assume a negative correlation between lack of positive expectations and measurement error in depression given the same measured depression score. In other words, if subjects had the same measured depression at baseline, those with less positive expectations had probably a higher true degree of depression, on average, and had received a lower score than they should have (i.e., negative measurement error) while those with more positive expectations had received a higher score than they should have (i.e., positive measurement errors). Because of regression toward the mean, and the low test–retest correlation for depression indicates that this enfant terrible had a lot of room to wreak havoc, we can expect those with negative measurement errors, and consequently with less positive expectations, to have increased in degree of depression to the follow-up while those with positive measurement errors, and more positive expectations, can be expected to have decreased. Eriksson and Häggström (2014) have shown that the “X on Y1 while adjusting for Y0” method gives increasingly biased results with an increase in the correlation between X and Y0 and in the measurement error of Y0. Researchers wishing to avoid spurious findings and Type 1 errors should be aware of this phenomenon and we recommend that any findings of an effect of X on Y1 while adjusting for Y0 should be verified by an unadjusted effect of X on the Y1–Y0 difference.
The present simulation study has mainly employed continuous and normally distributed variables and the presented formulas and functions can probably not be trusted to always give extremely accurate predictions with real data. The main point has not been to introduce impeccable prediction tools but rather to expose some problems with the evaluated analysis method. Somebody might want to argue that because the present simulation does not cover the full complexity of an often-messy reality, the problems it highlights are not real and it is safe to continue using the “X on Y1 while adjusting for Y0” method as has been done numerous times before. However, this would be like arguing that it is safe to consume a certain poison at home as long as its lethality has been demonstrated only in controlled laboratory settings. There does not seem to be any reason to believe that messy skewed data would pose a graver threat against the validity of the present simulation study than the validity of studies employing the “X on Y1 while adjusting for Y0” method with real data. Still, there are aspects that might be interesting to explore further, for example, how the effect of X on Y1 while adjusting for Y0 behaves in more complex models, including more covariates and also in logistic models with a dichotomous outcome.
Conclusions
The present simulation study indicates, in accordance with earlier findings, that a method where the effect of a predictor (X) on an outcome at follow-up (Y1) is calculated while adjusting for the outcome at baseline (Y0) can give spurious findings, especially when there is a strong correlation between X and Y0 and when the test–retest correlation between Y0 and Y1 is relatively weak.
Supplemental Material
Supplemental material, ConBase_Script for Predicting the Effect of a Predictor When Controlling for Baseline by Kimmo Sorjonen, Bo Melin and Michael Ingre in Educational and Psychological Measurement
Supplemental material, dforg for Predicting the Effect of a Predictor When Controlling for Baseline by Kimmo Sorjonen, Bo Melin and Michael Ingre in Educational and Psychological Measurement
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material: Supplemental material for this article is available online.
References
- Cronbach L. J., Furby L. (1970). How we should measure “change” - or should we? Psychological Bulletin, 74, 68-80. [Google Scholar]
- Dearborn P. J., Robbins M. A., Elias M. F. (2018). Challenging the “jolly fat” hypothesis among older adults: High body mass index predicts increases in depressive symptoms over a 5-year period. Journal of Health Psychology, 23, 48-58. [DOI] [PubMed] [Google Scholar]
- Eberhart N. K., Hammen C. L. (2006). Interpersonal predictors of onset of depression during the transition to adulthood. Personal Relationships, 13, 195-206. [Google Scholar]
- Eriksson K., Häggström O. (2014). Lord’s paradox in a continuous setting and a regression artifact in numerical cognition research. PLoS One, 9, e95949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gimeno D., Kivimäki M., Brunner E. J., Elovainio M., De Vogli R., Steptoe A., . . . Ferrie J. E. (2009). Associations of C-reactive protein and interleukin-6 with cognitive symptoms of depression: 12-Year follow-up of the Whitehall II study. Psychological Medicine, 39, 413-423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horwitz A. G., Berona J., Czyz E. K., Yeguez C. E., King C. A. (2017). Positive and negative expectations of hopelessness as longitudinal predictors of depression, suicidal ideation, and suicidal behavior in high-risk adolescents. Suicide and Life-Threatening Behavior, 47, 168-176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huck S. W., McLean R. A. (1975). Using a repeated measures ANOVA to analyze the data from a pretest-posttest design: A potentially confusing task. Psychological Bulletin, 82, 511-518. [Google Scholar]
- Johnson S. L., Meyer B., Winett C., Small J. (2000). Social support and self-esteem predict changes in bipolar depression but not mania. Journal of Affective Disorders, 58, 79-86. [DOI] [PubMed] [Google Scholar]
- Kim J., Noh J.-W., Park J., Kwon Y. D. (2014). Body mass index and depressive symptoms in older adults: A cross-lagged panel analysis. PLoS One, 9, e114891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lord F. M. (1967). A paradox in the interepretation of group comparisons. Psychological Bulletin, 68, 304-305. [DOI] [PubMed] [Google Scholar]
- Moos R. H., Schutte K. K., Brennan P. L., Moos B. S. (2005). The interplay between life stressors and depressive symptoms among older adults. Journals of Gerontology Series B, Psychological Sciences and Social Sciences, 60, P199-P206. [DOI] [PubMed] [Google Scholar]
- O’Connor T. G., Cheng H., Dunn J., Golding J. (2005). Factors moderating change in depressive symptoms in women following separation: Findings from a community study in England. Psychological Medicine, 35, 715-724. [DOI] [PubMed] [Google Scholar]
- Paterniti S., Niedhammer I., Lang T., Consoli S. M. (2002). Psychosocial factors at work, personality traits and depressive symptoms: Longitudinal results from the GAZEL study. British Journal of Psychiatry, 181, 111-117. [PubMed] [Google Scholar]
- Pearl J. (2016). Lord’s paradox revisited—(Oh Lord! Kumbaya!). Journal of Causal Inference, 4(2). doi: 10.1515/jci-2016-0021 [DOI] [Google Scholar]
- Petscher Y., Schatschneider C. (2011). A simulation study on the performance of the simple difference and covariance-adjusted scores in randomized experimental designs. Journal of Educational Measurement, 48, 31-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Retrieved from http://www.R-project.org/ [Google Scholar]
- Rogosa D., Brandt D., Zimowski M. (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 92, 726-748. [Google Scholar]
- Rogosa D. R., Willett J. B. (1983). Demonstrating the reliability of the difference score in the measurement of change. Journal of Educational Measurement, 20, 335-343. [Google Scholar]
- Sorjonen K., Farioli A., Hemmingsson T., Melin B. (2017). Refractive state, intelligence, education, and Lord’s paradox. Intelligence, 61, 115-119. [Google Scholar]
- van Breukelen G. J. P. (2013). ANCOVA versus CHANGE from baseline in nonrandomized studies: The difference. Multivariate Behavioral Research, 48, 895-922. [DOI] [PubMed] [Google Scholar]
- Vanderlind W. M., Beevers C. G., Sherman S. M., Trujillo L. T., McGeary J. E., Matthews M. D., . . . Schnyer D. M. (2014). Sleep and sadness: Exploring the relation among sleep, cognitive control, and depressive symptoms in young adults. Sleep Medicine, 15, 144-149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zumbo B. D. (1999). The simple difference score as an inherently poor measure of change: Some reality, much mythology. In Thompson B. (Ed.), Advances in social science methodology (Vol. 5, pp. 269-304). Greenwich, CT: JAI Press. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, ConBase_Script for Predicting the Effect of a Predictor When Controlling for Baseline by Kimmo Sorjonen, Bo Melin and Michael Ingre in Educational and Psychological Measurement
Supplemental material, dforg for Predicting the Effect of a Predictor When Controlling for Baseline by Kimmo Sorjonen, Bo Melin and Michael Ingre in Educational and Psychological Measurement


