Abstract
The purpose of the present study was to apply the methodology developed by Raykov on modeling item-specific variance for the measurement of internal consistency reliability with longitudinal data. Participants were a randomly selected sample of 500 individuals who took on a professional qualifications test in Saudi Arabia over four different occasions. Data were analyzed by use of confirmatory factor analysis, and item error variance was corrected for item specificity. The estimation of reliability involved composite index omega. Results indicated that the initially low and unacceptable levels of internal consistency reliability approached acceptable levels after accounting for item-specific variance. Findings were verified by testing whether the difference estimates of internal consistency reliability deviated from a zero-mean distribution using 10,000 replicated samples assuming a known (symmetric) or unknown (asymmetric) population distribution of the difference reliability coefficients. Percentage improvement reliability estimates indices were also estimated along with their 95% confidence intervals. Two appendices provide annotated Mplus syntax files for future use.
Keywords: internal consistency reliability, specific error variance, item specificity, omega composite reliability, error variance
Reliability is one of the most important qualities of every psychometric tool, including tests, scales, inventories, and so on (Furr & Bacharach, 2014). Attempts to improve the reliability of our measurements are always worthwhile and should be a goal of measurement scientists, since every attempt to measure a construct via a measure contains “error.” Error can be considered as those factors that intervene in an assessment process and influence a person’s score beyond his or her true ability. In classical test theory (CTT), error is the difference between true score and observed score. A true score in this framework is defined as a hypothetical expected value of repeated measures on the same individual, and an error score is simply the difference between true and observed scores.Thus, for any individual i, who is measured on occasion j, his or her observed score Yij is a function of his or her true score T of that occasion, plus the error (of measurement) on that occasion:
| (1) |
There are two types of error sources: systematic and random (Schmidt, Le, & Ilies, 2003). Systematic error is the difference between an observed value and the true value due to all causes other than sampling variability (Waller, Thompson, & Wenk, 2000). It is related to all predictable sources of error occurred in an assessment process and has always the same direction and the same magnitude (Portney & Watkins, 2000). Systematic error may occur for a variety of reasons (Wright & Feinstein, 1992), including (a) the characteristics of the tool, (b) the measurement process, (c) the participants’ characteristics, and (d) combinations of the three main sources (Campbell & Russo, 2001). This type of error has been termed transient error (Becker, 2000). As Schmidt et al. (2003) stated, “Transient errors are defined as longitudinal variations in responses to measures that are produced by random variations in respondents’ psychological states across time.” (p. 206). Because systematic error is related to all predicable sources of errors it is usually more closely related to the concept of validity (Portney & Watkins, 2000).
Random error is neither predictable nor consistent in terms of direction and magnitude. It occurs for a variety of reasons, including environmental conditions (e.g., noise, uncomfortable seats, low lighting, etc.), limitations of the instrument (e.g., lack of clarity on specific items, leading questions, etc.), variations in the procedure (e.g., wrong directions) or idiosyncratic characteristics of the individual (e.g., headache, low motivation, distraction, etc.). Due to its random nature, it does not have consistent effects across the entire sample, and it is assumed that it is distributed normally across the entire sample. This means that if we could see all of the random errors in a distribution, they would have to sum to 0. It is worth noting that because random sources of error tend to contribute more than systematic error to the total error (Atkinson & Nevill, 1998), reliability estimates tend to focus more on determining the degree of random error in measurements (Portney & Watkins, 2000).
According to Perron and Gillespie (2015), error is closely related to reliability, since reliability is the degree to which measurement is free of error. This free of error term means that when a reliable measure is used in an assessment process and differences in scores over two or more occasions are detected, this is a result of a change in ability rather than measurement error. Thus, the more error is introduced into the observed score, the lower the reliability, and vice versa. However, it is almost impossible to achieve error free measures, and as a result, all measurements (M) obtained from a psychometric tool will differ from their true values (Rothstein, 1985) to some degree. This model has been extensively described in CTT and has been termed true score theory (Nunnally & Bernstein, 1994). According to this notion, all obtained scores are the product of a true value (T) plus error (E) (see Equation 1).
We know from the early work on factor analysis (e.g., Spearman, 1904; Thurstone, 1947), that the variance of an item or a scale can be decomposed into three different parts: common, specific1 and measurement error, despite not all authors agree with that conceptualization (see McCrae, 2015 for a discussion). In a classic factor analytic model, the items in each factor share common variance, which is due to the conceptual overlap among the items that constitute the factor. However, there is also a component of variance for each item that does not relate to the content of the rest of the items (because there is no conceptual overlap among the items). This variance is called unique variance. The unique variance for each item is split into two parts: random error variance and specific variance. This specific variance is also true variance, as is common variance, with the only difference that it is not shared with other variables (Bentler, 2017). Thus, true variance is the sum of common variance, specific variance, and unique variance is a summative term that includes item-specific variance not accounted for by the sampling of item content or time and thus is considered error variance (Robinson, Shaver, & Wrightsman, 1991).
In most psychometric procedures such as exploratory factor analysis, confirmatory factor analysis, structural equation modeling, and internal consistency reliability, specificity variance is considered as part of error variance than part of true variance (Raykov & Marcoulides, 2016; Raykov, Marcoulides, & Gabler, 2017). For example, reliability is defined as the ratio of true to total variance. However, in all the well-known methods of estimating internal consistency reliability (e.g., alpha, omega, maximal), the models compute the ratio of common to total variance. As explained earlier, however, true variance is the sum of common variance and specific variance, so by omitting a distinct component of true variance as it is specificity variance, we prompt for low bound reliability estimates and introduce a downward bias to our reliability estimation (see Bentler, 2017).
Item Error Variance and Item-Specific Variance
Internal consistency reliability, being only one form of reliability estimation, and the use of Cronbach’s alpha or other related indices suggest that the relationship between items on a single occasion suffice to evaluate scale reliability. This assumption has been recently challenged especially with regard to item-specific information in measurements over time (Raykov, 2007). The proposition that item specific information should comprise a meaningful component of measurement and not random error is not new (Bentler, 1968; Cronbach, 1951; Smith, 1974). Cronbach himself was well aware of the presence of specific variance in his seminal 1951 paper when he stated that alpha “treats the specific content of an item as error, but the coefficient of precision treats it as part of the thing being measured” (p. 307) but did not specifically account for it (see also McCrae, 2015).
When looking at item relationships versus items’ lack of relatedness, the former constitute the assessment of internal consistency. So, the more the item heterogeneity, the lower the internal consistency reliability and the opposite. However, when looking at an instrument’s reliability over time (in test–retest or longitudinal designs), item heterogeneity actually contributes meaningful variance and should complement true variance rather than error variance as in one-time assessments. In other words, item-specific variance is expected to lower internal consistency reliability in cross sectional designs, but because the same item-specific variance will be present in repeated measures designs, reliability of measurements over time, will be enhanced. This proposition challenges the traditional view of internal consistency reliability estimation in that reliability is a function of true score plus some form of random error, which is expected to be uncorrelated across items and suggests a modification to the effect that reliability assessment should also account for item-specific variance (Sv) as part of true score variance (Tv) as shown below:
| (2) |
The above recommendation involves splitting the error term onto two components, true random error and a true variance component that is specific to the item (and thus, should be construed as part of true variance, not error variance).
Figure 1 displays two Venn diagrams aimed to provide a visual of common, unique and specific item variances in longitudinal designs. The figure on the left panel includes item-specific variances and thus is described in more detail. The three ellipses at the lower part of the figures refer to items within a one-factor model having three indicators (Y1, Y2, Y3) at the first time point and the mirrored figure above that, the same model at a second time point. Overlapping areas refer to either common variance shared by items or item-specific variance shared by an item over two time points; Nonoverlapping areas refer to true error variances.2 When looking at the left panel figure and the first time point, there are three common variance areas, those shared between Items 1 and 2 [CVT1Y123], Items 2 and 3 [CVT1Y23], and Items 1, 2, and 3 [CVT1Y123]. There are the unshaded areas reflecting error variances for Items 1 [Y1T1E4], Item 2 [Y2T1E], and Item 3 [Y3T1E]. Of great interest for the purposes of the present study is the modeling of item specific information for Items 1 [Y1SV], 2 [Y2SV], and 3 [Y3SV] as overlapping areas of the items due to time only (same content). Labeling follows the same conventions for Time 2 data. The figure in the right panel is identical to the one on the left with the omission of item-specific variances pointing to the traditional way of measuring internal consistency reliability, critically downplaying true reliability estimation (McCrae, 2015) by ignoring item specific information.
Figure 1.
Left panel: Venn diagram for the demonstration of item true variance, item error variance, and item-specific variance (component of true variance) for a three-item latent construct across two time points (shown with three indicators and two time points for parsimony and clarity). Variance estimates are bifurcated to “True variance estimates” denoted in the figure as common or shared variance between items (two or more) and error variances as item components not shared with any other item. The additional component of initially thought error variance component is now shown as “Item-specific variance” and describes the part of item information that is common between an item at Time 1 and the same item at Time 2 thus accounting for true variance due to content. Accounting for this, earlier thought as unknown component of variance, is the primary thesis of modeling item-specific variances using item specific information (e.g., Gabler & Raykov, 2017; or specificity demonstrated using correlations with relevant auxiliary variables (e.g., Bentler, 2017). The horizontal dashed line goals at separating Time 1 from Time 2 measurements. Right Panel: Same 1-factor model measured at two time points without item-specific variance estimates. It represents traditional way of assessing error variance and reliability estimates within classical test theory.
Goals of the Present Study
The purpose of the present study was to illustrate the concept of modeling item specific error variance and internal consistency reliability through accounting for item specificity in longitudinal designs using real rather than simulated data as per Raykov’s (2007) and Raykov’s and Marcoulides (2016, 2017) recommendations and following the lead of Raykov and Marcoulides. The importance lies on the fact that undetected forms of variance that currently “load” onto error variance may in fact need to be modelled as part of true variance (Burt, 1976). Furthermore, the study attempted to contribute tests of significance for the difference of the two composite reliability coefficients using bootstrapping (with and without assuming knowledge of the underlying distributions of the difference statistics). Last, percentage improvement indices (PII) were developed and furnished using 95% confidence intervals to evaluate the magnitude of improvement in standardized form as point estimates of reliability likely differ from population values (Kelley & Cheng, 2012; Padilla & Divers, 2015). All the above are modeled by use of the popular statistical package Mplus (8.2) and annotated routines are shared in Appendices A and B for ease of use.
Methodology
Participants and Procedures
Participants were 500 individuals who took a teacher professional qualifications test over four time periods as part of the licensure process in Saudi Arabia. The tests were administered during 4 consecutive years by the National Center for Assessment (NCA) in Higher Education in Saudi Arabia. The NCA authority administers over 300k tests over the course of a year, and many participants take the test more than one times. The current sample of 500 participants involves a random sample of a much larger number of examinees (more than 100,000). Randomization was employed so that inferential statistics would not be greatly inflated due to excessive power.5 Over the 4-year period person scores were equated by use of the D-score method (Dimitrov, 2016, 2018) as participants were exposed to parallel forms of the same instrument. There were 108 males (21.6%) and 392 females (78.4%). Most of the participants were university students on a full-time basis (93.9%) with only 6.1% studying on a part-time basis. They came from 28 public and private institutions of the country representing all 13 regions of the Saudi Arabia Kingdom. No other demographic information was available to the authors due to nondisclosing personal information. The study met the ethical guidelines of the relevant professional authority in Saudi Arabia.
Measure: Professional Skills and Competencies Test
The current measure contained four domains, namely, Professional knowledge (36 items), Enhancing learning (18 items), Supporting learning (10 items), and Professional responsibility (11 items). Item level content was not available to avoid item exposure. For mock items one can visit the website of the authority to be fully informed of the test and the administration procedures (https://qiyas.sa/en/Pages/default.aspx). Items were originally dichotomously coded but for the present purposes of the study and to evaluate internal consistency reliability of the full instrument, scale aggregate estimates were used and scores over time were equated by use of the D-scoring method.
Data Analyses
Several analytical procedures are described in this section, ancillary and focal in relation to the goals of the study. Ancillary analyses involved the confirmatory factor analysis model, ensuring that the data are well described by the model6 and applications of the measurement invariance protocol with the latter being a prerequisite step to assessing item-specific variance over time. Other additions involved furnishing 95% confidence intervals of improved reliability estimates, difference reliability coefficients (DRCs) and percentage improvement coefficients (Raykov & Zinbarg, 2011). The level of significance was set to 1% due to the excessive levels of power associated with a sample size of 500 participants in order to avoid potential Type-I errors.
Longitudinal Measurement Invariance
The concept of measurement invariance pertains to the statistical property of constancy of measurement over time or across groups with the former being of interest in the present study (Byrne, 1994; Meredith, 1993). In other words in the process of testing increasingly more restricted models one evaluates if the measurement of a construct(s) over time is equivalent in terms of the instrument’s simple structure (configural), the items’ functionality (metric), the items’ intercept levels (scalar), the items’ errors of measurement (strict-error variance equivalence7), the latent variable(s) variance over time (strict-factor variance equivalence8),9 and can be extended to multiple groups and multiple occasions (see extensive protocol of Marsh, Parker, & Morin, 2016). The above protocol was applied for a full exposition of measurement and structural invariance evaluation although for the purposes of modeling internal consistency reliability with item specificity meeting scalar invariance suffices.
Internal Consistency Reliability: Composite Reliability Omega With and Without Item-Specific Variance
Omega composite reliability10 (McDonald, 1970, 1999; Raykov, 1997), albeit its similarity to Cronbach’s alpha (Cronbach, 1951) possesses the advantage of allowing for heterogeneous item-latent variable correlations, thus being suitable for congeneric measures in that items contribute differential information to the latent trait (as well as error variances11) (Joreskog, 1971). It is estimated as following:
| (3) |
With being the factor loadings of Item i and ∑Var(e) the respective error variances of Item i. This formula ignores the likelihood that a correlated structure in the residuals is present, in which case reliability needs to be adjusted accordingly (Westfall, Henning, & Howell, 2012).
The information reported herein on composite reliability and item-specific variance are a summary of the relevant review and proposals put forth by Raykov and Tisak (2004), Bentler, (2017), and Raykov (2007). The major idea behind the present conceptualization is that measurement error is formed of specific item error plus true random error as in the following equation:
| (4) |
which states that the error variance ε of item i on occasion j is a function of “pure” measurement error e(Gorsuch, 1983) and measurement error attributed to item specific information s. Based on CTT reliability is estimated as:
| (5) |
But somehow one needs to separate from the numerator the part of error variance that is specific to the item (i.e., Var(sij). Assuming invariant estimates of item-specific variance over time:
| (6) |
With SP being the item-specific variance of Item 1 through q and measurement occasion j1 through jk. An improved estimate of internal consistency reliability based on omega can be obtained by:
| (7) |
With δj being the item-specific variance. The reader is advised to consult the original source for detailed information and derivations. For extensions to maximal reliability estimations see Raykov (2005) and Hancock and Mueller (2001).
Difference Reliability Coefficients, Percentage Improvement Indices, and 95% Confidence Intervals
As several researchers have noted, one can compare and contrast reliability coefficients to probe for significant differences between them and/or estimate approximate confidence intervals for the same purpose (Deng & Chan, 2017; McDonald, 1999; Raykov, 2009).
| (8) |
Or in other words test the model:
| (9) |
Which reflects the composite reliability estimate using item specific information (i.e., Equation 7) minus the classical formula of composite reliability in which the error variance term includes item-specific variance as a form of error. Similarly, transforming the DRC to the percentage scale as shown below:
| (10) |
can provide for a standardized form evaluation of the magnitude of reliability improvement due to accounting for item specificity and may represent a useful effect size index of the need to model item-specific variance. This proposition goes along with Bentler’s (2017, p. 533) recommendation to report improvements in reliability estimation in percentage form.
There are several methods that are suitable for furnishing confidence intervals, namely, parametric bootstrapping (Efron & Tibshirani, 1993; Goldstein, 2003; Kuk, 1995), the delta method (Oehlert, 1992; Raykov, 2002; Padilla, & Divers, 2013a, 2013b), the permutation method that also uses the bootstrap distribution (Bishara & Hittner, 2012; Pituch & Stapleton, 2008), Bayesian methodologies (Yuan & MacKinnon, 2009), asymptotically distribution free methods (ADF; Maydeu-Olivares, Coffman, & Hartmann, 2007), and other. In the present study, we use parametric bootstrapping, the delta method, and an ADF as a means of also comparing and contrasting the different estimation methods (Raykov & Shrout, 2002). Researchers have also employed various software such as Mplus and R (e.g., Dunn, Baguley, & Brunsden, 2014; Kelley, 2007; Kelley & Lai, 2010). The parametric bootstrap involves the assumption of normally distributed parameter estimates in the population, which may or may not hold for DRC (with a low bound value of zero) and similarly for the PII recommended herein, but we implemented here for consistency with the relevant literature along with two additional approaches that did not assume normality (Kelley, 2005). Thus, confidence intervals were estimated using the monotone transformation approach (Browne, 1984; Oehlert, 1992) illustrated in Raykov, Marcoulides, and Akaeze (2017) as follows. First the DRC or PII indices were logit transformed to be normalized onto a normal deviate estimate Then a confidence interval can be furnished of the form (Casella & Berger, 2002; Raykov & Marcoulides, 2016; Raykov, Rodenberg, & Narayanan, 2015):
| (11) |
With the logit transformation of DRC or PII (shown for DRC below) being:
| (12) |
And its estimate of standard error:
| (13) |
With za/2 being the two-sided level of significance for a given level of confidence, 95% or 99%. This asymptotic theory estimation produces “approximately correct” intervals as sample sizes approach infinity (Kelley & Cheng, 2012).
We also used the ADF method, which involves estimating the bootstrap distribution of the DRC and PII coefficients using the variance estimates and distribution shape of the items (Browne, 1984). The advantage of the ADF methodology is that normality of the estimates is not presumed (Cheung, 2009; Kelley & Pornprasertmanit, 2016; Olsson, Foss, Troye, & Howell, 2000) but is estimated based on the data.
Results
Simple Structure of Professional Test Measure: Measurement Model
A one-factor model (see Figure 2) was fit to the data and provided excellent model fit as denoted by a nonsignificance chi-square statistic, χ2(2) = 0.395, p = .821, pointing to the presence of exact model fit (MacCallum, Browne, & Sugawara, 1996). Descriptive fit indices as well as residual values also pointed to the presence of excellent model fit for the current one-factor model (comparative fit index [CFI] = 1.00; Tucker–Lewis index [TLI] = 1.00; root mean square error of approximation [RMSEA] < 0.001; RMSEACI[0.000, 0.075]). Subsequent models tested the invariance of the one-factor model over time using a restrictive protocol of measurement invariance.
Figure 2.

Measurement one-factor model for the assessment of professional skills and competencies using a single measurement occasion.
Measurement Invariance of Professional Skills Measure Over Time
Figure 3 displays the hypothesized model in which the four factors (η1, η2, η3, η4) were defined by four indicators across four occasions. Furthermore, item specificity was modeled using four latent variables (S1, S2, S3, S4) defining with the goal of defining the variance estimates attributed to each item. This model (see Figure 4) provided excellent fit to the data, producing a nonsignificant chi-square value and acceptable fit indices (see Table 1, scalar model). Table 1 describes the findings on measurement and structural invariance using a series of nested models. Models were contrasted by use of a loglikelihood test. First, a configural model was fit to the data and provided acceptable model fit, χ2(94) = 108.762, p = .142; CFI = 0.989; TLI = 0.986; RMSEA = 0.018; RMSEACI [0.000, 0.031], followed by a fixed-slopes (metric invariance) model, which fit the data equally well, χ2(103) = 117.811, p = .151; CFI = 0.989; TLI = 0.987; RMSEA = 0.078; RMSEACI [0.000, 0.030]. When fitting the fixed slopes fixed intercepts model (scalar invariance), results indicated that the latter model was not inferior to the metric invariance model, χ2(112) = 129.623, p = .122; CFI = 0.987; TLI = 0.986; RMSEA = 0.018; RMSEACI [0.000, 0.030], and thus, both the prerequisite equal factor loadings, equal intercepts assumption for testing longitudinal reliability were met. Table 1 shows significant difference tests by use of the loglikelihood and scaling correction factors in relation to the number of parameters for each model. Although not required, two more restrictive models were tested, one positing equal factor variances (M4) and one fixing item residual variances to be equivalent (M5). Both these models provided acceptable model fit and were not inferior to the scalar model. Consequently, the prerequisite assumption of equal slopes, equal intercepts over time was met with these data.
Figure 3.
Hypothesized longitudinal model for the assessment of professional skills and competencies using a one-factor model with four indicators across four time points. Factor loadings are shown as lambda estimates (λ), item error variances as epsilon (ε), and item thresholds as tau (τ). The latent variables η1 to η4 represent the one factor solution over the four measurement occasions. The latent variables S1-S4 represent item specific, true, variances. Measurement model factors were freely correlated. Specific factors are hypothesized to have zero relationships with the measurement model latent factor and with each other as well (Raykov, 2007; Raykov & Marcoulides, 2016, 2017).
Figure 4.
Estimated longitudinal one-factor model with four indicators. For clarity, item error variances are not shown. Estimates in the figure are standardized.
Table 1.
Longitudinal Measurement Invariance: Tests of Competing Nested Models.
| Measurement invariance comparisons | Tests of significance using−2ΔLL
difference in LL |
|||||||
|---|---|---|---|---|---|---|---|---|
| Model H0 LL | H0 LL scale factor | No. of free parameters | Difference in LL * −2 | Scaled difference in −2LL | df difference | p | ||
| 2. Full metric invariance model (M2) | −15821.789 | 0.9997 | 49 | — | — | — | — | — |
| 1. Configural model (M1) | −15817.366 | 0.9965 | 58 | — | — | — | — | — |
| Model comparison (M2 vs. M1) | — | — | — | 8.846 | 0.9791 | 9.035 | 9 | .4340 (ns) |
| 3. Scalar invariance model (M3) | −15827.662 | 1.0007 | 40 | — | — | — | — | |
| 2. Full metric invariance model (M2) | −15821.789 | 0.9997 | 49 | — | — | — | — | |
| Model comparison (M3 vs. M2) | — | — | — | 11.746 | 0.9953 | 11.802 | 9 | .2247 (ns) |
| 4. Strict (factor variance equivalence) (M4) | −15828.381 | 0.9917 | 37 | — | — | — | — | — |
| 3. Scalar invariance model (M3) | −15827.662 | 1.0007 | 40 | — | — | — | — | — |
| Model comparison (M4 vs. M3) | — | — | — | 1.438 | 1.1117 | 1.294 | 3 | .7307 (ns) |
| 5. Strict (residual variance equivalence) (M5) | −15834.073 | 0.9917 | 25 | — | — | — | — | — |
| 4. Scalar invariance model (M4) | −15827.662 | 1.0007 | 40 | — | — | — | — | — |
| Model comparison (M5 vs. M4) | — | — | — | 12.822 | 1.0157 | 12.624 | 15 | .6313 (ns) |
Note. LL = log likelihood; H0 = null hypothesis; df = degrees of freedom; ns = nonsignificant.
p < .01.
Composite Reliability Estimation of Unidimensional Professional Skills Measure
Prior to estimating composite reliability accounting for item-specific variance, it was important to establish that there were nonzero amounts of item-specific variance. Table 2 shows those findings with estimates of specific variance (SV) for Items 1 through 4 being significant at p < .001: Item 1 SV = 0.580, Z = 6.816, p < .001; Item 2 SV = 0.580, Z = 6.184, p < .001; Item 3 SV = 0.512, Z = 6.076, p < .001; Item 4 SV = 0.504, Z = 6.176, p < .001. These findings are graphically depicted in Figure 5, with a zero-mean distribution being modeled to provide a visual reference to the distributions of item-specific variances. As shown in the figure, the point estimates of the item-specific variance distributions were far to the right of the zero-mean distribution, suggesting an adoption of the alternative hypotheses that the amounts of specific variance were substantial.
Table 2.
Point Estimates, Standard Errors, and Symmetric and Asymmetric 95% Confidence Intervals of the Specificity Variance Estimates.
| Item | Specific variance | SE | Symmetric 95% CI | D-Asymmetric 95% CI | ADF 95% CI |
|---|---|---|---|---|---|
| Item 1 | 0.580*** | 0.085 | [0.413, 0.746] | [0.435, 0.772] | [0.413, 0.746] |
| Item 2 | 0.558*** | 0.090 | [0.381, 0.735] | [0.407, 0.765] | [0.373, 0.734] |
| Item 3 | 0.512*** | 0.084 | [0.347, 0.677] | [0.371, 0.706] | [0.347, 0.676] |
| Item 4 | 0.504*** | 0.082 | [0.344, 0.644] | [0.366, 0.693] | [0.336, 0.664] |
Note. D-Asymmetric = Delta method estimated confidence intervals; ADF = asymptotic distribution free; SE = standard error. 95% confidence intervals were estimated using simulated data with 10,000 participants.
p < .05. **p < .01. ***p < .001.
Figure 5.
Bootstrap distributions of specific item variance estimates and their relationship to a simulated distribution of zero mean and variance estimates equal to the mean variance estimates of the specific item variance distributions. Note that there is no overlap between distributions either using the point estimates or the 95% confidence intervals. Dashed lines indicate point estimates of population distributions.
Furthermore, it was important to test for the assumption of specificity constancy. To this end a model with freely estimated specificity variance estimates over time was nested within a model in which those variance estimates were fixed to equity. When contrasting the fixed versus free specific variance estimates model results indicated that the restricted model was not associated with significant decrements in model fit, difference χ2(12) = 23.488, p = ns, for a critical value of 26.22 using an alpha level of .01. Consequently, the assumption of specificity constancy was met.
Composite reliability estimates across the four time points were equal to 0.559, 0.556, 0.558, and 0.50 (see Table 3, four first lines). These estimates would by no means be acceptable to the relevant literature and would be discarded as being too low. Significant improvements of composite reliability, however, were evident after modeling item-specific variance as true variance with composite reliability estimates rising to values of 0.688, 0.676, 0.688 and 0.688. These later estimates may be considered borderline but nevertheless acceptable. Table 3 provides low and upper bounds of those internal consistency estimates by use of symmetric and asymmetric 95% confidence intervals using the computational procedures described above.
Table 3.
Point Estimates, Standard Errors, and Symmetric and Nonsymmetric 95% Confidence Intervals of the Composite Reliability Estimates With and Without Item-Specific Variance Accounted for.
| Parameter | Omega reliability | SE | Symmetric 95% CI | D-Asymmetric 95% CI | ADF 95% CI |
|---|---|---|---|---|---|
| CRNISV Time 1 | 0.559*** | 0.020 | [0.520, 0.598] | [0.521, 0.600] | [0.520, 0.598] |
| CRNISV Time 2 | 0.556*** | 0.020 | [0.518, 0.595] | [0.518, 0.597] | [0.515, 0.595] |
| CRNISV Time 3 | 0.558*** | 0.020 | [0.520, 0.597] | [0.520, 0.599] | [0.519, 0.597] |
| CRNISV Time 4 | 0.530*** | 0.020 | [0.491, 0.569] | [0.492, 0.571] | [0.489, 0.569] |
| CRSIV Time 1 | 0.688*** | 0.024 | [0.640, 0.735] | [0.643, 0.737] | [0.641, 0.735] |
| CRSIV Time 2 | 0.676*** | 0.025 | [0.627, 0.725] | [0.629, 0.727] | [0.625, 0.725] |
| CRSIV Time 3 | 0.688*** | 0.024 | [0.641, 0.735] | [0.643, 0.737] | [0.641, 0.735] |
| CRSIV Time 4 | 0.688*** | 0.024 | [0.641, 0.735] | [0.643, 0.737] | [0.646, 0.736] |
Note. CRNISV = composite reliability without accounting for item-specific variance; CRISV = composite reliability accounting for item-specific variance; D-Asymmetric = delta method estimated confidence intervals; ADF = asymptotic distribution free; SE = standard error. 95% confidence intervals were estimated using simulated data with 10,000 participants.
p < .05. **p < .01. ***p < .001.
Table 4 introduces tests of two additional indices that may further inform the modeling of item-specific variance. First, there are tests of the difference composite reliability (DCR) estimates as per earlier recommendations on testing differences between reliability coefficients (Deng & Chan, 2017; Raykov, 2009). When viewing these estimates, it is apparent that they were nonzero by use of both a Z test and through inspecting the 95% confidence intervals. Across all tests and comparisons, zero was not included in any of the confidence intervals, suggesting that the improvement in composite reliability was unambiguously nonzero. These findings are further illustrated by visually inspecting Figure 6, which shows the point estimates and distribution of the DCRs at each time point. Using a zero-mean reference distribution to the left, it is clear that neither any point estimate or any value in the 2.5% area of rejection belongs to the core of the zero-mean distribution. Consequently, the improvement in composite reliability was unequivocal. Further evidence in Table 4 came by use of the PII, which represents improvement in the reliability coefficient in percentage units. Those estimates ranged between 18% and 23% improvement after accounting for item-specific variance and greatly exceed earlier findings (e.g., Bentler, 2017 reported an 8% improvement).
Table 4.
Point Estimates, Standard Errors, and Symmetric and Asymmetric 95% Confidence Intervals of the Difference Composite Reliability Estimates After Accounting for Item-Specific Variance and Percentage Improvement Indices.
| Parameter | D-omega reliability | SE | Symmetric 95% CI | D-Asymmetric 95% CI | ADF 95% CI |
|---|---|---|---|---|---|
| DCR Time 1 | 0.129*** | 0.024 | [0.081, 0.176] | [0.090, 0.186] | [0.090, 0.186] |
| DCR Time 2 | 0.120*** | 0.025 | [0.071, 0.169] | [0.080, 0.181] | [0.080, 0.181] |
| DCR Time 3 | 0.130*** | 0.024 | [0.083, 0.177] | [0.091, 0.187] | [0.091, 0.187] |
| DCR Time 4 | 0.158*** | 0.024 | [0.111, 0.205] | [0.117, 0.213] | [0.117, 0.213] |
| PII Time 1 | 0.187*** | 0.029 | [0.130, 0.243] | [0.138, 0.253] | [0.138, 0.253] |
| PII Time 2 | 0.177*** | 0.030 | [0.118, 0.237] | [0.127, 0.247] | [0.127, 0.247] |
| PII Time 3 | 0.189*** | 0.028 | [0.133, 0.244] | [0.141, 0.253] | [0.141, 0.253] |
| PII Time 4 | 0.230*** | 0.027 | [0.177, 0.283] | [0.183, 0.290] | [0.183, 0.290] |
Note. DCR = difference estimate in composite reliability coefficients from ignoring or accounting for item-specific variance; PII = percentage improvement index over the four time points. D-asymmetric = delta method estimated confidence intervals; ADF = asymptotic distribution free; SE = standard error.
p < .05. **p < .01. ***p < .001.
Figure 6.
Bootstrap distributions of difference estimates between composite reliability coefficients prior to and after accounting for item specificity for each one of the four factors (F1 through F4). The distribution to the left represents a simulated distribution of zero mean and variance estimates equal to that of the most commonly encountered variance estimate from the other distributions. None of the 95% confidence intervals of the difference omega distributions included zero pointing to the presence of significantly enhanced composite reliability estimates in the presence of item-specific variance. Dashed lines indicate point estimates of difference distributions.
Discussion
The purpose of the present study was to illustrate the concept of modeling item-specific error variance and internal consistency reliability through accounting for item specificity in longitudinal designs using real rather than simulated data as per Raykov’s (2007) and Raykov’s and Marcoulides (2016, 2017) recommendations. Furthermore, the study attempted to contribute tests of significance for the difference of the two composite reliability coefficients using bootstrapping (with and without assuming knowledge of the underlying distribution). Last, PII were developed and furnished using 95% confidence intervals to evaluate the magnitude of improvement in standardized form.
The most important finding was that the amount of specific item variance was statistically and practically significant and substantial representing a union rather than an intersection of true variance components (McCrae, 2015). That is, inferential statistics and confidence intervals pointed that these estimates were significantly different from zero. Furthermore, they ranged in percentage improvement of the reliability coefficients by 18% to 23%, representing salient amounts of improvement and rendering an instrument from unreliable to internally consistent. The observed amounts are extremely large when viewed under the lenses that specific item variance does not cumulate linearly as the number of scale components increase (Bentler, 2017), as is true with stochastic variance. The current conceptualization puts in question the original bifurcation of true and error variance opening up new avenues for discovering underexplored sources of error variance that may need to further adjust the measurement of true score variance.
A second important finding is that tests of significance can be formed to evaluate the magnitude of item-specific variances and their usefulness for measurement purposes by modeling population distributions and estimating confidence intervals around those point estimates. The pioneer work of Raykov and Marcoulides (2016, 2017) by use of the delta method for estimating confidence intervals proves to be extremely useful for that purpose. In the present study, it was clear that none of the point estimates of DRC or percentage improvements fell within the area of acceptance of a zero-mean distribution. Thus, further evidence to the hypothesis of modeling specificity variance was provided.
A third important finding relates to the implications of the present findings for test construction and scale evaluation. Instrument development should carefully include sources of specific variance either in the form of item content (as in the present study) or using alternate content sources. To this end, Bentler (2017) recommended expanding the facets of content when designing an instrument and supplement them with content from relevant constructs. However, one has to be cautious not to confound the true content of the construct as specified in its operational definition but also not to narrow the content of the instrument.
Limitations and Future Directions
There are several limitations imposed by the current analytical methodology. For example, currently, little is known about the estimation of reliability using categorical indicators and when data do not meet distributional assumptions. Furthermore, very little is known about the power of the statistical tests for the DRC or the newly recommended PII. Estimation of internal consistency reliability through teasing out method variance will also be important as current estimates likely inflate reliability (McCrae, 2015). Last, there are no studies that shown how to account for item-specific variance in complex designs. Consequently, more research is clearly needed to address those issues and complexities.
There are several avenues for new research in this intriguing line of research, which has been initiated by Bentler (2017, Raykov and Marcoulides (2016, 2017) and Raykov (2007). First, the issue of item-specific variance can be extended, supplementing the present modeling approach, with a set of auxiliary variables that can be predictive of other amounts of unique variances as suggested by Bentler (2017). Second, in the absence of exact scalar invariance, one can attain “approximate scalar invariance” by use of Bayesian prior variance estimates that allow for minor deviations of intercept invariance over time from zero (Seddig & Leitgob, 2018) with the idea of implementing Bayesian priors being generalized to any part of invariance that does not hold (e.g., metric). The later may be both a viable and reasonable proposition as measurement error, particularly transient, will be accounted for to a large extend both allowing intercepts to be equivalent over time within the lines of an expected margin of measurement error. Third, other sources of measurement error that are currently viewed under the lenses of error variance (e.g., response time patterns, approaches and strategies in test-taking, emotionality, motivation, etc.) can potentially be identified and accounted for in the measurement of internal consistency reliability. Last, the estimation of internal consistency reliability coefficients accounting for item specificity can be expanded to include Guttman’s L coefficients and other indices that are currently not used in commercial software (see Bentler, 2017, Table 1) and can also be extended using corrective procedures for attenuation (Bentler, 2015). In this innovative line of research, examining sources of currently unique variance that are port of common variance will greatly enhance estimates of reliability and hence the utility and validity of these scores.
Appendix A
Mplus 8.2 annotated syntax file for model of Figure 3. See Raykov and Marcoulides (2016) for estimation of confidence intervals using the delta method by use of an R-function they developed for that purpose.
TITLE: Internal consistency reliability accounting for item specificity;
DATA: FILE IS <name of data file>; ! Preferred data file forms are .dat and .csv
VARIABLE: NAMES ARE Y1-Y16; ! Four indicators for 4 time points, 4x4=16
MODEL: F1 BY Y1 ! Item slope fixed to unity for identification
Y2 (L2) ! Item slope labeled so that it will be equivalent
Y3 (L3) ! over time as per metric invariance assumption
Y4 (L4); ! last item of factor at time 1 end in ;
F2 BY Y5 ! 2nd factor defines 2nd measurement point
Y6 (L2) ! Item slope labeled so that it will be equivalent
Y7 (L3) ! over time as per metric invariance assumption
Y8 (L4); ! last item of factor at time 2 end in ;
F3 BY Y9 ! 3d factor defines 3d measurement point
Y10 (L2) ! Item slope labeled so that it will be equivalent
Y11 (L3) ! over time as per metric invariance assumption
Y12 (L4); ! last item of factor at time 3 end in ;
F4 BY Y13 ! 4th factor defines 4th measurement point
Y14 (L2) ! Item slope labeled so that it will be equivalent
Y15 (L3) ! over time as per metric invariance assumption
Y16 (L4); ! last item of factor at time 4 end in ;
[Y1 Y5 Y9 Y13] (INT1); ! Item intercepts are equivalent over 4 times (scalar invariance)
[Y2 Y6 Y10 Y14] (INT2); ! Item intercepts are equivalent over 4 times (scalar invariance)
[Y3 Y7 Y11 Y15] (INT3); ! Item intercepts are equivalent over 4 times (scalar invariance)
[Y4 Y8 Y12 Y16] (INT4); ! Item intercepts are equivalent over 4 times (scalar invariance)
SP1 BY Y1* Y5 Y9 Y13 (D1); ! Factor defining specific variance of item 1 over time and invariance
SP2 BY Y2* Y6 Y10 Y14 (D2); ! Factor defining specific variance of item 2 over time and invariance
SP3 BY Y3* Y7 Y11 Y15 (D3); ! Factor defining specific variance of item 3 over time and invariance
SP4 BY Y4* Y8 Y12 Y16 (D4); ! Factor defining specific variance of item 4 over time and invariance
Y1-Y16 (E1-E16); ! Error variances as per CFA Model
F1-F4 (Fv1-Fv4); ! Factor variances of latent dimension over 4 time points
SP1-SP4@1; ! Item Specificity Factor Variances fixed to unity for identification
S1 WITH S2-S4@0; ! Specific factors constrained to be uncorrelated with all factors
S2 WITH S3-S4@0; ! Specific factors constrained to be uncorrelated with all factors
S3 WITH S4@0; ! Specific factors constrained to be uncorrelated with all factors
S1 WITH F1-F4@0; ! Specific factors constrained to be uncorrelated with all factors
S2 WITH F1-F4@0; ! Specific factors constrained to be uncorrelated with all factors
S3 WITH F1-F4@0; ! Specific factors constrained to be uncorrelated with all factors
S4 WITH F1-F4@0; ! Specific factors constrained to be uncorrelated with all factors
[F1@0 F2-F4*]; ! Factor means free to estimate except the first one that is fixed to zero
MODEL CONSTRAINT: ! For defining new variables
NEW(OSR1 OSR2 OSR3 OSR4 ! Omega reliability coefficients accounting for item specificity
SPV1 SPV2 SPV3 SPV4 ! Estimating item-specific variances
DRCF1 DRCF2 DRCF3 DRCF4); !Contrasting omega coeffs. with those accounting for specific variance
OSR1=((1+L2+L3+L4)^2*Fv1+D1^2+D2^2+D3^2)
/((1+L2+L3+L4)^2*Fv1+D1^2+D2^2+D3^2+ E1+E2+E3+E); ! Omega 1 reliability with item specificity
OSR2=((1+L2+L3+L4)^2*Fv2+D1^2+D2^2+D3^2)
/((1+L2+L3+L4)^2*Fv2+D1^2+D2^2+D3^2+E5+E6+E7+E8);
OSR3=((1+L2+L3+L4)^2*Fv3+D1^2+D2^2+D3^2)
/((1+L2+L3+L4)^2*Fv3+D1^2+D2^2+D3^2+E9+E10+E11+E12);
OSR4=((1+L2+L3+L4)^2*Fv4+D1^2+D2^2+D3^2)
/((1+L2+L3+L4)^2*Fv4+D1^2+D2^2+D3^2+E13+E14+E15+E16);
SPV1=D1^2; !Item 1 specific variance estimate
SPV2=D2^2; !Item 2 specific variance estimate
SPV3=D3^2; !Item 3 specific variance estimate
SPV4=D4^2; !Item 4 specific variance estimate
DCRF1=OSR1-(0.559); ! Difference between Omega 1 with specific variance and original estimate
DCRF2=OSR2-(0.556); ! Difference between Omega 2 with specific variance and original estimate
DCRF3=OSR3-(0.558); ! Difference between Omega 3 with specific variance and original estimate
DCRF4=OSR4-(0.530); ! Difference between Omega 4 with specific variance and original estimate
PII1=1-(0.559/OSR1); ! Percentage improvement index for Time 1
PII2=1-(0.556/OSR2); ! Percentage improvement index for Time 2
PII3=1-(0.558/OSR3); ! Percentage improvement index for Time 3
PII4=1-(0.530/OSR4); ! Percentage improvement index for Time 4
OUTPUT: TECH1 CINTERVAL STDYX; ! Request symmetric confidence intervals using bootstrapping
Appendix B
Mplus 8.2 code for estimating asymmetric confidence intervals using point estimates and variances from the original analyses and by use of the Monte Carlo module. Example code involves the point and interval estimation of specific variance for Item 1 of the scale.
TITLE: Monte Carlo simulation for estimation of asymmetric confidence intervals;
MONTECARLO: NAMES ARE m1; ! Monte Carlo simulation
NOBSERVATIONS = 10,000; ! Population distribution with 10000 cases
NREPS = 1; ! One generated set of 10,000 cases
SEED = 12345; ! Seed number to replicate exact estimates
SAVE = SV1.dat; ! Save one dataset
MODEL POPULATION: ! Population model statement
[m1*.625]; ! Mean estimate of specific variance, item1
m1*.003721; ! Variance estimate of specific variance, item1
MODEL: ! Model statement
[m1]; ! Mean of item 1 from bootstrap distribution
m1; ! Variance of item 1 from bootstrap distribution
Saved text data file from the above code can be imported in EXCEL and 95% confidence intervals can be estimated along with the point estimate of the bootstrap distribution.
For example, Schmidt et al. (2003) consider this part of variance as transient error rather than true variance. Cronbach (1951) excluded that part of variance considering it random measurement error.
Recently, Bentler (2017) challenged the notion that the part of nonoverlapping variance within the one-factor model does not represent shared variance with components outside the factor. He further went on to model that part of error variance using auxiliary variables capturing related to the construct components. The present work drew heavily from the ideas of that paper as well as the early work of Raykov (2007) and Raykov and Tisak (2004).
Referring to common variance, Time 1, for Items Y1 and Y2 [CVT1Y12].
Referring to Item’s Y1, Time 1, Error variance estimate [Y1T1E].
Due to confidentiality, the data provided in the supplementary file involve a simulated data file with parameter estimates drawn from the original data file.
In the absence of a properly fitted model, estimates of internal consistency can be greatly biased (Green & Hershberger, 2000; Kelley & Cheng, 2012; Komaroff, 1997; Zimmerman, Zumbo, & Lalonde, 1993).
Also termed full uniqueness invariance.
Also termed structural invariance.
The increasingly restrictive model of testing the factor covariance invariance was not applicable in the present one-factor simple structure. Similarly, the testing of latent means was not tested as it is not meaningful for the evaluation of longitudinal measurement invariance.
We purposefully used composite reliability and not the popular alpha reliability coefficient in light of ample evidence pointing to the shortcomings of the statistic (Dunn et al., 2014; Green & Yang, 2009; Zhang & Yuan, 2016).
For tau-equivalent measures one can use the alpha coefficient as well.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Georgios D. Sideridis
https://orcid.org/0000-0002-4393-5995
References
- Atkinson G., Nevill A. M. (1998). Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine, 26, 217-238. [DOI] [PubMed] [Google Scholar]
- Becker G. (2000). How important is transient error in estimating reliability? Going beyond simulation studies. Psychological Methods, 5, 370-379. [DOI] [PubMed] [Google Scholar]
- Bentler P. M. (1968). Alpha-maximized factor analysis (alphamax): Its relation to alpha and canonical factor analysis. Psychometrika, 33, 335-345. [DOI] [PubMed] [Google Scholar]
- Bentler P. M. (2015, October). Spearman–Brown prophesy and correction for attenuation with specificity. Paper presented at Society of Multivariate Experimental Psychology annual meeting, Redondo Beach, CA. [Google Scholar]
- Bentler P. M. (2017). Specificity-enhanced reliability coefficients. Psychological Methods, 22, 527-540. [DOI] [PubMed] [Google Scholar]
- Bishara A. J., Hittner J. B. (2012). Testing the significance of a correlation with nonnormal data: Comparison of Pearson, Spearman, transformation, and resampling approaches. Psychological Methods, 17, 399-417. [DOI] [PubMed] [Google Scholar]
- Brown T. A. (2015). Confirmatory factor analysis for applied research. New York, NY: Guilford Press. [Google Scholar]
- Browne M. W. (1984). Asymptotic distribution free methods in the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 24, 445-455. [DOI] [PubMed] [Google Scholar]
- Burt R. S. (1976). Interpretational confounding of unobserved variables in structural equation models. Sociological Methods & Research, 5, 3-52. [Google Scholar]
- Byrne B. M. (1994). Testing for the factorial validity, replication, and invariance of a measuring instrument: A paradigmatic application based on the Maslach Burnout Inventory. Multivariate Behavioral Research, 29, 289-311. [DOI] [PubMed] [Google Scholar]
- Campbell D. T., Russo M. J. (2001). Social measurement. Thousand Oaks, CA: Sage. [Google Scholar]
- Casella G., Berger R. L. (2002). Statistical inference (2nd ed.). Pacific Grove, CA: Duxbury Press. [Google Scholar]
- Cheung M. W.-L. (2009). Constructing approximate confidence intervals for parameters with structural constructing approximate confidence intervals for parameters with structural equation models. Structural Equation Modeling, 16, 267-294. [Google Scholar]
- Cronbach L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. [Google Scholar]
- Deng L., Chan W. (2017). Testing the difference between reliability coefficients alpha and omega. Educational and Psychological Measurement, 77, 185-203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dimitrov D. M. (2016). An approach to scoring and equating tests with binary items: Piloting with large-scale assessments. Educational and Psychological Measurement, 76, 954-975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dimitrov D. M. (2018). The delta scoring method of tests with binary items: A note on true score estimation and equating. Educational and Psychological Measurement, 78, 805-825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunn T. J., Baguley T., Brunsden V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399-412. [DOI] [PubMed] [Google Scholar]
- Efron B., Tibshirani R. J. (1993). An introduction to the bootstrap. London, England: Chapman & Hall. [Google Scholar]
- Furr M. R., Bacharach V. R. (2014). Psychometrics: An introduction. Thousand Oaks, CA: Sage. [Google Scholar]
- Gabler S., Raykov T. (2017). Evaluation of maximal reliability for unidimensional measuring instruments with correlated errors. Structural Equation Modeling, 24, 104-111. doi: 10.1080/10705511.2016.1159916 [DOI] [Google Scholar]
- Goldstein H. (2003). Multilevel statistical models (3rd ed.). London, England: Arnold. [Google Scholar]
- Gorsuch R. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]
- Green S. B., Hershberger S. L. (2000). Correlated errors in true score models and their effect on coefficient alpha. Structural Equation Modeling, 7, 251-270. [Google Scholar]
- Green S. B., Yang Y. (2009). Commentary on coefficient alpha: A cautionary tale. Psychometrika, 74, 169-173.20037638 [Google Scholar]
- Hancock G. R., Mueller R. O. (2001). Rethinking construct reliability within latent variable systems. In Cudeck R., du Toit S. H. C., Sorbom D. (Eds.), Structural equation modelling: Past and present [a festschrift in honor of Karl G. Joreskog] (pp. 195-221). Chicago, IL: Scientific Software International. [Google Scholar]
- Joreskog K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109-133. [Google Scholar]
- Kelley K. (2005). The effects of nonnormal distributions on confidence intervals around the standardized mean difference: Bootstrapping as an alternative to parametric confidence intervals. Educational and Psychological Measurement, 65, 51-69. [Google Scholar]
- Kelley K. (2007). Methods for the behavioral, educational, and educational sciences: An R package. Behavior Research Methods, 39, 979-984. [DOI] [PubMed] [Google Scholar]
- Kelley K., Cheng Y. (2012). Estimation of and confidence interval formation for reliability coefficients of homogeneous measurement instruments. Methodology, 8, 39-50. [Google Scholar]
- Kelley K., Lai K. (2010). MBESS 3.0 [Computer software and manual]. Retrieved from http://www.cran.r-project.org/
- Kelley K., Pornprasertmanit S. (2016). Confidence intervals for population reliability coefficients: Evaluation of methods, recommendations, and software for composite measures. Psychological Methods, 21, 69-92. [DOI] [PubMed] [Google Scholar]
- Komaroff E. (1997). Effect of simultaneous violations of essential s-equivalence and uncorrelated error on coefficient a. Applied Psychological Measurement, 21, 337-348. [Google Scholar]
- Kuk A. Y. C. (1995). Asymptotically unbiased estimation in generalized linear models with random effects. Journal of the Royal Statistical Society, 57, 395-407. [Google Scholar]
- Marsh H. W., Parker P. D., Morin A. (2016). Invariance testing across samples and time: cohort-sequence analysis of perceived body composition. In Ntoumanis N., Myers N. D. (Ed.), An introduction to intermediate and advanced statistical analyses for sport and exercise scientists (pp. 121-149). Chichester, England: Wiley. [Google Scholar]
- Maydeu-Olivares A., Coffman D. L., Hartmann W. M. (2007). Asymptotically distribution-free (ADF) interval estimation of coefficient alpha. Psychological Methods, 12, 157-176. [DOI] [PubMed] [Google Scholar]
- McDonald R. P. (1970). The theoretical foundations of principal factor analysis, canonical factor analysis, and alpha factor analysis. British Journal of Mathematical and Statistical Psychology, 23, 1-21. [Google Scholar]
- McDonald R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
- McCrae R. R. (2015). A more nuanced view of reliability: Specificity in the trait hierarchy. Personality and Social Psychology Review, 19, 97-112. [DOI] [PubMed] [Google Scholar]
- Meredith W. (1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58, 525-543. [Google Scholar]
- Nunnally J. C., Bernstein I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. [Google Scholar]
- Oehlert G. W. (1992). A note on the delta method. The American Statistician, 46, 27-29. [Google Scholar]
- Olsson U. H., Foss T., Troye S. V., Howell R. D. (2000). The performance of ML, GLS, and WLS estimation in structural equation modeling under conditions of misspecification and nonnormality. Structural Equation Modeling, 7, 557-595. [Google Scholar]
- Padilla M. A., Divers J. (2013. a). Bootstrap interval estimation of reliability via coefficient omega. Journal of Modern Applied Statistical Methods, 12, 78-89. [Google Scholar]
- Padilla M. A., Divers J. (2013. b). Coefficient omega bootstrap confidence intervals: Nonnormal distributions. Educational and Psychological Measurement, 73, 956-972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Padilla M. A., Divers J. (2015). A comparison of composite reliability estimators: Coefficient omega confidence intervals in the current literature. Educational and Psychological Measurement, 76, 436-453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perron B. E., Gillespie D. F. (2015). Key concepts in measurement. New York, NY: Oxford University Press. [Google Scholar]
- Pituch K. A., Stapleton L. M. (2008). The performance of methods to test upper level mediation in the presence of nonnormal data. Multivariate Behavioral Research, 43, 237-267. [DOI] [PubMed] [Google Scholar]
- Portney L. G., Watkins M. P. (2000). Foundations of clinical research: Applications to practice. Upper Saddle River, NJ: Prentice Hall. [Google Scholar]
- Raykov T. (1997). Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau-equivalence for fixed congeneric components. Multivariate Behavioral Research, 32, 329-354. [DOI] [PubMed] [Google Scholar]
- Raykov T. (2004). Estimation of maximal reliability: A note on a covariance structure modelling approach. British Journal of Mathematical and Statistical Psychology, 57, 21-27. [DOI] [PubMed] [Google Scholar]
- Raykov T. (2005). Studying group and time invariance in maximal reliability for multiple-component measuring instruments via covariance structure modelling. British Journal of Mathematical and Statistical Psychology, 58, 301-317. [DOI] [PubMed] [Google Scholar]
- Raykov T. (2006). Interval estimation of optimal scores from multiple-component measuring instruments via SEM. Structural Equation Modeling, 13, 252-263. [Google Scholar]
- Raykov T. (2007). Reliability of multiple-component measuring instruments: Improved evaluation in repeated measure designs. British Journal of Mathematical and Statistical Psychology, 60, 119-136. [DOI] [PubMed] [Google Scholar]
- Raykov T. (2009). Interval estimation of revision effect on scale reliability via covariance structure modeling. Structural Equation Modeling, 16, 539-555. [Google Scholar]
- Raykov T., Marcoulides G. A. (2016). On examining specificity in latent construct indicators. Structural Equation Modeling, 23, 845-855. [Google Scholar]
- Raykov T., Marcoulides G. A. (2017). Evaluation of true criterion validity for unidimensional multi-component measuring instruments in longitudinal studies. Structural Equation Modeling, 24, 599-606. [Google Scholar]
- Raykov T., Marcoulides G. A., Gabler S. (2017). Improved estimation of maximal reliability for unidimensional multicomponent measuring instruments in repeated measure studies. Structural Equation Modeling, 24, 755-767. [Google Scholar]
- Raykov T., Marcoulides G., Akaeze H. O. (2017). Comparing between- and within-group variances in a two-level study: A latent variable modeling approach to evaluating their relationship. Educational and Psychological Measurement, 77, 351-361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raykov T., Rodenberg C., Narayanan A. (2015). Optimal shortening of multiple-component measuring instruments: A latent variable modeling procedure. Structural Equation Modeling, 22, 227-235. [Google Scholar]
- Raykov T., Shrout P. E. (2002). Reliability of scales with general structure: Point and interval estimation using a structural equation modeling approach. Structural Equation Modeling, 9, 195-212. [Google Scholar]
- Raykov T., Tisak J. (2004). Examining time-invariance in reliability in multi-wave, multi-indicator models: A covariance structure analysis approach accounting for indicator specificity. British Journal of Mathematical and Statistical Psychology, 57, 253-263. [DOI] [PubMed] [Google Scholar]
- Raykov T., Zinbarg R. E. (2011). Proportion of general factor variance in a hierarchical multiple-component measuring instrument: A note on a confidence interval estimation procedure. British Journal of Mathematical and Statistical Psychology, 64, 193-207. [DOI] [PubMed] [Google Scholar]
- Revelle W., Zinbarg R. E. (2009). Coefficients alpha, beta, omega, and the GLB: Comments on Sijtsma. Psychometrika, 74, 145-154. [Google Scholar]
- Robinson J. P., Shaver P. R., Wrightsman L. S. (1991). Criteria for scale selection and evaluation. In Robinson J. P., Shaver P. R., Wrightsman L. S. (Eds.), Personality and social psychological attitudes (Vol. 1, pp. 1-16). San Diego, CA: Academic Press. [Google Scholar]
- Rothstein J. (1985). Measurement and clinical practice: Theory and application. New York, NY: Churchill Livingstone. [Google Scholar]
- Schmidt F. L., Le H., Ilies R. (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206-224. [DOI] [PubMed] [Google Scholar]
- Seddig D., Leitgob H. (2018). Approximate measurement invariance and longitudinal confirmatory factor analysis: Concept and application with panel data. Survey Research Methods, 12, 29-41. [Google Scholar]
- Smith K. W. (1974). On estimating the reliability of composite indexes through factor analysis. Sociological Methods & Research, 2, 485-510. [Google Scholar]
- Spearman C. (1904). “General intelligence,” objectively determined and measured. American Journal of Psychology, 15, 72-101. [Google Scholar]
- Thurstone L. L. (1947). Multiple factor analysis. Chicago, IL: University of Chicago Press. [Google Scholar]
- Waller N., Thompson J., Wenk E. (2000). Using IRT to separate measurement bias from true group differences on homogeneous and heterogeneous scales: An illustration with the MMPI. Psychological Methods, 5, 125-146. [DOI] [PubMed] [Google Scholar]
- Westfall P. H., Henning K. S. S., Howell R. D. (2012). The effect of error correlation on interfactor correlation in psychometric measurement. Structural Equation Modeling, 19, 99-117. [Google Scholar]
- Wright J. G., Feinstein A. R. (1992). Improving the reliability of orthopaedic measurements. Journal of Bone and Joint Surgery: British Volume, 74, 287-291. [DOI] [PubMed] [Google Scholar]
- Yuan Y., MacKinnon D. P. (2009). Bayesian mediation analysis. Psychological Methods, 14, 301-322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z. Y., Yuan K.-H. (2016). Robust coefficients alpha and omega and confidence intervals with outlying observations and missing data: Methods and software. Educational and Psychological Measurement, 76, 387-411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zimmerman D. W., Zumbo B. D., Lalonde C. (1993). Coefficient alpha as an estimate of test reliability under violations of two assumptions. Educational and Psychological Measurement, 53, 33-49. [Google Scholar]





