Abstract
We examined the performance of coefficient alpha and its potential competitors (ordinal alpha, omega total, Revelle’s omega total [omega RT], omega hierarchical [omega h], greatest lower bound [GLB], and coefficient H) with continuous and discrete data having different types of non-normality. Results showed the estimation bias was acceptable for continuous data with varying degrees of non-normality when the scales were strong (high loadings). This bias, however, became quite large with moderate strength scales and increased with increasing non-normality. For Likert-type scales, other than omega h, most indices were acceptable with non-normal data having at least four points, and more points were better. For different exponential distributed data, omega RT and GLB were robust, whereas the bias of other indices for binomial-beta distribution was generally large. An examination of an authentic large-scale international survey suggested that its items were at worst moderately non-normal; hence, non-normality was not a big concern. We recommend (a) the demand for continuous and normally distributed data for alpha may not be necessary for less severely non-normal data; (b) for severely non-normal data, we should have at least four scale points, and more points are better; and (c) there is no single golden standard for all data types, other issues such as scale loading, model structure, or scale length are also important.
Keywords: reliability, non-normality, continuous data, discrete data, coefficient alpha
In social and behavioral sciences, researchers often use Likert-type scales of observed variables to operationalize unobserved constructs (e.g., personality or attitudes). Whenever we measure the reliability of these constructs with several related items, coefficient alpha (e.g., in Guttman, 1945; popularized by Cronbach, 1951; abbreviated as alpha below) remains the most common estimator despite alternatives proposed by other researchers. Although alpha remains popular, debates on its use continue (Bentler, 2021; Cho, 2021, 2022; McNeish, 2018; Sijtsma & Pfadt, 2021).
Some major concerns have been the conditions and assumptions to be met before using alpha. A comprehensive review of alpha and its alternatives suggested that the items in the scale should be continuous and normally distributed for the proper use of alpha (McNeish, 2018). Raykov and Marcoulides (2019), however, pointed out these assumptions were unnecessary. For example, their review of Cronbach’s 1951 work showed no evidence for the necessity of these two assumptions, and the only requirement was that the individual difference on the sum score should be greater than zero, that is, var(Y) > 0. Furthermore, KR-20, a special case of alpha for binary data, did not rest on the assumption of normality.
Research on the reliability of discrete scales and data with different distributions continues to be active (Chalmers, 2018; Foster, 2021; Kim et al., 2020; Olvera Astivia et al., 2020; Zumbo & Kroc, 2019). The present research approached the issues from an applied user’s perspective to compare how alpha and its potential competitors (five indices from McNeish, 2018 and ordinal alpha) were affected by the degree of data non-normality. To understand the seriousness of such issues in applied studies, we also examined the performance of these indices in a large-scale international survey.
Alpha and Its Alternatives
The reliability of a scale can be defined as the correlation between two scores from two parallel tests (Lord & Novick, 1968). In the population, this correlation is equal to the squared correlation between the observed score and the true score, or the ratio of the true score variance to the observed score variance:
Alpha is calculated by the following formula:
where k is the number of test items, is the variance of the ith item, and is the total variance of the test (Cronbach, 1951). The estimate ranges typically from 0.00 to 1.00, but it can be negative when the items are not positively correlated. Obviously, at least two items are needed to form a scale, that is, k > 1. Furthermore, > 0 is implicitly assumed, otherwise alpha will be meaningless.
As discussed above, despite normality not being required according to its formula, we still debated the effects of non-normality on alpha, not to mention the effects of non-normality on potential alternatives to alpha. This research, therefore, contributed by comparing the performance of alpha and some of its recently proposed popular alternatives with different types of non-normal data (McNeish, 2018; see formulae in Table 1 and details in the appendix).
Table 1.
Summary on Alternative Indices to Alpha (Edwards et al., 2021; McNeish, 2018).
| Alternative | Mathematical formula | Explanation of symbols |
|---|---|---|
| Omega total | represents the factor loading of item i of the single common factor; is the unique variance of item i; and k is the total number of items | |
| Revelle’s omega total | represents the factor loading of item i on the general factor; represents the standard factor loading of the fth group factor; k is the total number of items in the scale; F is the total number of group factors; and is the total variance with the Schmid–Leiman rotation | |
| Omega hierarchical | denotes the general factor loading of the ith item; and is the total variance after the Schmid–Leiman rotation | |
| GLB | is the test variance and is the largest trace of the interitem error covariance matrix | |
| Coefficient H | k is the number of items in the scale and is the squared standard loading of the ith item. |
Note. GLB = greatest lower bound.
Three alternatives are from the omega family: omega total, Revelle’s omega total (omega RT), and omega hierarchical (omega h). Omega total represents the ratio of a scale’s estimated true score variance to its total variance, computed under a factorial analysis framework (McDonald, 1999). Omega RT, as a member of the omega family, is quite different from omega total because omega RT is based on a more complex variance decomposition of the principal and additional factors (McNeish, 2018; Revelle, 2020). Typically, omega RT results in a higher reliability estimate than omega total. Omega h, being a model-based estimate of reliability, quantifies the strength of a general factor after controlling for the group factors. Omega h is like omega RT in that they are based on the Schmid–Leiman rotation. Omega h differs, however, in involving the general but not the other factors (i.e., group or specific factors) in the calculation (Kelley & Pornprasertmanit, 2016; Murray et al., 2019).
We also examined two other indices, the greatest lower bound (GLB) and coefficient H. GLB is based on the Classical Test Theory approach to reliability developed by Jackson and Agunwamba (1977). It reflects the maximal value of the interitem error covariance matrix and is a lower bound of reliability. Coefficient H is the reliability of an optimally weighted composite of a scale (Geldhof et al., 2014; Hancock & Mueller, 2011).
In addition, as most surveys involved Likert-type scales (e.g., 4-point scales), in this research, we examined how the number of scale points in Likert-type scales would affect the performance of various reliability indices. We also included in our comparison the ordinal alpha index introduced by Zumbo et al. (2007) for Likert-type scales, whose use was still debatable (Chalmers, 2018; Zumbo & Kroc, 2019). Conceptually, ordinal alpha is equivalent to coefficient alpha, except the former is based on the polychoric rather than the Pearson correlation (Gadermann et al., 2012; Zumbo et al., 2007).
Effects of Non-Normality on Reliability Estimate
Data distribution and the normality assumption are often important issues in empirical work (Foldnes & Grønneberg, 2019). Quite a few simulation studies investigated how the violation of the normality assumption might affect alpha (Bandalos & Enders, 1996; Bay, 1973; Foster, 2021; Green & Yang, 2009; Kim et al., 2020; Olvera Astivia et al., 2020; Sheng & Sheng, 2012; Shultz, 1993; Trinchera et al., 2018; Trizano-Hermosilla & Alvarado, 2016; Zimmerman et al., 1993).
On the effects of different types of non-normal distributions on alpha, findings were inconsistent. For example, Bay (1973) found that alpha underestimated the reliability of the leptokurtic true score distribution. Similarly, Sheng and Sheng (2012) found that alpha was affected by the leptokurtic true score distributions; or the skewed and/or kurtotic error score distributions. In general, large sample sizes helped improve the accuracy, bias, and precision of reliability estimates, whereas an increase in skewed items led to a more substantial bias (Trizano-Hermosilla & Alvarado, 2016). The upper bound of alpha was likely capped by the distribution shapes of the item, relative bias, root mean square error (RMSE), and efficiency (Olvera Astivia et al., 2020).
In contrast, Zimmerman et al. (1993) and Shultz (1993) found that alpha was quite robust with the violation of the normality assumption. More recently, Foster (2021) noticed that alpha did not change much and performed consistently with different distributions (i.e., the Poisson-gamma [P-G], gamma-inverse gamma [G-IG], negative binomial-F, and binomial-beta [BB] distributions). These results were conflicting probably because of their different operationalizations of non-normality.
The effects of non-normality on alpha alternatives have received attention recently. For example, Trizano-Hermosilla and Alvarado (2016) investigated the performance of alpha, omega coefficient, and GLB of one-dimensional models differing in skewness and non-tau-equivalence. They found that the omega coefficient was always a better choice than alpha. In the presence of skewed items, they recommended omega and GLB even with small samples. Rather than relying on alternative indices, Trinchera et al. (2018) proposed a new asymptotic distribution alpha and a new interval estimate that do not require tau equivalence and multivariate normal distribution. Kim et al. (2020) proposed a reliability index for scale items having different numbers of ordered categories and found that the new index performed better than alpha for bifactor scales. More recently, Foster (2021) compared omega h and omega RT for data with different exponential distributions and found that omega h had the worst performance, while the bias of omega RT was small but positive. On ordinal alpha specifically, research showed that the precision of polychoric estimates was affected by non-normality (distribution shape and response categories) in ordinal structural equation model analyses (Foldnes & Grønneberg, 2019, 2020). We would expect, therefore, non-normality and scale points might affect ordinal alpha which was based on the polychoric correlation.
As scale items in most questionnaires had a limited number of scale points, the reliability of discrete data became a special topic of studies relevant to non-normal distribution (Kim et al., 2020). Bandalos and Enders (1996) found that alpha increased with the degree of similarity between the underlying and observed distributions and the number of response categories. They showed that reliability estimates became quite robust when response categories reached five or more. In contrast, Green and Yang (2009) found that reliability was affected by the ordered categorical items and proposed a nonlinear structural equation model method to assess their reliability. Non-normality due to discrete point distribution has been typical in empirical studies (Micceri, 1989) and was, hence, examined in the present research.
The Present Research
From the above review, previous studies considered only a few reliability indices on non-normality, and insufficient systematic studies compared the most common reliability indices together. Recently, Edwards et al. (2021) assessed the accuracy of alpha and four of five alternatives proposed by McNeish over 140 conditions varying in factors such as the violation of tau equivalence and sample sizes. A comprehensive reanalysis of 30 reliability indices in various simulation studies also suggested that no single index satisfied all data types (Cho, 2022). Despite these attempts, all of them did not examine the effects of non-normality, which remained the primary objective of the present research.
In the present research with three simulation studies, Study 1 focused on continuous data varying in skewness and kurtosis, while Studies 2 and 3 concentrated on discrete data varying in distribution conditions. Specifically, non-normality due to categorization was examined in Study 2, while non-normality in the form of exponential distribution models was examined in Study 3. Other than non-normality, other important factors (scale strength, sample size, and scale points) were also considered in these studies. Details of the simulation design factors are shown in their respective studies below.
Study 1: Effects of the Degree of Non-Normality, Scale Strength, and Sample Size
Simulation Design
We examined the effects of the degree of non-normality and scale strength on various reliability indices with unidimensional scales using a 5 Degree of Non-normality × 2 Scale Strength× 3 Sample Size design.
Degree of Non-Normality
The five levels of item skewness and kurtosis were namely multivariate normal (skewness = 0, kurtosis = 0), slightly non-normal (0.5, 0.5), moderately non-normal (1.0, 1.5), very non-normal (1.5, 3.25), and extremely non-normal (1.75, 3.75; Hau & Marsh, 2004). The four conditions of non-normal data were obtained by transforming the originally normally distributed data into non-normal distributions using a procedure described by Fleishman (1978) and Vale and Maurelli (1983).
Scale Strength
Standardized factor loadings in the range of .4 to .9 were common in most empirical research and simulation studies (Li, 2016). In the present study, the standardized loadings for the strong and medium-strength scale were set at .8 (strong) and .5 (medium), respectively, while the factor variances were all set at 1 in the population.
Sample Size
The smallest sample size was commonly set at N = 50 (Cho, 2022). Replicates of N = 50, 200, and 500 were simulated, representing small, moderate, and large sample sizes, respectively. The largest sample size was set at N = 500, beyond which our small-scale trials suggested results would be stabilized.
A set of six-item scales with multivariate normally distributed responses were first generated to produce 1,000 replicates for each of the 30 cells in the simulation design using the R program. Then, the six reliability indices were obtained for each replicate with the scaleStructure function in the userfriendlyScience package.
Evaluation Criteria
Three evaluation criteria were used, namely, (a) average relative bias, (b) average mean squared error, and (c) efficiency (SD of the sample estimates from the average estimate in each condition, small SD indicated an efficient estimation.
Relative Bias
Parameter estimates bias was typically calculated relatively, defined as the average deviation of the sample estimate from its population value:
RMSE
The mean squared error measured both the amount of bias and the sampling variability of parameter estimates, as defined by the following two formulae:
where is the true population value of reliability, is the reliability estimate of the rth replicate of a certain reliability index, and Nr is the number of properly converged replicates in each design condition. As recommended by Hoogland and Boomsma (1998), values of RMSE < .05 and relative bias <|5%| were considered trivial and acceptable. Furthermore, following Curran et al. (1996), values between 5% and 10% were classified as moderately biased, while those greater than 10% were substantial (Curran et al., 1996).
Efficiency
The efficiency of parameter estimates was measured by the standard deviation of the sample estimates from the average value, which served as a proxy for the standard error of the population value within a particular cell. The efficiency could be computed as follows:
Analyses
After calculating the six reliability indices for all replicates in 30 design cells, replicates with unacceptable omega total values larger than one were first removed (Edwards et al., 2021). Results showed 1.7% of such improper omega total, which happened only in the cell with extremely non-normal data (S = 1.75, K = 3.75), .5 loading, and N = 50. The three evaluation criteria, relative bias, RMSE, and efficiency, were then computed. Finally, the multivariate analysis of variance (MANOVA) test was conducted to examine the design factors, with partial η2 showing the effect sizes of these design factors.
Results
Accuracy of Parameter Estimates
Other than omega h, the relative bias was in the accepted range (<5%) for strong scales regardless of their degree of non-normality (see Appendix C, Table C1 in Supplementary Material). Alpha, omega total, omega RT, and omega h underestimated reliability of medium-strength scales (.5 loading) with severe non-normality. In contrast, GLB and coefficient H overestimated reliability with medium-strength scales and small sample sizes. For omega h, it underestimated the population values in most conditions with a relatively large bias. The efficiency and RMSE of strong scales were good (i.e., low values), suggesting the estimation of reliability based on strong scales was relatively stable. As expected, the efficiency and RMSE increased as the degree of non-normality increased.
Effects of Scale Strength and Degree of Non-Normality
MANOVA analyses indicated that all indices were affected by scale strength with partial ranging from .858 to .909 (Table 2, Row 1). This reaffirms the common knowledge that reliability increases with scale strength (Xiao & Hau, 2021). In contrast, all six indices were moderately affected by the degree of non-normality with partial ranging from .103 to .195 (Table 2, Row 2). In terms of sample size, alpha and omega total were not affected by sample size, whereas the other four indices were moderately affected by sample size (partial = .122–.262; Table 2, Row 3).
Table 2.
Study 1: Effects of Scale Strength, Degree of Non-Normality, and Sample Size, and Their Interactions on Different Reliability Indices (Six-Item Scale).
| Design factor | α | GLB | H | |||
|---|---|---|---|---|---|---|
| Scale strength (loading) | .906 | .909 | .873 | .858 | .898 | .890 |
| Degree of non-normality (Deg NonN) | .195 | .192 | .125 | .136 | .141 | .103 |
| Sample size | .002 | .001 | .122 | .140 | .262 | .201 |
| Loading × Deg NonN | .051 | .049 | .034 | .013 | .036 | .012 |
| Deg NonN ( .8 loading) | .033 | .033 | .018 | .038 | .022 | .025 |
| Deg NonN ( .5 loading) | .208 | .204 | .138 | .116 | .152 | .092 |
| Deg NonN × Sample Size | .001 | .001 | .002 | .001 | .007 | .031 |
| Deg NonN (N = 50) | .073 | .064 | .032 | .040 | .028 | .003 |
| Deg NonN (N = 200) | .078 | .079 | .049 | .061 | .059 | .061 |
| Deg NonN (N = 500) | .074 | .078 | .057 | .050 | .075 | .073 |
| Loading × Sample Size | — | .002 | .050 | .014 | .120 | .157 |
| Loading × Deg NonN × Sample Size | — | .001 | .001 | .009 | .002 | .021 |
Note. Partial equals to zero is indicated by “—.” GLB = greatest lower bound.
Examination of the interaction effects showed that effects due to non-normality were minimal for strong scales (Table 2, Row 5), but larger for medium scales (Table 2, Row 6). Specifically, the indices were more sensitive to the degree of non-normality for medium-strength scales, indicating that non-normality affected reliability estimates only when scales were medium-strength (vs. strong; see also Figure 1 for alpha). In addition, the effects due to non-normality were consistent for different sample sizes (Table 2, see Rows 8–10 for details).
Figure 1.

Alpha Values of Scales With Varying Degree of Non-Normality (Normal Vs. Severely Non-Normal) and Scale Strength (Medium Vs. Strong Scales With Moderate Vs. High Loadings).
Study 2: Effects of Non-Normality Distribution and Number of Scale Points in Likert-Type Scale
In this study, we further examined the effects of both non-normality distribution and the number of scale points in Likert-type scales. A 3 Distribution × 5 Number of Scale Points experimental design was used (Li, 2016, 2021). In addition to the indices used in Study 1, ordinal alpha was included.
Distribution
Following Li (2016), three distributions were adopted, namely, symmetric, slightly asymmetric, and moderately asymmetric distributions (Table 3), with different distributions obtained by manipulating the percentage of response at each scale point (details below).
Table 3.
Response Probabilities of Different Scale Points in Three Distribution Conditions.
| Distribution | Scale points | ||||
|---|---|---|---|---|---|
| 2-point | 4-point | 5-point | 6-point | 7-point | |
| Symmetric | 50/50 | 10/40/40/10 | 10/20/40/20/10 | 5/16/29/29/16/5 | 5/12/18/30/18/12/5 |
| Slightly asymmetric | 29/71 | 5/9/52/34 | 4/5/21/46/24 | 4/5/5/ 36/31/19 | 4/5/6/12/42/22/9 |
| Moderately asymmetric | 17/83 a | 5/9/26/60 | 4/6/10/32/48 | 4/4/5/16/32/39 a | 4/4/5/6/10/32/39 |
Number of Scale Points
We compared scales with commonly found 4, 5, 6, and 7 points (based on Li, 2016, 2021; studies over 157 psychometric measures and 647 instruments from 84 empirical studies). In addition, as alpha is a generalization of the KR-20 formula for dichotomous indicators (coded as 0 or 1; Cronbach, 1951; Raykov & Marcoulides, 2019), we also included the binary data (2-point scale) in the design (Table 3).
Operationally, 2-, 4-, 5-, 6-, and 7-point scales were generated from the respective scales with three degrees of non-normality (distribution conditions). We simulated 1,000 data sets with multivariate normal responses for six items and sample size N = 500. Factor loading of .5 was chosen to magnify the possible RMSEs and bias differences among the indices. Each data set was then transformed into three degrees of non-normality (distribution conditions) according to the response probabilities (Table 3). Specifically, the simulated data were discretized using the respective thresholds according to the response probabilities. For example, for the symmetric distribution, the response probabilities were 50% and 50% for the 2-point scale, with the threshold set at 0. Both data generation and reliability analyses were conducted with the R program.
In addition, to calculate the population reliability, two parallel tests sharing identical underlying continuous true scores were generated with N = 1,000,000. The correlation between the observed sum scores from the two parallel tests was taken as the population reliability (Kim et al., 2020).
Results
Accuracy of Parameter Estimates
No case with an unacceptable omega total >1 was found in all simulation conditions. Generally, other than omega RT, most indices underestimated reliability across different conditions. This was likely because the information was lost in collapsing continuous data into discrete values. In contrast, omega RT overestimated the population reliability in most conditions, probably due to its overfitting of the model, which caused a subsequent positive bias in reliability estimation (see also Foster, 2021). The bias of ordinal alpha was the smallest because its calculation was based on the polychoric covariance matrix and was rested on an underlying normal distribution assumption. Thus, all the ordinal data were assumed to have the same underlying normal distribution despite their differences in scale points.
The relative bias became smaller with an increase in the number of scale points, and more scale points were required to reduce the bias of more severe non-normality scales. Specifically, for symmetrically distributed scales, the relative bias of alpha and omega total would reduce to an accepted range with 4-point or above. When distribution became more asymmetric, more scale points were needed to reduce the bias (see Appendix C, Table C2 in Supplementary Material). The patterns of efficiency and RMSE were similar to those of relative bias, and all indices showed large relative bias and RMSE with 2-point scales.
Effects of Distribution and Scale Points
MANOVA on the reliability indices showed that ordinal alpha was not affected by the kinds of distribution (first design factor) and the number of scale points (second design factor), with partial close to zero for the design factors. In contrast, all other indices were affected by these two design factors (Table 4, Rows 1 and 2). Specifically, partial ranged from .416 (omega RT) to .718 (alpha) for the number of scale points effect and from .087 (omega RT) to .285 (alpha) for the kinds of distribution effect. These results suggested that other than ordinal alpha, omega RT was least affected by the kinds of distribution and the number of scale points compared with other alpha alternatives examined here. Analyses on the interactions indicated that all indices (except ordinal alpha) were affected by the degree of non-normality, especially when there were fewer scale points (e.g., 2-point; Table 4, Row 4; Figure 2).
Table 4.
Study 2: Effects of Distribution and Number of Scale Points (Six-Item Scale) on Different Reliability Indices.
| Design factor | α | GLB | H | Ordinal alpha | |||
|---|---|---|---|---|---|---|---|
| Scale points | .718 | .716 | .416 | .539 | .633 | .701 | .004 |
| Distribution | .285 | .282 | .087 | .199 | .185 | .256 | .001 |
| Scale points × Distribution | .036 | .035 | .007 | .015 | .017 | .023 | .001 |
| Distribution (2-point) | .163 | .159 | .042 | .092 | .092 | .130 | — |
| Distribution (4-point) | .070 | .069 | .015 | .047 | .041 | .062 | — |
| Distribution (5-point) | .054 | .053 | .015 | .035 | .032 | .049 | — |
| Distribution (6-point) | .038 | .039 | .013 | .027 | .027 | .039 | — |
| Distribution (7-point) | .065 | .064 | .016 | .045 | .037 | .059 | — |
Note. Partial equals to zero is indicated by “––.” GLB = greatest lower bound.
Figure 2.
Values of Alpha and Its Alternatives on Scales With Different Distributions and Number of Scale Points.
Note. GLB = greatest lower bound.
Study 3: Performance With Exponential Distribution Models
In applied studies, researchers often have data sets with exponential distributions. As an extension of Studies 1 and 2, we simulated data with three types of exponential distributions following procedures described by Foster (2020, 2021) (see Appendix B in Supplementary Material). For these three types of exponential distribution data, namely, the P-G, G-IG, and the B-B models, reliability ρ was set at .8 for the six-item scale. For each distribution type, 1,000 replicates of sample sizes 50, 200, and 500 were simulated. Alpha, omega RT, and omega h were estimated by the R psych package (omega function), whereas omega total by the ci reliability function in the MBESS package, GLB by the glb.fa function in the psych package (Revelle, 2020), and coefficient H by the scaleStructure function in the UserfriendlyScience package. As the estimation for coefficient H failed for the present chosen distributions, no result was shown.
Results
Relatively bias, RMSE, and efficiency are shown in Appendix C, Table C3 in Supplementary Material. Results showed that omega RT and GLB performed well regardless of the sample size, whereas omega h performed the worst.
It was noted that all indices performed worse with the B-B distribution than with the P-G or the G-IG distribution. For exponential data, alpha underestimated reliability, particularly for B-B distribution data. In contrast, omega RT and GLB provided better estimates with smaller biases for most conditions.
Study 4: Performance in an Authentic Large-Scale Study
Finally, we further compared the performance of alpha and its alternatives in an authentic large-scale assessment. The purpose was to see whether the commonly used large-scale assessment items were severely non-normal that warranted our special attention in using various reliability indices. The well-known Programme for International Student Assessment (PISA; OECD, 2019a) was chosen as the target for examination. It measured 15-year-olds’ ability in reading, mathematics and science knowledge and skills, and other related background and attitudes. A total of 612,004 representative students from 79 countries (economies) participated in PISA 2018 (OECD, 2019b). We used the worst scales (overall most severely non-normality) and compared the performance of various reliability indices between the worst country (the country with the most severely non-normality) and the best country (the least severely non-normality).
Selected Scales and Countries
The PISA 2018 student questionnaire contained 74 derived constructs (OECD, 2019b). We excluded indices based on a single item (e.g., gender) or derived from different scale formats. This left us with 28 scales with a total of 108 items (see item statistics in Table D1 in Supplementary Material). Their skewness ranged from −1.94 to 16.60, while their kurtosis ranged from −2.00 to 273.65. After removing the two extreme non-normal items, skewness and kurtosis narrowed and became −1.94 to 2.74 and –1.63 to 7.75, respectively. These were close to normal or moderately non-normal (Blanca et al., 2013), suggesting most PISA items were not severely non-normal.
Following the above analyses, seven constructs with the most severe non-normality and covering a wide range of different scale points were selected. They were home educational resources (HEDRES; seven items, 2-point scale), general fear of failure (GFOFAIL; three items, 4-point scale), student perceived teacher support in test language lessons (TEACHSUP; four items, 4-point scale), students’ resilience (RESILIENCE; five items, 4-point scale), mastery goal orientation (MASTGOAL, three items, 5-point scale), meta-cognition: assess credibility (METASPAM; five items, 6-point scale), and meta-cognition: understanding and remembering (UNDREM; six items, 6-point scale).
To understand how the reliability indices performed in the potentially most severely affected countries, for each construct selected above, we first chose the country with the largest kurtosis (as each scale had multiple items, we compared the medium absolute kurtosis of the items in each scale). Then, we compared this most severely affected country against the least affected one (smallest absolute kurtosis) used as a benchmark. Basically, the standardized loadings across scales were similar, and the scales were unidimensional as indicated by the model-fit indices (see Appendix D, Tables D2–D8 in Supplementary Material).
Results
We had several observations (see Appendix D, Table D9 in Supplementary Material): (a) As exemplified by PISA, items in large-scale surveys were not severely non-normal. (b) The various reliability indices differed substantially for the same set of non-normal data (the country with the most severely non-normality) and the close to normal data (the country with the least severely non-normality). This suggested that the characteristics of individual reliability indices and factors other than non-normality (e.g., items in the scale were not tau-equivalent) might make the values of the reliability indices substantially different. (c) The reliability indices were similar, and non-normality was of lesser concern when the scales had more points (4-, 5-, 6-point scales); 2-point scales were particularly problematic. (d) Alpha was relatively robust and less affected by non-normality in some situations (certain kurtosis and skewness values).
The authentic data set also reaffirmed a few observations in this and other research. (a) Ordinal alpha was systematically larger than alpha, particularly when the scales were non-normal and had fewer points (e.g., 2-point scales). (b) Omega h was smaller than alpha in all conditions (Zinbarg et al., 2005), while omega RT and GLB tended to give higher than alpha estimates. Understandably, the above findings were limited to the seven chosen scales in the PISA data and had to be generalized cautiously.
Discussion and Conclusion
Reliability is the most common indicator of the quality of a scale, with alpha being the most popular, despite the continuing debates on its use (McNeish, 2018; Raykov & Marcoulides, 2019). We examined the performance of different reliability indices for varying types of non-normal data. For continuous data, all indices were affected by non-normality and scale strength. Effects of non-normality were more severe when the scales were weak (low loadings).
Generally, the relative bias would be in the accepted range for different degrees of non-normality if the scales were strong. When the scale strength (loading) was moderate; however, the bias became quite large and increased with the degree of non-normality. For discrete data, most indices underestimated reliability; the exception was omega RT which might overfit the proposed model (Kim et al., 2020; Savalei & Reise, 2019). Our results showed that most indices were biased substantially with severely non-normal data or scales with fewer points. The bias with binary or few scale-point data was in line with previous research (Bonanomi et al., 2015; Lissitz & Green, 1975; Nunnally, 1978). We concurred with a recent study that reliability would increase and become less biased as scale points increased and reliability would be stabilized when reaching five scale points (Kim et al., 2020). Importantly, for applied users, alpha and omega total could still provide relatively accurate reliability estimates when the data did not severely violate normality assumption, or the scales consisted of many scale points. Increasing the number of scale points could compensate for the deviation from normality in reliability estimation.
Ordinal alpha, purposely designed for ordinal data using the polychoric covariance matrix, was shown to be unaffected by non-normality and the number of scale points (see Zumbo et al., 2007). The polychoric covariance matrix assumes an underlying normal distribution for the discrete responses. Thus, all the ordinal data transformed from the same measurement are supposed to have the same underlying normal distribution despite having different scale points.
Considerations on Non-Normality for Empirical Researchers
Empirical researchers would like to know whether the violation of continuous and normality assumptions would lead to inaccurate reliability estimates and which reliability index should be preferably used. Our results disappointedly suggested that no single index fitted all data types. Nevertheless, a few take-home messages can be drawn.
First, after comparing with various proposed alternatives to alpha, we still recommend alpha when the distribution is relatively normal or not severely non-normal. As found in the present research, most reliability indices behaved appropriately when the skewness and kurtosis of data were not severe. That is, the demand for continuous and normally distributed data in reliability estimation is not strictly necessary, particularly when the scales are strong (high factor loadings) or when the scales have many scale points. When developing research instruments, perhaps a routine normality check is useful (χ2 test, Kuiper’s test, Shapiro and Wilk’s tests, etc.; Yazici & Yolacan, 2007). If the data severely violate the normality assumption, alpha will not be recommended. An alternative is to increase the number of scale points to at least four points to obtain stable reliability.
In comparing omega family indices, omega total performs similarly to alpha and, therefore, can be regarded as a more generalization form of alpha, as confirmed in previous studies (McNeish, 2018). We also recommend omega RT for discrete data as its bias is reasonable and acceptable even with moderately asymmetric distribution data. However, we do not recommend omega h, as it generally underestimates the population values. In simulation studies and authentic, more complex data, omega h took on the smallest values compared with other indices, and sometimes it could be close to zero (see Zinbarg et al., 2005). Based on the definition of omega h, it would be a good choice for a potentially multidimensional scale with a strong general factor or strictly unidimensional scale with equal factor loadings (see also Green & Yang, 2009; Trizano-Hermosilla et al., 2021). Thus, if we have a clear construct in the scale, omega h can be a suitable choice. Otherwise, we suggest using other indices such as omega total and omega RT.
Finally, high-reliability values should be interpreted cautiously if the model fit is not acceptable. For example, with the authentic PISA data, ordinal alpha, omega RT, and GLB had values greater than .7 for the home possessions scale, although the model fit indices were poor (see Appendix D, Table D2 in Supplementary Material). The high values of these indices may simply reflect there are many items rather than the scale being high in quality. For the same reason, we found that the reliability of UNDREM was in the accepted range despite having some poor items (see Appendix D, Table D8 in Supplementary Material).
Here, omega RT might also provide a higher value because it was obtained from the exploratory bifactor solution (general and group factors). It was always larger than alpha in the present authentic data and previous studies (McNeish, 2018; Savalei & Reise, 2019; Xiao & Hau, 2021). We should be cautious; however, the high value does not necessarily mean omega RT is better than alpha. As the group factors represent irrelevant sources of variability, it is unclear whether their variance should be counted toward improved reliability.
Interestingly, we found that omega h for the home possession scale was appropriately low, reaffirming its use with other indices to judge the quality of the scales (Xiao & Hau, 2021). In brief, the performance of the reliability indices is heavily affected by other features of the scale, such as strength (loading), model specification, or scale length. These factors are probably more influential on reliability than non-normality in large-scale surveys.
Limitations and Future Research
There are several limitations to this research. First, we confined our analyses to scales with all items sharing the same distribution. In other words, Likert-type scale data were generated by applying the same set of response probabilities to each continuous variable. Such distributions are rarely encountered in real-world data sets. Further research should explore more diverse item distributions within each scale.
Moreover, real-world data sets are often non-normal and incomplete with missing values (Raykov & Marcoulides, 2016; Savalei, 2010). We can extend to other distributions in future studies, such as misspecified models involving clustered or bi/multi-modal data. Finally, we considered scales with a unidimensional structure only in the current research. Some of the indices are more suitable for scales with multidimensional structures. Explorations with more diverse structures would be helpful (e.g., Fu et al., 2021; Trizano-Hermosilla et al., 2021).
Supplemental Material
Supplemental material, sj-docx-1-epm-10.1177_00131644221088240 for Performance of Coefficient Alpha and Its Alternatives: Effects of Different Types of Non-Normality by Leifeng Xiao and Kit-Tai Hau in Educational and Psychological Measurement
Appendix
Calculation of Population Reliability and Various Reliability Indices
Population Reliability
The reliability of test scores is defined as the correlation between two scores from two parallel tests (Lord & Novick, 1968). In the population, the correlation between two test scores is equal to the squared correlation between the observed score and the true score. Therefore, it is equal to the ratio of the true score variance to the observed score variance:
The observed test variance and the error variance are obtained from the covariance matrix. For example, to calculate the population reliability of a six-item scale with .8 loading, the variance of observed scores is , while the error variance is , and therefore, its reliability is
We can also work on population reliability as used by Raykov and Marcoulides (2016):
Omega Total
McDonald (1999) proposed omega total from a facorial analysis framework, which can be represented as
where λ i represents the factor loading of item i of the single common factor, θ ii is the unique variance of item i, and k is the total number of items.
Revelle’s Omega Total (Omega RT)
Omega RT is another member of the omega family. It is quite different from omega total because it is based on a more complex variance decomposition of the main and additional factors. Typically, omega RT results in a higher reliability estimation than omega total. Omega RT is calculated by (Revelle, 2020):
where represents the factor loading of item i on the general factor, represents the standard factor loading of the fth group factor, k is the total number of items in the scale, F is the total number of group factors, and is the total variance with the Schmid–Leiman rotation (Schmid & Leiman, 1957). Here, is consisted of variances from four parts: g (general factor), f (group factor), s (specific factor), and e (random error), with the variance of specific items being confounded with that of the error. Unlike omega total, omega RT is a total reliability estimate based on an exploratory bifactor solution on scales with more than one dimension (McNeish, 2018; Revelle, 2020).
Omega Hierarchical (Omega h)
Omega h, a member of the omega family of model-based reliability estimates, quantifies the strength of a general factor after controlling for the group factors. Omega h is like omega RT in that it is also based on the Schmid–Leiman rotation. It differs, however, in that its calculation involves the general but not the other factors (i.e., group or specific factors; Kelley & Pornprasertmanit, 2016; Murray et al., 2019). The formula estimates the extent to which the general factor dimension contributes to the reliability of observed scores:
where denotes the general factor loading of the ith item, is the total variance after the Schmid–Leiman rotation, to represent the group factor loadings of the ith item of the group factors 1 to k, and represents the residual variance.
Maximal Reliability and Coefficient H
Coefficient H is the reliability of an optimally weighted composite of a scale (Geldhof et al., 2014; Hancock & Mueller, 2011) estimated by the following:
where k is the number of items in the scale and is the squared standard loading of the ith item. Different from other estimates of the composite reliability, the standard factor loading here is first squared and then aggregated.
Greatest Lower Bound (GLB)
The GLB is based on the classical test theory approach to reliability developed by Jackson and Agunwamba (1977). With the greatest possible covariance matrix of error, according to the reliability formula, GLB reliability is defined as
where is the test variance and is the largest trace of the interitem error covariance matrix. Minimum rank factor analysis, developed by Ten Berge and his colleague (Shapiro & Ten Berge, 2002), was used to compute GLB.
Footnotes
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Leifeng Xiao
https://orcid.org/0000-0001-7125-1067
Supplemental Material: Supplemental material for this article is available online.
References
- Bandalos D. L., Enders C. K. (1996). The effects of nonnormality and number of response categories on reliability. Applied Measurement in Education, 9(2), 151–160. 10.1207/s15324818ame0902_4 [DOI] [Google Scholar]
- Bay K. S. (1973). The effect of non-normality on the sampling distribution and standard error of reliability coefficient estimates under an analysis of variance model. British Journal of Mathematical & Statistical Psychology, 26, 45–57. 10.1111/j.2044-8317.1973.tb00505.x [DOI] [Google Scholar]
- Bentler P. M. (2021). Alpha, FACTT, and beyond. Psychometrika, 86, 861–868. 10.1007/s11336-021-09797-8 [DOI] [PubMed] [Google Scholar]
- Blanca M. J., Arnau J., López-Montiel D., Bono R., Bendayan R. (2013). Skewness and kurtosis in real data samples. Methodology, 9, 78–84. 10.1027/1614-2241/a000057 [DOI] [Google Scholar]
- Bonanomi A., Cantaluppi G., Ruscone M. N., Osmetti S. (2015). A new estimator of Zumbo’s ordinal alpha: A copula approach. Quality & Quantity: International Journal of Methodology, 49, 941–953. 10.1007/s11135-014-0114-8 [DOI] [Google Scholar]
- Chalmers R. P. (2018). On misconceptions and the limited usefulness of ordinal alpha. Educational and Psychological Measurement, 78(6), 1056–1071. 10.1177/0013164417727036 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho E. (2021). Neither Cronbach’s alpha nor McDonald’s omega: A commentary on Sijtsma and Pfadt. Psychometrika, 86, 877–886. 10.1007/s11336-021-09801-1 [DOI] [PubMed] [Google Scholar]
- Cho E. (2022). The accuracy of reliability coefficients: A reanalysis of existing simulations. Psychological Methods. Advance online publication. 10.1037/met0000475 [DOI] [PubMed]
- Cronbach L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. 10.1007/BF02310555 [DOI] [Google Scholar]
- Curran P. J., West S. G., Finch J. F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16–29. [Google Scholar]
- Edwards A. A., Joyner K. J., Schatschneider C. (2021). A simulation study on the performance of different reliability estimation methods. Educational and Psychological Measurement, 81(6), 1089–1117. 10.1177/0013164421994184 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fleishman A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43(4), 521–532. [Google Scholar]
- Foldnes N., Grønneberg S. (2019). On identification and non-normal simulation in ordinal covariance and item response models. Psychometrika, 84(4), 1000–1017. 10.1007/s11336-019-09688-z [DOI] [PubMed] [Google Scholar]
- Foldnes N., Grønneberg S. (2020). Pernicious polychorics: The impact and detection of underlying non-normality. Structural Equation Modeling, 27(4), 525–543. 10.1080/10705511.2019.1673168 [DOI] [Google Scholar]
- Foster R. C. (2021). KR-20 and KR-21 for some non-dichotomous data (It’s not just Cronbach’s alpha). Educational and Psychological Measurement, 81(6), 1172–1202. 10.1177/0013164421992535 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fu Y., Wen Z., Wang Y. (2021). A comparison of reliability estimation based on confirmatory factor analysis and exploratory structural equation models. Educational and Psychological Measurement, 82(2), 205–224. 10.1177/00131644211008953 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gadermann A. M., Guhn M., Zumbo B. D. (2012). Estimating ordinal reliability for Likert-type and ordinal item response data: A conceptual, empirical, and practical guide. Practical Assessment, Research & Evaluation, 17(1), 1–13. [Google Scholar]
- Geldhof G. J., Preacher K. J., Zyphur M. J. (2014). Reliability estimation in a multilevel confirmatory factor analysis framework. Psychological Methods, 19(1), 72–91. 10.1037/a0032138 [DOI] [PubMed] [Google Scholar]
- Green S. B., Yang Y. (2009). Reliability of summed item scores using structural equation modeling: An alternative to coefficient alpha. Psychometrika, 74(1), 155–167. 10.1007/s11336-008-9099-3 [DOI] [Google Scholar]
- Guttman L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10(4), 255–282. [DOI] [PubMed] [Google Scholar]
- Hancock G. R., Mueller R. O. (2011). The reliability paradox in assessing structural relations within covariance structure models. Educational and Psychological Measurement, 71(2), 306–324. 10.1177/0013164410384856 [DOI] [Google Scholar]
- Hau K. T., Marsh H. W. (2004). The use of item parcels in structural equation modelling: Non-normal data and small sample sizes. British Journal of Mathematical & Statistical Psychology, 57(2), 327–351. https://doi.10.1111/j.2044-8317.2004.tb00142.x [DOI] [PubMed] [Google Scholar]
- Hoogland J. J., Boomsma A. (1998). Robustness studies in covariance structure modeling an overview and a meta-analysis. Sociological Methods and Research, 26(3), 329–367. 10.1177/0049124198026003003 [DOI] [Google Scholar]
- Jackson P. H., Agunwamba C. C. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: I algebraic lower bounds. Psychometrika, 42(4), 567–578. [Google Scholar]
- Kelley K., Pornprasertmanit S. (2016). Confidence intervals for population reliability coefficients: Evaluation of methods, recommendations, and software for composite measures. Psychological Methods, 21(1), 69–92. 10.1037/a0040086 [DOI] [PubMed] [Google Scholar]
- Kim S., Lu Z., Cohen A. S. (2020). Reliability for tests with items having different numbers of ordered categories. Applied Psychological Measurement, 44(2), 137–149. 10.1177/0146621619835498 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C. H. (2016). The performance of ML, DWLS, and ULS estimation with robust corrections in structural equation models with ordinal variables. Psychological Methods, 21(3), 369–387. 10.1037/met0000093.supp [DOI] [PubMed] [Google Scholar]
- Li C. H. (2021). Statistical estimation of structural equation models with a mixture of continuous and categorical observed variables. Behavior Research Methods, 53, 2191–2213. 10.3758/s13428-021-01547-z [DOI] [PubMed] [Google Scholar]
- Lissitz R. W., Green S. B. (1975). Effect of the number of scale points on reliability: A Monte Carlo approach. Journal of Applied Psychology, 60(1), 10–13. 10.1037/h0076268 [DOI] [Google Scholar]
- Lord F. M., Novick M. R. (1968). Statistical theories of mental test scores. Addison-Wesley. [Google Scholar]
- McDonald R. P. (1999). Test theory: A unified approach. Erlbaum. [Google Scholar]
- McNeish D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412–433. 10.1037/met0000144 [DOI] [PubMed] [Google Scholar]
- Micceri T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105, 156–166. [Google Scholar]
- Murray A. L., Booth T., Eisner M., Obsuth I., Ribeaud D. (2019). Quantifying the strength of general factors in psychopathology: A comparison of CFA with maximum likelihood estimation, BSEM, and ESEM/EFA bifactor approaches. Journal of Personality Assessment, 101(6), 631–643. 10.1080/00223891.2018.1468338 [DOI] [PubMed] [Google Scholar]
- Nunnally J. C. (1978). Psychometric theory (2nd ed.). McGraw-Hill. [Google Scholar]
- OECD. (2019. a). PISA 2018 assessment and analytical framework. OECD Publishing. [Google Scholar]
- OECD. (2019. b). PISA 2018 technical report. OECD Publishing. [Google Scholar]
- Olvera Astivia O. L., Kroc E., Zumbo B. D. (2020). The role of item distributions on reliability estimation: The case of Cronbach’s coefficient alpha. Educational and Psychological Measurement, 80(5), 825–846. 10.1177/0013164420903770 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raykov T., Marcoulides G. A. (2016). Scale reliability evaluation under multiple assumption violations. Structural Equation Modeling, 23(1–2), 302–313. [Google Scholar]
- Raykov T., Marcoulides G. A. (2019). Thanks coefficient alpha, we still need you!. Educational and Psychological Measurement, 79(1), 200–210. 10.1177/0013164417725127 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Revelle W. (2020). Using R and the psych package to find ω. Computer Software. https://personality-project.org/r/tutorials/HowTo/omega.tutorial/omega.pdf
- Savalei V. (2010). Small sample statistics for incomplete nonnormal data: Extensions of complete data formulae and a Monte Carlo comparison. Structural Equation Modeling, 17(2), 241–264. 10.1080/10705511003659375 [DOI] [Google Scholar]
- Savalei V., Reise S. P. (2019). Don’t forget the model in your model-based reliability coefficients: A reply to McNeish (2018). Collabra: Psychology, 5(1), 36. 10.1525/collabra.247 [DOI] [Google Scholar]
- Schmid J., Leiman J. M. (1957). The development of hierarchical factor solutions. Psychometrika, 22(1), 53–61. [Google Scholar]
- Shapiro A., Ten Berge J. M. (2002). Statistical inference of minimum rank factor analysis. Psychometrika, 67(1), 79–94. [Google Scholar]
- Sheng Y., Sheng Z. (2012). Is coefficient alpha robust to non-normal data? Frontiers in Psychology, 3(FEB), 1–13. 10.3389/fpsyg.2012.00034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shultz G. S. (1993). A Monte Carlo study of the robustness of coefficient alpha [Master’s thesis]. University of Ottawa. [Google Scholar]
- Sijtsma K., Pfadt J. M. (2021). Part II: On the use, the misuse, and the very limited usefulness of Cronbach’s alpha: Discussing lower bounds and correlated errors. Psychometrika, 86, 843–860. 10.1007/s11336-021-09789-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trinchera L., Marie N., Marcoulides G. A. (2018). A distribution free interval estimate for coefficient alpha. Structural Equation Modeling, 25(6), 876–887. 10.1080/10705511.2018.1431544 [DOI] [Google Scholar]
- Trizano-Hermosilla I., Alvarado J. M. (2016). Best alternatives to Cronbach’s alpha reliability in realistic conditions: Congeneric and asymmetrical measurements. Frontiers in Psychology, 7(MAY), 1–8. 10.3389/fpsyg.2016.00769 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trizano-Hermosilla I., Gálvez-Nieto J. L., Alvarado J. M., Saiz J. L., Salvo-Garrido S. (2021). Reliability estimation in multidimensional scales: Comparing the bias of six estimators in measures with a bifactor structure. Frontiers in Psychology, 12, 508287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vale C. D., Maurelli V. A. (1983). Simulating multivariate nonnormal distributions. Psychometrika, 48(3), 465–471. 10.1007/BF02293687 [DOI] [Google Scholar]
- Xiao L. F., Hau K. T. (2021). Alternatives to Cronbach’s alpha: Bias, sensitivity to mismatched item, scale strength, scale length, and sample size [Unpublished manuscript]. Faculty of Education, The Chinese University of Hong Kong. [Google Scholar]
- Yazici B., Yolacan S. (2007). A comparison of various tests of normality. Journal of Statistical Computation and Simulation, 77(2), 175–183. 10.1080/10629360600678310 [DOI] [Google Scholar]
- Zimmerman D. W., Zumbo B. D., Lalonde C. (1993). Coefficient alpha as an estimate of test reliability under violation of two assumptions. Educational and Psychological Measurement, 53, 33–49. 10.1177/07399863870092005 [DOI] [Google Scholar]
- Zinbarg R., Revelle W., Yovel I., Wen L. (2005). Cronbach’s α, Revelle’s β, and McDonald’s ωh: Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70(1), 123–133. 10.1007/s11336-003-0974-7 [DOI] [Google Scholar]
- Zumbo B. D., Gadermann A. M., Zeisser C. (2007). Ordinal versions of coefficients alpha and theta for Likert rating scales. Journal of Modern Applied Statistical Methods, 6(1), 21–29. 10.22237/jmasm/1177992180 [DOI] [Google Scholar]
- Zumbo B. D., Kroc E. (2019). A measurement is a choice and Stevens’ scales of measurement do not help make it: A response to Chalmers. Educational and Psychological Measurement, 79(6), 1184–1197. 10.1177/0013164419844305 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-epm-10.1177_00131644221088240 for Performance of Coefficient Alpha and Its Alternatives: Effects of Different Types of Non-Normality by Leifeng Xiao and Kit-Tai Hau in Educational and Psychological Measurement

