Abstract
In this article, the second of a series on rating scale translation, adaptation, and psychometric testing, we focus on reliability testing of a rating scale. Reliability refers to the consistency of results when the scale is reapplied to or completed by the same individual again under the same conditions. We discuss three key types of reliability: internal consistency, test–retest reliability, and inter-rater reliability testing. The appropriate measure for reporting internal consistency is Cronbach’s alpha (α); for test–retest reliability, it is the intraclass correlation coefficient (ICC) for continuous variables and intraclass kappa for categorical variables. For inter-rater reliability, the preferred measure is either Cohen’s kappa (κ) in case of categorical variables with two raters or the ICC for continuous variables; depending on the randomness in the selection of raters, different statistical models are used for computing the ICC. This article presents these concepts with simple, non-technical explanations. We also address practical considerations for conducting reliability tests, explain how to choose the right statistical index for each type of reliability, and clarify common misapplications. Finally, we offer guidance on interpreting and reporting reliability test results in a manuscript, along with instructions on conducting these analyses using IBM SPSS Statistics.
Keywords: Internal consistency, inter-rater reliability, intraclass correlation coefficient, psychometric testing, reliability testing, split-half reliability
In a previous article in this series, we described the process of translation and adaptation of a rating scale.[1] Any rating scale, whether developed de novo or adapted from an existing one, must undergo reliability testing after the contents are finalized. Reliability of a test, tool, or observation refers to the degree to which the test results are consistent and reproducible. In other words, reliability refers to how stable or consistent the results would be if the test is repeated under the same conditions on the same individual by different individuals/raters/clinicians or when completed by the patient on more than one occasion. In scale psychometrics, we consider the tool reliable if it is reapplied to the same individual several times and if we get highly similar results each time. Readers must understand an important nuance here: reliability does not guarantee validity. A tool may be reliable, but not valid, if it measures a construct consistently, but consistently measures the wrong construct.
There are three essential types of reliability for any tool. The first one is termed test–retest reliability; as the name suggests, we are assessing the consistency of test results over time. A good tool should also measure a construct consistently regardless of who administers it; this is referred to as inter-rater reliability (IRR). Finally, a tool can be expected to measure a construct consistently only if the items within the tool measure the same underlying construct or characteristic; in other words, items in a tool should be intercorrelated. This aspect of reliability is referred to as internal consistency.
Below, we explain how to measure each of the above types of reliability, how to interpret and report them in a research paper, and outline practical issues and common pitfalls in this process to illustrate the importance of choosing the correct indices of reliability testing.
INTERNAL CONSISTENCY
High internal consistency is desirable for any test; this implies that all items within a test are internally correlated and that all items measure the same broader construct. Conversely, poor internal consistency may mean that the items are poorly related to each other or measure different constructs. Typically, internal consistency is calculated using an index called Cronbach’s alpha (α), named after the American psychologist Lee Cronbach. The calculations involve dividing the average shared variance (covariance) by the average total variance; in other words, a high Cronbach’s alpha means that the item covariance is high relative to the item variance.[2,3]
Conventional thresholds for interpreting Cronbach’s alpha are shown in Table 1; <.50 is unacceptable. What about a Cronbach’s alpha value >.95? This is a very high value and has been flagged earlier due to concerns about excessive overlap and possible redundancy in some scale items,[2] because they are possibly measuring the same thing. Readers must note that Cronbach’s alpha value for a tool is specific to a study sample; hence, authors must report this value for every tool they use in the study. Further, Cronbach’s alpha can also be reported separately for each sub-domain of a tool, if the sub-domains are independent [Table 1].
Table 1.
Interpreting and reporting common reliability indices
| Index | Interpretation thresholds | Sample reporting statement in a manuscript | How to run the test using IBM SPSS Statistics |
|---|---|---|---|
| Cronbach’s alpha (α) | <.50=unacceptable .51 to .60=poor .61 to .70=questionable .71 to .80=acceptable .81 to .90=good .91 to .95=excellent |
For a tool with eight sub-domains, a sample reporting statement for Cronbach’s alpha may read thus: “Internal consistencies of the Suicidal Narrative Inventory (SNI)-38 total and subscale scores were as follows: total scores (α=.87), thwarted belongingness (α=.85), perceived burdensomeness (α=.88), goal reengagement (α=.91), goal disengagement (α=.89), entrapment (α=.67), perfection (α=.86), and fear of humiliation (α=.88)”. | Analyse→Scale→Reliability analysis→Statistics (dialog box)→select Inter-item Correlations. The Cronbach’s α will be shown in the “Reliability Statistics” output box. |
| Intraclass correlation coefficient (ICC) | <0.50=poor .50 to .75=moderate .76 to .90=good >0.9=excellent |
For test–retest reliability, the ICC can be reported thus: “The tool ICC estimates with 95% confidence intervals (CI) were calculated using IBM SPSS Statistics for Windows, version 23 (IBM Corp., Armonk, N.Y., USA), based on a 2-way mixed-effects model, absolute-agreement type, and mean-rating (k=2 for two repeated measurements). The ICC value was 0.87 (95% CI, .84 to .90).” | Analyse→Scale→Reliability analysis→Statistics (dialog box)→select Intra-class Correlation Coefficient. The output box will be titled “Intraclass correlation coefficient” and will depict two types of ICC: single measure and average measure. The single-measure ICC indicates the reliability of ratings based on a single (typical) rater; the average-measure ICC indicates the reliability of ratings based on multiple raters averaged together. Researchers can choose the appropriate one for their study. |
| Cohen’s kappa (κ) | 0 to .20=none .21 to .39=minimal .40 to .59=weak .60 to .79=moderate .80 to .90=strong >.90=almost perfect agreement |
For a tool with five items, reporting of Cohen’s κ for inter-rater reliability can be done thus: “Cohen’s κ was run to check the agreement between the two raters on categorical ratings of individual tool items. The average value of Cohen’s κ coefficient was .85 (95% CI, .76 to .94); the item-wise values ranged from .78 (for item 3; 95% CI, .75 to .81) to .89 (for item 4, 95% CI, .86 to .92). Thus, the overall inter-rater reliability ratings were found to be strong”. | Analyse→Descriptive statistics→Crosstabs (dialog box) From the crosstabs dialog box, two options should be selected: Statistics (dialog box)→select Kappa. Cells (dialog box)→select Observed Counts. The value of Cohen’s Kappa will be shown in the output box titled “Symmetric measures”. |
There are several limitations to Cronbach’s alpha, as outlined in a widely quoted commentary.[4] A central limitation is that the measure is overly sensitive to the number of tool items; a small tool (e.g., <5 items) may yield a falsely low value of Cronbach’s alpha; conversely, the alpha may be falsely high for a large tool. This has led researchers to question the utility of a measure whose value may have little relation to the content of the tool; some have even concluded that Cronbach’s alpha is not a measure of internal consistency.[4] Notwithstanding these limitations, the measure continues to be widely used, perhaps because of the vast body of research linking alpha to the factor structure of a test.[5,6] Another speculative reason may be the growing gap between psychology and psychometrics, precluding the development of better psychometric measures that more accurately reflect the tool’s item correlatedness.[4]
As a parting note in this section, a split-half reliability test is preferred over Cronbach’s alpha to estimate internal consistency when there are many items in a tool (e.g., 100 or more) to circumvent an important limitation of the Cronbach’s alpha statistic: sensitivity to the number of items. In practice, the original tool is split into two halves, and the correlation between the two parallel forms thus created is computed using Pearson’s correlation coefficient. Disadvantages of the split-half method include possible arbitrariness in splitting the tool, which can impact the value of the test statistic, and underestimation of the true reliability (as only half the tool is being tested). Finally, Cronbach’s alpha is appropriate for continuous data; for tools with dichotomous item data, Kuder and Richardson have proposed a version of Cronbach’s alpha referred to as KR20.[7]
TEST–RETEST RELIABILITY
As explained before, test–retest reliability is a statistical measure of how consistent the test results are over time. In practice, the test–retest reliability of a tool is estimated by administering the same tool to the same individual at two different points in time and calculating the correlation coefficient or agreement between the scores. One question arises here: How far should the second administration be spaced out? The answer would depend on the nature of the underlying construct being measured; specifically, how stable it is and what the impact of the first administration is on the second one.
To illustrate further, if the underlying construct being measured is highly dynamic (e.g., blood pressure), the retesting time window should be in minutes. However, for more stable constructs such as personality traits, re-administration can happen after a month, or even longer. If a cognitive construct is being measured, it may be advisable to give a sufficient gap between administrations to prevent memory and practice effects from influencing re-administration scores.[8]
Which correlation measure should be used for test–retest reliability testing? Can we use conventional Pearson’s or Spearman’s correlation? The answer is no. Why? Because Pearson’s correlation (r) measures the linear relationship between two sets of scores, considering the corresponding change in the second variable when the value of the first variable changes; if the changes are of a similar magnitude and direction, then the correlation will be good. For example, we test a personality construct on five individuals and obtain scores of 30, 35, 40, 45, and 50. Two weeks later, we retest the same individuals; this time, we get the following scores: 40, 45, 50, 55, and 60. If we run a Pearson’s r as a measure of test–retest reliability, the value will be 1; that is, perfect correlation. But, for a stable construct like personality, the test obviously cannot be considered reliable because the scores on readministration changed so substantially, and this magnitude of change was not accounted for.
In this scenario, can we run a paired t-test and look for significant differences between the two sets of scores? Again, the answer is no. Why? The paired t-test looks for significant differences between means. If the individual scores on re-administration vary widely but in opposite directions, the means for the first and second administrations would still be similar, and the scores would not differ significantly. Once again, we will incorrectly conclude that the test is reliable.
So, which is the correct measure for test–retest reliability? The answer is the intraclass correlation coefficient (ICC). This measure checks for agreement or correlation between values within a class of data (e.g., correlations within repeated measurements of weight). It is also sensitive to the extent to which subjects preserve their ranking order across repeated measurements. The ICC values may range from 0 to 1; higher values indicate greater reliability. There are different forms of ICC for test–retest and inter-rater reliability; a discussion on when to use which is outside the scope of this article, and interested readers are referred to a popular guideline on the topic.[9]
Interpreting the ICC in a research paper is simple. However, complete reporting of the ICC is slightly more complex and should include the model (one-way or two-way; random effects or mixed effects), type (single rater/measurement or mean of k raters/measurements), and definition of relationship (absolute agreement or consistency) considered to be appropriate [see Table 1].[9]
The ICC works for continuous variables. What if the variable in question is categorical, such as treatment response (yes vs no)? In such cases, the appropriate statistic for test–retest reliability is Cohen’s kappa (κ), with a more recent trend towards using the Intraclass kappa.[10,11] Cohen’s kappa is discussed in the following section on inter-rater reliability.
Finally, readers may note that a Bland-Altman plot of the difference between two measurements against their average values has also been used as a measure of test–retest reliability. However, much like the paired t-test, the Bland-Altman plot represents a method of analysing agreement between raters or measurement methods. An appropriate measure of reliability should capture the extent of correlation and agreement between observations or measurements; this is where the ICC scores over the paired t-test and the Bland-Altman plot.[12]
INTER-RATER RELIABILITY
The IRR measures the consistency and agreement between two or more raters. Why is the IRR necessary in research? Because different outcome raters collecting data may experience and interpret the phenomena being measured differently; this can introduce bias in outcome ratings. A tip for researchers is that the IRR check needs to be carried out separately for every tool used in a study, if there are multiple outcome raters involved, regardless of the extent of involvement of each rater; even if a rater scored only 10% of the study sample, an IRR exercise should still be performed. Another important caveat here is that the IRR exercise must be completed before study commencement and should involve the same patient being rated at the same time by all outcome raters involved in the IRR exercise.
Many statistics have been proposed to measure inter-rater reliability; the choice would depend on the type of construct being measured. For continuous constructs such as depression or quality of life ratings, an ICC (same as explained in the earlier section, but will now be based on the mean of k raters, not k measurements) can be used. For categorical constructs such as depression (yes/no) or cognitive impairment (yes/no), and when two raters are involved, a Cohen’s kappa (κ) is widely preferred; its values may range from −1 to +1, similar to most correlation statistics. A value of 0 indicates the amount of agreement that can be expected by random chance, while 1 represents perfect agreement. For brevity, we only discuss the Cohen’s κ statistic here; interested readers are referred elsewhere for a discussion on other IRR statistics or modifications of kappa when rating ordinal variables or when more than two raters are involved.[13]
Interpretation thresholds and reporting of Cohen’s κ are described in Table 1. However, these thresholds have been critiqued for being too lenient; in health care settings, a value of at least .60, and preferably .80, has been suggested for greater confidence in study results, given the implications for practice. Values lower than this may indicate the need to retrain raters or reconsider the outcome tool.[13] This is easily understood when considering the following fact: if Cohen’s κ is <0.6, it means over 40% of the data under consideration are erroneous; with so much error in results, even a statistically significant finding may hold little meaning.
Box 1 summarises key considerations in the reliability testing of a tool.
Box 1.
Important considerations in reliability testing of a tool/instrument/rating scale
| • Good reliability does not imply validity. • Three types of reliability metrics exist for any tool: internal consistency, test–retest reliability, and inter-rater reliability. • Internal consistency is expressed using Cronbach’s alpha (α). It is specific for each tool applied to the study sample. Investigators must, therefore, report sample-specific α for every tool in their research. • The time window for test–retest reliability depends on the nature of the underlying construct being measured; particularly, its temporal stability and the effect of the first administration on the second one. • The appropriate measure for test–retest reliability is the Intraclass Correlation Coefficient for continuous variables and Intraclass kappa for categorical variables. • Inter-rater reliability (IRR) exercise must be carried out separately for every tool used in a study. If there is more than one rater in a study, the inter-rater reliability (IRR) exercise must be conducted independently for each tool, regardless of how involved (in terms of proportion of ratings done) and experienced each rater is. • The appropriate measure of inter-rater agreement for categorical variables is Cohen’s kappa (κ). For continuous variables, the ICC for multiple raters can be used.[9] |
Reliability analysis is a critical component of tool psychometrics and involves testing internal consistency, test–retest reliability, and inter-rater reliability. Choosing the right indices for each reliability subtype is important to draw valid conclusions. Proper interpretation and reporting of reliability indices are necessary in any paper; manuscripts are liable to get rejected if the tool reliability indices are suboptimal because the findings are likely to be compromised. We encourage researchers to discuss the approach to reliability testing and selection of reliability indices with a biostatistician and involve them at the protocol development stage itself. This will support study design, data collection, analysis, interpretation, and reporting; all these aspects are key to the internal validity of any study.
Conflicts of interest
There are no conflicts of interest.
Acknowledgements
The authors acknowledge discussions on the electronic Journal Club (eJC) platform, moderated by Prof. Chittaranjan Andrade, which informed the contents of this manuscript.
Funding Statement
Nil.
REFERENCES
- 1.Grover S, Menon V, Gupta S, Vidhukumar K, Indu PV, Chacko D. Translation and adaptation of rating scales. Indian J Psychiatry. 2025;67:643–7. doi: 10.4103/indianjpsychiatry_532_25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ. 2011;2:53–5. doi: 10.5116/ijme.4dfb.8dfd. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Taber KS. The use of Cronbach’s alpha when developing and reporting research instruments in science education. Res Sci Educ. 2018;48:1273–96. [Google Scholar]
- 4.Sijtsma K. On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika. 2009;74:107–20. doi: 10.1007/s11336-008-9101-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cronbach LJ. Internal consistency of tests: Analyses old and new. Psychometrika. 1988;53:63–70. [Google Scholar]
- 6.Cortina JM. What is coefficient alpha? An examination of theory and applications. J Appl Psychol. 1993;78:98–104. [Google Scholar]
- 7.Kuder GF, Richardson MW. The theory of the estimation of test reliability. Psychometrika. 1937;2:151–60. [Google Scholar]
- 8.Chiu EC, Koh CL, Tsai CY, Lu WS, Sheu CF, Hsueh IP, et al. Practice effects and test-re-test reliability of the five digit test in patients with stroke over four serial assessments. Brain Inj. 2014;28:1726–33. doi: 10.3109/02699052.2014.947618. [DOI] [PubMed] [Google Scholar]
- 9.Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15:155–63. doi: 10.1016/j.jcm.2016.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chmura Kraemer H, Periyakoil VS, Noda A. Kappa coefficients in medical research. Stat Med. 2002;21:2109–29. doi: 10.1002/sim.1180. [DOI] [PubMed] [Google Scholar]
- 11.Fisher DG, Reynolds GL, Neri E, Noda A, Kraemer HC. Measuring Test-Retest Reliability: The Intraclass Kappa. Available from: https://proceedings.wuss.org/2019/65_Final_Paper_PDF.pdf .
- 12.Sainani KL. Reliability statistics. PM and R. 2017;9:622–8. doi: 10.1016/j.pmrj.2017.05.001. [DOI] [PubMed] [Google Scholar]
- 13.McHugh ML. Interrater reliability: The kappa statistic. Biochem Med (Zagreb) 2012;22:276–82. [PMC free article] [PubMed] [Google Scholar]
