Abstract
An essential question when computing test–retest and alternate forms reliability coefficients is how many days there should be between tests. This article uses data from reading and math computerized adaptive tests to explore how the number of days between tests impacts alternate forms reliability coefficients. Results suggest that the highest alternate forms reliability coefficients were obtained when the second test was administered at least 2 to 3 weeks after the first test. Even though reliability coefficients after this amount of time were often similar, results suggested a potential tradeoff in waiting longer to retest as student ability tended to grow with time. These findings indicate that if keeping student ability similar is a concern that the best time to retest is shortly after 3 weeks have passed since the first test. Additional analyses suggested that alternate forms reliability coefficients were lower when tests were shorter and that narrowing the first test ability distribution of examinees also impacted estimates. Results did not appear to be largely impacted by differences in first test average ability, student demographics, or whether the student took the test under standard or extended time. It is suggested that for math and reading tests, like the ones analyzed in this article, the optimal retest interval would be shortly after 3 weeks have passed since the first test.
Keywords: reliability, alternate forms, test–retest, time interval, computerized adaptive testing
One key part of developing a quality assessment is ensuring that the assessment yields sufficiently reliable scores. There are many methods for estimating reliability, including internal consistency, test–retest, alternate forms, generalizability theory, interrater, item response theory (IRT), and other factor analytic model-based methods (see Haertel, 2006; Raykov & Marcoulides, 2011). The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014) stress the importance of using assessments with high reliability and provide some factors to consider with different coefficients. When estimating test–retest or alternate forms reliability coefficients, one often asked question is how many days there should be between tests. Well-known measurement texts (e.g., Crocker & Algina, 1986; Haertel, 2006; Mehrens & Lehmann, 1991; Popham, 2006; Raykov & Marcoulides, 2011), as well as the Standards for Educational and Psychological Testing, do not define an ideal timeline for these coefficients. They instead offer general guidance that the timeframe should be long enough to avoid recall effects but short enough to mitigate changes due to increases in ability. A review of the research literature also finds that recommended timeframes are not always consistent. For example, in measuring personality, Cattell et al. (1970) argued for a 2-month retesting interval, and then in later work, Cattell (1986) suggested a 1-month retesting interval. Deyo et al. (1991) suggested a 1- to 2-week testing interval for patient-reported outcome measurement, while Nunnally and Bernstein (1994) suggested a 2-week interval for achievement-type tests. Bardhoshi and Erford (2017) also suggested a 2-week interval for instruments used in counseling and development.
A small number of studies have directly investigated how the number of days between tests impacts test–retest and alternate forms reliability coefficients. Watson (2004) examined how a 2-month versus 2½-year retesting period impacted test–retest coefficients for the Big Five and trait affectivity instruments using data from 392 to 462 psychology students. He found that coefficients tended to be higher with a shorter time interval. Chmielewski and Watson (2009) looked at how retesting using a 2-week interval versus a 2-month interval impacted test–retest reliability coefficients for instruments measuring the Big Five, trait affectivity, and personality disorders using data from 447 to 465 psychology students. They found that reliability estimates were very similar for both testing intervals. Similarly, Marx et al. (2003) compared a 2-day retest interval versus a 2-week interval for four knee rating scales and the eight subscales of the Short Form-36 using 70 patients with knee disorders and did not find statistically significant differences in reliability. Backhaus et al. (2002) compared a 2-day time interval with a longer interval that averaged 45 days for the Pittsburgh Sleep Quality index in 80 patients with primary insomnia and found that test–retest reliability was higher with a shorter interval. Yeo et al. (2012) used a latent growth curve model to look at how reliability changed over time for a set of reading curriculum-based measures and observed that reliability sometimes went up and down as the number of weeks between tests changed. Liao and Qu (2010) looked at alternate forms reliability for the TOEIC speaking and writing tests using time intervals of 1 to 30, 31 to 60, 61 to 90, 91 to 180, and 181 to 365 days between tests. They found that reliability increased as the time interval between tests increased.
Besides the studies that have formally looked at how test–retest and alternate forms reliability estimates are impacted by the number of days between tests, many meta-analyses have been performed that can provide some indication of how the time interval between tests may impact test–retest reliability for various instruments. For example, meta-analyses have looked at test–retest coefficients for the Big Five (Gnambs, 2014; Viswesvaran & Ones, 2000), the Beck Depression Inventory (de Ayala et al., 2005; Erford et al., 2016; Yin & Fan, 2000), the Minnesota Multiphasic Personal Inventory (Vacha-Haase et al., 2001), the Myers-Briggs Type Indicator (Capraro & Capraro, 2002), and the Hamilton Rating Scale for Depression (Trajković et al., 2011). These studies report a range of days between tests and that reliability coefficients change with different test intervals; often, but not always, observing that reliability decreases as the days between tests increases.
The results from prior research suggest that the primary focus has been on test–retest reliability and not alternate forms reliability, that a very small number of studies have focused on educational tests, and that reliability coefficients may go up, down, or be similar as the time interval is increased depending the trait being measured, the test instrument utilized, and the respondents used in the analyses. Prior research also suggests that most studies examining how the number of days between tests impacts reliability coefficients have used small to moderate samples and have compared estimates from two different time intervals. The study by Liao and Qu (2010) is a notable exception in that used a larger sample and five different time intervals. However, even in this study, the authors did not present results with coefficients estimated for each day between tests; presumably because sample sizes would not support these analyses.
Prior studies that have looked at how test–retest and alternate forms reliability coefficients are impacted by the time interval between tests have also not included any assessments that are computerized adaptive tests (CATs). There are additional complications when calculating these types of coefficients for CATs. Since CATs dynamically select items based on examinee performance, an examinee is unlikely to see the same items when retesting and the test forms often differ in difficulty due to changes in student ability. Despite these complications, alternate forms reliability coefficients can be computed for CATs if data are collected from two testing occasions. CATs also differ from fixed-form tests in that they are often shorter and have more consistent and usually lower measurement error across the range of ability measured by the assessment. Lower measurement error across assessment occasions can reduce the impact of regression to the mean for high and low scorer, which may result in different reliability estimates for fixed-form tests and CATs. The reliability index most often computed for CATs tends to be an IRT model-based estimate, which is computed as 1 minus the ratio of the sum of squared estimated measurement error for each individual test taker over the variance of the test scores. However, some CATs compute and report alternate forms reliability as a traditional reliability estimate in addition to IRT model-based estimates if more than one reliability estimate is reported. External reviewers for the U.S. Department of Education, National Center for Intensive Intervention, and the Buros Center for Testing typically look for more than one reliability estimate when they review submitted assessments. Alternate forms reliability is a commonly used second estimate of reliability since it can be difficult to compute other traditional measures of reliability, such as Cronbach’s (1951) coefficient alpha, for CATs because examinees usually receive different test items. In contrast, it is easy to correlate scores from two test administrations if the test is administered on more than one occasion to estimate alternate forms reliability.
The purpose of this article is to explore how the number of days between testing impacts alternate forms reliability coefficients for K-12 reading and math CATs. The specific question guiding our research is the following:
Research Question 1: How does the number of days between tests impact alternate forms reliability estimates for reading and math CATs?
In the next section of this article, we provide an overview of some factors that can impact test–retest and alternate forms reliability coefficients to help provide additional context on the use and interpretation of these types of coefficients. We then describe the data and methods that we used to investigate our research question. The following section presents the results of our analyses examining how the number of days between tests impacts alternate forms reliability estimates. We also present results from several analyses examining the sensitivity of our findings to some factors that have been shown to impact these coefficients. The study concludes with discussion of what these results may mean for others that calculate alternate forms reliability coefficients to estimate the precision of test scores.
Factors That Influence Reliability Coefficients
There are a variety of well-known factors that can influence test–retest and alternate forms reliability coefficients besides the time interval between tests. These include test length, item types and scoring methods, the group of examinees used in the analyses, test administration conditions, test content, and the construct measured by the assessment (see Crocker & Algina, 1986; Frisbee, 1988; Kieffer & MacDonald, 2011; Traub & Rowley, 1991; Yin & Fan, 2000). We briefly describe how each factor can influence reliability estimates.
Test Length
A general finding when estimating reliability is that tests with more items are more reliable than tests with fewer items (Crocker & Algina, 1986; Frisbee, 1988; Kieffer & MacDonald, 2011; Traub & Rowley, 1991; Yin & Fan, 2000). In fact, a common complaint of some reliability coefficients, like Cronbach’s (1951) coefficient alpha, is that they can be increased by simply adding items.
Item Types and Scoring Methods
Another important finding when measuring reliability is that there can be tradeoffs between the type of items used on an assessment and the method used to score those items (Haladyna & Rodriguez, 2013; Kieffer & MacDonald, 2011; Traub & Rowley, 1991). Generally, when items are objectively scored there is less error in scoring, which means that objective scoring tends to lead to higher reliability than when items are subjectively scored by raters (Traub & Rowley, 1991). However, one typically can get more information from a polytomous scored item (Yen & Fitzpatrick, 2006), which means that if two tests have the same number of items but one test has items that are dichotomous scored and the other test has items that are polytomous scored, the test with the polytomous scored items often will have higher reliability.
Test Administration Conditions
Test administration conditions can also impact reliability such that when the administration of the test is more standardized reliability will tend to be higher than when the administration is less standardized (Frisbee, 1988; Traub & Rowley, 1991). For example, if the instructions given to examinees are consistent, the locations and times that examinees take the assessment are similar, and the people administering the tests have the same training, reliability estimates will tend to be higher than when these factors are more varied. Likewise, if a test has a time limit this can influence reliability, especially if the time provided is not enough for many examinees to complete the assessment (Crocker & Algina, 1986; Frisbee, 1988; Traub & Rowley, 1991). In these cases, reliability estimates may be lower since the test measures not only examinee ability but also examinee test taking speed.
Group of Examinees Used in the Analyses
Another common finding when estimating reliability is that when a group of examinees is more heterogeneous reliability tends to be higher than when the group of examinees is more homogenous (Crocker & Algina, 1986; Frisbee, 1988; Kieffer & MacDonald, 2011; Traub & Rowley, 1991; Yin & Fan, 2000). The heterogeneity of the group of examinees being tested impacts reliability because reliability is a function of the variability of test scores. When test score variability is small, reliability tends to be lower. It is also possible to get different reliability estimates if the sample of examinees used in the analyses differ in other ways. For example, one may get different reliability estimates if the sample includes only males, only females, or a mixture of the two groups since the ability distributions of these samples may not be the same. The fact that reliability can change with different samples is why reliability is a property of test scores and not a property of the test itself (Yin & Fan, 2000). The fact that using different samples can change results is important to keep in mind when computing test–retest and alternate forms reliability since these coefficients tend to be based on convenience samples that may differ in important ways from the full distribution of test takers.
Test Content
Test content can also influence reliability. If the content of the test ends up being overly hard or easy, then reliability will be lower compared to when test content has moderate difficulty (Frisbee, 1988). Likewise, the examinees’ exposure to the content can influence reliability as reliability tends to be lower when examinees do not have a sufficient opportunity to learn test content. Differences in test content also introduce another factor to consider when computing alternate forms reliability as score differences may be a function of content differences and lack form equivalence when using alternate forms (Crocker & Algina, 1986; Traub & Rowley, 1991). However, if alternate forms are available and equivalent, then using alternate forms has advantages over repeating the same form across administrations as examinees may remember the items from the first administration and give the same item responses for the second administration (Crocker & Algina, 1986; Traub & Rowley, 1991). The fact that examinees may remember questions and give the same answers means that there can be correlations between the errors of measurement across administrations. It is desirable to avoid correlated errors of measurement since an assumption of test–retest and alternate forms analyses is that they are based on independent administrations. The time that examinees can recall test content influences the time interval that one should use when computing test–retest reliability coefficients, as one wants to try to minimize recall effects if forms are identical across administrations.
Construct Measured by the Assessment
The construct measured by the assessment can also influence reliability as some constructs are easier to measure and are more stable than other constructs (see Bardhoshi & Erford, 2017; Cattell, 1986; Cattell et al., 1970; Deyo et al., 1991; Nunnally & Bernstein, 1994). For example, an instrument measuring achievement may be less stable and change more quickly than something like personality. Tests focused on constructs that are harder to measure tend to have lower reliability than tests focused on constructs that are easier to measure. Likewise, tests that focus on constructs that are less stable tend to have lower test–retest and alternate forms reliability than tests focused on more stable constructs. How quickly the ability being measured changes is a key factor to consider when thinking about the time interval to use when computing test–retest and alternate forms reliability coefficients. Typically, one wants the time interval to be short enough that the ability being measured would not change between administrations.
Data and Methods
Data for this study came from two K-12 reading and math CATs. We use data from both the reading and math CATs since Frisbee (1988) points out that test content may influence reliability estimates. These CATs are used to measure student progress and growth in schools throughout the United States. Each 34-item Rasch (1960) based CAT has grade-specific test blueprints. The CAT algorithm uses the grade-specific test blueprints as content constraints. The reading CATs consist of vocabulary-in-context items and short reading selections with associated multiple-choice items. The math CATs consist of multiple-choice items measuring four different domains. In addition to the content constraints, the CATs use a version of the randomesque technique (Kingsbury & Zara, 1989) to control item exposure. Items are selected during the CATs so that examinees have an approximate 67% chance of answering each item correct, subject to the content and item exposure constraints. The decision to use a 67% correct rule was made to give examinees a reasonable chance to answer each item correct, to minimize the potential impact of guessing, and to help improve student motivation and reduce test anxiety. The CATs are scored using modal a priori (MAP) estimation using a normal prior with mean equal to the grade-specific starting point and a variance of 2 until an examinee gets at least one item right and one item wrong. Once an examinee gets one item right and one item wrong, maximum likelihood estimation is used. Each CAT is terminated when it reaches 34 items.
An important feature of these assessments is that teachers can administer them whenever they choose. The most common administration schedule is taking the assessment once in the fall, once in the winter, and once in the spring, but other administration schedules with varying numbers of days between tests are also used. We estimate alternate forms reliability by taking correlations between the final Rasch ability estimates for the first and second tests that students took for 0 to 150 days between tests. Our analyses focus on Grades 1 through 8 since these grades had enough data to estimate coefficients for each number of days between tests. All analyses were performed using R (R Core Team, 2020). Table 1 provides a summary of the numbers of students that took the two CATs at least twice in each grade under standard timing conditions.
Table 1.
Number of Students Testing at Least Twice for Reading and Math CATs Under Standard Timing Conditions.
Grade | Reading CAT | Math CAT |
---|---|---|
1 | 575,629 | 420,637 |
2 | 929,731 | 571,507 |
3 | 987,671 | 599,962 |
4 | 940,983 | 575,237 |
5 | 889,221 | 554,019 |
6 | 709,556 | 485,757 |
7 | 610,959 | 431,746 |
8 | 577,579 | 409,629 |
Total | 6,221,329 | 4,048,494 |
Note. CAT = computerized adaptive test
Because alternate forms reliability may be impacted by the factors mentioned above, we perform several additional analyses to evaluate the sensitivity of our results to some these factors. We can examine the impact of test length because shorter versions of the two CATs are also offered for the purposes of progress monitoring. For reading, the shorter version of the CAT is 25 items in length. For math, the shorter version of the CAT is 24 items in length. We can also look at the impact of test administration conditions since some administrations are given under standard and extended time. For the reading CATs, students are allotted 60 seconds in Grades 1 and 2 and 45 seconds in Grades 3 through 8 to answer vocabulary-in-context items under standard time. Students receive 120 seconds in Grades 1 and 2 and 90 seconds in Grades 3 through 8 to answer the multiple-choice questions with the short reading selections under standard time. For the math CATs, students are given 180 seconds to answer each question under standard time. Under extended time, students receive three times the standard time for reading and two times the standard time for math. We also look at alternate forms reliability for several groups of examinees that took the test. These groups include female students, male students, White students, non-White students, students who scored in the bottom decile of the full first administration test score distribution, and students who scored in the top decile of the full first administration test score distribution. In addition, since the samples are convenience samples that may not fully represent the full first administration test score distribution, we classified students into deciles based on their first administration test scores and then selected stratified random samples at each number of days such that there were equal numbers of examinees from each decile. We then computed alternate forms reliability estimates for these samples. We were not able to perform any analyses to investigate how item types and scoring methods impacted results as all items were multiple-choice items. We also were not able to perform any analyses to look at the impact of test content, other than performing separate analyses for math and reading. All the CATs followed the same grade-specific test blueprint for that grade level and content area.
Another important consideration is how much student ability changes at various time points as previous research suggests that one typically wants to select a retest interval that is long enough to avoid recall effects, but short enough that student ability does not change dramatically. We calculate the standardized mean difference in ability using Hedges’g (Hedges & Olkin, 1985) to provide an effect size measure of the change in ability between the first and second test administration at different time points.
Results
Reading Results
Figure 1 displays plots of the alternate forms reliability coefficients, standardized mean difference in ability (as measured by Hedges’g), the sample size of students, and the first test average ability as a function of the days since the first test for Grade 3 reading. The lines in the figure were created using lowess smoothing to help show general trends in results. The figure shows that the alternate forms reliability coefficients were lower when testing within the first 2 to 3 weeks and leveled off in the mid-0.80s after that point in time. One can also see that students grew in ability over time, such that as time went on students taking the second form showed increasingly positive standardized mean differences in ability. One can also see fluctuations in the number of students testing on different days. In addition, one can see that the samples of students that were retested more quickly had lower average first test ability. These results indicate that the highest alternate forms reliabilities were achieved when waiting at least 2 to 3 weeks before retesting. However, there appears to be a tradeoff in waiting to retest as the samples of students showed increasing differences in ability over time. In addition, it appears that there might be selection effects in data, such that samples with fewer days between tests had lower first test average ability. We report on some additional analyses we performed to try and remove some of these selection effects in the sensitivity analysis section.
Figure 1.
Alternate forms reliability, standardized mean difference in ability, sample size, and first test average ability versus days since first test for grade 3 reading.
Figure 2 displays plots of alternate forms coefficients for all grades for reading, while Table 2 provides alternate forms reliability estimates at weekly time intervals for every grade to help the reader to see specific numerical estimates. The pattern for Grade 3 was also observed for most of the other grades. That is, alternate forms reliability was lower when retesting students within the first 2 to 3 weeks with higher coefficients found after that amount of time. Grade 1 exhibited a slightly different pattern from other grades in that lower alternate forms reliability coefficients were found in the first 2 to 3 weeks, peak coefficients were found for the next 3 weeks after that, and then alternate forms reliability coefficients slightly decreased as the number of days between tests increased after that point in time. We also investigated the standardized mean difference in ability and the first test average ability at other grades besides Grade 3 and found similar results at other grade levels with student ability growing over time and the samples of students that were retested more quickly having lower first test average ability. The big difference in results across grades was in the amount of growth in ability over time and the observed first test average ability estimates. Generally, smaller amounts of growth and higher first test average ability were found at higher grades. We do not show figures with these results in the text of article because of space considerations.
Figure 2.
Alternate forms reliability coefficients for all grades for reading.
Table 2.
Alternate Forms Reliability Estimates for Different Time Intervals for Reading.
Time interval | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|
0-7 days | 0.72 | 0.77 | 0.78 | 0.77 | 0.77 | 0.78 | 0.79 | 0.80 |
8-14 days | 0.83 | 0.83 | 0.83 | 0.83 | 0.84 | 0.82 | 0.82 | 0.82 |
15-21 days | 0.85 | 0.86 | 0.86 | 0.85 | 0.85 | 0.85 | 0.83 | 0.83 |
22-28 days | 0.87 | 0.88 | 0.87 | 0.87 | 0.86 | 0.86 | 0.84 | 0.85 |
29-35 days | 0.87 | 0.88 | 0.88 | 0.87 | 0.87 | 0.87 | 0.86 | 0.86 |
36-42 days | 0.87 | 0.87 | 0.88 | 0.87 | 0.87 | 0.87 | 0.87 | 0.86 |
43-49 days | 0.86 | 0.87 | 0.88 | 0.87 | 0.87 | 0.87 | 0.87 | 0.86 |
50-56 days | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 |
57-63 days | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 |
64-70 days | 0.85 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.86 | 0.87 |
71-77 days | 0.85 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 |
78-84 days | 0.85 | 0.87 | 0.88 | 0.88 | 0.88 | 0.87 | 0.87 | 0.87 |
85-91 days | 0.84 | 0.86 | 0.88 | 0.88 | 0.87 | 0.87 | 0.87 | 0.87 |
92-98 days | 0.83 | 0.86 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.87 |
99-105 days | 0.82 | 0.86 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 |
105-112 days | 0.81 | 0.86 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.88 |
113-119 days | 0.82 | 0.86 | 0.86 | 0.87 | 0.88 | 0.87 | 0.87 | 0.87 |
120-126 days | 0.81 | 0.86 | 0.87 | 0.88 | 0.87 | 0.87 | 0.87 | 0.88 |
127-133 days | 0.80 | 0.86 | 0.88 | 0.88 | 0.88 | 0.87 | 0.87 | 0.87 |
134-140 days | 0.78 | 0.86 | 0.88 | 0.88 | 0.87 | 0.87 | 0.87 | 0.87 |
141-150 days | 0.78 | 0.85 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 |
Math Results
Figure 3 shows plots of the alternate forms reliability coefficients, standardized mean difference in ability, the sample size of students, and the first test average ability as a function of the days since the first test for Grade 3 math. There are many similarities between the math and reading results. In particular, student ability grew over time, students with lower first test average ability tended to be retested more quickly, and alternate forms reliability was lower in the first two or so weeks of retesting and leveled off at a value above 0.80 after that. The main differences between the reading and math results are that the alternate forms reliability coefficients leveled off at slightly lower value for math, the sample sizes were lower for math, and the amount of growth and first test average ability differed due in part to performance and content differences and the math and reading CATs being on different scales. It again appears that there is a tradeoff in waiting to retest as the samples of students increasingly differed in ability as the number of days between tests increased. There were also potential selection effects with samples with fewer days between tests having lower first test average ability.
Figure 3.
Alternate forms reliability, standardized mean difference in ability, sample size, and first test average ability versus days since first test for Grade 3 math.
Like with reading, Figure 4 shows alternate forms coefficients for all grades for math, while Table 3 provides numerical estimates of the alternate form reliability coefficients at weekly time intervals. The patterns for Grade 3 and for reading were largely found at the other grades for math. Alternate forms reliability coefficients again tended to be lower in the first couple of weeks of retesting. In addition, Grade 1 had slightly different results than the other grades with lower reliability coefficients in the first couple of weeks, peak coefficients for the next 3 weeks after that, and then decreasing coefficients after that point in time. Similar to reading, we looked at the standardized mean differences in ability and the first test average ability at other grades beside Grade 3 for math and found that student ability grew over time and the samples of students that were retested more quickly tended to have lower first test average ability. The big difference in results across grades was again smaller amounts of growth and higher first test average ability at higher grades. We do not include figures with these results in the main text of the article due to space considerations.
Figure 4.
Alternate forms reliability coefficients for all grades for math.
Table 3.
Alternate Forms Reliability Estimates for Different Time Intervals for Math.
Time interval | Grade 1 | Grade 2 | Grade 3 | Grade 4 | Grade 5 | Grade 6 | Grade 7 | Grade 8 |
---|---|---|---|---|---|---|---|---|
0-7 days | 0.73 | 0.74 | 0.75 | 0.76 | 0.76 | 0.79 | 0.79 | 0.77 |
8-14 days | 0.77 | 0.79 | 0.79 | 0.80 | 0.82 | 0.82 | 0.82 | 0.79 |
15-21 days | 0.80 | 0.80 | 0.82 | 0.82 | 0.84 | 0.84 | 0.83 | 0.82 |
22-28 days | 0.80 | 0.82 | 0.82 | 0.82 | 0.84 | 0.84 | 0.82 | 0.82 |
29-35 days | 0.80 | 0.82 | 0.82 | 0.83 | 0.84 | 0.84 | 0.84 | 0.82 |
36-42 days | 0.79 | 0.82 | 0.82 | 0.83 | 0.84 | 0.84 | 0.85 | 0.83 |
43-49 days | 0.77 | 0.81 | 0.82 | 0.82 | 0.84 | 0.84 | 0.84 | 0.83 |
50-56 days | 0.77 | 0.82 | 0.82 | 0.82 | 0.84 | 0.85 | 0.85 | 0.84 |
57-63 days | 0.76 | 0.81 | 0.81 | 0.82 | 0.84 | 0.84 | 0.84 | 0.84 |
64-70 days | 0.77 | 0.81 | 0.81 | 0.82 | 0.84 | 0.84 | 0.84 | 0.84 |
71-77 days | 0.77 | 0.81 | 0.81 | 0.83 | 0.84 | 0.84 | 0.84 | 0.84 |
78-84 days | 0.77 | 0.81 | 0.81 | 0.82 | 0.84 | 0.85 | 0.84 | 0.84 |
85-91 days | 0.77 | 0.81 | 0.81 | 0.83 | 0.84 | 0.85 | 0.84 | 0.84 |
92-98 days | 0.76 | 0.81 | 0.80 | 0.83 | 0.84 | 0.84 | 0.84 | 0.84 |
99-105 days | 0.76 | 0.80 | 0.80 | 0.83 | 0.85 | 0.84 | 0.84 | 0.84 |
105-112 days | 0.75 | 0.79 | 0.80 | 0.82 | 0.84 | 0.84 | 0.84 | 0.84 |
113-119 days | 0.75 | 0.80 | 0.80 | 0.81 | 0.84 | 0.84 | 0.84 | 0.83 |
120-126 days | 0.75 | 0.80 | 0.80 | 0.81 | 0.83 | 0.84 | 0.83 | 0.82 |
127-133 days | 0.74 | 0.80 | 0.80 | 0.82 | 0.83 | 0.83 | 0.84 | 0.82 |
134-140 days | 0.76 | 0.80 | 0.80 | 0.82 | 0.84 | 0.84 | 0.84 | 0.84 |
141-150 days | 0.73 | 0.79 | 0.81 | 0.82 | 0.84 | 0.83 | 0.84 | 0.84 |
Sensitivity Analyses
Figures 5 and 6 show results from the sensitivity analyses that we ran for Grade 3 reading and Grade 3 math, respectively. The figures display plots for the 34-item tests for students taking tests under standard time, shorter tests taken under standard time, analyses that used stratified random samples with equal numbers of students from each decile, students that scored in the bottom decile of the first test ability distribution, students that scored in the top decile of the first test ability distribution, students that completed tests with extended time, female students, male students, White students, and non-White students. Although we performed analyses at other grades for both subjects, we only present results for Grade 3 for each subject because of space considerations and because results at other grades were very similar to those shown in Figures 5 and 6. Many of the plots show strikingly similar patterns with lower alternate forms reliability coefficients found when retesting in the first 2 to 3 weeks, and coefficients leveling off above 0.80 after that point in time.
Figure 5.
Alternate forms reliability coefficients for Grade 3 reading for different groups of students.
Figure 6.
Alternate forms reliability coefficients for Grade 3 math for different groups of students.
There are couple of noteworthy exceptions that warrant additional discussion. First, when the subsample of students came from the bottom or top decile of the first test ability distribution the lines did not level off like they did in other plots, such that for the bottom decile alternate forms reliability appeared to continue to increase with additional days between tests and for the top decile reliability was less than 0.2 with very few days between tests and the lines appeared to be a lot more jagged. In addition, the alternate forms reliability lines for the bottom decile appeared to reach values that were above the values obtained in other analyses as the time period between tests continued to increase, while alternate forms reliability only rose to a value in 0.30s or 0.40s for the top decile analyses. Consistent with prior research, these results suggest that narrowing the subsample of students used in the analyses based on student ability can change estimated reliability. One can also see that that when test length was shorter reliability coefficients were lower for a period of around 30 days for reading and, as expected, coefficients leveled off at a slightly lower value for these shorter tests. For math, the shorter tests led to lower alternate forms reliability than the longer tests, but the points were much more scattered, the lowess fit line was more jagged, and there was not clearly lower reliability estimates in the first couple of weeks as seen in many other plots. The increased scatter and inconsistent trend seem to be due in part to the fact that the analyses were based on a much smaller sample than in many of the other plots. There were a couple of other plots for both reading and math that also appeared to be more scattered, including analyses that focused on students with extended time, White students, and analyses focused on the top and bottom deciles. These differences in the scatter of points also appeared to be a function of the smaller sample sizes used in these analyses.
A key finding shown in Figures 5 and 6 is that the shape and magnitude of the lines for 34-item tests taken under standard time where very similar to those in which stratified random samples with equal numbers of students from each decile were used in the analyses. This finding is important because it suggests that the first test ability differences observed in Figures 1 and 3 are not the sole factor driving the patterns in the alternate forms reliability coefficients that were displayed in these figures. It also does not appear that results were related to student demographics or whether tests were taken under standard or extended time since the shapes of these lines were like the lines for 34-item tests taken under standard time. Taken together, the results in the figures suggest that the lower alternate forms reliability coefficients observed when retesting in the first few weeks could not be explained by differences in first test average ability, student demographics, or whether tests were taken under standard or extended time.
Discussion
This article uses a unique set of data to help provide an answer to a long-asked question: How many days should there be between testing when estimating alternate forms reliability coefficients? Our results suggest that alternate forms reliability estimates were lower when the second test was administered within the first 2 to 3 weeks of the first test. After this amount of time, coefficients tended to be similar and higher despite student ability increasing over time. Additional sensitivity analyses suggested that the lower alternate forms reliability coefficients observed when retesting in the first 2 or 3 weeks could not be simply explained by student demographics, the amount of time a student was allotted on the test, or differences in first test average ability. Our sensitivity analyses did suggest that alternate forms reliability tended to be lower when the tests were shorter, and that different alternate forms reliability can be observed when narrowing the first test ability distribution of the examinees.
Many of the factors that we investigated in our sensitivity analyses that have been shown to impact test–retest and alternate forms reliability coefficients produced results that were consistent with prior research with the exceptions of the student demographics and the standard and extended time comparisons where results did not appear to be markedly different for the different demographic groups or different timing conditions. Prior research suggests that the population used in the analysis can impact reliability estimates and that reliability estimates can change with different samples. Part of the difference in our analyses of students with different demographic characteristics could be due to the larger sample sizes of the groups compared and the fact that the tests included in our analyses were CATs which had more consistent measurement error for different groups across the ability scale. In terms of timing conditions, prior research suggests that reliability coefficients tend to be lower with less standardized test administrations or when a test may be speeded (see Crocker & Algina, 1986; Traub & Rowley, 1991). However, it should be noted that extended time is often viewed as a test accommodation and by definition a test accommodation is not supposed to change the construct measured by the assessment. This may help explain in part the similarity of the results for standard and extended time in our analyses as extended time is designed to function as a test accommodation on these assessments.
An important question to ask is whether the patterns found in our analyses would hold for other tests. The simple answer to this question is that we do not know. To date prior research has not collected or reported data like that shown in this article, which makes it hard to know how consistent results would be for other testing programs. Instead, prior research has tended to compare results from two different time points and, as noted in the introduction, results from these different studies have sometimes shown increasing, similar, and decreasing reliability estimates depending on the trait being measured, the test instrument utilized, and the respondents used in the analyses. Future research should explore whether the patterns reported in this study hold for other tests using data like those we used in our analyses. In particular, future research examining tests that measure constructs besides achievement, such as personality, motivation, emotions, and so on, would be valuable since our analyses only focused on achievement tests. However, it is important to bear in mind that collecting data like those used in this study may be cost prohibitive and difficult in many other testing programs.
It is also important to note that our analyses focused on CATs. It is entirely possible that some of the patterns we observed may look different for other types of tests. For example, some of the leveling off we observed after the first 2 or 3 weeks may not be found with fixed-form tests because of differences in item selection and what that may mean for measuring ability. Fixed-form tests have one set of items, tend to target items to middle of the ability scale, and can exhibit floor and ceiling effects. A function of this test design model is that there can be time periods where some or many students are not measured with much precision when the test is used across multiple time points. If the fixed-form test is well-targeted to students when it is initially given, then one might expect that over time more students would reach the ceiling of the test and be measured with less precision because of growth in student ability and the test becoming increasingly easier for students. For these types of tests, one might expect alternate forms reliability to drop with time due to an increasing number of students being measured with less precision and reaching the ceiling of the test. In contrast, CATs are not subject to the same ceiling effects as fixed-form tests, such that as students grow in ability, students should be measured with approximately the same level of precision as they were initially measured with. The fact that measurement precision stays fairly constant over time may explain some of the differences found in our analyses in comparison to some prior research studies where reliability has been shown to decrease over time. Future research could compare alternate forms reliability using different time intervals for a testing program that uses CAT and fixed-form tests to see if the mode of testing does in fact have an impact on reliability estimates.
It is also important to point out that in all our analyses teachers decided what schedule with which to test their students. The fact that teachers decided the student testing schedule implies that the samples in our analyses were convenience samples. The use of convenience samples is common in test–retest and alternate forms reliability research. We did perform sensitivity analyses to see whether differences in reliability may be related to various factors, such as test length, differences in first test ability distribution, student demographics, or whether the test was given under standard or extended time. For the most part, the patterns of lower reliability in the first 2 or 3 weeks were robust to these factors. Future research could consider looking at other examinee characteristics beyond those we considered to see if those characteristics may help to explain the lower reliability in the first couple of weeks. For example, we were not able to directly measure student anxiety, motivation, or effort on their first test. It may be that students who are retested in the first 2 or 3 weeks are students with more test anxiety, less motivation, or did not try as hard on their first test. We focused our sensitivity analyses on the factors that we did because we could directly measure them and because prior research had suggested that some of these factors may contribute to differences in alternate forms reliability. Future research that investigates why teachers decide to retest students when they did would be helpful to more fully understand some of the patterns in alternate forms reliability that we observed.
An essential question when calculating test–retest and alternate forms reliability for different types of tests is how many days should there be between testing? Based on our results, it appears that for the reading and math tests we studied that the highest alternate forms reliability coefficients were obtained when examinees were retested after waiting 2 to 3 weeks. After this amount of time coefficients were often similar. However, our results suggested a potential tradeoff in waiting longer to retest as student ability tended to grow with time. These findings indicate that if one wants student ability to be as similar as possible between the test two testing occasions and still obtain the highest estimates of alternate forms reliability, then one would generally want to retest shortly after 3 weeks have passed since the first test. This timeframe is a week longer than those suggested by Deyo et al. (1991) for patient-reported outcome measurement, Nunnally and Bernstein (1994) for achievement type tests, and Bardhoshi and Erford (2017) for instruments used in counseling and development.
Acknowledgments
The authors would like to thank Catherine Close, Tyler Hefty, Tom Rue, James Olsen, and James McBride for discussion and comments they provided on this work during its development. Any opinions, findings, conclusions, or recommendations expressed in this article are those of the authors and are not necessarily the official position of Renaissance.
Footnotes
Author’s Note: Paper presented at the 2020 Meeting of the National Council for Measurement in Education.
Declaration of Conflicting Interests: The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Adam E. Wyse https://orcid.org/0000-0002-1719-9461
References
- American Educational Research Association, American Psychological Association, & National Council for Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
- Backhaus J., Junghanns K, Broocks A., Riemann D., Hohagen F. (2002). Test-retest reliability and validity of the Pittsburgh Sleep Quality Index in primary insomnia. Journal of Psychosomatic Research, 53(3), 737-740. 10.1016/S0022-3999(02)00330-6 [DOI] [PubMed] [Google Scholar]
- Bardhoshi G., Erford B. T. (2017). Processes and procedures for estimating score reliability and precision. Measurement and Evaluation in Counseling and Development, 50(4), 256-263. 10.1080/07481756.2017.1388680 [DOI] [Google Scholar]
- Capraro R. M., Capraro M. M. (2002). Myers-Briggs Type Indicator score reliability across studies: A meta-analytic reliability generalization study. Educational and Psychological Measurement, 62(4), 590-602. 10.1177/0013164402062004004 [DOI] [Google Scholar]
- Cattell R. B. (1986). The psychometric properties of tests: Consistency, validity, and efficiency. In Cattell R. B., Johnson R. C. (Eds.), Functional psychological testing (pp. 54-78). Brunner/Mazel. [Google Scholar]
- Cattell R. B., Eber H. W., Tatsuoka M. M. (1970). Handbook for the Sixteen Personality Factor Questionnaire (16PF). Institute for Personality and Ability Testing. [Google Scholar]
- Chmielewski M., Watson D. (2009). What is being assessed and why it matters: The impact of transient error on trait research. Journal of Personality and Social Psychology, 97(1), 186-202. 10.1037/a0015618 [DOI] [PubMed] [Google Scholar]
- Crocker L., Algina J. (1986). Introduction to classical and modern test theory. Harcourt Brace Jovanovich. [Google Scholar]
- Cronbach L. J. (1951, September). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. 10.1007/BF02310555 [DOI] [Google Scholar]
- de Ayala R. J., Vonderharr-Carlson D. J., Kim D. (2005). Assessing the reliability of the Beck Anxiety Inventory scores. Educational and Psychological Measurement, 65(5), 742-756. 10.1177/0013164405278557 [DOI] [Google Scholar]
- Deyo R. A., Diehr P., Patrick D. L. (1991). Reproducibility and responsiveness of health status measures: Statistics and strategies for evaluation. Controlled Clinical Trials, 12(4 suppl), 142S-158S. 10.1016/S0197-2456(05)80019-4 [DOI] [PubMed]
- Erford B. T., Johnson E., Bardoshi G. (2016). Meta-analysis of the English version of the Beck Depression Inventory–Second Edition. Measurement and Evaluation in Counseling and Development, 49(1), 3-33. 10.1177/0748175615596783 [DOI] [Google Scholar]
- Frisbee D. A. (1988). Reliability of scores from teacher-made tests. Educational Measurement: Issues and Practice, 7(1), 25-33. 10.1111/j.1745-3992.1988.tb00422.x [DOI] [Google Scholar]
- Gnambs T. (2014). A meta-analysis of dependability coefficients (test-retest reliabilities) for measures of the Big Five. Journal of Research in Personality, 52(1), 20-28. 10.1016/j.jrp.2014.06.003 [DOI] [Google Scholar]
- Haertel E. H. (2006). Reliability. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 65-110). American Council on Education. [Google Scholar]
- Haladyna T. M., Rodriguez M. C. (2013). Developing and validating test items. Routledge. 10.4324/9780203850381 [DOI] [Google Scholar]
- Hedges L. V., Olkin I. (1985). Statistical methods for meta-analysis. Academic Press. [Google Scholar]
- Kieffer K. M., MacDonald G. (2011). Exploring factors that affect score reliability and validity in the Ways of Coping questionnaire reliability coefficients: A meta-analytic reliability generalization study. Journal of Individual Differences, 32(1), 26-38. 10.1027/1614-0001/a000031 [DOI] [Google Scholar]
- Kingsbury G. G., Zara A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2(4), 359-375. 10.1207/s15324818ame0204_6 [DOI] [Google Scholar]
- Liao C.-W., Qu Y. (2010). Alternate forms test–retest reliability and test score changes for the TOEIC® speaking and writing tests (TOEIC Compendium TC-10-10). www.ets.org/Media/Research/pdf/TC-10-10.pdf
- Marx R. G., Menezes A., Horovitz L., Jones E. C., Warren R. F. (2003). A comparison of two time intervals for test-retest reliability of health status instruments. Journal of Clinical Epidemiology, 56(8), 730-735. 10.1016/S0895-4356(03)00084-2 [DOI] [PubMed] [Google Scholar]
- Mehrens W. A., Lehmann I. J. (1991). Measurement and evaluation in education and psychology. Harcourt Brace Jovanovich. [Google Scholar]
- Nunnally J., Bernstein I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill. [Google Scholar]
- Popham W. J. (2006). Assessment for educational leaders. Pearson. [Google Scholar]
- R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://www.R-project.org/ [Google Scholar]
- Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Denmark Paedogogiske Institut. [Google Scholar]
- Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. Routledge. 10.4324/9780203841624 [DOI] [Google Scholar]
- Trajković G., Starčević V., Latas M., Leštarević M., Ille T., Bukumirić Z., Marinković J.(2011). Reliability of the Hamilton Rating Scale for Depression: A meta-analysis over a period of 49 years. Psychiatry Research, 189(1), 1-9. 10.1016/j.psychres.2010.12.007 [DOI] [PubMed] [Google Scholar]
- Traub R. E., Rowley G. L. (1991). Understanding reliability. Educational Measurement: Issues and Practice, 10(1), 37-45. 10.1111/j.1745-3992.1991.tb00183.x [DOI] [Google Scholar]
- Vacha-Haase T., Kogan L. R., Tani C. R., Woodall R. A. (2001). Reliability generalization: Exploring variation of reliability coefficients of MMPI clinical scale scores. Educational and Psychological Measurement, 61(1), 45-59. 10.1177/00131640121971059 [DOI] [Google Scholar]
- Viswesvaran C., Ones D. (2000). Measurement error in “Big Five Factors” personality assessment: Reliability generalization across studies and measures. Educational and Psychological Measurement, 60(2), 224-235. 10.1177/00131640021970475 [DOI] [Google Scholar]
- Watson D.(2004). Stability versus change, dependability versus error: Issues in the assessment of personality over time. Journal of Research in Personality, 38(4), 319-350. 10.1016/j.jrp.2004.03.001 [DOI] [Google Scholar]
- Yeo S., Kim D., Branum-Martin L., Wayman M. M., Espin C. A.(2012). Assessing the reliability of curriculum-based measurement: An application of latent growth modeling. Journal of School Psychology, 50(2), 275-292. 10.1016/j.jsp.2011.09.002 [DOI] [PubMed] [Google Scholar]
- Yen W. M., Fitzpatrick A. R. (2006). Item Response Theory. In Brennan R. L. (Ed.), Educational measurement (4th ed., pp. 111–153). American Council on Education. [Google Scholar]
- Yin P., Fan X. (2000). Assessing the reliability of Beck Depression Inventory scores: Reliability generalization across studies. Educational and Psychological Measurement, 60(2), 201-223. 10.1177/00131640021970466 [DOI] [Google Scholar]