Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Sep 2.
Published in final edited form as: Assess Eff Interv. 2011 Feb 2;36(3):158–166. doi: 10.1177/1534508410396698

The Importance of Predictive Power in Early Screening Assessments: Implications for Placement in the Response to Intervention Framework

Yaacov Petscher 1, Young-Suk Kim 1, Barbara R Foorman 1
PMCID: PMC4557888  NIHMSID: NIHMS718586  PMID: 26346970

Abstract

As schools implement response to intervention to identify and serve students with learning difficulties, it is critical for educators to know how to evaluate screening measures. In the present study, Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency was used to compare the differential decisions that might occur in screening accuracy when predicting two reading comprehension measures (i.e., Stanford Achievement Test–10th Edition and Gates-McGinitie Reading Test–Fourth Edition) at the end of second grade. The results showed that the Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency tended to have higher sensitivity and negative predictive power for Stanford Achievement Test–10th Edition and higher specificity and positive predictive power for Gates-McGinitie Reading Test–Fourth Edition. Furthermore, attempting to achieve a criterion of positive predictive power for a given reading comprehension outcome (Stanford Achievement Test–10th Edition, in this study) appears to render a favorable balance compared to other indices of diagnostic accuracy. These results are discussed in light of trade-offs and a need for considering specific contexts of schools and districts.

Keywords: screening accuracy, predictive power, response to intervention, Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency, reading comprehension


Assessment is at the center of a response to intervention (RTI) framework. As a preventive service delivery model, the early identification of students who are at risk for future reading failure is the key to appropriately placing them into interventions. Thus, universal screening is a critical first step in most RTI models (Jenkins, Hudson, & Johnson, 2007). Effective screening measures are typified by brevity and ease of use, and they should demonstrate high accuracy in predicting whether students will succeed on a criterion outcome of interest (e.g., standardized reading assessment, state achievement test). When scores from screening assessments are validated, they are typically designed to maximize a particular statistical outcome (Streiner, 2003; e.g., correct classification according to a gold standard outcome, thereby reducing the number of underidentified students). Thus, it is imperative that educators and researchers are informed about the tradeoffs of maximizing different outcomes. This article discusses the important statistical and methodological components that should be considered when choosing a screening assessment, and it highlights such considerations with an illustration using the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) Oral Reading Fluency task (Good, Wallin, Simmons, Kame’euni, & Kaminski, 2002) and two widely used reading comprehension tests to draw attention to the differences in at-risk identification and to provide general guidelines for comparing assessments for early screening.

Background and Context

An ideal screening assessment requires several features. First, practical utility is an important criterion such that it should be inexpensive; brief; easy to administer, score, and interpret; and, ideally, easily linked to instruction (Schatschneider, Petscher, & Williams, 2008). Another critical and often assumed quality of a screener is high screening accuracy and discrimination so that it distinguishes with precision students who will develop difficulties in a target area (e.g., reading) from those who will not (Glover & Albers, 2007). Ultimately, all these features serve students by accurately identifying them with respect to risk status and enabling appropriate allocation of resources for effective intervention.

When evaluating new screening assessments, two considerations must be made that affect the screening accuracy—namely, what measure will be used as the gold standard outcome and what psychometric properties of screening accuracy are maximized. Both decisions are equally important to the evaluation process—that is, the former informs the assessor to what the screener is predicting, and the latter is informative about the goal of the screener. These two components interact to provide information about the extent to which scores from a screener are valid and meaningful for identifying potentially at-risk students of interest (i.e., difficulties in reading).

The outcome to which the developer wishes to predict is a critical decision because it serves as the criterion for how risk is operationally defined, and it is the foundation on which all the statistical indices are based. For example, the accuracy of the Phonological Awareness Literacy Screening in kindergarten (Invernizzi, Meier, Swank, & Juel 1999) was evaluated with students’ fall performance to predict their spring performance. The cut points for the fall kindergarten DIBELS Letter Naming Fluency and Initial Sound Fluency tasks were chosen on the basis of students’ performances on several of the DIBELS measures conducted spring of kindergarten and fall of first grade. Conversely, the authors of the Test of Silent Word Reading Fluency (Mather, Hammill, Allen, & Roberts, 2004) were most interested in evaluating what percentage of students who performed poorly on the screener also performed poorly on a series of standardized, norm-referenced measures, such as the Letter Word Identification and Passage Comprehension tasks on the Woodcock-Johnson Psycho-Educational Battery–Revised (Woodcock & Johnson, 1990).

The three different approaches described above represent commonly used techniques for selecting the gold standard outcome. One may choose to use a later time point of the same measure or battery, as in the case of Phonological Awareness Literacy Screening, or one may choose to use a different measure within the same test battery, as with DIBELS. Alternatively, a different outcome may be selected (in the case of the Test of Silent Word Reading Fluency, e.g.) so that practitioners and researchers may evaluate how well it predicts risk on an external measure such as a state achievement test.

Although the psychometrics of the scores may appear to be strong when using a particular outcome, it is important that the practitioner understand for which outcome that risk is defined. In the above examples, the early identification of students at risk will be specific to only that assessment, regardless of what outcome each screener predicts. Thus, the outcome selected operationally defines what risk status truly is. It therefore behooves researchers and practitioners to select screening and outcome measures that result in reliable data and valid decisions and are conceptually defensible measures of reading success. Educators should be aware that even when outcome measures meet all the psychometric requirements, such measures vary in random and systematic measurement errors and the extent or emphasis of areas in the target construct. This is particularly true for reading comprehension because it is not a unitary construct but draws on multiple processes (Cain, Oakhill, & Bryant, 2004; Davis, 1944). Thus, the extent to which reading comprehension tests tap into subskills of reading such as word recognition, working memory, and language comprehension varies (Andreassen & Braten, 2010; Keenan, Betjemann, & Olson, 2008). For example, the Stanford Achievement Test–10th Edition (SAT-10; Harcourt Brace, 2004) and Gates-MacGinitie Reading Test–Fourth Edition (GMRT-4; MacGinitie & MacGinitie, 2006), two frequently used reading comprehension measures, are purported to measure the construct of reading comprehension and have excellent psychometric properties. However, these two tests may vary in the extent to which they measure different subprocesses of reading (e.g., word reading and language comprehension; inference making) and various types of texts (e.g., expository versus narrative texts).

When a group of individuals are administered a screener and a gold standard outcome, a resulting contingency matrix may be generated (Table 1). From this matrix, four types of classifications may occur (Schatschneider et al., 2008): students who are identified as at risk on the screen and either failed the outcome (Cell A; true positive) or passed the outcome (Cell B; false positive) and students who were identified as not at risk on the screen and either failed the outcome (Cell C; false negative) or passed the outcome (Cell D; true negative). In general, most screening and diagnostic measures try to maximize what are considered to be either population-or sample-based indices. Population-based indices are statistical proportions that describe the population level of risk according to the gold standard outcome chosen and describe the sensitivity and specificity of the scores. The sensitivity of a screener is the proportion of individuals who failed the outcome and were identified as at risk on the screener; from Table 1, sensitivity may be calculated as A/ (A + C). Specificity, D/ (D + B), is the proportion of individuals who pass the outcome test in the population who are not at risk on the screening assessment. Sensitivity has been an important index in the RTI framework because it is the percentage of children correctly identified by a screener as needing further assessments and/or intervention. Several recommendations have been provided about appropriate thresholds for sensitivity and specificity; however, many researchers attempt to have levels of at least .80, with some recommending minimum values of .90 (Compton, Fuchs, Fuchs, & Bryant, 2006; Jenkins, 2003). A useful summative measure may be used to describe the proportion of students who are correctly identified as either at risk or not at risk. The overall correct classification index (OCC) may be calculated as follows: (A + D)/ (A + B + C + D).

Table 1.

Sample 2 × 2 Contingency Matrix

Outcome
Screen Fail Pass
At risk A: True positive B: False positive
Not at risk C: False negative D: True negative

Positive predictive power and negative predictive power are the two primary sample-based indices. Predictive power describes the proportion of students screened who ultimately perform successfully or poorly on the gold standard outcome. Positive predictive power is the percentage of students identified as at risk on the screen who fail the outcome test, calculated as A/ (A + B), while negative predictive power, D/(C + D), is the percentage of students identified as not at risk on the screen who pass the outcome test. These sample-based indices differ from the population-based indices because they are considered to be based on the makeup of the sample. Whereas sensitivity and specificity are properties of the test itself (Streiner, 2003), sample-based indices depend on the proportion of students in the sample who are at risk (i.e., base rate). Thus, if a screener was used in two separate samples where one was higher achieving than the other, similar estimates of sensitivity and specificity could be obtained while different values for the positive and negative predictive power would be calculated.

Consider an example where a screener is used in two schools, each with 2,000 students. In School A, 50% of the students were at risk according to the state achievement test, while 15% were at risk on the same test in School B, and for the screener selected, the reported sensitivity was .95 and the specificity .90. With this information, the contingency tables in Table 2 were constructed. In Schools A and B, the sensitivity was .95—that is, 950/(950 + 50) in School A and 285/(285 + 15) in School B—and the specificity was .90: 900/(900 + 100) in School A and 1,530/(1,530 + 170) in School B. As expected, these population-based indices are identical in both schools. However, when the sample-based indices are calculated, very different findings are observed. The positive predictive power in School A is .90—that is, 950/ (950 + 100)—compared to .63 in School B: 285/ (285 + 170). However, the negative predictive power in School A is 0.95—that is, 900/ (900 + 50)—compared to .99 in School B: 1530/ (1530 + 15). This illustration demonstrates the critical importance of understanding and attending to the sample-based statistics, which should be given greater credence than the population-based indices when screening accuracy is evaluated and when screens are used to predict to distal outcome performance.

Table 2.

Base Rate Comparison for Sample-Based Indices: State Achievement Test

Screen Fail Pass Total
School A
  At risk 950 100 1,050
  Not at risk 50 900 950
  Total 1,000 1,000 2,000
School B
  At risk 285 170 455
  Not at risk 15 1,530 1,545
  Total 300 1,700 2,000

Although perfect screening (i.e., 100% screening accuracy) is desirable, it is elusive because of the inherent measurement error associated with assessments, as well as the difficulties in measuring developing skills in children (Jenkins et al., 2007). In practice, educators and researchers need to identify their needs and consider trade-offs of the statistical outcomes described above to determine the screener that best fits the needs of a school, district, or research project. For instance, if a school uses a screener to identify and provide interventions to as many students who may fail the outcome despite demands on resources, a screen with high sensitivity may be more appropriate. In contrast, if a school uses a screen to identify children who may need further monitoring, a screening device with a high negative predictive power may be better suited because such a screener would do a better job in identifying students with a low chance of developing a problem and thus not needing intervention (Schatschneider et al., 2008).

In summary, when either the gold standard outcome changes or varying psychometric properties are maximized for screening accuracy, the identification of individuals who are likely to fail the outcome will vary. The following research questions were tested in the current investigation:

  • To what extent do indices of sensitivity, specificity, positive and negative predictive power, and the overall rate of correct classification change when using a screener to predict failure on two gold standard measures of reading comprehension (SAT-10 and GMRT-4)?

  • To what extent do indices of sensitivity, specificity, positive and negative predictive power, and the overall rate of correct classification change when manipulating cut points to achieve .80 sensitivity (Method 1) or positive predictive power (Method 2) when predicting failure on one gold standard measure of reading comprehension (SAT-10)?

METHOD

Participants and Data Source

The participants were 17,778 second-grade students who attended a Reading First school during the 2005–2006 school year. According to school records, this cohort reflected the diversity found in Florida: 50% were female, 40% were identified as White, 31% as Black, 22% as Latino, 4% as multiracial, and < 1% as either Native American or Asian. Across the sample, 77% were eligible for free or reduced-price lunch, and 11% were served on an individual education plan for a disability. Programs for limited English proficiency served 15% of students. Table 3 provides a summary of the student demographics and the demographics for all students in the state.

Table 3.

Demographic Characteristics for the Full Sample, the Population (State), and Students With Missing Data (in Percentages)

Demographics Full
Sample
Population
(State)
Missing
Data
Girl 50 52 51
White 40 30 38
Black 31 38 30
Latino 22 26 23
Asian 1 1 1
Multiracial 4 4 4
Native American <1 <1 <1
Free and/ or reduced price lunch 77 76 76
English-language learners 15 17 14
Speech impaired 5 5 5
Language impaired 2 3 2
Specific learning disability 4 5 4
Other 4 4 4

Measures

Oral reading fluency

DIBELS Oral Reading Fluency (Good, Kaminski, Smith, Laimon, & Dill, 2001) is a measure that assesses oral reading rate in grade-level connected text. Students are asked to read three consecutive passages out loud (1 minute per passage), and they are given the prompt “Be sure to do your best reading” (Good et al., 2001, p. 30). Words omitted, words substituted, and hesitations of more than 3 seconds are scored as errors, although errors that are self-corrected within 3 seconds are scored as correct. Errors are noted by the assessor, and the score produced is the number of words correctly read per minute. The median score of the three passages is the score type used for decision making about level of risk and level of intervention needed. Information about how the risk levels for Oral Reading Fluency benchmarks were developed and what ranges of scores correspond to various levels of risk are available from several technical reports by the DIBELS authors (e.g., Good et al., 2002). Speece and Case (2001) reported parallel form reliability of .94, and strong interrater reliability (.96) has been observed in Florida (Progress Monitoring and Reporting Network, 2005). Research has demonstrated adequate to strong predictive validity of DIBELS Oral Reading Fluency for reading comprehension outcomes—that is, .65 to .80 (Barger, 2003; Good et al., 2001; Roehrig, Petscher, Nettles, Hudson, & Torgesen, 2008; Shapiro, Solari, & Petscher, 2008; Wilson, 2005). Part of the guiding principles that Good et al. (2002) utilized in developing the original cut scores was to retain intervals for low-risk levels that resulted in at least 80% of students meeting the end of year goal. Additionally, they wanted to set an interval for high risk whereby 20% or fewer of students met the third-grade goal. Good et al. also outlined that the some-risk students should have a 50% probability of meeting the end of year goal. Data for the current study consisted of the number of words read correctly per minute.

SAT-10

The SAT-10 is a group-administered, untimed, standardized measure of reading comprehension. Students answer a total of 54 multiple-choice items that assess their initial understanding, interpretation, critical analysis, and awareness and usage of various reading strategies. The internal consistency for the SAT-10 on a nationally representative sample of students was .88. Validity was established with other standardized assessments of reading comprehension, providing strong evidence of content, criterion, and construct validity (coefficients > .70; Harcourt Brace, 2004). For the present analyses, the percentile rank associated with the scale score on the total reading comprehension domain was used.

GMRT-4

The Reading Comprehension subtest of the GMRT-4 consists of 40 single and short three- to four-sentence passages of narrative and expository text, followed by several multiple choice questions. The questions are purported to tap understanding of details and ability to make inferences and integrate information in the passages. Internal consistency estimates of .96 and test–retest reliability of .85 to .90 were reported for the 2006 standardization sample, with construct validity estimates of .79 to .81 also reported. For the present analyses, the percentile rank associated with the scale score on the reading comprehension subtest was used.

Procedures

Data in this study were drawn from the Progress Monitoring and Reporting Network, an archival data source that houses student performance data on reading measures. Both the data and the students were from the network, which was maintained by the Florida Center for Reading Research as part of its role in providing support for schools and districts throughout the state under Reading First. The Progress Monitoring and Reporting Network is a centralized data collection and reporting system through which schools in Florida report reading data and receive reports of the data for instructional decision making. The participants were administered the DIBELS assessments according to the state of Florida’s assessment plan in the fall, winter, and spring. In the present study, we use data from the fall assessment period. The SAT-10 and GMRT-4 were administered at the end of the school year.

Data Analysis

To answer our first research question, the accuracy of the screen at the fall was tested by creating a series of 2 × 2 contingency tables, similar to those presented in Table 1, which describe the number of students who were identified as at risk or not at risk on both the screen (DIBELS Oral Reading Fluency) and the two gold standard outcome variables (SAT-10 and GMRT-4). Scores on the measures were recoded into dichotomous variables according to the cut points on each measure that corresponded to risk. Across the measures, scores at or above the respective cut points for low risk were coded as 1 (i.e., success), and scores corresponding to either moderate or high risk were coded as 0 to indicate that students did not meet the threshold for low risk.

At the fall assessment period, Good et al. (2001) reported that Oral Reading Fluency scores at or above 44 are considered to be low risk. Students with scores less than 44 may be identified as either moderate or high risk. Pertaining to the outcome tests, performance at or above the 40th percentile is often used to denote students who are low risk on state achievement tests (American Institutes for Research, 2007), while scores below this value are reflective of moderate- or high-risk performance. No universally agreed-on threshold exists for risk designation in education sciences; thus, it is important to consider commonly used practices for such score transformations. Although the 40th percentile cut point is often utilized for state achievement outcomes (American Institutes for Research, 2007), we opted to use the more conservative value of the 50th percentile to identify students as at risk on the SAT-10 and GMRT-4 for illustrative purposes.

As based on the reported cut points for DIBELS Oral Reading Fluency, the sensitivity, specificity, positive and negative predictive power, and overall percentage of correctly classified students were calculated. Although other aspects of diagnostic efficiency may be tested (e.g., likelihood ratio, odds ratio), these five indices are more commonly found in technical reports and research papers to describe classification accuracy (Streiner, 2003).

The second research question was addressed by using receiver-operating characteristic curve analysis to determine the cut points of DIBELS Oral Reading Fluency that corresponded to a maximized screener property. Although several methods exist to evaluate the appropriateness of developed cut scores (e.g., equipercentile equating and discriminant analysis), receiver-operating characteristic curve analysis has been demonstrated as having greater flexibility with regard to estimated screening accuracy and determining the balance between type I and II errors (Silberglitt & Hintze, 2005). Optimal cut scores for differentially maximizing sensitivity and positive predictive power in the sample were determined by examining the values in the receiver-operating characteristic curve and subsequently using selected values in a 2 × 2 contingency table to evaluate the indices previously described. Using previously discussed recommendations for index thresholds, we sought to use two methods to establish cut points that achieved either .80 sensitivity (Method 1) or .80 positive predictive power (Method 2).

RESULTS

Missing Data Analysis

According to the descriptive data analyses, 6% unique data points were missing across all studied variables and time points. The demographic makeup of students with missing data was examined to determine if the data were missing at random or if demographics constituted a systematic error. However, frequency distributions suggested that the data were not missing in any discernable pattern (Table 3). Moreover, students with missing data approximated the students with complete data with regard to demographic frequencies. Although the prevalence of missingness was low, Little’s test (1988) of data missing completely at random indicated that the data were not missing completely at random, χ²(4) = 54.11, p < .001. To correct for an unbalanced design and for potential biases in parameter estimation, multiple imputation was conducted in SAS PROC MI analysis with the free or reduced-price lunch, minority status, and item score variables, using Markov chain Monte Carlo estimation with 10 imputations.

Descriptive and Correlation Data

Table 4 reports the descriptive statistics for students’ performance on the selected measures. On average, students correctly read 56 words correct per minute (SD = 31.66) and had developmental scale scores of 599 on the SAT-10 and 442 on the GMRT-4. A better contextualization of the developmental scale scores is to provide the associated percentile rank of each score: A score of 599 on the SAT-10 corresponded to the 50th percentile according to its norming sample, whereas a score of 442 on the GMRT-4 was associated with performance at the 48th percentile of its norming sample. Moderate to strong correlations were observed across all measures, ranging from .64 (between Oral Reading Fluency and SAT-10) to .73 (between SAT-10 and GMRT-4). Forty-four percent of students were identified as failing the SAT-10, compared with 56% for the GMRT-4.

Table 4.

Descriptive Statistics and Correlations for Observed Variables

Oral
Reading
Fluency
Stanford
Achievement
Test
Gates-
MacGinitie
Reading
Test
Oral Reading Fluencya 1.00
Stanford Achievement Test .64 1.00
Gates-MacGinitie Reading Test .68 .76 1.00
M 56.11 598.98 442.39
SD 31.66 39.48 38.51
a

Dynamic Indicators of Basic Early Literacy Skills, fall.

Research Question 1: Screening Accuracy With Varying Outcomes

Using the identified scores to define risk on the Oral Reading Fluency, SAT-10, and GMRT-4 measures, we constructed 2 × 2 contingency tables to evaluate the screening accuracy of the screens. The results from calculations are reported in Table 5. As can be observed from the indices in the fall, when Oral Reading Fluency was used to predict two gold standard reading comprehension outcomes, differential classification occurred relative to risk identification. When Oral Reading Fluency predicted failure on the SAT-10, 66% of the students who failed the SAT-10 (i.e., scored < 50th percentile) scored below 44 words correct per minute on DIBELS Oral Reading Fluency, compared with a 60% correct classification of risk based on predicting failure on the GMRT-4. This 6% observed difference in sensitivity in favor of the SAT-10 was counterbalanced by a 6% difference in specificity for the GMRT-4. That is, although 81% of students who passed the SAT-10 were fluent at or above 44 words correct per minute, an identification rate of 87% was estimated for the GMRT-4. A similar pattern of differential advantages for estimates between the outcomes was observed for positive and negative predictive power. Although the negative predictive power of the Oral Reading Fluency–SAT-10 relationship (75%) was greater than the Oral Reading Fluency–GMRT-4 relationship (63%), a similar 12% discrepancy was estimated with positive predictive power in favor of the GMRT-4 (86%) compared to the SAT-10 (74%).

Table 5.

Diagnostic Efficiency Results (in Percentages)

Predictive Power
Overall
Correct
Classification
Variable Sensitivity Specificity Positive Negative
Reading Question 1
  Oral Reading Fluency - Stanford Achievement Test 66 81 74 75 74
  Oral Reading Fluency - Gates-MacGinitie Reading Test 60 87 86 63 72
Reading Question 2
  Method 1 80 64 64 81 72
  Method 2 52 90 80 70 73

Research Question 2: Screening Accuracy With Varying Cut Scores

Recalibrated cut scores were conducted for DIBELS Oral Reading Fluency predicting failure on the SAT-10. The resulting receiver-operating characteristic curve suggested that an Oral Reading Fluency cut score of 48 would be appropriate to achieve a sensitivity value of .80 (Method 1). Similarly, to attain a positive predictive power estimate of .80 (Method 2), a fluency cut score of 36 was needed. By using the respective points to maximize each screening goal, the resulting accuracy indices were calculated and reported in Table 5. Comparisons between the two methods for each index demonstrate the nature of discrepancies that occur when focusing on specific screen targets. The sensitivity was 28% higher in Method 1 but lower by 26% in specificity and 16% in positive predictive power while higher by 11% in negative predictive power when compared to Method 2. Thus, Method 1 produces higher levels of sensitivity and negative predictive power, whereas Method 2 results in higher positive predictive power and specificity.

DISCUSSION

With RTI frequently required as a way to identify children at risk for future reading difficulties, schools are expected to implement it. One critical element of effective RTI placement is assessment, including screening, to identify students’ instructional needs. Thus, it is important that educators understand trade-offs for choosing different gold standard outcomes and the statistical properties to maximize, in order to meet the needs of districts and schools. In the present study, a widely used screen, DIBELS Oral Reading Fluency, was examined for screening accuracy when predicting two standardized measures of reading comprehension. In particular, we focused on delineating population-based indices versus sample-based indices of screening accuracy. Sensitivity and specificity are examples of frequently reported population-based indices. In contrast, positive predictive power and negative predictive power are sample-based indices because they are influenced by students’ performance level in the sample. Because schools and districts differ in their demographic composition and students’ performance levels, these sample-based indices are likely to provide relevant and useful information if schools and districts were to adopt a screen.

At the heart of the decision, an educator must contend with the question, Which is the greater perceived evil, to identify too many students for Tier 2 or Tier 3 intervention or to miss students who are in need of services? The answer to this problem is not as simple as it seemingly appears, given that multiple elements factor into the decision. The amount of funding that a school or district may have for interventions could preclude a desire to maximize the identification of at-risk students. The key that educators should consider in light of resource allocations and priorities is weighing trade-offs between providing intervention for those who do not need it and providing no intervention for those who do need it.

The results of the present study showed that the DIBELS Oral Reading Fluency measure had varying levels of diagnostic accuracy, depending on the outcome. Oral Reading Fluency tended to have higher sensitivity and negative predictive power for SAT-10 than for GMRT-4. In contrast, Oral Reading Fluency had higher specificity and positive predictive power for GMRT-4 than for SAT-10. These results suggest that the same screening measure serves somewhat different functions in terms of diagnostic accuracy, depending on the gold standard outcome. Although both the SAT-10 and the GMRT-4 are purported to measure the same construct (reading comprehension) and although a gold standard is presumably free from error, this notion primarily stems from the origins of screening analyses, which are derived from signal detection theory and medical models, where outcomes tend to be more dichotomous (e.g., the patient has cancer or does not). Thus, in education sciences, this assumption is less tenable, where all instruments are flawed by random and/or systematic measurement error. These results imply that the choice of gold standard outcomes is an important consideration for schools and districts, especially when a differential proportion of students in the same sample failed the SAT-10 (44%) compared to the GMRT-4 (56%).

In the attempt to achieve a criterion of either positive predictive power or sensitivity for a given reading comprehension outcome (SAT-10 in this study), focusing on positive predictive power (i.e., Method 2) appears to render a more favorable balance. In other words, when .80 for positive predictive power is the aim, the loss in sensitivity is comparable to the loss of specificity when sensitivity is achieved at .80 (Method 1). However, Method 2 does not lose as much on negative predictive power (9%) compared to the loss in specificity (26%) and positive predictive power (16%) with Method 1. There is strong evidence that students who start at a low level in reading rarely catch up in later grades (Francis, Shaywitz, Stuebing, Shaywitz, & Fletcher, 1996; Juel, 1988; Torgesen & Burgess, 1998) and that remediating students later (e.g., second grade) takes much more time and resources but is less successful (Foorman, Breier, & Fletcher, 2003; Torgesen, 2000). Thus, ensuring that fewer children are misidentified as not at risk and that those identified as at risk receive intervention corresponds to the goal and axiom of the RTI framework of allocating resources for early identification and prevention.

The findings from the present study should be interpreted with limitations pertaining to the context of the present study—that is, the specific sample characteristics, the gold standard outcome chosen, and the presence of ongoing interventions. Our sample was slightly overrepresentative of White students (40% sample, 30% state) and underrepresentative of Black students (31% sample, 38% state). Finally, schools in Florida are required to provide appropriate interventions to students on the basis of their performance on the screening measures in the beginning of the year. However, the extent to which the students in this study were receiving Tier 2 or Tier 3 interventions and how it affected the results of the present study are unknown.

IMPLICATIONS FOR RESEARCH AND PRACTICE

Evaluating a screening assessment requires the educator’s awareness about multiple factors pertaining to both the psychometric elements of the screen and the practical needs of schools and districts. Schatschneider et al. (2008) provided initial guidelines about what to focus on in a screening process at the school or district level: (a) Identify what “at risk” means; (b) establish the goal for the screening process; (c) study how the screen was developed; (d) determine the base rate in the school, district, or state; (e) attend to the positive and negative predictive power; and (f) collect local data to evaluate how well the screening process is working. The first critical step is defining what “at risk” means (e.g., what outcome is used). Risk can take on a host of meanings, describing performance on a concurrently administered standardized reading assessment, benchmark performance on a progress-monitoring measure, the passing of an end-of-year state assessment test within the present year, or even success on the state test in a future grade.

Without first delineating the type of risk for which to screen, the choice and utility of the assessment will fail.

Second, establishing the goal for the screening process is imperative because it will not only narrow down the list of potential screens but help determine the amount of time that could be spent assessing, identifying, and ultimately placing students into appropriate interventions. For example, if the goal is to choose an assessment that will identify the students who have a low chance for developing a problem from the screening process (i.e., reduce underidentification errors), then it is important to maximize negative predictive power. Conversely, if it is more important to have a high percentage of all students be correctly identified as at risk and not at risk, then the eligible screening assessments should have a high percentage of the overall correct classification of students. For example, in the context of the present study, it is possible that by simply adjusting the cut points for risk designation on the screen, as Hintze, Ryan, and Stone (2003) and Roehrig et al. (2008) did in their studies, a different level of classification will be observed to meet the needs of the school or district. Once one outlines this goal, evaluating a screen to measure it will assist in determining whether the screen’s definition of risk is the same as one’s own. Even if the outcome is similar, it is important to check the specifications of that outcome and how skills on that measure were assessed. Skills such as reading comprehension are complex and multidimensional, and they can be assessed in ways that tap into lower- or higher-level skills. Even the response format (e.g., multiple choice versus a cloze or short answer) will have an impact on the screening accuracy of scores (Jenkins, Johnson, & Hileman, 2004).

The fourth and fifth steps are tied together because the base rate of the problem in a school or district will effectively determine the extent to which a screen could reasonably be applied. If the screen selected was normed on a sample with a similar base rate, then it may be used with little apprehension because it provides information about how it will likely work in the selected sample. Last, collecting local data within the school or district will provide the best gauge regarding how the screening process is working and to what extent a different definition of risk, goal, or choice of screen is warranted. This will ensure the fit between the selected screening measure and the needs of local schools and district. In summary, this research highlights that screening measures vary in psychometric properties of screening accuracy that are maximized and that they are not likely to meet the needs and priorities of all schools and districts. Thus, it is researchers’ and practitioners’ responsibility to be aware of these characteristics and utilize screening measures as they were intended, to best serve students.

Acknowledgments

Funding

The authors disclosed receipt of the following financial support for the research and/or authorship of this article: This work was supported by the Institute of Education Sciences (R305A100301).

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interests with respect to the authorship and/or publication of this article.

References

  1. American Institutes for Research. Reading first state APR data. Washington, DC: Author; 2007. [Google Scholar]
  2. Andreassen R, Braten I. Examining the prediction of reading comprehension on different multiple-choice tests. Journal of Research in Reading. 2010;33:263–283. [Google Scholar]
  3. Barger J. Comparing the DIBELS oral reading fluency indicator and the North Carolina end of grade reading assessment. Asheville, NC: North Carolina Teacher Academy; 2003. [Google Scholar]
  4. Cain K, Oakhill J, Bryant P. Children’s reading comprehension ability: Concurrent prediction by working memory, verbal ability, and component skills. Journal of Educational Psychology. 2004;96:31–42. [Google Scholar]
  5. Compton DL, Fuchs D, Fuchs LS, Bryant JD. Selecting at-risk readers in first grade for early intervention: A two-year longitudinal study of decision rules and procedures. Journal of Educational Psychology. 2006;98:394–409. [Google Scholar]
  6. Davis FB. Fundamental factors of comprehension of reading. Psychometrika. 1944;9:185–197. [Google Scholar]
  7. Foorman BR, Breier JI, Fletcher JM. Interventions aimed at improving reading success: An evidence-based approach. Developmental Neuropsychology. 2003;24:613–639. doi: 10.1080/87565641.2003.9651913. [DOI] [PubMed] [Google Scholar]
  8. Francis DJ, Shaywitz SE, Stuebing KK, Shaywitz BA, Fletcher JM. Developmental lag versus deficit models of reading disability: A longitudinal, individual growth curves analysis. Journal of Educational Psychology. 1996;88:3–17. [Google Scholar]
  9. Glover TA, Albers CA. Considerations for evaluating universal screening assessments. Journal of School Psychology. 2007;45:117–135. [Google Scholar]
  10. Good RH, Kaminski RA, Shinn M, Bratten J, Shinn M, Laimon L, et al. Technical adequacy and decision making utility of DIBELS (Technical Report No. 7) Eugene: University of Oregon; 2004. [Google Scholar]
  11. Good RH, Kaminski RA, Smith S, Laimon D, Dill S. Dynamic Indicators of Basic Early Literacy Skills. 5th ed. Eugene: University of Oregon; 2001. [Google Scholar]
  12. Good RH, Wallin J, Simmons DC, Kame’euni EJ, Kaminski RA. System-wide percentile ranks for DIBELS benchmark assessment (Technical Report No. 9) Eugene: University of Oregon; 2002. [Google Scholar]
  13. Harcourt Brace. Stanford Achievement Test: Technical data report. 10th ed. Orlando, FL: Author; 2004. [Google Scholar]
  14. Hintze JM, Ryan AL, Stone G. Concurrent validity and diagnostic accuracy of the Dynamic Indicators of Basic Early Literacy Skills and the Comprehensive Test of Phonological Processing. School Psychology Review. 2003;32:541–556. [Google Scholar]
  15. Invernizzi M, Meier JD, Swank L, Juel C. Phonological Awareness Literacy Screening. Charlottesville: University of Virginia; 1999. [Google Scholar]
  16. Jenkins JR. Candidate measures for screening at-risk students; Paper presented at the National Research Center on Learning Disabilities’ Responsiveness-to-Intervention Symposium; Kansas City, MO. 2003. Dec, Retrieved from www.nrcld.org/symposium2003/jenkins/index.html. [Google Scholar]
  17. Jenkins JR, Hudson RF, Johnson ES. Screening for service delivery in an RTI framework: Candidate measures. School Psychology Review. 2007;36:582–599. [Google Scholar]
  18. Jenkins JR, Johnson E, Hileman J. When is reading also writing: Sources of individual differences on the new reading performance assessments. Scientific Studies of Reading. 2004;8:125–151. [Google Scholar]
  19. Juel C. Learning to read and write: A longitudinal study of 54 children from first through fourth grades. Journal of Educational Psychology. 1988;80:437–447. [Google Scholar]
  20. Keenan JM, Betjemann RS, Olson RK. Reading comprehension tests vary in the skills they assess: Differential dependence on decoding and oral comprehension. Scientific Studies of Reading. 2008;12:281–300. [Google Scholar]
  21. Little RJA. A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association. 1988;83:1198–1202. [Google Scholar]
  22. MacGinitie W, MacGinitie R. Gates-MacGinitie Reading Tests. 4th ed. Iowa City, IA: Houghton Mifflin; 2006. [Google Scholar]
  23. Mather N, Hammill DD, Allen EA, Roberts R. Test of Silent Word Reading Fluency. Austin, TX: Pro-Ed; 2004. [Google Scholar]
  24. Progress Monitoring and Reporting Network. Database psychometric reporting. Tallahassee, FL: Author; 2005. [Google Scholar]
  25. Roehrig AD, Petscher Y, Nettles SM, Hudson RF, Torgesen JK. Not just speed reading: Accuracy of the DIBELS oral reading fluency measure for predicting highstakes third grade reading comprehension outcomes. Journal of School Psychology. 2008;46:343–366. doi: 10.1016/j.jsp.2007.06.006. [DOI] [PubMed] [Google Scholar]
  26. Schatschneider C, Petscher Y, Williams KM. How to evaluate a screening process: The vocabulary of screening and what educators need to know. In: Justice L, Vukelich C, editors. Achieving excellence in preschool literacy instruction. New York: Guilford Press; 2008. pp. 304–316. [Google Scholar]
  27. Shapiro E, Solari E, Petscher Y. Use of an assessment of reading comprehension in addition to oral reading fluency on the state high stakes assessment for students in Grades 3 through 5. Journal on Learning and Individual Differences. 2008;18:316–328. doi: 10.1016/j.lindif.2008.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Silberglitt B, Hintze JM. Formative assessment using CBM-R cut scores to track progress toward success on state-mandated achievement tests: A comparison of methods. Journal of Psychoeducational Assessment. 2005;23:304–325. [Google Scholar]
  29. Streiner DL. Diagnosing tests: Using and misusing diagnostic and screening tests. Journal of Personality Assessment. 2003;81:209–219. doi: 10.1207/S15327752JPA8103_03. [DOI] [PubMed] [Google Scholar]
  30. Torgesen J. Individual differences in response to early interventions in reading: The lingering problem of treatment resisters. Learning Disabilities Research and Practice. 2000;15:55–64. [Google Scholar]
  31. Torgesen JK, Burgess SR. Consistency of reading-related phonological processes throughout early childhood: Evidence from longitudinal-correlational and instructional studies. In: Metsala J, Ehri L, editors. Word recognition in beginning reading. Hillsdale, NJ: Erlbaum; 1998. pp. 148–172. [Google Scholar]
  32. Wilson J. The relationship of Dynamic Indicators of Basic Early Literacy Skills (DIBELS) oral reading fluency to performance on Arizona Instrument to Measure Standards (AIMS) (Technical Report) Tempe, AZ: Assessment and Evaluation Department, Tempe School District No.3; 2005. [Google Scholar]
  33. Woodcock RW, Johnson MB. Woodcock-Johnson Psycho-Educational Battery–Revised: Examiner’s manual. Chicago: Riverside; 1990. [Google Scholar]

RESOURCES