Abstract
The purpose of this study was to explore the utility of a dynamic assessment (DA) of algebraic learning in predicting third graders’ development of mathematics word-problem difficulty. In the fall, 122 third-grade students were assessed on a test of math word-problem skill and DA of algebraic learning. In the spring, they were assessed on word-problem performance. Logistic regression was conducted to contrast two models. One relied exclusively on the fall test of math word-problem skill to predict word-problem difficulty on the spring outcome (less than the 25th percentile). The second model relied on a combination of the fall test of math word-problem skill and the fall DA to predict the same outcome. Holding sensitivity at 87.5%, the universal screener alone resulted in a high proportion of false positives, which was practically reduced when DA was included in the prediction model. Findings are discussed in terms of a two-stage process for screening students within a responsiveness-to-intervention prevention model.
Keywords: Mathematics, word problems, screening, dynamic assessment, elementary
In the present study, we investigated the potential for a two-stage screening process to enhance the precision with which students are screened for math word-problem difficulty. The first stage involves a universal screen, the purpose of which is to eliminate high-scoring students from further consideration as at risk. In the second stage of screening, the remaining students complete a dynamic assessment to discriminate true positives (children who fail the universal screen and are truly at risk for poor math problem-solving outcome) from false positives (children who fail the universal screen but whose math problem-solving skills would develop adequately without special intervention). The importance of accurate screening has grown over the past decade as responsiveness-to-intervention (RTI) prevention systems are implemented in schools. In this introduction, we briefly explain the RTI prevention framework within which the present study is contextualized. Then, we clarify the role of screening within RTI, explaining the potential for dynamic assessment within the screening process. We then review prior research on dynamic assessment and finally explain how the present study extends previous work.
The RTI Prevention Framework
As presently practiced, RTI prevention systems are designed in dramatically different ways, with schools incorporating two to seven tiers of intervention (Berkeley, Bender, Gregg Peaster, & Saunders, 2009). One school's Tier 2 may be identical in intensity and instructional design to another school's Tier 6. This creates confusion as schools struggle to conceptualize, design, and communicate about their RTI prevention systems. For this reason, we begin this introduction by describing an RTI prevention framework that incorporates three levels of prevention services. The present study is contextualized within such a three-level RTI prevention system, within which each level is distinctive in terms of instructional intensity. We note that any RTI system can be conceptualized within this framework; that is, schools can incorporate more than one tier of intervention within any of the three levels of the framework.
The first level of the RTI prevention framework is primary prevention. It comprises the instructional practices that general education teachers conduct with all students: (a) the core instructional program along with (b) classroom routines that provide opportunities for instructional differentiation, (c) accommodations that permit access to the primary prevention program for all students, and (d) problem-solving strategies designed to address students’ motivational problems that interfere with them performing the academic skills they possess. Most core programs are designed using instructional principles derived from research, but few are validated because of the challenges associated with conducting controlled studies of complex, multicomponent programs.
By contrast, secondary prevention of the RTI prevention framework involves small-group instruction that relies on a validated tutoring protocol. The validated protocol specifies instructional procedures and dictates its duration (typically 10 to 15 weeks of 20- to 40-min sessions) and frequency (three or four times per week). The intensity of secondary prevention differs from primary prevention in three ways. First, secondary prevention is empirically validated whereas primary prevention is research principled. Second, secondary prevention relies entirely on adult-led small-group tutoring whereas primary prevention relies heavily on whole-class instruction. Third, because secondary prevention involves a clearly articulated standard protocol, it does not require as much professional judgment as primary prevention. For this reason, some (not all) schools rely on paraprofessionals to implement secondary prevention to make RTI more doable. In either case, secondary prevention is not the responsibility of the general education teacher; rather, professional support staff (e.g., reading and math coaches) implement secondary prevention, sometimes directly and other times by training and supervising paraprofessionals to serve as tutors. Schools may design their RTI prevention system so that students receive just one or a series of tutoring protocols before proceeding to the third level of the prevention system.
When a validated tutoring protocol is used at secondary prevention with fidelity, the large majority of students are expected to benefit. In this way, validation provides a basis for two critical, interrelated assumptions. First, a student's unresponsiveness to validated tutoring, which involves a standard program that has been proven effective for the vast majority of students, is not the result of poor instruction but rather student characteristics (i.e., a possible disability). Second, students who do not benefit from validated tutoring demonstrate a need for nonstandard instruction. As written into federal law, students who have a disability and display a need for nonstandard instruction are entitled to special education. Hence, a comprehensive evaluation follows to confirm the presence of a disability, making it possible (although not necessary) for special education resources to fuel tertiary prevention of the RTI prevention framework.
Tertiary prevention differs from secondary prevention in two important ways. First, in tertiary prevention, teachers establish clear, individual, and ambitious year-end goals in instructional material that matches the student's needs. This material may or may not be the student's grade-appropriate curriculum (instead addressing foundational skills necessary for successful performance in grade-appropriate material and, in this way, representing appropriate content standards). Second, because the student has demonstrated insufficient response to standard forms of instruction at primary prevention and secondary prevention, tertiary prevention is individualized. The teacher begins with a more intensive version of the standard protocol (e.g., longer sessions, smaller group size) but does not presume it will meet the student's needs. Rather, frequent progress monitoring quantifies the effects of the protocol using rate of improvement (slope). When slope forecasts that goal attainment is unlikely, the teacher experiments by modifying components of the protocol, while using progress monitoring to assess the effects of those modifications. In this way, the teacher inductively designs an effective, individualized instructional program.
Screening Within RTI and the Potential of Dynamic Assessment
Within this RTI prevention framework, a first order issue is identifying students who are at risk for the serious and long-term negative consequences associated with poor academic achievement and who therefore need to enter secondary prevention. To identify these students, schools administer tests that forecast academic achievement and apply cut points to the resulting scores to distinguish students who are and are not at risk. The traditional approach involves administering a general measure of intelligence (e.g., Raven Progressive Matrices; Raven, 1960) or a test of specific ability or skill presumed to underlie future academic performance (e.g., phonological processing for development of word-reading performance or fast recognition of small quantities for development of early calculation skill). In these conventional testing situations, students respond without examiner assistance.
Such assessments of intelligence or precursor abilities and skills capture varying amounts of variance in forecasting academic development. For example, a measure of quantity discrimination, when used as a screener at the beginning of first grade, accounts for 25% to 63% of the variance in end-of-year math outcomes, depending on the study (e.g., Chard et al., 2005; Clarke & Shinn, 2004; Lembke & Foegen, 2006). Research does however show that such measures can produce large percentages of false positives: children who fail the universal screen but would develop adequately without secondary prevention (L. S. Fuchs et al., 2007; Seethaler & Fuchs, 2010). This is problematic for RTI prevention systems because identifying large numbers of false positives to enter secondary prevention, when these students would do fine without such expensive services, stresses the resources available in schools to provide remediation for the students who truly need it.
Because these conventional assessments are imperfect predictors of academic learning, they have long been the target of scrutiny and criticism. One concern is that these “static” estimates of performance reveal only two states: unaided success or failure. By contrast, as Vygotsky (e.g., 1934/1962) proposed, children may function somewhere between these states: unable to perform a task independently but able to succeed with assistance. This has implications for making distinctions among students at the lower end of the distribution. For example, when two children earn the same low score on a measure of quantity discrimination, they may not have the same potential to develop computational skill. One may develop well with only minimal assistance, suggesting that the initially low performance on the static assessment stems from inadequate learning opportunity in the child's present environment. The other child, by contrast, may struggle with computation even when provided highly explicit instruction, revealing that the initially low performance stems from important weaknesses that need to be addressed with intervention.
If the goal is to forecast learning potential, then the question is why not assess the student's capacity to learn, rather than assessing what the student presently knows. This alternative form of assessment, which measures a student's learning potential, is known as dynamic assessment (DA). It involves structuring a learning task, providing instruction to help the student learn the task, and indexing responsiveness to that instruction as a measure oflearning potential. Research is needed to understand whether DA improves the prediction of student achievement outside of the DA.
Prior Work on DA as a Predictor of Academic Achievement
Research on DA varies as a function of the DA's structure and design and in terms of research questions and the methodological features of studies. In terms of structure and design, DAs vary along three dimensions (see Campione, 1989): index, style of interaction, and the nature of the skills assessed. Index refers to the way in which DAs quantify responsiveness to the assisted phase of learning. One approach involves indexing the amount of change from an unassisted pretest to an unassisted posttest (with the assisted learning phase intervening between the pre- and posttest) or by scoring students’ unaided performance following the assisted phase of assessment (e.g., Ferrara, Brown, & Campione, 1986). Another approach is to quantify the amount of scaffolding required during the assisted phase of assessment to reach criterion performance (e.g., Murray, Smith, & Murray, 2000; Spector, 1992). These alternative methods for indexing DA performance serve the same purpose: to predict whether students require extra attention to learn adequately. DAs also vary in terms of the style of interaction. Some DAs (e.g., Ferrara et al., 1986) are standardized, where the tester administers a fixed series of prompts. Other DAs (e.g., Tzuriel & Haywood, 1992) are individualized, where the tester addresses the student's specific obstacles as revealed by that student's responses. The third dimension along which DA varies is the nature of the skills assessed. Early work (e.g., Budoff, 1967; Feuerstein, 1979) tended to focus on domain-general skills associated with cognitive ability. More recent DAs tend to be more academically grounded (e.g., Bransford, Delclos, Vye, Burns, & Hasselbring, 1987; Campione, 1989; Campione & Brown, 1987; Spector, 1992).
In terms of research questions and methods, some studies focus on the amount of learning that accrues on the DA task as a function of student characteristics or the structure of DA (e.g., Tzuriel & Haywood, 1992). This approach dominated DA research in the 1970s, 1980s, and 1990s. Alternatively, researchers consider the contribution of DA in explaining academic performance outside the DA. This second type of study can be categorized further in terms of whether studies account for competing predictors of outcome and whether academic outcome is assessed concurrently with DA or at a later time. Compared to studies that do not control for competing predictors of outcome, studies that exert such control impose a more stringent criterion for considering the value of DA. In terms of the timing of the outcome, studies that measure academic performance at a later time enhance external validity, given that the purpose of DA is to forecast future academic achievement. To contextualize the present study, we focused on the subset of studies that explored DA's contribution in predicting academic performance while controlling for competing predictors. We did however consider studies that predicted concurrent as well as future academic performance. We also included DAs of varying structure and design (for a comprehensive review of DA, see Grigorenko & Sternberg, 1998).
Using our inclusion criteria, we identified four relevant studies (see Note 1). Speece, Cooper, and Kibler's (1990) DA task involved a domain-general skill associated with overall cognitive ability: solving matrices from intelligence tests. Using a standardized style of interaction, they indexed first graders’ learning potential via the number of prompts required during the assisted phase of assessment, and they examined the contribution of DA over verbal IQ, pre-DA matrices performance, and language ability for predicting performance on a concurrently administered math achievement test. Although the result was statistically significant, DA accounted for less than 2% of the variance in concurrent math performance.
Swanson and Howard (2005) also predicted concurrent academic performance using a standardized form of DA. They extended the work of Speece et al. (1990) by centering DA on cognitive abilities presumed to underlie academic performance: phonological working memory (i.e., rhyming tasks that required recall of acoustically similar words) and semantic working memory (i.e., digit or sentence tasks that required recall of numerical information embedded in short sentences). Testers selected among four standardized hints to correspond with student errors, choosing the least obvious, relevant hint. Performance was scored as a gain score (highest score obtained with assistance), maintenance score (stability of the gain score after assistance was removed), and probe score (number of hints to achieve highest level). DA scores for phonological working memory were combined into a factor score; the same was done for DA semantic working memory. The researchers classified students, whose average age was 10 to 12 years, as skilled readers, poor readers, reading disabled, or math and reading disabled. To predict concurrent performance on a reading and arithmetic achievement test, verbal IQ and pre-DA working memory were entered first into multiple regression analyses. In predicting reading, the unique contribution of the semantic DA factor, which was the better of the two DA factor scores, was 6%. In predicting arithmetic, the unique contribution of the semantic DA factor was 25%.
Although Speece et al. (1990) found more limited support for DA's added value as related to math performance when DA addressed a domain-general task of cognitive ability, Swanson and Howard's (2005) work, which centered DA on cognitive abilities presumed to underlie academic performance, was more encouraging. Yet neither study assessed academic performance at a later time. Spector (1997) centered DA on cognitive abilities presumed to underlie academic performance (as did Swanson & Howard, 2005) and, in contrast to Speece et al. (1990) and Swanson and Howard (2005), delayed assessment of academic achievement to later in the school year. In November of kindergarten, she administered a standardized DA of phonemic awareness, indexing the number of prompts to achieve criterion performance; in May, she assessed word-level reading skill. In predicting end-year word-reading skill, DA substantially enhanced predictive validity beyond initial verbal ability and initial, static phonological awareness performance, explaining an additional 21 % of the variance. In fact, November DA was the only significant predictor of May word-reading skill. Results provide evidence that DA may enhance the prediction of student leaming.
L. S. Fuchs, Compton, et al. (2008) extended the framework for considering type of DA task by employing actual math content that was not a precursor or foundational skill for the predicted outcome. The question was whether a student's potential to learn algebra at the beginning of third grade forecasted development of word-problem skill over the course of the academic year, when controlling for the nature of classroom instruction and for student skills and characteristics previously documented as important to word-problem performance. The student skills and characteristics, which were measured at the beginning of the year when DA was also administered, were computation skill, word-problem skill, language ability, attentive behavior, and nonverbal reasoning. Structural equation measurement models showed that DA measured a distinct dimension of beginning-of-the-year ability. Also, although instruction (conventional vs. validated) and pretreatment computational skill were sufficient to account for math word-problem outcome proximal to classroom instruction, DA in addition to language and pretreatment word-problem skill were needed to forecast learning on word-problem outcomes more distal to classroom instruction.
Purpose of the Present Study
L. S. Fuchs, Compton, et al. (2008) showed that an algebra DA was a distinct dimension of ability at the beginning of third grade when competing with a host of static measures and that algebra DA was necessary to account for learning on word-problem outcomes distal from classroom instruction. This suggests the potential for DA to capture important information about students’ capacity to profit from school instruction—information that stands apart from what students already know, which is influenced by culture, socioeconomics, and previous learning opportunity. In these ways, L. S. Fuchs, Compton, et al. suggest that DA might be useful in screening students’ risk for poor learning outcomes within an RTI framework.
This was the focus of the present study, in which we reanalyzed the L. S. Fuchs, Compton, et al. (2008) database to consider DA within the context of a two-stage screening process for enhancing the precision with which students are screened for math word-problem difficulty. In the first stage, an easy-to-administer and low-cost universal screen would be administered to all students, with the cut point set to minimize missing true positives. The purpose of this first-stage screening would be to eliminate high-scoring students from further consideration, which requires more costly assessment. In this second stage of screening, a more in-depth measure would be used to discriminate the remaining students in terms of false positives versus true positives. Research shows how screening with a one-stage universal screen in the primary grades results in high percentages of false positives (e.g., D. Fuchs, Compton, Fuchs, & Davis, 2008; Seethaler & Fuchs, 2010). If a second-stage screening process reduces these errors, then the efficiency and effectiveness of RTI prevention systems should improve by reducing false positives who enter secondary prevention they do not need and instead permitting schools to allocate those resources to the students who truly need them. In the present study, we used DA heuristically to gain insight into such a two-stage screening process and to understand DA's specific potential as a second-stage screener.
Method
Participants
The data described in this article were collected as part of a prospective 4-year study assessing the effects of mathematical problem-solving instruction and examining the developmental course and cognitive predictors of mathematical problem solving. The study represents a reanalysis of the L. S. Fuchs, Compton, et al. (2008) database, involving the 4th-year cohort from the larger study at the fall and spring of third grade assessment waves. Students were sampled from 30 participating classrooms in five Title I and three non-Title I schools, with 2 to 6 students per class. From the 510 students with parental consent, we randomly sampled 150 for participation, blocking within instructional condition: Half were randomly assigned to conventional math problem-solving instruction and half to schema-broadening math problem-solving instruction (see L. S. Fuchs, Compton, et al., 2008, for more information), within classroom and within three strata: (a) 25% of students with scores 1 SD below the mean of the entire distribution on the Test of Computational Fluency (L. S. Fuchs, Hamlett, & Fuchs, 1990), (b) 50% of students with scores within 1 SD of the mean of the entire distribution on the same measure, and (c) 25% of students with scores 1 SD above the mean of the entire distribution on the same measure. Of these 150 students, we have complete data for 122 children, who were comparable to the 150 initially sampled students on all variables. See Table 1 for the performance of these 122 students, of whom 67 (54.9%) were male and 80 (67.0%) received subsidized lunch. In terms of race/ethnicity, 57 (47.5%) were African American, 52 (42.6%) were European American, 10 (8.3%) were Hispanic, and 3 (2.5%) were Other. Two students (1.7%) were English language learners.
Table 1.
Raw Score |
Standard Score |
|||
---|---|---|---|---|
Variable | M | SD | M | SD |
Descriptive measures | ||||
WASI IQ | NA | NA | 96.10 | 12.55 |
TCAP reading | NA | NA | 54.74 | 14.55 |
TCAP math | NA | NA | 60.70 | 17.90 |
WJ–applied prob. | 30.04 | 4.15 | 106.97 | 12.58 |
WRMT-WID | 56.66 | 11.51 | 100.94 | 11.20 |
Screeners | ||||
AWP | 8.00 | 5.45 | NA | NA |
DA | 9.98 | 4.13 | NA | NA |
Outcome | ||||
Iowa | 15.74 | 4.36 | 188.05 | 20.60 |
Note: N = 122. WASI = Wechsler Abbreviated Scale of Intelligence (Weschler, 1999); TCAP = Tennessee Comprehensive Assessment Battery (CTB/McGraw-Hill, 1997); WJ = Woodcock–Johnson III Tests of Cognitive Abilities (Woodcock, McGrew, & Mather, 2001); WRMT = Woodcock Reading Mastery Tests (Woodcock, 1998); WID = Word Identification; AWP = Algorithmic Word Problems (L. S. Fuchs, Hamlett, & Powell, 2003); DA = dynamic assessment of algebraic learning (L. S. Fuchs, Compton, et al., 2008); Iowa = Iowa Test of Basic Skills (Hoover, Dunbar, & Frisbie, 2001).
Screeners
Math word-problem skill. Algorithmic Word Problems (L. S. Fuchs, Hamlett, & Powell, 2003) comprises 10 word problems, each of which requires one to four steps. The measure, which is group administered, samples four problem types that are part of the third-grade curriculum: shopping list, half, buying bags, and pictograph problems. The tester reads each item aloud while students follow along on their own copies of the problems; the tester progresses to the next problem when all but one or two students have their pencils down, indicating they are finished. Students can ask for rereading(s) as needed. The maximum score is 44. We used two alternate forms; the problems in both forms required the same operations, incorporated the same numbers, and presented text with the same number or length of words. We used Form A in half of the classes and Form B in the other half. For the representative sample, Cronbach's alpha was .85, and criterion validity with the previous spring's TerraNova (CTB/McGraw-Hill, 1997) Total Math score was .58 for the 844 students for whom we had TerraNova scores. Interscorer agreement, computed on 20% of protocols by two independent scorers, was .984.
DA. We selected basic algebra skills as our DA (L. S. Fuchs, Compton, et al., 2008) content for the following reasons. First, we could safely assume that these skills were unfamiliar to third graders (i.e., the skills had not been introduced in school to students by the beginning of third grade and were not familiar from everyday life experiences). Second, the algebra skills were of sufficient difficulty that most third graders would not be able to solve the problems without assistance but could learn the skills with varying amounts of teaching. Third, at the beginning of third grade, students should have mastered the simple calculation skills incorporated in the three algebra skills. Fourth, we could delineate rules underlying the algebra skills, rules that could be used to construct clear explanations within a graduated sequence of prompts. Fifth, the graduated sequences of prompts for the three skills could be constructed in an analogous hierarchy, thereby promoting equal interval scaling of the DA scoring system. Sixth, the three skills were increasingly difficult (as established in pilot work), and later skills appeared to build on earlier skills; therefore, we hoped that transfer across the three algebra skills might facilitate better DA scores. Seventh, although linguistic content is absent from the algebra skills, algebra does require understanding of the relations among quantities, as is the case for solving word problems. Hence, a potentially important connection does exist between algebraic learning, as assessed on the DA content, and word-problem learning.
The three algebra skills are (a) finding the missing variable in the first or second position in addition equations (e.g., x + 5 = 11 or 6 + x = 10), (b) finding x in multiplication equations (e.g., 3x = 9), and (c) finding the missing variable in equations with two missing variables, but with one variable then defined (e.g., x + 2 = y - 1; y = 9). We refer to these three skills, respectively, as DA Skill A, DA Skill B, and DA Skill C.
Mastery of each DA skill is assessed before instructional scaffolding occurs, and mastery testing recurs after each level of instructional scaffolding is completed. The mastery test comprises six items representing the skill targeted for mastery, with mastery defined as at least five items correct. The items on the test are not used for instruction but are parallel instructional items; each time the six-item test is readministered for a given skill, a different form is used, although some items recur across forms. If, after 5 s, the student has not written anything and does not appear to be working, the tester asks, “Can you try this?”; if, after another 15 s, the student still has not written anything and does not appear to be working, the tester asks, “Are you still working or are you stuck?” If the student responds that he or she is stuck, the tester initiates the first (or next) level of instructional scaffolding. If the student responds that he or she is still working but another 30 s passes without the student writing anything or working, the tester then initiates the first (or next) level of instructional scaffolding. If the student masters the skill (i.e., at least five items are correct), the tester administers a generalization problem (i.e., for Skill A: 3 + 6 + x = 11; for Skill B: 14 = 7x; for Skill C: 3 + x = y + y; y = 2) and moves to the next DA Skill. If the student does not master the skill (i.e., fewer than five items are answered correctly), the tester provides the first (or next) level of instructional scaffolding, which is followed by the six-item test. The levels of instructional scaffolding gradually increase instructional explicitness and concreteness. If a student fails to answer at least five items correctly after the tester provides all five scaffolding levels for a given skill, the DA is terminated.
Scores range from 0 to 21, where 0 is the worst score (i.e., student never masters any of the three skills) and 21 is the best score (i.e., student masters each of the three skills on the pretest and gets every generalization problem correct). So, for each skill, there is a maximum of 7 points, awarded as follows: student masters skill on pretest = 6 points, student masters skill after Scaffolding Level 1 = 5 points, student masters skill after Scaffolding Level 2 = 4 points, student masters skill after Scaffolding Level 3 = 3 points, student masters skill after Scaffolding Level 4 = 2 points, student masters skill after Scaffolding Level 5 = 1 point, student never shows mastery = 0 points. In addition, if the student gets the generalization problem correct, 1 bonus point is added, for the maximum score of 7 points for that DA skill.
Scaffolding levels, which range from incidental to explicit, are provided in Fuchs et al. (2008c) for DA Skill A. For this information on DA Skills B and C, contact the first author. The correlation with the previous year's math composite score on the state assessment (CTB/McGraw-Hill, 1997) was .49.
Math Word-Problem Outcome and Definition of Difficulty
With the Iowa Test of Basic Skills: Problem Solving and Data Interpretation (Iowa; Hoover, Dunbar, & Frisbie, 2001), students solve 24 word problems and use data presented in tables and graphs to solve word problems. At Grades 1 to 5, KR20 is .83-.87. In this study, coefficient alpha was .86. We defined word-problem difficulty (WPD) as scoring below the 25th percentile at spring of third grade.
Data Analysis
We used logistic regression to evaluate the utility of two contrasting screening models for classifying WPD: one based entirely on the group-administered Algorithmic Word Problems measure and the other based on the Algorithmic Word Problems measure plus DA. Within the context of RTI, we were interested in maximizing identification of true positives (i.e., students who truly require secondary prevention services) while limiting the identification of false positives (i.e., students who are identified with risk for WPD at the time of screening but who complete third grade above our criterion for WPD). Therefore, within both screening models, we held sensitivity at .875 (i.e., at least 87.5% of students who complete third grade with WPD are identified as at risk for WPD in the fall of third grade) and then observed how the competing models affected specificity, hoping that the addition of DA to the model would reduce false positives.
We used measures of sensitivity, specificity, overall hit rate, and area under the receiver operating curve (ROC) to contrast the utility of the two completing logistic regression models. Sensitivity, which is the proportion of children correctly predicted by the model to have WPD, is computed by dividing the number of true positives by the sum of true positives and false negatives. Specificity is the proportion of children correctly predicted to not have WPD and is computed by dividing the number of true negatives by the sum of true negatives and false positives. Overall hit rate, which is the proportion of children correctly classified as either having WPD or not having WPD, represents the overall accuracy of the prediction model. Finally, the area under the ROC (AUC) is a plot of the true positive rate against the false positive rate for the different possible cut points of a test. To contrast the predictive accuracy of logistic regression models, we used AUC as a measure of discrimination (see Swets, 1992). To illustrate, if we had already placed children into their correct WPD groups and then selected one child at random from each group, we would assume that the child scoring higher on the screener or screeners would be the child without WPD. The AUC represents the proportion of randomly chosen pairs of students for which the screener or screeners correctly classified children with and without WPD. It ranges from .50 to 1.00. The greater the AUC, the less likely classification is the result of chance. AUC less than.70 indicates a poor predictive model, .70 to .80 is fair, .80 to .90 is good, and greater than .90 is excellent (Swets, 1992). Output from ROC analysis includes confidence intervals for the AUC, and a lack of overlap for the confidence intervals between models indicates a significant difference in the predictive accuracy of the models.
Results
In Table 1, we show means and standard deviations to describe the sample on the Algorithmic Word Problems and DA screeners and on the Iowa outcome. The correlation between the two screeners (Algorithmic Word Problems and DA) was .52, between the Algorithmic Word Problems screener and the Iowa outcome was .43, and between DA and the Iowa was .26. The prevalence of WPD in this sample was 19.7%. In Table 2, we report the results of the logistic regression analyses for classifying WPD status in the spring of third grade. Classifying WPD based on Algorithmic Word Problems while holding sensitivity at 87.5% resulted in specificity of 48.0%; the hit rate was 55.7%. Adding DA to the prediction model, while again holding sensitivity at 87.5%, improved specificity to 70.4%; the hit rate was 73.8%. The AUC values for the two models (Algorithmic Word Problems alone vs. Algorithmic Word Problems plus DA) were .834 and .860, respectively, which are deemed good. Confidence intervals for the AUCs overlapped, indicating that the models were not significantly different. While permitting no more than 3 students with WPD to be missed via screening (i.e., see the FN column), sole reliance on the Algorithmic Word Problems resulted in 51 students identified as at risk who did not in fact meet the spring criterion for WPD (i.e., see FP column). The addition of DA to the model, again permitting no more than 3 students with WPD to be missed via screening, reduced the number of students of false positives to 29. This represents a 57% reduction.
Table 2.
Model | B | SE | Wald | p | TN | FN | TP | FP | Hit Rate | Sensitivity | Specificity | AUC | SE | CI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model 1 | ||||||||||||||
AWP | –0.337 | 0.080 | 17.593 | .000 | 47 | 3 | 21 | 51 | 55.7 | 87.5 | 48.0 | .834 | .043 | .751–.918 |
Constant | 0.777 | 0.476 | 2.658 | .103 | ||||||||||
Model 2 | ||||||||||||||
AWP | –0.265 | 0.084 | 9.979 | .002 | 69 | 3 | 21 | 29 | 73.8 | 87.5 | 70.4 | .860 | .041 | .780–.940 |
DA | –0.195 | 0.081 | 5.743 | .017 | ||||||||||
Constant | 1.971 | 0.721 | 7.435 | .006 |
Note: N = 122. TN = true negatives; FN = false negatives; TP = true positives; FP = false positives; hit rate = (TP + TN)/N; sensitivity = TP/(TP + FN); specificity = TN/(TN + FP); ROC = receiver operating curve; AUC = area under the curve; AWP = Algorithmic Word Problems; DA = dynamic assessment of algebraic learning. Hit rate, sensitivity, and specificity are expressed as percentages.
Discussion
In the present study, we reanalyzed the L. S. Fuchs, Compton, et al. (2008) database to consider DA within the context of a two-stage screening process for enhancing the precision with which students are screened for WPD. In the first stage, a relatively easy-to-administer and low-cost universal screen would be administered to all students, with the cut point set to minimize missing true positives. The purpose of this first-stage screening would be to eliminate as many high-scoring students as possible from further consideration, which requires more costly, additional assessment. In the second stage of screening, a more in-depth measure would be used to sort the remaining students into false positives and true positives. Research shows how screening with a one-stage universal screen in the primary grades results in high percentages of false positives (e.g., L. S. Fuchs et al., 2007; Seethaler & Fuchs, 2010). If a second-stage screening process reduces such error, then the efficiency and effectiveness of RTI prevention systems should improve by limiting false positives who enter secondary prevention and instead permitting schools to allocate those resources to the students who truly need them. In the present study, we used DA heuristically to gain insight into such a two-stage screening process and to understand DA's specific potential as a second-stage screener.
Results support the heuristic of a two-stage screening process. Our first model, which relied entirely on a group-administered screen that might be used universally within a two-stage screening process, resulted in unacceptably low specificity of 48.0% (while holding sensitivity to 87.5%) and an inadequate hit rate of 55.7%. By contrast, our second model, which simultaneously considered the more efficient group-administered test along with DA, increased specificity to 70.4%, even as sensitivity was held at 87.5%. The hit rate improved to 73.8%. Therefore, although there was no reliable difference between the AUCs for the two models, the second-stage screener resulted in a practically important reduction in the identification of false positives: 22 fewer (of 122) students were incorrectly identified as at risk for WPD. These students would be unnecessarily tutored at the secondary prevention level if screening and resulting intervention decisions were to rely entirely on the group-administered universal screen.
In thinking about the practical value of such a two-stage screening process, we pose the following hypothetical cost analysis of a two-stage screening process, where the cost side involves administration of an additional screener with a subset of students but the saving side involves reducing the number of students incorrectly identified for secondary preventive tutoring. To conduct this cost analysis, we first determined the cut point on the first-stage screener, Algorithmic Word Problems, which would permit 87.5% sensitivity. Setting a cut point of 10 on Algorithmic Word Problems eliminated 50 not-at-risk students from second-stage screening while missing 3 at-risk students, to yield sensitivity of 87.5%. This left 72 students for second-stage screening with DA, only some of whom were truly at risk.
We then made the following assumptions (which note involves only one of several possible scenarios). First, we assumed that screening would occur at third grade, in a school with 120 third graders. Second, we figured that secondary preventive tutoring would be conducted in triads, with a cost of $100 per hour per professional tutor. Third, we assumed that the DA required, on average, .75 hr to administer to one student and that the tester would be paid $100 per hour. With these assumptions, a two-stage screening process in which 72 students all receive a .75 hr DA would require 54 hr at a cost of $100 per hour, with an added cost over a one-stage screening process of $5,400. At the same time, however, a two-stage screening process based on these assumptions would save 22 false positives from unnecessary tutoring within the RTI system, who otherwise would have received such tutoring within a one-stage screening process. In this scenario, seven triads of students would avoid unnecessary tutoring, which would require 34 hr per triad or 238 hr at a cost of $100 per hour. This would yield a savings of $23,800. Therefore, this two-stage screening process adds $5,400 in expenses, even as it saves $23,800, leaving a net reduction of $18,400 in RTI costs for a school building.
The cost efficiency of this two-stage screening process highlights the heuristic value of following up universal screening with additional assessment to narrow down the pool of students truly at risk for academic difficulty. At the same time, it is important to note that the classification accuracy of the specific screening model tested in the present study, which incorporated Algorithmic Word Problems as well as DA, leaves room for improvement. While holding sensitivity to a high standard (87.5%), specificity remained an unacceptably low 70.4% even after adding DA to the prediction model. Although this rate of false positives was practically superior to the base model that relied exclusively on Algorithmic Word Problems, it is possible that other approaches to second-stage screening may produce superior classification models. Other candidates for second-stage screening include 5 to 8 weeks of progress monitoring, as shown by Compton, Fuchs, Fuchs, and Bryant (2006) in predicting reading difficulty, and a measure of language, which has been associated with word-problem skill (e.g., Fuchs, Stuebing, et al., 2008), and more in-depth individually administered static measures of word-problem skill, and variations on DA. In addition, there are few if any studies with the purpose of identifying risk for WPD at the third grade or at any grade level. It is possible that other universal screeners, used at Stage 1, combined with the present DA or some other second-stage screening tool, would result in superior screening. A research program dedicated to these screening questions is clearly warranted.
Funding
This research was supported in part by Grant 1 RO 1 HD46154 and Core Grant HD 15052 from the National Institute of Child Health and Human Development to Vanderbilt University and Grant R324A090039 from the U.S. Department of Education to Vanderbilt University. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Child Health and Human Development or the National Institutes of Health or of the U.S. Department of Education.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
We excluded Byrne's work (e.g., Byrne, Fielding-Barnsley, & Ashley, 2000) because it conceptualizes dynamic assessment (DA) as the student's rate of acquisition in response to schooling (i.e., it is not an assessment conducted to predict responsiveness to schooling). Therefore, as an assessment paradigm, Byrne's work is more similar to responsiveness to intervention than to DA.
References
- Berkeley S, Bender WN, Gregg Peaster L, Saunders L. Implementation of response to intervention: A snapshot of progress. Journal of Learning Disabilities. 2009;42:85–95. doi: 10.1177/0022219408326214. [DOI] [PubMed] [Google Scholar]
- Bransford JC, Delclos VR, Vye NJ, Burns MS, Hasselbring TS. State of the art and future directions. In: Lidz CS, editor. Dynamic assessment: An interactional approach to evaluating learning potential. Guilford; New York, NY: 1987. pp. 479–496. [Google Scholar]
- Budoff M. Learning potential among institutionalized young adult retardates. American Journal of Mental Deficiency. 1967;72:404–411. [PubMed] [Google Scholar]
- Byrne B, Fielding-Barnsley R, Ashley L. Effects of preschool phoneme identity training after six years: Outcome level distinguished from rate of response. Journal of Educational Psychology. 2000;92:659–667. [Google Scholar]
- Campione JC. Assisted testing: A taxonomy of approaches and an outline of strengths and weaknesses. Journal of Learning Disabilities. 1989;22:151–165. doi: 10.1177/002221948902200303. [DOI] [PubMed] [Google Scholar]
- Campione JC, Brown AL. Linking dynamic assessment with school achievement. In: Lidz C, editor. Dynamic assessment: An interactional approach to evaluating learning potential. Guilford; New York, NY: 1987. pp. 82–115. [Google Scholar]
- Chard DJ, Clarke B, Baker S, Otterstedt J, Braun D, Katz R. Using measures of number sense to screen for difficulties in mathematics: Preliminary findings. Assessment for Effective Intervention. 2005;30:3–14. [Google Scholar]
- Clarke B, Shinn MR. A preliminary investigation into the identification and development of early mathematics curriculum-based measurement. School Psychology Review. 2004;33:234–248. [Google Scholar]
- Compton DL, Fuchs D, Fuchs LS, Bryant JD. Selecting at-risk readers in first grade for early intervention: A two-year longitudinal study of decision rules and procedures. Journal of Educational Psychology. 2006;98:394–409. [Google Scholar]
- CTB/McGraw-Hill . TerraNova technical manual. Author; Monterey, CA: 1997. [Google Scholar]
- Ferrara RA, Brown AL, Campione JC. Children's learning and transfer of inductive reasoning rules: Studies of proximal development. Child Development. 1986;57:1087–1099. doi: 10.1111/j.1467-8624.1986.tb00438.x. [DOI] [PubMed] [Google Scholar]
- Feuerstein R. The learning potential assessment device, theory, instruments, and techniques. University Park Press; Baltimore, MD: 1979. The dynamic assessment of retarded performers. [Google Scholar]
- Fuchs D, Compton DL, Fuchs LS, Davis GC. Making “secondary intervention” work in a three-tier responsiveness-to-intervention model: Findings from the first-grade longitudinal study at the National Research Center on Leaming Disabilities. Reading and Writing: A Contemporary Journal. 2008a;21:413–436. [Google Scholar]
- Fuchs LS, Fuchs D, Stuebing K, Fletcher JM, Hamlett CL, Lambert W. Problem solving and computational skill: Are they shared or distinct aspects of mathematical cognition? Journal of Educational Psychology. 2008b;100:30–47. doi: 10.1037/0022-0663.100.1.30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fuchs LS, Compton DL, Fuchs D, Hollenbeck KN, Craddock C, Hamlett CL. Dynamic assessment of algebraic learning in predicting third graders’ development of mathematical problem solving. Journal of Educational Psychology. 2008c;100:829–850. doi: 10.1037/a0012657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fuchs LS, Fuchs D, Compton DL, Bryant JD, Hamlett CL, Seethaler PM. Mathematics screening and progress monitoring at first grade: Implications for responsiveness-to-intervention. Exceptional Children. 2007;73:311–330. [Google Scholar]
- Fuchs LS, Hamlett CL, Fuchs D. Test of Computational Fluency. Vanderbilt University; Nashville, TN: 1990. Unpublished test. Available from L. S. Fuchs, 228 Peabody 37203.
- Fuchs LS, Hamlett CL, Powell SR. Grade 3 Math Battery. Vanderbilt University; Nashville, TN: 2003. Unpublished test. Available from L. S. Fuchs, 228 Peabody 37203.
- Fuchs LS, Seethaler PM, Powell SR, Fuchs D, Hamlett CL, Fletcher JM. Effects of preventative tutoring on the mathematical problem solving of third-grade students with math and reading difficulties. Exceptional Children. 2008;74:155–173. doi: 10.1177/001440290807400202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grigorenko EL, Sternberg RJ. Dynamic testing. Psychological Bulletin. 1998;124:85–111. [Google Scholar]
- Hoover HD, Dunbar SB, Frisbie DA. Iowa Test of Basic Skills. Riverside; Rolling Meadows, IL: 2001. [Google Scholar]
- Lembke E, Foegen A. Monitoring student progress in early math; Paper presented at the 14th annual Pacific Coast Research Conference; San Diego, CA. Feb, 2006. [Google Scholar]
- Murray BA, Smith KA, Murray GG. The test of phoneme identities: Predicting alphabetic insight in pre-alphabetic readers. Journal of Literacy Research. 2000;32:421–477. [Google Scholar]
- Raven JC. Standard Progressive Matrices, Sets A, B, C, D, and E. Lewis; Cambridge, UK: 1960. [Google Scholar]
- Seethaler PM, Fuchs LS. The predictive utility of kindergarten screening for math difficulty. Exceptional Children. 2010;77:37–59. [Google Scholar]
- Spector JE. Predicting progress in beginning reading: Dynamic assessment of phonemic awareness. Journal of Educational Psychology. 1992;84:353–363. [Google Scholar]
- Speece DL, Cooper DH, Kibler JM. Dynamic assessment, individual differences, and academic achievement. Learning and Individual Differences. 1990;2:113–127. [Google Scholar]
- Swanson HL, Howard CB. Children with reading disabilities: Does dynamic assessment help in classification? Learning Disability Quarterly. 2005;28:17–34. [Google Scholar]
- Swets JA. The science of choosing the right decision threshold in high-stake diagnostics. American Psychologist. 1992;47:522–532. doi: 10.1037//0003-066x.47.4.522. [DOI] [PubMed] [Google Scholar]
- Tzuriel D, Haywood HC. The development of interactive-dynamic approaches for assessment of learning potential. In: Haywood HC, Tzuriel D, editors. Interactive Assessment. Springer-Verlag; New York: 1992. pp. 3–37. [Google Scholar]
- Vygotsky LS. Thought and language. MIT Press; Cambridge, MA: 1962. (Original work published 1934) [Google Scholar]
- Weschler D. Wechsler Abbreviated Scale of Intelligence. Psychological Corporation; San Antonio, TX: 1999. [Google Scholar]
- Woodcock RW. Woodcock Reading Mastery Tests–Revised. American Guidance Service; Circle Pines, MN: 1998. [Google Scholar]
- Woodcock RW, McGrew KS, Mather N. Woodcock–Johnson III Tests of Cognitive Abilities. Riverside; Itasca, IL: 2001. [Google Scholar]