Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2025 Jan 30;85(5):1000–1031. doi: 10.1177/00131644241313212

Examining the Instructional Sensitivity of Constructed-Response Achievement Test Item Scores

Anne Traynor 1,, Cheng-Hsien Li 2, Shuqi Zhou 3
PMCID: PMC11783420  PMID: 39896146

Abstract

Inferences about student learning from large-scale achievement test scores are fundamental in education. For achievement test scores to provide useful information about student learning progress, differences in the content of instruction (i.e., the implemented curriculum) should affect test-takers’ item responses. Existing research has begun to identify patterns in the content of instructionally sensitive multiple-choice achievement test items. To inform future test design decisions, this study identified instructionally (in)sensitive constructed-response achievement items, then characterized features of those items and their corresponding scoring rubrics. First, we used simulation to evaluate an item step difficulty difference index for constructed-response test items, derived from the generalized partial credit model. The statistical performance of the index was adequate, so we then applied it to data from 32 constructed-response eighth-grade science test items. We found that the instructional sensitivity (IS) index values varied appreciably across the category boundaries within an item as well as across items. Content analysis by master science teachers allowed us to identify general features of item score categories that show high, or negligible, IS.

Keywords: instructional sensitivity, validity, test content


Inferences about student learning from large-scale achievement test scores are fundamental in education. For test scores to provide information about student learning to educators, parents, and policymakers, differences in the content of instruction (i.e., the implemented curriculum) should affect test-takers’ item responses (e.g., Burstein, 1989; Naumann et al., 2020). Test item scores are “instructionally sensitive” if they are affected by variation in the content or quality of instruction received by test-takers (B. O. Muthén et al., 1991). When achievement test scores will be interpreted as measuring learning by students or trainees in an education context, evidence that the test-takers have had adequate opportunity to learn (OTL) the tested content is a crucial type of validity evidence (AERA, APA, & NCME, 2014, pp. 56–57, 72, 145, 197). Without such evidence, the scores can be argued to measure test-takers’ current achievement status or predict their future outcomes, but achievement score differences across individuals or groups cannot be directly attributed to their schooling. Test item instructional sensitivity (IS) indices connect the educational policy concept of OTL (e.g., Daus & Braeken, 2018) to empirical differences in test item difficulty.

Existing research about test item IS has focused on multiple-choice items (e.g., Li et al., 2017; Mehrens & Phillips, 1987; Naumann et al., 2019), with the exception of Ruiz-Primo et al.’s (2002) study of science laboratory assessment tasks for upper elementary students. They proposed a detailed framework describing how item IS should decrease as the item content becomes more distal from the content representations that teachers use during instruction. Then, they confirmed this hypothesis by scoring written responses in students’ lab notebooks during several instructional units. Despite this advance, we still lack general principles that would allow large-scale achievement test developers to consistently write constructed-response items that can demonstrably measure students’ learning from school instruction. This is a validity issue (Popham et al., 2014).

As constructed-response test items become more innovative and are used at larger scale (Russell & Moncaleano, 2019), identifying patterns in the effect of classroom instruction on test-takers’ item performance can inform future test design decisions. We anticipated that IS indices for dichotomous item scores (Polikoff, 2010) may be extensible to ordered polytomous item scores from a rubric. To compile IS evidence for constructed-response items, this study appraised an effect size index and applied it to science test item response data. Then, we examined shared features of items that show either high or negligible IS. This line of research is intended to inform test item writing and rubric design guidelines about IS, and perhaps to improve control of the IS of test forms during their assembly.

IS of Constructed-Response Test Item Scores

Constructed-response test items that are scored as ordered categories include objectively scored items, as well as items assessed using a rubric (i.e., marking key) by content experts or an automated scoring engine. We assume that each item’s scoring system consists of ordered performance levels that are separable and clearly described (Brookhart, 2018). We conceptualize IS as a property of the scores yielded by a certain test item when test-takers’ responses are rated using a particular rubric. An item with instructionally sensitive scores has item difficulty statistics that vary for test-taker groups who have documented differences in OTL about the content topic (Floden, 2002) tested by that item, but are otherwise equal. Although we define IS as a feature of item scores in a test-taker population, rather than of items, to emphasize that our definition pertains to items, not only total test scores (e.g., D’Agostino et al., 2007; Ing, 2018), we will use the term item IS throughout.

Schmidt et al. (1999) reported suggestive evidence that constructed-response item scores might be particularly sensitive to differences in the implemented curriculum across school systems. However, constructed-response items have not been a focus of earlier studies about item IS. The mathematics item IS study by Li et al. (2017) used multinomial logistic regression to examine three polytomous items in a 34-item set; none of those items showed significant IS. A science item IS study by Naumann et al. (2019) found students’ responses to two polytomous items about buoyancy force showed no relationship to classroom instructional quality ratings. (The study modeled responses from each polytomous item as a set of dichotomous scores.) That finding was characterized as unexpected by the investigators, particularly because those items were closely tied to the instructional content.

Since existing evidence about IS of constructed-response items is minimal, we rely on conclusions about multiple-choice items to extract tentative principles for writing instructionally sensitive items. A consistent finding is the small proportion of items from educational tests that display IS. Multiple-choice mathematics and science items that primarily require disciplinary factual or procedural knowledge tend to show greater IS over time (Naumann et al., 2019) and across student groups (Li et al., 2017; B. O. Muthén et al., 1991) than items that require evaluation or synthesis. Expectedly, items framed in contexts that were not used during instruction tend to exhibit lower IS (Ruiz-Primo et al., 2012). Li and collaborators (p. 11) also judged that mathematics items requiring “common knowledge,” which is often learned outside of school, seem to have lower IS. Collectively, these conclusions about IS of multiple-choice items constituted a hypothesis for our second research question. We investigated the extent to which these principles hold for rubric-scored constructed-response items.

Proposed Item Difficulty Difference IS Index

Item IS analysis can be conceptualized as a special case of uniform differential item functioning (DIF) analysis that compares groups with varying OTL the tested content. During the analysis, item difficulty statistics may be compared for test-taker groups before and after an instructional intervention, or groups whose instructional history is known to differ (Naumann et al., 2016). One important distinction between item IS and DIF in general is that items exhibiting differential functioning are viewed as undesirable, potentially indicating problems with construct representation in the test scores (e.g., Gómez-Benito et al., 2018), while items with IS favoring the group that had more learning opportunity are desirable if the test scores are claimed to measure students’ learning during school (Naumann et al., 2016; Ruiz-Primo et al., 2012).

Among DIF detection methods for polytomous items, to the best of our knowledge, only multinomial logistic regression has previously been used to examine test item IS (Li et al., 2017). Some advantages of the itemwise logistic regression model as an IS detection method are that it can incorporate multiple covariates and can produce stable effect size estimates when the total person sample size is very small (e.g., less than 100; Belzak, 2020). Its limitations as an IS detection method arise from its estimation method, which does not use information from the multivariate distribution of all test item responses, and typically treats test-takers with a missing item response by omitting them. Many simulation studies (e.g., Finch, 2016; Kristjansson et al., 2005) have established that logistic regression methods are among the best observed variable methods for DIF detection, and those conclusions should extend directly to item IS analysis.

Historically, statistical methods to quantify the IS of dichotomous items have tended to treat each test-taker’s total score as an observed variable (Polikoff, 2010), which is also the approach taken in logistic regression models. However, in a recent series of studies, Naumann and collaborators (2016, 2020) have successfully used item response models that treat each test-taker’s total score as a latent variable to evaluate the IS of dichotomous items. We believed that general approach was promising to extend to IS analysis of constructed-response item scores, and returned to the DIF literature to guide our model selection.

Among item response models for polytomous items, the generalized partial credit model seemed to be well suited for item IS analysis because its single overall item difficulty parameter can form a readily interpretable IS index, and it is often used for achievement test scoring. Muraki (1999) proposed testing group differences in an item’s difficulty parameter, estimated from a multiple-group generalized partial credit model as a statistical test for DIF. The multiple-group model parameterization that we planned to implement is represented in Equation 1:

P(Yj,g=k|θ)=ei=1kaj(θbj,g+di,j,g)h=0m1ei=1haj(θbj,g+di,j,g)d0,j=0i=1mj1di,j,g=0 (1)

where P(Yj, g = k|θ) = conditional probability of a response (Yj, g ) on Item j in Category k (k = 0, 1, . . ., m− 1; m = number of observable values) for test-takers in OTL group g (g = 0, 1), θ = latent ability status for an individual test-taker, aj = discrimination parameter for Item j, g. which can be allowed to vary across groups, bj, g = difficulty (location) parameter for Item j in Group g, and di, j = threshold parameter for Category i of Item j in Group g. The “group step difficulty,”bi, j , for a given Category i can be obtained as bj, g di, j, g We intended to model any groupwise differences in Item j discrimination by allowing the category threshold parameters for an item to vary across groups.

We hypothesized that the difference between generalized partial credit item difficulty or step difficulty parameters in test-taker groups that have (1) and have not (0) experienced relevant topical instruction, BDIFF, might be a useful IS index (Equation 2):

BDIFF=bj,0bj,1 (2)

with positive values indicating IS favoring the group with more OTL, as expected in theory, but negative values also possible. The group indicator variable, g, is allowed to vary across items if distinct subsets of students have had OTL about particular content topics.

Evaluating Test Item IS in Cross-Sectional Observational Studies

In educational intervention studies, students are randomly assigned into instructional groups (e.g., Naumann et al., 2020), which equalizes the groups’ average background characteristics as their sample sizes increase. Since the student groups are then likely to have similar mean achievement before the intervention is implemented, subsequent group differences in test item performance may be interpreted as evidence of item IS (Polikoff, 2010). Lacking random assignment, observational studies allow us to interpret student group differences in test item performance as IS evidence only if we use statistical methods to balance the groups on measured variables that are likely to be associated with both their test item responses and their available learning opportunities. Despite this additional requirement, conclusions about item IS from international large-scale assessments may be particularly useful to inform item writing guidelines that could be relevant across national education systems, rather than bounded by the context of a particular educational intervention.

Research about educational stratification across countries indicates that allocation of educational opportunities to students is not random—socioeconomically advantaged students tend to have access to more advanced instruction than other students do (e.g., Hannum & Buchmann, 2005; Kalogrides & Loeb, 2013). Previous studies of mathematics item IS in the United States demonstrate that students’ current total achievement scores are a strong predictor of their performance on individual test items (Li et al., 2017; B. O. Muthén et al., 1991). However, adjusting for student OTL group differences in other background variables seldom changes the conclusions about an item’s IS (Li et al., 2017). These results are consistent with the logic of DIF methods that account for current student achievement differences, but not necessarily covariates other than the group variable of interest, in the detection model. To overcome the lack of random assignment of students to “opportunity” or “no opportunity” to learn groups in international assessments, the multiple-group generalized partial credit model (Muraki, 1999) embeds control for group total achievement score differences.

In DIF analyses, items that display large DIF index values, perhaps caused by a source of bias, are often excluded from the total achievement score used to match groups. In studies of item IS, excluding instructionally sensitive items from the scoring model would remove items with desirable properties from the achievement matching variable. Thus while examining item IS in this study, we consistently retained all administered items in the total test score, but allowed the item difficulty parameters to vary across the student OTL groups.

Computer simulation allowed us to examine the BDIFF index’s statistical optimality, and its application to item response data from an international large-scale assessment program let us evaluate its utility. We asked:

  1. Does the proposed effect size index to characterize the instructional sensitivity of ordered polytomous items, BDIFF, have adequate statistical properties, even under measurement conditions that are marginal for latent variable modeling?

  2. What are the characteristics of constructed-response science achievement test item and rubric combinations that yield instructionally sensitive scores?

Methods for the IS Effect Size Index Simulation

To address Question 1, we used Monte Carlo simulation of responses to a short test composed of ordinal items that follow the generalized partial credit model. The first test item had IS (BDIFF) of varying degrees, operationalized as group differences in the step difficulty parameters for Item 1, and the other items had fixed, group-invariant parameters. The first item had fixed item discrimination of a = 1.00 to maintain a comparable BDIFF effect size index across conditions.

BDIFF Index Effect Size

A key decision for the simulation was determining OTL group differences in item difficulty that might represent “low,”“moderate,” and “high” test item IS. BDIFF values that we computed from B. O. Muthén et al.’s (1991, pp. 14–15) item IS results, obtained using an extension of the two-parameter logistic model, ranged between −.45 and 1.95, with a median of 0.09 and third quartile of 0.35. As best we know, these were the only relevant existing empirical IS effect sizes. Roussos et al. (1999) established that the item difficulty difference for two test-taker groups whose item responses fit the two-parameter logistic model can be transformed to ETS’s delta effect size metric for evaluating DIF as 4aj (bj,0bj,1), where aj denotes Item j discrimination, assumed to be equal across groups, and bj, g is the difficulty parameter in each group on a particular Item j. Item difficulty differences equivalent to the delta values that have been utilized as boundaries demarcating “negligible” from “moderate” (1.0) and “moderate” from “large” DIF (1.5) will then be .250 and .375 (in absolute value), respectively. The item difficulty parameter has the same meaning in the generalized partial credit and two-parameter logistic models, so it seemed reasonable to propose BDIFF values of .30 and .60 as representing “moderate” and “high” item IS in our simulation, respectively.

Simulation Conditions

We simulated test item response data under 36 experimental conditions with combinations of the following features: number of scoring categories (2, 3, 5); number of items (5, 10), numbers of test-takers in the no OTL and OTL groups (300-300, 500-500, 300-500); and population value of the BDIFF IS index (“low-moderate,” .30, and “moderate-high,” .60). Both test-taker groups had simulated latent achievement scores drawn from a standard normal distribution, N(0, 1). For simplicity, we presumed that the two test-taker groups differed only in their OTL, the specific topic of Item 1, not in their overall content area achievement. Our simulation conditions were intentionally chosen to probe questionable generalized partial credit model use situations, since 500 test-takers have been suggested as a minimum for calibration in single-group samples (DeMars, 2010). IS detection has low direct stakes for individual test-takers, and it may be desirable to conduct exploratory analyses in small samples if they are believed to be representative of a defined population. Short test lengths are increasingly common in school achievement testing, particularly when constructed-response items are administered.

To set realistic item parameter values, estimates from five-category 2011 U.S. National Assessment of Educational Progress (NAEP) Grade 8 Science test items were assigned as fixed parameter values in the simulation models. Three-category item conditions were modified from the five-category conditions by retaining the NAEP item difficulty and lowest threshold parameter for each item, and creating the upper (of two) threshold parameters as the negative of the lower threshold parameter value. In the two-category conditions, we treated each NAEP item difficulty as a population value. Low-moderate and moderate-high IS conditions for Item 1 were constructed by adding and subtracting a fixed amount (.15 or .30) from the NAEP item difficulty value to produce simulated “no OTL” and “OTL” test-taker groups with higher and lower true item difficulty values, respectively.

We generated 500, 1,000, and finally 1,500 item response datasets for two simulation conditions. For each number of replications, we varied the random seed number that initiates the data generation process three times, finding that the results were stable to at least two decimal places when 1,500 replications were used. We then generated 1,500 datasets for each of the remaining 34 conditions. Finally, using the simulated item responses, we computed BDIFF effect sizes for Item 1. All analyses were conducted using Mplus 8.11 software (L. K. Muthén & Muthén, 2017).

Two evaluation criteria—average percentage bias and 95% confidence interval coverage of the true population value by the index estimates—were used to judge the performance of BDIFF as an IS effect size index. A third outcome, power to detect that the IS effect size differs from zero, is also reported to illustrate the limits of significance testing as an approach to detect test item IS in small samples. We checked the credibility of the simulation by inspecting the results for inadmissible model solutions, finding none among 10 results sets randomly drawn for each condition, and no nonconverged solutions.

Results From the IS Effect Size Index Simulation

Results to evaluate the performance of the BDIFF IS effect size index are displayed in Table 1. Since multiple item step difficulty BDIFF values were estimated in the three- and five-category conditions, to summarize, Table 1 reports the “worst” results that we observed (i.e., maximum percentage bias, minimum confidence interval coverage). Across all simulation conditions, we found systematic small, usually positive biases in the estimated value of BDIFF. Maximum percentage bias tended to be largest when Item 1 displayed a low-to-moderate (.30) IS effect or the group sample sizes were unbalanced, but was less than 4% in most conditions, which might be deemed acceptable for this relatively low-stakes data use. Bias of the BDIFF index was less acceptable for item category boundaries that lay far from the bulk of the test-taker score distribution, with percentage bias reaching a maximum of 11.7% for the highest step difficulty BDIFF in the condition with 10 items having five response categories. The BDIFF index estimate appeared to have greater bias when tests had 10 items than when they had 5 items, but several of the added items contained one very low or high step difficulty parameter ≥|2.0|, obscuring any effect of the number of items on bias. We thus would not rely on adding items to a test to reduce bias in this index. Overall, we conclude that bias in the IS index becomes appreciable for item step difficulty parameters with extreme values, and as the group sample sizes become smaller and/or more unequal.

Table 1.

Bias and Variability of Item Difficulty Difference (BDIFF) Effect Size Estimates Across Simulation Conditions

Group sample sizes
300-300 500-500 300-500
True effect size No. of items No of categories Bias % 95% CI coverage Power range Bias % 95% CI coverage Power range Bias % 95% CI coverage Power range
Low-moderate (.30) 5 2 2.3 .97 .21 2.0 .96 .38 3.3 .97 .23
3 3.3 .97 [.11, .21] 3.7 .96 [.19, .33] 3.7 .96 [.10, .22]
5 2.0 .95 [.07, .18] 5.3 .95 [.09, .29] 4.3 .96 [.08, .19]
10 2 2.7 .97 .26 1.0 .96 .43 4.0 .97 .30
3 3.3 .96 [.12, .23] 2.3 .95 [.20, .35] 6.3 .95 [.13, .26]
5 10.0 .95 [.09, .19] 7.3 .95 [.12, .29] 11.7 .95 [.08, .24]
Moderate-high (.60) 5 2 1.7 .97 .71 2.0 .96 .92 2.5 .96 .81
3 1.8 .97 [.34, .66] 2.2 .96 [.58, .88] 2.7 .96 [.37, .77]
5 3.0 .94 [.13, .60] 3.3 .96 [.33, .77] 5.8 .95 [.20, .64]
10 2 2.0 .97 .79 0.8 .96 .95 2.8 .96 .88
3 2.2 .96 [.39, .71] 1.7 .95 [.63, .90] 4.2 .95 [.44, .79]
5 3.2 .95 [.33, .78] 6.2 .96 [.19, .57] 5.8 .94 [.21, .65]

Note. Bias % = maximum absolute value of percentage bias (across parameters of the focal item, when two- or four-step difficulty BDIFF values were estimated). All the maximum bias values for any step difficulty BDIFF parameter within a given item were positive, but negative bias of less than 3% was observed for 16 of the 84 BDIFF parameters estimated, most often in the five-category item conditions. 95% CI coverage = minimum 95% confidence interval coverage (across parameters of the focal item).

Generally, the 95% confidence interval for the BDIFF index included the true population value in at least 95% of replications, with minimum confidence interval coverage dipping to 94% for two (out of 84 total) estimated step difficulty BDIFF indices, both in five-category item conditions. Maximum power to detect that a step difficulty BDIFF index differed from zero exceeded .80 in the 500-500 OTL group size condition when the IS effect was moderate-to-large (.60) and an item had two categories. However, power was lower in the other conditions, suggesting that significance testing of OTL group differences in item difficulty could not be relied on for IS detection with group sample sizes of 500 or less, given that the magnitude of IS is not known a priori in practice. This result points to the need for effect size indices to examine IS. Overall, we conclude from the simulation that the BDIFF index seems to perform well enough that it may be useful to evaluate test item IS when several hundred test-takers have completed each item.

Methods for the Constructed-Response Item IS Analysis

Question 2 sought to evaluate the IS of constructed-response item–rubric scores from a science achievement test. Previous item IS studies have analyzed data from the United States. Because we hope to inform test item writing principles that generalize beyond a single country, we sought a broader sample of item response data from eighth-grade students who had taken the Trends in International Mathematics and Science Study 2011 science assessment (TIMSS; Martin & Mullis, 2012) in English, the most administered language version of the test. In 12 countries, Australia, Bahrain, Botswana, England, Ghana, New Zealand, Oman, Qatar, Singapore, South Africa, the United Arab Emirates, and the United States, some or all test-takers took the English language version. Countries that used the same test language version were encouraged to make dialect adaptations of the items, but not modifications that would alter the science content assessed. A systematic translation verification process was used to qualify any item language adaptations (Martin & Mullis, 2012). Two other countries in which the English language version test was administered were excluded from our analysis: Lebanon, where omission rates on the teacher questionnaire items about students’ OTL exceeded 50%, and Saudi Arabia, where fewer than 100 students tested in English. We intentionally selected data from 2011, rather than the most recent study cycle (2019), because interpreting item IS results requires examining the item content. Approximately, two-thirds of the 2011 constructed-response items have been released for restricted use, while smaller proportions of the 2015 and 2019 item sets are currently available for research use.

Composing the OTL Group Variables

A distinguishing feature of the TIMSS data collection is the availability of instructional information at the individual student level. Previous research suggests teachers’ retrospective self-reports about content topic coverage are at least moderately reliable (Polikoff et al., 2020). Science teachers completed a questionnaire indicating, for each of 20 assessed topics (e.g., classification of living organisms, properties of matter), whether students had received instruction during a previous or current school year, or never. Up to three teachers of each student completed the questionnaire, so students in the same classroom could have distinct instructional histories reported. Like Li et al. (2017) and B. O. Muthén et al. (1991), we treated the first two rating scale categories as indicating “OTL” and the last category as indicating “no OTL.” A student was classified as reported to have experienced topic instruction if at least one teacher indicated they had opportunity. OTL reports conflicted for 2.5% of the students or fewer, depending on the specific science content topic. As shown in Table 2, the proportion of individual students reported by their teachers to have received instruction ranged between .16 and 1.00 across topics and countries, which suggests that examining test item IS in this analytic sample is reasonable.

Table 2.

Summary Statistics for Student Science Achievement and Teacher Instructional Questionnaire Items, by Country

Country Proportion of students tested in English Number tested in English Range of mean item scores
(0–2)
Range in proportions of students in OTL category for any content topic Maximum proportion of OTL item missing data
Australia 1 6,453 .21–1.56 .23–.87 .26
Bahrain .25 837 .11–1.38 .62–.97 .04
Botswana 1 4,107 .04–.93 .26–.99 .06
England 1 3,335 .09–1.63 .58–.96 .15
Ghana 1 5,725 .03–.91 .49–1.00 .02
New Zealand 1 5,046 .04–1.44 .16–.91 .16
Oman .11 838 .11–1.53 .31–.97 .04
Qatar .49 1,447 .04–1.21 .69–.95 .08
Singapore 1 5,830 .33–1.59 .24–.90 .22
South Africa .86 8,126 .04–1.02 .42–.96 .06
United Arab Emirates .45 5,198 .15–1.32 .46–.92 .23
United States 1 9,930 .08–1.51 .56–.96 .31

Note. OTL = opportunity to learn.

Sampling the Science Achievement Item Domain

TIMSS implemented a two-stage stratified sampling design, selecting schools with probability proportional to size, followed by intact science classrooms within schools to enlist student participants, and their teachers (Martin & Mullis, 2012). In total, 75,273 eighth graders from 1,045 schools in the 12 countries took the English-version science test. On each country’s assessment date, more than 200 science items, approximately half of which were constructed-response items, were administered to students. The items covered four major content areas: biology, chemistry, earth science, and physics. A balanced incomplete block test booklet design assured that each student took only a fraction of the test item set. Test booklets were distributed in a spiraling manner, so the item responses will represent each country’s population once the sampling weights are applied to the response data. In our analysis, we used sampling weights that treated all countries as equal in size (i.e., “Senate” weights) so the results would not be driven by instruction–item performance relationships in the United States, which has a much larger youth population than the other 11 countries. Scorers of the constructed-response items underwent formal rater training and interrater reliability checks as they marked students’ item responses using the corresponding scoring rubric (Martin & Mullis, 2012).

To compose our item set for the IS evaluation, from among the released constructed-response items, we selected all 13 items that had three score categories, excluded subitems that referred to the same task prompt, table, or illustration, and then drew a stratified random sample of dichotomous items to assemble a total of eight items from each content area category. Interpreting IS results requires examining the test item content, so 32 items were the maximum number we could analyze, given the project resources. The 32 items were dispersed across nine test booklets. To ensure that all content areas were represented in each booklet, from among the remaining available items in each booklet, we again drew a random sample so 10 items per booklet, 90 in total (32 evaluation plus 58 scaling items), would be included in the scoring.

Descriptive statistics for the student achievement test item and teacher instructional questionnaire data by country and overall are summarized in Tables 2 and 3. As anticipated, student performance on the item set varied by country (see Table 2). Because the number of students in one of the OTL groups who answered an item was often less than 500 within a country, given our simulation results, we planned to analyze all responses to the English-version science test as one sample, and not to undertake within-country analyses. This analytic decision was also intended to maximize variation in students’ OTL for each test item. At least 5,100 students in the analytic sample produced written responses to each science item, as shown in Table 3. Between 58% and 92% of students had reported OTL about each item’s content topic, which seems consistent with TIMSS’ purpose to assess common curricular objectives. Fit statistics for a generalized partial credit model applied to the 10-item sets from each booklet suggested the data could be modeled as unidimensional (Tucker–Lewis index = [.95, .99]; root mean square error of approximation = [.02, .05]).

Table 3.

Descriptive Statistics for 32 Released Constructed-Response TIMSS English-Version Grade 8 Science Test Items

Content topic Label Number of score points Weighted mean score Number of test-taker responses Weighted proportion of test-takers in OTL category
Biology
Adaptation and variation in survival of species 32451 2 1.26 5,692 .62
Major organs and organ systems of organisms 32306 2 .91 5,787 .83
Reproduction and heredity 32530Z 2 .78 5,485 .69
Reproduction and heredity 42297 2 .67 5,767 .69
Cells and their functions 52263A 1 .19 5,186 .92
Ecosystems 42300A 1 .46 6,189 .78
Reproduction and heredity 42319 1 .38 5,270 .69
Reproduction and heredity 52265 1 .38 6,435 .68
Chemistry
Classification and composition of matter 42305 2 .52 6,422 .85
Chemical change 42100 2 1.16 5,530 .68
Physical and chemical properties of matter 52049 2 .59 6,356 .86
Acids and bases 52043Z 1 .25 6,332 .85
Chemical change 32679 1 .21 5,295 .68
Chemical change 42104 1 .21 6,348 .69
Classification and composition of matter 52145 1 .31 6,227 .86
Mixtures and solutions 42088 1 .42 6,391 .85
Physics
Light and sound energy 32369 2 .56 5,735 .75
Physical states and changes in matter 42173Z 2 1.20 6,271 .84
Physical states and changes in matter 42404 2 .47 5,745 .83
Electricity and magnetism 42195 1 .10 5,442 .62
Energy transformation and transfer 42400 1 .20 5,645 .86
Forces and motion 52233 1 .15 6,256 .79
Light and sound energy 42273 1 .44 5,450 .75
Physical states and changes in matter 42094 1 .37 6,000 .85
Earth science
Earth’s processes, cycles, and history 42317 2 1.20 5,524 .60
Earth’s processes, cycles, and history 52116 2 1.00 5,663 .58
Earth’s structure and physical features 32650Z 2 1.17 5,352 .58
Earth in the solar system 52110 1 .35 6,130 .65
Earth’s processes, cycles, and history 42301 1 .61 5,460 .59
Earth’s processes, cycles, and history 32126 1 .54 5,631 .59
Earth’s structure and physical features 42406 1 .31 5,421 .59
Earth’s structure and physical features 52289C 1 .37 5,209 .58

Note. OTL = opportunity to learn. “Weighted” statistics were computed using the SENWGT sampling probability weight.

Computing the Item IS Index

Given the operationalization of students’ content topic OTL in the teacher questionnaire, their science item response data could be analyzed using an IS index appropriate for an instruction–no instruction design (Naumann et al., 2016). To address Question 2, we computed the BDIFF IS effect size index for all 32 ordinal science achievement items, treating data from the 12 countries as a single sample. OTL groups for the 14 content topics assessed by the item set varied. Thus, we substituted different OTL grouping variables into a multiple-group generalized partial credit model to compute each item’s BDIFF effect size. (More formally, we obtained the BDIFF effect sizes from 14 models, using the same strategy to identify and anchor each model, as described subsequently.)

The mean latent science achievement score for each OTL group was constrained to 0, while that for the no OTL group was freely estimated in the models (see Table 4). This parameterization allowed us to distinguish between group differences in item difficulty that could be attributable to topic-specific OTL and differences in overall science achievement. To fully identify each model, we kept one item category threshold in the no OTL group fixed (to its single-group model value) across the models. In our formulation, all items had a multiple-group model, except one scale anchor item that had threshold parameters held constant across the groups, a “free baseline” strategy (e.g., Nye & Drasgow, 2011). The anchor item was chosen to have item location differences less than|0.02| for all OTL groups by an iterative process, akin to Cao et al.’s (2017) study. After selecting the one anchor item, which was dichotomous, we fixed its discrimination and threshold parameters to their single-group model value across all models. Then, we proceeded to compute each item’s BDIFF effect size.

Table 4.

Item Difficulty Difference Instructional Sensitivity Index Values for 32 Released Constructed-Response TIMSS English-Version Grade 8 Science Assessment Items

Content topic Label Number of score categories Item discrimination coefficients Latent θ mean in no OTL group Step
0 → 1 BDIFF
Step
1 → 2 BDIFF
Effect size category
Biology
Adaptation and variation in survival of species 32451 3 1.04 −.05 .18 −.09 N, N
Major organs and organ systems of organisms 32306 3 .77 .26 .15 .79 N, L
Reproduction and heredity 32530Z 3 .82 .28 −.21 .50 N, L
Reproduction and heredity 42297 3 .97 .28 .49 −.16 L, N
Cells and their functions 52263A 2 1.75 .38 .07 N
Ecosystems 42300A 2 1.91 .04 −.06 N
Reproduction and heredity 42319 2 1.87 .28 −.10 N
Reproduction and heredity 52265 2 1.81 .28 .09 N
Chemistry
Classification and composition of matter 42305 3 .87 −.37 .85 .38 L, M
Chemical change 42100 3 .97 −.17 .18 .08 N, N
Physical and chemical properties of matter 52049 3 1.17 −.37 .31 .49 M, L
Acids and bases 52043Z 2 .83 −.37 .64 L
Chemical change 32679 2 .84 −.17 .16 N
Chemical change 42104 2 1.38 −.17 .14 N
Classification and composition of matter 52145 2 2.11 −.37 .45 L
Mixtures and solutions 42088 2 .86 −.37 .09 N
Physics
Light and sound energy 32369 3 .67 .04 .41 .34 M, N
Physical states and changes in matter 42173Z 3 .75 −.21 −.03 .17 N, N
Physical states and changes in matter 42404 3 1.89 −.21 .14 −.19 M, M
Electricity and magnetism 42195 2 1.12 −.04 1.24 L
Energy transformation and transfer 42400 2 1.78 .08 .46 L
Forces and motion 52233 2 .95 .08 −.05 N
Light and sound energy 42273 2 1.86 .04 .40 L
Physical states and changes in matter 42094 2 1.26 −.21 .38 L
Earth science
Earth’s processes, cycles, and history 42317 3 1.13 −.21 .11 .20 N, N
Earth’s processes, cycles, and history 52116 3 1.45 −.18 −.02 −.05 N, N
Earth’s structure and physical features 32650Z 3 1.45 −.18 −.08 −.04 N, N
Earth in the solar system 52110 2 1.55 −.43 .35 L
Earth’s processes, cycles, and history 42301 2 1.53 −.21 .27 L
Earth’s processes, cycles, and history 32126 2 1.46 −.21 −.02 N
Earth’s structure and physical features 42406 2 1.84 −.21 −.04 N
Earth’s structure and physical features 52289C 2 1.01 −.18 .18 N

Note. Effect sizes printed in boldface are significantly different from 0. Effect size category classifications: L = large (minimum value = .375/item discrimination coefficient), M = moderate (minimum value = .250/item discrimination coefficient), and N = negligible. Item discrimination coefficients are reported from a single-group model. OTL = opportunity to learn; BDIFF = difference in item difficulty parameters between “no OTL” and “OTL” groups.

Missing data were handled by using full-information maximum likelihood estimation with robust, classroom cluster-adjusted standard errors. If a student was tested in a language other than English (e.g., Afrikaans, Arabic), their data were retained so accurate standard errors could be computed, but were assigned zero weight in computing the model parameter estimates (e.g., Heeringa et al., 2017, p. 117). A consequence of allowing the discrimination parameter to vary across items is that our criterion values for “moderate” and “large” item difficulty differences across groups (i.e., IS), .250 and .375, needed to be scaled by the estimated discrimination value for each item; then we could interpret the magnitude of each BDIFF value.

To identify patterns in the content characteristics of constructed-response items and rubrics that yield scores with “large” or “negligible” IS, we are currently conducting item content analysis interviews with master Grade 8 science teachers (i.e., educators who have a master’s degree in science education or science and at least 10 years of teaching experience) from public and private schools in several national education systems. The educators have been recruited and consented specifically for participation in this research; they were not previously known to the investigators. Two independent coders have applied direct semantic coding to the interview transcripts for each item, informed by earlier research findings about test item features that affect item score IS. Previous IS studies have relied on learning scientists or psychometricians to analyze the items’ content. Here, we utilize final codes of transcripts from the two teacher interviews in a country that administered the English-version assessment, the United States, to facilitate interpretation of our item IS index results. (The participant recruitment process, interview protocol, reflexive thematic analysis methods, and results of the semistructured interviews are reported in Altintaş et al. [2024].)

Results From the Constructed-Response Item IS Analysis

We applied each two-group generalized partial credit model to the TIMSS Grade 8 English-version science constructed-response item set, and computed the BDIFF step difficulty IS index for each adjacent score category boundary (0,1 and 1,2 if applicable; e.g., Penfield et al., 2008). Originally, we conceptualized BDIFF as the difference in overall item difficulty location for the OTL and no OTL student groups. When items had more than two score categories, we found that the magnitude of BDIFF values tended to vary appreciably across the category boundaries within an item, as well as across items, as shown in Table 4. Among the 13 items with three score categories, 2 category boundaries had large negative, 1 had moderate negative, 16 had negligible, 4 had moderate positive, and 3 had large positive step difficulty BDIFF values. One item, 42404, had score categories with nonnegligible step difficulty BDIFF values in opposite directions (i.e., positive, negative), so presenting a single value would have oversimplified the IS of that item’s score. Six item response functions had disordered category boundaries when the middle “partially correct” category contained the smallest proportion of responses for one or both OTL groups, such that it did not have the highest probability for students at any location in the science achievement score distribution. Figure 1 illustrates this phenomenon in Item 32369. Although ideally items should not exhibit such “step reversals” (Muraki, 1999, citing personal communication with H. Wainer), we retained them in the analytic sample. The item response functions in Figure 1 also exhibit a downward shift in category difficulty for students who had OTL about this item’s content topic, sound energy. The BDIFF values for this item’s 1,2 and 0,1 category boundaries are .34 and .41, respectively, in Table 4.

Figure 1.

Figure 1

Response Functions of Student Opportunity-to-Learn Groups on One Constructed-Response Science Assessment Item With Step BDIFF Values = .41, .34

Note. Category response functions for OTL and no OTL groups are displayed in yellow and blue tones, respectively.

The IS rate was higher for items with two than with three score categories. Among the 19 dichotomous items, 11 had negligible and 8 had large positive BDIFF values. Figure 2 summarizes the step difficulty BDIFF values for all 45 score category boundaries of the 32 constructed-response items. Overall, we found that constructed-response science item scores had varying IS, consistent with earlier results about mathematics items (Li et al., 2017; B. O. Muthén et al., 1991). We also observed patterns in item score IS by science domain, with chemistry and physics items tending to show higher BDIFF values than biology and earth science items. Most of the items that showed large positive IS had specific content topics about matter and energy (e.g., classification of matter, physical and chemical properties, changes in physical state, energy transfer, electricity, light and sound energy), although two had earth and space science topics (i.e., earth in the solar system, earth’s processes). Both large negative BDIFF indices pertained to biology item responses.

Figure 2.

Figure 2

Distribution of BDIFF Item Step Difficulty Instructional Sensitivity Index Values for 32 Constructed-Response TIMSS Grade 8 Science English-Version Assessment Items

We asked two master science teachers to examine 18 items and corresponding rubrics that yielded scores with large positive or near-zero IS. Each teacher’s content analysis interview was 3 to 4 hours in duration. The teachers made predictions about the range of cognitive response processes that they expected eighth-grade students to display, and the difficulty of each item score category for students who had not had relevant content topic instruction in school. They substantially agreed about students’ likely response processes and made step difficulty predictions (i.e., likelihood of a correct response as “very low,”“low,”“moderate,” or “high”) for students with and without OTL that reasonably matched our statistical results for nine items. 1 Predicting the difficulty of attaining a particular item score is complex (Hambleton & Jirka, 2006), even for experienced teachers, so we consider this to be a conservative criterion for item content analysis interview results that may be generalizable. For some other items, the two teachers’ predictions had similarities, but we limited our further interpretation to those items with the most consistent qualitative and statistical results.

The teachers each explained how items that contain scientific vocabulary or rubric performance level description(s) that require a student to use scientific vocabulary in their response would produce item scores that show IS. They both characterized chemistry items 42305, 52043Z, and 52145 this way, along with one biology item, 32530Z, that had a relatively high BDIFF value. One physics item, 42195, that required students to read an electrical circuit diagram, recall a resistance equation, and solve it, had a very large BDIFF value (1.24), as the teachers anticipated. Students’ responses on those items needed to demonstrate specialized disciplinary knowledge (Samarapungavan, 2018) to be scored as fully or partially correct. The teachers believed that eighth graders who had not learned about these items’ broad content topics in school were unlikely to generate even a partially correct response. Item 42305 and its rubric that had large step difficulty BDIFF values of .85 and .38 are displayed in Figure 3 as an example of high IS.

Figure 3.

Figure 3

Classification of Matter Item (A) and Scoring Rubric (B) with Moderate-to-Large Step BDIFF Values = .85, .38

Source. Reprinted from TIMSS 2011 Assessment. Copyright © 2013 International Association for the Evaluation of Educational Achievement (IEA). TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Reprinted with permission.

Among the items with near-zero IS, the teachers independently judged that eighth graders could likely draw on their everyday life experience outside of school to produce partially correct responses to items 32451, 42100, and 52116 about animals using camouflage, evidence of chemical reactions, and functions of plant roots, respectively. Both teachers noted that a simple visual description of a picture of two beakers was accepted as a correct response for item 42088. Item 42088 and its rubric that had a near-zero BDIFF value of .09 are displayed in Figure 4 as an example of negligible IS. Two biology items showed large negative difficulty differences for one score category favoring the test-taker group with no OTL the item’s broad topic. The rubric for item 32306, which asked students about the functioning of human eyes, indicated a simple visual description of the item picture would be scored as fully correct. The partially correct response for item 42297 (“grow both plants together”), which asked students to design an experiment, could be derived from everyday life and may have been unlikely for students who had instruction about reproduction and heredity. B. O. Muthén et al. (1991) similarly reported that a few TIMSS mathematics multiple-choice items had significant negative IS index values.

Figure 4.

Figure 4

Mixtures and Solutions Item (A) and Scoring Rubric (B) With Negligible BDIFF Value = .09

Source. Reprinted from TIMSS 2011 Assessment. Copyright © 2013 International Association for the Evaluation of Educational Achievement (IEA). TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Reprinted with permission. Highlighting added to the original.

Conclusions

Item score IS is an important psychometric property for any test item set that is used to assess test-takers’ learning or achievement of school curriculum objectives. When a test is claimed to measure curricular achievement, evidence that items become easier for students (i.e., trainees) after they experience relevant content topic instruction can confirm that score interpretation (Rodriguez, 2017). It is essential validity evidence. We argue that detectable item score IS is a desirable property for locally developed and large-scale educational achievement tests, whether they are used for classroom assessment or research purposes. It is particularly consequential when the scores will be used to make long-term decisions about individual students. Developing approaches to evaluate the IS of constructed-response items can improve the quality of item analysis. Overall, our results suggest the BDIFF IS index has adequate statistical properties, and is useful for IS detection in international large-scale educational assessment items. Supplementing the statistical analysis by item content analysis interviews with master science teachers is allowing us to extract principles for writing instructionally sensitive items and rubrics. We believe these principles can guide test item development and form assembly.

For constructed-response items that have more than two (i.e., polytomous) scores, the performance level descriptions in each rubric that operationalize fully and partially correct responses appear to drive IS. In real test item response data, we found information value in computing IS of the item score categories, not only the item as a whole. Thus, the IS of test item scores should be conceptualized as a characteristic of a particular item–rubric combination.

The magnitude of any relationship between content exposure and student achievement outcomes will depend on whether instructional content is described at the narrow item, or broader topic, level. To encourage challenging curriculum-aligned instruction, rather than instruction about incidental features of the test items (Koretz, 2017), when IS is evaluated, educators should be asked to report about broad content topic instruction (Burstein, 1989). Then, the item IS effect sizes will depend on the precision of the content topic statements, their conceptual “distance” from the item content (Ruiz-Primo et al., 2002), the consistency with which teachers interpret them, and students’ range of content exposure (Kuger, 2016). Ruiz-Primo et al. (2012) demonstrated that we should expect small IS indices when items are intended to assess students’ transfer of learning to new contexts (e.g., Harris et al., 2019). Among students who have experienced relevant content instruction, IS effect sizes might further increase with instructional quality if the test items become easier for them (Ing, 2018), but such content by instructional quality interactions are difficult to observe and measure reliably (Casabianca et al., 2015; Mikeska et al., 2019). 2

Our results indicate item score categories that require either comprehension or use of scientific vocabulary (e.g., Lazaroff & Vlach, 2022) are sensitive to content topic coverage during instruction. These items often showed detectable step difficulty differences between the student groups who had, and did not have, OTL about the item’s broad content topic. However, the magnitude of the item IS effect sizes was seldom very large. Students struggle to learn about energy and the particulate structure of matter as abstract concepts (Stevens et al., 2010; Zhou & Traynor, 2022), so it seems plausible that items requiring this knowledge would be particularly sensitive to differences in school instruction. Other TIMSS items integrate multiple science content topics, though, so may show little IS when the OTL indicator represents a single topic. The teachers who we interviewed systematically attributed any item difficulty differences favoring the OTL group to disciplinary content features of the item and rubric, which supports that these science assessment items can yield data about students’ learning in school—it is a form of validity evidence (Leighton, 2019).

We found that some science items had negligible IS due to scoring rubric categories that gave credit for knowledge from everyday life experience (Cowie et al., 2011; Eberbach & Crowley, 2009). The step difficulty of those items presumably changed little, even if students received more sophisticated instruction about the topic at school. Items asking about everyday life experience may give students who have had less OTL access to the test, and positively affect the testing experience for those students. This consideration might encourage use of such items when the target construct extends beyond curricular science achievement. Looking across all the rubrics for one test form, however, the proportion of performance level descriptions that assign points for everyday life experience should be intentional, and maintained across forms and years to prevent construct shift (Strachan et al., 2021) if score comparisons will be made. This principle should apply to adaptive test forms, also. Other science items had IS index values that were unexpectedly low, given master teachers’ predictions about how eighth graders would use academic science knowledge to develop correct responses. A practical recommendation might be to reexamine those items’ scoring rubric categories, then retain the items if they have acceptable item discrimination values, although they do not have detectable IS.

Naumann et al. (2016) proposed that item difficulty differences can only be attributed to item content if they favor the test-taker group who had relevant instruction. Our study is not the first to report a few items with appreciable negative IS, though (B. O. Muthén et al., 1991; Naumann et al., 2019). We allow that negative item IS could be caused by problematic item or scoring rubric content. In the items where we observed large negative IS, the source seemed to be one rubric category. In our view, items with negative score IS are undesirable for most conceivable score uses, unless we intend to measure science misconceptions. Just as test-takers typically are not assigned credit for selecting a distractor option in a multiple-choice item, we suppose that a negative IS index suggests a rubric performance level description is capturing something outside the target construct, which probably should not be assigned partial credit.

To examine the mechanisms of item IS, our study design capitalized on variation in the enacted curriculum across countries and classrooms. We interpreted patterns in our statistical results—the IS effect size categories—juxtaposed against descriptions of the items’ content features from our master teacher interviews. Given these study results, we evaluate counterexplanations. Since the test item response data came from students in 12 countries, which were weighted equally in the analysis, it seems unlikely that the item IS effect sizes were systematically driven by the curriculum in one country. However, the relatively small number of students who took each item meant that we were not able to conduct within-country analyses to exclude that possibility. Because the study design was observational, the student groups with and without OTL about the broad content topic of each item presumably differed on other unmeasured characteristics. Thus, the group item difficulty differences that we interpret as item IS to learning opportunities during school may be partly attributable to varying home educational supports, as an immediate or distal cause that we did not model. From our perspective, the most likely source of bias in our item IS effect size estimates may be systematic overreporting about coverage of certain content topics in the TIMSS teacher questionnaires. In instances when multiple science teachers reported about the content exposure history of a particular student, agreement between them was quite high, more than 97%, as we reported earlier. However, for certain content topics, mostly in biology, we found that the student group with no OTL had higher science achievement scores (i.e., latent mean) than the group with opportunity, after allowing for group differences in the item (step) difficulty parameters. We would not expect that result if the two groups were otherwise similar. We suppose that a tendency by the TIMSS teachers to overreport content coverage, which would affect the accuracy of the OTL grouping variable for certain topics, could explain that result. If some students in an “OTL” group actually had no exposure to the content topic in school (or the instructional quality was low), then the item difficulty estimates for that group may be too high, and those BDIFF item IS effect sizes could be underestimated. This occurrence could explain the low proportion of instructionally sensitive items that we detected.

Limitations

Students’ OTL in school can be defined expansively to encompass their learning environment and resources (e.g., Moss et al., 2008), but their experience with particular content topics is necessary for their learning about those topics under any definition. Our item IS index operationalized “no OTL in school” in an unambiguous way as “this broad topic has not been taught.” However, the “OTL in school” category was operationalized more coarsely as “this topic has been taught,” without capturing information about the amount of instructional time or emphasis on the topic, at variance from recommendations about measuring OTL (Kuger, 2016; Schmidt & Maier, 2012). While binary representations of OTL were shown to be useful in previous item IS studies, we acknowledge that surveying educators about their content topic emphasis might be better, if it was judged feasible given the time cost and political considerations in participating countries.

Our BDIFF index intentionally does not capture differences in item discrimination (or item information) that could be caused by instruction, consistent with IS definitions in the literature (Naumann et al., 2016; Polikoff, 2010). If item discrimination of the investigated item varies across OTL groups, the model misspecification caused by constraining discrimination to a single value could induce bias in the step difficulty BDIFF estimates. While one previous study found little evidence that differences in students’ OTL affected the item discrimination of mathematics items (Li et al., 2017), an alternative effect size index could be the expected item score difference from a multiple-group generalized partial credit model that allows the discrimination parameter of the investigated item to vary across groups (e.g., Meade, 2010). Our real data application used moderate sample sizes: 10 items per test-taker and more than 5,000 persons per item, including at least 70 persons per item score by OTL group category. Under these conditions, IS effect sizes tended to be significantly different from zero only if they were large. If the person sample size is smaller, the proportion of IS item scores is believed to be low, and statistical significance testing will be the main method of IS detection, a logistic regression-based IS index (e.g., Li et al., 2017) could be a viable alternative.

Future Directions

Our study and others have compiled evidence about content characteristics of instructionally sensitive achievement test items that are distinct from general best practices for writing items and rubrics (e.g., Albano & Rodriguez, 2018). Our research team is continuing educator interviews and structured thematic analysis of those results to understand IS constructed-response item writing principles more precisely. While IS has been conceptualized as a “property” of total test scores or items (e.g., Naumann et al., 2020), which implies some degree of stability, before using information about item IS features during test development, we should appraise the replicability of item IS across time or samples from a population when the curriculum is invariant. Ongoing research by our team is also examining the stability of item IS indices. As educators and researchers work together to understand instructionally sensitive constructed-response items and rubrics—how to write them, when to deploy them, and how to interpret item IS index magnitude relative to the distance between each item and instructional content—we envision advances in achievement test development.

Acknowledgments

The authors gratefully acknowledge the thoughtful contributions of the participating master science teachers. Their efforts were essential to the quality of the study. They also wish to thank two anonymous reviewers for sharing their ideas about item instructional sensitivity, which have improved this work.

1.

Among the other nine items/rubric sets that the teachers reviewed: For four items, only Teacher 2’s predictions agreed with the IS effect size results. For one item, only Teacher 1’s prediction agreed with the IS effect size results. For two items, the teachers’ predictions agreed with each other but conflicted with the statistical results. For the remaining two items, there was no agreement among the teachers’ predictions and the statistical results.

2.

In an auxiliary analysis, we regressed the item response probability of students who had OTL a given content topic on a standardized variable from the teacher survey questionnaire indicating the frequency with which they emphasize science investigation during lessons, which may be a proxy for instructional quality. Notably, this “science investigation emphasis” variable (label: BTBSESI), which referred to all science lessons, had a higher level of generality than the content topic OTL reports that were used to demarcate student groups for the IS index computation. The odds ratio effect sizes from mixture multiple-group generalized partial credit models (i.e., ordinal logistic regression in the OTL category only) for all 32 items ranged between .94 and 1.03. The magnitudes of the item BDIFF values changed by ±0.00 to 0.08 units when the instructional quality variable was added to the model, which we interpreted as small adjustments. We concluded that instructional quality variable was not an important predictor of test item category response probabilities among students who had OTL the item’s content topic, while accounting for their overall content area achievement and the item difficulty.

Footnotes

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The project was supported in part by the International Association for the Evaluation of Educational Achievement (IEA) Research & Development Fund. The authors are duly responsible for the content of this manuscript.

Ethical Approval: This study has been approved by the Purdue University Institutional Review Board.

Informed Consent: Classroom teachers gave written informed consent to participate in their item content analysis interview.

Data Availability: The Trends in International Mathematics and Science Study data are publicly available for download from the TIMSS & PIRLS International Study Center: https://timssandpirls.bc.edu/databases-landing.html

References

  1. Albano A. D., Rodriguez M. C. (2018). Item development research and practice. In Elliott S. N., Kettler R. J., Beddow P. A., Kurz A. (Eds.), Handbook of accessible instruction and testing practices: Issues, innovations, and applications (pp. 181–198). Springer. [Google Scholar]
  2. Altıntaş Ö., Chang Y.-H., Koloi-Keaikitse S., Traynor A. (2024). Educators’ analysis of the instructional sensitivity of constructed-response test items. Paper presented at the NCME Classroom Assessment Conference, Chicago, IL. [Google Scholar]
  3. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Psychological Association. [Google Scholar]
  4. Belzak W. C. (2020). Testing differential item functioning in small samples. Multivariate Behavioral Research, 55(5), 722–747. [DOI] [PubMed] [Google Scholar]
  5. Brookhart S. M. (2018). Appropriate criteria: Key to effective rubrics. Frontiers in Education, 3, Article 22. [Google Scholar]
  6. Burstein L. (1989). Conceptual considerations in instructionally sensitive assessments (CSE Technical Report 333). Center for Research on Evaluation, Standards, and Student Testing, University of California, Los Angeles. [Google Scholar]
  7. Cao M., Tay L., Liu Y. (2017). A Monte Carlo study of an iterative Wald test procedure for DIF analysis. Educational and Psychological Measurement, 77(1), 104–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Casabianca J. M., Lockwood J. R., McCaffrey D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75(2), 311–337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cowie B., Jones A., Otrel-Cass K. (2011). Re-engaging students in science: Issues of assessment, funds of knowledge and sites for learning. International Journal of Science and Mathematics Education, 9, 347–366. [Google Scholar]
  10. D’Agostino J. V., Welsh M. E., Corson N. M. (2007). Instructional sensitivity of a state’s standards-based assessment. Educational Assessment, 12(1), 1–22. [Google Scholar]
  11. Daus S., Braeken J. (2018). The sensitivity of TIMSS country rankings in science achievement to differences in opportunity to learn at classroom level. Large-Scale Assessments in Education, 6, 1–31. [Google Scholar]
  12. DeMars C. (2010). Item response theory. Oxford University Press. [Google Scholar]
  13. Eberbach C., Crowley K. (2009). From everyday to scientific observation: How children learn to observe the biologist’s world. Review of Educational Research, 79(1), 39–68. [Google Scholar]
  14. Finch W. H. (2016). Detection of differential item functioning for more than two groups: A Monte Carlo comparison of methods. Applied Measurement in Education, 29(1), 30–45. [Google Scholar]
  15. Floden R. E. (2002). The measurement of opportunity to learn. In Porter A. C., Gamoran A. (Eds.), Methodological advances in cross-national surveys of educational achievement (pp. 231–266). National Academy Press. [Google Scholar]
  16. Gómez-Benito J., Sireci S. G., Padilla García J. L., Hidalgo Montesinos M. D., Benítez Baena I. (2018). Differential item functioning: Beyond validity evidence based on internal structure. Psicothema, 30(1), 104–109. [DOI] [PubMed] [Google Scholar]
  17. Hambleton R. K., Jirka S. J. (2006). Anchor-based methods for judgmentally estimating item statistics. In Downing S. M., Haladyna T. M. (Eds.), Handbook of test development (pp. 413–434). Erlbaum. [Google Scholar]
  18. Hannum E., Buchmann C. (2005). Global educational expansion and socio-economic development: An assessment of findings from the social sciences. World Development, 33(3), 333–354. [Google Scholar]
  19. Harris C. J., Krajcik J. S., Pellegrino J. W., DeBarger A. H. (2019). Designing knowledge-in-use assessments to promote deeper learning. Educational Measurement: Issues and Practice, 38(2), 53–67. [Google Scholar]
  20. Heeringa S. G., West B. T., Berglund P. A. (2017). Applied survey data analysis (2nd ed.). Chapman and Hall/CRC Press. [Google Scholar]
  21. Ing M. (2018). What about the “instruction” in instructional sensitivity? Raising a validity issue in research on instructional sensitivity. Educational and Psychological Measurement, 78(4), 635–652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kalogrides D., Loeb S. (2013). Different teachers, different peers: The magnitude of student sorting within schools. Educational Researcher, 42(6), 304–316. [Google Scholar]
  23. Koretz D. (2017). The testing charade: Pretending to make schools better. University of Chicago Press. [Google Scholar]
  24. Kristjansson E., Aylesworth R., McDowell I., Zumbo B. D. (2005). A comparison of four methods for detecting differential item functioning in ordered response items. Educational and Psychological Measurement, 65(6), 935–953. [Google Scholar]
  25. Kuger S. (2016). Curriculum and learning time in international school achievement studies. In Kuger S., Klieme E., Jude N., Kaplan D. (Eds.), Assessing contexts of learning (pp. 395–422). Springer. [Google Scholar]
  26. Lazaroff E., Vlach H. A. (2022). Children’s science vocabulary uniquely predicts individual differences in science knowledge. Journal of Experimental Child Psychology, 221, 105427. [DOI] [PubMed] [Google Scholar]
  27. Leighton J. P. (2019). The risk–return trade-off: Performance assessments and cognitive validation of inferences. British Journal of Educational Psychology, 89(3), 441–455. [DOI] [PubMed] [Google Scholar]
  28. Li H., Qin Q., Lei P. W. (2017). An examination of the instructional sensitivity of the TIMSS math items: A hierarchical differential item functioning approach. Educational Assessment, 22(1), 1–17. [Google Scholar]
  29. Martin M. O., Mullis I. V. S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS 2011. TIMSS & PIRLS International Study Center, Boston College. [Google Scholar]
  30. Meade A. W. (2010). A taxonomy of effect size measures for the differential functioning of items and scales. Journal of Applied Psychology, 95(4), 728–743. [DOI] [PubMed] [Google Scholar]
  31. Mehrens W. A., Phillips S. E. (1987). Sensitivity of item difficulties to curricular validity. Journal of Educational Measurement, 24(4), 357–370. [Google Scholar]
  32. Mikeska J. N., Holtzman S., McCaffrey D. F., Liu S., Shattuck T. (2019). Using classroom observations to evaluate science teaching: Implications of lesson sampling for measuring science teaching effectiveness across lesson types. Science Education, 103(1), 123–144. [Google Scholar]
  33. Moss P. A., Pullin D. C., Gee J. P., Haertel E. H., Young L. J. (Eds.). (2008). Assessment, equity, and opportunity to learn. Cambridge University Press. [Google Scholar]
  34. Muraki E. (1999). Stepwise analysis of differential item functioning based on multiple-group partial credit model. Journal of Educational Measurement, 36(3), 217–232. [Google Scholar]
  35. Muthén B. O., Kao C. F., Burstein L. (1991). Instructionally sensitive psychometrics: Application of a new IRT-based detection technique to mathematics achievement test items. Journal of Educational Measurement, 28(1), 1–22. [Google Scholar]
  36. Muthén L. K., Muthén B. O. (2017). Mplus user’s guide (8th ed.). [Google Scholar]
  37. Naumann A., Hochweber J., Klieme E. (2016). A psychometric framework for the evaluation of instructional sensitivity. Educational Assessment, 21(2), 89–101. [Google Scholar]
  38. Naumann A., Musow S., Katstaller M. (2020). Instructional sensitivity as a prerequisite for determining the effectiveness of interventions in educational research. In Astleitner H. (Ed.), Intervention research in educational practice: Alternative theoretical frameworks and application problems (pp. 147–170). Waxmann. [Google Scholar]
  39. Naumann A., Rieser S., Musow S., Hochweber J., Hartig J. (2019). Sensitivity of test items to teaching quality. Learning and Instruction, 60, 41–53. [Google Scholar]
  40. Nye C. D., Drasgow F. (2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96(5), 966–980. [DOI] [PubMed] [Google Scholar]
  41. Penfield R. D., Alvarez K., Lee O. (2008). Using a taxonomy of differential step functioning to improve the interpretation of DIF in polytomous items: An illustration. Applied Measurement in Education, 22(1), 61–78. [Google Scholar]
  42. Polikoff M. S. (2010). Instructional sensitivity as a psychometric property of assessments. Educational Measurement: Issues and Practice, 29(4), 3–14. [Google Scholar]
  43. Polikoff M. S., Gasparian H., Korn S., Gamboa M., Porter A. C., Smith T., Garet M. S. (2020). Flexibly using the surveys of enacted curriculum to study alignment. Educational Measurement: Issues and Practice, 39(2), 38–47. [Google Scholar]
  44. Popham W. J., Berliner D. C., Kingston N. M., Fuhrman S. H., Ladd S. M., Charbonneau J., Chatterji M. (2014). Can today’s standardized achievement tests yield instructionally useful data? Quality Assurance in Education, 22(4), 300–316. [Google Scholar]
  45. Rodriguez M. C. (2017, April). Item and test design considering instructional sensitivity [Paper presentation]. National Council on Measurement in Education Annual Meeting, San Antonio, TX. [Google Scholar]
  46. Roussos L. A., Schnipke D. L., Pashley P. J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24(3), 293–322. [Google Scholar]
  47. Ruiz-Primo M. A., Li M., Wills K., Giamellaro M., Lan M. C., Mason H., Sands D. (2012). Developing and evaluating instructionally sensitive assessments in science. Journal of Research in Science Teaching, 49(6), 691–712. [Google Scholar]
  48. Ruiz-Primo M. A., Shavelson R. J., Hamilton L., Klein S. (2002). On the evaluation of systemic science education reform: Searching for instructional sensitivity. Journal of Research in Science Teaching, 39(5), 369–393. [Google Scholar]
  49. Russell M., Moncaleano S. (2019). Examining the use and construct fidelity of technology-enhanced items employed by K-12 testing programs. Educational Assessment, 24(4), 286–304. [Google Scholar]
  50. Samarapungavan A. (2018). Construing scientific evidence: The role of disciplinary knowledge in reasoning with and about evidence in scientific practice. In Fischer F., Chinn C. A., Engelmann K., Osborne J. (Eds.), Scientific reasoning and argumentation (pp. 56–76). Routledge. [Google Scholar]
  51. Schmidt W. H., Maier A. (2012). Opportunity to learn. In Sykes G., Schneider B., Plank D. N. (Eds.), Handbook of education policy research (pp. 541–559). Routledge. [Google Scholar]
  52. Schmidt W. H., McKnight C. C., Cogan L. S., Jakwerth P. M., Houang R. T. (1999). Facing the consequences: Using TIMSS for a closer look at U.S. mathematics and science education. Kluwer. [Google Scholar]
  53. Stevens S. Y., Delgado C., Krajcik J. S. (2010). Developing a hypothetical multi-dimensional learning progression for the nature of matter. Journal of Research in Science Teaching, 47(6), 687–715. [Google Scholar]
  54. Strachan T., Cho U. H., Kim K. Y., Willse J. T., Chen S. H., Ip E. H., Weeks J. P. (2021). Using a projection IRT method for vertical scaling when construct shift is present. Journal of Educational Measurement, 58(2), 211–235. [Google Scholar]
  55. Zhou S., Traynor A. (2022). Measuring students’ learning progressions in energy using cognitive diagnostic models. Frontiers in Psychology, 13, 892884. 10.3389/fpsyg.2022.892884 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES