Abstract
Models of word reading that simultaneously take into account item-level and person-level fixed and random effects are broadly known as explanatory item response models (EIRM). Although many variants of the EIRM are available, the field has generally focused on the doubly explanatory model for modeling individual differences on item responses. Moreover, the historical application of the EIRM has been a Rasch version of the model where the item discrimination values are fixed at 1.0 and the random or fixed item effects only pertain to the item difficulties. The statistical literature has advanced to allow for more robust testing of observed or latent outcomes, as well as more flexible parameterizations of the EIRM. The purpose of the present study was to compare four types of Rasch-based EIRMs (i.e., doubly descriptive, person explanatory, item explanatory, doubly explanatory) and more broadly compare Rasch and 2PL EIRM when including person-level and item-level predictors. Results showed that not only was the error variance smaller in the unconditional 2PL EIRM compared to the Rasch EIRM due to including the item discrimination random effect, but that patterns of unique item-level explanatory variables differ between the two approaches. Results are interpreted within the context of what each statistical model affords to the opportunity for describing and explaining differences in word-level performance.
Keywords: explanatory item response model, crossed random effects, 2pl IRT, reading
Scholarly literature is replete with theories that explain how children learn to read, and basic reading science tests hypotheses related to such theories. For example, Perfetti and Hart’s (2002) lexical quality hypothesis (LQH) postulates that there are word features (e.g., orthographic, phonological, and semantic) and child characteristics (i.e., variation in mental lexicon) that contribute to a child’s accurate word reading ability. Studies of individual differences of how word features relate to child-level outcomes (e.g., Garlock et al., 2001; Sosa & Stoel-Gammon, 2012) primarily use a single-level multiple regression as a statistical means by which independent variables of word and child characteristics predict child-level word-reading. From a statistical perspective, this approach can lead to a mismatch between the theory and analytic hypothesis test whereby LQH considers word-level and child-level characteristics, but the single-level regression only considers explaining individual differences in child-level variance of word reading.
One of the most well-known aphorisms in statistics is that all models are wrong, but some are useful (Box & Draper, 1987), and through that lens we agree that there has been usefulness in understanding using single-level regression applied to questions about word-level and person-level differences. At the same time, there exists greater opportunities to leverage alternative statistical models that provide a stronger correspondence between theory and hypothesis testing. The goal of the present paper in this special issue is to revisit the explanatory item response model (EIRM) in its statistical underpinnings, explain its conceptual application and interpretation of coefficients across software packages, and introduce an emerging extension of the commonly used Rasch EIRM to a latent 2PL EIRM.
Explanatory Item Response Modeling (EIRM)
Explanatory item response models (EIRMs) are a broad range of statistical models that are can be synonymous with generalized linear mixed models (McCulloch & Searle, 2001), cross-classified models (e.g., Kim, Petscher, Foorman, & Zhou, 2010), crossed random effect models (e.g., Gilbert, Compton, & Kearns, 2011), and cross-classified, multilevel logistic models (Van den Noortgate, De Boeck, & Meulders, 2003). The general, core functionality of these models is to summarize a matrix of item responses to understand: 1) the odds of item-level accuracy, 2) the extent to which there are individual differences in item-level accuracy, 3) how much of the variance in item-level accuracy is due to between-item differences compared to between-person differences, and 4) whether selected person predictors, item predictors, and combinations of the two explain the respective variances. In this way, EIRMs reflect a merging of multilevel and psychometric traditions that have historically been treated as separate means of statistical inquiry.
The bridging of multilevel and psychometric models in the EIRM is novel, respective to either approach, largely due to the allowable mechanisms of fixed and random effects for persons and items (De Boeck & Wilson, 2004). Just as a regression analysis can be called single level or multilevel based on the treatment of cluster level effects as random or fixed (e.g., classroom or school), so too are EIRMs given different labels according to how the item-level and person-level effects are treated. When there is an absence of person and item predictors in the EIRM, such that each person and item has its own unique estimate (i.e., a threshold for each item and a trait-level estimate for each person), the model is considered to be a doubly descriptive model (also known as a Rasch model). A well-known type of doubly descriptive model is a Rasch item response theory model that can be expressed as1
| (1) |
where πpi is the probability of person p correctly responding to item i; θp is the ability level of person p where negative values reflect low ability and positive values reflect high ability and is N(0,σ2); βi are fixed parameters and represent the difficulty level of item i, where negative values reflect easy items and positive values reflect hard items; and exp is an exponential measure known as Euler’s number (i.e., 2.718).
The doubly descriptive nature of Equation 1 is that there is a fixed estimate of each person’s ability level based on item responses (i.e., θp) and a fixed estimate of each item’s difficulty level based on the item responses (i.e., βi). A facet of the Rasch model is that whenever a given value of θp is equal to βi, the probability of a correct response will be .50. That is, when the ability of an individual matches the difficulty of the item, the exponentiated value of the numerator (i.e., 0) is 1 [i.e., exp(0) = 1] and the denominator is 2 [i.e., 1+exp(0) = 2], resulting in a probability value of 1/2 = .50. More broadly, Eq. 1 may be expressed as
| (2) |
where ηpi is the mean of a distribution that has a link, usually a logit or probit link, to the probability expressed in Eq. 1. The reason for introducing Eq. 2 as a linear predictor rather than solely presenting the exponentiated version in Eq. 1 is that Eq. 2 allows for a more seamless presentation of the three other types of EIRMs. Eq. 2 further expresses that there is an interpretative link between the ability of the individual and the difficulty of the item. For a given mean intercept value in the EIRM, the interpretation of the ability is in the direction of the intercept; however, the interpretation of the difficulty of the item is the negative of the intercept.
As an example, if an unconditional EIRM using a logit link estimated an intercept log-odds value of 1.50, we could reparameterize that value to a person-level probability of a correct response as .82. We can simultaneous interpret the 1.50 as a type of item difficulty value by applying the –βi in Eq. 2. This tandem of θp and –βi yields an alignment in interpretation whereby the average probability of a correct response is .82 aligns with the average item difficulty of −1.50, denoting an easy item. This interplay between the ability and difficulty allows for consistency in interpretation in the Rasch EIRM (see Wilson, 2008 for more information on the formal relations between Eqs. 1 and 2).
When person-level covariates are included in the model, but item-level covariates are not, the EIRM is referred to as a person explanatory model (also known as the latent regression Rasch model). Eq. 2 can be easily extended to the person explanatory model by
| (3) |
Notice the explicit connection between the Eq. 2 portion of Eq. 3 and that the person explanatory portion of the model is substitutionary for θp. In this case, θp is defined by , where Zpj is a value for person p on covariate j (j = 1, …J; e.g., a score of 100 on a standardized measure of vocabulary for a given individual in a sample), ϑj is the regression weight for the person-level covariate, and is the remaining person effect (i.e., random effect) not explained by Zpj and is N(0,σ2). When item-level covariates are included, but person-level covariates are not, the EIRM is an item explanatory model (also known as the linear logistic test model; LLTM) and Eq. 2 may be extended to this model as
| (4) |
Notice here the explicit connection between the Eq. 2 portion of Eq. 4 and that the item explanatory portion of the model is substitutionary for βi. In Eq. 4, is defined by , where Xik is a value for item i on item covariate k (k = 1,…K; e.g., a grapheme-phoneme correspondence score for a given item), βk is the regression weight for the item-level covariate, and is the random effect for items not explained by the included covariates. in Eq. 4 denotes that the linear function is not equal to βi because the prediction is not perfect (Wilson et al., 2008).
When both item covariates and person covariates are included, Eqs. 2–4 are combined for a doubly explanatory model (also known as the latent regression LLTM) as
EIRM in Literacy
The importance of laying out the foundation of the prominent expressions of EIRMs is to both see its applications in the literature and to evaluate extensions that may be used for more comprehensive testing of explaining item-level and person-level variances in item-level accuracy. The doubly descriptive model is abound in literacy research (e.g., Bowles, Skibbe, & Justice, 2011; Chaves-Sousa, Santos, Viana, Vale, Cadime...Ribeiro, 2018; Frietas et al., 2018; Piasta, Groom, Khan, Skibbe, & Bowles, 2018) for estimating the relative difficulty of items across tests of word reading and letter name knowledge, and is primarily observed in its application for psychometric validation purposes. Person explanatory models are less frequently found in literacy research and are more generally found in areas of applied psychometrics. For example, Wilson and Moore (2011) used the person explanatory model to explain differences in a language assessment using student characteristics, and also applied the model to tests of reading comprehension using gender and English language learner characteristics as explanatory variables of between-person differences in item-level performance (Wilson & Moore, 2012). The item explanatory models have a longer historical tradition of application especially in the context of cognitive diagnostic models (e.g., Baghaei & Ravand, 2015; Medina-Diaz, 1993; Ravand, 2016; Sheehan & Mislevy, 1990), but as with the previous two models it is largely used in areas of applied psychometrics research.
Only the doubly explanatory model finds application in the area of studying individual differences rather than psychometrics and is unsurprising given that the model is inclusive of two random effects that researchers use as means of explaining variance. Gilbert, Compton, and Kearns (2011) used the doubly explanatory model to look at how person-level phonemic awareness, working memory and rapid naming skills, as well as word frequency, word size, and other word characteristics related to the probability of word-level accuracy. The authors observed that a majority portion of the variance could be attributed to between-word differences (i.e., 20.03%) and between-person differences (i.e., 14.36%) with remaining variances attributed to classroom, schools, and scale-level variance (i.e., π2/3). Further, phonemic awareness and rapid naming skills significantly explained student variance in word-level accuracy; rime frequency, grapheme complexity, and an interaction between rime frequency and rime size significantly explained word-level variance. Kearns (2015) used the same methodology to explore how person characteristics (i.e., grapheme-phoneme correspondence, phonological awareness, phonogram reading, vocabulary knowledge, affix reading, morphological awareness, and root word reading and word characteristics (i.e., bigram, word, and OLD frequency; number of letters; regularity; transparency; root family size; and root word frequency) related to polysyllabic and polymorphemic word reading. Nearly 70% of the variance in polymorphemic reading was explained in both person-level and word-level variance, and approximately 50% of the person- and word-level variance in polysyllabic accuracy was explained. Other applications of the doubly explanatory model have been used with both cross-sectional item-level data (e.g., Goodwin, Gilbert, & Cho, 2013; Puranik, Petscher, & Lonigan, 2014; Steacy et al., 2017) and longitudinal item-level data (e.g., Kim, Petscher, & Park, 2016).
Regardless of the outcome, nature of explanatory variables, or form of the EIRM that is used there is one salient feature that is shared across all applications – the use of the Rasch model. Recall that the Rasch model from Eq. 2 indicates that the odds of a correct item response for an individual is a function of the ability level of the individual and the difficulty level of the item. The Rasch model is a form of what is also known as a one-parameter model (1PL) in item response theory because only one item parameter is estimated (i.e., the item difficulty). The 1PL model can be extended to include other item parameters such as item discriminations (akin to item-to-total correlations), item guessing, and item dependency effects. When both an item difficulty and an item discrimination are estimated, the 1PL model becomes a 2PL model and is expressed as
where ai is the discrimination of item i. Item discriminations in item response theory quantify the slope of the item to describe how well performance on an item is able to separate, or discriminate, a high ability individual from a low ability individual. Large, positive values reflect informative items that discriminate well (i.e., between 0.80 and 2.00; De Ayala, 2013), small values near 0 indicate items that are uninformative and do not discriminate well between high and low ability individuals, and negative values are misinformative and suggest that those with high ability tend to do poorly on the item. The relation between the 2PL and Rasch model is that when ai = 1 (i.e., fixed at 1.0 for all items) the 2PL model reduces down to Eq. 2. This inherent flexibility of moving from one type of item response model to another has implications for EIRMs such that even where one may choose among explanatory and descriptive EIRMs there is the potential to move from Rasch-based EIRM to 2PL EIRM (Rijmen & Briggs, 2004).
Just as the Rasch EIRM decomposes the variance in item responses to between-person and between-item differences, the 2PL EIRM (Fox, 2010) provides specificity in the item-level decomposition to be due to variance in the difficulty as well as variance in the discrimination. Where the EIRMs to date have been useful in looking at differential predictors of person and item differences is that differential conclusions may be reached as it pertains to the malleable nature of the predictors and outcome. Gilbert et al (2011) reached conclusions in their work that words with complex vowel grapheme phoneme correspondence were harder for students with poor phonemic awareness. From Goodwin et al. (2013), the authors found that derived-word frequency and root-word frequency are key contributors in word reading for adolescent readers such that educators should be aware that students may need more learning experiences when working with low frequency words. When extended to the 2PL EIRM, it is plausible to look at how word characteristics are not only important to explain word-level difficulty but also word-level discrimination so that educators may better understand the types of words that may separate high ability and low ability individuals.
Without an understanding of how word features explain the discrimination of items, an incomplete picture of word differences exists. Moreover, because the Rasch version of the EIRM only appropriates item response variance to difficulties, it is likely that the 2PL version provides more comprehensive and accurate information about the proportion of variance that is due to items. The current study was designed to study this limitation through a comparative analysis of Rasch and 2PL EIRMs with three research questions in mind. First, what are the differences in fit among the four Rasch EIRMs in the present sample? Second, to what extent does the doubly explanatory Rasch and doubly explanatory 2PL EIRMs provide similar decompositions of item response variance across item and person levels? Third, to what extent do item predictors explain variance in item thresholds (i.e., the approximate difficulty of the item) and item slopes (i.e., the approximate item discrimination) in the doubly explanatory Rasch and 2PL EIRM? To answer these research questions, we use student responses on item-level data from combined word lists that included exception words (49 items), multisyllabic words (30 items), and root words (30 items). Selected item-level covariates (i.e., word frequency, number of letters, age of acquisition, decodability, and valence) and a child-level covariate of whether the student was identified as at-risk for dyslexia (i.e., < 85 standard score on Woodcock-Johnson-III Word Identification subtest) were used in the iterative model building process. It is important to note that the primary thrust for this word is to illustrate processes of EIRMs, interpretations of coefficients, differences in model specifications, and advantages of one form over another. Subsequently, our choice of sample, data, and measures were more heavily informed by aspects of statistical illustration that would be salient to the readership of the journal and have theoretical relevance but with less attention to specific hypotheses and theory that would traditionally motivate primary research.
Method
Participants and Procedures
Participants were 173 fifth-grade students who were part of a longitudinal study (see Compton, Fuchs, & Fuchs, 2010 for details). Fifty-four percent of participants were female, 67.4% of students received free or reduced-price lunch, and 21% of students were served on an individual educational plan. Students were predominantly Black (48.3%), followed by White (37.8%), Asian (4.1%), Hispanic (2.9%), Kurdish (2.9%), Biracial (2.9%), and Other (1.1%); 6.7% were diagnosed with Attention Deficit Disorder. Students were primarily assessed in their school with a small number assessed in the second author’s research laboratory in the case that principals did not allow for in-school testing. Participants were from 45 schools and 102 classrooms.
Measures
Person measures.
Word Identification.
Word identification was measured with the Word Identification subtest (Woodcock, 1987). For this task, children were asked to read words aloud one at a time. The test was not timed but children were encouraged to move to the next item after a 5-second silence. Correct pronunciations were counted as correct and the total score was the sum of correct items. Basal and ceiling rules were applied. The examiner’s manual reports the split half reliability for fifth grade students as .91 (Woodcock, 1998). Standard scores on this task were used to create a student-level independent variable for the EIRMs via a dichotomous variable of whether students were at-risk for dyslexia (i.e., standard score < 85 or not >= 85).
Exception word reading.
Orthographic processing was measured by the exception word list from Adams and Huggins (1985) and included 49 words. All words had irregular spelling-to sound correspondences. The frequencies ranged from 0.12/million to 134.1/million. Words were presented from most to least frequent. Children were asked to read the words aloud as quickly and accurately as possible. They were given the option to say “skip” if they could not read a word. Correct pronunciations were awarded 1 point, otherwise a score of 0 was assigned. Coefficient alpha for our sample was .94. Interrater agreement for the test was 95.6% and fidelity of implementation procedures was 97.0%, based on at least 20% of the tester’s test sessions.
Multisyllabic word list.
Multisyllabic word reading was assessed with a 30-item experimenter-created list of words. All words were morphologically complex, which means they all contained more than one unit of meaning (i.e., root word plus suffix). Data from the participants and words described in Carlisle and Katz (2006) served as a basis for the present word list; percent correct was calculated for each word as an indication of word difficulty. To ensure the word list would produce variability in terms of student performance, we selected words that words fit into a 2x2 matrix: opaque vs. transparent relation to a corresponding root word and regular vs. irregular body of the basic orthographic syllabic structure (BOB; Taft, 1992). Within each cell, we controlled for base word frequency and chose words with a range of difficulties. We added five words to the irregular BOB category because not enough were available on the Carlisle and Katz list. Students were presented the list of words and asked to read the words aloud one at a time. Correct pronunciations were scored 1 and incorrect pronunciations were scored 0. The internal consistency reliability (coefficient alpha) for our sample was .94. Interrater agreement for item scoring was .95. Fidelity of implementation was 97.2% based on at least 20% of the test sessions for each tester.
Root word list.
Each word on the experimenter-created multisyllabic word list (intensity) comprised a root word (intense) and a suffix (-ity) for 30 items total. In the testing session after reading the multisyllabic word list, children were asked to read each of the root words associated with the multisyllabic words. Items were presented in a single column on two pages. Each item was scored 1 for a correct pronunciation or 0 for an incorrect pronunciation. Coefficient alpha for our sample was .94. Interrater agreement for item scoring was .98. Fidelity of implementation was 98.8% based on at least 20% of the test sessions for each tester.
Word measures.
Word frequency.
Word frequency was coded for each item using the Brysbaert and New (2009) measure of the number of times the word appears in the corpus of 51 million words. Due to the large variance observe in the database, the log of word frequency was taken in used in the present study. Word frequency has been shown to explain variance in lexical-decision time (e.g., 41% in Brysbaert et al, 2011).
Number of letters.
The number of letters in the word was included as an important measure for word naming and related to lexical decisions and demasking (Ferrand et al., 2011). As the length of the word increases so does word processing time; thus, accounting for word length as a predictor of item-level accuracy is important.
Age of Acquisition.
The average age of acquisition (AoA) was included from different norm groups (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012) for 30,121 English content words was used due to its unique prediction of lexical decision times beyond word frequency (Brysbaert & Cortese, 2011). Further, AoA reflects the potential order in which words are learned and thus may be related to the speed with which word representations are activated
Decodability.
Decodability was measured with Compton, Appleton, and Hosp (2004) 9-level decodability scale adapted from Menton and Hiebert (1999) and assigns a level to the word based on the linguistic difficulty of its decoding pattern. Level 1 is inclusive of vowel sounds and CV patterns; Level 2 includes CVC and VC patterns; Level 3 includes CCV, VCC[C], CC[C]VC, CVCC[C], and CC[C]VCC[C] patterns; Level 4 includes CC[C]VCe patterns; Level 5 includes C[C]VV[C][C] and VVC[C] both including vowel digraphs; Level 6 includes C[C]Vr, [C][C]VrC, [C][C]Vll, C[C]VLC, and C[C]VVLC patterns; Level 7 includes dipthongs; Level 8 includes multisyllabic words; and Level 9 includes non-decodable monosyllabic words. Interrater reliability for decodability was estimated at .98 for the words in the item pool.
Valence.
Crowd-sourced ratings of valence for 13,915 words were collected by Warriner, Kuperman, and Brysbaert (2013) to understand the sentiments associated with the way words are produced and received. Valence, or level of pleasantness, is the level of arousal evoked by the words and ranges from happy to unhappy measured on a scale from 1 to 9. Higher scores reflect greater associations of the word with happiness and pleasantness whereas lower scores indicate annoyance, dissatisfaction, boredom, and melancholy. Valence was calibrated on a sample of 303,539 observations from 1,827 individuals.
Data Analyses
A series of explanatory item response models were estimated using a combination of the lme4 package in R software (Bates, Maeschler, & Walker, 2015) for the Rasch-based EIRMs and Mplus 8.1 (Muthen & Muthen, 1998-2019) for the 2PL EIRM. The lme4 package uses a maximum likelihood estimator with a default logit link whereas Mplus software uses Bayes estimator with a probit link for its EIRM. Model fit comparisons for unconditional doubly descriptive, person explanatory, item explanatory, and doubly explanatory models were conducted using the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) where lower values are preferred in model comparison. The deviance statistic and the log likelihood were estimated and reported to facilitate a likelihood ratio test between selected pairs of the Rasch models.
Following these unconditional model comparisons, four conditional models were specified: 1) the impact of a student covariate via a dummy-code of dyslexia on the odds of word-level accuracy (i.e., Person Predictor model); 2) the impact of item covariates (i.e., the log-frequency of the word, number of letters, age of acquisition, grapheme-phoneme correspondence, and valence) on the odds of word-level accuracy (i.e., Item Predictor model), 3) the impact of both student and item covariates on the odds (i.e., Person + Item Predictor model), and 4) the effect of person by item interactions on the odds of accuracy (i.e., Person X Item model). Of primary interest in these initial comparisons were the magnitude and directions of the coefficients, the decomposition of variance according to the random effect clusters, and the proportion of variance explained across the various models by the covariates. A variance decomposition index (VDI) was calculated for person- and item-effects for each of the Rasch and 2PL unconditional models. In many instances, a VDI is equivalent to an intraclass correlation (ICC) that computes the correlation among responses due to multiple cluster units (e.g., students nested within classrooms). EIRMs introduce mild complexities into the ICC such that person and item correlations are adjusted for scale units (i.e., and ) where scale variance in the logit model = and scale variance in the probit model =1.0 (Rodriguez & Elo, 2003). The 2PL unconditional random effects model includes a random slope to represent item discriminations that presents and interpretative challenge when using an ICC. A VDI was instead used in our EIRMs to provide an index of the proportion of variance due to respective random effects and was computed as: , . The total item variance in the Rasch model is just the threshold variance and the total item variance in the 2PL model is the sum of the threshold and loading variances.
The final set of EIRMs were estimated in the context of a doubly explanatory 2PL model where item covariates served as explanatory variables for both the item thresholds and the item slopes. We opted to only use the doubly explanatory 2PL model instead of robust testing across the doubly descriptive, person explanatory, item explanatory and doubly explanatory models due to the lack of information criteria that are presently available in Mplus software. The Deviance information criterion (DIC) is available for single-level models using Bayes estimator but is not presently provided for models with multiple random effects such as the EIRM. In this way, the 2PL will serve as an illustration of a new frontier of possible EIRMs with the caveat noted here that robust testing would ultimately include some version of comparisons among the EIRMs with regards to the person- and item-level random effects.
In our presentation of the results, it is important to be mindful of the difference between the result of the hypothesis test and the practical importance of the coefficient value. A psychometric perspective might suggest that hypothesis tests hold less intuitive value for item thresholds and person ability because they merely represent whether the difficulty of the item or ability of the person is different from zero. Psychometric models estimate item parameters that range in their difficulty from approximately −3.0 (easy) to 3.0 (hard) and person ability estimates that range approximately −3.0 (low ability) to 3.0 (high ability). As such, values that demonstrate statistical significance or not may be less relevant at the individual item and person level than the magnitude and direction of the estimate itself. A random effects perspective holds that there is inherent value in hypothesis testing to guide prediction and understand for given regression analysis where hypothesis tests that result in a reject or fail to reject the null hypothesis lets the researcher say something about whether what has been estimated is due to chance. The tension between the traditions of psychometric and random effects modeling in the EIRM means that hypothesis testing for thresholds may have less value for understanding, yet because of the blending of these traditions we opt to use the statistical testing to guide our explication of results.
Results
Descriptive Statistics and Preliminary Analyses
A preliminary review of the data showed that no data was missing for the word-level variables, nor was there missing data for the child measures. Descriptive statistics and correlations are reported in Table 1 for word-level attributes. The mean proportion correct across all items was 60.47% (SD = 27.48) indicating that items tended to be easier for the sample. The average age of acquisition for the words in this sample was 7.45 (SD = 2.17) and corresponded to the easier nature of the items given that students in this sample were in grade 5. The mean valence of 5.52 suggested that individuals tended to feel neutral about the words, but the range pointed to some words eliciting more positive emotions (e.g., heavenly = 7.89, freedom = 7.72) and others eliciting negative emotions (e.g., drought = 2.79, confusion = 3.32). Averaged decodability of items was 7.01 (SD = 1.88) such that items tended to have more complex decodability patterns.
Table 1.
Word-level descriptive statistics and correlations.
| Measure | Percent Correct | Log Freq. | N-letters | AoA | Valence | DEC |
|---|---|---|---|---|---|---|
| Percent Correct | 1.00 | |||||
| Log Frequency | .35** | 1.00 | ||||
| N-letters | −.05 | −.48** | 1.00 | |||
| AoA | −.36* | −.65** | .46** | 1.00 | ||
| Valence | .06 | .21* | .11 | −.22* | 1.00 | |
| DEC | −.04 | −.44** | .55** | .30** | .08 | 1.00 |
| Mean | 60.47 | 2.83 | 5.97 | 7.45 | 5.52 | 7.01 |
| SD | 27.48 | 0.79 | 1.74 | 2.17 | 1.25 | 1.88 |
| Min | 6 | 1.30 | 3 | 3.11 | 2.79 | 2 |
| Max | 100 | 2.83 | 10 | 12.95 | 7.89 | 9 |
Note. N-letters = number of letters, AoA = Age of Acquisition, DEC = decodability.
p < .01,
p < .05.
Correlations ranged from −.65 (p < .01) between AoA and log frequency to .55 between decodability and the length of the Word (i.e., N-letters). Only of the word-level variables correlated with the average percent correct. Log frequency was positively correlated with accuracy (r = .35, p < .01) indicating that the more frequent the word in the corpus, the more likely an individual was to know the word. AoA was negatively correlated with accuracy (r = −.36, p < .01) with a lower age associated with higher accuracy. The correlations among all three word reading tasks was .90 (i.e., r = .903 between exception and root words, r = .899 between exception and multisyllabic words, and r = .898 between root and multisyllabic words).
EIRM Comparisons
As a preliminary analysis of the data, we specified a unidimensional, confirmatory factor analysis model of student performance across the items in the three word reading tasks using a weighted least squares multivariate estimator to appropriately treat the categorical nature of the items. The comparative fit and Tucker-Lewis indexes (CFI and TLI) were used along with the root mean square error of approximation (RMSEA) to judge model fit. Values of at least .95 for the CFI and TLI are suggestive of acceptable model fit as are RMSEA values < .05. Results indicated that the one-factor, 2PL model provided reasonable fit to the data, χ2(8384) = 9441.26, CFI = .95, TLI = .95, RMSEA = .022 (90% Confidence Interval = .020, .025) and fit better than a one-factor, 1PL model, χ2(8515) = 38171.19, Δχ2 = 28,729.93 , Δdf = 131, p < .001. Additionally, a non-parametric, confirmatory factor analyses using DETECT (Zhang, 2007) evaluated the essential unidimensionality of scores from the items when considering the possibility of addition factors. That is, it was plausible that the three task-types (i.e., exception word reading, multisyllabic word reading, and root word reading) formed individual factors. DETECT values range from 0-1.00 with scores <.20 reflecting a unidimensional structure, .20-.40 indicating weak multidimensionality, and scores >.40 reflecting at least moderate multidimensionality. Results produced a DETECT value of .15, indicating unidimensionality of scores.
Research Question 1: Rasch EIRM comparison results.
Table 2 provides the goodness of fit statistics via the AIC, BIC, deviance, and log likelihood. The doubly descriptive model had relatively worse fit based on the AIC (30072) and BIC (30080) compared to the person explanatory model (AIC = 27546, BIC = 27562; ΔAIC = 2526, ΔBIC = 2518) where AIC and BIC differences of at least 5 are suggestive of practically important model differences (Raftery, 1995). The log likelihood ratio test (LRT) showed a significant difference between the two models in favor of the person explanatory model, χ2(1) = 2528, p < .001. The item explanatory model provided improved fit compared to the person explanatory model, χ2(1) = 4208, p < .001, and the doubly explanatory fit better than the item explanatory model, χ2(1) = 4084, p < .001. The totality of these results was that the doubly explanatory model provided the fit to the data with both word-level and person-level effects treated as random.
Table 2.
Model fit comparisons among the Rasch explanatory item response models (EIRM).
| Likelihood Ratio Test | |||||||
|---|---|---|---|---|---|---|---|
| Model | Deviance | Log likelihood | AIC | BIC | χ2 | df | p-value |
| Doubly Descriptive | 30070 | −15035 | 30072 | 30080 | |||
| Person Explanatory | 27542 | −13771 | 27546 | 27562 | 2528a | 1 | <.001 |
| Item Explanatory | 23333 | −11667 | 23337 | 23353 | 6737b | 1 | <.001 |
| Doubly Explanatory | 19250 | −9626 | 19256 | 19280 | 4083c | 1 | <.001 |
Note.
Test statistic comparing the person explanatory model to the doubly descriptive model;
Test statistic comparing the item explanatory model to the doubly descriptive; cTest statistic comparing the doubly explanatory model to the item explanatory model.
Research Question 2: Unconditional Rasch and 2PL model results.
Unconditional model results (i.e., no predictor) are reported in Table 3. The intercept value of 0.62 in the Rasch model is the log-odds of word accuracy and corresponds to an estimated probability value of .65; note that this value approximates the average percent correct observed in Table 1 (60.47%). When the same model is estimated using the 2PL model in Mplus the mean intercept was 0.62, identical to the Rasch estimate. The mean threshold value in the 2PL model was 0.17 (p = .330) indicating that items had a mean threshold or approximate difficulty value that was “average”. The mean slope value which approximates item discrimination, was 1.07 (p < .001).
Table 3.
Unconditional model results for Rasch and 2PL doubly explanatory item response models.
| Model | Variable | Estimate | SE (PSD) | p | VDI |
|---|---|---|---|---|---|
| Rasch | Intercept | 0.62 | 0.20 | 0.002 | |
| Word Threshold Variance | 3.65 | - | - | 0.66 | |
| Person Variance | 1.89 | - | - | 0.34 | |
| Scale Variance | 3.29 | - | - | - | |
| 2PL | Intercept | 0.62 | (0.15) | <.001 | |
| Threshold | 0.17 | (0.17) | .330 | ||
| Slope | 1.07 | (0.89) | <.001 | ||
| Word Threshold Variance | 1.49 | 0.21 | <.001 | 0.49 | |
| Word Slope Variance | 0.53 | 0.10 | <.001 | 0.18 | |
| Person Variance | 1.00 | - | - | 0.33 | |
| Scale Variance | 1.00 | - | - | - | |
Note. VDI = variance decomposition index. Scale variance for the lme4 model with the logit link π2/3, scale variance for the Mplus model with the probit link = 1.0. PSD = the posterior standard deviation that is estimated for the coefficients in the 2PL Bayes models.
An important distinction between the unconditional model results is that Rasch specification in lme4 provides a single coefficient to capture the magnitude and direction of both the item-level threshold and the person-level log-odds ability (i.e., Eq. 2 shows that the item threshold is interpreted as the negative of the person-ability and the person-ability is interpreted as the negative of the item threshold). The 2PL model configurate provides separate estimates for thresholds and intercepts. The reason that the 2PL model produces separates estimates for intercepts, thresholds, and slopes is that intercepts are represented in the 2PL model as a latent variable thus, there are both person-level estimates and item-level estimates in the same model. A further consideration is that the estimates from Mplus were probit; thus, to provide interpretative parity between the fixed effects in the Rasch and 2PL models, conversion factor was needed such that multiplying the probit estimates by a factor 1.6 produces a logit-scale value. Applying this principle to the 2PL threshold value of 0.17 results in a reparameterized, average item log-odds of 0.27 (i.e., 0.17 * 1.6 = 0.27), and an average person log-odds of 0.99 (i.e., 0.62*1.6 = 0.99). Substituting the average person-level coefficient (0.99) and item-level coefficient (0.27) into Eq. 1 [and adding the 1.07 loading multiplied by the θp – βi yields an average estimated probability of .65 for item accuracy and is identical to the Rasch-based estimated probability.
The variance decomposition index (VDI) from the unconditional models are reported in Table 3. The Rasch model VDIs showed that 66% of the variance was due to word differences, and 34% was due to person differences. VDIs in the 2PL model showed that the proportion of variance due to person differences (i.e, 33%) approximated the between-person differences in the Rasch model (i.e., 34%). The word-level slope VDI in the 2PL model was 18%, and the word-level threshold VDI in the 2PL model (49%) was lower than the Rasch model threshold VDI (i.e., 66%). Note here that even though the threshold VDI was smaller in the 2PL model compared to the Rasch model, the combined VDI for thresholds and slopes in the 2PL model (67%) was approximately equal to the word variance in the Rasch model (66%). The differences in threshold-specific variance and approximately equal total word variances between the between Rasch and 2PL models highlights the potential advantage of the 2PL model in providing further context for word variance that is due to the difficulty and discrimination of the items.
Research Question 3: Relation of person and item predictors.
Rasch model.
Table 4 presents the results for the four, conditional, doubly explanatory EIRMs. The unconditional model, as was reported in Table 3, is provided for relative comparisons when including the person and item predictors. The person predictor model included the dummy-coded covariate of whether students were classified as dyslexic and tested the extent to which students with dyslexia differed in their odds of word-level accuracy compared to typically performing students. The intercept value in the person predictor model was 1.02 and represents the mean log-odds of word-level accuracy for typically developing students. When reparameterized to a predicted probability of .73, the finding indicates that typically developing students had an average 73% chance of a correct response to any item. Conversely, the log-odds of a correct response for students classified as dyslexic was 2.91 units lower than typically developing students (p < .001), and the reparameterized, fitted log-odds was −1.89 (i.e., 1.02 + −2.91 = −1.89) that corresponded to a .13 probability of success. As such, students with dyslexia had significantly lower odds of success compared to typically developing students. Variance components showed a reduction of variance at the person level from 1.89 in the unconditional model to 0.85 in the person predictor model indicating that the inclusion of the dyslexia indicator resulted in 55% of the person variance explained [i.e., (1.89 − 0.85) / 1.89 = .550].
Table 4.
Rasch conditional EIRM results.
| Model | Variable | Estimate | SE | p |
|---|---|---|---|---|
| Unconditional | Intercept | 0.62 | 0.20 | 0.002 |
| Word Variance | 3.65 | - | - | |
| Person Variance | 1.89 | |||
| Scale Variance | 3.29 | - | - | |
| Person Predictor | Intercept/Threshold | 1.02 | 0.19 | <.001 |
| Dyslexia | −2.91 | 0.21 | <.001 | |
| Word Variance | 3.65 | - | - | |
| Person Variance | 0.85 | |||
| Scale Variance | 3.29 | - | - | |
| Item Predictors | Intercept/Threshold | 0.98 | 0.19 | <.001 |
| Log - Frequency | 0.33 | 0.24 | 0.097 | |
| N-letters | 0.21 | 0.13 | 0.103 | |
| AoA | −0.30 | 0.11 | 0.006 | |
| DEC | 0.04 | 0.11 | 0.746 | |
| Valence | −0.13 | 0.15 | 0.389 | |
| Word Variance | 3.02 | - | - | |
| Person Variance | 1.89 | |||
| Scale Variance | 3.29 | - | - | |
| Person + Item Predictors | Intercept/Threshold | 0.98 | 0.19 | <.001 |
| Log - Frequency | 0.33 | 0.14 | 0.017 | |
| N-letters | 0.21 | 0.13 | 0.103 | |
| AoA | −0.30 | 0.11 | 0.006 | |
| DEC | 0.04 | 0.11 | 0.746 | |
| Valence | −0.13 | 0.15 | 0.389 | |
| Dyslexia | −2.93 | 0.22 | <.001 | |
| Word Variance | 3.02 | - | - | |
| Person Variance | 0.85 | |||
| Scale Variance | 3.29 | - | - | |
| Person x Item Predictors | Intercept/Threshold | 0.96 | 0.18 | <.001 |
| Log - Frequency | 0.33 | 0.14 | 0.016 | |
| N-letters | 0.23 | 0.13 | 0.073 | |
| AoA | −0.30 | 0.11 | 0.006 | |
| DEC | 0.08 | 0.11 | 0.466 | |
| Valence | −0.18 | 0.14 | 0.209 | |
| Dyslexia | −2.98 | 0.22 | <.001 | |
| Dyslexia*Log F | −0.03 | 0.06 | 0.641 | |
| Dyslexia*N-letters | −0.22 | 0.06 | <.001 | |
| Dyslexia*AoA | −0.01 | 0.05 | 0.830 | |
| Dyslexia*DEC | −0.28 | 0.05 | <.001 | |
| Dyslexia*Valence | 0.45 | 0.06 | <.001 | |
| Word Variance | 2.95 | - | - | |
| Person Variance | 0.89 | - | - | |
| Scale Variance | 3.29 | |||
Note. N-letters = number of letters, AoA = Age of Acquisition, Log Freq. = Log frequency, DEC = decodability.
The Item Predictors model (Table 4) showed that age of acquisition uniquely predicted between-item variance. The average log-odds of a correct response was 0.98 (predicted probability = .73; p < .001) conditional on the grand-mean centered item predictors in the model. As the age of acquisition (AoA) increased, the person log-odds of a correct response decreased by −0.30 units (p = .006). To interpret the item-predictor model as item difficulty differences requires the application of the –βi to the intercept and all predictors in the model such that the mean item difficulty is −0.98 (i.e., an easy item), and that as the age of acquisition increases by one unit the item difficulty increases by 0.30 units (i.e., the item becomes harder). No other item-level predictors were statistically significant. Variance components showed a reduction of variance at the item level from 3.65 in the unconditional model to 3.02 in the item predictor model indicating that the inclusion of the item predictors resulted in 17.3% of the item variance explained [i.e., (3.65 − 3.02) / 3.65 = .173]
The Person + Item Predictors model (Table 4) yielded a similar pattern of results where both AoA and Dyslexia were significant predictors of the odds of success, as was word frequency, and the variance components of this model reflected what was observed in the separate Person and Item predictor models. Interactions between the dyslexia indicator and item-predictors in the Person x Item model (Table 4) showed that the relation between word-level accuracy and whether the person was classified as dyslexia or not was moderated by item characteristics. The number of letters, decodability, and valence of the word were all significant moderators of word accuracy as was a significant main effect of word frequency. The inclusion of interaction terms resulted in a slight reduction in word variance (i.e., 3.02 to 2.95) and increase in the person variance (0.85 to 0.89).
2PL model.
Person and item predictors were included in a conditional 2PL random effects model (Table 5) to evaluate whether person- and item-level predictors explained item-level differences performance between students, differences among items in their threshold values, and differences among items in their slope values. The inclusion of the person and item predictors yielded new insights in item-level predictions. In the Person + Item Predictors model, the person-level predictors were consistent with the Rasch EIRM models such that students classified as dyslexic had odds of word-level accuracy 2.86 units lower than the typically developing students (i.e., 1.86, p < .001). AoA was a significant predictor of item thresholds, as was observed in the Rasch EIRM (Table 4). The prediction of item slopes showed that both decodability (0.09, p = .004) and valence (−0.11, p =.010) uniquely explained variance in item discriminations. As decodability values increased (i.e., as the decodability rating increased in complexity) the items increased in their ability to separate high and low ability individuals. Additionally, as the valence of the item decreased the items increased in their ability to discriminate between high and low ability individuals (i.e., items that elicited more negative emotions discriminate better).
Table 5.
2PL conditional EIRM results.
| Person Prediction | Threshold Prediction | Slope Prediction | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | Variable | Estimate | SE | p | Estimate | SE | p | Estimate | SE | p |
| Person + Item Predictors | Threshold/Intercept | 1.86 | 0.22 | <.001 | −1.39 | 2.08 | 0.526 | 4.49 | 1.18 | 0.018 |
| Log Freq. | - | - | - | 0.35 | 0.36 | 0.35 | 0.65 | 0.55 | 0.238 | |
| N-letters | - | - | - | 0.07 | 0.08 | 0.392 | 0.03 | 0.04 | 0.386 | |
| AoA | - | - | - | −0.30 | 0.06 | <.001 | 0.02 | 0.03 | 0.458 | |
| DEC | - | - | - | −0.12 | 0.07 | 0.072 | 0.09 | 0.03 | 0.004 | |
| Valence | - | - | 0.12 | 0.09 | 0.18 | −0.11 | 0.04 | 0.010 | ||
| Dyslexia | −2.86 | 0.3 | <.001 | - | - | - | - | - | - | |
| Word Threshold Variance | 1.04 | 0.18 | <.001 | - | - | - | - | - | - | |
| Word Slope Variance | 0.24 | 0.18 | <.001 | - | - | - | - | - | - | |
| Person Variance | 1.00 | - | - | |||||||
| Scale Variance | 1.00 | - | - | - | - | - | - | - | - | |
| Person x Item Predictors | Intercept | 1.76 | 0.25 | <.001 | −1.45 | 2.09 | 0.486 | 4.24 | 3.02 | 0.153 |
| Log Freq. | - | - | - | 0.35 | 0.36 | 0.348 | 0.60 | 0.53 | 0.276 | |
| N-letters | - | - | - | 0.09 | 0.08 | 0.244 | 0.02 | 0.04 | 0.571 | |
| AoA | - | - | - | −0.32 | 0.06 | <.001 | 0.03 | 0.03 | 0.274 | |
| DEC | - | - | - | −0.11 | 0.07 | 0.1 | 0.09 | 0.03 | 0.006 | |
| Valence | - | - | - | 0.11 | 0.09 | 0.212 | −0.10 | 0.04 | 0.010 | |
| Dyslexia | −3.02 | 0.31 | <.001 | - | - | - | - | - | - | |
| Dyslexia*Log Freq. | −0.02 | 0.02 | 0.204 | - | - | - | - | - | - | |
| Dyslexia*N-letters | −0.08 | 0.04 | 0.094 | - | - | - | - | - | - | |
| Dyslexia*AoA | 0.06 | 0.03 | 0.100 | - | - | - | - | - | - | |
| Dyslexia*DEC | 0.01 | 0.04 | 0.722 | - | - | - | - | - | - | |
| Dyslexia*Valence | −0.01 | 0.05 | 0.802 | - | - | - | - | - | - | |
| Word Threshold Variance | 1.04 | 0.18 | <.001 | - | - | - | - | - | - | |
| Word Slope Variance | 0.24 | 0.05 | <.001 | - | - | - | - | - | - | |
| Person Variance | 1.00 | - | - | |||||||
| Scale Variance | 1.00 | - | - | - | - | - | - | - | - | |
Note. N-letters = number of letters, AoA = Age of Acquisition, Log Freq. = Log frequency, DEC = decodability.
The Person x Item model in Table 5 reports both main effect coefficients for the person- and word-level coefficients respective to person-level differences, item-threshold differences, and item-slope differences just as in the Person + Item model. It is important to note that in the Person x Item Predictors model, the interaction terms are reported on the person prediction portion of the table, but the estimated coefficients can be used both for estimated fitted person-level and item-level estimates. That is, if one wanted to compute estimated fitted log-odds at the person level, the intercept value (i.e., 1.76) and associated main effect and interaction coefficients for the person- and item-predictors could be used to produce such values. Similarly, if it was of interest to compute fitted threshold values at the item-level, the threshold (i.e., −1.45) and associated main effect and interaction coefficients could be used to produce such values. The interactions in the 2PL model cannot be used is in their predictions of differences among item slopes as this requires a latent variable interaction that is complex and time-intensive to estimate based on the current capabilities in the software. Results from the interaction model in the 2PL showed that the three significant interactions in the Rasch model (Table 4) were not significant in the 2PL model [i.e., dyslexia x word length (−0.08, p = .094), dyslexia x decodability (0.01, p = .722), and dyslexia x valence (−0.01, p = .802)]. An important finding as well as that word frequency was not significant in the 2PL EIRM unlike the Rasch EIRM. The inclusion of the person and word-level predictors resulted in 30% of the item threshold variance and 55% of the item slope or discrimination variance explained.
Discussion
A significant body of research in the last decade has leaned into the explanatory item response model (EIRM) as a tool to illuminate the confluence of how person-level and item-level covariates can explain person-level and item-level differences. Many researchers have used the EIRM to explore word reading differences in young children (e.g., Steacy et al., 2017; Gilbert et al., 2014; Kearns, 2016) and all of these studies have provided salient recommendations to the scientific community regarding what is malleable about word reading based on respective level characteristics and how instruction and intervention might be formed or reformed in light of the evidence. The purpose of the present study was to revisit the root structures of EIRMs, given their wide utility, and test the extent to which new advances in the models can further inform and enhance our use of these models.
One of the more noteworthy findings from this study was that expanding from a Rasch-based doubly explanatory model to 2PL doubly explanatory model resulted in less error variance that was estimated in the unconditional model. The comparison of the Rasch to the 2PL model showed that the VDI for total word variance was equal between the two models but that the 2PL model further decomposed that variance to difficulty and discrimination differences among items. An implication of this finding is that the conventional approaches to EIRM via Rasch modeling, though not incorrect, are potentially missing an opportunity to further understand and explain variance due to another parameter that may be estimated in the item response model. A second noteworthy finding was that the nature of the explanatory mechanisms changed. For example, decodability and valence were not significantly related to item-level variance in the Rasch model; however, they were found to be significant predictors of item discriminations. What this differential finding demonstrates is that the totality of what makes an item predictor meaningful may depend on what portion of the item variance it is predicting. Even as valence and decodability may not be important in understanding why items differ in their difficulty, they mattered in this sample of data and items in understanding why items differ in the discrimination of high and low ability individuals.
Suggestions for Best Practices in Using EIRMs
As researchers continue to embrace EIRMs in their work, we believe there is value in offering suggestions to the field as to best practices. Considering that EIRMs blend psychometric and random effects models, there are three specific considerations for best practices in their application that are attentive to particular aspects of each analytic framework. First, because EIRMs are psychometric in nature it is important to consider the assumptions of item response theory (IRT) prior to model testing. Unidimensionality and local item independence of item response are core tenets of IRT. Preliminary testing and reporting of EIRM results should include attention to the essential unidimensionality (Stout, 1990) of scores in the sample and the reporting of fit indexes according to the test of dimensionality done (e.g., CFI, TLI and RMSEA in a confirmatory, parametric analysis; DETECT index in a confirmatory, nonparametric analysis). We recognize that sample sizes at the person-level in applications of the EIRM may be smaller than what might be conventionally recommend for factor analysis. In such circumstances, we recommend that scientists cover this as part of their reporting and note that results in an EIRM may be confounded by multidimensionality that was not tested in an initial factor analysis of items.
A second suggestion is that researchers should consider including model comparison tests among variants of the EIRM in justifying their selection of the doubly explanatory model. A cursory review of published EIRMs, including those by the authors of this manuscript, would reveal that rarely if ever do individuals compare EIRMs and instead focus on the doubly explanatory model. Both random effects modeling and measurement modeling includes model comparison steps to evaluate the appropriateness of a selected model for parsimony and usefulness. Deviance statistics are used in random effects models to guide whether a cluster-level random effect should be fixed or estimated. Incremental fit statistics are used in measurement models to assist in whether a one-factor model fits better or worse compared to alternative specifications (e.g., a correlated-traits model). We then recommend that as doubly explanatory models are used for individual differences testing that they are accompanied by a comparison to one of the other presented EIRMs.
Our last suggestion is that scientists, including us, should be specifically thoughtful about the presentation of random effects as ICCs or VDIs. Should the former be used, it is helpful to be reminded that scale variance is fixed at π2/3 (i.e., ~3.29) in logit models and 1.0 in probit models and to use appropriate computations (e.g., Cho & Rabe-Hesketh, 2011). Conventional software programs frequently omit scale variance in model results, thus, it behooves the analysts and authors to include this value when reporting both variance components and ICCs.
Limitations
The present study is limited due to several factors pertaining to both child and item considerations. The sample of students were predominantly from higher poverty areas and were predominantly minority, thus, future research is needed to evaluate the extent to which the findings in the present study replicate with other diverse samples. The sample of items in this study were a convenience sample of item pools from a larger study, and were not inclusive of the totality of representative word reading item-types that could lead to different explanatory variables relating to item difficulty and discrimination in different ways. As well, the types of item-level predictors were chosen based on a sampling of varied phonological, orthographic, and semantic features and the inclusion of other or different sets of predictors may yield different results and conclusions. Moreover, the distribution of decodability showed the presence of a minor ceiling effect that likely has bearing on its coefficient strength in the study. It is also important to note that much of the published EIRMs in literacy research uses packages or software that leverage the mixed effect model framework (e.g., lme4 package in R, proc glimmix in SAS). Variations across packages often introduces variations in the types of estimators, optimizers, integration points, and parameterizations that can potentially mask as substantively interesting differences. Future research can both use Monte Carlo simulations and applied studies to provide clarity to the field as to what types of results manifest as meaningful differences in theory versus choice of estimation in software.
Conclusions
In conclusion, this study has identified that not only may multiple types of EIRMs be estimated with descriptive or explanatory mechanisms, but also that EIRMs are robust to the inclusion of multiple item-level parameters. Moving from Rasch to 2PL EIRMs holds great opportunity for further studying individual differences in item-level performance to continue providing recommendations for the field about important features of words that may lead to better instructional recommendations.
Footnotes
Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of a an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.
Wilson et al. (2008) contends that these are Rasch models by virtue of the following. ηpi = θp − βi can be reformed as an expression of odds, πpi/1−πpi which is obtained from an exponential form of the first equation as exp(ηpi) = exp(θp − βi). Substituting the exponentiated form into the odds representation gives exp(θp)/ exp(βi) which is, itself, the exponential form of the Rasch model. It then follows that πpi=exp(θp − βi)/[1+exp(θp − βi)] which is a the probabilistic formula for the Rasch model.
References
- Adams MJ, & Huggins AWF (1985). The growth of children’s sight vocabulary: A quick test with educational and theoretical implications. Reading Research Quarterly, 20(3), 262–281. [Google Scholar]
- Baghaei P, & Ravand H (2015). A cognitive processing model of reading comprehension in English as a foreign language using the linear logistic test model. Learning and Individual Differences, 43, 100–105. [Google Scholar]
- Bates D, Maechler M, Bolker B, & Walker S (2015). Fitting linear mixed-models using lme4. .Journal of Statistical Software, 67, 1–48. [Google Scholar]
- Bowles RP, Skibbe LE, & Justice LM (2011). Analysis of letter name knowledge using Rasch measurement. Journal of applied measurement, 12(4), 387–398. [PubMed] [Google Scholar]
- Box GEP; Draper NR (1987), Empirical model-building and response surfaces. John Wiley & Sons. [Google Scholar]
- Brysbaert M, & Cortese MJ (2011). Do the effects of subjective frequency and age of acquisition survive better word frequency norms? Quarterly Journal of Experimental Psychology, 64, 545–559. [DOI] [PubMed] [Google Scholar]
- Brysbaert M, & New B (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior research methods, 41(4), 977–990. [DOI] [PubMed] [Google Scholar]
- Brysbaert M, Buchmeier M, Conrad M, Jacobs AM, Bölte J, & Böhl A (2011). The word frequency effect. Experimental psychology. [DOI] [PubMed] [Google Scholar]
- Carlisle JF, Katz LA (2006). Effects of word and morpheme familiarity on reading of derived words. Reading and Writing, 19, 669–693. [Google Scholar]
- Chaves-Sousa S, Santos S, Viana FL, Vale AP, Cadime L, Prieto G, & Ribeiro I (2017). Development of a word reading test: identifying students at-risk for reading problems. Learning and Individual Differences, 56, 159–166. [Google Scholar]
- Cho SJ, & Rabe-Hesketh S (2011). Alternating imputation posterior estimation of models with crossed random effects. Computational Statistics & Data Analysis, 55(1), 12–25. [Google Scholar]
- Compton DL, Appleton AC, & Hosp ΜK (2004). Exploring the relationship between text-leveling systems and reading accuracy and fluency in second-grade students who are average and poor decoders. Learning Disabilities Research & Practice, 19(3), 176–184. [Google Scholar]
- De Ayala RJ (2013). The theory and practice of item response theory. Guilford Publications. [Google Scholar]
- De Boeck P, & Wilson M (2004). Explanatory item response models. Springer. [Google Scholar]
- Ferrand L, Brysbaert M, Keuleers E, New B, Bonin P, Méot A, & Pallier C (2011). Comparing word processing times in naming, lexical decision, and progressive demasking: Evidence from Chronolex. Frontiers in Psychology, 2, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fox JP (2010). Bayesian item response modeling: Theory and applications. Springer Science & Business Media. [Google Scholar]
- Freitas S, Prieto G, Simoes MR, Nogueira J, Santana F, Martins C, & Alves L (2018). Using the Rasch analysis for the psychometric validation of the Irregular Word Reading Test (TeLPI): A Portuguese test for the assessment of premorbid intelligence. The Clinical Neuropsychologist, 32(supl), 60–76. [DOI] [PubMed] [Google Scholar]
- Garlock VM, Walley AC, & Metsala JL (2001). Age-of-acquisition, word frequency, and neighborhood density effects on spoken word recognition by children and adults. Journal of Memory and language, 45(3), 468–492. [Google Scholar]
- Gilbert JK, Compton DL, & Kearns DM (2011). Word and person effects on decoding accuracy: A new look at an old question. Journal of Educational Psychology, 103(2), 489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodwin AP, Gilbert JK, & Cho SJ (2013). Morphological contributions to adolescent word reading: An item response approach. Reading Research Quarterly, 48(1), 39–60. [Google Scholar]
- Kearns DM (2015). How elementary-age children read polysyllabic polymorphemic words. Journal of Educational Psychology, 107(2), 364. [Google Scholar]
- Kim YS, Petscher Y, Foorman BR, & Zhou C (2010). The contributions of phonological awareness and letter-name knowledge to letter-sound acquisition—a cross-classified multilevel model approach. Journal of Educational Psychology, 102(2), 313. [Google Scholar]
- Kim YSG, Petscher Y, & Park Y (2016). Examining word factors and child factors for acquisition of conditional sound-spelling consistencies: A longitudinal study. Scientific Studies of Reading, 20(4), 265–282. [Google Scholar]
- Kuperman V, Stadthagen-Gonzalez H, & Brysbaert M (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. [DOI] [PubMed] [Google Scholar]
- McCulloch CE, & Searle SR (2001). Generalized, Linear, and Mixed Models (Wiley Series in Probability and Statistics). [Google Scholar]
- Medina-Díaz M (1993). Analysis of cognitive structure using the linear logistic test model and quadratic assignment. Applied psychological measurement, 17(2), 117–130. [Google Scholar]
- Menton S, & Hiebert EH (1999). Literature Anthologies: The Task for First-Grade Readers. [Google Scholar]
- Muthen LK, & Muthen BO (1998-2019). Mplus User’s Guide (8th Ed). Los Angeles, CA: Muthen & Muthen. [Google Scholar]
- Perfetti CA and Hart L 2002. “The lexical quality hypothesis”. In Precursors of functional literacy, Edited by: Vehoeve L, Elbron C and Reitsma P 189–213. Amsterdam: John Benjamins. [Google Scholar]
- Piasta SB, Groom LJ, Khan KS, Skibbe LE, & Bowles RP (2018). Young children’s narrative skill: concurrent and predictive associations with emergent literacy and early word reading skills. Reading and Writing, 31(7), 1479–1498. [Google Scholar]
- Puranik CS, Petscher Y, & Lonigan CJ (2014). Learning to write letters: Examination of student and letter factors. Journal of Experimental Child Psychology, 128, 152–170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raftery AE (1995). Bayesian model selection in social research. Sociological methodology, 25, 111–163. [Google Scholar]
- Ravand H (2016). Application of a cognitive diagnostic model to a high-stakes reading comprehension test. Journal of Psychoedncational Assessment, 34(8), 782–799. [Google Scholar]
- Rijmen F, & Briggs D (2004). Multiple person dimensions and latent item predictors In Explanatory Item Response Models (pp. 247–265). Springer, New York, NY. [Google Scholar]
- Rodriguez G, & Elo I (2003). Intra-class correlation in random-effects models for binary data. The Stata Journal, 3(1), 32–46. [Google Scholar]
- Sheehan K, & Mislevy RJ (1990). Integrating cognitive and psychometric models to measure document literacy. Journal of Educational Measurement, 27(3), 255–272. [Google Scholar]
- Sosa AV, & Stoel-Gammon C (2012). Lexical and phonological effects in early word production. Journal of Speech, Language, and Hearing Research, 55, 596–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steacy LM, Kearns DM, Gilbert JK, Compton DL, Cho E, Lindstrom ER, & Collins AA (2017). Exploring individual differences in irregular word recognition among children with early-emerging and late-emerging word reading difficulty. Journal of Educational Psychology, 109(1), 51. [Google Scholar]
- Stout WF (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55(2), 293–325. [Google Scholar]
- Taft M (1992). The body of the BOSS: Subsyllabic units in the lexical processing of polysyllabic words. Journal of Experimental Psychology: Human Perception and Performance, 18, 1004–1014. doi: 10.1037/0096-1523.18.4.1004. [DOI] [Google Scholar]
- Van den Noortgate W, De Boeck P, & Meulders M (2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28(4), 369–386. [Google Scholar]
- Warriner AB, Kuperman V, & Brysbaert M (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods, 45(4), 1191–1207. [DOI] [PubMed] [Google Scholar]
- Wilson M, De Boeck P, & Carstensen CH (2008). Explanatory item response models: A brief introduction In Hartig J, Klieme E, and Leutner D (Eds). Assessment of competencies in educational contexts, (pg. 91–120). Cambridge, MA: Hogrefe & Huber Publishers. [Google Scholar]
- Wilson M, & Moore S (2011). Building out a measurement model to incorporate complexities of testing in the language domain. Language Testing, 28(4), 441–462. [Google Scholar]
- Wilson M, & Moore S (2012). An explanative modeling approach to measurement of reading comprehension. Reaching an understanding: Innovations in how we view reading assessment, 147–168. [Google Scholar]
- Woodcock RW (1987). Woodcock Reading Mastery Tests - Revised. Circle Pines, MN: American Guidance Service. [Google Scholar]
- Woodcock RW (1998). Woodcock Reading Mastery Tests - Revised Normative Update: Examiner’s Manual. Circle Pines, MN: American Guidance Service. [Google Scholar]
- Zhang J (2007). Conditional covariance theory and detect for polytomous items. Psychometrika, 72(1), 69–91. [Google Scholar]
