Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 7.
Published in final edited form as: Psychol Assess. 2020 Jun 25;32(10):928–942. doi: 10.1037/pas0000883

Using Item Response Theory to Evaluate the Children’s Behavior Questionnaire: Considerations of General Functioning and Assessment Length

D Angus Clark 1, M Brent Donnellan 2, C Emily Durbin 2, Rebecca J Brooker 3, Tricia K Neppl 4, Megan Gunnar 5, Stephanie M Carlson 5, Lucy Le Mare 6, Grazyna Kochanska 7, Philip A Fisher 8, Leslie D Leve 9, Mary K Rothbart 10, Samuel P Putnam 11
PMCID: PMC8497017  NIHMSID: NIHMS1729593  PMID: 32584073

Abstract

Although the Children’s Behavior Questionnaire (CBQ; Rothbart et al., 2001) is the most popular assessment for childhood temperament, its psychometric qualities have yet to be examined using Item Response Theory (IRT) methods. These methods highlight in detail the specific contributions of individual items for measuring different facets of temperament. Importantly, with 16 scales for tapping distinct aspects of child functioning (195 items total), the CBQ’s length can be prohibitive in many contexts. The detailed information about item functioning provided by IRT methods is therefore especially useful. The current study used IRT methods to analyze the CBQ’s 16 temperament scales and identify potentially redundant items. An abbreviated “IRT form” was generated based on these results, and evaluated across four independent validation samples. The IRT form was compared to the original and short CBQ forms (Putnam & Rothbart, 2006). Results provide fine-grained detail on the CBQ’s psychometric functioning, and suggest it is possible to remove up to 39% of the original form’s items while largely preserving the measurement precision and content coverage of each scale. This study provides considerable psychometric information about the CBQ’s items and scales, and highlights future avenues for creating even more efficient high-quality temperament assessments.

Keywords: Children’s Behavior Questionnaire, Item Response Theory, Temperament, Parent Report, Short Form

Introduction

Children vary widely along a variety of observable dimensions, such as activity level, self-control, reaction to novelty, and tolerance of frustration. These early emerging individual differences reflect variations in children’s temperament (Rothbart & Derryberry, 1981; Shiner & DeYoung, 2013; Shiner & Caspi, 2012). Individual differences in child temperament are increasingly recognized as a critical component of lifespan development (Clark, Durbin, Hicks, Iacono, & McGue, 2017; Rothbart, 2011; Zenter & Shiner, 2012). Indeed, child temperament lays the foundation for adolescent and adult personality (Shiner & DeYoung, 2013), and is relevant for many consequential short- and long-term outcomes, such as psychopathology (Caspi, Moffitt, Newman, & Silva, 1996; Klein, Dyson, Kujawa, & Kotov, 2012; Tackett, Martel, & Kushner, 2012), academic performance (Duckworth & Allred, 2012), and substance use (e.g., Clark, Donnellan, Robins, & Conger, 2015; Creemers et al., 2010; Stautz & Cooper, 2013). Accordingly, a growing number of researchers and practitioners are incorporating measures of temperament into their work (Carey & McDevitt, 1989; McClowry & Collins, 2012; Zenter & Shiner, 2012). The Children’s Behavior Questionnaire (CBQ; Rothbart, Ahadi, Hershey, & Fisher, 2001; Kotelnikova, Olino, Klein, Kryski, & Hayden, 2015) is among the most widely used temperament questionnaires. Despite the widespread usage of the CBQ, there is little information about how the scales function using the methods of Item Response Theory (IRT; Embretson & Reise, 2000; Revicki & Reise, 2014).

IRT provides a richer description of psychometric functioning at both the item and test level than classical psychometric methods, and can be used to reduce surveys in a principled manner that maximizes psychometric properties. The CBQ is fairly lengthy, with 16 individual temperament scales and close to 200 items, which can be restrictive in many circumstances. Short and very short forms of the CBQ exist (Putnam & Rothbart, 2006), but users may perceive the loss of precision and content coverage typically associated with short forms (e.g., Crede, Harms, Niehorster, & Gaye-Valentine, 2012) to be too great to move away from the original scales. IRT methods are particularly useful for addressing these concerns because they provide detailed information to balance scale length and quality. Accordingly, the aims of the present study were to 1) provide a thorough IRT analysis of the original 16 CBQ temperament scales, 2) use this information to construct an abbreviated “IRT form” of the CBQ scales by identifying and trimming potentially unnecessary items, and 3) compare the IRT form scales to the original and short CBQ form scales across several independent validation samples. Collectively, this work provides fine-grained information about each of the CBQ’s 16 specific temperament scales and offers shorter scales for temperament researchers who wish to assess various dimensions of child temperament with the CBQ.

The Children’s Behavior Questionnaire

Research on child temperament is often based on parent report questionnaires (Clark et al., 2017; Goldsmith & Gagne, 2012; Lo, Vroman, & Durbin, 2014). The CBQ (Rothbart et al., 2001) is currently the most widely used inventory (Kotelnikova et al., 2015). The foundational CBQ paper (Rothbart et al., 2001) has been cited over 2,400 times since its release (Google Scholar search, April 24, 2020). The CBQ has also been translated into over 20 non-English languages (e.g., Arabic, Chinese, Italian), giving it a broad international and cross-cultural presence (“The Children’s Behavior Questionnaire”, 2017). Indeed, the CBQ is arguably the standard in the field.

The CBQ is grounded in Rothbart and colleagues’ psychobiological model of early childhood temperament (Rothbart & Derryberry, 1981). It includes 195 individual items, with content derived both through theory-based approaches and extensive parental interviews. Items tend to emphasize specific behaviors (e.g., “gets mad when only mildly criticized”) rather than more global judgements. For each item, respondents rate the degree to which a target behavior was characteristic of the child over the past six months. Responses range from “1: extremely untrue of your child” to “7: extremely true of your child”. Each item of the CBQ is associated with one of 16 distinct conceptual facets of temperament: Activity Level, Anger/Frustration, Attentional Focusing, Attentional Shifting1, Discomfort, Fear, High Intensity Pleasure, Impulsivity, Inhibitory Control, Low Intensity Pleasure, Perceptual Sensitivity, Positive Anticipation/Approach, Sadness, Shyness, Smiling/Laughter, and Soothability (see https://osf.io/r7mdf/ for conceptual definitions).

Item Response Theory and the CBQ

The CBQ is a carefully constructed questionnaire, and many aspects of its psychometric functioning have been studied (e.g., psychometric equivalence across child sex, consistency across early childhood, cross-cultural generalizability; Clark, et al., 2016; Kotelnikova et al., 2015; Rothbart et al., 2001; Sulik et al., 2010). However, the properties of the CBQ scales have not yet been investigated within an IRT framework. Indeed, despite the potential to offer many useful insights on psychometric functioning at the item, scale, and dimensional levels, IRT methods have been rarely used in the temperament literature, despite calls to more widely adopt them (Goldsmith & Gagne, 2012; but see Peterson et al., 2017).

IRT is a measurement framework including a wide range of latent variable models that provide information about psychometric functioning at both the item and test or scale level (de Ayala, 2009; Embretson & Reise, 2000). IRT techniques model the likelihood of a given response to an item as a probabilistic function of the target’s overall score on the latent trait of interest, often referred to as theta (Kamata & Bauer, 2008). This latent trait is typically connected to item responses through discrimination and location parameters. Discrimination values are analogous to factor loadings, and denote how strongly an item is related to theta. Higher values reflect strong associations or “better” items. Location parameters are analogous to indicator intercepts, and capture the value of theta at which targets are more likely to warrant a higher response category. Measurement precision, or reliability, is represented in IRT models using the concept of information (Thissen & Orlando, 2001). Information values are computed across theta, for both items and the full test or scale, as logit units that can easily be converted into standard errors and estimates of reliability (Thissen & Orlando, 2001).

IRT offers many benefits, from sample-invariant parameter estimates (assuming no differential item functioning across populations) to metrics of reliability at both the test/scale and item level that are conditional on the trait being assessed (de Ayala, 2009; Hambleton & Swaminathan, 1985; Markus & Borsboom, 2013). The detail provided by IRT models identifies the specific contributions of individual items, highlighting when and where additional items may or may not be necessary (de Ayala, 2009; Hambleton & Swaminathan, 1985). Indeed, lengthy questionnaires often contain many similar items, whereas small sets of targeted items may provide adequate reliability across a desired range of theta (e.g., Schroder, Clark, & Moser, 2017). This strength of IRT models is particularly relevant in the context of the CBQ.

The CBQ is comprehensive in terms of its content, but this depth of assessment comes at a cost of respondent time given its length. Short and very short CBQ forms were developed using traditional psychometric methods that addresses length concerns (Putnam & Rothbart, 2006). Short form development typically entails sacrificing a non-trivial degree of measurement precision and content coverage as, by definition, psychometrically useful items from the full form will be discarded (i.e., in theory, if the discarded items were unnecessary there would be no need for a full form; Smith, McCarthy, & Anderson, 2000). Although the conflict between length and psychometric quality will never fully disappear, IRT methods can help minimize these concerns by precisely highlighting the specific contributions of individual items to identify weaker or even superfluous items (Hambleton & Swaminathan, 1985; Schroder et al., 2017). Thus, IRT methods can be used to identify items that can be omitted without significantly impacting the original form’s reliability and content coverage.

Present Study

IRT techniques were used to analyze the psychometric properties of the 16 CBQ temperament scales with an eye toward identifying a shorter set of items for each scale. Parents’ ratings of the 16 facets of temperament across one calibration sample and four independent validation samples were considered. Three major goals motivated this study. The first was to provide a comprehensive reference for researchers on the CBQ’s general functioning based on an IRT analysis. The second goal was to identify items that appeared to be redundant in psychometric terms, and examine the psychometric consequences of removing these items through the fashioning of an abbreviated “IRT Form”. The third goal was to compare the psychometric properties of the original, IRT, and short CBQ form scales across 5 independent samples of children to evaluate the properties of the shorter scales so researchers can make informed decisions in future work.

Method

Participants

Calibration Sample.

The calibration sample was a sample of children and their parents (both mothers and fathers) that was previously used to study measurement invariance in the CBQ (Clark et al., 2016). The calibration sample included 605 children with CBQ data (47% girls) aged 3 to 7 years. The mean child age was 52 months (SD = 10.79; range = 36 – 95), or 4.3 years. The average age of mothers was 31.49 years (SD = 7.66; range = 18 – 49). The average age of fathers was 33.44 years (SD = 7.49; range = 18 – 57). Family incomes ranged from below $10,000 annually, to over $100,000; 27% of all households included reported income of less than $41,000 yearly. In all there were 588 maternal reports of child temperament, and 479 paternal reports of child temperament. As both parents reported on the same child (i.e., maternal and paternal ratings are not independent) only maternal ratings were used in the actual calibration analyses. Paternal ratings were treated as an additional sample to confirm and extend findings from the calibration analyses. Unlike the samples described below, however, these paternal ratings do not represent a truly independent validation sample.

Validation Sample 1.

The first validation sample comes from a longitudinal study of temperament development (Gartstein, Putnam, & Rothbart, 2012; Gartstein & Rothbart, 2003; Putnam, Rothbart, & Gartstein, 2008). Validation sample 1 included 187 children with CBQ data (54% girls) and their primary caregivers (90% mothers). The mean age of the children at the time of CBQ administration was 49.5 months (SD = 4.81). Caregivers had a mean age of approximately 33.10 years (SD = 5.30). Most participating caregivers were White, married, and middle class (mean family income = $41,798; SD = 19, 154).

Validation Sample 2.

The second validation sample was used in the development of the CBQ short and very short forms (see “Scale Construction” section in Putnam & Rothbart, 2006). Data for this sample was collected by Kochanska and colleagues (1994; N = 171), Fagot and Leve (1998; N = 174), and Fisher (1994; N = 123). The full sample contained 468 children (45% girls) that ranged in age from 21 to 101 months of age (average age = 65 months of age). Participants were mostly White, and the sample was diverse with respect to socioeconomic status.

Validation Sample 3.

The third validation sample was also used in the original development of the CBQ short and very short forms (see “Study 1” in Putnam & Rothbart, 2006). The data for this sample were collected by Carlson (N = 245), Le Mare (N = 129), Kochanska (N = 99), Gunnar (N = 60), and Rothbart (N = 57). The full sample contained 590 children (48% girls) with an average age of 54.42 months (SD = 13.57). Participants were mostly White, and came from middle and upper socioeconomic backgrounds.

Validation Sample 4.

The fourth validation sample consisted of maternal respondents collected from members of online panels maintained by Qualtrics, a market research company (Qualtrics, Provo, UT). Mothers were only administered the IRT form to ensure there was one validation sample in which respondents only completed the reduced item set (as opposed to deriving the IRT form from the full form). The ratings included here come from the participants who passed a quality control check item included in the questionnaire (select “agree a little”), answered in the affirmative that they responded honestly, and that they took part in the survey seriously (see Aust et al., 2013). This sample included 420 mothers who reported about one of their children (62% girls) with an average age of around 59.77 months (SD = 16.39). The majority of maternal participants were White (77%), and respondents were generally between 25 and 34 years old (58%). Family incomes ranged from below $15,000 annually, to over $100,000; 12% of all households included reported income of less than $15,000 yearly.

Data Analytic Strategy

Data analysis proceeded in three phases. The first was the initial evaluation of the CBQ scales using IRT methods. The second was the identification of potentially superfluous items and the creation of the IRT form scales. The third was the comparison of the full, IRT, and short form temperament scales. The first two phases exclusively relied on maternal reports from the calibration sample, whereas all raters and samples were used in the third. Phases one and two were performed for each individual scale in turn given the conceptual centrality of the facet scales in the development and application of the CBQ. IRT models were run using the flexMIRT software and estimated via full information maximum likelihood (Cai, 2012). All other analyses were conducted using Mplus version 8.0 (Muthen & Muthen, 1998–2017) and estimated using either mean and variance adjusted weighted least squares (WLSMV; for the categorical latent variable models; Wirth & Edwards, 2007), or full information maximum likelihood (for models based on continuous data).

Phase 1.

Scale dimensionality was first analyzed as the IRT models used here assume a unidimensional (or at least essentially unidimensional; Slocum-Gori & Zumbo, 2011) construct. Dimensionality was assessed using both item factor analyses (IFA; Wirth & Edwards, 2007) and bi-factor models (restricted or unrestricted, based on the IFAs; Stucky, Thissen, & Edelen, 2013). Item level “explained common variance” (I-ECV) was computed from the bi-factor models; this reflects the percentage of an item’s total shared variance that is explained by the general factor of interest as opposed to any specific factors that emerged during the IFA (e.g., Hansen et al., 2014; Rodriguez et al., 2016). When bi-factor models included specific factors indicated by only two items, factor loadings on that specific factor were fixed to equality for identification. Notably, even when constrained models including specific factors with only two indicators are more likely to encounter estimation problems, so if these models would not converge properly, the stronger of the two items from the specific factor was retained in the IRT scale on the basis of the other selection criterion (see below). The original CBQ scales were then analyzed with the graded response model (GRM) -- an IRT model for polytomous items (Samejima, 1969) -- using the maternal ratings from the calibration sample.

Phase 2.

The original CBQ scales were trimmed by identifying potentially unnecessary items. Items that failed to load meaningfully on any factor in the IFA (defined as λ < .40), had I-ECV values below 35%, and/or demonstrated low initial discrimination (defined as α < .80) were flagged for potential elimination. At minimum, however, at least one item per IFA dimension was included in the IRT scale to maintain content coverage, with priority going to those items with I-ECV values above 35%. If every scale item was exemplary by all of these standards, relatively weaker items were trimmed given the diminishing psychometric returns of including multiple items with similar properties. Notably, these criteria represented general guidelines, and flexibility was sometimes necessary to meet scale targets. Those scale targets for the IRT form were twofold. The first target was to maintain content coverage. This was accomplished by including at least one item in the IRT scale from each dimension identified in the IFA. The second target was for the IRT scale to provide at least 4 logits of information (i.e., reliability of roughly .75) within 2 standard deviations of theta’s mean. If the original scale had failed to meet this second target, the goal was to at least maintain a comparable level of precision in the shorter scale.

After trimming items, the psychometric properties of the new IRT scales were assessed using maternal ratings from the calibration sample. The dimensionality of the IRT scales was analyzed with item factor analysis, and the psychometric properties of the scales were assessed with the graded response model. Bi-factor models were not used in the dimensionality assessment of the IRT scales as the scale reduction procedure typically reduced any multidimensionality present in the original scales to the point where there were only 1 or 2 items per specific factor (if any multidimensionality even remained), which would lead to unstable models that struggle to converge. With the reduced multidimensionality, however, the GRM discrimination values are a reasonable proxy for the I-ECV values (i.e., higher discrimination values indicator a stronger relation to the general factor).

Phase 3.

The IRT form scales were compared to both the original and short CBQ form scales across the four validation samples, and maternal and paternal reports (parents were analyzed separately) in the calibration sample. First, measurement precision was compared across forms using marginal reliability estimates from the GRMs. Marginal reliability (Thissen, Nelson, & Swygert, 2001) is a single number that roughly reflects the “average” reliability (information) across theta. Second, correlations between forms were calculated both with and without corrections for potential within-sample error variance (Levy, 1967). Third, correlations with two criterion variables -- externalizing and internalizing problems -- and inter-parent agreement were examined. These criterion were selected given: 1) enduring interest in how different temperamental traits serve as either risk or protective factors in predicting adjustment across childhood (Klein et al., 2012; Tackett, Martel, & Kushner, 2012), and 2) improving (or at least understanding and maintaining) agreement between different raters of child behavior -- typically between .2 and .4 -- is a longstanding concern in developmental research (Clark et al., 2017; Duhig, Renk, Epstein, & Phares, 2000; Richters, 1992). Externalizing and internalizing problems were measured using a composite score of mothers’ and fathers’ ratings on the Child Behavior Checklist (CBCL; Achenbach & Ruffle, 2000) total externalizing and total internalizing problem scales. This final set of analyses was only conducted with the maternal and paternal reports from the calibration sample2.

As this study was initially based entirely on the secondary analysis of existing data, it was exempted from ethics committee review (Texas A&M University IRB#: 2017–0477). The subsequent collection of Validation Sample 4 was also exempted from IRB review given the nature of the data collection method and assurances of participant anonymity (Michigan State University IRB#: 00002636).

Results

We first discuss the psychometric analysis of the original CBQ scales and construction of the IRT form scales, followed by the comparison of the IRT, original, and short form scales. The dimensional structure of the full CBQ is then explored in a brief series of supplemental analyses. As a concise reference for the findings presented here, a list of every item, its content, basic psychometric performance in the calibration sample, and whether it is included in the IRT and/or short form, can be found on the Open Science Framework (OSF) at: https://osf.io/r7mdf/ .

Psychometric Evaluation of the CBQ and IRT Form Development

The results of the dimensionality assessments and GRMs for the original scales can be found on the OSF (https://osf.io/r7mdf/); these results provide a useful reference regarding the functioning of the original CBQ scales, including the weaker items that were removed when generating the IRT scales. Again, these were based on the analysis of maternal ratings from the calibration sample. The results of the dimensionality assessments and GRMs for the IRT scales are presented in Table 1. These results are also based on maternal ratings from the calibration sample; results based on paternal ratings from the calibration sample can be found on the OSF at: https://osf.io/r7mdf/. Results for each individual scale are presented in turn, alphabetically.

Table 1.

Graded Response Model and Dimensionality Results for IRT Form Scales

Information at Different Trait Levels
Scale/Item Discrimination −2 −1 0 1 2
Activity Level α = .79 3.50/ 4.57 4.92 5.03 5.07
251 1.17 .36/ .42 .43 .44 .42
411 1.23 .22 .39 .44 .46 .48
1021 1.82 .60 .91 .99 1.00 1.02
1231 1.22 .16 .33 .43 .45 .48
1261 1.04 .33 .34 .34 .34 .34
1451 1.06 .25 .33 .35 .36 .36
1721 1.40 .23 .48 .56 .59 .62
1921 1.10 .35 .37 .38 .37 .36
Anger/Frustration α = .83 6.11 5.88 5.88 5.81 4.83
21 .84 .22 .23 .23 .22 .22
191 .91 .26 .26 .26 .26 .25
341 1.52 .69 .73 .70 .69 .64
621 2.25 1.54 1.25 1.38 1.35 .58
781 1.33 .54 .56 .55 .54 .52
1281 1.17 .43 .41 .41 .39 .32
1731 1.43 .63 .64 .60 .61 .57
1811 1.31 .53 .54 .51 .51 .49
1931 .92 .27 .26 .25 .25 .24
Attentional Focusing α = .77 4.09 4.30 4.43 4.33 4.27
161 .97 .29 .30 .30 .30 .29
381 1.67 .74 .79 .86 .85 .82
471 1.83 .93 1.02 1.04 .98 .98
1251 .85 .19 .21 .22 .23 .23
1711 1.40 .54 .58 .61 .59 .58
1951 1.12 .39 .40 .39 .38 .37
Attentional Shifting α = .71 3.36 3.49 3.40 3.38 3.29
61 1.41 .59 .63 .59 .58 .54
291 .89 .23 .23 .24 .24 .25
951 1.69 .84 .86 .82 .83 .81
1841 1.57 .72 .77 .74 .72 .69
Discomfort α = .87 6.72 8.08 8.49 8.15 5.96
51 1.23 .46 .48 .48 .45 .42
611 3.27 1.98 2.93 3.19 3.05 1.50
1011 2.69 1.86 2.06 2.17 2.09 1.61
1152 .44 .06 .06 .06 .06 .06
1321 2.22 1.27 1.47 1.51 1.42 1.30
1782 .51 .08 .08 .08 .08 .08
Fear α = .78 3.70 4.45 4.62 4.53 4.28
402 1.42 .32 .57 .65 .63 .61
501 .64 .12 .13 .13 .13 .13
912 1.63 .56 .82 .85 .83 .73
1302 1.80 .80 .98 1.03 .98 .88
138 1.14 .40 .41 .42 .41 .39
1611 .62 .12 .12 .12 .13 .12
1762 .95 .26 .28 .29 .29 .28
1891 .64 .12 .13 .13 .13 .13
High Intensity Pleasure α = .83 7.01 7.24 6.68 5.20 2.55
81 2.70 2.12 2.22 2.06 1.40 .16
221 .81 .21 .21 .21 .21 .21
301 .99 .32 .31 .31 .30 .27
511 1.19 .45 .45 .45 .43 .36
601 2.61 2.01 2.17 1.86 1.22 .14
671 .94 .26 .24 .18 .10 .05
1241 .78 .18 .18 .17 .17 .14
1391 1.19 .46 .45 .44 .38 .22
Impulsivity α = .84 5.25 6.03 6.43 6.33 6.25
131 .62 .12 .12 .13 .12 .12
462 .88 .22 .23 .23 .24 .24
593 1.87 .97 1.03 1.09 1.07 1.00
713 2.41 1.30 1.52 1.71 1.63 1.66
791 .60 .11 .12 .12 .11 .11
1693 1.70 .55 .79 .85 .91 .88
1833 2.05 .98 1.22 1.31 1.25 1.23
Inhibitory Control α = .85 6.79 6.80 6.63 6.27 6.13
41 1.00 .32 .32 .31 .29 .24
321 1.74 .89 .90 .93 .83 .80
751 1.49 .69 .70 .66 .64 .61
931 1.28 47 .49 .50 .51 .49
1081 1.40 .57 .59 .59 .62 .58
1361 2.32 1.62 1.58 1.41 1.18 1.26
1471 1.03 .33 .33 .32 .31 .29
1681 1.25 .45 .45 .48 .47 .47
1851 1.19 .43 .44 .44 .42 .39
Low Intensity Pleasure α = .76 5.22 5.06 4.74 3.43 2.10
541 .85 .22 .21 .20 .16 .10
662 1.31 .54 .52 .38 .17 .05
763 1.54 .75 .70 .62 .29 .08
862 .56 .10 .10 .10 .10 .10
1133 1.44 .65 .62 .60 .46 .19
1332 1.74 .88 .86 .81 .38 .09
1463 1.45 .67 .65 .63 .50 .21
1643 1.13 .40 .39 .39 .37 .29
Perceptual Sensitivity α = .91 10.34 20.56 14.23 17.29 2.91
281 2.44 1.77 1.89 1.74 1.60 .68
311 2.89 1.69 2.65 2.43 2.29 .59
651 6.90 5.11 14.24 8.31 11.68 .04
981 .80 .20 .19 .19 .18 .16
1051 .84 .22 .23 .23 .23 .22
1701 1.05 .35 .35 .34 .31 .23
Positive Anticipation α = .81 6.58 5.87 5.73 5.12 2.64
241 .86 .22 .22 .22 .21 .17
821 1.12 .38 .36 .34 .23 .11
961 1.80 1.01 .94 .92 .86 .37
1171 2.90 2.46 1.83 1.83 1.55 .14
1311 .88 .25 .25 .24 .24 .23
1481 1.49 .57 .59 .54 .55 .27
1661 1.21 .44 .42 .39 .24 .10
1911 .91 .26 .26 .26 .25 .25
Sadness α = .70 3.19 3.38 3.42 3.38 3.35
182 .83 .22 .22 .22 .21 .20
391 .86 .16 .20 .22 .23 .24
442 1.12 .38 .40 .39 .38 .37
552 1.41 .53 .62 .64 .61 .61
812 .97 .29 .30 .30 .29 .28
94 .85 .23 .23 .22 .22 .22
109 .34 .04 .04 .04 .04 .04
1271 .83 .16 .20 .21 .22 .22
1492 .76 .18 .18 .18 .18 .17
Shyness α = .94 15.92 18.23 17.82 16.66 8.28
171 2.19 1.41 1.45 1.42 1.31 .68
231 2.58 1.87 2.05 1.99 1.71 1.11
451 1.97 1.21 1.19 1.16 .96 .29
571 2.94 2.25 2.65 2.37 2.32 .78
1061 2.66 1.04 2.02 2.09 2.12 1.66
1291 3.96 3.97 4.34 4.44 3.97 .50
1431 2.19 1.23 1.37 1.41 1.49 1.26
1581 2.62 1.93 2.15 1.94 1.78 1.01
Smiling and Laughter α = .77 5.40 5.09 5.00 3.81 2.40
1101 1.50 .65 .63 .59 .60 .37
1211 1.21 .45 .42 .40 .32 .15
1351 1.50 .69 .64 .60 .33 .10
1631 2.22 1.41 1.23 1.28 .50 .07
1651 1.30 .54 .53 .50 .46 .29
179 .72 .17 .16 .16 .16 .15
1941 1.27 .49 .48 .47 .44 .27
Soothability α = .84 6.65 6.66 6.42 5.83 5.00
141 .62 .12 .12 .12 .12 .12
681 1.48 .66 .70 .69 .64 .51
922 1.24 .48 .46 .45 .43 .40
1182 1.51 .71 .63 .65 .62 .55
1341 2.31 1.60 1.62 1.48 1.23 .98
1501 2.08 1.28 1.33 1.27 1.07 .91
1771 1.59 .79 .80 .76 .72 .53

Note. Item and test information presented at five levels of the latent trait, −2, −1, 0, 1, and 2. Total test information presented in row with scale name. α = marginal reliability of scale. Superscripts denote factor structure supported by IFAs; identical superscripts denote that the items loaded on the same factor (loadings above .4). All results based on maternal ratings from the calibration sample, results based on paternal ratings can be found on the OSF (https://osf.io/r7mdf/).

Activity Level.

The dimensionality analysis suggested that a single factor solution was optimal. The original 13-item activity level scale consistently provided around 5.50 logits of information above −1.0 (α = .82). Five items were subsequently removed from this scale for failing to load on any factor in the dimensionality analysis (Items 1, 48, 153, and 187), inadequate initial discrimination (Items 1, 48, 153, and 187), and/or providing only marginal gains in information compared to the remaining items (Item 88). The 8-item IRT activity level scale appeared unidimensional and consistently provided between 4.50 and 5.00 logits of information above −1.0 (S6; α = .79). No items had a discrimination value below .80. The IRT scale was slightly less informative on average when paternal ratings were analyzed (α = .77). Overall, the IRT scale represented a reduction in length of 38% over the original (Table 2).

Table 2.

Number of Items in Original, IRT, and Short Forms

Original IRT Short O ➔I S ➔I
Activity Level 13 8 7 38% +14%
Anger/Frustration 13 9 6 −31% +50%
Attentional Focusing 9 6 6 −33% 00%
Attentional Shifting 5 4 - −20% -
Discomfort 12 6 6 −50% 00%
Fear 12 8 6 −33% +33%
High Intensity Pleasure 13 8 6 −38% +33%
Impulsivity 13 7 6 −46% +17%
Inhibitory Control 13 9 6 −31% +50%
Low Intensity Pleasure 13 8 8 −38% 00%
Perceptual Sensitivity 12 6 6 −50% 00%
Positive Anticipation 13 8 6 −38% +33%
Sadness 12 9 7 −25% +28%
Shyness 13 8 6 −38% +33%
Smiling and Laughter 13 7 6 −46% +17%
Soothability 13 7 6 −46% +17%
Total 195 118 94 −39% +26%

Note. O ➔I = percent difference in length from the original form to the IRT form; S ➔I = percent difference in length from the short form to the IRT form. Although the full CBQ contains 195 items, only 192 items were actually included in the analyses here. Items 3, 33, and 49 were not incorporated into any scale in the CBQ scoresheet. As such, these three items by default are not included in the IRT form.

Anger/Frustration.

The dimensionality analysis suggested a two factor solution was optimal. The first factor was the largest and represented the general anger/frustration construct, whereas the second factor emphasized bed-time related anger. The original 13-item anger/frustration scale provided the most information at −2.0 (6.87 logits), with information values decreasing gradually as scores increased (5.70 at 2.00) (α = .85). Four items were subsequently removed from this scale for failing to load on any factor in the dimensionality assessment (Item 156), being weakly associated with the general factor (Item 120) (the higher initial discrimination of the other item on this factor -- Item 2 -- implies a stronger relation to the general factor), inadequate initial discrimination (Item 73), and/or providing only marginal gains in information (Item 140). The 9-item IRT anger/frustration scale appeared unidimensional, and its information curve followed a pattern similar to the original’s, falling from 6.11 logits at −2.00 to 4.83 logits at 2.00 (α = .83). All items had discrimination values above .80. The IRT scale was similarly reliable when paternal ratings were used (α = .83). Overall, the IRT scale represented a reduction in length of 31% over the original (Table 2).

Attentional Focusing.

The dimensionality analysis suggested two factors, though these factors were conceptually ambiguous (the second factor did however seem based more strongly on specific behaviors/activities). The original 9-item attentional focusing scale consistently provided slightly more than 4.00 logits of information (α = .79). Three items were subsequently trimmed from this scale for either for loading on a secondary factor while being weakly associated with the general factor in the bi-factor analysis (Items 144, 160 and 186), and/or inadequate initial discrimination (Items 160 and 186). The 6-item IRT attentional focusing scale appeared unidimensional, and its information curve was similar to the original’s (α = .77). All items had discrimination values above .80. The IRT scale was slightly less reliable when paternal ratings were used, generally providing between 3.30 and 3.50 logits of information (α = .72). Overall, the IRT scale represented a reduction in length of 33% over the original (Table 2).

Attentional Shifting.

The dimensionality analysis suggested that a single factor solution was optimal. The original 5-item attentional shifting scale consistently provided slightly less than 4.00 logits of information (α = .73). Given the original length and relatively low amount of information provided, only the item that provided the least information (Item 180) was removed. The 4-item IRT attentional shifting scale appeared unidimensional and provided a similar amount of information as the original (Table 2; α = .71). All items had discrimination values above .80. The IRT scale was equally reliable when paternal ratings were used (α = .71). Overall, the IRT scale represented a reduction in length of 20% over the original (Table 2).

Discomfort.

The dimensionality analysis suggested a two factor solution. The factors appeared to correspond to sensitivity to physical pain, and more mild stimuli. The original 12-item discomfort scale consistently provided above 6.00 logits of information, though precision was strongest between −1.00 and 1.00 (α = .88). Six items were subsequently removed from this scale for either failing to load on any factor in the dimensionality assessment (Items 21, 141, 157, and 190), being weakly associated with the general factor (Items 87 and 97), and/or inadequate initial discrimination (Items 21, 87, 97, 141, 157, and 190. The 6-item IRT discomfort scale had a large single factor and one minor factor (Items 115 and 178), and consistently provided over 5.00 logits of information (α = .87). ). Two items (Items 115 and 178) had discrimination values below.80 and were retained to ensure that each factor was adequately represented. The IRT scale provided slightly less information when paternal ratings were used (α = .85). Overall, the IRT scale represented a reduction in length of 50% over the original (Table 2).

Fear.

The dimensionality analysis suggested that a two factor solution was optimal. The first factor captured fear related to darkness and sleep, while the second factor captured fear of potentially dangerous or startling stimuli. The original 12-item fear scale provided the most information (up to 7.00+ logits) between −1.00 and 1.00 (α = .87). Four items were subsequently removed from this scale for either failing to load on any factor in the dimensionality analysis (Item 58), being weakly associated with the general factor (Items 15 and 70), and/or inadequate initial discrimination (Items 15, 70, and 80). The 8-item IRT fear scale had a two-factor dimensional structure similar to the original scale, and provided the most information between −1.00 and 1.00 (α = .78). Three items (Items 50, 161, and 189) had discrimination values below .80; these items were retained to ensure that each factor was adequately represented, and maintain an acceptable degree of precision. The IRT scale was slightly less informative when paternal ratings were used (α = .75). Overall, the IRT scale represented a reduction in length of 33% over the original (Table 2).

High Intensity Pleasure.

The dimensionality analysis suggested a single factor solution was optimal. The original 13-item high intensity pleasure scale consistently provided around 6.00 logits of information until above 1.0, where it fell to 3.50 by 2.00 (α = .83). Five items were subsequently removed from this scale for either failing to load on any factor in the dimensionality assessment (Items 100, 107, and 159), inadequate initial discrimination (Items 77, 100, 107, and 159), and/or providing only marginal gains in information compared to the remaining items (Item 182). The 8-item IRT high intensity pleasure scale was characterized by a single strong factor (there was evidence for some potential multidimensionality, but factors were not conceptually coherent, and were highly intercorrelated), and provided at least 5.00 logits of information below 1.0; above 1.0 the IRT scale provided slightly less than 3.00 logits (α = .83). Only one item had a discrimination value slightly below .80 (Item 124 with a discrimination of .78). The IRT scale was slightly more informative when paternal ratings were used (α = .85). Overall, the IRT scale represented a reduction in length of 38% over the original (Table 2).

Impulsivity.

The dimensionality analysis suggested that a 2 factor solution was optimal. The first factor captured specific impulsive behaviors, while the second factor captured the propensity to approach novel stimuli. The original 13-item impulsivity scale consistently provided between 5.50 and 6.50 logits of information (α = .84). Six items were subsequently removed from this scale for either failing to load on any factor in the dimensionality assessment (Items 90 and 137), being weakly associated with the general factor in the bi-factor model (Items 26, 104, 114 and 155), and/or inadequate initial discrimination (Items 26, 104, 114, 137, and 155). The 7-item IRT impulsivity scale was characterized by a single large factor (three items loaded on two extra minor factors), and consistently provided between 5.00 and 6.50 logits of information (α = .84). Two items had discrimination values below .80 (Items 13 and 79) but were retained to ensure both original factors were adequate represented. The IRT scale was somewhat less informative when paternal ratings were used (α = .82). Overall, the IRT scale represented a reduction in length of 46% over the original (Table 2).

Inhibitory Control.

The dimensionality analysis suggested that a single factor solution was optimal. The original 13-item inhibitory control scale consistently provided between 6.50 and 8.00 logits of information (α = .86). Four items were subsequently trimmed from this scale for either failing to load on any factor in the dimensionality assessment (Items 63 and 116), inadequate initial discrimination (Items 63 and 116), and/or providing only marginal gains in information (Items 20 and 162). The 9-item IRT inhibitory control scale was unidimensional and consistently provided between 6.00 and 7.00 logits of information (α = .85). All items had discrimination values above .80. The IRT scale was similarly reliable when paternal ratings were used (α = .84). Overall, the IRT scale represented a reduction in length of 31% over the original (Table 2).

Low Intensity Pleasure.

The dimensionality analysis suggested that a three factor solution was optimal, with the two conceptually clearest factors centering on closeness to parents, and enjoyment of verbal stimuli. The original 13-item low intensity pleasure scale provided the most (7.43) information at −2.0, but information declined as scores increased, dropping to 2.84 by 2.0 (α = .83). Seven items were subsequently trimmed from this scale for either failing to load on any factor in the dimensionality assessment (Items 12 and 11), being only weakly associated with the general factor (Items 36, 151, and 174; item 174 was the weaker item in a two item factor), and/or inadequate initial discrimination (Items 12, 36, and 11). The 8-item IRT low intensity pleasure scale was best characterized by a factor solution similar to the original scale, and provided information in a similar pattern, ranging from 5.22 logits (−2.00) to 2.10 logits (2.00) (α = .76). All items except item 86 had discrimination values above .80; item 86 was retained to ensure representation of the second factor from the IFA (as it appeared slightly stronger than its counterpart, Item 36). The IRT scale provided slightly more information when paternal ratings were used, especially above the mean (α = .77). Overall, the IRT scale represented a reduction in length of 38% over the original (Table 2).

Perceptual Sensitivity.

The dimensionality analysis suggested that a two factor solution was optimal, with a factor each for both social and non-social stimuli. The original 12-item perceptual sensitivity scale provided exceptionally high levels of information (up to 21 logits) from −2.0 to 1.0, before dropping precipitously (to 3.37 logits) by 2.0 (α = .92). This dramatic information function is largely due to the inclusion of item 65, which had a discrimination value of 6.91. Seven items were removed from this scale for failing to load on any factor in the dimensionality assessment (Items 122 and 142), being only weakly associated with the general factor (Items 9, 52, and 154), and/or inadequate initial discrimination (Items 9, 52, 84, 122, 142, and 154). The 5-item IRT perceptual sensitivity scale appeared unidimensional and also provided exceptionally high levels of information from −2.0 to 1.0, before dropping off by 2.0 (α = .91). All items had discrimination values above .80. The IRT scale was similarly reliable when paternal ratings were used (α = .90). Overall, the IRT scale represented a reduction in length of 50% over the original (Table 2).

Positive Anticipation.

The dimensionality analysis suggested that a two factor solution was optimal, though the two factors were not clearly distinguishable conceptually. The original 13-item positive anticipation scale provided the most information at −2.00 (6.77 logits), with values decreasing as scores increased to 3.50 logits by 2.00 (α = .83). Five items were subsequently removed from this scale for either failing to load on any factor in the dimensionality assessment (Items 69, 175, and 188), being weakly associated with the general factor in the bi-factor model (Item 10), and/or inadequate initial discrimination (Items 10, 35, 69, and 175). The 8-item IRT positive anticipation scale appeared largely unidimensional and provided between 5.00 and 6.00 logits of information until after 1.0 (α = .81). None of the items had a discrimination value below .80. The IRT scale was slightly more reliable when paternal ratings were used (α = .82). Overall, the IRT scale represented a reduction in length of 38% over the original (Table 2).

Sadness.

The dimensionality analysis suggested that a two factor solution was optimal. Most items loaded on the first factor, which appeared to represent the general sadness construct. The second factor centered specifically on sadness related to stories or television shows. The original 12-item sadness scale consistently provided around 3.50 logits of information (α = .72). Three items were subsequently removed from this scale for failing to load on any major factor (Items 64 and 72), being weakly associated with the general factor in the bi-factor model (Item 112), and/or inadequate initial discrimination (Items 64, 72, and 112). The 9-item IRT sadness scale was best characterized by a factor solution similar to the original scale, and consistently provided around 3.20 logits of information (α = .70). The IRT scale was slightly less informative when paternal ratings were used (α = .69). Overall, the IRT scale represented a reduction in length of 25% over the original (Table 2).

Shyness.

The dimensionality analysis suggested that a two factor solution was optimal, though the two factors demonstrated considerable conceptual overlap. The original 13-item shyness scale provided consistently high levels of information (10.00+ logits), though relatively less information was provided above 1.0 (α = .95). Five items were subsequently removed from this scale. Importantly, this scale was particularly strong, with only one or two items meeting any of the criteria for potential removal. Thus, the items that were removed tended to be the weakest from an altogether strong set. Items were removed for being relatively weaker items associated with the smaller second factor (Items 37, 74, and 89), and/or relatively low initial discrimination (Items 7, 37, 74, 89, and 119). The 8-item IRT shyness scale appeared unidimensional and provided above 15.00 logits of information between −2.0 and 1.0, before dropping to 8.28 logits by 2.00 (α = .94). No items had discrimination values below .80. The IRT scale was slightly less informative when paternal ratings were used (α = .92). Overall, the IRT scale represented a reduction in length of 38% over the original (Table 2).

Smiling and Laughter.

The dimensionality analysis suggested that a two factor solution was optimal, with the factors roughly corresponding to smiling, and laughing. The original 13-item smiling and laughter scale provided above 5.00 logits of information until above −1.0, though only 2.86 logits were provided at −2.0 (α = .86). Six items were subsequently removed from this scale for being weakly associated with the general factor in the bi-factor model (Items 43 and 83), inadequate initial discrimination (Items 11, 43, 83, 99, and 152), and/or demonstrating consistent estimation issues across analyses (Item 56). The 7-item IRT smiling and laughter scale appeared unidimensional and provided a substantial amount of information –at least 5.00 – across most of the range considered, but dropped out at the extreme ends (α = .77) (the differential direction of the information curves across the original and IRT scales is an arbitrary consequence of the valence of the latent factor during estimation). Only Item 179 had a discrimination value slightly below .80, but it was retained in order to preserve an acceptable degree of precision. The IRT scale was slightly more informative when paternal ratings were used (α = .82). Overall, the IRT scale represented a reduction in length of 46% over the original (Table 2).

Soothability.

The dimensionality analysis suggested that a two factor solution was optimal. The two factors appeared to capture the ability to calm down after an exciting activity, or a negative event. The original 13-item soothability scale consistently provided more than 6.00 logits of information between −2.0 and 1.0 (α = .85). Six items were subsequently removed from this scale for either failing to load on any factor in the dimensionality assessment (Items 85, 103, and 167), being weakly associated with the general factor (Items 42 and 53), and/or inadequate initial discrimination (Items 27, 42, 53, 85, 103, and 167). The 7-item IRT soothability scale was best characterized by a factor solution similar to the original scale, and consistently provided at least 5.00 logits of information, though information did decline above the mean (α = .84). Only item 14 had a discrimination value below .80, and was retained to ensure that its factor was represented. The IRT scale was slightly more reliable with paternal ratings (α = .85). Overall, the IRT scale represented a reduction in length of 46% over the original (Table 2).

Comparing the Original, IRT, and Short Forms

The number of items in the original, IRT, and short forms are presented in Table 2. In all, 77 (of 195) items were removed from the original form for the IRT form (Table 2). The IRT form (118 items) was 39% shorter than the original form (195 items), and 26% longer than the short form (94 items). On average, the original form scales contained 12 items, whereas the IRT form scales contained 7 items, and the short form scales contained 6 items.

Reliability.

Marginal reliability estimates for the scales across forms and samples are presented in Table 33. Across samples, the original scales tended to provide the most information, whereas the short form scales tended to provide the least. These differences were modest, however. Across samples and scales, the average marginal reliability estimate was .85 for the original form, .82 for the IRT form, and .81 for the short form. The rank order of the individual scales also tended to be preserved across forms such that certain scales (e.g., shyness) consistently provided the most information, while others (e.g., sadness) provided the least.

Table 3.

Marginal Reliability Estimates Across Forms and Samples

Calibration
M
Calibration
F
Validation
1
Validation
2
Validation
3
Validation
4
Average
α
O I S O I S O I S O I S O I S I O I S
Activity Level .82 .79 .74 .81 .77 .71 .84 .77 .79 .83 .76 .77 .86 .82 .80 .79 .83 .78 .76
Anger/Frustration .85 .83 .80 .86 .83 .80 .86 .84 .83 .85 .82 .87 .85 .82 .84 .87 .85 .84 .83
Attentional Focusing .79 .77 .75 .77 .72 .75 .81 .79 .76 .79 .76 .76 .81 .78 .78 .84 .79 .78 .76
Attentional Shifting .73 .71 - .73 .71 - .68 .63 - .79 .79 - .68 .63 - .67 .72 .69 -
Discomfort .88 .87 .87 .86 .85 .86 .89 .88 .89 .89 .89 .89 .90 .90 .90 .86 .88 .88 .88
Fear .87 .78 .89 .87 .75 .89 .90 .77 .89 .87 .80 .87 .88 .77 .88 .79 .88 .78 .88
High Intensity Pleasure .83 .83 .69 .86 .85 .72 .87 .87 .78 .87 .85 .77 .86 .85 .78 .81 .86 .84 .75
Impulsivity .84 .84 .78 .83 .82 .76 .88 .86 .83 .84 .82 .82 .85 .84 .82 .76 .85 .82 .80
Inhibitory Control .86 .85 .77 .86 .84 .74 .87 .86 .75 .86 .83 .76 .88 .86 .79 .87 .87 .85 .76
Low Intensity Pleasure .83 .76 .81 .84 .77 .81 .84 .69 .81 .81 .73 .79 .79 .72 .75 .76 .82 .74 .79
Perceptual Sensitivity .92 .91 .88 .89 .90 .88 .90 .90 .86 .91 .91 .88 .90 .90 .86 .89 .90 .90 .87
Positive Anticipation .83 .81 .79 .85 .82 .76 .88 .85 .80 .83 .82 .80 .84 .82 .78 .81 .85 .82 .79
Sadness .72 .70 .62 .71 .69 .63 .72 .67 .60 .76 .73 .70 .73 .70 .76 .77 .73 .71 .66
Shyness .95 .94 .90 .94 .92 .90 .96 .95 .93 .96 .95 .91 .95 .95 .91 .93 .95 .94 .91
Smiling and Laughter .86 .77 .79 .87 .82 .81 .88 .81 .84 .85 .75 .80 .86 .79 .79 .76 .86 .78 .81
Soothability .85 .84 .82 .86 .85 .84 .86 .85 .83 .84 .84 .84 .85 .84 .83 .84 .85 .84 .83
Average α .85 .82 .79 .85 .81 .79 .86 .82 .81 .85 .82 .82 .85 .82 .82 .81 .85 .82 .81

Note. Calibration - M = mothers’ ratings from the calibration sample; Calibration - F = fathers’ ratings from the calibration sample; O = original form; I = IRT form; S = short form. Average marginal reliability across forms presented in bottom row. Average scores exclude the attentional shifting scale as this scale is not included in the short form.

Interform Correlations.

Correlations between the different forms are presented in Table 4. All of the correlations in Table 4 have been adjusted for shared within-sample error variance using Levy’s long-to-short correction (Levy, 1967); this correction accounts for the unsystematic measurement error shared across long and short forms. The unadjusted correlations can be found on the OSF at: https://osf.io/r7mdf/ (most of these correlations were above r = .80). Correlations were highest between the original and IRT forms, and lowest between the IRT and short forms. Across scales and samples, the average adjusted correlation was r = .80 (the unadjusted average was r = .93) for the original and IRT forms, r = .79 (unadjusted average r = .88) for the original and short forms, and r = .71 (unadjusted average r = .91) for the IRT and short forms.

Table 4.

Correlations Between Different Forms Across Samples

Calibration
M
Calibration
F
Validation
1
Validation
2
Validation
3
Average
r
roi rsi ros roi rsi ros roi rsi ros roi rsi ros roi rsi ros roi rsi ros
Activity Level .79 .62 .75 .78 .58 .73 .78 .70 .81 .78 .67 .76 .83 .73 .82 .79 .66 .77
Anger/Frustration .82 .80 .79 .82 .81 .83 .83 .82 .82 .82 .78 .82 .83 .77 .82 .82 .80 .82
Attentional Focusing .75 .62 .75 .70 .54 .75 .78 .67 .78 .73 .55 .77 .77 .66 .79 .75 .61 .77
Attentional Shifting .58 - - .56 - - .45 - - .74 - - .65 - - .60 - -
Discomfort .81 .75 .79 .80 .73 .80 .82 .79 .81 .80 .78 .80 .82 .80 .80 .81 .77 .80
Fear .78 .77 .83 .76 .77 .86 .75 .73 .81 .80 .77 .83 .76 .76 .81 .77 .76 .83
High Intensity Pleasure .82 .59 .71 .82 .63 .71 .86 .73 .8 .78 .81 .77 .83 .71 .78 .82 .69 .75
Impulsivity .80 .73 .79 .77 .72 .75 .84 .77 .84 .79 .71 .82 .81 .76 .83 .80 .74 .81
Inhibitory Control .84 .75 .81 .83 .70 .78 .85 .73 .79 .84 .75 .80 .84 .78 .82 .84 .74 .80
Low Intensity Pleasure .78 .71 .76 .78 .71 .75 .73 .69 .75 .75 .70 .74 .74 .66 .72 .76 .69 .74
Positive Anticipation .75 .65 .78 .77 .62 .78 .82 .69 .81 .77 .62 .75 .78 .65 .79 .79 .65 .78
Perceptual Sensitivity .83 .76 .82 .80 .75 .80 .85 .77 .81 .83 .77 .81 .84 .77 .80 .83 .76 .81
Sadness .70 .50 .62 .69 .48 .66 .68 .50 .65 .74 .55 .68 .72 .62 .74 .71 .53 .67
Shyness .93 .85 .9 .91 .84 .90 .95 .90 .94 .95 .85 .91 .95 .86 .92 .94 .86 .91
Smiling and Laughter .76 .63 .79 .81 .69 .81 .81 .68 .83 .75 .54 .77 .78 .63 .81 .78 .63 .80
Soothability .79 .79 .78 .81 .79 .84 .81 .78 .81 .77 .74 .79 .79 .78 .77 .79 .78 .80
Average r .80 .70 .78 .79 .69 .78 .81 .73 .80 .79 .71 .79 .81 .73 .80 .80 .71 .79

Note. Calibration - M = mothers’ ratings from the calibration sample; Calibration - F = fathers’ ratings from the calibration sample; OI = correlation between original and IRT form; SI = correlation between short and IRT form; OS = correlation between original and short form. Average inter-form correlation presented in bottom row. Average scores exclude the attentional shifting scale as this scale is not included in the short form. Correlations adjusted for shared within-sample error variance based on Levy’s long to short correction (Levy, 1967); when adjusting the correlations between the IRT and short forms the scale with more items was treated as the long form in the formula. Unadjusted correlations can be found on the OSF (https://osf.io/r7mdf/).

Criterion Correlations and Rater Agreement.

The correlations between temperament and externalizing/internalizing problems, and between maternal and paternal ratings, are presented in Table 5. Across scales and raters, the original and IRT forms were associated with the CBCL scales to a largely equivalent degree (Cohen’s qs4 from .00 to .11; average q = .05). The short form tended to be the most weakly related to the CBCL scales, though differences were again generally small (average qs of .06 comparing the original and short form, and .06 comparing the IRT and short form). Parental agreement was also similar across forms; on average the highest agreement was observed with the original form (average qs of .04 compared against the IRT form, and .08 against the short form), and the lowest with the short form (average q of .07 compared against the IRT form).

Table 5.

Correlations with the Child Behavior Checklist and Interparent Agreement Across Forms in Calibration Sample

Maternal Reports Paternal Reports
Externalizing Internalizing Externalizing Internalizing Parent Agreement
O I S O I S O I S O I S O I S
Activity Level .36 .31 .36 −.03 −.07 −.02 .29 .22 .29 −.06 −.09 −.07 .52 .49 .47
Anger/Frustration .38 .41 .41 .19 .22 .19 .32 .36 .23 .19 .21 .07 .42 .49 .42
Attentional Focusing −.32 −.37 −.31 −.07 −.13 −.06 −.30 −.34 −.29 −.10 −.16 −.10 .49 .46 .50
Attentional Shifting −.04 .07 - −.07 <.01 - −.09 −.03 - −.14 −.11 - .17 .25 -
Discomfort .03 −.02 .01 .21 .17 .18 .05 −.01 −.01 .25 .23 .13 .52 .49 .68
Fear .03 .06 .01 .32 .32 .26 .06 .09 .02 .34 .34 .16 .55 .51 .44
High Intensity Pleasure .08 .13 .04 −.17 −.15 −.14 .03 .08 <.01 −.23 −.20 −.15 .62 .57 .59
Impulsivity .28 .20 .18 −.16 −.22 −.20 .24 .16 .15 −.16 −.26 −.24 .62 .63 .59
Inhibitory Control −.50 −.48 −.47 −.14 −.13 −.13 −.47 −.44 −.35 −.12 −.11 −.05 .58 .56 .50
Low Intensity Pleasure −.26 −.22 −.13 −.09 −.08 −.05 −.26 −.26 −.19 −.05 −.09 −.08 .32 .32 .29
Positive Anticipation .14 .02 .09 .08 −.01 .05 .07 −.03 .015 .03 −.03 <.01 .37 .34 .36
Perceptual Sensitivity −.12 −.19 −.20 .02 −.01 −.04 −.18 −.25 −.21 −.12 −.19 −.20 .41 .41 .41
Sadness .11 .14 .04 .32 .33 .24 .10 .12 −.06 .26 .28 .04 .40 .38 .28
Shyness .09 .06 .07 −.26 −.26 −.26 .08 .07 .07 −.25 −.24 −.23 .68 .65 .65
Smiling and Laughter −.10 −.01 −.12 −.19 −.14 −.19 −.17 −.07 −.18 −.25 −.22 −.23 .33 .28 .29
Soothability .34 .28 .26 .32 .32 .30 .23 .16 .10 .26 .24 .13 .47 .44 .33
Average |r| .21 .19 .18 .17 .17 .15 .19 .18 .14 .18 .19 .13 .49 .47 .45

Note. O = original form; I = IRT form; S = short form; Externalizing = total externalizing problems scale of CBCL; Internalizing = total internalizing problems scale of the CBCL; Parent Agreement = correlation between maternal and paternal reports. Average correlation presented in bottom row. Average scores exclude the attentional shifting scale as this scale is not included in the short form.

Supplemental Analyses: Higher Order Dimensional Structure

Consistent with the original development of the CBQ, we focused on the 16 specific temperament scales. Notably, there is evidence that these scales cluster around three superordinate temperament dimensions: Effortful Control, Negative Affectivity, and Surgency (Rothbart et al., 2001). A brief set of analyses was therefore conducted to explore the higher order dimensional structure of the CBQ scales across the different CBQ forms. Exploratory factor analysis and exploratory structural equation models with target rotations (ESEM; Asparouhov & Muthen, 2009) were used. A more complete description of the analytic approach and results (including tables) from these analyses can be found on the OSF (https://osf.io/r7mdf/). The EFA and ESEM results consistently suggested that three factors were adequate to describe the higher-order structure of the CBQ, however, the specific nature of that structure was somewhat fuzzier than implied by the canonical dimensional structure. These conclusions were consistent across forms. Importantly, these analyses are about higher-order structure and do not necessarily reflect negatively on the psychometric quality of the individual scales -- the primary focus of the present investigation. Indeed, the CBQ was not originally developed with an eye to any specific higher-order structure (Rothbart et al., 2001). Still, these supplement results suggest that more work is needed to better conceptualize any higher-order dimensional structure of the CBQ.

Discussion

Item response theory (IRT) modeling techniques were applied to the Children’s Behavior Questionnaire (CBQ; Rothbart et al., 2001) to provide in-depth information about the psychometric functioning of each temperament scale and to identify items that could be removed to create a shorter inventory. Results suggested that the CBQ contains many (up to 77) somewhat redundant items. Their removal resulted in an abbreviated form functionally similar to the full CBQ form. Overall, results provide extensive documentation of the CBQ’s basic psychometric functioning for temperament researchers, and practical information for improving the efficiency of temperament assessment in research and applied contexts.

Full and Abbreviated CBQ Forms

The CBQ contains 16 temperament scales, and the initial IRT analyses demonstrated that these scales range in psychometric quality from relatively weaker (e.g., sadness) to relatively stronger (e.g., shyness). Most scales provide an adequate level of information (around 4.00; reliability of .75) for assessing children across most of the span the underlying latent continuum (theta). Most scales also showed some multidimensionality, with pockets of additional covariation often centered on a set of closely related behaviors (e.g., anger about going to bed in the anger/frustration scale). Many of these specific factors are important for content coverage, and including some of their items will generally not undermine scale functioning. These items will tend to appear weaker though when examined in unidimensional models given their associations with specific facets.

The original CBQ scales seemed adequate based on these analyses; however, all scales contained items that contributed little incremental information to the assessment of the target trait and could be removed without substantially undercutting measurement quality. Altogether, 39% of the original CBQ’s items were removed in the IRT form (Table 2). Although this represents a reduction in length, the IRT form scales largely functioned similarly to the original form. The raw information logits sometimes implied a substantial drop in precision (e.g., shyness), however, there are diminishing returns on information such that at high levels of reliability seemingly larger decreases in information logits per se do not necessarily correspond to similarly large drops in reliability.

Implications

The results highlight which items -- and by extension, the behaviors and narrow attributes -- are best suited for assessing individual differences between children, at least in the context of parent report methods. The data also reveal which of these items tap into narrower facets of temperament beyond the overarching scale. Conversely, the results also highlight which items are less informative -- either by being too strongly tied to narrow facets within a scale or failing to improve measurement quality past a point of diminishing returns. Indeed, the performance of the abbreviated IRT form in which these items were removed suggests that the analytic procedure used here was successful at shortening the scales without sacrificing much measurement precision.

The IRT form was also compared to the standard short form CBQ, which was developed using traditional psychometric methods (Putnam & Rothbart, 2006). The traditional CBQ short form contains 24 fewer items than the IRT form, a reduction in length of 52% compared to the original, and 26% compared to the IRT form (Table 2). The IRT form was thus slightly longer, and slightly more reliable on average, but these differences were modest. Similarly, the IRT form was somewhat more predictive of broader adjustment, and demonstrated higher inter-rater agreement. Again though the differences between forms were modest and should not be overstated. Still, this does suggest that the different abbreviated forms have some distinct characteristics that may be more or less favorable depending on the goal of the assessment. For example, the IRT form scales seem to include a slightly broader range of content based on the thorough scale-level dimensionality assessment. This can help assess children across a wider range on the trait of interest. However, by omitting some of this content, the Short Form scales are often briefer, and may be more internally consistent (i.e., more tightly focused on the core of the trait).

Both the IRT and traditional short forms represent reasonable options for measuring temperament with fewer items than the full CBQ. An overarching implication of this work is the idea that parents may not necessarily need to complete the original full-length CBQ to obtain high-quality assessments of children’s temperamental traits. In both research and applied settings, a substantial portion of the full form can be omitted for the sake of parsimony without any major negative psychometric repercussions. The score sheet and item summary tables included on the OSF (https://osf.io/r7mdf/) are a useful reference that highlights the items that were and were not included in the IRT and short forms.

Limitations and Future Directions

Although several validation samples were incorporated into the evaluation of the different forms, it is worth noting that many of these samples were used in the development and evaluation of the short form. Validation Samples 1 and 4 were the only sample not used in the development of either the original, short, or IRT form. Of course, the IRT form was not considered previously in the validation samples, and the calibration sample was not used in the initial development of the short form. The total combination of these samples thus allowed each form to be considered in at least one sample that was not involved in its development. Also, one important aspect of psychometric functioning that was not considered here was differential item functioning (DIF; Tay, Meade, & Cao, 2015) – or measurement bias – across meaningful variables such as child age, gender, and ethnicity (Rothbart et al., 2001). To be sure, there is some work on measurement bias in the CBQ (e.g., Clark et al., 2016), but these investigations are usually based on the full form, and only consider one or two potential sources of invariance, such as child and informant sex. Still, past results have been somewhat encouraging insofar as although some measurement non-invariance exists across child and informant sex, the practical impact of this non-invariance tends to be small.

Future studies should more thoroughly investigate the invariance of the CBQ scales to provide another angle on the strengths and weaknesses of different items, and the general functioning of different forms. For example, the CBQ is considered appropriate for children between the ages of 3 and 7. This is a fairly wide age range, and developmentally relevant differential item functioning would be useful in further fine-tuning assessments of temperament by identifying stronger and weaker items in this regard. Future work could also explore the relations between the different forms and external criteria, such as observational assessments of temperament. Parent report and laboratory assessments of temperament tend not to overlap strongly indicating that both approaches can provide complementary information (Durbin & Wilson, 2012; Seifer, 2002; Seifer et al., 2004). Potentially, the abbreviated scales may offer an advantage in this enterprise, given the elimination of weaker items.

Future work can also further explore the construction of the CBQ. For example, the CBQ uses a 7 point response scale, but there were many items in which certain response categories were hardly, if ever, used. This suggests that informants tend not to use all the response categories, and so it may be beneficial to reduce the amount of response options in future iterations of the CBQ. A five-point response scale could be reasonable as it was rarely the case that more than two response options were unused, and with five scale points items can still be treated as essentially continuous in analyses (Rhemtulla, Brosseau-Liard, & Savalei, 2012). Relatedly, the approach used here in the identification of the IRT form was fairly conservative, with a focus on largely maintaining the properties of the full form. Accordingly, future work can also explore using the general approach here to further reduce length after determining what would qualify as an acceptable versus unacceptable loss of psychometric functioning.

Conclusion

The child temperament literature is rapidly expanding as researchers and practitioners focus on the importance of early emerging individual differences in children’s dispositions. The wide availability of tools for assessing temperament that are both robust and practical is therefore an increasingly pressing matter. One of the most popular temperament inventories, the CBQ, is extremely comprehensive, but also lengthy. The current study used IRT and related techniques to analyze the CBQ scales and identify potentially unnecessary items. Results supported the general effectiveness of the CBQ scales while highlighting that each scale can be substantially shortened. Accordingly, as temperament questionnaires continue to undergo refinement, we believe IRT methods will prove invaluable for developing increasingly efficient high-quality assessment options.

Public Significance Statement:

The Children’s Behavior Questionnaire (CBQ) is the most popular questionnaire for assessing child temperament in research and applied settings. This study provides a comprehensive documentation of the CBQ’s measurement properties, and provides a set of shorter scales for use in future studies and assessments.

Acknowledgements

This work was supported by grant T32 AA007477 (F. Blow) from the National Institute on Alcohol Abuse and Alcoholism. Calibration sample data collection was supported by the Kovler Research Scholar Fund of The Family Institute at Northwestern University, and the Eunice Kennedy Shriver National Institute of Child Health and Human Development (HD064687). Validation sample 1 data collection supported by funds allocated to Samuel Putnam and Maria Gartstein from the National Institute of Mental Health (U.S.; grant 5 T32 MH1893), awarded to University of Oregon, and from a new faculty award from Bowdoin College, awarded to Samuel Putnam. Validation Sample 2 data collection supported by National Institute of Mental Health (R01 MH037911) and the National Science Foundation (DBS-9209559). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies. The online material referenced in this manuscript can be found at the Open Science Framework (OSF) online repository at: https://osf.io/r7mdf/.

Footnotes

1

Attentional Focusing and Attentional Shifting scales are often combined, but were separated here based on the recommendation of analyses from the Northwest Mothers of Twins Study (NMOTS; Ahadi, Rothbart, & Ye, 1993)

2

These data were collected around the same time as the CBQ data in the calibration sample

3

The corresponding Cronbach’s Alphas for the forms across samples are available on the OSF at: https://osf.io/r7mdf/.

4

Cohen’s q represents the difference between two correlations that have been converted into Z-scores. Conventional standards typically hold that a q of .20 represents a small effect.

Contributor Information

D. Angus Clark, University of Michigan, Department of Psychiatry and Addiction Center.

Rebecca J. Brooker, Texas A&M University, Department of Psychological and Brain Sciences

Tricia K. Neppl, Iowa State University, Department of Human Development and Family Studies

Lucy Le Mare, Simon Fraser University, Faculty of Education.

Grazyna Kochanska, The University of Iowa, Department of Psychological and Brain Sciences.

Philip A. Fisher, University of Oregon, Department of Psychology

Leslie D. Leve, University of Oregon, College of Education

Mary K. Rothbart, University of Oregon, Department of Psychology

Samuel P. Putnam, Bowdoin College, Department of Psychology

References

  1. Achenbach TM, & Ruffle TM (2000). The Child Behavior Checklist and related forms for assessing behavioral/emotional problems and competencies. Pediatrics in Review, 21(1), 265–280. [DOI] [PubMed] [Google Scholar]
  2. Ahadi SA, Rothbart MK, & Ye R (1993). Children’s Temperament in the U.S. and China: Similarities and differences. European Journal of Personality, 7, 359–378. [Google Scholar]
  3. Asparouhov T, & Muthen B (2009). Exploratory structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 16, 397–438. [Google Scholar]
  4. Aust F, Diedenhofen B, Ullrich S, & Musch J (2013). Seriousness checks are useful to improve data validity in online research. Behavior research methods, 45(2), 527–535. [DOI] [PubMed] [Google Scholar]
  5. Cai L (2012). flexMIRT: Flexible multilevel item factor analysis and test scoring [Computer software]. Seattle, WA: Vector Psychometric Group, LLC. [Google Scholar]
  6. Carey WB, & McDevitt SC (1989). Clinical and educational applications of temperament research. Amsterdam/Lisse: Swets & Zeitlinger. [Google Scholar]
  7. Caspi A, Moffitt TE, Newman DL, & Silva PA (1996). Behavioral observations at age 3 years predict adult psychiatric disorders: Longitudinal evidence from a birth cohort. Archives of General Psychiatry, 53(11), 1033–1039. [DOI] [PubMed] [Google Scholar]
  8. Clark DA, Donnellan MB, Robins RW, & Conger RD (2015). Early adolescent temperament, parental monitoring, and substance use in Mexican-origin adolescents. Journal of Adolescence, 41, 121–131. doi: 10.1016/j.adolescence.2015.02.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Clark DA, & Bowles RP (2018). Model Fit and Item Factor Analysis: Overfactoring, Underfactoring, and a Program to Guide Interpretation. Multivariate behavioral research, 1–15. [DOI] [PubMed] [Google Scholar]
  10. Clark DA, Durbin CE, Donnellan MB, & Neppl TK (2017). Internalizing symptoms and personality traits color parental reports of child temperament. Journal of Personality. doi: 10.1111/popy.12293 [DOI] [PubMed] [Google Scholar]
  11. Clark DA, Durbin CE, Hicks BM, Iacono WG, & McGue M (2017). Personality in the age of industry: Structure, heritability, and correlates of personality in middle childhood from the perspective of parents, teachers, and children. Journal of Research in Personality, 67, 132–143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Clark DA, Listro CJ, Lo SL, Durbin CE, Donnellan MB, & Neppl TK (2016). Measurement invariance and child temperament: An evaluation of sex and informant differences on the Child Behavior Questionnaire. Psychological Assessment, 28(12), 1646–1662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Crede M, Harms P, Niehorster S, Gaye-Valentine A (2012). An evaluation of the consequences of using short measures of the Big Five personality traits. Journal of Personality and Social Psychology, 102(4), 878–888. [DOI] [PubMed] [Google Scholar]
  14. Creemers HE, Dijkstra JK, Vollebergh WAM, Ormel J, Verhulst FC, & Huizink AC (2010) Predicting life-time and regular cannabis use during adolescence; the roles of temperament and peer substance use: The TRAILS study. Addiction, 105, 699–708. doi: 10.1111/j.1360-0443.2009.02819.x [DOI] [PubMed] [Google Scholar]
  15. de Ayala R (2009). The theory and practice of item response theory. New York: Guilford Press. [Google Scholar]
  16. Duckworth AL, & Allred KM (2012). Temperament in the classroom. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 607–626). New York: The Guilford Press. [Google Scholar]
  17. Duhig AM, Renk K, Epstein MK, & Phares V (2000). Interparental agreement on internalizing, externalizing, and total behavior problems: A meta-analysis. Clinical Psychology: Science and Practice, 7(4), 435–453. [Google Scholar]
  18. Durbin CE, & Wilson S (2012). Convergent validity of and bias in maternal reports of child emotion. Psychological assessment, 24, 647–660. doi: 10.1037/a0026607 [DOI] [PubMed] [Google Scholar]
  19. Embretson SE, & Reise SP (2000). Item Response Theory for Psychologists. Mahwah, NJ: Lawrence Erlsbaum Associates, Inc. [Google Scholar]
  20. Fagot BI, & Leve LD (1998). Teacher ratings of externalizing behavior at school entry for boys and girls: Similar early predictors and different correlates. The Journal of Child Psychology and Psychiatry and Allied Disciplines, 39(4), 555–566. [PubMed] [Google Scholar]
  21. Fisher PA (1994). Temperament goodness of fit and psychosocial adjustment in children.
  22. Gartstein MA, Bridgett DJ, & Low CM (2012). Asking questions about temperament: Self- and other-report measures. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 183–208). New York: The Guilford Press. [Google Scholar]
  23. Gartstein MA, Putnam SP, & Rothbart MK (2012). Etiology of preschool behavior problems: Contributions of temperament attributes in early childhood. Infant Mental Health Journal, 33(2), 197–211. [DOI] [PubMed] [Google Scholar]
  24. Gartstein MA, & Rothbart MK (2003). Studying infant temperament via the revised infant behavior questionnaire. Infant Behavior and Development, 26(1), 64–86. [Google Scholar]
  25. Goldsmith HH, & Gagne JR (2012). Behavioral assessment of temperament. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 209–228). New York: The Guilford Press. [Google Scholar]
  26. Hambleton RK, & Swaminathan H (1985). Item response theory: Issues and applicants. Boston: Kluwer Nijhoff. [Google Scholar]
  27. Hansen M, Cai L, Stucky BD, Tucker JS, Shadel WG, & Edelen MO (2014). Methodology for developing and evaluating the PROMIS smoking item banks. Nicotine & Tobacco Research, 16(3), 175–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kamata A, & Bauer DJ (2008). A note on the relation between factor analytic and item response theory models. Structural Equation Modeling: A Multidisciplinary Journal, 15 (1), 136–153. [Google Scholar]
  29. Klein DN, Dyson MW, Kujawa AJ, & Kotov R (2012). Temperament and internalizing disorders. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 541–561). New York: The Guilford Press. [Google Scholar]
  30. Kochanska G, DeVet K, Goldman M, Murray K, & Putnam SP (1994). Maternal reports of conscience development and temperament in young children. Child development, 65(3), 852–868. [PubMed] [Google Scholar]
  31. Kotelnikova Y, Olino TM, Klein DN, Kryski KR, & Hayden EP (2015). Higher- and lower-order factor analyses of the Children’s behavior questionnaire in early and middle childhood. Psychological Assessment. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Levy P (1967). The correction for spurious correlation in the evaluation of short‐form tests. Journal of Clinical Psychology, 23(1), 84–86. [DOI] [PubMed] [Google Scholar]
  33. Lo SL, Vroman LN, & Durbin CE (2014). Ecological validity of laboratory assessments of child temperament: Evidence from parent perspectives. Psychological Assessment, 27(1), 280–290. [DOI] [PubMed] [Google Scholar]
  34. Markus KA, & Borsboom D (2013). Frontiers of test validity theory: Measurement, causation, and meaning. Routledge. [Google Scholar]
  35. McClowry SG, & Collins A (2012). Temperament-based intervention: Reconceptualized from a response-to-intervention framework. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 607–626). New York: The Guilford Press. [Google Scholar]
  36. Muthén LK, & Muthén BO (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén. [Google Scholar]
  37. Peterson ER, Mohal J, Waldie KE, Reese E, Atatoa Carr PE, Grant CC, & Morton SM (2017). A cross-cultural analysis of the Infant Behavior Questionnaire Very Short Form: An item response theory analysis of infant temperament in New Zealand. Journal of personality assessment, 99(6), 574–584. [DOI] [PubMed] [Google Scholar]
  38. Putnam SP, Ellis LK, & Rothbart MK (2001). The structure of temperament from infancy through adolescence. In Eliasz A & Angleitner A (Eds.), Advances in research on temperament (pp. 165–182). Lengerich, Germany: Pabst Science. [Google Scholar]
  39. Putnam SP, & Rothbart MK (2006). Development of short and very short forms of the Children’s Behavior Questionnaire. Journal of Personality Assessment, 87(1), 103–113. [DOI] [PubMed] [Google Scholar]
  40. Putnam SP, Rothbart MK, & Gartstein MA (2008). Homotypic and heterotypic continuity of fine‐grained temperament during infancy, toddlerhood, and early childhood. Infant and Child Development: An International Journal of Research and Practice, 17(4), 387–405. [Google Scholar]
  41. Revicki DA, & Reise SP (2015). Summary: New IRT problems and future directions. In Reise SP & Revicki DA (Eds.), Handbook of Item Response Theory Modeling (pp. 457–462). New York, NY: Taylor & Francis. [Google Scholar]
  42. Rhemtulla M, Brosseau-Liard PÉ, & Savalei V (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological methods, 17(3), 354. [DOI] [PubMed] [Google Scholar]
  43. Richters JE (1992). Depressed mothers as informants about their children: a critical review of the evidence for distortion. Psychological bulletin, 112(3), 485. [DOI] [PubMed] [Google Scholar]
  44. Rodriguez A, Reise SP, & Haviland MG (2016). Evaluating bifactor models: Calculating and interpreting statistical indices. Psychological methods, 21(2), 137. [DOI] [PubMed] [Google Scholar]
  45. Rothbart MK (2011). Becoming who we are: Temperament and personality in development New York, NY: The Guilford Press. [Google Scholar]
  46. Rothbart MK, Ahadi SA, Hershey KL, & Fisher P (2001). Investigations of temperament at three to seven years: The Children’s Behavior Questionnaire. Child Development, 72(5), 1394–1408. [DOI] [PubMed] [Google Scholar]
  47. Rothbart MK, & Derryberry D (1981). Development of individual differences in temperament In Lamb ME, Brown AL,(Eds.), Advances in developmental psychology (Vol. 1, pp. 37–86). [Google Scholar]
  48. Samejima F (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34, 100–114. [Google Scholar]
  49. Seifer R (2002). What do we learn from parent reports of their children’s behavior? Commentary on Vaughn et al.’s critique of early temperament assessments. Infant Behavior and Development, 25, 117–120. [Google Scholar]
  50. Seifer R, Sameroff A, Dickstein S, Schiller M, & Hayden LC (2004). Your own children are special: Clues to the sources of reporting bias in temperament assessments. Infant Behavior and Development, 27, 323–341. doi: 10.1016/j.infbeh.2003.12.005 [DOI] [Google Scholar]
  51. Schroder HS, Clark DA, & Moser JS (2017). Screening for problematic worry in adults with a single item from the Penn State Worry Questionnaire. Assessment. [DOI] [PubMed] [Google Scholar]
  52. Shiner RL, & Caspi A (2012). Temperament and the development of personality traits, adaptations, and narratives. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 497–518). New York: The Guilford Press. [Google Scholar]
  53. Shiner RL, & DeYoung CG (2013). The structure of temperament and personality traits: A developmental perspective. In Zelazo P (Ed.), Oxford Handbook of Developmental Psychology (pp. 113–141). New York: Oxford University Press. [Google Scholar]
  54. Slocum-Gori SL, & Zumbo BD (2011). Assessing the unidimensionality of psychological scales: Using multiple criteria from factor analysis. Social Indicators Research, 102(3), 443–461. [Google Scholar]
  55. Smith GT, McCarthy DM, & Anderson KG (2000). On the sins of short-form development. Psychological Assessment, 12(1), 102–111. [DOI] [PubMed] [Google Scholar]
  56. Stautz K, & Cooper A (2013). Impulsivity-related personality traits and adolescent alcohol use: A meta-analytic review. Clinical Psychology Review, 33, 574–592. [DOI] [PubMed] [Google Scholar]
  57. Stucky BD, Thissen D, & Edelen MO (2013). Using logistic approximations of marginal trace lines to develop short assessments. Applied Psychological Measurement, 37(1), 41–57. [Google Scholar]
  58. Sulik MJ, Huerta S, Zerr AA, Eisenberg N, Spinrad TL, Valiente C, … & Edwards A (2010). The factor structure of effortful control and measurement invariance across ethnicity and sex in a high-risk sample. Journal of psychopathology and behavioral assessment, 32(1), 8–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Tackett JL, Martel MM, & Kushner SC (2012). Temperament, externalizing disorders, and Attention-Deficit/Hyperactivity Disorder. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 562–580). New York: The Guilford Press. [Google Scholar]
  60. Tay L, Meade AW, & Cao M (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18(1), 3–46. [Google Scholar]
  61. Thissen D, Nelson L, & Swygert KA (2001). Item response theory applied to combinations of multiple-choice and constructed-response items—approximation methods for scale scores. In Thissen D & Wainer H (Eds.), Test Scoring. Hillsdale, NJ: Erlbaum. [Google Scholar]
  62. Thissen D, & Orlando M (2001). Item response theory for items scored in two categories. In Thissen D & Wainer H (Eds.), Test Scoring. Hillsdale, NJ: Erlbaum. [Google Scholar]
  63. Wirth RJ, & Edwards MC (2007). Item factor analysis: Current approaches and future directions. Psychological Methods, 12(1), 58–79. doi: 10.1037/1082-989X.12.1.58 [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Zentner M, & Shiner RL (2012). Fifty years of progress in temperament research: A synthesis of major themes, findings, and challenges and a look forward. In Zentner M, & Shiner RL (Eds.), Handbook of Temperament (pp. 673–700). New York: The Guilford Press. [Google Scholar]

RESOURCES