Abstract
The Child Behavior Checklist (CBCL) and Strengths and Difficulties Questionnaire (SDQ) both measure emotional and behavioral problems in children and adolescents, and scores on the two instruments are highly correlated. When administrative needs compel practitioners to change the instrument used or data from the two measures are combined to perform pooled analyses, it becomes necessary to compare scores on the two instruments. To enable such comparisons, we score linked three domains (Internalizing, Externalizing, and Total Problems) of the CBCL and SDQ in three age groups spanning 2–17 years. After assessing linking assumptions, we compared item response theory (IRT) and equipercentile linking methods to identify the most statistically justifiable link, ultimately selecting equipercentile linking with loglinear smoothing due to its minimal bias and the ability to link raw SDQ scores with both T-scores and raw scores from the CBCL. We derived crosswalk conversion tables to convert scores on one measure to the metric of the other and discuss the use of these tables in research and practice.
Keywords: Child Behavior Checklist, Strengths and Difficulties Questionnaire, linking, equating, assessment
Childhood mental health assessment is important in clinical practice to screen for emotional and behavioral problems and in research to assess severity and monitor change. In the U.S., 13%–25% of children are diagnosed annually (Egger & Angold, 2006; Perou et al., 2013), and diagnostic comorbidity is common (Ghandour et al., 2019). Early identification of problems can help address contributing factors and reduce downstream negative effects on school readiness, academic achievement and attainment, peer relationships, substance use, and criminal behavior (Child Mind Institute, 2015; Perou et al., 2013). The importance of screening early and often is highlighted by the American Academy of Pediatrics, Committee on Psychosocial Aspects of Child and Family Health and Task Force on Mental Health (2009), which recommendation that clinicians be equipped to identify and address pediatric mental health concerns during well-care visits. Practical limitations to administering pediatric mental health screeners may result in measure underutilization and thus underidentification of children who need additional services. Cost, time, and ease of use can prohibit implementing certain assessment tools despite their high reliability and validity (Cohen et al., 2008). Also, measures used can change between administrations, for example, due to a change in budget or strategic direction or a desire to reduce time burden (e.g., Zuckerbrot et al., 2007). Such a switch necessitates a way to compare scores from different instruments to ensure continuity of monitoring.
In addition, researchers may encounter situations in which multiple instruments were administered to different groups of children. Such situations may arise when data from multiple sites are combined in a multistudy program or when existing data were collected using one measure but prospective data will be collected using a different measure. These harmonization issues are common in large-scale consortium research (Collins & Manolio, 2007; Feigelson et al., 2006; Pedersen et al., 2013; Smith-Warner et al., 2006; Willett et al., 2007), and with a broader movement to standardize measurement across nationally funded research programs (Downey & Olson, 2013) and the proliferation of modern perspectives on data sharing (Fischer & Zigmond, 2010; Van Noorden, 2013), it is increasingly important to be able to place scores from different measures onto a common metric. Crosswalk conversion tables, while originally developed to compare scores on aptitude tests (Lord & Wingersky, 1984), have become popular tools for converting scores between instruments in psychosocial and health sciences, with recent research linking measures of depression (Choi et al., 2014), anxiety (Schalet et al., 2014), pain (Askew et al., 2013), psychological distress (Batterham et al., 2018), and fatigue (Lai et al., 2014), to name a few. While crosswalk conversions have limitations, including the introduction of error when the to-be-linked scores are not perfectly correlated (e.g., Lord, 1982) and potential breakdown of the linking relationship across subgroups (Dorans, 2004; Petersen, 2008), they are beneficial for comparing levels of functioning across samples and the results of studies which use different instruments. Crosswalk tables can also be used to harmonize data at the score level when the linked instruments do not share items or response formats. While integrative data analytic methods (IDA; Curran & Hussong, 2009; Hussong et al., 2013), in which joint latent variable modeling of multiple measures is used to analyze the corresponding constructs on a common metric, is theoretically better-suited than crosswalk conversion tables for such harmonization with respect to bias and precision of the resulting estimates, this approach requires the analyst to be trained in latent variable modeling, as well as raw co-administration data or derived item parameters from concurrent calibration. In contrast, crosswalk tables permit the resulting linked scores to be used directly by analysts with no training in latent variable modeling do not require overlapping administration. In the present study, we used established linking methodology (Choi et al., 2014) to link two popular measures of emotional and behavioral functioning, the Child Behavior Checklist (CBCL; Achenbach, 1991; Achenbach & Ruffle, 2000) and Strengths and Difficulties Questionnaire (SDQ; Goodman, 1997).
The CBCL has a long history of clinical and nonclinical use for assessing behavioral and emotional problems in children (Achenbach, 1991; Achenbach & Ruffle, 2000), with several age-based versions: preschool parent/teacher-report form (1.5–5 years); school-age parent/teacher-report form (6–18 years); and school-age self-report form (11–18 years). The CBCL has many scoring options, including examination of individual items, syndrome scales (e.g., Anxious/Depressed), Diagnostic and Statistical Manual of Mental Disorders; DSM-oriented scales (e.g., Depressive Problems, Anxiety Problems), domain scores (i.e., Internalizing, Externalizing), and/or the Total Problems score which encompasses all assessed problems. The SDQ assesses similar constructs to the CBCL and has a similar response format and question structure (Goodman, 1997). Like the CBCL, the SDQ has multiple forms: a preschool form (2–4 years), a younger child school-age form (4–10 years), an older child school-age form (11–17 years), and a self-report form (11–17 years). The versions of the SDQ have nearly identical items, and each measures five specific domains: Emotional Problems, Conduct Problems, Hyperactivity, Peer Problems, and Prosocial which measures positive functioning. Like the CBCL, the SDQ can also yield Internalizing, Externalizing, and Total Problems scores by combining the various subscales. As a considerably shorter measure (25 items in total), the SDQ is becoming popular as a simpler measure of child emotional and behavioral problems. The overlap in domains in the CBCL and SDQ make them good candidates for score linking.
The CBCL has strong psychometric properties, with good interrater reliability (Mother × Father: .48 < r < .67 for syndrome scales, .59 < r < .67 for domain scores) and internal consistency (average α = .83 for syndrome scales, .92 for Internalizing and Externalizing, .97 for Total Problems; Achenbach & Rescorla, 2001; see also Achenbach et al., 2008; Holmbeck et al., 2008). The CBCL syndrome scales have been differentially shown to predict adult Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV; American Psychiatric Association, 2000) diagnoses (Hofstra et al., 2002) and to be highly correlated with comparable subscales on other measures, including the SDQ (mean correlation across subscales .58 < r < .75; Achenbach et al., 2008), Conners Parent and Teacher Rating Scales (.71 < r < .89), and Behavior Assessment System for Children (average r = .69; Achenbach & Rescorla, 2001). However, the length of the CBCL (100–119 items per form) can pose a burden for administration (Muris et al., 2003). While the SDQ has far fewer items than the CBCL (25 per form) and therefore offers a much briefer assessment of similar domains, this drop in item count comes with a corresponding drop in internal consistency (average α = .66 for parent-report, .78 for teacher-report) and interrater agreement (Parent × Teacher: .25 < r < .48; Goodman, 2001) for the difficulties scales. Despite this apparent disadvantage, the domain scores for the SDQ have demonstrated higher internal consistency (e.g., α = .82 for Total Difficulties; Goodman, 2001; see also Goodman, 1999; Muris et al., 2003), and the difficulties scales and domain scores have been found to predict risk for psychiatric disorders (Goodman, 2001) and mental health services use and contact (Bourdon et al., 2005). For a more comprehensive psychometric comparison of the CBCL and SDQ, we refer readers to Achenbach et al. (2008). In short, while the CBCL is longer and thus more reliable than the SDQ when measuring specific syndromes or difficulties, these differences are attenuated when broader domain scores are used and the two measures have strong convergent, discriminant, and external validity as measures of child emotional and behavioral functioning.
As the CBCL and SDQ are both commonly used, the present study seeks to link or map scores from each instrument to the other to facilitate data pooling and enable practitioners and researchers to compare the scores across measures. To date, only one study has attempted to link the CBCL and SDQ. Stevens et al. (2021) linked the CBCL and SDQ Total scores in a sample of residential youth using equipercentile equating, yielding crosswalk tables which can be used to convert SDQ Total Difficulties scores to CBCL Total Problems T-scores or vice versa. However, a rigorous analysis of test linking assumptions was not performed: The correlation between CBCL and SDQ scores was not provided, and Cronbach’s α was only reported for each scale individually. Furthermore, only a single crosswalk table was constructed despite the separate CBCL norms for males and females, and no assessment of the quality of the linkage was conducted, either in the estimation sample or in key subsamples (e.g., male and female). Finally, Stevens et al. only linked the school-age forms in a sample consisting of youth in a residential care facility; therefore, results do not generalize to preschool-aged children and may not generalize to outpatient or nonpatient youth, the latter of which make up the broader demographic of those likely to be evaluated in clinical practice and research settings.
Method
Participants
The authors recruited participants through an internet panel company, collecting data from three samples of 500 parents. Each sample was defined by the age of the index child: 2 years 0 months to 5 years 11 months (ages 2–5 sample), 6 years 0 months to 11 years 11 months (ages 6–11 sample), and 12 years 0 months to 17 years 11 months (ages 12–17 sample). Table 1 provides child and parent demographic information for each sample by child gender. The samples were collected with the goal of obtaining racial and ethnic representation consistent with national norms (~70%–80% White, 20% Hispanic or Latino, 10% Black or African American), allowing natural fallout on other demographic variables (e.g., education, income). With respect to race and ethnicity, these goals were generally achieved (72.7%–81.4% White across age and gender subgroups, 19.5%–26.7% Hispanic or Latino, 14.3%–23.1% Black or African American). Educational attainment, which was not selected for, was higher in the current samples than in the general U.S. population, with 94.3%–99.2% with high school diploma or higher and 36.3%–71.2% with Bachelor’s degree or higher compared to 88.6% and 33.1%, respectively, in the U.S. population (U.S. Census Bureau, 2019). Participants were parents or legal guardians of the index children, who were informed that they would be asked questions about their children’s health and well-being and provided informed consent prior to participation. The overall project has institutional review board approval and this specific substudy was deemed exempt. Each parent completed both the CBCL and SDQ, yielding a single-group design which is ideal for linking (Dorans, 2007). The order of the instruments was randomized for each participant, such that half received the CBCL first and the other half received the SDQ first. This study was not preregistered. Data and analysis code are not publicly available.
Table 1.
Child and Parent Demographic Information for Analysis Samples by Child Gender
| Child age group |
Ages 2–5 |
Ages 6–11 |
Ages 12–17 |
|||
|---|---|---|---|---|---|---|
| Child gender | Female | Male | Female | Male | Female | Male |
| Number of participants | 245 | 255 | 217 | 283 | 236 | 264 |
| Child age (M) | 4.0 | 3.9 | 8.8 | 9.0 | 14.9 | 14.9 |
| Child Hispanic/Latino (%) | 20.4% | 22.7% | 26.7% | 24.0% | 19.5% | 21.6% |
| Child race (%) | ||||||
| White | 72.7% | 72.9% | 79.7% | 80.6% | 75.8% | 81.4% |
| Black or African American | 20.0% | 23.1% | 14.3% | 14.5% | 18.2% | 14.8% |
| American Indian or Alaska Native | 4.9% | 0.8% | 3.2% | 1.1% | 2.5% | 0.8% |
| Asian | 7.3% | 5.1% | 2.8% | 3.5% | 5.9% | 3.4% |
| Native Hawaiian or Pacific Islander | 2.0% | 0.8% | 1.4% | 0.4% | 0.4% | 0.4% |
| Other | 6.1% | 7.1% | 4.1% | 3.9% | 3.8% | 3.4% |
| Parent age (M) | 31.3 | 30.9 | 36.4 | 37.8 | 44.3 | 43.0 |
| Parent gender (%) | ||||||
| Female | 78.4% | 65.9% | 64.1% | 44.9% | 68.2% | 48.9% |
| Male | 21.2% | 34.1% | 35.9% | 55.1% | 31.8% | 51.1% |
| Other | 0.4% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% |
| Parent education (%) | ||||||
| Some high school | 5.7% | 4.7% | 5.1% | 1.4% | 2.1% | 0.8% |
| High school graduate | 31.0% | 24.3% | 10.6% | 13.1% | 16.5% | 9.8% |
| Some college | 17.1% | 20.8% | 23.0% | 11.3% | 12.7% | 10.6% |
| Associate degree | 9.8% | 9.8% | 8.3% | 8.8% | 13.6% | 7.6% |
| Bachelor’s degree | 24.9% | 24.7% | 27.2% | 30.7% | 27.5% | 30.7% |
| Master’s degree | 9.8% | 12.9% | 20.3% | 26.5% | 21.2% | 28.0% |
| Professional or doctorate degree | 1.6% | 2.7% | 5.5% | 8.1% | 6.4% | 12.5% |
Table 2 provides the age range for index children and the instruments administered to each sample. The recommended age groups for the CBCL and SDQ do not fully overlap; specifically, while the CBCL norms range from 1.5 years to 5 years 11 months, 6 years to 11 years 11 months, and 12 years to 17 years 11 months, the SDQ versions are recommended for children 2 years to 4 years 11 months, 4 years to 10 years 11 months, and 11 years to 17 years 11 months, resulting in two discordant age ranges: 5 years to 5 years 11 months and 11 years to 11 years 11 months. Rather than add two additional, narrowly defined groups to our design to account for these two age ranges, we allowed the administered ages for the SDQ to not fully align with the corresponding SDQ version, administering the SDQ 2–4 years to parents of 5-year-old children (n = 127) and the SDQ 4–10 years to parents of 11-year-old children (n = 64). This yielded only three, rather than five, samples. We opted for this design because the SDQ forms have few differences: only 4 items were changed (out of 25) between SDQ 2–4 years and 4–10 years and 5 items were changed (out of 25) between SDQ 4–10 years and SDQ 11–17 years; most of these were minor wording changes that did not impact the item concept (e.g., SDQ 2–4 years: “Often fights with other children or bullies them” vs. SDQ 4–10 years: “Often fights with other youth or bullies them”; SDQ 4–10 years: “Nervous or clingy in new situations, easily loses confidence” vs. SDQ 11–17 years: “Nervous in new situations, easily loses confidence”). In addition, the CBCL has been normed to the general U.S. population, and we aimed to use these norms to calculate T-scores for linking and evaluate the average level of emotional and behavioral functioning in our samples with respect to normative levels. The availability of these norms added additional utility to aligning the age structure of our samples with the norming structure of the CBCL, rather than the nonnormed age groupings defined by SDQ versions.
Table 2.
Child Age and Instrument Version for Each Score-Linking Sample
| Child age | CBCL version | SDQ version |
|---|---|---|
| 2–5 years, 11 months | CBCL preschool 1.5–5 years | SDQ 2–4 years |
| 6–11 years, 11 months | CBCL school-age 6–17 years | SDQ 4–10 years |
| 12–17 years, 11 months | CBCL school-age 6–17 years | SDQ 11–17 years |
Note. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire.
Measures
Child Behavior Checklist
The items in the CBCL correspond to problematic behaviors observed within the past 6 months, rated on a 3-point scale: 0 = Not True, 1 = Somewhat or Sometimes True, and 2 = Very True or Often True. The preschool CBCL contains 100 items covering multiple syndrome scales which can be combined into two domain scores and one total score: Internalizing domain score (Emotionally Reactive, Anxious/Depressed, Somatic Complaints, and Withdrawn); Externalizing domain score (Attention Problems and Aggressive Behavior); and the Total Problems score (i.e., all Internalizing and Externalizing syndrome scales, and Sleep Problems and Other Problems scales). The school-age CBCL similarly contains 113 items and syndrome scales that are combined to produce an Internalizing score (Anxious/Depressed, Withdrawn/Depressed, and Somatic Complaints); Externalizing score (Rule-Breaking Behavior and Aggressive Behavior); and Total score (i.e., all Internalizing and Externalizing syndrome scales, and Social Problems, Thought Problems, Attention Problems, and Other Problems). The CBCL DSM-oriented scales are not included in the current analyses.
Strengths and Difficulties Questionnaire
All versions of the SDQ contain 25 items, with 5 items measuring each of 5 dimensions: Emotional Problems, Conduct Problems, Hyperactivity, Peer Problems, and Prosocial Behavior, the last of which focuses on strengths of a child’s emotional and behavioral functioning. Like the CBCL, SDQ items are scored on a 3-point scale of 0 = Not True, 1 = Somewhat True, and 2 = Certainly True regarding the child’s behavior over the past 6 months. A Total Difficulties score can be calculated as the sum of the first four subscales. Additionally, the Emotional Problems and Peer Problems subscales can be combined to produce an Internalizing score, while Conduct Problems and Hyperactivity can be combined to produce an Externalizing score. Factor analyses have explored models including these higher order factors as an alternative latent structure of the SDQ; while results have been mixed, they generally favor the use of Internalizing and Externalizing dimensions over the less reliable subscale scores (Dickey & Blumberg, 2004; Koskelainen et al., 2001; Van Leeuwen et al., 2006). The Internalizing and Externalizing factors have also shown better convergent and discriminant validity across informants and with respect to clinical disorder than the component subscales, although the subscales can provide marginal utility over these higher order factors (Goodman et al., 2010).
Statistical Analyses
All analyses were conducted separately for the Internalizing, Externalizing, and Total Problems domains in R (R Core Team, 2020). Confirmatory factor analyses (CFAs) were conducted using the lavaan package (Rosseel, 2012).
Preliminary Psychometric Analysis
To assess the assumptions of test linking in the CBCL and SDQ, we followed the recommendations of Choi et al. (2014). As a preliminary step, we examined item content for the two measures and calculated the correlation between SDQ and CBCL summed scores to ensure that the two measures were essentially measuring the same constructs.
The next linking assumption we tested was unidimensionality: that a single latent variable can primarily account for the pattern of covariance among the combined item responses to the two measures. To this end, we calculated statistics from classical test theory (CTT: item-total correlations) and exploratory factor analysis (EFA: first to second eigenvalue ratio, number of factors identified by parallel analysis, number of eigenvalues greater than 1) using the psych package (Revelle, 2020) in R. We also calculated coefficient α to assess the internal consistency of the combined item sets.
Our unidimensionality assessment also included CFAs of the combined CBCL and SDQ item sets. CFA permits more specific tests of unidimensionality than CTT and EFA, including whether residual relationships between items exist after accounting for the common influence of the single latent variable. CFA also enables the calculation of coefficient omega (ω; McDonald, 1999) reliability which can be a better indicator of the reliability of the items as indicators of a single common factor than coefficient α (Revelle & Zinbarg, 2009). In particular, ω relies on the unidimensionality of a measure and, given that assumption, quantifies the proportion of variance in total scores attributable to the underlying latent variables using parameter estimates from a fitted confirmatory factor model. To estimate coefficient ω, we estimated a one-factor CFA model using the WLSMV estimator (Muthén, 1984; Muthén et al., 1997), which properly accounts for the ordinal level of measurement in the item responses and thus provides the most theoretically justifiable tests of the fit of the data to a unidimensional model. We assessed the fit of the resulting models using commonly used statistical indices and benchmark values (Hopwood & Donnellan, 2010; Lance et al., 2006), including the comparative fit index (CFI; >.90 = adequate fit, >.95 = very good fit), the Tucker–Lewis index (TLI; >.90 = adequate fit, >.95 = very good fit), and the root-mean-square error of approximation (RMSEA; <.10 = adequate fit, <.05 = very good fit) and its associated confidence interval. To assess residuals, we calculated the standardized root-mean-squared residual (SRMR) as a model-wide measure of residuals and calculated the z statistic (estimate/SE) for differences between the observed and model-implied covariance matrix to identify the locations of statistically meaningful residuals. Finally, we calculated coefficient ω using the method described in Green and Yang (2009) which uses the categorical CFA model estimated using WLSMV to determine the proportion of variance in total scores attributable to the latent variable in the categorical factor model.
Linking the CBCL and SDQ
Multiple linking methods were used and compared to identify potential problems with linking and to evaluate the sensitivity of the linkage to the assumptions underlying each method (Kolen & Brennan, 2004). To link the CBCL and SDQ, we used the following statistical techniques (Choi et al., 2010): IRT-based fixed-parameter calibration (Haebara, 1980) and nonparametric equipercentile linking (Kolen & Brennan, 2004; Lord, 1982). To conserve space, details on these methods, their statistical frameworks, and their strengths and weaknesses are described in Supplemental Methods. Linking analyses were conducted in R using the mirt package (Chalmers, 2012) for item response theory (IRT) parameter estimation and the PROsetta (Choi & Lim, 2020) and equate packages (Albano, 2016) to derive linking functions.
Each linking method yields a linking function, a (typically nonlinear) function relating scores on each instrument to a corresponding score on the other. To evaluate these functions, we statistically examined the relationships between linked and actual CBCL and SDQ scores, calculating the correlation between linked and observed scores, bias in linked scores, and standard deviation of differences between linked and observed scores. We also constructed Bland–Altman plots (Bland & Altman, 1999), which graph the average of each observed and linked score on the X-axis and the difference between these scores on the Y-axis, to assess linking bias across the score range. Based on the choice of linking function, crosswalk tables were constructed for converting SDQ raw scores to CBCL raw scores (and vice versa) and SDQ raw scores to CBCL T-scores (and vice versa) using the selected function. Crosswalk tables have been made publicly available at the American Psychological Association’s (APA) Open Science Framework (OSF) repository and can be accessed at https://osf.io/n5s9u/.
Subgroup Invariance
Once linking functions were derived, we assessed their invariance across subgroups (subgroup invariance; Choi et al., 2014) in the male and female preschool samples, repeating the assessments outlined above in these subsamples using the linking function for the combined preschool sample. Bland–Altman plots were constructed for each domain in the combined preschool sample, assessing the performance of the combined linking function in each subsample, and for the four linking functions for each domain in the school-age samples. Since the age groups to which the SDQ was administered did not exactly overlap with the recommended age groups for administration, we repeated the evaluation of the relationships between linked and actual scores using only these two ages as a final test of subgroup invariance.
Assessing Measurement Invariance Between SDQ Versions
In practice, desired conversions between scores may not coincide perfectly with the administrations presented herein; for example, an 11-year-old child’s SDQ 11–17 score may need to be evaluated according to the CBCL T-score standards for ages 6–11 years, but the CBCL for ages 6–11 years was co-administered with the CBCL 4–10 in this study. A finding of measurement invariance across SDQ age forms would justify the use of SDQ to CBCL crosswalk tables in such nonstandard settings. To evaluate this measurement invariance, we combined the three samples of SDQ item responses and estimated confirmatory factor models four times, imposing measurement invariance constraints by gender, age, neither, or both, with freed structural parameters (means, variance) across groups and constrained measurement parameters. These models were estimated using the CFA of polychoric correlations, as described above, to obtain structural equation modeling (SEM)-based fit statistics and residuals. To obtain deviance statistics, including Akaike’s information criterion (AIC), AIC with correction for small sample size (AICc), Bayesian information criterion (BIC), and sample size adjusted BIC (SABIC), we also estimated these models in an IRT framework. Fit comparisons and deviance-based tests of differences in fit between these models constitute omnibus tests of metric and scalar invariance: If applying constraints on measurement parameters (item slopes and intercepts in IRT; factor loadings and thresholds in CFA) does not meaningfully impact model fit, then scalar (factor loading/item slope) and metric (threshold/intercept) invariance can be assumed to hold across SDQ versions, justifying the application of SDQ–CBCL crosswalk tables in nonstandard settings. We assumed that all factor structures were unidimensional and identical across groups and, when constraining measurement parameters to equality across groups, we freely estimated the mean and variance of the latent variables in the two older age groups to account for potential group differences in the mean and variance of the corresponding latent trait; in short, we assumed configural invariance, but did not assume structural invariance (Vandenberg & Lance, 2000).
Construction of Analysis Samples
In practice, it may be necessary to convert SDQ scores to either CBCL raw scores or CBCL T-scores, necessitating the construction of two sets of conversion tables: SDQ raw score to CBCL raw score (and vice versa), and SDQ raw score to CBCL T-score (and vice versa). These conversion options permit two methods of calculating CBCL T-scores from SDQ raw scores: (a) convert SDQ raw scores to CBCL raw scores, then convert the resulting scores to CBCL T-scores using the CBCL’s T-score conversion tables or (b) convert SDQ raw scores directly to CBCL T-scores using a separate linking function derived for this purpose. To optimize the concordance between these two methods, we evaluated linking assumptions and derived linking functions in samples which matched the age–gender structure of the T-score conversion tables of the CBCL: combining male and female preschool-age children into a single preschool sample, while analyzing each school-age group (6–11, 12–17) and gender (male, female) combination separately. We also evaluated linking assumptions and derived and evaluated linking functions in the male and female preschool samples separately to assess subgroup invariance for these two groups.
Results
Content Comparison
The CBCL is far more comprehensive than the SDQ, with more items being included in each of Internalizing, Externalizing, and Total Problems domains. Generally, the set of CBCL items used to calculate a score includes, in a slightly less verbose form, the set of SDQ items used to calculate that score (e.g., school-age CBCL item Fears and SDQ 4–10 item Many fears, easily scared). However, many differences exist between the two measures. First, in the Internalizing domain, the SDQ includes only one item on somatic complaints (e.g., Often complains of headaches), while the CBCL includes a whole subscale of Somatic Complaints in its Internalizing score. The SDQ includes many items on social interactions (e.g., Has at least one good friend; Picked on or bullied by other children) in its Peer Problems subscale, which is included in Internalizing, but the CBCL does not include similar items in its Internalizing score. For Externalizing, the CBCL and SDQ differ in that the CBCL includes many more items on disobedience, illegal behavior, and moodiness, which are only represented by a small number of less severe items (e.g., Generally obedient; Often lies or cheats; Steals from home, school, or elsewhere) in the SDQ. Furthermore, while the SDQ Hyperactivity scale, included in the Externalizing score, includes items relating to both hyperactivity (e.g., Restless, overactive) and attention (e.g., Easily distracted, concentration wanders), such items are included on the CBCL Attention Problems subscale (e.g., Impulsive; Can’t concentrate) which is included in the Externalizing domain on the preschool CBCL but not the school-age CBCL. Finally, and in addition to these differences between the Internalizing and Externalizing items, the Total Problems score on the CBCL contains many items with no corresponding items in the SDQ, including Thought Problems (e.g., Twitching; Sees things) and the broad set of problems included in Other Problems (e.g., Cruel to animals; Wets the bed; Shows off). Any imperfections in the performance of the linking functions derived herein can potentially be attributed to these nontrivial content differences.
Preliminary Psychometric Analysis
Figure 1 contains statistics used to evaluate unidimensionality, reliability, and between-test correlations in the combined CBCL–SDQ item sets. Across all samples and domains, the reliability of the combined item sets was high (α > .95, ω > 0.94). EFA and CFA metrics supported the unidimensionality of the combined Internalizing items across all samples: ratios of first to second eigenvalues in EFA were above 11, SDQ and CBCL scores were correlated above .83, and CFA model fit was very good (CFI > .96, TLI > 0.95, RMSEA < .047, SRMR < .08). For all samples except the preschool samples, the same trends held for the Total and Externalizing scores, albeit with less agreement between summed scores for Externalizing (.76 > r > .72) than Total (.86 > r > .83) scores. In the preschool samples (separate and combined), correlations between CBCL and SDQ summed scores were high for Externalizing and Total scores (.86 > r > .81); however, the first to second eigenvalue ratio was lower (>5.9) and CFA model fit was noticeably worse (CFI > .91, TLI > 0.91, RMSEA < .068, SRMR < .11), though still considered “adequate” according to our a priori benchmarks, suggesting some multidimensionality in these domains and samples.
Figure 1. Unidimensionality Assessment of Combined CBCL and SDQ Items.

Note. (a) Classical test theory and exploratory factor analysis (CTT/EFA) statistics; (b) confirmatory factor analysis (CFA) statistics. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire; PA # Fact = number of factors identified in parallel analysis; Evals = eigenvalues.
A detailed analysis of EFA parameter estimates and CFA residuals revealed sizable local dependencies among positively valenced items in all versions of the SDQ, a phenomenon which has been reported elsewhere (e.g., van de Looij-Jansen et al., 2011). By “positively valenced” items, we refer to items which indicate a lack of emotional or behavioral problems (e.g., 7. Generally obedient), rather than the presence of emotional or behavioral problems (e.g., 5. Often loses temper). To account for these local dependencies, we performed a post hoc modification of our confirmatory factor models, reestimating them after adding a positive valence factor orthogonal to the general factor which loaded only on the positively valenced SDQ items (7, 11, 14, 21, and 25 in all SDQ forms), yielding a two-tier model (e.g., Cai, 2010). Model fit information for these models is presented in Figure 1 alongside the corresponding information for the unidimensional models. Because these factors were orthogonal, we were able to calculate coefficient ω hierarchical (ωh; Gignac, 2015), also reported in Figure 1, allowing us to assess the proportion of item-level variance accounted for by the general factor in each categorical factor model. This modification resulted in some fit indices shifting from “adequate” to “very good” for the Externalizing scales in the 2–5 age group but resulted in little change to model-based reliability (ω ≈ ωh ≈ .99); this indicates that, even when valence effects are taken into account, an overwhelming majority of total score variance in all domains and samples is attributable to the general factor. In the estimated CFA models, threshold values were similar between the two measures, while factor loadings on the general factor in the SDQ were markedly lower for items with positive valence. This, combined with the larger number of items in the CBCL, suggests that the CBCL is to be preferred when the most reliable measurement is desired, but otherwise the SDQ is an adequate substitute. See Supplemental Results, Supplemental Figures S2a–S2e and S3a–S3e, and Supplemental Tables S2a–S2o for specific discussion and parameter estimates for the categorical CFA models.
Linking the CBCL and SDQ
Differences between linking functions (IRT-based and the two equipercentile methods) were negligible; however, the equipercentile linking function with loglinear smoothing exhibited two practical advantages over the other two linking functions, and we selected this method for constructing crosswalk tables. First, equipercentile linking was not possible when CBCL T-scores were used, and if different linking functions were used for raw scores and T-scores, the resulting conversion tables may cause confusion.1 Thus, we found it most efficient to use equipercentile linking for all score types. Second, loglinear smoothing permits interpolation and extrapolation to some scores which were not observed in the data, allowing us to construct more complete crosswalk tables.
Statistical assessments of linking functions are presented in panel (a) of Figure 2 for each school-age age–gender subgroup and the combined preschool sample. Across all samples and domains, correlations between linked and observed scores were <.82 except for Externalizing in the school-aged samples, which reached a minimum of about .75 for females ages 12–17, and bias was very low in all samples and domains. Bland–Altman plots for the combined preschool sample (Supplemental Figure S1a) revealed minimal bias, as demonstrated by the trend lines being close to zero across the score range, while Bland–Altman plots for the separate school-age samples (Supplemental Figure S1b) illustrated even better performance than the preschool conversions, demonstrated by even tighter fit of the trend lines around zero.
Figure 2. Crosswalk Conversion Statistics for Linking Samples (a) and Subsamples (b).

Note. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire; IRT = IRT-based fixed-parameter equating; Equip = equipercentile equating with no smoothing; EquipL = equipercentile equating with loglinear smoothing. Subsamples in (b) include gender subsamples for the preschool CBCL (ages 2–5) and samples which were administered a different version of the SDQ than recommended for their age group (ages 5 and 11). The Y-axis on the middle and right-hand panels is in standard deviation units with respect to the destination metric (e.g., SDQ score for CBCL-to-SDQ comparisons).
Not surprisingly, the linking functions performed more poorly in their subsamples than in the samples used for their estimation, as illustrated in panel (b) of Figure 2, and the degree of deterioration depended on the subsample. The male and female preschool subsamples had comparable performance to the full preschool sample, with some increase in variability of differences between linked scores but low bias and high correlation between linked and observed scores. As in the full preschool sample, the Externalizing score linking performed more poorly than the Internalizing and Total linking in the preschool subsamples. The same pattern generally held in groups which were administered a version of the SDQ other than the recommended version for their age group. In these groups, the Externalizing factor exhibited poor performance, with low correlations between linked and observed scores, large mean differences between linking and observed scores, or both. In contrast, these statistics were much better for the Internalizing and Total scores in these subsamples, albeit not as good as those for the samples used to derive the linking functions. We accept the latter set of differences as minor in size (e.g., mean difference of 0.1 SDs or less, corresponding to a small effect size in the Cohen’s d metric) and conclude that, other than Externalizing, the crosswalk tables generated herein generalize well to these nonstandard subsamples.
Finally, we constructed crosswalk tables linking SDQ total scores to CBCL T-scores and vice versa using equipercentile linking with loglinear smoothing. Root mean squared differences between T-scores derived from SDQ summed scores directly and T-scores derived by first converting SDQ summed scores to CBCL summed scores were at most 1.5 T-score points, most of which is likely due to the rounding of CBCL summed scores required in the intermediate step of the latter approach. Thus, for each combination of sample (combined 2–5, male 6–11, female 6–11, male 12–17, female 12–17) and domain (Internalizing, Externalizing, Total Problems) used for deriving summed score-linking functions, we computed direct SDQ to CBCL T-score crosswalk tables, which are included with raw score conversion tables in Supplemental Tables S1a–S1o.
Assessing the Measurement Invariance of SDQ 2–4, SDQ 4–10, and SDQ 11–17
Figure 3 contains fit statistics from CFA and deviance statistics from IRT analyses for multiple group models of the SDQ, where groups are defined by gender (male and female) and age (2–5, 6–11, 12–17). Two-tier models with a positive valence factor are also included in Figure 3; these models generally fit better than those for the corresponding unidimensional models by both fit and deviance metrics, with generally unacceptable fit for the unidimensional models (except for Internalizing, where fit was generally acceptable) and large differences in deviance between unidimensional and two-tier models (~175 points for Internalizing; ~530 points for Externalizing; ~985 points for Total Problems). Within each set of models, there was little difference in relative fit according to SEM fit statistics for the different levels of measurement invariance, with a slight preference for the models with measurement invariance by both age and gender due to the larger number of degrees of freedom in these models. Models with measurement invariance by gender, but not by age, tended to have the best CFA-based model fit and lowest AIC and AICc, while models with measurement invariance by age and gender tended to have the lowest BIC and SABIC, although statistics for these models were generally similar regardless of the metric used. Based on these results, we concluded that the SDQ is essentially measurement invariant across its three forms, and that the crosswalk tables included herein can be used for nonstandard conversions. This also suggests, but does not prove, that the items with slight wording variations across SDQ forms function similarly across age ranges, and a more targeted study in which the same raters receive both variations of each item would be needed to verify that the unique items in each SDQ form have identical functioning in nonstandard age ranges.
Figure 3. Assessment of Measurement Invariance in the Strengths and Difficulties Questionnaire (SDQ).

Note. The displayed deviance values represent the difference between each statistic and the minimum of that statistic, calculated by domain.
Discussion
In this study, we linked three domains (Internalizing, Externalizing, and Total Problems) of the CBCL and SDQ according to the age and gender structure of the CBCL T-score conversion tables. The Internalizing and Total scores on the CBCL and SDQ were essentially unidimensional when combined into a single item set, with noticeable but not meaningful deviations from unidimensionality. In contrast, the Externalizing domain deviated slightly from unidimensionality in the 2–5 sample, albeit with acceptable unidimensional model fit according to the metrics used, and in the school-age samples, the Externalizing summed scores were not as strongly correlated between the CBCL and SDQ as the other domains. Results were essentially identical after combining males and females within the preschool (2–5) sample.
Across the three linking methods used (IRT-based fixed-parameter, equipercentile with and without loglinear smoothing), bias was minimal, but for the Externalizing domain in the 6–11 and 12–17 samples, correlations between linked and observed scores were lower than for other domains and samples. The three score-linking methods yielded roughly identical results, and therefore, to enable extrapolation and interpolation of linked scores, equipercentile equating was chosen to construct crosswalk tables. Aside from the Externalizing domain, linking functions yielded acceptable results when applied to the preschool gender subsamples and to subsamples in which children were administered a version of the SDQ other than the recommended version for their age. The three versions of the SDQ were essentially measurement invariant, justifying the conversion of any SDQ score to the metric of the preschool or school-age CBCL and other nonstandard conversions.
Equipercentile linking was used to construct conversions between SDQ summed scores and CBCL T-scores, with little difference in CBCL T-scores generated by first linking SDQ summed scores to CBCL summed scores or by directly linking SDQ summed scores to CBCL T-scores. The full set of crosswalk tables (Supplemental Tables S1a–S1o) can be used to convert SDQ to CBCL scores and vice versa, where CBCL scores are either provided/desired as T-scores or summed scores. While potentially unwieldy, this large number of tables provides a comprehensive map between scores on these two instruments across their many parent-report versions. Furthermore, these tables permit the placement of SDQ scores onto a normed metric, namely that of the CBCL, providing an ability to compare scores to their expected distribution in the general population.
As mentioned, score-level linking is one of several possible methods for IDA. Another approach is to estimate factor scores using item response models estimated on combined data from multiple instruments (Curran & Hussong, 2009; Hussong et al., 2013). For researchers interested in applying this approach instead, we include our combined CBCL–SDQ CFA parameter estimates in Supplemental Tables 2a–2o. These parameters can be used to estimate factor scores, which share the same generalizability limitations as our current sample, using lavaan, Mplus (Muthén & Muthén, 1998–2011), or similar software.
Practical Limitations
Compared to the crosswalk tables presented in a previous CBCL–SDQ linking study conducted in a residential care setting (Stevens et al., 2021; Table 2), the crosswalk tables accompanying this report predict lower CBCL Total Problems T-scores from lower SDQ Total Difficulties scores (mean difference of −2.56 for SDQ = 0), but higher CBCL scores from higher SDQ scores (mean difference of 19.5 for SDQ = 36). Interestingly, for the age groups overlapping between the two studies (ages 12–17), mean SDQ raw scores (9.9–10.7) and CBCL T-scores (54.1–56.8) reported herein were fairly similar to those reported in Stevens et al. (12.1 and 56.9, respectively), suggesting that differences in crosswalk conversions are a function of sample characteristics other than the distribution of observed scores. Specifically, the sample in Stevens et al. was recruited through a residential care facility and, relative to the current sample, had a higher percentage of male participants (60%) and of Black or African American participants (24.6%), with a smaller percentage of Hispanic or Latino (8.8%). While no link is expected to hold in all possible subpopulations (Newton, 2010), these differences suggest that future research may be needed to assess the generalizability of CBCL–SDQ crosswalk conversions across certain groups.
The current work and Stevens et al. (2021) share several broader limitations. Both studies targeted specific populations and the linking functions derived in these works may not generalize to new populations of interest. The choice of rater is also a potential methodological artifact of both studies; in the current work, all reporters were the parents of the target child, while in Stevens et al., reporters also included other caregivers, parole officers, and program officers. These reporting protocols limit the generalizability of the linking functions derived herein and in Stevens et al. to other reporters (e.g., medical professionals, teachers) not represented in these analyses. Given the differences between the linking functions derived herein and those in Stevens et al., users should acknowledge any differences between their population and/or rater and those used in deriving these linking functions and may consider conducting a targeted linking study for their population of interest.
For many analyses, not all SDQ or CBCL summed scores or CBCL T-scores were represented in our data; however, equipercentile linking with nonlinear smoothing allows for interpolation and extrapolation of linked scores, such that the scores presented in Tables S1–S45 represent the range of possible observed scores in the to-be-linked instruments. While the interpolated scores are likely as reliable as the scores which were directly linked, we repeat the general statistical recommendation to treat extrapolated scores with caution. Table 3 lists the observed mean and range of all linked scores (summed scores and T-scores), and values outside of this range should be treated as potentially unreliable, especially when interpreted for a given individual, for example, in clinical decision-making for an individual child.
Table 3.
Observed Score Ranges in Linking Samples
| Dimension | Age | Gender | SDQ summed score |
CBCL summed score |
CBCL T-score |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Mean | Max | Min | Mean | Max | Min | Mean | Max | % Borderline | % Clinical | |||
| Internalizing | 2–5 | Both | 0 | 4 | 17 | 0 | 12.5 | 64 | 29 | 53.2 | 95 | 7.7 | 14.6 |
| 6–11 | Female | 0 | 4.3 | 16 | 0 | 10.7 | 62 | 33 | 54.8 | 99 | 8.9 | 15.5 | |
| Male | 0 | 5.4 | 16 | 0 | 14 | 64 | 34 | 58.9 | 100 | 8.4 | 25.1 | ||
| 12–17 | Female | 0 | 4.9 | 15 | 0 | 13.2 | 60 | 33 | 55 | 99 | 9.6 | 17 | |
| Male | 0 | 5.3 | 17 | 0 | 13.2 | 64 | 34 | 58.2 | 100 | 8.9 | 24.3 | ||
| Externalizing | 2–5 | Both | 0 | 7.1 | 20 | 0 | 14.2 | 47 | 28 | 42.4 | 63 | 0 | 0 |
| 6–11 | Female | 0 | 5.5 | 19 | 0 | 10.7 | 65 | 34 | 53.9 | 96 | 3.8 | 16 | |
| Male | 0 | 6.7 | 19 | 0 | 15.8 | 68 | 33 | 58.3 | 98 | 5.1 | 25.8 | ||
| 12–17 | Female | 0 | 5.1 | 16 | 0 | 12.2 | 63 | 34 | 51.5 | 96 | 6.1 | 12.2 | |
| Male | 0 | 5.4 | 20 | 0 | 12.4 | 70 | 34 | 54.8 | 100 | 5.4 | 21.2 | ||
| Total | 2–5 | Both | 0 | 11.1 | 34 | 0 | 43 | 170 | 28 | 53.5 | 95 | 8.7 | 13.2 |
| 6–11 | Female | 0 | 9.8 | 30 | 0 | 39.9 | 224 | 25 | 54.4 | 97 | 6.6 | 16.9 | |
| Male | 0 | 12.1 | 34 | 0 | 55.3 | 224 | 24 | 58.8 | 97 | 5.8 | 29.5 | ||
| 12–17 | Female | 0 | 9.9 | 31 | 0 | 45.9 | 214 | 24 | 54.1 | 97 | 10.5 | 15.7 | |
| Male | 0 | 10.7 | 35 | 0 | 46.5 | 238 | 24 | 56.8 | 100 | 5.8 | 25.9 | ||
Note. CBCL = Child Behavior Checklist; SDQ = Strengths and Difficulties Questionnaire; min = minimum; max = maximum.
Table 3 also contains the percentages of individuals in each subgroup with CBCL T-scores classified as Borderline (between 65 and 70) or Clinically Meaningful (above 70). Despite being constructed as a general population sample, the percentages reported in Table 3 differ from what would be expected based on a normal distribution (6.7% Borderline and 9.1% Clinical). In particular, Externalizing scores were very low in the youngest subsample (ages 2–5), with no individuals scoring in the Borderline or Clinical range, while all other scores and subgroups had much higher percentages of Clinical scores (12.2%–29.5%) than expected in a general population sample. While these distributions could be due to idiosyncrasies of our sample, which was not collected through methods traditionally used in norming studies (e.g., stratified random sampling), they could also be indicative of outdated norms, as the current sample was collected in 2020 and the CBCL norms used herein were established in 2001. This time gap could influence standards for “normative” emotional and behavioral problems, similar to the Flynn effect in intelligence testing (Flynn, 1987), and an updated norming study for the CBCL could clarify this issue.
Due to the poor fit of the unidimensional Externalizing and Total model in the preschool forms and the relatively low correlation between observed and equated Externalizing scores in the school-age forms, we recommend exercising caution when applying the Externalizing conversions and preschool Total conversions. For preschool Externalizing and Total scores, while the linkages themselves were fairly stable, these composite scores on what are essentially multidimensional measures (e.g., Externalizing including conduct, hyperactivity, and inattention symptoms) can be difficult to interpret. The lower correlation and poor subscale invariance in the school-age Externalizing linkages are more concerning for placing scores onto a common metric, as they suggest that scores converted using these crosswalk tables may not be as similar as desired to the scores that would have been obtained on the instrument that was not administered. These differences could be due to aforementioned differences in the operationalizations of Externalizing in the two measures: while the Externalizing domain in the school-age CBCL focuses on conduct problems, half of the Externalizing items in the SDQ focus on attention and hyperactivity problems, an entirely different domain. We therefore recommend that Externalizing scores on these two measures not be treated, strictly speaking, as fully linked measures of the same underlying construct.
In clinical practice, the Externalizing and preschool Total crosswalk tables may be unsuitable for application at the level of the individual in high-stakes decisions, for instance, when assessing changes in functioning in an inpatient or residential care setting as a decision criterion for clinical or social welfare decisions. In general, these qualifications do not apply to the Internalizing conversions for any age group or Total scores for the school-age groups, for which unidimensionality and linking analyses resulted in more reliable linking functions. Correlations between CBCL and SDQ summed scores for these domains and samples fluctuated around .866, which is a commonly used cutoff for classifying linked scores as equated, indicating interchangeability of the two sets of scores (Dorans & Walker, 2007; Newton, 2010). Because these correlations are close to the cutoff, we would not recommend treating any linked scores as completely equivalent in high-stakes decisions for individuals; rather, these linked scores can be considered the most likely scores that would be obtained had the to-be-linked instrument been administered instead. As with almost any psychological test, high-stakes decisions should not be based on (linked) test scores alone but in combination with clinical severity of disorder and level of functioning as assessed by individuals trained in making such assessments.
That being said, the aforementioned issues with the Externalizing and preschool Total scores do not necessarily preclude the use of their associated crosswalk conversions in research that requires pooling data from these measures, for example, in transdisciplinary consortium research or post hoc data aggregation. Despite these issues, the linked scores remain the best estimate of the score that would have been obtained on the not administered measure given an observed score on the administered measure, and placing scores on a common metric, although imperfect, retains utility over conducting no harmonization at all (e.g., Skaggs, 2005). Rather, we recommend that analysts using these linking functions conduct sensitivity analyses to determine whether the instrument used to generate a score influences their results; in regression analyses, this would involve including an indicator denoting the administered test as a covariate and moderator of the effect(s) of interest. Assuming otherwise equivalent samples, significant moderation would indicate that conclusions may be sensitive to the measure used, any other differences between administrations (e.g., population differences) notwithstanding. While this sensitivity analysis is an important tool to use whenever scores are linked to a common metric, it is particularly important when linkages are not perfectly reliable, as with the Externalizing and preschool Total Problems crosswalks presented herein. These linkages, when properly incorporated into pooled data analyses, provide a valuable tool for increasing the statistical power available to answer important research questions in child psychosocial health while permitting the statistical assessment of method effects caused by differences in assessment.
Furthermore, it should be acknowledged that the correlation between linked and observed scores was not perfect and that there is inherent error in the process of converting scores on one instrument to equated scores on the other. This error is partly a function of sample size: while the sample sizes used to derive linking functions here (~250–500) are well beyond those needed to yield meaningful statistical gains from equating scores (50–200; Aşiret & Sünbül, 2016), they are smaller than in some large-scale test linking studies, for example, in high-stakes achievement test linking (e.g., Pommerich et al., 2000).
When both SDQ and CBCL scores must be used in the same analysis, researchers have an option of transforming CBCL scores into SDQ scores or vice versa. The CBCL has more items than the SDQ and is a more reliable measure; therefore, when roughly equal numbers of individuals complete both instruments, we recommend transforming SDQ scores into CBCL scores. However, if the number of SDQ scores is considerably higher than the number of CBCL scores, we recommend transforming the CBCL scores to SDQ scores instead, recognizing the tradeoff between the reliability of each score and the necessary loss of reliability when scores from one instrument are linked to the metric of another.
Conclusion
With an estimated 17 million children facing a mental health problem in the U.S. (Child Mind Institute, 2015), the need for robust pediatric screening is of utmost importance. Pragmatic approaches to assessing and monitoring children’s emotional and behavioral functioning are required for such practices to become commonplace. The present study addresses this need by providing health care providers and researchers a way to convert scores between two popular measures. While caution is noted when implementing such harmonized scores in practice, particularly at the individual level, the ability to score the CBCL and SDQ on a common metric via the crosswalks developed in this study provides a much-needed option to expand the utility of these measurement systems and ultimately pediatric mental health screening in general.
Supplementary Material
Public Significance Statement.
We applied modern score-linking methodology to create score conversion tables for the Child Behavior Checklist and the Strengths and Difficulties Questionnaire, two commonly used measures of emotional and behavioral problems in children, allowing scores on each instrument to be converted to the metric of the other.
Acknowledgments
Data collection and preliminary analyses were sponsored by the Environmental Influences on Child Health Outcomes (ECHO) program, Office of the Director, National Institutes of Health, under Award Number U24OD023319 with co-funding from the Office of Behavioral and Social Sciences Research (OBSSR; Person Reported Outcomes Core). This study was not preregistered. We have no conflicts of interest to disclose.
Footnotes
Crosswalk tables for the Internalizing, Externalizing, and Total Problems domains of the Child Behavior Checklist (CBCL) and Strengths and Difficulties Questionnaire (SDQ) are publicly available at the American Psychological Association’s (APA) Open Science Framework (OSF) repository (https://osf.io/n5s9u/). Data and analysis code are not publicly available.
Supplemental materials: https://doi.org/10.1037/pas0001083.supp
In particular, the conversion from SDQ raw score directly to CBCL T-score may yield a different T-score than if the SDQ raw score was first converted to a linked CBCL raw score, and then that raw score was converted to a T-score, unless equipercentile linking is used.
References
- Achenbach TM (1991). Manual for the Child Behavior Checklist/4-18 and 1991 profile. Department of Psychiatry, University of Vermont. [Google Scholar]
- Achenbach TM, Becker A, Döpfner M, Heiervang E, Roessner V, Steinhausen H-C, & Rothenberger A (2008). Multicultural assessment of child and adolescent psychopathology with ASEBA and SDQ instruments: Research findings, applications, and future directions. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 49(3), 251–275. 10.1111/j.1469-7610.2007.01867.x [DOI] [PubMed] [Google Scholar]
- Achenbach TM, & Rescorla L (2001). Manual for the ASEBA school-age forms & profiles: An integrated system of multi-informant assessment. Aseba. [Google Scholar]
- Achenbach TM, & Ruffle TM (2000). The Child Behavior Checklist and related forms for assessing behavioral/emotional problems and competencies. Pediatrics, 21(8), 265–271. 10.1542/pir.21-8-265 [DOI] [PubMed] [Google Scholar]
- Albano AD (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1–36. 10.18637/jss.v074.i08 [DOI] [Google Scholar]
- American Academy of Pediatrics, Committee on Psychosocial Aspects of Child and Family Health and Task Force on Mental Health. (2009). Policy statement—the future of pediatrics: Mental health competencies for pediatric primary care. Pediatrics, 124(1), 410–421. 10.1542/peds.2009-1061 [DOI] [PubMed] [Google Scholar]
- American Psychiatric Association. (2000). Diagnostic and statistical manual of mental disorders (4th ed., Text Revision). [Google Scholar]
- Aşiret S, & Sünbül SÖ (2016). Investigating test equating methods in small samples through various factors. Educational Sciences: Theory & Practice, 16(2), 647–668. 10.12738/estp.2016.2.2762 [DOI] [Google Scholar]
- Askew RL, Kim J, Chung H, Cook KF, Johnson KL, & Amtmann D (2013). Development of a crosswalk for pain interference measured by the BPI and PROMIS pain interference short form. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 22(10), 2769–2776. 10.1007/s11136-013-0398-5 [DOI] [PubMed] [Google Scholar]
- Batterham PJ, Sunderland M, Slade T, Calear AL, & Carragher N (2018). Assessing distress in the community: Psychometric properties and crosswalk comparison of eight measures of psychological distress. Psychological Medicine, 48(8), 1316–1324. 10.1017/S0033291717002835 [DOI] [PubMed] [Google Scholar]
- Bland JM, & Altman DG (1999). Measuring agreement in method comparison studies. Statistical Methods in Medical Research, 8(2), 135–160. 10.1177/096228029900800204 [DOI] [PubMed] [Google Scholar]
- Bourdon KH, Goodman R, Rae DS, Simpson G, & Koretz DS (2005). The Strengths and Difficulties Questionnaire: U.S. normative data and psychometric properties. Journal of the American Academy of Child & Adolescent Psychiatry, 44(6), 557–564. 10.1097/01.chi.0000159157.57075.c8 [DOI] [PubMed] [Google Scholar]
- Cai L (2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75(4), 581–612. 10.1007/s11336-010-9178-0 [DOI] [Google Scholar]
- Chalmers RP (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
- Child Mind Institute. (2015). Children’s mental health report. https://childmind.org/2015-childrens-mental-health-report/
- Choi SW, & Lim S (2020). PROsetta: Linking patient-reported outcomes measures (R package Version 0.1.4). https://CRAN.R-project.org/package=PROsetta
- Choi SW, Reise SP, Pilkonis PA, Hays RD, & Cella D (2010). Efficiency of static and computer adaptive short forms compared to full-length measures of depressive symptoms. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 19(1), 125–136. 10.1007/s11136-009-9560-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi SW, Schalet B, Cook KF, & Cella D (2014). Establishing a common metric for depressive symptoms: Linking the BDI-II, CES-D, and PHQ-9 to PROMIS depression. Psychological Assessment, 26(2), 513–527. 10.1037/a0035768 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cohen LL, La Greca AM, Blount RL, Kazak AE, Holmbeck GN, & Lemanek KL (2008). Introduction to special issue: Evidence-based assessment in pediatric psychology. Journal of Pediatric Psychology, 33(9), 911–915. 10.1093/jpepsy/jsj115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Collins FS, & Manolio TA (2007). Merging and emerging cohorts: Necessary but not sufficient. Nature, 445(7125), Article 259. 10.1038/445259a [DOI] [PubMed] [Google Scholar]
- Curran PJ, & Hussong AM (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81–100. 10.1037/a0015914 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dickey WC, & Blumberg SJ (2004). Revisiting the factor structure of the strengths and difficulties questionnaire: United States, 2001. Journal of the American Academy of Child & Adolescent Psychiatry, 43(9), 1159–1167. 10.1097/01.chi.0000132808.36708.a9 [DOI] [PubMed] [Google Scholar]
- Dorans NJ (2004). Equating, concordance, and expectation. Applied Psychological Measurement, 28(4), 227–246. 10.1177/0146621604265031 [DOI] [Google Scholar]
- Dorans NJ (2007). Linking scores from multiple health outcome instruments. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 16(Suppl. 1), 85–94. 10.1007/s11136-006-9155-3 [DOI] [PubMed] [Google Scholar]
- Dorans NJ, & Walker ME (2007). Sizing up linkages. In Dorans NJ, Pommerich M, & Holland PW (Eds.), Linking and aligning scores and scales (pp. 179–198). Springer. 10.1007/978-0-387-49771-6_10 [DOI] [Google Scholar]
- Downey AS, & Olson S (Eds.). (2013). Sharing clinical research data: Workshop summary. National Academies Press. [PubMed] [Google Scholar]
- Egger HL, & Angold A (2006). Common emotional and behavioral disorders in preschool children: Presentation, nosology, and epidemiology. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 47(3–4), 313–337. 10.1111/j.1469-7610.2006.01618.x [DOI] [PubMed] [Google Scholar]
- Feigelson HS, Cox DG, Cann HM, Wacholder S, Kaaks R, Henderson BE, Albanes D, Altshuler D, Berglund G, Berrino F, Bingham S, Buring JE, Burtt NP, Calle EE, Chanock SJ, Clavel-Chapelon F, Colditz G, Diver WR, Freedman ML, … Trichopoulos D (2006). Haplotype analysis of the HSD17B1 gene and risk of breast cancer: A comprehensive approach to multicenter analyses of prospective cohort studies. Cancer Research, 66(4), 2468–2475. 10.1158/0008-5472.CAN-05-3574 [DOI] [PubMed] [Google Scholar]
- Fischer BA, & Zigmond MJ (2010). The essential nature of sharing in science. Science and Engineering Ethics, 16(4), 783–799. 10.1007/s11948-010-9239-x [DOI] [PubMed] [Google Scholar]
- Flynn JR (1987). Massive IQ gains in 14 nations: What IQ tests really measure. Psychological Bulletin, 101(2), 171–191. 10.1037/0033-2909.101.2.171 [DOI] [Google Scholar]
- Ghandour RM, Sherman LJ, Vladutiu CJ, Ali MM, Lynch SE, Bitsko RH, & Blumberg SJ (2019). Prevalence and treatment of depression, anxiety, and conduct problems in US children. The Journal of Pediatrics, 206, 256–267. 10.1016/j.jpeds.2018.09.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gignac GE (2015). Estimating the strength of a general factor: Coefficient omega hierarchical. Industrial and Organizational Psychology: Perspectives on Science and Practice, 8(3), 434–438. 10.1017/iop.2015.59 [DOI] [Google Scholar]
- Goodman A, Lamping DL, & Ploubidis GB (2010). When to use broader internalising and externalising subscales instead of the hypothesised five subscales on the Strengths and Difficulties Questionnaire (SDQ): Data from British parents, teachers and children. Journal of Abnormal Child Psychology, 38(8), 1179–1191. 10.1007/s10802-010-9434-x [DOI] [PubMed] [Google Scholar]
- Goodman R (1997). The Strengths and Difficulties Questionnaire: A research note. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 38(5), 581–586. 10.1111/j.1469-7610.1997.tb01545.x [DOI] [PubMed] [Google Scholar]
- Goodman R (1999). The extended version of the Strengths and Difficulties Questionnaire as a guide to child psychiatric caseness and consequent burden. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 40(5), 791–799. 10.1111/1469-7610.00494 [DOI] [PubMed] [Google Scholar]
- Goodman R (2001). Psychometric properties of the strengths and difficulties questionnaire. Journal of the American Academy of Child & Adolescent Psychiatry, 40(11), 1337–1345. 10.1097/00004583-200111000-00015 [DOI] [PubMed] [Google Scholar]
- Green SB, & Yang Y (2009). Reliability of summed item scores using structural equation modeling: An alternative to coefficient alpha. Psychometrika, 74(1), 155–167. 10.1007/s11336-008-9099-3 [DOI] [Google Scholar]
- Haebara T (1980). Equating logistic ability scales by a weighted least squares method. The Japanese Psychological Research, 22(3), 144–149. 10.4992/psycholres1954.22.144 [DOI] [Google Scholar]
- Hofstra MB, van der Ende J, & Verhulst FC (2002). Child and adolescent problems predict DSM-IV disorders in adulthood: A 14-year follow-up of a Dutch epidemiological sample. Journal of the American Academy of Child & Adolescent Psychiatry, 41(2), 182–189. 10.1097/00004583-200202000-00012 [DOI] [PubMed] [Google Scholar]
- Holmbeck GN, Thill AW, Bachanas P, Garber J, Miller KB, Abad M, Bruno EF, Carter JS, David-Ferdon C, Jandasek B, Mennuti-Washburn JE, O’Mahar K, & Zukerman J (2008). Evidence-based assessment in pediatric psychology: Measures of psychosocial adjustment and psychopathology. Journal of Pediatric Psychology, 33(9), 958–980. 10.1093/jpepsy/jsm059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hopwood CJ, & Donnellan MB (2010). How should the internal structure of personality inventories be evaluated? Personality and Social Psychology Review, 14(3), 332–346. 10.1177/1088868310361240 [DOI] [PubMed] [Google Scholar]
- Hussong AM, Curran PJ, & Bauer DJ (2013). Integrative data analysis in clinical psychology research. Annual Review of Clinical Psychology, 9(1), 61–89. 10.1146/annurev-clinpsy-050212-185522 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolen MJ, & Brennan RL (2004). Test equating, scaling, and linking: Methods and practices. Springer. 10.1007/978-1-4757-4310-4 [DOI] [Google Scholar]
- Koskelainen M, Sourander A, & Vauras M (2001). Self-reported strengths and difficulties in a community sample of Finnish adolescents. European Child & Adolescent Psychiatry, 10(3), 180–185. 10.1007/s007870170024 [DOI] [PubMed] [Google Scholar]
- Lai JS, Cella D, Yanez B, & Stone A (2014). Linking fatigue measures on a common reporting metric. Journal of Pain and Symptom Management, 48(4), 639–648. 10.1016/j.jpainsymman.2013.12.236 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lance CE, Butts MM, & Michels LC (2006). The sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Methods, 9(2), 202–220. 10.1177/1094428105284919 [DOI] [Google Scholar]
- Livingston SA (1992). Small-sample equating with log-linear smoothing. ETS Research Report Series, 1992(1), 1–10. 10.1002/j.2333-8504.1992.tb01434.x [DOI] [Google Scholar]
- Lord FM (1982). The standard error of equipercentile equating. Journal of Educational and Behavioral Statistics, 7(3), 165–174. 10.3102/10769986007003165 [DOI] [Google Scholar]
- Lord FM, & Wingersky MS (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453–461. 10.1177/014662168400800409 [DOI] [Google Scholar]
- McDonald RP (1999). Test theory: A unified treatment. Lawrence Erlbaum. [Google Scholar]
- Muris P, Meesters C, & van den Berg F (2003). The Strengths and Difficulties Questionnaire (SDQ)—further evidence for its reliability and validity in a community sample of Dutch children and adolescents. European Child & Adolescent Psychiatry, 12(1), 1–8. 10.1007/s00787-003-0298-2 [DOI] [PubMed] [Google Scholar]
- Muthén B (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115–132. 10.1007/BF02294210 [DOI] [Google Scholar]
- Muthén BO, du Toit SHC, & Spisic D (1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes. http://gseis.ucla.edu/faculty/muthen/articles/Article_075.pdf
- Muthén LK, & Muthén BO (1998–2011). Mplus user’s guide (6th ed.). [Google Scholar]
- Newton P (2010). Thinking about linking. Measurement: Interdisciplinary Research and Perspectives, 8(1), 38–56. 10.1080/15366361003749068 [DOI] [Google Scholar]
- Pedersen NL, Christensen K, Dahl AK, Finkel D, Franz CE, Gatz M, Horwitz BN, Johansson B, Johnson W, Kremen WS, Lyons MJ, Malmberg B, McGue M, Neiderhiser JM, Petersen I, & Reynolds CA (2013). IGEMS: The consortium on interplay of genes and environment across multiple studies. Twin Research and Human Genetics, 16(1), 481–489. 10.1017/thg.2012.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perou R, Bitsko RH, Blumberg SJ, Pastor P, Ghandour RM, Gfroerer JC, Hedden SL, Crosby AE, Visser SN, Schieve LA, Parks SE, Hall JE, Brody D, Simile CM, Thompson WW, Baio J, Avenevoli S, Kogan MD, Huang LN, & the Centers for Disease Control and Prevention. (2013). Mental health surveillance among children—United States, 2005–2011. MMWR Supplements, 62(2), 1–35. [PubMed] [Google Scholar]
- Petersen NS (2008). A discussion of population invariance of equating. Applied Psychological Measurement, 32(1), 98–101. 10.1177/0146621607311581 [DOI] [Google Scholar]
- Pommerich M, Hanson BA, Harris DJ, & Sconing JA (2000). Issues in creating and reporting concordance results based on equipercentile methods (ACT Research Report No. 2000-1). ACT. [Google Scholar]
- R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Reise SP, Kim DS, Mansolf M, & Widaman KF (2016). Is the bifactor model a better model or is it just better at modeling implausible responses? Application of iteratively reweighted least squares to the Rosenberg Self-Esteem Scale. Multivariate Behavioral Research, 51(6), 818–838. 10.1080/00273171.2016.1243461 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Revelle W (2020). psych: Procedures for personality and psychological research. Northwestern University. https://CRAN.R-project.org/package=psych [Google Scholar]
- Revelle W, & Zinbarg RE (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74(1), 145–154. 10.1007/s11336-008-9102-z [DOI] [Google Scholar]
- Rosenbaum PR, & Thayer D (1987). Smoothing the joint and marginal distributions of scored two-way contingency tables in test equating. British Journal of Mathematical & Statistical Psychology, 40(1), 43–49. 10.1111/j.2044-8317.1987.tb00866.x [DOI] [Google Scholar]
- Rosseel Y (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. 10.18637/jss.v048.i02 [DOI] [Google Scholar]
- Schalet BD, Cook KF, Choi SW, & Cella D (2014). Establishing a common metric for self-reported anxiety: Linking the MASQ, PANAS, and GAD-7 to PROMIS Anxiety. Journal of Anxiety Disorders, 28(1), 88–96. 10.1016/j.janxdis.2013.11.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skaggs G (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309–330. 10.1111/j.1745-3984.2005.00018.x [DOI] [Google Scholar]
- Smith-Warner SA, Spiegelman D, Ritz J, Albanes D, Beeson WL, Bernstein L, Berrino F, van den Brandt PA, Buring JE, Cho E, Colditz GA, Folsom AR, Freudenheim JL, Giovannucci E, Goldbohm RA, Graham S, Harnack L, Horn-Ross PL, Krogh V, … Hunter, D. J. (2006). Methods for pooling results of epidemiologic studies: The pooling project of prospective studies of diet and cancer. American Journal of Epidemiology, 163(11), 1053–1064. 10.1093/aje/kwj127 [DOI] [PubMed] [Google Scholar]
- Stevens AL, Ho KY, Mason WA, & Chmelka MB (2021). Using equipercentile equating to link scores of the CBCL and SDQ in residential youth. Residential Treatment for Children & Youth, 38(1), 102–113. 10.1080/0886571X.2019.1704670 [DOI] [Google Scholar]
- U.S. Census Bureau. (2019). American Community Survey: Educational Attainment. https://data.census.gov/cedsci/table?q=educational%20attainment&tid=ACSST1Y2019.S1501
- van de Looij-Jansen PM, Goedhart AW, de Wilde EJ, & Treffers PD (2011). Confirmatory factor analysis and factorial invariance analysis of the adolescent self-report Strengths and Difficulties Questionnaire: How important are method effects and minor factors? British Journal of Clinical Psychology, 50(2), 127–144. 10.1348/014466510X498174 [DOI] [PubMed] [Google Scholar]
- Van Leeuwen K, Meerschaert T, Bosmans G, De Medts L, & Braet C (2006). The Strengths and Difficulties Questionnaire in a community sample of young children in Flanders. European Journal of Psychological Assessment, 22(3), 189–197. 10.1027/1015-5759.22.3.189 [DOI] [Google Scholar]
- Van Noorden R (2013). Data-sharing: Everything on display. Nature, 500(7461), 243–245. 10.1038/nj7461-243a [DOI] [PubMed] [Google Scholar]
- Vandenberg RJ, & Lance CE (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. 10.1177/109442810031002 [DOI] [Google Scholar]
- Watson D (1988). The vicissitudes of mood measurement: Effects of varying descriptors, time frames, and response formats on measures of positive and negative affect. Journal of Personality and Social Psychology, 55(1), 128–141. 10.1037/0022-3514.55.1.128 [DOI] [PubMed] [Google Scholar]
- Willett WC, Blot WJ, Colditz GA, Folsom AR, Henderson BE, & Stampfer MJ (2007). Merging and emerging cohorts: Not worth the wait. Nature, 445(7125), 257–258. 10.1038/445257a [DOI] [PubMed] [Google Scholar]
- Zuckerbrot RA, Maxon L, Pagar D, Davies M, Fisher PW, & Shaffer D (2007). Adolescent depression screening in primary care: Feasibility and acceptability. Pediatrics, 119(1), 101–108. 10.1542/peds.2005-2965 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
