Abstract
Purpose:
Using the lens of classical test theory, we examine a linkage’s generalizability with respect to use in multivariable analyses, including multiple regression and structural equation modeling, rather than comparison of established subpopulations as is most common in the literature.
Methods:
To aid in this evaluation, we present a structural-equation-modeling based statistical method to examine the suitability of a given linkage for use cases involving continuous and categorical variables external to the linkage itself.
Results:
Using the PROMIS® Parent Proxy and Early Childhood Global Health measures, we show that, although a high correlation between the scores (here, ) may imply a general suitability for linking, a more detailed investigation of content, measurement structure, and results of the proposed methodology reveal important differences between the measures which can compromise interchangeability in certain use cases.
Conclusion:
In addition to the statistical quality of a linkage, users of linking methodology should also assess the question of whether the linkage is appropriate to apply to particular use cases of interest.
Keywords: Test linking, psychometrics, structural equation modeling, generalizability
Plain English summary
Statistical transformations, called linkages, allow scores on one measure, for example a questionnaire about global physical and mental health, to be compared to scores on another similar measure. However, whether scores can be treated interchangeably with each other in all cases is a generally unanswered question. We describe a method to examine whether linked scores can be treated interchangeably in particular cases, specifically when included in statistical analyses involving other variables collected (e.g., demographic variables). We apply this test to two measures of global health, revealing use cases in which these measures are essentially equivalent and cases where they are not. This research will help increase the statistical rigor of linkages, allowing researchers to assess when and whether a linkage is useful for their purposes.
Modern score linking applications have extended beyond their origins in academic ability testing into other areas [1], including person-reported outcome (PRO) measures of quality of life. In multi-site research programs, these linkages are often used to harmonize variables for cross-sectional and longitudinal research. The goal of this manuscript is to statistically examine the validity of linkages in this context. In addition to standard psychometric evaluation of the linkages (e.g., correlation between scores), we propose criteria based on the use cases for the resulting scores, with the idea that the same linkage may be suitable for some purposes and unsuitable for others. Here, we echo the sentiments of [2] that “…perfect comparability is most likely neither feasible nor desirable, and that rational policymakers need to satisfy themselves with results that provide reasonable estimates (emphasis in original) rather than conclusive answers”.
Use cases involving comparison of observed scores on a measure and scores linked to that measure’s metric for an individual, and between subpopulations, have been well-studied in the educational assessment literature (e.g., [3]). In contrast, applications of linked scores to integrative data analysis (IDA; [4]), or the simultaneous analysis of multiple data sets, have received virtually no attention, and we focus on this application in this paper. In IDA, linkages are used to provide common metrics for measures of the same construct administered in separate samples. Such an example is the Patient-Reported Outcomes Measurement Information System® (PROMIS®) Depression common metric [5-7] which provides comparable scores across several extant depression questionnaires by aligning scores onto the same PROMIS T-score metric system, an item response theory based scoring system with mean=50 and standard deviation=10. Linkages can also be used to place scores onto metrics with existing norms, allowing interpretation with respect to the normed metric. These use cases require additional tools to evaluate, and in this work, we propose one such tool: a structural-equation-modeling (SEM) based test of whether a particular linkage is appropriate for interpretation within a multivariable use case involving an external variable.
Internal and External Evaluation of Linkages
Typically, correlations between scores on two measures are used to judge suitability for linking. One example result of such an assessment is the ability to classify a linkage as an equating, meaning that scores on the two measures can be treated as interchangeable. Criteria for establishing equating are based on the correlation between scores of measures (traditionally, using as the cutoff) and equality of reliability between the two measures [8]. This internal evaluation is, of course, critically important to evaluating the utility of a linkage, but only directly generalizes to the comparison of scores or groups of scores in isolation.
External evaluations of linkages are generally oriented towards the example use cases mentioned above (see also [3]), which generally assess differences between linkages derived within one or more subpopulations and evaluate those linkages either conditional on a particular score (e.g., revealing negative bias in one group at the low end of the score range), or on the full range of scores (e.g., revealing negative bias in general). Because these methods depend on accurate estimation of the linkage within each subgroup, they may perform poorly when a subgroup has few individuals. It is also not clear how to apply these methods to continuous variables. Historically, the subgroups of interest have been large and categorical, for example males versus females, so these limitations have mattered little. For integrative data analysis, and multivariable analysis in general, additional tools are needed which account for these limitations.
Motivation: Classical Test Theory
To introduce terminology for the score linking context, consider a case in which transformed scores on one measure, which we call linked scores to specify that the linking transformation has been applied, are treated as proxies for observed scores on the other measure, which we call actual scores. These scores are analogous to the classical test theory (CTT) observed score, the score observed in practice which contains measurement error; and true score from a theoretical test with no measurement error. In CTT, measurement error is assumed to be independent not only of the observed score, but of all other variables, which we call external variables with respect to their relationship to the measurement model. In score linking, the correlation between linked and actual scores will never be unity. Therefore, if the linked score is to be treated as a proxy for the actual score, one must similarly assume that the remaining variance in each score after accounting for the other is unrelated to any external variables of interest.
A Test for Appropriate Use Cases of Linked Scores
A score linking study in which the two measures are co-administered enables a test of whether variance in each of the linked and actual scores, after accounting for variance shared with the other score, is systematically related to external variables also collected. Let be the actual score on whose metric inference is desired, and the linked score proposed to be interchangeable with . Note that, assuming the linking was successful, and are on the same metric, as is not the score on the linked-from measure itself, but the transformed version of that score after linking. We define interchangeability of and as indicating that differences between and are attributable only to linking error, i.e., variance unrelated to external variables. Without loss of generality, assume both and are standardized; then, this property can be written as
| #(1) |
where is the correlation between and , and is an error term orthogonal to . Because is not equal to one, the question arises as to whether the error term, representing variance in unaccounted for by , contains meaningless statistical noise, meaningful variance with respect to other variables to be used in analysis, meaningful variance unrelated to those other variables, or some combination thereof.
With no other variables, Equation (1) will always reproduce the only modeled correlation ; however, when additional variables are considered, this is no longer true, as Equation (1) implies constraints on correlations of and with other variables. Let represent an external variable of interest, such as a predictor, outcome, or covariate which is intended to be jointly modeled with using as a substitute for . Then, assuming Equation (1) still holds, the correlation between and can be written as
| #(2) |
Here, and denote correlations between and and between and , respectively, with as above. If Equation (2) holds, because the correlation is expected to be positive, the only differences between and would be due to measurement error, quantified by 1 minus . In this case, inferences made on the linked score would be comparable, aside from measurement error, to inferences which would have been made on had been observed, rendering and interchangeable. Defined in this way, differences between and over and above those reflecting indicate lack of interchangeability, specifically that the “error” term in Equation (1) reflects, at least partially, meaningful variance with respect to use cases involving .
Interchangeability can be tested using the path model in Figure 1; here, the directed arrow between and implies Equation (1), while the omission of the dotted two-way path between and implies Equation (2) and adding the dotted path yields a saturated model. Omitting the dotted path and conducting the omnibus model-fit test constitutes a test of interchangeability of the actual score and the linked score in multivariable analyses involving . A similar test can be conducted where the roles of and are switched; similarly, failing this test reveals that variance in , after accounting for , is systematically related to . While there are many frameworks in which difference between the two sides of Equation (2) can be tested, structural equation modeling enables out-of-the-box applications to non-Pearson correlations (e.g., polyserial correlations) using non-normal or non-continuous , inclusion of additional external variables besides in the same test, and/or inclusion of more complex models in place of , yielding a flexible set of interchangeability tests. The results of this test generalize to analytic methods whose results are functions of correlations and covariances, including multiple regression and SEM.
Figure 1. Path model illustrating test of interchangeability.

Note. represents scores on the metric to which scores are linked. represents linked scores on a different measure, linked to the metric of . represents an external variable to the linkage, such as a demographic or predictor/criterion variable in a multivariable analysis. is represented as a circle, consistent with its unobserved nature in linking applications.
When this test is significant, the question arises as to how much the resulting correlations and subsequent interpretation may differ. To assess this, one would add this correlation to the model in Figure 1 (dotted line) and re-estimate the model. This approach is similar to that of Bentler’s theory [9] in estimating reliability while partitioning item-specific variance into measurement error and variance unique to each item in a multi-item scale but also shared with external covariates. Alternatively, one could estimate the model in Figure 1 with the dotted line included and conduct a Wald test for significance of the residual correlation.
We conducted a simulation study to evaluate the Type I error, power, and rate of reversed sign (positive versus negative) for correlations of with versus . To conserve space and allow further discussion of the real-data example, we present the method, results, and discussion of this simulation study in Supplemental Materials.
Example: Linking Parent Proxy and Early Childhood PROMIS Global Health
To demonstrate this methodology, we evaluated two highly related measures: PROMIS Early Childhood Parent Report Scale v1.0 - Global Health 8a (EC) [10] and PROMIS Parent Proxy Scale v1.0 - Global Health 7 (PP) [11]. Between these two measures: three items are identical, both are administered to a parent, and the same 5-point Likert scale is used. Both EC and PP are scored using state-of-the-science item response theory (IRT)-based methods, specifically the unidimensional graded response model, and are reported using IRT-scaled, norm-referenced T scores with respect to the general population (here, of parents reporting on their children) in the United States. While the PROMIS adult Global Health measures report separate scores for mental and physical health, pediatric measures include only a single score spanning both physical and mental health. The EC measure was designed to enable age-appropriate assessment of global health in children ages 1 to 5, while the PP version was designed likewise for ages 5 to 17. Table 1 contains item content and perturbed PROMIS item slope parameters1 for the two measures; this item content, and the measures themselves are publicly available from the HealthMeasures website at healthmeasures.net. Notably, the first three items overlap: in general, would you say your child’s health is [excellent to poor]; in general, how would you rate your child’s quality of life?; and in general, how would you rate your child’s physical health?
Table 1.
Item text and perturbed slope parameters from PROMIS calibration
| Item Text | EC Slope | PP Slope |
|---|---|---|
| In general, would you say your child's health is: | 2.46 | 4.54 |
| In general, would you say your child's quality of life is: | 2.10 | 2.95 |
| In general, how would you rate your child's physical health? | 2.36 | 4.71 |
| In general, how would you rate your child's mental health? | 3.64 | |
| How would you rate your child's mood? | 2.48 | |
| How would you rate your child's social skills? | 1.81 | |
| How would you rate your child's ability to think? | 2.77 | |
| How well is your child meeting developmental milestones? | 2.70 | |
| In general, how would you rate your child's mental health, including mood and ability to think? | 2.11 | |
| How often does your child seem really sad? | 0.64 | |
| How often does your child have fun with other children? | 1.03 | |
| How often does your child feel that you listen to his or her ideas? | 1.12 |
Note. EC = PROMIS Early Childhood Global Health. PP = Parent Proxy Global Health. Perturbation was conducted to mask the actual values by adding random noise generated, separately for each parameter, from a random uniform distribution with limits of −.25 and .25. Items used in Op4G data had the text in strikethrough font removed and replaced with the underlined text for developmental appropriateness in the sample of parents of 1-5-year-olds .
Analysis Sample
Data for this study were collected by an internet panel company, Op4G, as part of the larger initiative to develop, validate, and norm a suite of PROMIS EC measures [12-13]. For an overview of the sample, see [13], and for a brief summary for the purposes of these analyses, see Supplemental Methods. Table 2 contains descriptive statistics for the Op4G data; see Supplemental Methods for information on missingness rates by variable.
Table 2.
Descriptive statistics for Op4G Data
| Characteristic | N = 3861 |
|---|---|
| Respondent Female | 293 (76%) |
| Child Characteristics | |
| Age | 4.48 (0.36) |
| Female | 176 (46%) |
| Hispanic | 63 (16%) |
| White | 310 (80%) |
| Black or African American | 67 (17%) |
| American Indian or Alaska Native | 13 (3.4%) |
| Asian | 12 (3.1%) |
| Native Hawaiian or other Pacific Islander | 4 (1.0%) |
| Other Race | 11 (2.8%) |
| Allergies | 106 (27%) |
| Asthma | 50 (13%) |
| Gross Motor Disorder | 20 (5.2%) |
| Fine Motor Disorder | 19 (4.9%) |
| Anxiety | 31 (8.0%) |
| ADHD | 58 (15%) |
| ASD | 32 (8.3%) |
| Behavioral or Conduct Problems | 30 (7.8%) |
| Developmental Delay | 27 (7.0%) |
| Speech Disorder | 36 (9.3%) |
| PROMIS T Scores | |
| Parent Proxy (PP) Global Health | 51.6 (10.5) |
| Early Childhood (EC) Global Health | 46.0 (10.6) |
| Linked EC to PP Global Health | 51.5 (10.8) |
| Sleep Impairment 4-item SF | 55.9 (10.2) |
| Sleep Disturbance 4-item SF | 54.8 (10.9) |
| Relationship 6-item SF | 50.1 (10.1) |
| Relationship Caregiver SF | 49.4 (11.4) |
| Relationship Family SF | 48.7 (9.2) |
| Relationship Peer SF | 50.9 (10.5) |
| Sleep Full-length bank | 56.4 (10.2) |
| Physical Activity Bank | 52.0 (9.3) |
| Relationship Full-length bank | 50.8 (12.6) |
| Irritability Full-length Bank | 48.6 (10.9) |
| Anxiety Full-length Bank | 52.9 (10.8) |
| Depression full-length Bank | 53.2 (10.0) |
| Curiosity 6a | 48.8 (10.4) |
| Persistency 6a | 52.3 (12.0) |
| Positive Affect Full-length Bank | 49.0 (11.1) |
| Flexibility 5a | 50.7 (12.4) |
| Frustration Tolerance 6a | 52.9 (12.0) |
| Neighborhood Characteristics | |
| Heavy Traffic | 2.68 (1.35) |
| Stranger Danger | 1.92 (1.11) |
| Road Safety | 2.19 (1.17) |
| No Lights | 3.02 (1.39) |
| Cross Roads | 2.76 (1.42) |
| No Transit | 2.62 (1.39) |
Note. 1Mean (SD) are presented for continuous variables, while n (%) are presented for categorical variables. 2Wilcoxon rank sum test was used for continuous variables; Pearson's Chi-squared test was used for categorical variables with all expected cell counts ≥5; Fisher’s exact test was used for categorical variables with any expected cell count <5. See Supplemental Methods and Supplemental Table 2 for information on missingness rates by variable.
Psychometric Comparison of EC and PP Global Health
The correlation between scores on EC and PP Global Health was estimated as .829 in this sample. Mean (SD) of EC Global Health T scores was 46.0 (10.6) and for PP was 51.6 (10.5).
From the slope parameters in Table 1, observe that items measuring physical health (In general, would you say your child’s health is:, In general, how would you rate your child’s physical health?) have the highest loadings in PP Global Health, while items measuring mental health (In general, how would you rate your child’s mental health?, How would you rate your child’s ability to think?) have the highest loadings in EC Global Health. This difference suggests that, although both forms measure both mental and physical health, T scores for EC Global Health are more a function of mental health, while T scores for PP Global Health are more a function of physical health, a substantive difference in the content of the measures.
We received approval from our institution’s Institutional Review Board. Study materials and data are not publicly available.
Equipercentile Linking
Equipercentile linking with loglinear smoothing was conducted in the equate package [14] in R [15] to link EC Global Health T scores to the metric of PP Global Health T scores and vice versa; see Supplemental Methods for additional details. After linking, the correlation between linked EC to PP scores and observed PP scores was .834, while the correlation between the original and linked EC scores was .993 (Figure 2). Similar results can be seen in Figure 2 for the PP to EC conversion. Crosswalk tables derived herefrom can be found in the OSF repository (https://osf.io/nf8bx/).
Figure 2. Scatterplot matrix (off-diagonal) and univariate distributions (diagonal) of PROMIS Early Childhood and Parent Proxy Global Health T scores, and linked scores from each measure to the other, in Op4G sample.

Use Case Tests
Method
For each external variable in Table 2, separate tests were conducted treating EC and PP Global Health as and and as and , where was considered after being linked to the metric of . When this test was significant for categorical variables (e.g., indicators of emotional and behavioral disorders), we subsequently calculated mean scores for each category for each of the EC and PP Global Health T scores. For significant continuous variables (e.g., other PROMIS T scores), we estimated the model with the dotted line included in Figure 1 and obtained estimates of the residual correlations between and . Use case tests were conducted as likelihood ratio tests comparing the model without the dotted line in Figure 1 to a saturated model using the robust maximum likelihood ratio test with “Huber-White” standard errors [16-17] using the lavaan package [18] in R; this test was used to account for known (for dichotomous variables) and potential (for other variables) non-normality in the modeled variables.
Results and Discussion
Because interchangeability tests were conducted twice for each external variable, in these results we use EC metric to refer to tests which evaluate the interchangeability of linked scores from the PP Global Health, transformed to the EC metric using derived linking functions, to actual EC scores, as the EC score metric is then common to both tested variables. Similarly, we use PP metric to refer to tests which evaluate the interchangeability of linked scores from EC Global Health to actual PP Global Health scores. The results of these tests, where significant for either metric, are reported in Table 3, while results for all variables are reported in Supplemental Table 4. Looking at the first row of Table 3, child developmental delay was the most statistically significant external variable for the EC metric; that is, when the actual EC Global Health score was treated as in Figure 1, and the linked PP Global Health score (now on the EC metric) is treated as in Figure 1, and the binary child developmental delay indicator was treated as in Figure 1, the omnibus chi-square test of the model in Figure 1, excluding the dotted path between and , revealed significant model misfit (p = 6.7*10−14). Continuing with the EC metric, the next most statistically significant violations of interchangeability were when PROMIS measures of Peer Relationships, and Family Relationships, were treated as the external variable in Figure 1. Fewer significant results were found for the PP metric, with the lowest p-values for PROMIS Positive Affect, Depression, and child developmental delay. Table 3 also contains estimates from the model with the dotted line included in Figure 1; returning to the first row for child developmental delay, Est. = −.045 in Table 3 indicates that, when the EC metric was evaluated, the estimated residual correlation between EC Global Health and child developmental delay was −.045. Looking next at other T scores, for the EC metric, the largest residual correlation between EC Global Health and an external variable was for Peer Relationships at .17, while for the PP metric, the largest such correlation was for Positive Affect at .12. While effect sizes may be small, such residual correlations could flip the sign of important estimates in such analyses, leading to inaccurate conclusions.
Table 3.
Significant p-values and estimated residual correlations between external variables and Early Childhood (EC) and Parent Proxy (PP) Global Health
| EC Metric | PP Metric | |||
|---|---|---|---|---|
| Variable | p | Est. | p | Est. |
| Child Characteristic | ||||
| Developmental Delay | .000000000000067 | −.045 | .0026 | .020 |
| ASD | .000068 | −.027 | .46 | .0057 |
| ADHD | .00041 | −.038 | .88 | −.0017 |
| Asian | .00090 | −.0084 | .72 | .00094 |
| Speech Disorder | .0066 | −.021 | .90 | .00097 |
| Behavioral or Conduct Problems | .013 | −.023 | .98 | .00022 |
| Fine Motor Disorder | .015 | −.019 | .84 | −.0017 |
| Asthma | .34 | .0080 | .025 | −.021 |
| Gross Motor Disorder | .041 | −.018 | .94 | .00064 |
| PROMIS T Score | ||||
| Social Relationships Full-length Bank | .0000031 | .17 | .12 | .050 |
| Social Relationships - Peer 4a | .0000051 | .16 | .081 | .052 |
| Positive Affect Full-length Bank | .037 | .077 | .00019 | .12 |
| Frustration Tolerance 6a | .00075 | .13 | .36 | .033 |
| Depression Full-length Bank | .19 | −.047 | .0019 | −.11 |
| Persistence 6a | .0022 | .12 | .18 | .045 |
| Flexibility 5a | .0039 | .13 | .96 | .0022 |
| Social Relationships - Child-Caregiver Interactions 5a | .0051 | .13 | .19 | .047 |
| Irritability Full-length Bank | .018 | −.092 | .21 | −.045 |
| Sleep Problems - Disturbance 4a | .021 | −.11 | .23 | −.055 |
| Social Relationships - Family 4a | .073 | .080 | .027 | .085 |
| Curiosity 6a | .027 | .085 | .17 | .046 |
| Sleep Problems Full-length Bank | .033 | −.11 | .19 | −.067 |
| Sleep Problems - Impairment 4a | .043 | −.088 | .086 | −.075 |
Note. Est. = Estimated residual correlation between Variable and the actual score used in the test. Metric = metric on which the test was conducted, where EC (Early Childhood Global Health) evaluates linked PP (Parent Proxy Global Health) to EC scores as substitutes for actual EC scores, while PP evaluates linked EC to PP scores as substitutes for actual EC scores. PROMIS short form suffixes (e.g., 6a for Frustration Tolerance 6a) indicate specific PROMIS short form numbers as indexed on the HealthMeasures website, with the number (e.g., 6 in 6a) indicating the number of items. Results with p < .05 are presented in bold.
We next examined mean differences in linked and observed T scores for categorical disorder variables, which can be found in Figure 3 (Supplemental Table 5).For children diagnosed with developmental delay, linked T scores differed on average between EC and PP Global Health by 5.88 T score points (PP to linked PP) and 5.43 points (EC to linked EC), or more than half a standard deviation. For children without developmental delay, these T score differences were 0.4 and 0.16 units, or one twenty-fifth of a standard deviation or less. For other “mental” disorders, including motor disorders, PP scores tended to be higher for diagnosed individuals than linked EC scores, while for asthma, the opposite trend was observed, and for non-diagnosed individuals scores did not differ by more than half of a T score point (one twentieth of a standard deviation) in any case.
Figure 3. Mean differences in T scores using different scoring and crosswalk methods.

Note. Dev. Delay = developmental delay; ASD = autism spectrum disorder; ADHD = attention deficit hyperactivity disorder; Beh. Or Con. D. = Behavioral or Conduct Disorder; EC = Early Childhood Global Health; PP = Parent Proxy Global Health. Crosswalked scores were converted to the indicated metric from their original metric using the equipercentile linking function.
Looking at the patterns of significance in Table 3 and Supplemental Table 1, all variables with significant residual correlations are either PROMIS T scores or child diagnoses, except for Asian ethnicity. PROMIS T scores for Anxiety and Physical Activity, neighborhood conditions, and all child and respondent characteristics were not significant. Thus, this test suggests that multivariable analyses involving this set of non-significant variables will be unaffected (except by linking error) by using linked scores in lieu of actual scores.
Lastly, to investigate whether specific items may be leading to this lack of interchangeability for children with developmental delay, we calculated mean item scores in children with and without developmental delay (Table 4). Three of the four items with the largest difference between children with and without developmental delay were in the EC form only, and all four related to mental health. Importantly, How well is your child meeting developmental milestones? had the largest difference of 1.58 raw score points lower for children with developmental delay. Failure to meet developmental milestones is one of the more common indicators leading to a diagnosis of developmental delay; thus, this finding of non-interchangeability is consistent with differences in item content between the measures.
Table 4.
Item mean scores for children with and without developmental delay
| Item Text | Measure | No Dev. Delay |
Dev. Delay |
Difference |
|---|---|---|---|---|
| In general, would you say your child’s health is: | Both | 4.42 | 3.62 | −0.80 |
| In general, would you say your child’s quality of life is: | Both | 4.51 | 3.80 | −0.71 |
| In general, how would you rate your child’s physical health? | Both | 4.49 | 3.68 | −0.81 |
| In general, how would you rate your child’s mental health? | EC Only | 4.42 | 3.41 | −1.02 |
| How would you rate your child’s mood? | EC Only | 4.13 | 3.32 | −0.82 |
| How would you rate your child’s social skills? | EC Only | 4.08 | 2.96 | −1.12 |
| How would you rate your child’s ability to think? | EC Only | 4.39 | 3.46 | −0.94 |
| How well is your child meeting developmental milestones? | EC Only | 4.31 | 2.72 | −1.58 |
| In general, how would you rate your child's mental health, including mood and ability to think? | PP Only | 4.40 | 3.27 | −1.13 |
| How often does your child seem really sad? | PP Only | 4.03 | 3.42 | −0.61 |
| How often does your child have fun with other children? | PP Only | 4.26 | 3.75 | −0.51 |
| How often does your child feel that you listen to his or her ideas? | PP Only | 4.32 | 3.94 | −0.38 |
Note. EC = Early Childhood Global Health; PP = Parent Proxy Global Health; Dev. Delay = Developmental Delay. Items used in Op4G data had the text in strikethrough font removed and replaced with the underlined text for developmental appropriateness in the sample of parents of 1-5-year-olds.
Discussion
In the applied example, although these scores are linkable, and often interchangeable, they are not always so. Using the proposed use-case test, we were able to determine when, rather than if, the linkage was appropriate, which we suggest is the more useful question. This approach is closely related to subgroup invariance analysis within item-level psychometrics (differential item functioning; [19]) and score-level test linking [20], but differs in its emphasis on practical importance for particular use cases. In this way, the proposed approach is most similar to the difference that matters paradigm (DTM; [21]), which evaluates a linkage with respect to a reporting scale (e.g., acceptance criteria for college admissions exams).
The proposed use-case evaluation provides estimates of the residual correlation of the actual score with the external variable, and which values can be considered ignorable depend, again, on the use case. To borrow an example from [22] a very small effect size for an intervention reducing suicidality would be a very important and life-saving finding, but if the effect size is smaller than the resulting residual correlation from the use-case test, relying on the associated linkage may erroneously portray a suicide-increasing intervention as suicide-preventive or vice versa. For larger effects, one can refer to Cohen's [23] criteria (, .3, and .5), wherein the goal would be to avoid misclassification of effect magnitude.
Limitations and Future Directions
While developers of linkages cannot know a priori all use cases which may apply, norms of the field and common extant usages of the linked measures can inform on which use case tests may be most valuable, thereby informing designers of linking studies on which external variables to also collect data on in addition to the to-be-linked measures. Similarly, users of linkages concerned about their particular use case may not find it represented exactly in any given linking study, but any information on score interchangeability can help inform whether linked scores might be useful, and this test remains valuable to the extent its results can provide such information.
The proposed method requires a co-administration design, unlike others presented in [3], mainly because so many linking studies in PROs use this design, probably because these measures tend to be less time-consuming for the respondent as compared to educational tests like the SAT. Future work can extend these methods, and this philosophy of attending to use cases for linkages, to those designs as needed. For example, the current study did not incorporate item-level information into the use case test itself, instead examining item content and group differences post hoc; proper techniques for incorporating this additional information into the use case test itself could more directly inform researchers on non-interchangeability of items, with the potential to identify more (or less) interchangeable item subsets without post hoc analysis. If use cases related to covariates unmeasured in the linking study are of concern, latent class models can attempt to include and account for these unobserved variables.
Relatedly, when cases of item interchangeability are detected, one can consider omitting problematic items (e.g., the item in the EC measure assessing developmental milestones). However, this practice carries risks worth considering. For example, if the violation reflects a Type I error, then item omission based on a single study could lead to loss of information in others. Also, the practice of removing items from measures risks a proliferation of numerous “versions” of measures, rendering measure selection cumbersome. Consistent with the focus on use cases, we recommend researchers proceed as we did here: present their results on violations of interchangeability, and leave the decision of whether to omit items to the designers of future studies, who are best suited to compare their use cases to those assessed with the proposed methods, unless convergent research and theory support item omission in (a more limited number of) revised versions of measures.
Lastly, in IDA, the presence of some scores on each measure may ameliorate any deleterious impact of treating linked scores interchangeably with actual scores on the metric of the desired measure; for example, if EC and PP Global Health scores were available for different subsets of participants, and a harmonized variable was created by linking PP scores to the EC metric and treating them interchangeably with actual EC scores, lack of interchangeability, for example when assessing relationships between global health and developmental delay, would have less of an impact than revealed by this test which assumes only linked scores are used. It seems, to the authors, realistic to assume that multiplying the estimated impact (here, the estimates in Table 3) by the proportion of linked rather than actual scores used, would reasonably approximate the degree of bias introduced by using this harmonized variable rather than only EC scores, but this impact may also depend on other model features, and future work should explore how best to treat these harmonized variables in the face of (potential) non-interchangeability.
General Conclusion
The approach presented herein provides a tool for evaluating a linkage’s utility in use cases involving analyses involving multiple variables. By addressing these practical questions directly, linking in nonstandard contexts such as quality of life research can come to represent not a “descent” of linking, as warned of in [1], but rather as equally rigorous applications of the same methodology.
Supplementary Material
Funding
Research reported in this publication was supported by the Environmental influences on Child Health Outcomes (ECHO) program, Office of The Director, National Institutes of Health, under Award Number U24OD023319 with co-funding from the Office of Behavioral and Social Sciences Research (OBSSR; Person Reported Outcomes Core).
Footnotes
Crosswalk tables are publicly available at the APA OSF repository (url: https://osf.io/nf8bx/). All measures used can be obtained from the HealthMeasures website at healthmeasures.net or modified therefrom, with modifications described in the manuscript. Study data and analysis code are not publicly available.
Competing Interests
We have no conflicts of interest to disclose.
Ethics Approval
This study was performed in line with the principles of the Declaration of Helsinki. We received approval from Northwestern University’s Institutional Review Board. Study materials and data are not publicly available.
Consent to Participate and Publish
All participants provided informed consent to participate in this research. No individual person’s data are presented, only aggregated results across participants.
The actual item parameters used for PROMIS scoring are proprietary and could not be included. To obtain these parameters, contact HealthMeasures.net. Perturbation was conducted to mask the actual values by adding noise generated, separately for each item, from a random uniform distribution with limits of -.25 and .25.
References
- 1.Dorans NJ, Pommerich M, & Holland PW (2007). Linking and aligning scores and scales. New York, NY: Springer. [Google Scholar]
- 2.Feuer MJ (2005). E pluribus Unum: Linking tests and democratic education. In Measurement and research in the accountability era (pp. 173–192). Routledge. [Google Scholar]
- 3.Huggins AC, & Penfield RD (2012). An NCME instructional module on population invariance in linking and equating. Educational Measurement: Issues and Practice, 31(1), 27–40. [Google Scholar]
- 4.Curran PJ, & Hussong AM (2009). Integrative data analysis: the simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Choi SW, Schalet B, Cook KF, & Cella D (2014). Establishing a common metric for depressive symptoms: Linking the BDI, CESD, and PHQ9 to PROMIS depression. Psychological Assessment, 26(2), 513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kaat AJ, Newcomb ME, Ryan DT, & Mustanski B (2017). Expanding a common metric for depression reporting: Linking two scales to PROMIS® depression. Quality of Life Research, 26(5), 1119–1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Blackwell CK, Tang X, Elliott AJ, Thomes T, Louwagie H, Gershon R, … & Cella D (2021). Developing a common metric for depression across adulthood: Linking PROMIS depression with the Edinburgh Postnatal Depression Scale. Psychological Assessment, 33(7), 610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Holland PW (2007). A framework and history for score linking. In Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer New York. [Google Scholar]
- 9.Bentler PM (2017). Specificity-enhanced reliability coefficients. Psychological Methods, 22(3), 527. [DOI] [PubMed] [Google Scholar]
- 10.Kallen MA, Lai JS, Blackwell CK, Schuchard JR, Forrest CB, Wakschlag LS, & Cella D (2022). Measuring PROMIS® global health in early childhood. Journal of Pediatric Psychology, 47(5), 523–533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Forrest CB, Bevans KB, Pratiwadi R, Moon J, Teneralli RE, Minton JM, & Tucker CA (2014). Development of the PROMIS® pediatric global health (PGH-7) measure. Quality of Life Research, 23, 1221–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cella D, Blackwell CK, & Wakschlag LS (2022). Bringing PROMIS to early childhood: Introduction and qualitative methods for the development of early childhood parent report instruments. Journal of Pediatric Psychology, 47(5), 500–509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lai J-S, Kallen MA, Blackwell CK, Wakschlag LS, & Cella D (2022). Psychometric considerations in developing PROMIS® measures for early childhood. Journal of Pediatric Psychology, 47(5), 510–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Albano AD (2016). equate: An R Package for Observed-Score Linking and Equating. Journal of Statistical Software, 74(8), 1–36. doi: 10.18637/jss.v074.i08 [DOI] [Google Scholar]
- 15.R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. [Google Scholar]
- 16.Huber PJ (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press. [Google Scholar]
- 17.White H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48, 817 838. DOI: https://doi.org/ 10.2307/1912934 [DOI] [Google Scholar]
- 18.Rosseel Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1–36. 10.18637/jss.v048.i02 [DOI] [Google Scholar]
- 19.Vandenberg RJ, & Lance CE (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. [Google Scholar]
- 20.Dorans NJ, & Holland PW (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281–306. [Google Scholar]
- 21.Dorans NJ, & Feigenbaum MD (1994). Equating issues engendered by changes to the new SAT and PSAT/NMSQT. In Lawrence IM, Dorans NJ, Feigenbaum MD, Feryok NJ, Schmitt AP, & Wright NK (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton, NJ: Educational Testing Service. [Google Scholar]
- 22.Lakens D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
