Assessing the Interchangeability of Linked Scores in Multivariable Statistical Analyses

Maxwell Mansolf; Courtney K Blackwell; David Cella; Jin-Shei Lai

doi:10.1007/s11136-023-03592-x

. Author manuscript; available in PMC: 2025 Apr 1.

Published in final edited form as: Qual Life Res. 2024 Jan 31;33(4):1121–1131. doi: 10.1007/s11136-023-03592-x

Assessing the Interchangeability of Linked Scores in Multivariable Statistical Analyses

Maxwell Mansolf ¹, Courtney K Blackwell ¹, David Cella ¹, Jin-Shei Lai ¹

PMCID: PMC10978247 NIHMSID: NIHMS1968702 PMID: 38294666

Abstract

Purpose:

Using the lens of classical test theory, we examine a linkage’s generalizability with respect to use in multivariable analyses, including multiple regression and structural equation modeling, rather than comparison of established subpopulations as is most common in the literature.

Methods:

To aid in this evaluation, we present a structural-equation-modeling based statistical method to examine the suitability of a given linkage for use cases involving continuous and categorical variables external to the linkage itself.

Results:

Using the PROMIS^® Parent Proxy and Early Childhood Global Health measures, we show that, although a high correlation between the scores (here, $r = .829$ ) may imply a general suitability for linking, a more detailed investigation of content, measurement structure, and results of the proposed methodology reveal important differences between the measures which can compromise interchangeability in certain use cases.

Conclusion:

In addition to the statistical quality of a linkage, users of linking methodology should also assess the question of whether the linkage is appropriate to apply to particular use cases of interest.

Keywords: Test linking, psychometrics, structural equation modeling, generalizability

Plain English summary

Statistical transformations, called linkages, allow scores on one measure, for example a questionnaire about global physical and mental health, to be compared to scores on another similar measure. However, whether scores can be treated interchangeably with each other in all cases is a generally unanswered question. We describe a method to examine whether linked scores can be treated interchangeably in particular cases, specifically when included in statistical analyses involving other variables collected (e.g., demographic variables). We apply this test to two measures of global health, revealing use cases in which these measures are essentially equivalent and cases where they are not. This research will help increase the statistical rigor of linkages, allowing researchers to assess when and whether a linkage is useful for their purposes.

Modern score linking applications have extended beyond their origins in academic ability testing into other areas [1], including person-reported outcome (PRO) measures of quality of life. In multi-site research programs, these linkages are often used to harmonize variables for cross-sectional and longitudinal research. The goal of this manuscript is to statistically examine the validity of linkages in this context. In addition to standard psychometric evaluation of the linkages (e.g., correlation between scores), we propose criteria based on the use cases for the resulting scores, with the idea that the same linkage may be suitable for some purposes and unsuitable for others. Here, we echo the sentiments of [2] that “…perfect comparability is most likely neither feasible nor desirable, and that rational policymakers need to satisfy themselves with results that provide reasonable estimates (emphasis in original) rather than conclusive answers”.

Use cases involving comparison of observed scores on a measure and scores linked to that measure’s metric for an individual, and between subpopulations, have been well-studied in the educational assessment literature (e.g., [3]). In contrast, applications of linked scores to integrative data analysis (IDA; [4]), or the simultaneous analysis of multiple data sets, have received virtually no attention, and we focus on this application in this paper. In IDA, linkages are used to provide common metrics for measures of the same construct administered in separate samples. Such an example is the Patient-Reported Outcomes Measurement Information System^® (PROMIS^®) Depression common metric [5-7] which provides comparable scores across several extant depression questionnaires by aligning scores onto the same PROMIS T-score metric system, an item response theory based scoring system with mean=50 and standard deviation=10. Linkages can also be used to place scores onto metrics with existing norms, allowing interpretation with respect to the normed metric. These use cases require additional tools to evaluate, and in this work, we propose one such tool: a structural-equation-modeling (SEM) based test of whether a particular linkage is appropriate for interpretation within a multivariable use case involving an external variable.

Internal and External Evaluation of Linkages

Typically, correlations between scores on two measures are used to judge suitability for linking. One example result of such an assessment is the ability to classify a linkage as an equating, meaning that scores on the two measures can be treated as interchangeable. Criteria for establishing equating are based on the correlation between scores of measures (traditionally, using $r = .866$ as the cutoff) and equality of reliability between the two measures [8]. This internal evaluation is, of course, critically important to evaluating the utility of a linkage, but only directly generalizes to the comparison of scores or groups of scores in isolation.

External evaluations of linkages are generally oriented towards the example use cases mentioned above (see also [3]), which generally assess differences between linkages derived within one or more subpopulations and evaluate those linkages either conditional on a particular score (e.g., revealing negative bias in one group at the low end of the score range), or on the full range of scores (e.g., revealing negative bias in general). Because these methods depend on accurate estimation of the linkage within each subgroup, they may perform poorly when a subgroup has few individuals. It is also not clear how to apply these methods to continuous variables. Historically, the subgroups of interest have been large and categorical, for example males versus females, so these limitations have mattered little. For integrative data analysis, and multivariable analysis in general, additional tools are needed which account for these limitations.

Motivation: Classical Test Theory

To introduce terminology for the score linking context, consider a case in which transformed scores on one measure, which we call linked scores to specify that the linking transformation has been applied, are treated as proxies for observed scores on the other measure, which we call actual scores. These scores are analogous to the classical test theory (CTT) observed score, the score observed in practice which contains measurement error; and true score from a theoretical test with no measurement error. In CTT, measurement error is assumed to be independent not only of the observed score, but of all other variables, which we call external variables with respect to their relationship to the measurement model. In score linking, the correlation between linked and actual scores will never be unity. Therefore, if the linked score is to be treated as a proxy for the actual score, one must similarly assume that the remaining variance in each score after accounting for the other is unrelated to any external variables of interest.

A Test for Appropriate Use Cases of Linked Scores

A score linking study in which the two measures are co-administered enables a test of whether variance in each of the linked and actual scores, after accounting for variance shared with the other score, is systematically related to external variables also collected. Let $x$ be the actual score on whose metric inference is desired, and $w$ the linked score proposed to be interchangeable with $x$ . Note that, assuming the linking was successful, $x$ and $w$ are on the same metric, as $w$ is not the score on the linked-from measure itself, but the transformed version of that score after linking. We define interchangeability of $x$ and $w$ as indicating that differences between $x$ and $w$ are attributable only to linking error, i.e., variance unrelated to external variables. Without loss of generality, assume both $x$ and $w$ are standardized; then, this property can be written as

x = r_{x w} * w + e

#(1)

where $r_{x w}$ is the correlation between $x$ and $w$ , and $e$ is an error term orthogonal to $x$ . Because $r_{x w}$ is not equal to one, the question arises as to whether the error term, representing variance in $x$ unaccounted for by $w$ , contains meaningless statistical noise, meaningful variance with respect to other variables to be used in analysis, meaningful variance unrelated to those other variables, or some combination thereof.

With no other variables, Equation (1) will always reproduce the only modeled correlation $r_{x w}$ ; however, when additional variables are considered, this is no longer true, as Equation (1) implies constraints on correlations of $w$ and $x$ with other variables. Let $z$ represent an external variable of interest, such as a predictor, outcome, or covariate which is intended to be jointly modeled with $x$ using $w$ as a substitute for $x$ . Then, assuming Equation (1) still holds, the correlation between $x$ and $z$ can be written as

r_{x z} = r_{w z} * r_{x w}

#(2)

Here, $r_{x z}$ and $r_{w z}$ denote correlations between $x$ and $z$ and between $w$ and $z$ , respectively, with $r_{x w}$ as above. If Equation (2) holds, because the correlation $r_{x w}$ is expected to be positive, the only differences between $r_{x z}$ and $r_{w z}$ would be due to measurement error, quantified by 1 minus $r_{x w}$ . In this case, inferences made on the linked score $w$ would be comparable, aside from measurement error, to inferences which would have been made on $x$ had $x$ been observed, rendering $x$ and $w$ interchangeable. Defined in this way, differences between $r_{w z}$ and $r_{x z}$ over and above those reflecting $r_{x w}$ indicate lack of interchangeability, specifically that the “error” term in Equation (1) reflects, at least partially, meaningful variance with respect to use cases involving $z$ .

Interchangeability can be tested using the path model in Figure 1; here, the directed arrow between $x$ and $w$ implies Equation (1), while the omission of the dotted two-way path between $x$ and $z$ implies Equation (2) and adding the dotted path yields a saturated model. Omitting the dotted path and conducting the omnibus model-fit test constitutes a test of interchangeability of the actual score $x$ and the linked score $w$ in multivariable analyses involving $z$ . A similar test can be conducted where the roles of $x$ and $w$ are switched; similarly, failing this test reveals that variance in $w$ , after accounting for $x$ , is systematically related to $z$ . While there are many frameworks in which difference between the two sides of Equation (2) can be tested, structural equation modeling enables out-of-the-box applications to non-Pearson correlations (e.g., polyserial correlations) using non-normal or non-continuous $z$ , inclusion of additional external variables besides $z$ in the same test, and/or inclusion of more complex models in place of $z$ , yielding a flexible set of interchangeability tests. The results of this test generalize to analytic methods whose results are functions of correlations and covariances, including multiple regression and SEM.

*Note*. $x$ represents scores on the metric to which scores are linked. $w$ represents linked scores on a different measure, linked to the metric of $x$ . $z$ represents an external variable to the linkage, such as a demographic or predictor/criterion variable in a multivariable analysis. $x$ is represented as a circle, consistent with its unobserved nature in linking applications.

When this test is significant, the question arises as to how much the resulting correlations and subsequent interpretation may differ. To assess this, one would add this correlation to the model in Figure 1 (dotted line) and re-estimate the model. This approach is similar to that of Bentler’s theory [9] in estimating reliability while partitioning item-specific variance into measurement error and variance unique to each item in a multi-item scale but also shared with external covariates. Alternatively, one could estimate the model in Figure 1 with the dotted line included and conduct a Wald test for significance of the residual correlation.

We conducted a simulation study to evaluate the Type I error, power, and rate of reversed sign (positive versus negative) for correlations of $z$ with $x$ versus $w$ . To conserve space and allow further discussion of the real-data example, we present the method, results, and discussion of this simulation study in Supplemental Materials.

Example: Linking Parent Proxy and Early Childhood PROMIS Global Health

To demonstrate this methodology, we evaluated two highly related measures: PROMIS Early Childhood Parent Report Scale v1.0 - Global Health 8a (EC) [10] and PROMIS Parent Proxy Scale v1.0 - Global Health 7 (PP) [11]. Between these two measures: three items are identical, both are administered to a parent, and the same 5-point Likert scale is used. Both EC and PP are scored using state-of-the-science item response theory (IRT)-based methods, specifically the unidimensional graded response model, and are reported using IRT-scaled, norm-referenced T scores with respect to the general population (here, of parents reporting on their children) in the United States. While the PROMIS adult Global Health measures report separate scores for mental and physical health, pediatric measures include only a single score spanning both physical and mental health. The EC measure was designed to enable age-appropriate assessment of global health in children ages 1 to 5, while the PP version was designed likewise for ages 5 to 17. Table 1 contains item content and perturbed PROMIS item slope parameters¹ for the two measures; this item content, and the measures themselves are publicly available from the HealthMeasures website at healthmeasures.net. Notably, the first three items overlap: in general, would you say your child’s health is [excellent to poor]; in general, how would you rate your child’s quality of life?; and in general, how would you rate your child’s physical health?

Table 1.

Item text and perturbed slope parameters from PROMIS calibration

Item Text	EC Slope	PP Slope
In general, would you say your child's health is:	2.46	4.54
In general, would you say your child's quality of life is:	2.10	2.95
In general, how would you rate your child's physical health?	2.36	4.71
In general, how would you rate your child's mental health?	3.64
How would you rate your child's mood?	2.48
How would you rate your child's social skills?	1.81
How would you rate your child's ability to think?	2.77
How well is your child meeting developmental milestones?	2.70
In general, how would you rate your child's mental health, including mood and ability to think?		2.11
How often does your child seem really sad?		0.64
How often does your child have fun with other children?		1.03
How often does your child feel that you listen to his or her ideas?		1.12

Open in a new tab

Note. EC = PROMIS Early Childhood Global Health. PP = Parent Proxy Global Health. Perturbation was conducted to mask the actual values by adding random noise generated, separately for each parameter, from a random uniform distribution with limits of −.25 and .25. Items used in Op4G data had the text in strikethrough font removed and replaced with the underlined text for developmental appropriateness in the sample of parents of 1-5-year-olds .

Analysis Sample

Data for this study were collected by an internet panel company, Op4G, as part of the larger initiative to develop, validate, and norm a suite of PROMIS EC measures [12-13]. For an overview of the sample, see [13], and for a brief summary for the purposes of these analyses, see Supplemental Methods. Table 2 contains descriptive statistics for the Op4G data; see Supplemental Methods for information on missingness rates by variable.

Table 2.

Descriptive statistics for Op4G Data

Characteristic	N = 386¹
Respondent Female	293 (76%)
Child Characteristics
Age	4.48 (0.36)
Female	176 (46%)
Hispanic	63 (16%)
White	310 (80%)
Black or African American	67 (17%)
American Indian or Alaska Native	13 (3.4%)
Asian	12 (3.1%)
Native Hawaiian or other Pacific Islander	4 (1.0%)
Other Race	11 (2.8%)
Allergies	106 (27%)
Asthma	50 (13%)
Gross Motor Disorder	20 (5.2%)
Fine Motor Disorder	19 (4.9%)
Anxiety	31 (8.0%)
ADHD	58 (15%)
ASD	32 (8.3%)
Behavioral or Conduct Problems	30 (7.8%)
Developmental Delay	27 (7.0%)
Speech Disorder	36 (9.3%)
PROMIS T Scores
Parent Proxy (PP) Global Health	51.6 (10.5)
Early Childhood (EC) Global Health	46.0 (10.6)
Linked EC to PP Global Health	51.5 (10.8)
Sleep Impairment 4-item SF	55.9 (10.2)
Sleep Disturbance 4-item SF	54.8 (10.9)
Relationship 6-item SF	50.1 (10.1)
Relationship Caregiver SF	49.4 (11.4)
Relationship Family SF	48.7 (9.2)
Relationship Peer SF	50.9 (10.5)
Sleep Full-length bank	56.4 (10.2)
Physical Activity Bank	52.0 (9.3)
Relationship Full-length bank	50.8 (12.6)
Irritability Full-length Bank	48.6 (10.9)
Anxiety Full-length Bank	52.9 (10.8)
Depression full-length Bank	53.2 (10.0)
Curiosity 6a	48.8 (10.4)
Persistency 6a	52.3 (12.0)
Positive Affect Full-length Bank	49.0 (11.1)
Flexibility 5a	50.7 (12.4)
Frustration Tolerance 6a	52.9 (12.0)
Neighborhood Characteristics
Heavy Traffic	2.68 (1.35)
Stranger Danger	1.92 (1.11)
Road Safety	2.19 (1.17)
No Lights	3.02 (1.39)
Cross Roads	2.76 (1.42)
No Transit	2.62 (1.39)

Open in a new tab

Note. ¹Mean (SD) are presented for continuous variables, while n (%) are presented for categorical variables. ²Wilcoxon rank sum test was used for continuous variables; Pearson's Chi-squared test was used for categorical variables with all expected cell counts ≥5; Fisher’s exact test was used for categorical variables with any expected cell count <5. See Supplemental Methods and Supplemental Table 2 for information on missingness rates by variable.

Psychometric Comparison of EC and PP Global Health

The correlation between scores on EC and PP Global Health was estimated as .829 in this sample. Mean (SD) of EC Global Health T scores was 46.0 (10.6) and for PP was 51.6 (10.5).

From the slope parameters in Table 1, observe that items measuring physical health (In general, would you say your child’s health is:, In general, how would you rate your child’s physical health?) have the highest loadings in PP Global Health, while items measuring mental health (In general, how would you rate your child’s mental health?, How would you rate your child’s ability to think?) have the highest loadings in EC Global Health. This difference suggests that, although both forms measure both mental and physical health, T scores for EC Global Health are more a function of mental health, while T scores for PP Global Health are more a function of physical health, a substantive difference in the content of the measures.

We received approval from our institution’s Institutional Review Board. Study materials and data are not publicly available.

Equipercentile Linking

Equipercentile linking with loglinear smoothing was conducted in the equate package [14] in R [15] to link EC Global Health T scores to the metric of PP Global Health T scores and vice versa; see Supplemental Methods for additional details. After linking, the correlation between linked EC to PP scores and observed PP scores was .834, while the correlation between the original and linked EC scores was .993 (Figure 2). Similar results can be seen in Figure 2 for the PP to EC conversion. Crosswalk tables derived herefrom can be found in the OSF repository (https://osf.io/nf8bx/).

Use Case Tests

Method

For each external variable in Table 2, separate tests were conducted treating EC and PP Global Health as $x$ and $w$ and as $w$ and $x$ , where $w$ was considered after being linked to the metric of $x$ . When this test was significant for categorical variables (e.g., indicators of emotional and behavioral disorders), we subsequently calculated mean scores for each category for each of the EC and PP Global Health T scores. For significant continuous variables (e.g., other PROMIS T scores), we estimated the model with the dotted line included in Figure 1 and obtained estimates of the residual correlations between $x$ and $z$ . Use case tests were conducted as likelihood ratio tests comparing the model without the dotted line in Figure 1 to a saturated model using the robust maximum likelihood ratio test with “Huber-White” standard errors [16-17] using the lavaan package [18] in R; this test was used to account for known (for dichotomous variables) and potential (for other variables) non-normality in the modeled variables.

Results and Discussion

Because interchangeability tests were conducted twice for each external variable, in these results we use EC metric to refer to tests which evaluate the interchangeability of linked scores from the PP Global Health, transformed to the EC metric using derived linking functions, to actual EC scores, as the EC score metric is then common to both tested variables. Similarly, we use PP metric to refer to tests which evaluate the interchangeability of linked scores from EC Global Health to actual PP Global Health scores. The results of these tests, where significant for either metric, are reported in Table 3, while results for all variables are reported in Supplemental Table 4. Looking at the first row of Table 3, child developmental delay was the most statistically significant external variable for the EC metric; that is, when the actual EC Global Health score was treated as $x$ in Figure 1, and the linked PP Global Health score (now on the EC metric) is treated as $w$ in Figure 1, and the binary child developmental delay indicator was treated as $z$ in Figure 1, the omnibus chi-square test of the model in Figure 1, excluding the dotted path between $x$ and $z$ , revealed significant model misfit (p = 6.7*10⁻¹⁴). Continuing with the EC metric, the next most statistically significant violations of interchangeability were when PROMIS measures of Peer Relationships, and Family Relationships, were treated as the external variable $z$ in Figure 1. Fewer significant results were found for the PP metric, with the lowest p-values for PROMIS Positive Affect, Depression, and child developmental delay. Table 3 also contains estimates from the model with the dotted line included in Figure 1; returning to the first row for child developmental delay, Est. = −.045 in Table 3 indicates that, when the EC metric was evaluated, the estimated residual correlation between EC Global Health and child developmental delay was −.045. Looking next at other T scores, for the EC metric, the largest residual correlation between EC Global Health and an external variable was for Peer Relationships at .17, while for the PP metric, the largest such correlation was for Positive Affect at .12. While effect sizes may be small, such residual correlations could flip the sign of important estimates in such analyses, leading to inaccurate conclusions.

Table 3.

Significant p-values and estimated residual correlations between external variables and Early Childhood (EC) and Parent Proxy (PP) Global Health

	EC Metric		PP Metric
Variable	p	Est.	p	Est.
Child Characteristic
Developmental Delay	.000000000000067	−.045	.0026	.020
ASD	.000068	−.027	.46	.0057
ADHD	.00041	−.038	.88	−.0017
Asian	.00090	−.0084	.72	.00094
Speech Disorder	.0066	−.021	.90	.00097
Behavioral or Conduct Problems	.013	−.023	.98	.00022
Fine Motor Disorder	.015	−.019	.84	−.0017
Asthma	.34	.0080	.025	−.021
Gross Motor Disorder	.041	−.018	.94	.00064
PROMIS T Score
Social Relationships Full-length Bank	.0000031	.17	.12	.050
Social Relationships - Peer 4a	.0000051	.16	.081	.052
Positive Affect Full-length Bank	.037	.077	.00019	.12
Frustration Tolerance 6a	.00075	.13	.36	.033
Depression Full-length Bank	.19	−.047	.0019	−.11
Persistence 6a	.0022	.12	.18	.045
Flexibility 5a	.0039	.13	.96	.0022
Social Relationships - Child-Caregiver Interactions 5a	.0051	.13	.19	.047
Irritability Full-length Bank	.018	−.092	.21	−.045
Sleep Problems - Disturbance 4a	.021	−.11	.23	−.055
Social Relationships - Family 4a	.073	.080	.027	.085
Curiosity 6a	.027	.085	.17	.046
Sleep Problems Full-length Bank	.033	−.11	.19	−.067
Sleep Problems - Impairment 4a	.043	−.088	.086	−.075

Open in a new tab

Note. Est. = Estimated residual correlation between Variable and the actual score used in the test. Metric = metric on which the test was conducted, where EC (Early Childhood Global Health) evaluates linked PP (Parent Proxy Global Health) to EC scores as substitutes for actual EC scores, while PP evaluates linked EC to PP scores as substitutes for actual EC scores. PROMIS short form suffixes (e.g., 6a for Frustration Tolerance 6a) indicate specific PROMIS short form numbers as indexed on the HealthMeasures website, with the number (e.g., 6 in 6a) indicating the number of items. Results with p < .05 are presented in bold.

We next examined mean differences in linked and observed T scores for categorical disorder variables, which can be found in Figure 3 (Supplemental Table 5).For children diagnosed with developmental delay, linked T scores differed on average between EC and PP Global Health by 5.88 T score points (PP to linked PP) and 5.43 points (EC to linked EC), or more than half a standard deviation. For children without developmental delay, these T score differences were 0.4 and 0.16 units, or one twenty-fifth of a standard deviation or less. For other “mental” disorders, including motor disorders, PP scores tended to be higher for diagnosed individuals than linked EC scores, while for asthma, the opposite trend was observed, and for non-diagnosed individuals scores did not differ by more than half of a T score point (one twentieth of a standard deviation) in any case.

*Note*. *Dev. Delay* = developmental delay; *ASD* = autism spectrum disorder; *ADHD* = attention deficit hyperactivity disorder; *Beh. Or Con. D.* = Behavioral or Conduct Disorder; EC = Early Childhood Global Health; PP = Parent Proxy Global Health. *Crosswalked* scores were converted to the indicated metric from their original metric using the equipercentile linking function.

Looking at the patterns of significance in Table 3 and Supplemental Table 1, all variables with significant residual correlations are either PROMIS T scores or child diagnoses, except for Asian ethnicity. PROMIS T scores for Anxiety and Physical Activity, neighborhood conditions, and all child and respondent characteristics were not significant. Thus, this test suggests that multivariable analyses involving this set of non-significant variables will be unaffected (except by linking error) by using linked scores in lieu of actual scores.

Lastly, to investigate whether specific items may be leading to this lack of interchangeability for children with developmental delay, we calculated mean item scores in children with and without developmental delay (Table 4). Three of the four items with the largest difference between children with and without developmental delay were in the EC form only, and all four related to mental health. Importantly, How well is your child meeting developmental milestones? had the largest difference of 1.58 raw score points lower for children with developmental delay. Failure to meet developmental milestones is one of the more common indicators leading to a diagnosis of developmental delay; thus, this finding of non-interchangeability is consistent with differences in item content between the measures.

Table 4.

Item mean scores for children with and without developmental delay

Item Text	Measure	No Dev. Delay	Dev. Delay	Difference
In general, would you say your child’s health is:	Both	4.42	3.62	−0.80
In general, would you say your child’s quality of life is:	Both	4.51	3.80	−0.71
In general, how would you rate your child’s physical health?	Both	4.49	3.68	−0.81
In general, how would you rate your child’s mental health?	EC Only	4.42	3.41	−1.02
How would you rate your child’s mood?	EC Only	4.13	3.32	−0.82
How would you rate your child’s social skills?	EC Only	4.08	2.96	−1.12
How would you rate your child’s ability to think?	EC Only	4.39	3.46	−0.94
How well is your child meeting developmental milestones?	EC Only	4.31	2.72	−1.58
In general, how would you rate your child's mental health, including mood and ability to think?	PP Only	4.40	3.27	−1.13
How often does your child seem really sad?	PP Only	4.03	3.42	−0.61
How often does your child have fun with other children?	PP Only	4.26	3.75	−0.51
How often does your child feel that you listen to his or her ideas?	PP Only	4.32	3.94	−0.38

Open in a new tab

Note. EC = Early Childhood Global Health; PP = Parent Proxy Global Health; Dev. Delay = Developmental Delay. Items used in Op4G data had the text in strikethrough font removed and replaced with the underlined text for developmental appropriateness in the sample of parents of 1-5-year-olds.

Discussion

In the applied example, although these scores are linkable, and often interchangeable, they are not always so. Using the proposed use-case test, we were able to determine when, rather than if, the linkage was appropriate, which we suggest is the more useful question. This approach is closely related to subgroup invariance analysis within item-level psychometrics (differential item functioning; [19]) and score-level test linking [20], but differs in its emphasis on practical importance for particular use cases. In this way, the proposed approach is most similar to the difference that matters paradigm (DTM; [21]), which evaluates a linkage with respect to a reporting scale (e.g., acceptance criteria for college admissions exams).

The proposed use-case evaluation provides estimates of the residual correlation of the actual score with the external variable, and which values can be considered ignorable depend, again, on the use case. To borrow an example from [22] a very small effect size for an intervention reducing suicidality would be a very important and life-saving finding, but if the effect size is smaller than the resulting residual correlation from the use-case test, relying on the associated linkage may erroneously portray a suicide-increasing intervention as suicide-preventive or vice versa. For larger effects, one can refer to Cohen's [23] criteria ( $r = .1$ , .3, and .5), wherein the goal would be to avoid misclassification of effect magnitude.

Limitations and Future Directions

While developers of linkages cannot know a priori all use cases which may apply, norms of the field and common extant usages of the linked measures can inform on which use case tests may be most valuable, thereby informing designers of linking studies on which external variables to also collect data on in addition to the to-be-linked measures. Similarly, users of linkages concerned about their particular use case may not find it represented exactly in any given linking study, but any information on score interchangeability can help inform whether linked scores might be useful, and this test remains valuable to the extent its results can provide such information.

The proposed method requires a co-administration design, unlike others presented in [3], mainly because so many linking studies in PROs use this design, probably because these measures tend to be less time-consuming for the respondent as compared to educational tests like the SAT. Future work can extend these methods, and this philosophy of attending to use cases for linkages, to those designs as needed. For example, the current study did not incorporate item-level information into the use case test itself, instead examining item content and group differences post hoc; proper techniques for incorporating this additional information into the use case test itself could more directly inform researchers on non-interchangeability of items, with the potential to identify more (or less) interchangeable item subsets without post hoc analysis. If use cases related to covariates unmeasured in the linking study are of concern, latent class models can attempt to include and account for these unobserved variables.

Relatedly, when cases of item interchangeability are detected, one can consider omitting problematic items (e.g., the item in the EC measure assessing developmental milestones). However, this practice carries risks worth considering. For example, if the violation reflects a Type I error, then item omission based on a single study could lead to loss of information in others. Also, the practice of removing items from measures risks a proliferation of numerous “versions” of measures, rendering measure selection cumbersome. Consistent with the focus on use cases, we recommend researchers proceed as we did here: present their results on violations of interchangeability, and leave the decision of whether to omit items to the designers of future studies, who are best suited to compare their use cases to those assessed with the proposed methods, unless convergent research and theory support item omission in (a more limited number of) revised versions of measures.

Lastly, in IDA, the presence of some scores on each measure may ameliorate any deleterious impact of treating linked scores interchangeably with actual scores on the metric of the desired measure; for example, if EC and PP Global Health scores were available for different subsets of participants, and a harmonized variable was created by linking PP scores to the EC metric and treating them interchangeably with actual EC scores, lack of interchangeability, for example when assessing relationships between global health and developmental delay, would have less of an impact than revealed by this test which assumes only linked scores are used. It seems, to the authors, realistic to assume that multiplying the estimated impact (here, the estimates in Table 3) by the proportion of linked rather than actual scores used, would reasonably approximate the degree of bias introduced by using this harmonized variable rather than only EC scores, but this impact may also depend on other model features, and future work should explore how best to treat these harmonized variables in the face of (potential) non-interchangeability.

General Conclusion

The approach presented herein provides a tool for evaluating a linkage’s utility in use cases involving analyses involving multiple variables. By addressing these practical questions directly, linking in nonstandard contexts such as quality of life research can come to represent not a “descent” of linking, as warned of in [1], but rather as equally rigorous applications of the same methodology.

Supplementary Material

supplement

NIHMS1968702-supplement-supplement.docx^{(2.2MB, docx)}

Funding

Research reported in this publication was supported by the Environmental influences on Child Health Outcomes (ECHO) program, Office of The Director, National Institutes of Health, under Award Number U24OD023319 with co-funding from the Office of Behavioral and Social Sciences Research (OBSSR; Person Reported Outcomes Core).

Footnotes

Crosswalk tables are publicly available at the APA OSF repository (url: https://osf.io/nf8bx/). All measures used can be obtained from the HealthMeasures website at healthmeasures.net or modified therefrom, with modifications described in the manuscript. Study data and analysis code are not publicly available.

Competing Interests

We have no conflicts of interest to disclose.

Ethics Approval

This study was performed in line with the principles of the Declaration of Helsinki. We received approval from Northwestern University’s Institutional Review Board. Study materials and data are not publicly available.

Consent to Participate and Publish

All participants provided informed consent to participate in this research. No individual person’s data are presented, only aggregated results across participants.

The actual item parameters used for PROMIS scoring are proprietary and could not be included. To obtain these parameters, contact HealthMeasures.net. Perturbation was conducted to mask the actual values by adding noise generated, separately for each item, from a random uniform distribution with limits of -.25 and .25.

References

1.Dorans NJ, Pommerich M, & Holland PW (2007). Linking and aligning scores and scales. New York, NY: Springer. [Google Scholar]
2.Feuer MJ (2005). E pluribus Unum: Linking tests and democratic education. In Measurement and research in the accountability era (pp. 173–192). Routledge. [Google Scholar]
3.Huggins AC, & Penfield RD (2012). An NCME instructional module on population invariance in linking and equating. Educational Measurement: Issues and Practice, 31(1), 27–40. [Google Scholar]
4.Curran PJ, & Hussong AM (2009). Integrative data analysis: the simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Choi SW, Schalet B, Cook KF, & Cella D (2014). Establishing a common metric for depressive symptoms: Linking the BDI, CESD, and PHQ9 to PROMIS depression. Psychological Assessment, 26(2), 513. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kaat AJ, Newcomb ME, Ryan DT, & Mustanski B (2017). Expanding a common metric for depression reporting: Linking two scales to PROMIS^® depression. Quality of Life Research, 26(5), 1119–1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Blackwell CK, Tang X, Elliott AJ, Thomes T, Louwagie H, Gershon R, … & Cella D (2021). Developing a common metric for depression across adulthood: Linking PROMIS depression with the Edinburgh Postnatal Depression Scale. Psychological Assessment, 33(7), 610. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Holland PW (2007). A framework and history for score linking. In Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer New York. [Google Scholar]
9.Bentler PM (2017). Specificity-enhanced reliability coefficients. Psychological Methods, 22(3), 527. [DOI] [PubMed] [Google Scholar]
10.Kallen MA, Lai JS, Blackwell CK, Schuchard JR, Forrest CB, Wakschlag LS, & Cella D (2022). Measuring PROMIS^® global health in early childhood. Journal of Pediatric Psychology, 47(5), 523–533. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Forrest CB, Bevans KB, Pratiwadi R, Moon J, Teneralli RE, Minton JM, & Tucker CA (2014). Development of the PROMIS^® pediatric global health (PGH-7) measure. Quality of Life Research, 23, 1221–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Cella D, Blackwell CK, & Wakschlag LS (2022). Bringing PROMIS to early childhood: Introduction and qualitative methods for the development of early childhood parent report instruments. Journal of Pediatric Psychology, 47(5), 500–509. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lai J-S, Kallen MA, Blackwell CK, Wakschlag LS, & Cella D (2022). Psychometric considerations in developing PROMIS^® measures for early childhood. Journal of Pediatric Psychology, 47(5), 510–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Albano AD (2016). equate: An R Package for Observed-Score Linking and Equating. Journal of Statistical Software, 74(8), 1–36. doi: 10.18637/jss.v074.i08 [DOI] [Google Scholar]
15.R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. [Google Scholar]
16.Huber PJ (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press. [Google Scholar]
17.White H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48, 817 838. DOI: https://doi.org/ 10.2307/1912934 [DOI] [Google Scholar]
18.Rosseel Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1–36. 10.18637/jss.v048.i02 [DOI] [Google Scholar]
19.Vandenberg RJ, & Lance CE (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. [Google Scholar]
20.Dorans NJ, & Holland PW (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281–306. [Google Scholar]
21.Dorans NJ, & Feigenbaum MD (1994). Equating issues engendered by changes to the new SAT and PSAT/NMSQT. In Lawrence IM, Dorans NJ, Feigenbaum MD, Feryok NJ, Schmitt AP, & Wright NK (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton, NJ: Educational Testing Service. [Google Scholar]
22.Lakens D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS1968702-supplement-supplement.docx^{(2.2MB, docx)}

[R1] 1.Dorans NJ, Pommerich M, & Holland PW (2007). Linking and aligning scores and scales. New York, NY: Springer. [Google Scholar]

[R2] 2.Feuer MJ (2005). E pluribus Unum: Linking tests and democratic education. In Measurement and research in the accountability era (pp. 173–192). Routledge. [Google Scholar]

[R3] 3.Huggins AC, & Penfield RD (2012). An NCME instructional module on population invariance in linking and equating. Educational Measurement: Issues and Practice, 31(1), 27–40. [Google Scholar]

[R4] 4.Curran PJ, & Hussong AM (2009). Integrative data analysis: the simultaneous analysis of multiple data sets. Psychological Methods, 14(2), 81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Choi SW, Schalet B, Cook KF, & Cella D (2014). Establishing a common metric for depressive symptoms: Linking the BDI, CESD, and PHQ9 to PROMIS depression. Psychological Assessment, 26(2), 513. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Kaat AJ, Newcomb ME, Ryan DT, & Mustanski B (2017). Expanding a common metric for depression reporting: Linking two scales to PROMIS^® depression. Quality of Life Research, 26(5), 1119–1128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Blackwell CK, Tang X, Elliott AJ, Thomes T, Louwagie H, Gershon R, … & Cella D (2021). Developing a common metric for depression across adulthood: Linking PROMIS depression with the Edinburgh Postnatal Depression Scale. Psychological Assessment, 33(7), 610. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Holland PW (2007). A framework and history for score linking. In Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer New York. [Google Scholar]

[R9] 9.Bentler PM (2017). Specificity-enhanced reliability coefficients. Psychological Methods, 22(3), 527. [DOI] [PubMed] [Google Scholar]

[R10] 10.Kallen MA, Lai JS, Blackwell CK, Schuchard JR, Forrest CB, Wakschlag LS, & Cella D (2022). Measuring PROMIS^® global health in early childhood. Journal of Pediatric Psychology, 47(5), 523–533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Forrest CB, Bevans KB, Pratiwadi R, Moon J, Teneralli RE, Minton JM, & Tucker CA (2014). Development of the PROMIS^® pediatric global health (PGH-7) measure. Quality of Life Research, 23, 1221–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Cella D, Blackwell CK, & Wakschlag LS (2022). Bringing PROMIS to early childhood: Introduction and qualitative methods for the development of early childhood parent report instruments. Journal of Pediatric Psychology, 47(5), 500–509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Lai J-S, Kallen MA, Blackwell CK, Wakschlag LS, & Cella D (2022). Psychometric considerations in developing PROMIS^® measures for early childhood. Journal of Pediatric Psychology, 47(5), 510–522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Albano AD (2016). equate: An R Package for Observed-Score Linking and Equating. Journal of Statistical Software, 74(8), 1–36. doi: 10.18637/jss.v074.i08 [DOI] [Google Scholar]

[R15] 15.R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. [Google Scholar]

[R16] 16.Huber PJ (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press. [Google Scholar]

[R17] 17.White H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48, 817 838. DOI: https://doi.org/ 10.2307/1912934 [DOI] [Google Scholar]

[R18] 18.Rosseel Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1–36. 10.18637/jss.v048.i02 [DOI] [Google Scholar]

[R19] 19.Vandenberg RJ, & Lance CE (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. [Google Scholar]

[R20] 20.Dorans NJ, & Holland PW (2000). Population invariance and the equitability of tests: Basic theory and the linear case. Journal of Educational Measurement, 37, 281–306. [Google Scholar]

[R21] 21.Dorans NJ, & Feigenbaum MD (1994). Equating issues engendered by changes to the new SAT and PSAT/NMSQT. In Lawrence IM, Dorans NJ, Feigenbaum MD, Feryok NJ, Schmitt AP, & Wright NK (Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton, NJ: Educational Testing Service. [Google Scholar]

[R22] 22.Lakens D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic. [Google Scholar]

PERMALINK

Assessing the Interchangeability of Linked Scores in Multivariable Statistical Analyses

Maxwell Mansolf

Courtney K Blackwell

David Cella

Jin-Shei Lai

Abstract

Purpose:

Methods:

Results:

Conclusion:

Plain English summary

Internal and External Evaluation of Linkages

Motivation: Classical Test Theory

A Test for Appropriate Use Cases of Linked Scores

Figure 1. Path model illustrating test of interchangeability.

Example: Linking Parent Proxy and Early Childhood PROMIS Global Health

Table 1.

Analysis Sample

Table 2.

Psychometric Comparison of EC and PP Global Health

Equipercentile Linking

Figure 2. Scatterplot matrix (off-diagonal) and univariate distributions (diagonal) of PROMIS Early Childhood and Parent Proxy Global Health T scores, and linked scores from each measure to the other, in Op4G sample.

Use Case Tests

Method

Results and Discussion

Table 3.

Figure 3. Mean differences in T scores using different scoring and crosswalk methods.

Table 4.

Discussion

Limitations and Future Directions

General Conclusion

Supplementary Material

Funding

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases