Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 1.
Published in final edited form as: Multivariate Behav Res. 2020 Jan 20;56(1):3–19. doi: 10.1080/00273171.2019.1707061

Extrinsic Convergent Validity Evidence to Prevent Jingle and Jangle Fallacies

Oscar Gonzalez 1, David P MacKinnon 2, Felix B Muniz 2
PMCID: PMC7369230  NIHMSID: NIHMS1550274  PMID: 31958017

Abstract

In psychology, there have been vast creative efforts in proposing new constructs and developing measures to assess them. Less effort has been spent in investigating construct overlap to prevent bifurcated literatures, wasted research efforts, and jingle-jangle fallacies. For example, researchers could gather validity evidence to evaluate if two measures with the same label actually assess different constructs (jingle fallacy), or if two measures with different labels actually assess the same construct (jangle fallacy). In this paper, we discuss the concept of extrinsic convergent validity (Fiske, 1971), a source of validity evidence demonstrated when two measures of the same construct, or two measures of seemingly different constructs, have comparable correlations with external criteria. We introduce a formal approach to obtain extrinsic convergent validity evidence using tests of dependent correlations and evaluate the tests using Monte Carlo simulations. Also, we illustrate the methods by examining the overlap between the self-control and grit constructs, and the overlap among seven seemingly different measures of the connectedness to nature construct. Finally, we discuss how extrinsic convergent validity evidence supplements other sources of evidence that support validity arguments of construct overlap.


The validation process is crucial for accurate interpretation of behavioral constructs because it provides evidence that researchers are measuring the constructs they claim they are measuring. The relations between a measure and external criteria could be used as a source of evidence to determine the interpretation of the measure (Messick, 1989). For example, validity evidence could be obtained by examining the extent to which a measure is highly correlated with measures of the same construct (convergent validity evidence), and examining the extent to which a measure is not highly correlated with measures that do not assess the same construct (discriminant validity evidence; Campbell & Fiske, 1959; Messick, 1989; Widaman, 1985). In the same vein, vast creative efforts have been devoted to discovering and developing specific theoretical constructs and measures. Less effort has been devoted to investigating construct overlap (Dawis, 1992; Gulliksen, 1968; Judge, Erez, Bono, & Thoresen, 2002). In this case, two constructs overlap if their measures are highly, if not perfectly, correlated and equally predict the same criteria. The study of construct overlap has been conceptualized in terms of jingle and jangle fallacies. Kelley’s (1927) jangle fallacy describes labelling two measures differently when they assess the same construct. On the other hand, the jingle fallacy describes the use of a common label for measures that assess different constructs (Kelley, 1927; Thorndike, 1904). Jingle-jangle fallacies hinder scientific communication among researchers when we use the same label for different psychological phenomena, and preclude knowledge convergence when we study the same psychological phenomena, but refer to it differently (Block, 2000).

If previously discovered constructs are not considered, the development of new constructs might lead to redundant measures, bifurcated literatures, and constructs without unique psychological importance (Dawis, 1992). Along the same line, Messick (1984) indicated that “… by failing to maintain the distinction between constructs and their indicants or measures, we are in danger of reverting to the jungle of operationism whereby test meaning resides in each investigator’s measurement operations rather than in validated relational or nomological networks (p.169).” For example, Block (2000) described an instance of the jingle fallacy when studying extraversion, where its colloquial definition might differ from how researchers measure it, such as studying impulsivity, sociability, or other facets. For the jangle fallacy, Block (2000) described that even though constructs of sensation-seeking, disinhibition, and novelty-seeking have their own measures, they are highly correlated, suggesting that they might be more similar than different. Here, jingle and jangle fallacies could prevent researchers from communicating findings on extraversion research or consolidating findings on impulse monitoring.

Starting from the premise that we can learn about constructs by studying their measures, the overlap of two constructs could be examined by building a validity argument that supports the interpretation of two measures as the same. Researchers could use many sources of evidence to support validity arguments of this type. Researchers could study the content of the measures closely; use multitrait-multimethod matrices to investigate the relations of the measures with other constructs across methods; examine incremental prediction of their measures on an outcome; use factor analysis to determine the dimensionality of the measures; and replicate the relations in other samples. After collecting evidence, researchers could determine if the constructs overlap by evaluating how/if the evidence supports the validity argument that the measures are interpreted as the same. This paper explores extrinsic convergent validity (ECV), a source of validity evidence that can supplement current methods that examine construct overlap. ECV evidence is demonstrated when two measures of the same construct, or two measures of seemly different constructs, are highly correlated and have comparable correlations with a set of criteria (Fiske, 1971). ECV is not a new approach, but it has been largely unexplored (Lubinski, 2006). For example, Fiske (1973) used ECV evidence to study the relations among twelve of Murray’s needs (e.g. achievement, authority, order, etc.; Murray, 1938) measured by three seemingly similar scales, the Edwards Personality Preference Schedule (Edwards, 1959), the Adjective Check List (Gough & Heilbrun, 1965), and the Personality Research Form (Jackson & Guthrie, 1968). The correlations among the needs differed across the three scales (Fiske, 1973), so treating them as measures of the same construct could suggest a jingle fallacy. In a more recent example, Lubinski (2004) used ECV evidence to evaluate if measures of literacy, reading comprehension, and vocabulary assessed the same construct. The three measures had similar correlations with measures of aptitude and academic interest, so assuming that the three measures assess different constructs could suggest a jangle fallacy.

In previous research, ECV evidence entailed visually comparing patterns of correlations between measures and criteria. This process was informal and subjective, so similar to other approaches (Thielmann & Hilbing, 2019; Westen & Rosenthal, 2003), a motivation of this paper is to quantify and formally test for ECV evidence. This paper proposes hypothesis tests for dependent correlations as a formal approach to test if correlation patterns of two measures with criteria are the same, thus obtaining evidence that two constructs could overlap. In this paper, we first describe the argument-based approach to validity and how extrinsic convergent validity fits in this validity framework. Then, we introduce two approaches to formally collect ECV evidence through (1) hypothesis testing for dependent correlations with analytical or resampling approaches, and (2) using a model comparison approach for dependent correlations in structural equation modeling. The approach is then illustrated with both empirical data and published summary information. Finally, we present simulation results to evaluate the empirical Type 1 error rates and statistical power of four approaches to test the difference of two dependent correlations. Overall, the goal of this paper is to encourage researchers to examine construct overlap and consider jingle-jangle fallacies before proposing a new construct.

An Argument-based Approach to Validity

Validity is considered the most important concept of psychometrics, but its definition has changed through the years. Validity has ranged from investigating what an assessment purports to measure (Kelley, 1927, Rulon, 1946) to the consideration of social consequences in score use (Messick, 1989). Validity theory has dynamically developed from considering that a score is a valid representation of whatever it correlates with (Cureton, 1951), to building a structured argument, backed-up with evidence, to ascribe a valid interpretation or use to a score (Kane, 2006). Messick (1989) defined validity as “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inference and actions based on test scores or other modes of assessment (p. 13).” In other words, researchers validate the interpretation or use of a score, not the measure itself, to predict or infer a target objective by collecting evidence to support an interpretation or use. The argument-based approach to validity is similar to an attorney building an argument before going to the courtroom. Evidence in favor of the case and evidence against counterhypotheses need to be explicit and exposed. Finally, validation is a continuous and cumulative process, and validity comes in a manner of degree, specific to the decision based on the scores.

In previous treatments of validity, different sources of evidence to support validity arguments were considered different types of validity (Cronbach, 1971). Content validity referred to the composition of the measure in how it reflected the intended construct. Criterion or predictive validity referred to how the scores from the measure correlate with concurrent or future outcomes. Construct validity referred to the theoretical construct that could explain the interrelationships among the measured items even when there is not an outcome or a gold standard available. Construct validity was studied with either factor-analytic models or with multitrait-multimethod matrices (Campbell & Fiske, 1959) to determine convergent or discriminant evidence (validity) based on correlation patterns of the scores with similar measures. In psychology one can still encounter discussions of these types of validity as if they were independent and part of a checklist, but they should be treated as sources of evidence to support a validity argument (Kane, 2006). In other words, content, construct, and predictive validity evidence are different aspects of validity, not different types of validity (i.e. there is only one validity, not three). As such, we use the term validity evidence in the rest of the paper to emphasize that distinction (AERA, APA, & NCME, 2014). Lastly, current validity theory has been criticized because it might make validation process impractical (Markus & Borsboom, 2013) and it assumes that psychological attributes could be measured (Michell, 2009).

As mentioned above, an approach to study construct overlap would be to build a validity argument that the interpretation of two constructs is the same – that is collect validity evidence that demonstrate that the constructs are indistinguishable (e.g. high correlations between their measures, content experts suggesting similar constructs, etc.). In terms of the current validity theory, ECV supplements sources of validity evidence concerned with interpreting scores from two measures as assessing the same construct. When two measures correlate highly, this is evidence of convergent validity, as indicated by Campbell and Fiske (1959), which in turn suggests construct overlap. The extrinsic part of ECV suggests that not only the measures have to correlate highly, but that they also have to correlate the same with external criteria. If two measures with different names correlate highly and have the same correlations with criteria, then there is evidence of extrinsic convergent validity, evidence of construct overlap, and perhaps of a jangle fallacy. On the other hand, if two measures with the same label correlate highly (or not), and have different correlation with criteria, then there is lack of extrinsic convergent validity evidence, lack of construct overlap, and perhaps of a jingle fallacy. So, construct overlap is closely related to convergent validity evidence. If ECV were the only source of evidence provided, the similarity of correlation patterns between the measures examined and several criteria could indicate that the scores are empirically interchangeable within a specific domain. We argue that establishing similarity of correlation patterns is necessary to support the interpretation that two measures assess the same construct. A high correlation between the measures and similar correlations with criteria suggest that whatever is unique to each measure does not contribute to the prediction of criteria. More sources of validity evidence could supplement the conclusions from ECV to strengthen the argument of construct overlap.

Extrinsic Convergent Validity Evidence

The rationale of ECV is that measures that target the same construct should correlate highly and have a similar correlation with an external criterion (Lubinski, 2006). We refer to the set of correlations of a measure with criteria as correlation profiles.1 For context, columns 2 and 3 of Table 1 show two correlation profiles: one for grit (measured by the Short Grit Scale, Grit-S; Duckworth & Quinn, 2009) and another for self-control (measured by the Brief Self-Control Scale, BSCS; Tangney, Baumeister, & Boone, 2004). In this case, the correlation between the Grit-S scores and the BSCS scores is r = .727. According to Fiske (1971), “as a rule, measures of the same construct are not interchangeable empirically and must not be considered conceptually equivalent until a high degree of convergence in correlational patterns with other variables, or extrinsic convergent validity, has been demonstrated empirically (pg. 245).” When two measures correlate, the variance of each measure could be split into variance shared between the measures and variance unique to each measure. The unique variance of the measure is composed of reliable variance specific to the measure (referred to as specificity) and random error. If measures correlate highly, but they have different correlations with an external criterion, then differential prediction could be due to different reliabilities or different specificities in the measures (Judge et al., 2002; Shrout & Yip-Bannicq, 2017).

Table 1.

Correlation profiles of Grit-S and BCSC and results for the Williams’ test, the multivariate delta standard errors, percentile bootstrap confidence intervals, and effect sizes for dependent correlations.

Williams’ Delta Bootstrap

Variable BSCS GRIT-S cor diff. Cohen’s q t-value p-value Δ s.e. z-value p-value 95% Boot L.L 95% Boot U.L.
FFMQ- Awareness 0.646 0.669 −0.022 −.041 −0.977 0.329 0.033 −0.672 0.502 −0.086 0.041
FFMQ -Describe 0.421 0.439 −0.018 −.022 −0.616 0.533 0.032 −0.560 0.576 −0.081 0.057
FFMQ -Nonjudge 0.409 0.389 0.020 −.024 0.688 0.492 0.032 0.627 0.531 −0.047 0.093
FFMQ -Nonreact 0.395 0.401 −0.006 −.007 −0.217 0.828 0.032 −0.188 0.851 −0.079 0.058
FFMQ -Observe 0.226 0.224 0.002 .002 0.043 0.966 0.032 0.062 0.951 −0.068 0.071
Future TP 0.221 0.213 0.008 .008 0.240 0.811 0.033 0.245 0.807 −0.059 0.066
Past Negative TP −0.490 −0.440 −0.050 −.064 −1.765 0.078 0.033 −1.518 0.130 −0.120 0.019
Past Positive TP 0.310 0.352 −0.042 −.047 −1.362 0.174 0.032 −1.307 0.192 −0.105 0.027
Present Fatalistic TP −0.453 −0.412 −0.041 −.050 −1.401 0.162 0.032 −1.288 0.198 −0.107 0.024
Present Hedonistic TP −0.280 −0.089 −0.191 −.198 −6.245 <0.001 0.034 −5.606 <0.001 −0.264 −0.119
Daily Alcohol Consumption −0.237 −0.120 −0.116 −.121 −3.710 <0.001 0.033 −3.561 <0.001 −0.187 −0.054
Time since first Cigarette −0.043 0.047 −0.090 −.090 −2.795 0.005 0.033 −2.724 0.007 −0.146 −0.027

Note: Bold values are for correlations that are statistically significant after Bonferroni’s adjustment for 12 tests (α = .004). Italicized p-values are correlations that are statistical significant after the Benjamini-Hochberg procedure. FFMQ= five factors of mindfulness questionnaire; TP=time perspective; BSCS=brief self-control scale; GRIT-S=short grit scale; boot=bootstrap; cor. diff= correlation difference; Δ s.e.= multivariate delta method standard error; L.L.=lower limit; U.L.=upper limit

Two reasons why researchers might include ECV evidence in their validity arguments relate to the theoretical and statistical role that measure specificity plays in distinguishing two constructs. For example, Marsh et al. (2019) discuss the relation between different types of self-efficacy measures. Theoretically, self-efficacy refers to a person’s belief that they can perform a certain behavior, so the assessment of self-efficacy is largely independent of a frame-of-reference. However, some measures of self-efficacy are more generalized and leave out the target outcome. As a result, generalized self-efficacy measures are prone to frame-of-reference effects and assess a construct more similar to global self-concept. Marsh et al. (2019) found across four waves of data that a generalized measure of math self-efficacy is similar to a measure of math self-concept (thought to be a global construct), with correlations between latent constructs of .94 and .96 and similar correlations with criteria (e.g., math grades, math test scores, math future aspirations, and verbal outcomes, controlling for covariates). On the other hand, measures of task-specific math self-efficacy had correlations between .57 and .63 with the latent constructs of generalized self-efficacy, and had lower correlations with criteria after controlling for covariates. Although generalized and task-specific measures of self-efficacy still have a substantial correlation and share similar labels, they do not assess the same construct. By removing the target outcome, the reliable variance specific to the pure self-efficacy measure is different from the reliable variance specific to the generalized self-efficacy measure, which is more similar to what a global measure of self-concept assesses.

Second, even when two measures correlate almost perfectly, they could have different correlations with a criterion, which could be due to the specificity of the measure (Schmidt et al., 1998). Supposed that variables x1, x2, and y are linearly related and two correlations among the three variables are known. There is a limited range for the third correlation (Yule, 1922),

range forry2=r12ry1±1r122ry12+r122ry12

Here, r12 is the correlation between x1 and x2, ry2 is the correlation between x2 and y, and ry1 is the correlation between x1 and y. Suppose that a researcher encounters a correlation between two measures of r12=.998, which could be encountered if a researcher corrects for attenuation. If x1 and x2 have the same reliability and x1 correlates with criterion y, ry1=.30, then the correlation between x2 and y could vary between ry2=.24 and .36. Therefore, even though the two predictors could appear to be almost identical by r12 approaching 1, the possible difference in measure-criterion correlations could matter in practice. Given a small sample size, it is more likely that researchers would find significant results using the measure that correlated with the criterion ry1=.30 as opposed to using the measure that correlates ry2=.24 with the criterion. In the preceding example, the difference of .06 between ry1 and ry2 is statistically significant at α = .05 with a sample size of N=50 using Williams’ test (1959; discussed below). The two measures, x1 and x2, would not be interpreted as assessing the same construct unless one can show that the specificity of a measure does not lead to differential prediction of criteria in specific domains. A hypothetical example could be self-concept and self-efficacy having similar correlation profiles in the domain of self-regulation, but different correlation profiles in the domain of achievement (also see empirical examples below).

Several studies have used ECV evidence to support the interpretation of construct overlap and measure interchangeability. ECV evidence suggests the overlap of the short Bem Sex-Role Inventory (BSRI) and the Extended Personal Attributes Questionnaire (EAPQ) to measure the construct of androgyny using correlation profiles with a measure of self-views from the Differential Personality Questionnaire (Lubinski, Tellegen, & Butcher, 1983). ECV evidence was used to study the overlap between the Strong Interest Inventory and the Study of Values scale in intellectually gifted adolescents (Schmidt et al., 1998) using biographical information, psychological assessments, and objective test criteria (Cattell, 1965). Also, ECV evidence suggests that self-esteem, generalized self-efficacy, locus of control, and neuroticism might be closely related constructs by examining their correlation profiles with the Big Five factors and measures of happiness (Judge et al., 2002). However, these studies visually compared the correlation profiles across measures and did not formally test if the correlation profiles were the same. Below, we propose several approaches to obtain ECV evidence by testing the hypothesis that two correlation profiles are equal.

Analytical and Resampling Tests of the ECV Hypothesis

For the simplest case, consider two constructs one suspects could overlap. Call the constructs’ measures, Scale A and Scale B. Let’s say that we have one criterion measure Y. The correlation between Scale A and Y, rYA, and the correlation between Scale B and Y, rYB, should be equal if both scales measure the same construct, assuming the same reliability and no relevant specificity in the measures. One can test for ECV evidence by testing the null hypothesis H0: ρYA = ρYB, where rYA and rYB are estimators of ρYA and ρYB. Rejecting the null hypothesis would be inconsistent with the ECV hypothesis, and would provide evidence of a jingle fallacy if one had argued that the scales measured the same construct. Failing to reject the null hypothesis would be consistent with the ECV hypothesis, and would provide evidence consistent with a jangle fallacy if one had argued that scales measured different constructs. Statistical tests should be paired with an effect size to better understand the magnitude of the differences between correlations. It is important to note that when two correlations come from the same sample, they are not statistically independent, so traditional methods to test the difference between two correlations do not apply (Dunn & Clark, 1971; Steiger, 1980a; 1980b; Wilcox, 2009; Zou, 2007). Below we discuss several tests for the difference between two dependent correlations.

Equation from Williams (1959).

The analytical formula to test for significance of the difference between two dependent correlations is a function of the correlations between the three variables Y, A, and B and sample size N (Steiger, 1980; Williams, 1959),

T(N3)=(rYArYB)(N1)(1+rAB)2(N1)(N3)|R|+r2(1rAB)3. (1)

The test statistic for the difference of two dependent correlations is t-distributed with N-3 degrees of freedom. In Equation 1, |R| is the determinant of the matrix of correlations between Y, A, and B, and r¯ is the mean of the dependent correlations,

|R|=1rYA2rYB2rAB2+(2rYArYBrAB);r¯=12(rYA+rYB) (2)

If the test statistic is statistically significant, it would suggest that the correlations between rYA and rYB are significantly different from each other. Regardless of the result, an effect size, such as Cohen’s q (discussed below), could provide researchers with a magnitude of the difference of the dependent correlations. Finally, the equation from Williams (1959) can be conducted with published summary statistics, so the full dataset is not needed.

Multivariate Delta Method.

A general approach to derive the variance of the function of random variables is the multivariate delta method (MacKinnon, 2008; Rao, 1973). A standard error to test H0: ρYA = ρYB can be derived by obtaining the variance of the estimator function rYA - rYB, which is the difference between the dependent correlation coefficients (Olkin & Finn, 1990). The multivariate delta method derives the variance of the function by pre- and post-multiplying the covariance matrix of the parameters in the function by the vector of partial derivatives of the function with respect to the parameters. Specifically, the covariance matrix of the parameters contains estimates for the variance of rYA, SrYA2, the variance of rYB, SrYB2, and the covariance between rYA and rYB, SrYAYB. The covariance SrYAYB accounts for the dependence of the correlations. Analytical solutions for the variances and covariances between correlations can be found in Olkin & Siotani (1976). Also, the variances and covariances between correlations can be obtained empirically using maximum likelihood estimation and are commonly found in the output of structural equation modeling programs (e.g. output option Tech3 in Mplus or vcov option in the R-package lavaan). The vector of partial derivatives contains the derivatives of rYA - rYB with respect to either rYA or rYB. To derive the variance of function f = rYA - rYB using the multivariate delta method, one can multiply,

Variance ofrYArYB=[dfdrYAdfdrYB][SYA2SYAYBSYAYBSYB2][dfdrYAdfdrYB] (3)

In this case, dfdrYA=1 and dfdrYB=1. Therefore, the variance of rYA - rYB is,

VarianceofrYArYB=SYA2+SYB22SYAYB. (4)

To test H0: ρYA = ρYB, one would divide rYA - rYB by the square root of its variance and conduct a z-test on this ratio. If the ratio is statistically significant, then the dependent correlations are significantly different. Regardless of the outcome of the significance test, Cohen’s q could also suggest the magnitude of the correlation difference. Finally, the accurate performance of the multivariate delta method depends on sample size, and this method can be computed with summary statistics.

Bootstrap Confidence Intervals.

Another approach to test H0: ρYA = ρYB is to build a bootstrap confidence interval for rYA - rYB and examine if the interval contains zero. The bootstrap is a nonparametric, resampling approach to derive an empirical distribution of a test statistic (Efron & Tibshirani, 1993). The bootstrap resamples datasets with replacement, so new datasets are built by sampling cases from the original dataset and single cases can appear more than once. In each bootstrapped dataset, the quantity rYA - rYB is estimated, and the empirical distribution of the quantity is used to choose confidence limits based on the quantiles from the distribution. If the confidence interval does not contain zero, then the dependent correlations are significantly different. The full dataset is needed to perform the bootstrap, and the interval width could be used to examine the magnitude of the estimate of the correlation difference.

Effect Size.

It is recommended that a significance test is always accompanied by an effect size. A difference in correlation coefficients is not an accurate representation of the correlation difference because the significance of the correlation difference depends on the location along the correlation scale. For example, Cohen (1988) described that the power to detect the difference between r1 = .50 and r2=.25 is different than the power to detect the difference between r1 =.90 and r2=.65, even though both correlations differ by .25. Cohen’s q accounts for this discrepancy and is estimated by the difference of two Fisher’s Z-transformed correlations (Cohen, 1988),

Cohen’s q=12log1+r11r112log1+r21r2, (5)

where r1 and r2 are the correlation coefficients being compared. Cohen (1988) suggests that q=.1 is a small effect size, q=.3 is a medium effect size, and q=.5 is a large effect size.

Analytical Extensions beyond the Two-Measures Case.

Formulas and approaches have been presented for the simple case of comparing two dependent correlations. Similar approaches extend to cases with more measures. For example, Scale C might also measure the same construct as Scale A and Scale B, so the null hypothesis would be H0: ρYA = ρYB = ρYC. Meng, Rosenthal, and Rubin (1992) provided an analytical formula to test if more than two dependent correlations with an outcome are equal,

χ2(k1)=(N3)i(zriz¯r)2(1rx)h (6)

where the test statistic is χ2 distributed with k-1 degrees of freedom, and k is the number of dependent correlations tested. The rx is the median correlation between the scales being tested, zri is the Fisher’s Z-transformation of correlation coefficient i out of k total scale-criterion correlations tested, z¯r is the mean of the zri, and N is the total number of subjects. The h in the denominator of Eq. 6 is obtained from the following equations,

h=1fr2¯1r2¯;f=1rx2(1r2¯) (7)

where r2¯ is the average of the squared scale-criterion correlations ri, and rx is as defined above. Hoffman (2000) suggested that the equation from Meng et al. (1992) has appropriate Type 1 errors and power to detect differences in three dependent correlations that share an outcome. The equation from Meng et al. (1992) also serves as the basis of quantifying construct validity (QCV; Weston & Rosenthal, 2003), an effort whose goal is make the validation process less subjective. Finally, the equation from Meng et al. (1992) can be estimated using summary statistics.

Cases for Two Measures and Multiple Criteria.

To extend the two-measures case to cases with more than one criterion is more complicated. For example, adding another dependent variable M, there are now two scales and two dependent variables, and the null hypothesis is: H0: ρYA = ρYB and ρMA = ρMB. Note that the correlation between Scale A and Y and the correlation between Scale B and Y can differ from the correlation between Scale A and M and the correlation between Scale B and M. There are several options to test this hypothesis. One could derive a standard error using the multivariate delta method for the function f = rYA - rYB and rMA - rMB, as described in the section above. Also, one could conduct a test for the difference of dependent correlations per criterion and adjust for the family-wise Type 1 error rate with a Bonferroni correction by dividing the alpha level (typically α = .05) by the number of comparisons. One could also control for false discovery rates adjusting the p-values using the Benjamini-Hochberg (BH) procedure (Benjamini & Hochberg, 1995). Finally, one could turn to the structural equation modeling framework, as described in the next section.

A Structural Equation Modeling Approach to the ECV Hypothesis

Structural equation modeling (SEM) provides an approach to test the ECV hypothesis that is applicable for any number of tests and measures using χ2 difference tests (Cheung & Chan, 2004; Preacher, 2006). For example, the hypothesis, H0: ρYA = ρYB, is tested by fitting a model where the correlations between two measures and an outcome are constrained to be equal and compared to a model that allows the correlations between the measures and the outcome to be freely estimated. The two models are nested, so a χ2 difference test with k-1 degrees of freedom (where k is the number of model constraints) can be conducted to test the H0: ρYA = ρYB hypothesis. Failing to reject the null hypothesis would indicate that the model with constraints does not fit significantly worse than a model without constraints. In other words, the dependent correlations were not found to differ significantly, consistent with the ECV hypothesis. The ECV hypothesis can be evaluated with SEM using summary covariance matrices.

When the baseline model allows all of the variables to freely correlate, the model is fully saturated, so its χ2 statistic and degrees of freedom are zero. Therefore, the test for H0: ρYA = ρYB is the χ2 test of model fit for the model with constraints. Just about any test with any number of scales and any number of dependent variables can be easily incorporated in an SEM framework by testing dependent correlation constraints. As sample size or the number of constraints increase, it is likely for the χ2 test to be rejected. Therefore, researchers could take one of two options. One option would be to take a model comparison approach, similar to testing for measurement invariance (Millsap, 2011), by sequentially freeing up the dependent correlation constraints until the χ2 is no longer significant. Researchers can then consider the dependent correlations that remain constrained as evidence supporting the ECV hypothesis, and correlations freed as evidence not supporting the ECV hypothesis. The other option is to consider alternative fit indices, such as the RMSEA, SRMR and CFI, to indicate if the model still fits the data well even though the χ2 test for model fit was rejected (Meade, Johnson, & Braddy, 2008).

Finally, researchers can leverage the flexibility of the SEM framework to obtain information about specific dependent correlation differences. For example, in both Mplus and lavaan, researchers can create new parameters from the differences of dependent correlations, such as rYA - rYB, and one can test if a specific difference is significantly different from zero. SEM software typically estimates standard errors for new parameters using the multivariate delta method, but one can also bootstrap the created statistic to test for statistical significance. It is important to note that standard errors may be inaccurate if SEM is conducted with summary correlation matrices, as opposed to summary covariance matrices (Cudeck & O’Dell, 1999).

Considerations before Estimating and Testing Dependent Correlations

Dependent correlations are computed in the same sample, so restriction of range of the scores and the reliability of the criteria are controlled for, and it is expected that scores from the measures compared would have similar reliability. Measurement error attenuates correlations coefficients, where the highest correlation a measure can have with the outcome is as high as the square-root of the reliability of the least reliable variable (Fiske, 1971). Consequently, H0: ρYA = ρYB, could be rejected if the reliabilities across Scale A and Scale B differ. Suggested solutions are to disattenuate correlation coefficients before conducting comparisons (Cohen, Cohen, West, & Aiken, 2003), develop analytical formulas to test for attenuated dependent correlation coefficients (Rosner, Wang, Eliassen, & Hibert 2015), or use latent variables in SEM (Cheung & Chan, 2004). Also, nonnormality and heteroscedasticity (Fouladi, 2000; Wilcox, 2009; 2016; Wilcox & Tian, 2008) of scores might distort the correlation between measures and outcomes. Finally, the measures-outcome correlations could also differ if the measures assess the same construct with different methods (e.g. self-report versus cognitive task; Eisenberg et al., 2019).

Extrinsic convergent validity is a source of evidence to support the validity arguments for construct overlap and to study jingle-jangle fallacies. ECV evidence is not meant to be the absolute evidence or to replace other sources of validity evidence. The ECV approach is meant to contribute to the validation process and make it less subjective. If ECV were to be used in isolation and without any other sources of evidence supporting the validity argument, its interpretation might be limited to the testing if two tests are empirically interchangeable. The logic of ECV is that if two highly-correlated constructs predict the same criteria equally, then specificity of the measures might not be important for prediction (Fiske, 1971), and in turn suggest that the hypothetical unique parts of the constructs evaluated are not distinguishable. A formal test for ECV is to test if dependent correlations are equal either through analytical, resampling, or SEM approaches. The tests of dependent correlations might be more accurate if researchers perform the test with raw data rather than published summary information – a correlation coefficient assumes a unidimensional model and that the measure is free of measurement error, and without data these assumptions cannot be examined. Below we provide one empirical illustration and one illustration with published summary information to show how researchers can test for ECV evidence using the different tests of dependent correlations. Simulation results are also presented to provide recommendations based on the performance of the different tests of dependent correlations.

Illustrations

Five tests of dependent correlations are illustrated with two examples. The multivariate delta method, the equation by Williams (1959), and the equation by Meng et al. (1992) were carried out using R code provided in the supplementary materials (R Core Team, 2016). SEM and bootstrapping (500 samples) were carried out in Mplus (Muthen & Muthen, 1998-2012). We used a Bonferroni correction and the BH procedure to control for multiple test comparisons.

Empirical Example: Grit and Self-Control

Two constructs reflecting self-regulation towards valued goals are self-control and grit. Self-control refers to the self-initiated actions to resist conflicting impulses in favor of a valued goal, and grit refers to passion and perseverance for long-term, valued goals despite setbacks or roadblocks (Duckworth & Gross, 2014). Previous research suggests that self-control and grit are highly correlated (r > .60) and they have different relations with achievement given the immediacy of goals (Duckworth, Peterson, Matthews & Kelly, 2007). Whereas individuals with self-control regulate behavior, emotion, and attention towards resisting everyday situations, gritty individuals regulate behavior, emotion, and attention towards long-term goals. Gritty individuals can fall to temptations as long as the temptations do not distract them from the long-term goal. Recently, the interpretation of the grit construct has been brought into question because it is difficult to differentiate grit from conscientiousness, suggesting a jangle fallacy (Credé, Tynan, & Harms, 2017). Credé and colleagues (2017) also suggested that grit should be considered another facet of conscientiousness, just as self-control, because both deal with the deferment of goals. Therefore, ECV evidence could be used to test the hypothesis that measures of grit and self-control are interchangeable, and in turn support the argument of construct overlap. For this example, we tested the ECV hypothesis between the construct of grit (measured by the 8-item, Grit-S, coefficient α = .897 with support to be essentially unidimensional; Gonzalez, Canning, Smyth, & MacKinnon, in press) and the construct of self-control (measured by the 13-item, BSCS, coefficient α = .914) in the domain of self-regulation using measures of mindfulness (FFMQ; Five Facet Mindfulness Questionnaire; Baer, Smith, Hopkins, Krietemeyer, & Toney, 2006), time perspective (FTP: Future Time Perspective scale; Carstensen & Lang, 1996; ZFTP: Zimbardo’s future time perspective scale; Zimbardo & Boyd, 1999), and alcohol and cigarette consumption. The dataset comes from a sample of Mturk participants (N=522) who took these and other surveys within a large online battery of tasks (Eisenberg et al., 2018). For this example, we tested the ECV hypothesis with four tests of dependent correlations: the equation from Williams (1959), multivariate delta method, bootstrap confidence intervals, and model comparison through SEM.

Results.

The correlation between the Grit-S and the BCSC was .727. The correlation profiles of the Grit-S and BCSC with the criteria are shown in Table 1, and correlation between the two correlation profiles (correlation of the correlations coefficients) was .990, suggesting that the criteria that Grit-S predict best (and worst) are the criteria that the BCSC predict best (and worst). Results for the equation from Williams (1959), the multivariate delta standard errors, percentile bootstrap confidence intervals, and effect sizes for dependent correlations are also shown in Table 1. The results largely agreed across the three tests. After controlling the Type 1 error rate via a Bonferroni correction and the BH procedure, analyses indicated that ten out of the twelve dependent correlation coefficients were not significantly different using the Williams’ test and the multivariate delta method. In this case, grit and self-control had different correlations only with present hedonistic time perspective and daily alcohol consumption (see Table 1).

SEM was used to test if the correlation between measures and the twelve criteria were the same by constraining the correlation of grit and self-control with each criterion to equality. The χ2 test suggested that the model did not fit the data well, χ2(12) = 60.991, p<.001. On the other hand, some of the alternative fit indices (RMSEA=.088, CFI=.979, and SRMR=.017) were within conventional thresholds, which suggest that equal correlations of self-control and grit with criteria described the relations generally well. Using information from the univariate tests above, the χ2 test of model fit was not significant (χ2(10) = 16.565, p=.085) when the relations of grit and self-control with present hedonistic time perspective and daily alcohol consumption were allowed to differ.

The univariate tests suggest partial support the ECV hypothesis: the Grit-S and the BSCS had similar correlation profiles (N=522) with facets of mindfulness and most facets of time perspective. These nine correlations (all with Cohen’s q < |.05|) support the interpretation that grit and self-regulation could be the same construct with different labels, as in a jangle fallacy. However, the BSCS had different correlations with daily alcohol consumption, time since first cigarette, and present hedonistic time perspective compared to the Grit-S (all with Cohe n’s q > |.09|). These three correlation differences are consistent with the theory that suggests that self-control is more useful than grit in resisting daily temptations (Duckworth & Gross, 2014). There is evidence that the specificity of self-control and grit is important in predicting self-regulation outcomes, so referring to the constructs as the same could render a jingle fallacy.

Published Example: Connectedness to Nature

In environmental psychology, the construct of connectedness to nature could have an important role in mitigating the current environmental crisis (Tam, 2013). There have been several concepts of humans’ relationship with nature that had been introduced in the environmental psychology literature, but they had been largely studied in isolation. Four unidimensional environmental concepts include emotional affinity towards nature (EATN; emotional inclinations towards nature; Kals, Schumacher, & Montada, 1999), connectedness to nature (CTN; how people feel affectively connected to the natural community; Mayer & Frantz, 2004), inclusion of nature in the self (INS; including nature into one’s self-concept; Aron, Aron, Tudor, & Nelson, 1991), and commitment to nature (COM; human-nature interdependence; Davis, Green, & Reed, 2009). Three multidimensional environmental concepts include environmental identity (EID; how humans perceive and act towards nature; Clayton, 2003), nature relatedness (NR; affective, cognitive, and experiential aspects towards nature; Nisbet, Zelenski, and Murphy, 2009), and connectivity with nature (CWN; perception of a fundamental sameness between one and nature; Dutcher, Finley, Luloff, & Johnson; 2007). These seven environmental concepts are highly similar and share many of the same labels and definitions. Tam (2013) determined that a single common factor explained the relationships among these seven concepts, and that there was ECV evidence by comparing the magnitude and direction of the seven correlation profiles with several criteria. For this example, we tested the ECV hypothesis of the seven measures of connectedness to nature in the domain of personality by using the Big Five Inventory (BFI; John, Donahue, & Kentle, 1991). In this example three tests for dependent correlations are illustrated: the equation from Meng et al. (1992), SEM with a χ2 test of model fit per criterion, and SEM with a χ2 test of model fit with all of the criteria. The correlation matrix of the seven concepts and their correlations with the BFI, obtained from Hong Kong Chinese undergraduates (N=322), are reported in Tam (2013). The correlations among the Big Five Factors were not reported in Tam (2013), so we used for illustration purposes the correlations among the Big Five Factors reported in the BFI source article (Benet-Martinez & John, 1998). Finally, raw data were not available, so the bootstrap is not illustrated.

Results.

The median correlation between the seven connectedness to nature measures was .710. The correlation profiles of the seven measures are in original article (see Table 3 in Tam, 2013), and the median correlation among the seven correlation profiles was .983. Results using the equation from Meng et al. (1992) are shown in Table 2. After a Bonferroni correction and the BH procedure for five multiple comparisons, analyses indicated that all but one (agreeableness) of the dependent correlation coefficients with the Big Five were not significantly different from each other. SEM was used to test if the correlations between the seven measures and the criteria were the same. First, the tests of dependent correlations for each of the criteria were carried out individually (reported in Table 2), and then simultaneously. After a Bonferroni correction for five multiple comparisons and the BH procedure for false discovery rates, the χ2 test of model fit on each criterion was not statistically significant. For the simultaneous test with five criteria, the model did not fit the data well, χ2(30) = 44.269, p=.045, but alternative fit indices were within conventional thresholds (RMSEA=.038, CFI=.993, and SRMR=.040). The results suggested that the correlation profiles of the seven measures of connectedness to nature with the Big Five (N=322) support the ECV hypothesis in the domain of personality. Specifically, the seven measures of connectedness to nature did not significantly differ on their relationship with the BFI outcomes. The seven measures appear to assess the same construct in the domain of personality, so assuming that they assess different constructs could suggest a jangle fallacy.

Table 2.

Average measure-criterion correlation, the equation from Meng et al. (1992), and χ2 test to evaluate the measures of connectedness to nature from Tam (2013).

Meng χ2 Test

Variable Avg. cor. χ2 value df p-value Cor. Constraint χ2 value df p-value RMSEA CFI SRMR
Extraversion 0.162 9.411 6 0.152 0.160 9.680 6 0.138 0.044 0.998 0.017
Agreeableness 0.301 24.779 6 <0.001 0.274 14.069 6 0.029 0.065 0.996 0.037
Conscientiousness 0.242 9.327 6 0.156 0.233 9.424 6 0.151 0.042 0.998 0.020
Neuroticism −0.072 9.345 6 0.155 −0.062 9.948 6 0.127 0.045 0.998 0.016
Openness to Experience 0.330 12.423 6 0.053 0.312 6.176 6 0.408 0.010 1.000 0.027

Note: p-values bolded are those that are statistically significant after a Bonferroni correction (α = .01). Italicized p-values are correlations that are statistical significant after the Benjamini-Hochberg procedure. Avg. cor.= average correlation across measures; Cor. constraint=estimated correlation constraint in the structural equation procedure.

Simulation Study

Monte Carlo simulation methods were used to evaluate the performance of four approaches to test the difference between two dependent correlations: the equation from Williams (1959), percentile bootstrap confidence intervals, multivariate delta method, and the χ2 test of model fit.

Simulation Procedure

There were 180 conditions in the simulation study and 500 replications per condition, so 90,000 datasets were analyzed. Factors varied in the simulation were correlation between measures (.5, .6, .7 .8), correlations of measure 1 with outcome (.3, .5, .7), correlation between measure of variable 2 with outcome (.3, .5, .7), and sample size (50, 100, 200, 500, and 1,000). Datasets were generated by crossing the simulation factors to derive specific correlation matrices and then sampling from a multivariate normal distribution using the mvrnorm command from the MASS R-package. Our fully-crossed design allows for a replication of the conditions where correlations with the outcome differed. In the supplementary materials, we expand simulation conditions to describe the performance of the four tests when one of the measures is normally-distributed (skewness=0, kurtosis = 3) and the other exhibited different levels of nonnormality (either mild, skewness = 1.25 and kurtosis = 3.25, or moderate, skewness=2.25 and kurtosis = 7; Kelley & Pornprasertmanit, 2016). The additional datasets generated by sampling from a multivariate normal distribution with specific correlation matrices, and using the transformation by Vale and Maurelli (1983) (with the rValeMaurelli function from the SimDesign R-package) to yield the specified amount of nonnormality.

The simulation outcomes were Type 1 error and statistical power. Conditions where the correlation between measure 1 with outcome and measure 2 with outcome were equal were used to study Type 1 error rates. Type 1 error rates between .025 and .075 were considered adequate (Bradley, 1978). Conditions where the correlation between measure 1 with outcome and measure 2 with outcome differed were used to study statistical power. Power above .80 was considered adequate. The Monte Carlo simulation was carried out in the R statistical environment (R Core Team, 2016) by programming the analytical formulas and the bootstrap. The lavaan (Rosseel, 2012) package was used to conduct χ2 tests of model fit using maximum likelihood estimation with correlations to the outcome constrained to be the same. The covariance between the dependent correlations for the multivariate delta standard error was estimated in lavaan using maximum likelihood estimation by fitting a model where the dependent correlations were allowed to vary, and the variance-covariance matrix between parameter estimates was extracted with the vcov command on the lavaan object. Annotated code in R and Mplus to carry out these procedures is presented in the supplementary materials.

Simulation Results

Two normally-distributed measures.

Type 1 error rates for four tests of dependent correlations are presented in Table 3. In 11 (out of 60) conditions, the test for dependent correlations from Williams (1959) had Type 1 errors above .075, with a maximum of .102. In four (out of 60) conditions, the bootstrap confidence intervals had Type 1 error rates above .075, with a maximum of .084. Conditions with a measure-outcome correlation of .7 contributed 11 out of the 15 conditions with high Type 1 error rates. Type 1 errors approached nominal values as sample size increased. The χ2 test and the multivariate delta method standard errors had Type 1 error rates below .025, especially when the measures-outcome correlation was large (.7).

Table 3.

Type 1 error rates for the tests of dependent correlations.

Correlation Between the Two Measures
0.5 0.6 0.7 0.8

Correlation of the Two Measures with the Outcome
N Test 0.3 0.5 0.7 0.3 0.5 0.7 0.3 0.5 0.7 0.3 0.5 0.7
50 Williams 0.052 0.042 0.070 0.056 0.074 0.072 0.064 0.068 0.064 0.050 0.046 0.076
Δ s.e. 0.014 0.022 0.000 0.032 0.018 0.000 0.036 0.020 0.014 0.030 0.010 0.002
Boot 0.048 0.072 0.068 0.066 0.074 0.054 0.072 0.072 0.060 0.066 0.066 0.066
χ2 test 0.022 0.022 0.002 0.044 0.030 0.000 0.048 0.034 0.018 0.040 0.026 0.002
100 Williams 0.062 0.056 0.080 0.056 0.080 0.080 0.060 0.048 0.086 0.056 0.056 0.078
Δ s.e. 0.034 0.014 0.008 0.028 0.032 0.004 0.038 0.020 0.006 0.038 0.020 0.008
Boot 0.052 0.056 0.082 0.070 0.084 0.074 0.076 0.068 0.072 0.074 0.062 0.060
χ2 test 0.040 0.020 0.010 0.042 0.034 0.008 0.048 0.026 0.014 0.046 0.022 0.008
200 Williams 0.050 0.048 0.064 0.066 0.060 0.090 0.048 0.038 0.074 0.046 0.050 0.088
Δ s.e. 0.040 0.018 0.004 0.052 0.022 0.008 0.034 0.018 0.008 0.032 0.022 0.008
Boot 0.062 0.074 0.052 0.082 0.048 0.072 0.046 0.040 0.060 0.054 0.062 0.074
χ2 test 0.044 0.022 0.004 0.056 0.028 0.012 0.038 0.022 0.012 0.036 0.022 0.010
500 Williams 0.040 0.058 0.102 0.044 0.072 0.062 0.062 0.074 0.064 0.038 0.042 0.072
Δ s.e. 0.046 0.022 0.006 0.034 0.026 0.004 0.050 0.032 0.008 0.030 0.012 0.004
Boot 0.066 0.066 0.068 0.042 0.068 0.056 0.056 0.070 0.044 0.040 0.042 0.050
χ2 test 0.050 0.022 0.006 0.042 0.040 0.004 0.058 0.034 0.012 0.034 0.014 0.004
1000 Williams 0.040 0.048 0.090 0.062 0.062 0.074 0.056 0.056 0.066 0.052 0.036 0.086
Δ s.e. 0.032 0.026 0.002 0.044 0.026 0.002 0.040 0.022 0.002 0.034 0.014 0.014
Boot 0.050 0.060 0.056 0.060 0.056 0.062 0.056 0.040 0.060 0.052 0.044 0.064
χ2 test 0.032 0.026 0.004 0.050 0.032 0.002 0.062 0.024 0.002 0.034 0.016 0.016

Note: Correlations of the measures with the outcome were the same, so the difference is zero. In bold red are the Type 1 error rates outside of Bradley’s (1978) robust criterion (.025, .075). N=sample size; Boot=bootstrap confidence intervals; Will=Williams’ test for dependent correlations; Δ s.e. = multivariate delta method standard error.

Power results for the tests of two dependent correlations are presented in Table 4. Results for the first and second replication were comparable, so the average of the two replications is presented. The bootstrap had higher statistical power than the other tests across all conditions. For any test and condition, power increased as a function of sample size, difference in correlation coefficients, and the correlation between the two measures. In other words, tests of the difference in dependent correlations had greater power when the measures share more variance. For conditions with a sample size of 50, tests of dependent correlations had power above .80 to detect a correlation difference of at least .40 (q=.56; e.g. conditions where measure 1 with outcome correlates 0.3 and measure 2 with outcomes correlates .7). For conditions with a sample size of 100, tests of dependent correlations had power above .80 to detect a correlation difference of at least .20 (q=.24) when the correlation between the two measures was .80. For conditions with a sample size of 200, 500 and 1,000 (not presented), all of the tests of dependent correlations had adequate statistical power to detect a correlation difference of .2 (q=.24).

Table 4.

Power to detect the difference of two dependent correlations across different tests.

N=50 N=100 N=200 N=500

r1 r2 q i Will Δ s.e. Boot χ2 Will Δ s.e. Boot χ2 Will Δ s.e. Boot χ2 Will Δ s.e. Boot χ2
0.3 0.5 0.24 0.5 0.344 0.236 0.390 0.270 0.633 0.506 0.657 0.532 0.875 0.824 0.879 0.831 1.000 0.999 1.000 0.999
0.3 0.5 0.24 0.6 0.402 0.295 0.450 0.350 0.681 0.574 0.708 0.611 0.918 0.876 0.920 0.890 1.000 1.000 1.000 1.000
0.3 0.5 0.24 0.7 0.485 0.375 0.521 0.437 0.796 0.711 0.804 0.740 0.973 0.952 0.974 0.954 1.000 1.000 1.000 1.000
0.3 0.5 0.24 0.8 0.715 0.601 0.720 0.655 0.934 0.885 0.939 0.908 0.999 0.997 0.998 0.998 1.000 1.000 1.000 1.000
0.5 0.7 0.32 0.5 0.488 0.210 0.488 0.250 0.768 0.484 0.764 0.516 0.960 0.844 0.962 0.856 1.000 1.000 0.996 1.000
0.5 0.7 0.32 0.6 0.528 0.256 0.544 0.303 0.831 0.579 0.821 0.624 0.985 0.936 0.980 0.947 1.000 0.998 1.000 0.999
0.5 0.7 0.32 0.7 0.665 0.400 0.683 0.459 0.915 0.755 0.914 0.793 0.996 0.978 0.992 0.982 1.000 1.000 1.000 1.000
0.5 0.7 0.32 0.8 0.806 0.556 0.836 0.602 0.970 0.916 0.960 0.920 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
0.3 0.7 0.56 0.5 0.948 0.836 0.932 0.864 1.000 0.994 1.000 0.996 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
0.3 0.7 0.56 0.6 0.977 0.912 0.972 0.939 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
0.3 0.7 0.56 0.7 0.998 0.990 0.999 0.994 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
0.3 0.7 0.56 0.8 1.000 0.998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Note: Bold green is for conditions with power above .80. i=correlation between measure 1 and measure 2; r1= correlation of measure 1 with outcome; r2=correlation of measure 2 with outcome; N=sample size; Boot=bootstrap confidence intervals; Will=Williams’ test; Δ s.e. = multivariate delta method standard error; χ2 =chi-square test; q=Cohen’s q.

A normally- and a nonnormally-distributed measure.

For the mild nonnormality case, Type 1 error rates are presented in Tables S1 and power is presented in Tables S2A and S2B. Across conditions, Type 1 errors were slightly higher than the case of two normally-distributed measures. Similar to the normal case, the Williams’ test (in 24 out of 60 conditions) and the bootstrap (in 27 out of 60 conditions) had Type 1 errors above .075. Type 1 errors below .025 were most common for the multivariate delta method standard error and the χ2 test when the measure-outcome correlation was large (r=.7). On the other hand, there was slightly less power to detect correlation differences when one of the measures deviates from normality than when the two measures are normal. However, power depended on which measure had the stronger association with the outcome – there was more power to detect correlation differences when the normal measure had a stronger association with the outcome than the nonnormal measure. Just as in the case for two normal measures, power increased as sample size, measure-outcome correlation, and correlation within measures increased. For the case with moderate nonnormality, there were convergence issues when the correlation between measures was r = .8. Type 1 error rates, presented in Table S3, were above .075 across all tests. Type 1 error rates diverted from the nominal rates as sample size and measure-outcome correlation increased. Due to the high Type 1 error rates, statistical power for the moderate nonnormality case is not presented.

Discussion

When new psychological constructs are proposed without considering previously discovered constructs, the new constructs might be prone to jingle and jangle fallacies. Ignoring other psychological constructs can lead to wasted research efforts and hinders the accumulation of knowledge about psychological phenomena (Block, 1995). In this paper, we discussed the concept and formal tests of extrinsic convergent validity (ECV; Fiske, 1971), demonstrated when two measures of a construct, or two measures of different constructs, are highly correlated and have comparable correlations with a set of criteria. ECV is a source of validity evidence that supports the empirical interchangeability of measures and supports the interpretation of construct overlap. Of course, ECV should not be used in isolation, but only as part of a larger validity argument. If ECV evidence supports the interpretation that two seemingly different constructs overlap, then researchers might reconsider their theory and prevent a jangle fallacy. On the other hand, if ECV evidence demonstrates that two similarly labelled constructs do not overlap, then researchers might reconsider their labels and prevent a jingle fallacy. In our published example, there was support for construct overlap and empirical interchangeability between the seven measures of the connectedness to nature construct in the domain of personality. On the other hand, there was conflicting ECV evidence that could either support or not support construct overlap and empirical interchangeability between self-control and grit in the domain of self-regulation. Conclusions from the examples might be limited to the domains evaluated, so more research is needed to understand other domains where the specificity of these measures matters for prediction. Given our criteria, differences in correlation profiles could outweigh similarities. Overall, the goal of the illustrations was to show how to obtain and evaluate ECV evidence.

The results from our simulations suggested that Williams’ test and the bootstrap tend to have Type 1 errors within the nominal rates, although there were positively biased Type 1 errors in some conditions. On the other hand, the multivariate delta method and the χ2 test of model fit had negatively biased Type 1 error rates. In regards to power, the percentile bootstrap was the most powerful test. However, these results were related to the distribution of scores from the measure. Based on these results, we recommend using the bootstrap when the full data is available and the measures are normally-distributed, the equation from Williams (1959) when only summary information is available, and to report an effect size, such as Cohen’s q, to understand the magnitude of the correlation differences. If scores from the measures are not normally distributed, alternative tests could be considered (Fouladi, 2000; Steiger, 1980a; Wilcox, 2009).

A future direction of this research is to evaluate the performance of tests for dependent correlations when many criteria are compared simultaneously (as in Steiger, 1980b; 2005). Multivariate effect sizes, such as Mahalanobis’s D (Del Giudice, 2016), could be useful in this instance. Also, it would be important to study the properties of Cohen’s q as an effect size for the difference in dependent correlations, along with determining an appropriate standard error (Kenny, 1987; Cohen, 1988). The confidence intervals in Zou (2007) could also help describe the magnitude of the correlation difference, and their empirical power should be studied.

What ECV can and cannot do

ECV is a source of evidence that evaluates measure specificity and that can support the validity argument for the interpretation that two constructs overlap. ECV presumes that two measures are highly correlated before researchers consider construct overlap. Similar to Westen and Rosenthal’s (2003) quantifying construct validity (QCV) approach and Thielmann and Hilbig’s (2019) nomological consistency, ECV aims for a less subjective evaluation of validity evidence. The three approaches use hypothesis testing to evaluate measure-criteria relations, but each has a slightly different focus. ECV supports the argument of construct overlap by empirically showing that two measures correlate highly and have similar correlation profiles with criteria; nomological consistency supports the argument of construct overlap by showing similar correlation profiles across measures, but does not incorporate the correlation between the measures in the analysis; and QCV supports the argument of construct overlap by examining the omnibus relations between what is empirically found and what is theoretically expected. Across the three approaches, their conclusions are limited to the domain/criteria evaluated, so these approaches also could be used to identify the domains in which two constructs differ and the domains in which the two constructs are the same. We recommend interpreting evidence from these three approaches in reference to the raw correlations and effect size measures. For example, the correlation between correlation profiles of the Grit-S and the BSCS could be high even if the measure-criteria correlations significantly differ (Smith, 2005). Finally, evidence comes in a matter of degrees and is specific to the decisions based on the score. Perhaps using the ECV approach in isolation could provide enough evidence for a pragmatic question about the use of measures and limited confidence that constructs overlap in that specific domain. If the interpretation is more ambitious, more evidence could be included, such as studying the content of the measures closely; using multitrait-multimethod matrices to investigate the relations of the measures with other constructs across methods; evaluating incremental prediction of an outcome; replication of relations in other samples; factor-analyzing the measures; and evaluating ECV evidence in other domains. More sources of evidence make the argument stronger.

Choosing criteria.

The conclusions from the ECV approach might depend on the criteria used to assess the similarity of correlation profiles. As a result, the number and type of criteria used to obtain ECV evidence could provide researchers with either very weak or very strong evidence for any given validity argument. This raises two concerns: how to choose criteria and when to stop collecting validity evidence. These two concerns are directly related to the claims made by the validity argument. If the interpretation is only limited to a specific domain, then researchers could test the equality between correlation profiles that include criteria relevant for that domain. If the claim is more ambitious and general, then it would be important to have a large variety of criteria from many domains where the two constructs are theoretically expected to differ. In terms of the number of criteria, there should be enough evidence so researchers feel comfortable rendering the appropriate interpretation to the scores. It is difficult to generalize any specific case, but one might need less evidence in basic laboratory research than in high-stakes assessment. Therefore, the setting and consequences of score use need to be considered.

The goal of this paper is to motivate researchers to consider construct overlap as they map the ontology of the psychological constructs in their field. Researchers planning new studies on new psychological constructs could include measures of similar constructs and test the ECV hypothesis with the different tests of dependent correlations to prevent wasted research efforts. In our examples, measures of connectedness to nature might demonstrate a jangle fallacy if researchers treat the measures as if they were assessing different constructs. On the other hand, grit and self-control demonstrate possible jingle and jangle fallacies depending if researchers treat these constructs as the same or not. Jingle-jangle fallacies may prevent researchers from accumulating scientific knowledge about psychological constructs. Therefore, we encourage researchers to use extrinsic convergent validity as a source of evidence as they build arguments to distinguish new constructs from those previously discovered.

Supplementary Material

Supp 1
Supp 2

Acknowledgements:

This research was supported in part by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1311230 and the National Institute on Drug Abuse under Grant No. R37DA009757 and under Grant No. UH2DA041713. We also thank Dr. D. Lubinski for inspiring this research.

The authors would like to thank Dr. David Lubinski for his comments on prior versions of this manuscript. The ideas and opinions expressed herein are those of the authors alone, and endorsement by the authors’ institutions, the National Science Foundation, and the National Institute on Drug Abuse is not intended and should not be inferred.

Funding: This work was supported by Grant No. DGE-1311230 from the National Science Foundation, and Grants No. R37DA009757 and No. UH2DA041713 from the National Institute on Drug Abuse.

Role of the Funders/Sponsors: None of the funders or sponsors of this research had any role in the design and conduct of the study; collection, management, analysis, and interpretation of data; preparation, review, or approval of the manuscript; or decision to submit the manuscript for publication.

Footnotes

1

Not to be confused with profile similarity from dyadic studies (Kenny, Kashy, & Cook, 2006)

Conflict of Interest Disclosures: Each author signed a form for disclosure of potential conflicts of interest. No authors reported any financial or other conflicts of interest in relation to the work described.

Ethical Principles: The authors affirm having followed professional ethical guidelines in preparing this work. These guidelines include obtaining informed consent from human participants, maintaining ethical treatment and respect for the rights of human or animal participants, and ensuring the privacy of participants and their data, such as ensuring that individual participants cannot be identified in reported results or from publicly available original or archival data.

References

  1. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. [Google Scholar]
  2. Aron A, Aron EN, Tudor M, & Nelson G (1991). Close relationships as including other in the self. Journal of Personality and Social Psychology, 60, 241–253. DOI: 10.1037/0022-3514.60.2.241. [DOI] [Google Scholar]
  3. Baer RA, Smith GT, Hopkins J, Krietemeyer J, & Toney L (2006). Using self-report assessment methods to explore facets of mindfulness. Assessment, 13, 27–45. DOI: 10.1177/1073191105283504. [DOI] [PubMed] [Google Scholar]
  4. Benet-Martinez V, & John OP (1998). Los Cinco Grandes across cultures and ethnic groups: Multitrait-multimethod analyses of the Big Five in Spanish and English. Journal of Personality and Social Psychology, 75, 729–750. DOI: 10.1037/0022-3514.75.3.729. [DOI] [PubMed] [Google Scholar]
  5. Benjamini Y, & Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57, 289–300. DOI: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
  6. Block J (1995). A contrarian view of the five-factor approach to personality description. Psychological Bulletin, 117, 187–215. DOI: 10.1037/0033-2909.117.2.187. [DOI] [PubMed] [Google Scholar]
  7. Block J (2000). Three tasks for personality psychology. In Bergman LR, Cairns RB, Nilsson LG, & Nystedt L (Eds.), Developmental science and the holistic approach (pp. 155–164). Mahwah, NJ: Erlbaum [Google Scholar]
  8. Bradley JV (1978). Robustness?. British Journal of Mathematical and Statistical Psychology, 31, 144–152. DOI: 10.1111/j.2044-8317.1978.tb00581.x. [DOI] [Google Scholar]
  9. Campbell DT, & Fiske DW (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. DOI: 10.1037/h0046016. [DOI] [PubMed] [Google Scholar]
  10. Carstensen LL, & Lang FR (1996). Future time perspective scale. Unpublished manuscript, Stanford University. [Google Scholar]
  11. Cattell RB (1965). The scientific analysis of personality. New York: Penguin Books. [Google Scholar]
  12. Clayton S (2003). Environmental identity: A conceptual and operational definition. In Clayton S, & Opotow S (Eds.), Identity and the natural environment (pp. 45–65). Cambridge, MA: MIT Press [Google Scholar]
  13. Cheung MWL, & Chan W (2004). Testing dependent correlation coefficients via structural equation modeling. Organizational Research Methods, 7, 206–223. DOI: 10.1177/1094428104264024. [DOI] [Google Scholar]
  14. Cohen J (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. [Google Scholar]
  15. Cohen J, Cohen P, West SG, & Aiken LS (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum [Google Scholar]
  16. Cronbach LJ (1971). Test validation. In Thorndike RL (Ed.), Educational measurement (2nd ed., pp. 443–507). Washington, DC: American Council on Education. [Google Scholar]
  17. Credé M, Tynan M, & Harms P (2017). Much ado about grit: A meta-analytic synthesis of the grit literature. Journal of Personality and Social Psychology, 113, 492–511. DOI: 10.1037/pspp0000102. [DOI] [PubMed] [Google Scholar]
  18. Cudeck R, & O’Dell LL (1994). Applications of standard error estimates in unrestricted factor analysis: Significance tests for factor loadings and correlations. Psychological Bulletin, 115, 475–487. DOI: 10.1037/0033-2909.115.3.475. [DOI] [PubMed] [Google Scholar]
  19. Cureton EE (1951). Validity. In Lindquist EF (Ed.), Educational measurement (pp. 621–694). Washington, DC: American Council on Education. [Google Scholar]
  20. Davis JL, Green JD, & Reed A (2009). Interdependence with the environment: Commitment, interconnectedness, and environmental behavior. Journal of Environmental Psychology, 29, 173–180. DOI: 10.1016/j.jenvp.2008.11.001. [DOI] [Google Scholar]
  21. Dawis RV (1992). The individual differences tradition in counseling psychology. Journal of Counseling Psychology, 39, 7–19. DOI: 10.1037/0022-0167.39.1.7. [DOI] [Google Scholar]
  22. Del Giudice M (2017). Heterogeneity Coefficients for Mahalanobis’ D as a Multivariate Effect Size. Multivariate Behavioral Research, 52, 216–221. DOI: 10.1080/00273171.2016.1262237. [DOI] [PubMed] [Google Scholar]
  23. Dunn OJ, & Clark V (1971). Comparison of tests of the equality of dependent correlation coefficients. Journal of the American Statistical Association, 66, 904–908. DOI: 10.1080/01621459.1971.10482369. [DOI] [Google Scholar]
  24. Duckworth A, & Gross JJ (2014). Self-control and grit: Related but separable determinants of success. Current Directions in Psychological Science, 23, 319–325. DOI: 10.1177/0963721414541462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Duckworth AL, Peterson C, Matthews MD, & Kelly DR (2007). Grit: perseverance and passion for long-term goals. Journal of Personality and Social Psychology, 92, 1087–1101. DOI: 10.1177/0963721414541462. [DOI] [PubMed] [Google Scholar]
  26. Duckworth AL, & Quinn PD (2009). Development and validation of the Short Grit Scale (GRIT–S). Journal of Personality Assessment, 91, 166–174. DOI: 10.1080/00223890802634290. [DOI] [PubMed] [Google Scholar]
  27. Dutcher DD, Finley JC, Luloff AE, & Johnson JB (2007). Connectivity with nature as a measure of environmental values. Environment and Behavior, 39, 474–493. DOI: 10.1177/0013916506298794. [DOI] [Google Scholar]
  28. Edwards AL (1959). Manual-Edwards Personal Preference Schedule. (Rev. ed.) New York: Psychological Corporation. [Google Scholar]
  29. Efron B, & Tibshirani RJ (1994). An introduction to the bootstrap. Boca Raton: CRC press. [Google Scholar]
  30. Eisenberg IW, Bissett PG, Canning JR, Dallery J, Enkavi AZ, Gabrieli SW, …, Poldrack RA (2018). Applying Novel Technologies and Methods to Inform the Ontology of Self-Regulation. Behavior Research and Therapy. 101, 46–57. DOI: 10.1016/j.brat.2017.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Eisenberg IW, Bissett PG, Enkavi AZ, Li J, MacKinnon DP, Marsch LA, & Poldrack RA (2019). Uncovering the structure of self-regulation through data-driven ontology discovery. Nature communications, 10, 2319. DOI: 10.1038/s41467-019-10301-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Fiske DW (1971). Measuring the concepts of personality. Chicago: Aldine. [Google Scholar]
  33. Fiske DW (1973). Can a personality construct be validated empirically? Psychological Bulletin, 80, 89–92. DOI: 10.1037/h0034786. [DOI] [Google Scholar]
  34. Fiske DW, & Barack LI (1976). Individuality of item interpretation in interchangeable ACL scales. Educational and Psychological Measurement, 36, 339–345. DOI: 10.1177/001316447603600212. [DOI] [Google Scholar]
  35. Fouladi RT (2000). Performance of modified test statistics in covariance and correlation structure analysis under conditions of multivariate nonnormality. Structural Equation Modeling, 7, 356–410. DOI: 10.1207/S15328007SEM0703_2. [DOI] [Google Scholar]
  36. Gonzalez O, Canning JR, Smyth H, & MacKinnon DP (in press). A psychometric evaluation of the Short Grit Scale: A closer look at its factor-structure and scale functioning. European Journal of Psychological Assessment. DOI: 10.1027/1015-5759/a000535. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Gough HG, & Heilrun AB (1965). The Adjective Check List manual. Palo Alto, Calif.: Consulting Psychologists Press. [Google Scholar]
  38. Gulliksen H (1968). Methods for determining equivalence of measures. Psychological Bulletin. 70, 534–44. DOI: 10.1037/h0026721. [DOI] [Google Scholar]
  39. Hoffman JM (2000). Methods to compare dependent correlations: A simulation study and application to an anabolic steroid prevention project (Doctoral dissertation, Arizona State University; ). [Google Scholar]
  40. Jackson DN, & Guthrie GM (1968). Multitrait-multimethod evaluation of the Personality Research Form. Proceedings of the 76th Annual Convention of the American Psychological Association, 3, 177–178. [Google Scholar]
  41. John OP, Donahue EM, & Kentle RL (1991). The Big Five Inventory – Versions 4a and 5a. Berkeley, CA: University of California, Berkeley, Institute of Personality and Social Research. [Google Scholar]
  42. Judge TA, Erez A, Bono JE, & Thoresen CJ (2002). Are measures of self-esteem, neuroticism, locus of control, and generalized self-efficacy indicators of a common core construct?. Journal of Personality and Social Psychology, 83, 693–710. DOI: 10.1037/0022-3514.83.3.693. [DOI] [PubMed] [Google Scholar]
  43. Kals E, Schumacher D, & Montada L (1999). Emotional affinity toward nature as a motivational basis to protect nature. Environment and Behavior, 31, 178–202. DOI: 10.1177/00139169921972056. [DOI] [Google Scholar]
  44. Kane M (2006). Validation. In Brennan R (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: American Council on Education and Praeger. [Google Scholar]
  45. Kelley EL (1927). Interpretation of educational measurements. Yonkers, NY: World. [Google Scholar]
  46. Kenny DA (1987). Statistics for the social and behavioral sciences. Boston: Little, Brown and Company. [Google Scholar]
  47. Kenny DA, Kashy DA, & Cook WL (2006). The analysis of dyadic data. New York: Guilford. [Google Scholar]
  48. Lubinski D (2004). Introduction to the special section on cognitive abilities: 100 years after Spearman’s (1904) “‘General intelligence,’ objectively determined and measured.” Journal of Personality and Social Psychology, 86, 96–111. DOI: 10.1037/0022-3514.86.1.96. [DOI] [PubMed] [Google Scholar]
  49. Lubinski D (2006). Ability Tests. In Eid M (Ed.), Handbook of multimethod measurement in psychology, (pp. 101–114). Washington, DC: American Psychological Association. [Google Scholar]
  50. Lubinski D, Tellegen A, & Butcher JN (1983). Masculinity, femininity, and androgyny viewed and assessed as distinct concepts. Journal of Personality and Social Psychology, 44, 428–439. DOI: 10.1037/0022-3514.44.2.428. [DOI] [Google Scholar]
  51. MacKinnon DP (2008). Introduction to statistical mediation analysis. New York: Routledge. [Google Scholar]
  52. Markus KA, & Borsboom D (2013). Frontiers of test validity theory: Measurement, causation, and meaning. New York: Routledge. [Google Scholar]
  53. Mayer FS, & Frantz M (2004). The connectedness to nature scale: A measure of individuals’ feeling in community with nature. Journal of Environmental Psychology, 24, 503–515. DOI: 10.1016/j.jenvp.2004.10.001. [DOI] [Google Scholar]
  54. Marsh HW, Pekrun R, Parker PD, Murayama K, Guo J, Dicke T, & Arens AK (2019). The murky distinction between self-concept and self-efficacy: Beware of lurking jingle-jangle fallacies. Journal of Educational Psychology, 111, 331–353. DOI: 10.1037/edu0000281. [DOI] [Google Scholar]
  55. Meade AW, Johnson EC, & Braddy PW (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. DOI: 10.1037/0021-9010.93.3.568. [DOI] [PubMed] [Google Scholar]
  56. Meng XL, Rosenthal R, & Rubin DB (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111, 172–175. DOI: 10.1037/0033-2909.111.1.172. [DOI] [Google Scholar]
  57. Messick S (1984). Abilities and knowledge in educational achievement testing: The assessment of dynamic cognitive structures. In Plake BS (Ed.), Buros-Nebraska symposium on measurement and testing: Vol. 1. Social and technical issues in testing: Implications for test construction and usage. Hillsdale, NJ: Erlbaum [Google Scholar]
  58. Messick S (1989). Validity. In Linn RL (Ed.), Educational measurement (3rd ed., pp. 13–104). New York: Macmillan. [Google Scholar]
  59. Michell J (2009). Invalidity in Validity. In Lissitz RW (Ed.), The concept of validity: Revisions, new directions and applications (pp. 111–136). [Google Scholar]
  60. Millsap RE (2011). Statistical approaches to measurement invariance. New York, NY: Routledge. [Google Scholar]
  61. Murray HA (1938). Explorations in personality. Oxford, England: Oxford Univ. Press. [Google Scholar]
  62. Muthén LK, & Muthén BO (1998-2012). Mplus User’s Guide. Seventh Edition. Los Angeles, CA: Muthén & Muthén. [Google Scholar]
  63. Nisbet EK, Zelenski JM, & Murphy SA (2009). The nature relatedness scale: Linking individuals’ connection with nature to environmental concern and behavior. Environment and Behavior, 41, 715–740. DOI: 10.1177/0013916508318748. [DOI] [Google Scholar]
  64. Olkin I, & Finn JD (1990). Testing correlated correlations. Psychological Bulletin, 108, 330–333. DOI: 10.1037/0033-2909.108.2.330. [DOI] [Google Scholar]
  65. Olkin I, & Siotani M (1976). Asymptotic distribution of functions of a correlation matrix. In Ikeda S (Ed.), Essays in probability and statistics (pp. 235–251). Tokyo: Shinko Tsusho [Google Scholar]
  66. Preacher KJ (2006). Testing complex correlational hypotheses with structural equation models. Structural Equation Modeling, 13, 520–543. DOI: 10.1207/s15328007sem1304_2. [DOI] [Google Scholar]
  67. R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. [Google Scholar]
  68. Rao CR (1973). Linear statistical inference and its applications. New York: Wiley. [Google Scholar]
  69. Rosner B, Wang W, Eliassen H, Hibert E (2015) Comparison of dependent Pearson and Spearman correlation coefficients with and without correction for measurement error. Journal of Biometrics and Biostatistics, 6, 226. DOI: 10.4172/2155-6180.1000226. [DOI] [Google Scholar]
  70. Rosseel Y (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48, 1–36. URL http://www.jstatsoft.org/v48/i02/. [Google Scholar]
  71. Rulon PJ (1946). On the validity of educational tests. Harvard Educational Review, 16, 290–296. [Google Scholar]
  72. Schmidt DB, Lubinski D, & Benbow CP (1998). Validity of assessing educational–vocational preference dimensions among intellectually talented 13-year-olds. Journal of Counseling Psychology, 45, 436–453. DOI: 10.1037/0022-0167.45.4.436. [DOI] [Google Scholar]
  73. Shrout PE, & Yip-Bannicq M (2017). Inferences about competing measures based on patterns of binary significance tests are questionable. Psychological Methods, 22, 84–93. DOI: 10.1037/met0000109. [DOI] [PubMed] [Google Scholar]
  74. Smith GT (2005). On Construct Validity: Issues of Method and Measurement. Psychological Assessment, 17, 396–408. DOI: 10.1037/1040-3590.17.4.396. [DOI] [PubMed] [Google Scholar]
  75. Steiger JH (1980a). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245–251. DOI: 10.1037/0033-2909.87.2.245. [DOI] [Google Scholar]
  76. Steiger JH (1980b). Testing pattern hypotheses on correlation matrices: Alternative statistics and some empirical results, Multivariate Behavioral Research, 15, 335–352. [DOI] [PubMed] [Google Scholar]
  77. Steiger JH (2005). Comparing correlations. In Maydeu-Olivares A & McArdle JJ (Eds.) Contemporary psychometrics. A festschrift to Roderick P. McDonald. Mahwah, NH: Lawrence Erlbaum Associates. [Google Scholar]
  78. Tam KP (2013). Concepts and measures related to connection to nature: Similarities and differences. Journal of Environmental Psychology, 34, 64–78. DOI: 10.1016/j.jenvp.2013.01.004. [DOI] [Google Scholar]
  79. Tangney JP, Baumeister RF, & Boone AL (2004). High self-control predicts good adjustment, less pathology, better grades, and interpersonal success. Journal of Personality, 72, 271–324. DOI: 10.1111/j.0022-3506.2004.00263.x. [DOI] [PubMed] [Google Scholar]
  80. Thielmann I, & Hilbig BE (2019). Nomological consistency: A comprehensive test of the equivalence of different trait indicators for the same constructs. Journal of Personality, 87, 715–730. DOI: 10.1111/jopy.12428. [DOI] [PubMed] [Google Scholar]
  81. Thorndike EL (1904). An introduction to the theory of mental and social measurements. New York: Teachers College, Columbia University. [Google Scholar]
  82. Vale CD, & Maurelli VA (1983). Simulating multivariate nonnormal distributions. Psychometrika, 48, 465–471. DOI: 10.1007/BF02293687. [DOI] [Google Scholar]
  83. Westen D, & Rosenthal R (2003) Quantifying construct validity: two simple measures. Journal of Personality and Social Psychology, 84, 608–618. DOI: 10.1037/0022-3514.84.3.608. [DOI] [PubMed] [Google Scholar]
  84. Widaman KF (1985). Hierarchically nested covariance structure models for multitrait-multimethod data. Applied Psychological Measurement, 9, 1–26. DOI: 10.1177/014662168500900101. [DOI] [Google Scholar]
  85. Wilcox RR (2009). Comparing Pearson correlations: Dealing with heteroscedasticity and nonnormality. Communications in Statistics-Simulation and Computation, 38, 2220–2234. DOI: 10.1080/03610910903289151. [DOI] [Google Scholar]
  86. Wilcox RR (2016). Comparing dependent robust correlations. British Journal of Mathematical and Statistical Psychology, 69, 215–224. DOI: 10.1111/bmsp.12069. [DOI] [PubMed] [Google Scholar]
  87. Wilcox RR, & Tian T (2008). Comparing dependent correlations. The Journal of General Psychology, 135, 105–112. DOI: 10.3200/GENP.135.1.105-112. [DOI] [PubMed] [Google Scholar]
  88. Williams EJ (1959). The comparison of regression variables. Journal of the Royal Statistical Society, Series B, 21, 396–399. DOI: 10.1111/j.2517-6161.1959.tb00346.x. [DOI] [Google Scholar]
  89. Yule GU (1922). An introduction to the theory of statistics. London: Charles Griffin and Co. [Google Scholar]
  90. Zimbardo PG, & Boyd JN (1999). Putting time in perspective: A valid, reliable individual difference metric. Journal of Personality and Social Psychology, 77, 1271–1288. DOI: 10.1037/0022-3514.77.6.1271. [DOI] [Google Scholar]
  91. Zou GY (2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12, 399–413. DOI: 10.1037/1082-989X.12.4.399 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1
Supp 2

RESOURCES