Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2019 Jan 13;80(2):389–398. doi: 10.1177/0013164418822703

Examining Validity-Related Correlations and Their Differences Under Assumption Violations

Tenko Raykov 1,, Abdullah A Al-Qataee 2, Dimiter M Dimitrov 2,3
PMCID: PMC7047262  PMID: 32158027

Abstract

A procedure for evaluation of validity related coefficients and their differences is discussed, which is applicable when one or more frequently used assumptions in empirical educational, behavioral and social research are violated. The method is developed within the framework of the latent variable modeling methodology and accomplishes point and interval estimation of convergent and discriminant correlations as well as differences between them in cases of incomplete data sets with data not missing at random, nonnormality, and clustering effects. The procedure uses the full information maximum likelihood approach to model fitting and parameter estimation, does not assume availability of multiple indicators for underlying latent constructs, includes auxiliary variables, and accounts for within-group correlations on main response variables resulting from nesting effects involving studied respondents. The outlined procedure is illustrated on empirical data from a study using tertiary education entrance examination measures.

Keywords: auxiliary variable, convergent validity, correlation, discriminant validity, interval estimation, latent variable modeling, maximum likelihood, missing data


Validity is unquestionably of primary relevance in any measurement effort, and in particular in educational, behavioral, social, clinical, business and marketing research (e.g., McDonald, 1999). In many respects, therefore, validity may be seen as the bottom line of measurement in these and related sciences (e.g., Crocker & Algina, 2006). While one may argue that validity is predominantly a substantive concept, its assessment is substantially facilitated by considering forms of validity that become of special importance in certain conditions and studies. Two such forms of validity are convergent and discriminant validity (e.g., Campbell & Fiske, 1959). Their evaluation in empirical research may significantly assist a scholar in the process of accumulating evidence for or against validity of measuring instruments used with populations under investigation (e.g., Messick, 1995).

When estimating validity-related coefficients, such as for instance convergent and discriminant coefficients, oftentimes scholars assume complete data or data missing at random (MAR), as well as lack of clustering effects, and possibly also manifest variable normality. Unfortunately, the majority of empirical studies in the educational, behavioral and social sciences involve missing data that tend not to be missing at random. In addition, data are frequently collected from respondents that are nested within higher-order units, such as classes, schools, teachers, neighborhoods, physicians, cities, counties, or regions, to mention but a few possibilities, and are not close to normal. Since respondents within these groups have typically shared various experiences across possibly many years of interactions of various kinds, their scores on key response variables may well exhibit notable similarities (e.g., Rabe-Hesketh & Skrondal, 2012). Under these circumstances, not accounting for this correlation may lead to incorrect substantive conclusions and is likely to entail spuriously small standard errors for parameters of main interest (e.g., Raudenbush & Bryk, 2002). As is often the case, nesting effects of this type in empirical settings tend to be observed alongside data that are not missing at random, hence enhancing further the chance of reaching potentially misleading substantive conclusions if the violations of the assumptions of complete data, MAR, normality, or that of independent respondents are not dealt with. Last but not least, many conventional studies of convergent and discriminant validity proceeded in the past within the multitrait multimethod framework that presumes availability of sufficient numbers of unidimensional multiple indicators of studied traits, as well as sufficiently many constructs being under consideration, in order to achieve model identification (e.g., Grayson & Marsh, 1994; see also Raykov & Marcoulides, 2016). This assumption is too strong or hard to meet, if possible at all, in many empirical settings where researchers only have access to overall measures of latent variables of concern that need not be unidimensional, that is, effectively have access merely to single indicators of underlying traits under consideration, while still being interested in evaluating construct indicator relationships as well as their similarities and differences.

The present note discusses a procedure for evaluation of validity-related correlations as well as differences between convergent correlations, between discriminant correlations, and between convergent and discriminant correlations, without making the assumptions of complete data, multiple unidimensional indicators, data missing at random, normality, or independent respondents (lack of nesting effects). These assumptions, which are frequently violated in empirical studies, have in part characterized earlier research on evaluating validity-related correlation coefficients (e.g., Crocker & Algina, 2006; Raykov, 2011). The outlined approach is illustrated on data from a study concerned with tertiary entrance examination measures.

Background, Notation, and Assumptions

Suppose k (approximately) continuous measures are collected on a sample of n persons in a study of a population of interest (k, n > 1). Denote these observed variables by y1, . . ., yk and let y = (y1, . . ., yk)′ be their vector (with priming denoting transposition in the rest of the article). Designate by R = Corr(y) = [ρij] their population correlation matrix (i, j = 1, . . ., k; e.g., Raykov & Marcoulides, 2008). In educational, behavioral, social, clinical, organizational, and marketing studies, some values may well be missing in a collected data set, that is, the pertinent n×k data matrix, denoted M, may be incomplete. In this setting, suppose an investigator is concerned with evaluating (a) the similarity between correlations among (single) observed measures that on theoretical grounds are expected to be indicators of closely related concepts (latent traits, abilities, or constructs) and (b) the discrepancy between correlations of measures that on theoretical grounds are expected to be indicators of concepts (constructs) that are largely unrelated or only related to a limited extent (see next section for details). Frequently in empirical research, the measures indicated in (a) are of relevance when interested in assessing what may be referred to as convergent validity, while those in (b) when concerned with assessing discriminant (divergent) validity (e.g., Campbell & Fiske, 1959).

The increasingly popular latent variable modeling (LVM) methodology offers a readily utilized framework for responding to these concerns (B. O. Muthén, 2002), and will be therefore used in the rest of this article. Specifically, the approach of confirmatory factor analysis will be employed in the following discussion (e.g., Raykov & Marcoulides, 2006). More concretely, use will be made of the following special case of a generally applicable confirmatory factor analysis model:

y=α+Λη+ε, (1)

where Λ is the identity matrix of size k×k, η is a zero-mean vector of size k×1 consisting of dummy variables, α is the k×1 vector of consisting of the observed means, and ε is the k×1 vector consisting of zeros (e.g., Raykov & Marcoulides, 2010).

Evaluation of Convergent and Discriminant Validity Correlations and Their Differences When Assumptions Fail

In the remainder of this note, given the presence of missing data in a data matrix M, we will use the popular maximum likelihood (or full information maximum likelihood, FIML) method for measure correlation estimation (e.g., Enders, 2010). As is widely known, the method is based on the MAR assumption (e.g., Little & Rubin, 2002). This assumption is fulfilled when the probability of missingness is not related to the actually missing (unobserved) values. Thereby, whether the probability of missingness is related to the observed data or not, is irrelevant. With this in mind, the MAR assumption amounts to what may be seen as systematic missingness in general, since the probability of missingness is allowed to relate to observed data, for example, to variables with complete observations (e.g., Raykov & Marcoulides, 2008).

While the MAR assumption is very helpful, one may argue that it is not frequently fulfilled in empirical educational, behavioral, and social research. The reason is that oftentimes subjects with particular or unusual values on some variables (e.g., extreme scores) are more likely to give rise to missing data on these measures, thus invalidating that assumption. To counteract some consequences of this violation, as well as to enhance the plausibility of the MAR stipulation, we will use the method of auxiliary variables (AVs) when applying the robust version of the FIML method to account also for some violations of normality (e.g., Collins, Schafer, & Kam, 2001). The AVs are not measures of main interest to a researcher, and if there were no missing values on the variables in a model of concern he/she would not necessarily involve the AVs in their modeling and analytic efforts. Effective and useful AVs are typically causes of missingness or variables related to them, and hence represent measures containing some information about the missing values (Enders, 2010). The choice of AVs from among a set of available and also measured variables in a given empirical study may be facilitated by some statistical procedures (e.g., Raykov & Marcoulides, 2014; Raykov & West, 2016) and is likely to benefit to a pronounced degree by substantive considerations helping to identify measures in an overall data set that are closely related on theoretical grounds to dependent variables with missing values in a model of interest to the researcher. This general recommendation for selection of useful AVs will be followed in the next section dealing with an empirical application.

Since we will be concerned then with a utilization of this approach to data resulting from psychometric testing of applicants to tertiary educational institutions who are coming from different geographic regions, we need to account for the clustering effects that may result thereby. The latter would stem from the fact that applicants from the same geographic region have shared numerous experiences for instance when attending high school or earlier in their schooling period, due in part at least to the fact that they were enrolled in schools in the same general area of the country of relevance. As a consequence, it may be expected that there will be some nesting effects within region, which would likely affect the applicant scores on the measures of concern in this article. To account for these effects, rather than ignore them and thus most likely confront spuriously deflated standard errors for the correlations and their differences of interest below, we will use the robust version of FIML. This version, based on the so-called “sandwich” estimator, adjusts correspondingly the overall goodness of fit indices of fitted models as well as in particular the standard errors of individual parameters (L. K. Muthén & Muthén, 2018). Therefore, using this approach, one can obtain statistically valid standard errors and associated confidence intervals at prespecified confidence levels, for any given validity-related correlation or correlation difference when employing the saturated model in Equation (1). This feature of the approach is particularly attractive for the concerns of this note, namely point and interval estimation of validity related correlations and their differences in the presence of data not missing at random (NMAR), nonnormality, and clustering effects. In addition, this modeling approach is easily used with the widely circulated LVM software Mplus (L. K. Muthén & Muthén, 2018).

We apply next the discussed procedure on data from a tertiary education entrance examination in the Kingdom of Saudi Arabia.

Point and Interval Estimation of Validity-Related Correlations and Their Differences for University Entrance Examination Measures

In this section, we will be concerned with evaluation of the correlations and their differences among several ability-related measures used for the aforementioned examination purpose. These measures are (a) the General Ability Test (GAT), and in particular its total score (abbreviated as GATT); (b) the Post-Graduate Ability Test (PGAT) and particularly its total score (PGATT); (c) the Science Achievement Admission Test (SAAT); (d) the Standardized Test of English Proficiency (STEP); as well as (e) the Teacher Licensure Test in two types of scores—General Test (TLTG) scores and Subject Test (TLTS) scores. The TLTG is taken by all applicants for teacher licensure, whereas the TLTS in a specific subject area (e.g., mathematics, biology, chemistry, physics, etc.) is taken only by teacher candidates who specialize in that area. All these tests are developed and administered by the National Center for Assessment in Saudi Arabia.

In a first step of an application of the estimation procedure outlined in the preceding section, we fit the model defined in Equation (1) (see also immediately following discussion) to the available data from n = 817 male applicants who were coming from J = 13 geographic regions of the Kingdom of Saudi Arabia. (See Appendix A for the Mplus source code used thereby as well as the note to it.) To counteract possible violations of MAR—to the degree feasible with auxiliary variables, given the measures at hand—and to enhance the plausibility of this assumption, we use as AVs the GAT quantitative and verbal scores, the PGAT quantitative and verbal scores, and the TLTG score (e.g., Collins et al., 2001; as indicated earlier, this model is saturated). The resulting correlation estimates, along with their standard errors, t-values, and two-tailed p-values, as well as the associated 95% confidence intervals obtained using the initial transformation approach in Raykov and Marcoulides (2011; see Appendix B for the R-function employed to furnish these intervals) are presented in Table 1.

Table 1.

Validity-Related Correlation Estimates, Standard Errors, and 95% Confidence Intervals (Software Format).

Correlation Est. SE t p 95% CI
GATT WITH
 SAAT 0.477 0.025 18.736 .000 (0.426, 0.524)
 PGATT 0.709 0.019 36.863 .000 (0.670, 0.744)
 STEP 0.399 0.028 14.417 .000 (0.343, 0.452)
 TLTS 0.372 0.031 12.052 .000 (0.310, 0.431)
SAAT WITH
 PGATT 0.461 0.025 18.502 .000 (0.411, 0.508)
 STEP 0.360 0.030 12.131 .000 (0.300, 0.417)
 TLTS 0.341 0.035 9.662 .000 (0.271, 0.408)
PGATT WITH
 STEP 0.465 0.025 18.403 .000 (0.415, 0.512)
 TLTS 0.471 0.027 17.173 .000 (0.416, 0.522)
STEP WITH
 TLTS 0.311 0.033 9.408 .000 (0.245, 0.374)

Note. Est. = parameter estimate, SE = standard error, t = t-value (= Est./SE), p = two-tailed p-value, 95% CI = 95% confidence interval (cf. L. K. Muthén & Muthén, 2018).

As can be seen from Table 1, the strongest evaluated correlation is that between the total scores on the GAT and PGAT measures, estimated at 0.709 (with standard error [SE], of 0.019). Inspecting its confidence interval (CI), we see that a range of practically highly plausible population values for this correlation stretches from the mid/high 0.60s through the mid 0.70s (see last column of Table 1 containing the 95% CIs of the correlations). This finding can be expected since both variables involved in this correlation are academic aptitude measures, and hence would be anticipated to be markedly interrelated. On substantive grounds, this correlation can thus be viewed as a convergent validity related correlation (e.g., Campbell & Fiske, 1959). (We may mention in passing that GAT is taken by high school graduates who apply to universities and measures their analytical and deductive skills in two parts—verbal and quantitative—whereas PGAT, which is taken by university graduates, measures the same type of skills in three parts, namely verbal, quantitative, and logical/inductive-spatial.) Conversely, the weakest correlation in Table 1 is that between STEP and TLTS, estimated at 0.311 (0.033), with a range of practically highly plausible population values stretching from the mid 0.20s through the mid 0.30s only. This considerably weaker correlation coefficient can be seen as expected, owing to the fact that the two measures evaluate distinct constructs that are not markedly inter-related on substantive grounds. Specifically, STEP measures proficiency in English as a second language, whereas TLTS measures knowledge and skills in a particular subject area such as mathematics, biology, chemistry, physics, and so forth.

As a second step of the application of the validity related correlation estimation procedure discussed in this article, we fit the same model in Equation (1) (to the same data set and using the same auxiliary variables), but add as ‘external parameters’ the differences between (a) two correlations that on substantive grounds can be expected to be markedly different, being a convergent and a discriminant correlation, respectively, and (b) between two correlations substantively expected to be close to each other, being both of discriminant type. Specifically, we estimate the difference between the correlation of the GAT and PGAT total scores, on the one hand, and the correlation between the STEP and TLTS scores, on the other hand. (See Appendix A for the Mplus source code used thereby.) This convergent-to-divergent correlation difference is estimated as 0.337 (0.028), with a 95% CI (0.282, 0.392). These results suggest that there is a marked population difference, as expected, between the correlation of close traits—as evaluated by the GAT and PGAT total scores—on one hand, and the correlation of distinct traits—as evaluated by the STEP and TLTS measures—on the other hand. This difference is significant, and suggested as positive in the population, as could be seen from observing that the last stated confidence interval does not contain the zero point and is entirely positioned above it.

Furthermore, we estimate in the same way the difference between the correlation between SAAT and STEP measures, on one hand, and the correlation between STEP and TLTS measures, on the other hand (see Table 1 for the estimates and related statistics for these correlations, and the note to Appendix A). These two correlations are both discriminant in nature as the STEP measures (proficiency in English) do not overlap in content with the SAAT and TLTS measures. Since both correlations are expected to be of similar magnitude, as anticipated their difference is estimated at a close to 0 value, namely 0.02 (0.039), with a 95% CI (−0.057, 0.097). This interval covers the zero point and hence one can conclude that there is not enough evidence in the analyzed data set warranting rejection of the null hypotheses of these correlations’ equality, if one were to be interested in testing the conjecture of their identity (cf. Raykov & Marcoulides, 2008). We would like to point out that using the same approach as above in this section, one can point and interval estimate the difference between any two correlations of interest, namely via introduction of an “external parameter” that is defined as equal to their difference.

Conclusion

This note was concerned with evaluation of convergent validity– and discriminant validity–related correlations and differences among them for (approximately) continuous initial measures. These correlations and differences can obtain special theoretical and empirical importance when examining educational, behavioral, and social phenomena, and particularly at the measure validation stage (cf. Messick, 1995). No assumption of complete data, (sufficient number of) multiple unidimensional indicators of underlying traits of concern, data missing at random, normality, or independence was made thereby. As a model fitting and estimation method, full information maximum likelihood (robust version) was used, which permitted accounting for within-group correlation on the observed variables of concern.

The discussed point and interval estimation procedure is best used with large samples, owing to the fact that it is instrumentally based on the maximum likelihood method that rests on an asymptotic statistical theory (e.g., Casella & Berger, 2002). Whether this theory has obtained practical relevance in a given empirical study, is currently a difficult question to answer since it is likely substantially influenced by the amount of missing data (missing information) in the analyzed data set as well as additional factors, such as number of observed variables and parameters. We encourage future research addressing this complex question, hopefully contributing thereby to developing guidelines allowing one to determine whether one could use the resulting estimates and standard errors in a more trustworthy way. Similarly, the procedure is based on the assumption of (approximately) continuous original measures. One may conjecture, however, that it may be robust to some violations of continuity, specifically with potentially as few as five to seven possible values on noncontinuous measures, as long as their distributions are not excessively skewed and/or kurtotic. We encourage further research, possibly based on comprehensive simulation studies, addressing the query whether this conjecture may be considered dependable. Last but not least, the results of an application of the outlined method can be expected to be more trustworthy with fewer assumption violations and with lower degree of missing data in empirical studies (sets of relevant measures).

In conclusion, this note offers to empirical educational, behavioral, social, clinical, organizational, and marketing scientists a readily and widely applicable procedure for evaluation of validity-related correlations and their differences under multiple assumption violations, which may be particularly useful in empirical measure validation studies.

Acknowledgments

We are indebted to G. A. Marcoulides for valuable discussions on validity and its assessment.

Appendix A

Mplus Source Code for Evaluation of Validity-Related Correlations and Their Differences Under Assumption Violations

TITLE: EVALUATION OF VALIDITY RELATED CORRELATIONS AND THEIR

  DIFFERENCES UNDER ASSUMPTION VIOLATIONS.

DATA: FILE = <NAME OF RAW DATA FILE>;

VARIABLE: NAMES = SEQ GAT_Q GAT_V GAT_T SAAT PGAT_T PGAT_Q PGAT_V STEP

  TLT_G TLT_S SEX GENDER GRAD_YR REGION;

  USEV = GAT_T SAAT PGAT_T STEP TLT_S;

  MISSING = ALL(-999);

  AUXILIARY = (M) GAT_Q GAT_V PGAT_Q PGAT_V TLT_G;

ANALYSIS: ESTIMATOR = MLR;

MODEL: LGATT BY GAT_T*; GAT_T@0;

  LSAAT BY SAAT*; SAAT@0;

  LPGATT BY PGAT_T*; PGAT_T@0;

  LSTEP BY STEP*; STEP@0;

  LTLTS BY TLT_S*; TLT_S@0;

  LGATT-LTLTS@1;

  LGATT WITH LSAAT-LTLTS (FI12-FI15);

  LSAAT WITH LPGATT-LTLTS (FI23-FI25);

  LPGATT WITH LSTEP-LTLTS (FI34-FI35);

  LSTEP WITH LTLTS (FI45);

MODEL CONSTRAINT:

  NEW(GPG_GTLT, SS_STLT);

  GPG_GTLT = FI13-FI15;

  SS_STLT = FI24-FI25;

OUTPUT: CINTERVAL; ! PROVIDES SE’S FOR CORRELATIONS OF RELEVANCE.

Note. Annotating comments added after exclamation mark. For point and interval estimation of validity related correlations, drop the entire MODEL CONSTRAINT section (its 4 lines). (Modify correspondingly the MODEL CONSTRAINT section for point and interval estimation of other correlation differences.) For the 95% confidence intervals reported in the last column of Table 1, use the R-function “ci.corr” in Appendix B (cf. Raykov & Marcoulides, 2011, chap. 4).

Appendix B

R-Function for Obtaining Confidence Intervals of Validity-Related Correlations

ci.pc <– function(c, se){

 z=.5*log((1+c)/(1–c))

 sez = se/((1–c^2))

 ci_z_lo = z–1.96*sez

 ci_z_up = z+1.96*sez

 ci_lo = (exp(2*ci_z_lo)–1)/(exp(2*ci_z_lo)+1)

 ci_up = (exp(2*ci_z_up)–1)/(exp(2*ci_z_up)+1)

 ci = c(ci_lo, ci_up)

ci

}

Note. This R-function is a minor modification of the function “ci.pc” in Raykov and Marcoulides (2011, chap. 4) specifically for the needs of the present note.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work by T. Raykov on this research was supported by the National Center for Assessment, Riyadh, Saudi Arabia.

References

  1. Campbell D. T., Fiske D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. [PubMed] [Google Scholar]
  2. Casella G., Berger J. (2002). Statistical inference. Monterey, CA: Wadsworth. [Google Scholar]
  3. Collins L., Schafer J. L., Kam C.-H. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330-351. [PubMed] [Google Scholar]
  4. Crocker L., Algina J. (2006). Introduction to classical and modern test theory. Fort Worth, TX: Harcourt Brace Jovanovich. [Google Scholar]
  5. Enders C. A. (2010). Applied missing data analysis. New York, NY: Guilford Press. [Google Scholar]
  6. Grayson D., Marsh H. W. (1994). Identification with deficient rank loading matrices in confirmatory factor analysis: Multitrait multimethod models. Psychometrika, 59, 121-134. [Google Scholar]
  7. Little R. P., Rubin D. B. (2002). Statistical analysis with missing data. New York, NY: Wiley. [Google Scholar]
  8. McDonald R. P. (1999). Test theory. A unified treatment. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  9. Messick S. (1995). Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749. [Google Scholar]
  10. Muthén B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 87-117. [Google Scholar]
  11. Muthén L. K., Muthén B. O. (2018). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén. [Google Scholar]
  12. Rabe-Hesketh S., Skrondal A. (2012). Multilevel and longitudinal modeling with Stata. College Station, TX: Stata Press. [Google Scholar]
  13. Raudenbush S., Bryk A. (2002). Hierarchical linear modeling. Thousand Oaks, CA: Sage. [Google Scholar]
  14. Raykov T. (2011). Estimation of convergent and discriminant validity with multi-trait multi-method correlations. British Journal of Mathematical and Statistical Psychology, 64, 38-52. [DOI] [PubMed] [Google Scholar]
  15. Raykov T., Marcoulides G. A. (2006). A first course in structural equation modeling (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
  16. Raykov T., Marcoulides G. A. (2008). An introduction to applied multivariate analysis. New York, NY: Taylor & Francis. [Google Scholar]
  17. Raykov T., Marcoulides G. A. (2010). Group comparisons in the presence of missing data using latent variable modeling techniques. Structural Equation Modeling, 17, 135-149. [Google Scholar]
  18. Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. New York, NY: Taylor & Francis. [Google Scholar]
  19. Raykov T., Marcoulides G. A. (2014). Identifying useful auxiliary variables for incomplete data analyses: A note on a group difference examination approach. Educational and Psychological Measurement, 74, 537-550. [Google Scholar]
  20. Raykov T., Marcoulides G. A. (2016). On examining specificity in latent construct indicators. Structural Equation Modeling, 23, 845-855. [Google Scholar]
  21. Raykov T., West B. T. (2016). On enhancing plausibility of the missing at random assumption in incomplete data analyses via evaluation of response-auxiliary variable correlations. Structural Equation Modeling, 23, 43-53. [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES