Abstract
Across a wide range of substance use outcomes, ethnic/racial minorities in the U.S. experience a disproportionately higher burden of negative health outcomes and/or lower levels of access to care (relative to non-Latinx White individuals). Various explanations for these substance use related health disparities have been proposed. This narrative review will not focus on the theoretical content of these explanations but will instead focus on the underlying statistical frameworks that are used to test such theories. Here, we provide a narrative review of psychometric critiques of cross-cultural research, which collectively suggest that: 1) research testing similarities and differences among ethnic/racial groups often miss or omit to test statistical assumptions of equal instrument functioning across the ethnic/racial groups being compared; 2) testing the assumptions of equal instrument functioning is feasible using established guidelines from modern measurement theories; and 3) substance use research may need to explicitly incorporate tests of equal instrument functioning to prevent bias when making inferences across ethnic/racial groups. We provide recommendations for evaluating the cultural equivalence of measurement using structural equation modeling, and advocate that cross-cultural substance use research move towards statistical approaches that are better positioned to test for (and model) bias in measurement. Explicitly testing the cultural equivalence of measurement when making inferences across cultural groups (within a falsifiable psychometric framework) can advance our understanding of similarities and differences among ethnic/racial groups, and hence can provide a more socially just (and statistically robust) scientific base.
Keywords: cross-cultural, race, ethnicity, psychometrics, measurement invariance
There is a long history of health disparities in the United States (U.S.; Hammonds & Reverby, 2019; Singh et al., 2017; Williams, Priest, & Anderson, 2016), and the importance of addressing health disparities has been systemically acknowledged by the U.S. government for decades (e.g., Secretary’s Task Force on Black and Minority Health, 1985). These disparities are particularly consequential in substance use outcomes, with ethnic/racial minority populations frequently displaying disproportionately higher levels of negative outcomes and/or lower access and response to treatment (for reviews see Chartier & Caetano, 2010; Larimer & Arroyo, 2016; Pinedo, 2019; Spillane & Smith, 2007; Zapolski, Pedersen, McCarthy, & Smith, 2014; Zemore et al., 2018).
Understanding the development of substance use within a culturally sensitive framework is important for public health. Life expectancy in the U.S. is declining and this has been partially attributed to alcohol and other substance use problems (Koob, Powell, & White, 2020; Woolf et al., 2018). Although these “deaths of despair” were initially identified in (non-Latinx) White populations, the decline in life expectancy has been shown to apply across ethnicity/race (Gaydosh et al., 2019), with worse outcomes for minority populations in some instances. For example, although rates of alcohol-induced deaths have increased at an unprecedented pace for White males from 2000 to 2016, there is still a significantly higher alcohol-induced death rate among Latinx males, relative to White males (Spillane et al., 2020). It is important to note that health disparities in the U.S. exist even for groups stereotyped as “model minorities.” For example, research has shown high rates of substance use within subgroups of Asian Americans (Ahmmad & Adkins, 2020; Areba, Watts, Larson, Eisenberg, & Neumark-Sztainer, 2020), and as a group they are less likely to be diagnosed with an alcohol use disorder relative to White Americans with comparable symptoms (Cheng, Iwamoto, & McMullen, 2018).
In sum, although the exact nature of health disparities varies across cultural groups, developmental timing, and context; understanding the development of substance use related health disparities is of the utmost public health importance. This narrative review will not focus on the prevalence or theoretical explanations of these disparities but will instead summarize psychometric critiques of the inferential statistics typically used in this literature. Psychometric critiques of cross-cultural research state that most studies omit to test for the cultural equivalence of the instruments being used to make inferences across cultural groups, which diminishes the confidence we have in the validity of between group inferences. Although the term cultural equivalence can be broadly defined (Helms, 2015), this narrative review will focus on the psychometric equivalence of instruments used to make comparisons across cultural groups (Borsboom, 2006a). The psychometric perspective is useful because it provides a formal statistical framework with falsifiable procedures that can be used to ask if we are likely to be measuring the same construct(s) across cultural groups (Davidov, Schmidt, Billiet, & Meuleman, 2018), as well as allows for statistical modeling of measurement bias within inferential tests (Hsiao & Lai, 2018). Prior to reviewing the evidence for psychometric critiques in substance use research and providing recommendations for future research, we briefly provide historical and theoretical context.
Historical Background
The relatively short history of scientific psychology has been characterized by fragmentation of research methods. For example, commenting on the history and course of the field, Lee Cronbach (1957; 1975) argued that scientific psychology was fragmented into the two subdisciplines of correlational psychology and experimental psychology. Correlational psychology emerged from research investigating individual and racial differences in intelligence (Helms, 2012); whereas experimental psychology focused on the manipulation of variables as the primary tool in scientific progress. Although half a century has passed since Cronbach, remnants of methodological fragmentation remain. For example, modern research using self-report tools almost always reports some statistical estimate of reliability of measurement in the sample being used to make inferences (such as Cronbach’s alpha), whereas research using experimental tasks are rarely held to the same measurement standards (Enkavi et al., 2019; Hedge, Powell, & Sumner, 2018; Lilienfeld & Treadway, 2016; Parsons, Kruijt, & Fox, 2019; Rouder & Haaf, 2019). The thesis of this review is thus that applications of these scientific traditions to cross-cultural research have not been sufficiently well positioned to ask if measurement scores used to compare ethnic/racial groups are equally reliable across the groups being compared. If instruments function differently across ethnic/racial groups, it would be impossible to identify whether observed similarities and/or differences between groups are “real,” or a vestige of systematic measurement error. In the following sections we: a) review psychometric critiques of cross-cultural research; b) briefly summarize classical and modern measurement theories; c) review substance use research that has tested whether measurement instruments have comparable psychometric properties across ethnic/racial groups; and d) provide recommendations for future research.
Psychometric Critiques of Cross-cultural Research
Psychometric critiques of cross-cultural research have existed for decades, as can be seen by Janet Helms’ critique of the intelligence field (Helms [1992]. Why is there no study of cultural equivalence in standardized cognitive ability testing?), and her recapitulation of the limited attention that is paid to measurement issues in minority populations (Perry et al. [2008]. Why is there still no study of cultural equivalence in standardized cognitive ability tests?). Since then, numerous psychometric critiques of cross-cultural research have emerged, including in the substance use field (e.g., Burlew, Feaster, Brecht, & Hubbard, 2009; Eghaneyan et al., 2020; Lopez-Vergara et al., 2020), and in cross-cultural research more broadly (Borsboom, 2006b; Borsboom, & Wijsen, 2017; Burgard & Chen, 2014; Burlew, Peteet, McCuistian, & Miller-Roenigk, 2019; Byrne & Campbell, 1999; Dong & Dumas, 2020; Gregorich, 2006; Ramirez, Ford, Stewart, & Teresi, 2005; Schmidt, Heffernan, & Ward, 2020; Spector, Liu, & Sanchez, 2015; Stevanovic et al., 2017; Stewart & Napoles-Springer, 2003). These psychometric critiques have documented that research frequently omits to test whether assessment instruments measure the same underlying construct(s) across ethnic/racial groups. As these psychometric critiques document, most cross-cultural research focuses on analyzing observed mean differences between ethnic/racial groups (i.e., has relied on Classical Test Theory). However, observed group differences may be biased if the psychometric properties of the instruments are not equivalent across the groups being compared (Davidov, Meuleman, Cieciuch, Schmidt, & Billiet, 2014). Simulation studies have shown that unequal instrument functioning across groups can compromise the validity of inferences made between such groups (Beuckelaer & Swinnen, 2018). For example, if a measure is more reliable in one group vs another, then the accuracy of parameter estimates will differ by groups. Hence, the point of psychometric critiques is that without testing for equivalent instrument functioning, it is not possible to know if observed similarities or differences between ethnic/racial group are valid, or if they reflect measurement artifacts related to differential instrument functioning, or some degree of both (Lee, Little, & Preacher, 2018).
Classical Test Theory & Modern Measurement Theories
Classical Test Theory states that any quantitative assessment is the function of two things: 1) measurement error, and 2) the “true score” of what we are trying to measure. The innovative feature of Classical Test Theory (e.g., Lord & Novick, 1968; Thurstone, 1932) was the explicit recognition of the presence of measurement error in any scientific observation. Although the practice of obtaining an arithmetic mean of various observations as a proxy for “the truth” can be traced back to the 18th century (Traub, 1997), formalizing the idea that what we scientifically measure is not a perfect representation of “nature” led to substantial technological advances. In other words, theoretical recognition that our parameter estimates are not the actual parameters themselves led to statistical advances that tried to increase the precision of our inferences. For example, early in the 20th century, Charles Spearman demonstrated that measurement error can lead to erroneous conclusions, and provided a statistical method for correcting for such error by taking the mathematical mean of various observations (i.e., the principle of aggregation) to better approximate nature (Spearman, 1904; Spearman, 1907; Spearman, 1910). To summarize, theoretical recognition of measurement error led to statistical innovations regarding ways of increasing the accuracy of measurement.
The utilization of Classical Test Theory in cross-cultural research warrants further “unpacking” of the statistical assumptions being made when comparing means and covariances across ethnic/racial groups. The valid application of Classical Test Theory to cross-group comparisons rests on the assumption that measures do not differ in reliability across the groups being compared. Although the principle of aggregation can “filter out” random measurement error, it is not a useful method for protecting against systematic measurement error (Borsboom, 2005). Furthermore, Classical Test Theory has no formalized procedures for testing whether measurement error is random or systematic. Hence, the assumption that aggregation will “filter out” error equally across the groups being compared is not testable within Classical Test Theory (Markus & Borsboom, 2013), and this untestable assumption is the foundation of psychometric critiques of cross-cultural research.
Over the past half century, various model-based measurement strategies have been developed (“modern measurement theories”). The common feature of model-based approaches is that their statistical equations include not only parameter estimates to represent the person, but also include estimates to represent the psychometric properties of the items (Embretson, 2010). Hence, modern measurement theories can explicitly ask about the cultural equivalence of measurement from a falsifiable psychometric framework (Boer, Hanke, & He, 2018; Han, Colarelli, & Weed, 2019; Putnick & Bornstein, 2016; Raju, Laffitte, & Byrne, 2002; Sass, 2011; Schmitt & Kuljanin, 2008). Before we elaborate on how model-based approaches test for the cultural equivalence of measurement (see Recommendations for Future Research section), we will discuss the findings of substance use research that has explicitly asked if we are likely to be measuring the same constructs across ethnic/racial groups.
Research Testing the Equivalence of Measurement Across Ethnicity/Race
Although the majority of cross-cultural substance use research has relied on Classical Test Theory (Eghaneyan et al., 2020; Montgomery, Burlew, Haeny, & Jones, 2020), there have been applications of modern measurement theories to detect measurement bias in cross-cultural research. Such studies have asked whether research instruments have equivalent psychometric properties across ethnic/racial groups. Differences in the psychometric properties of an instrument across groups suggests that there are systematic sources of bias in measurement across the groups being compared. Evidence for systematic bias in measurement violates a key assumption for the valid application of Classical Test Theory to cross-group comparisons (the assumption of equivalent instrument functioning across the groups being compared).
Systematic bias in measurement across ethnicity/race has been found in measures of alcohol use (Fish, Pollitt, Schulenberg, & Russell, 2018; Lopez-Vergara et al., 2020), alcohol outcome expectancies (Ham, Wang, Kim, & Zamboanga, 2013; McCarthy, Pedersen, & D’Amico, 2009), alcohol dependence (Carle, 2008; Carle, 2009), cocaine dependence (Wu et al., 2010), nicotine dependence (Lopez-Vergara et al., 2020; Rose, Dierker, Selya, & Smith, 2018), and cannabis involvement (Miller et al., 2019). Systematic bias in measurement across ethnicity/race has also been found in constructs that frequently overlap with substance use, such as socioeconomic status (Lopez-Vergara et al., 2020), discrimination (Bastos & Harnois, 2020; Harnois, Bastos, Campbell, & Keith, 2019; Lewis, Yang, Jacobs, & Fitchett, 2012; Lopez-Vergara et al., 2020; Reeve et al., 2011; Sladek, Umana-Taylor, McDermott, Rivas-Drake, & Martinez-Fuentes, 2020), intelligence (Whicherts, 2016), personality (Dong & Dumas, 2020), social support (Sacco, Casado, & Unick, 2011), and depression (Breslau, Javaras, Blacker, Murphy, & Normand, 2008; Crockett, Randall, Shen, Russell, & Driscoll, 2005), as well as neuropsychological tasks (Avila et al., 2020) and college admission tests (Santelices & Wilson, 2010). Hence, there is substantial empirical evidence supporting psychometric critiques of cross-cultural research.
It is important to note that studies have examined psychometric differences in substance use instruments across cultural groups and found no evidence of bias in measurement (e.g., Bravo et al., 2019; Feldstein Ewing, Montanaro, Gaume, Caetano, & Bryan, 2015; Spillane & Smith, 2010). However, psychometric properties are sample dependent (Markus & Borsboom, 2013; Revelle & Condon, 2019). The sample dependence of psychometric properties is why contemporary research using self-report instruments report some mathematical estimate of reliability (such as Cronbach’s alpha) in every sample used to make inferences, even when such measures have been shown to produce acceptable estimates of reliability in previous samples. In turn, in line with several of these studies, which have empirically examined measurement invariance, it is important to test for measurement invariance in the samples and context under examination. Testing the assumption of the cross-cultural equivalence of measures is feasible using modern measurement theories, as we summarize in the next section.
Recommendations for Future Research
Various forms of model-based measurement have been developed, each with their own nomenclature (e.g., Embretson, 2010; Raju et al., 2002). One of the most researched and accessible frameworks comes from structural equation modeling (SEM). In SEM, the statistical overlap between multiple observed (also called “manifest”) indicators (e.g., items in a questionnaire, or performance on various trials of an experimental task) are predicted by a latent variable (i.e., a factor). In other words, the observed indicators are statistically regressed on a latent variable. Hence, variance on an indicator that is accounted for by the latent variable is conceptually analogous to what Classical Test Theory calls “true score” variance; whereas variance on the indicator that is not predicted by the latent variable is conceptually analogous to what Classical Test Theory calls “measurement error.” Because this process relies on statistical regression, two types of parameters are needed to regress the indicators on the latent variable (item slopes, or so called “factor loadings,” and item intercepts). Making meaningful comparisons between groups from observed (manifest) data require that factor loadings and factor item intercepts are equivalent across the groups being compared (Millsap, 2012). In the SEM framework, variability in the latent variable can be used to quantify between group differences (i.e., what we are usually interested in asking), whereas between-group discrepancy in the relationship between the latent variable and its indicators can be used to ask about the cultural equivalence of measurement.
In the SEM framework of testing for measurement invariance, a prerequisite for testing the equivalence of factor loadings and item intercepts is to establish configural invariance. Establishing configural invariance means that groups have equivalent latent “structures,” or in other words that the groups have the same number of factors and the same pattern of factor loadings (i.e., which items load on which factor). Configural invariance across groups implies that the same latent construct is being measured across the groups being compared. If the assumption of configural invariance is not satisfied, it is not statistically prudent to compare the groups. Research that cannot establish configural invariance should not be considered a “failure,” and should still be published (Han et al., 2019). Knowledge of when configural invariance does not hold can inform the field regarding the functioning of measurement instruments and/or advance our understanding of the manifestation of constructs (e.g., are the phenotypic features that make up “addiction” the same across cultural groups?).
After establishing configural invariance, between group equality in the magnitude of factor loadings (i.e., how much indicator variance is “captured” by the latent variable) can be tested. Testing the equality of factor loadings is known as metric invariance, and it is used to test for group equivalence in the underlying unit of measurement. Hence, metric invariance is a necessary condition for validly comparing covariances (e.g., correlations, regression coefficients) across groups (Lee et al., 2018). Factor loadings represent group differences in how “good” the item is at indexing the latent construct (because they are conceptually the same as the “true score” variance component of Classical Test Theory).
Further equality of item intercepts is known as scalar invariance, which is a necessary condition for validly comparing means across groups. Scalar invariance indicates that groups do not differ in item intercepts after accounting for the influence of the latent factor. Lack of scalar invariance means that individuals who have the same value on the latent construct do not have the same value on the observed (manifest) items. Lack of scalar invariance will bias mean level comparisons across groups, and hence testing for scalar invariance is important in preventing erroneous conclusions (Chen, 2008; Wicherts & Dolan, 2010).
It is important to note that measurement invariance is not an “all or none” statistical outcome. When modern measurement theories detect significant amounts of measurement bias across groups (i.e., statistically significant between group differences in factor loadings and/or item intercepts), there are procedures to statistically model such bias to attempt to make more accurate parameter estimates. Modeling the degree of measurement bias (i.e., the magnitude of the group differences in factor loadings and/or intercepts) is known as partial measurement invariance (Byrne, Shavelson, & Muthen, 1989; Gunn, Grimm, & Edwards, 2020; Lai, Richardson, & Mak, 2019). Establishing partial measurement invariance allows for between group inferences (at the latent level), and is conceptually akin to “accounting” for the amount of systematic measurement error. However, model-based approaches to measurement are not a “magic pill” to fix differential instrument functioning, and the boundaries of how much bias in measurement can be “accounted” for have not been fully “mapped.” In the next paragraphs, we discuss an empirical approach for testing for measurement invariance, and for testing partial measurement invariance. The benefit of an empirical approach (e.g., Lee et al., 2018) is that the questions of “are we measuring the same construct?” and “did we account for bias in measurement?” are falsifiable.
Figure 1 shows a visual depiction of the steps to test for measurement invariance, as well as to test for partial measurement invariance. Figure 1 has two interrelated “cascades” of steps, with steps for invariance testing on the left and steps for partial invariance testing on the right. The cascade of steps is dependent on previous steps, for example, if configural invariance is not established then metric invariance cannot be tested, and if metric invariance (or partial metric invariance) is not supported then scalar invariance cannot be tested. The cascade of steps is also recursive (if “no” then go back to a previous step), and each recursive step is conducted by “freeing” one factor loading or item intercept at a time. Syntax for conducting these steps in the statistical software Mplus is provided in Tables 1–3, though there are other approaches to writing syntax within this software (see Mplus User’s Manual), as well as across software programs (for syntax using the open access software R see Maier, 2018).
Figure 1 –
A psychometric framework for asking about the cultural equivalence of measurement
Note: Detailed discussion of these steps is provided in pages 8 – 14.
Table 1 –
Example of Mplus syntax for configural invariance (Step 1)
Syntax | Meaning of syntax |
---|---|
GROUPING IS var (1=g1 2=g2); | This command tells Mplus what groups you are going to contrast. “Var” would be the group variable, and “g1,” “g2” are group names. |
MODEL: | The first model command corresponds to the first group under the “GROUPING IS” command (i.e., group “g1”). |
Factor BY var1* (L1) var2* (L2) var3* (L3); factor@1; |
Telling Mplus to make a factor called “factor” via indicators var1-var3, which are freely estimated (“*”) and given unique parameter names L1-L3. Setting factor variance to 1. |
[var1*] (i1); [var2*] (i2); [var3*] (i3); |
Variables inside brackets represent item intercepts, freely estimated (“*”), and given unique parameter names i1-i3. |
MODEL G2: | The second MODEL command needs you to tell it for which group it is for (i.e., group “g2”). |
Factor BY var1* var2* var3*; factor@1; |
Make a factor via indicators var1-var3, which are freely estimated and NOT given the same parameter name given to the other group. |
[var1*]; [var2*]; [var3*]; |
Intercepts freely estimated, and NOT given the same parameter name given to the other group. |
Note: Mplus is not case sensitive; familiarity with Mplus is needed as not all steps (e.g., getting data “into” Mplus) are depicted.
Table 3 –
Example of Mplus syntax for scalar invariance
Scalar invariance (Step 3) | Partial scalar invariance (Step 3a-b) | ||
---|---|---|---|
| |||
Syntax | Meaning of syntax | Syntax | Meaning of syntax |
GROUPING IS var (1=g1 2=g2); | This command tells Mplus what groups you are going to contrast | GROUPING IS var (1=g1 2=g2); | This command tells Mplus what groups you are going to contrast |
MODEL: | The first model command corresponds to the first group under the “GROUPING IS” command (e.g., group “g1”). | MODEL: | The first model command corresponds to the first group under the “GROUPING IS” command (e.g., group “g1”). |
Factor BY var1* (L1) var2* (L2) var3* (L3); factor@1; |
Telling Mplus to make a factor called “factor” via indicators var1-var3, which are freely estimated (“*”) and given unique parameter names L1-L3. Setting factor variance to 1. | Factor BY var1* (L1) var2* (L2) var3* (L3) factor@1; |
Telling Mplus to make a factor called “factor” via indicators var1-var3, which are freely estimated (“*”) and given unique parameter names L1-L3. Setting factor variance to 1. |
[var1*] (i1); [var2*] (i2); [var3*] (i3); |
Variables inside brackets represent item intercepts, freely estimated (“*”), and given unique parameter names i1-i3. | [var1*]; [var2*] (i2); [var3*] (i3); |
Variables inside brackets represent item intercepts given unique parameter names i2-i3; do not give a unique parameter name to the intercept that we are “releasing”, to be different across groups. |
MODEL G2: | The second MODEL command needs you to tell it for which group it is for (group “g2”). | MODEL G2: | The second MODEL command needs you to tell it for which group it is for (group “g2”). |
Factor BY var1* (L1) var2* (L2) var3* (L3); factor@1; |
Make a factor via indicators var1-var3, which are freely estimated and given the same parameter name given to the other group. | Factor BY var1* (L1) var2* (L2) var3* (L3); factor@1; |
Make a factor via indicators var1-var3, which are freely estimated and given the same parameter name given to the other group. |
[var1*] (i1); [var2*] (i2); [var3*] (i3); |
Intercepts freely estimated, and given the same parameter name given to the other group. | [var1*]; [var2*] (i2); [var3*] (i3); |
Variables inside brackets represent item intercepts given unique parameter names i2-i3; do not give a unique parameter name to the intercept that we are “releasing” to be different across groups. |
Note: Mplus is not case sensitive; familiarity with Mplus is needed as not all steps (e.g., getting data “into” Mplus) are depicted.
As shown in Figure 1, establishment of configural invariance is the first step. Table 1 provides syntax for conducting this step, specifically by estimating a multi-group confirmatory factor model that allows factor loadings and item intercepts to be different across the groups being compared. After Step 1, equality of factor loadings should be tested (i.e., metric invariance; Step 2). The left two columns in Table 2 provide syntax to test for metric invariance, specifically by forcing factor loadings to be equal across groups being compared (but allowing item intercepts to be different across the groups being compared). If metric invariance is not established (if the magnitude or size of factor loadings differ by groups), partial metric invariance can be tested by freeing one factor loading at a time to identify which factor loading(s) are statistically significantly different across groups, and “freeing” such factor loading(s) to be different across groups (Step 2a). The right two columns in Table 2 provide syntax to test for partial metric invariance, specifically the factor loadings for “var1” are not given the same names across the different groups’ “MODEL” command, which results in Mplus estimating unique factor loading values for “var1” across groups. If this partial metric invariance model fits the data equally well as the configural invariance model (Step 2b), then scalar invariance can be tested (Step 3); but if the partial metric invariance model fits worse than the configural invariance model, it is necessary to go back to step 2a and free additional factor loadings, until all the items with differential factor loadings across groups are identified. If one exhausts all the recursive possibilities between Steps 2a and 2b without reaching a model that fits the data equally well to the configural invariance model, then this would mean that we were not able to establish partial metric invariance (or could not adequately “account” for the degree of measurement bias). If partial metric invariance cannot be established, between group inferences in means and covariances should not be made.
Table 2 –
Example of Mplus syntax for metric invariance
Metric invariance (Step 2) | Partial metric invariance (Step 2a-b) | ||
---|---|---|---|
| |||
Syntax | Meaning of syntax | Syntax | Meaning of syntax |
GROUPING IS var (1=g1 2=g2); | This command tells Mplus what groups you are going to contrast | GROUPING IS var (1=g1 2=g2); | This command tells Mplus what groups you are going to contrast |
MODEL: | The first model command corresponds to the first group under the “GROUPING IS” command (e.g., group “g1”). | MODEL: | The first model command corresponds to the first group under the “GROUPING IS” command (e.g., group “g1”). |
Factor BY var1* (L1) var2* (L2) var3* (L3); factor@1; |
Telling Mplus to make a factor called “factor” via indicators var1-var3, which are freely estimated (“*”) and given unique parameter names L1-L3. Setting factor variance to 1. | Factor BY var1* var2* (L2) var3* (L3) factor@1; |
Telling Mplus to make a factor called “factor” via indicators var1-var3, which are given unique parameter names L2-L3; do not give a unique parameter name to loading that we are “releasing” to be different across groups. |
[var1*] (i1); [var2*] (i2); [var3*] (i3); |
Variables inside brackets represent item intercepts, freely estimated (“*”), and given unique parameter names i1-i3. | [var1*] (i1); [var2*] (i2); [var3*] (i3); |
Variables inside brackets represent item intercepts, freely estimated (“*”), and given unique parameter names i1-i3. |
MODEL G2: | The second MODEL command needs you to tell it for which group it is for (group “g2”). | MODEL G2: | The second MODEL command needs you to tell it for which group it is for (group “g2”). |
Factor BY var1* (L1) var2* (L2) var3* (L3); factor@1; |
Make a factor via indicators var1-var3, which are freely estimated and given the same parameter name given to the other group. | Factor BY var1* var2* (L2) var3* (L3); factor@1; |
Telling Mplus to make a factor called “factor” via indicators var1-var3, which are given unique parameter names L2-L3; do not give a unique parameter name to loading that we are “releasing” to be different across groups. |
[var1*]; [var2*]; [var3*]; |
Intercepts freely estimated, and NOT given the same parameter name given to the other group. | [var1*]; [var2*]; [var3*]; |
Intercepts freely estimated, and NOT given the same parameter name given to the other group. |
Note: Mplus is not case sensitive; familiarity with Mplus is needed as not all steps (e.g., getting data “into” Mplus) are depicted.
Syntax for testing for scalar invariance is provided in the left two columns of Table 3, specifically by estimating a multi-group confirmatory factor model that forces factor loadings and item intercepts to be equal (i.e., invariant) across the groups being compared. If scalar invariance is established (Step 3), then between group inferences on both means and covariances can be made (Step 4). However, if scalar invariance is not established (if item intercepts differ by group), partial scalar invariance can be tested by freeing one item intercept at a time to identify if this intercept significantly differs across groups (Step 3a). Syntax for testing for partial scalar invariance is provided in the right two columns of Table 3, specifically the item intercepts for “var1” are not given the same names across the different groups’ “MODEL” command, which results in Mplus estimating unique item intercept values for “var1” across groups. If this partial scalar invariance model fits the data equally well than the metric invariance (or partial metric invariance) model (Step 3b), then between group inferences on both factor means and covariances can be made (Step 4). If the partial scalar invariance model fits the data worse than the metric or partial metric invariance model, it is necessary to go back to step 3a and free additional item intercepts, until all the items with differential item intercepts across groups are identified. If all of the recursive steps between Steps 3a and 3b have been exhausted, then this would mean that we were not able to establish partial scalar invariance (or could not adequately “account” for the degree of measurement bias), and between group inferences in means should not be made (though between group interferences in covariances can be made).
For illustrative purposes, we will discuss Lopez-Vergara et al. (2020) who followed the procedures in Figure 1 to test for the measurement invariance of three indicators of smoking in a large sample (n=2,376) of middle-aged daily smokers balanced across the three largest ethnic/racial groups in the U.S. (Black, Latinx, White). A confirmatory factor model was indicated via average frequency of smoking, quantity of smoking, and time to first cigarette after waking up. Configural invariance (step 1) was established, but the test of metric invariance (step 2) led to a decrease in model fit, leading to exploration of statistically significantly different factor loadings across Black, Latinx, and White smokers. As chi-square tests are overly sensitive in large sample sizes, decrements in CFI greater than .01 were used to decide what is a “significant enough” drop in model fit (Cheung & Rensvold, 2002; Meade, Johnson, & Braddy, 2008). Although focusing on CFI is a defensible strategy, it is important to note that there is no “gold standard” indicator of model fit. Modern conventions emphasize reporting model fit (and change in model fit) via various fit indices (e.g., CFI, RMSEA, and SRMR; see Putnick & Bornstein, 2016), as was done in Lopez-Vergara et al. (2020). Time to first cigarette was found to be the source of model misfit, and a partial measurement invariance model (Step 2a-b) was found to fit the data equally well to the comparison configural invariance model. Modeling the source of misfit indicated that the item time to first cigarette was a less reliable indicator for Black and Latinx smokers. In other words, the item functioning was substantially better in White smokers. Analogous to the “true score variance” component of Classical Test Theory: variance in the indicator time first cigarette consisted of 35% “true score variance” for Black smokers (standardized factor loading = .59, p<.001); variance in the same indicator consisted of 27% “true score variance” for Latinx smokers (standardized factor loading = .52, p<.001); and variance in the same indicator consisted of 61% “true score variance” for White smokers (standardized factor loading = .78, p<.001). We refer the reader to Lopez-Vergara et al. (2020) for further details. However, conducting inferential analyses using Classical Test Theory (i.e., assuming equal instrument functioning) led to different associations relative to conducting inferential analyses after “accounting” for systematic bias in measurement (i.e., after accounting for the differential reliability of indicators across ethnicity/race).
Applied Implications
Although empirically testing for the cultural equivalence of measurement can increase the demands placed on researchers, testing for measurement invariance is important because bias in measurement may be a representation of systemic issues in science. For example, psychological science has been primarily conducted on White samples (Brady, Frygberg, & Shoda, 2018; Sue, 1999). Findings conducted on primarily White samples are sometimes tested in ethnic/racial minority populations, usually without testing the cultural equivalence of measurement (Helms, 2015). Application of measures developed in one population to another can result in biased instrument functioning for many reasons, such as having distinct experiences to use as a reference when responding, cultural differences on what are deemed socially desirable responses, or distinct manifestations of constructs across cultural groups (Pendergast, von der Embse, Kilgus, & Eklund, 2017). Hence, testing for measurement invariance in the sample being used to make inferences across cultural groups should be a prerequisite when testing hypotheses that have substantial public health or “real world” implications (Green, Chen, Helms, & Henze, 2011; Schmidt et al., 2020). Below we discuss some ways the field can increase the utilization of cultural equivalence testing when researching important hypotheses.
For researchers familiar with SEM the procedures and syntax we provide in the Figure and Tables may suffice to test for measurement invariance. Other resources to learn these analyses include previous reviews (e.g., Putnick & Bornestein, 2016), book chapters (e.g., Lee et al., 2018), and comprehensive introductory SEM textbooks (e.g., Little, 2013). Various three-to-five day comprehensive seminars on psychometrics or SEM are commercially available and provide step-by-step guidance on testing for measurement invariance (e.g., https://www.statscamp.org; https://statisticalhorizons.com). Finally, graduate and post-doctoral training programs focusing on health disparities and substance use may better position their students by providing advanced quantitative training.
Limitations
There are various limitations to this review, including that not all aspects of measurement invariance were discussed. For example, although it is possible to test for group invariance of residual variances (i.e., “strict measurement invariance”), this level of invariance testing was not discussed because psychometric critiques tend to not emphasize equality of residual variance (it is not necessary for making valid between group inferences in means and covariances). Similarly, we never discussed procedural adaptations needed when using dichotomous indicators (Muthen & Asparouhov, 2002), or what to do when there are many groups (Muthen & Asparouhov, 2018). It is also important to note that SEM is not the only framework that can be used to test for bias in measurement. It is also possible to test for measurement invariance using differential item functioning (DIF) analysis in the item response theory (IRT) framework (Raju et al., 2002; Tay, Meade, & Cao, 2015). Finally, we acknowledge that ethnic/racial categories are extremely heterogeneous (Helms, Jernigan, & Mascher, 2005). Psychometric critiques encourage research to “unpack” the psychometric properties of measurement across the groups being compared in order to make more meaningful comparisons in key constructs of interest. However, we do not intend to communicate that all cross-cultural research should test for measurement invariance (especially when it is unfeasible to get relatively large samples). What we mean is that before any finding can be considered “well established,” it may be necessary to explicitly ask about the cultural equivalence of measurement.
Conclusions
In sum, a variety of psychometric critiques of cross-cultural research have emerged over the past three decades. These critiques converge in suggesting that most research omits to statistically ask if our research instruments are likely to be measuring the same constructs across ethnicity/race. There is a growing body of research indicating that many of our instruments can produce biased measurement across ethnicity/race. Without explicitly testing for the cultural equivalence of measurement, it is impossible to determine the validity of observed group differences/similarities in substance use mechanisms and outcomes (Helms, 2015). Testing the assumption of equivalent instrument functioning is now feasible and may protect against systemic sources of bias that may prevent progress in our understanding of the development of substance use across cultural groups. In a society with a robust history of marginalization of entire communities (e.g., Dunbar-Ortiz, 2014; Ortiz, 2018), it may be necessary for research on such communities to consider using statistical frameworks that can ask about, and account for, bias in measurement. We advocate for this research to be conducted with statistical frameworks that are falsifiable, as opposed to simply assuming that our instruments are measuring the same construct(s) across cultural groups.
Public Significance Statement.
Statistical critiques question the comparability of measurement in research making group comparisons across ethnic/racial groups. Testing the comparability of measurement across ethnic/racial groups is statistically feasible yet infrequently done. Future research may benefit from statistically testing the comparability of measurement when making comparisons across ethnic/racial groups.
Acknowledgements:
Preparation of this manuscript was supported by NIAAA grants K08AA024794 (HLV) and K24AA026876 (SFE), NIDA grant K23DA039327 (NHW), and by U54GM115677 and P20GM125507 from the NIGMS which fund Advance Clinical and Translational Research (Advance-CTR) (HLV) and the Center for Excellence on Biomedical Research Excellence (COBRE) on Opioids and Overdose (NHW), respectively.
Footnotes
Disclosures: None of the authors have a conflict of interest to declare.
References
- Ahmmad Z, & Adkins DE (2020). Ethnicity and acculturation: Asian American substance use from early adolescence to mature adulthood. Journal of Ethnic & Migration Studies, 1–27. [Google Scholar]
- Areba EM, Watts AW, Larson N, Eisenberg ME, & Neumark-Sztainer D (2020). Acculturation and ethnic group differences in well-being among Somali, Latino, and Hmong adolescents. American Journal of Orthopsychiatry, advanced online publication. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avila JF, Renteria MA, Witkiewitz K, Verney SP, Vonk JMJ, & Manly JJ (2020). Measurement invariance of neuropsychological measures of cognitive aging across race/ethnicity by sex/gender groups. Neuropsychology, 34, 3–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Banks DE, & Zapolski TCB (2018). The crossover effect: A review of racial/ethnic variations in risk for substance use and substance use disorder across development. Current Addiction Reports, 5, 386–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bastos JL, & Harnois CE (2020). Does the Everyday Discrimination Scale generate meaningful cross-group estimates? A psychometric evaluation. Social Science & Medicine, 265, 113321. [DOI] [PubMed] [Google Scholar]
- Beuckelaer A, & Swinnen G (2018). Biased latent variable mean comparisons due to measurement non-invariance: A simulation study. In Davidov E, Schmidt P, Billiet J, & Meuleman B (Eds.), Cross-cultral analysis: Methods and applications (2nd ed., pp. 127–156). New York, NY: Routledge. [Google Scholar]
- Boer D, Hanke K, & He J (2018). On detecting systematic measurement error in cross-cultural research: A review of critical reflection on equivalence and invariance tests. Journal of Cross-Cultural Psychology, 49, 713–734. [Google Scholar]
- Borsboom D (2005). Measuring the mind: Conceptual issues in contemporary psychometrics Cambridge University Press. [Google Scholar]
- Borsboom D (2006a). When does measurement invariance matter? Measurement in a Multi-Ethnic Society, 44, s176–s1818. [DOI] [PubMed] [Google Scholar]
- Borsboom D (2006b). The attack of the psychometricians. Psychometrika, 71, 435–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Borsboom D, & Wijsen LD (2017). Psychology’s atomic bomb. Assessment in Education: Principles, Policy & Practice, 24, 440–446. [Google Scholar]
- Brady LM, Fryberg SA, & Shoda Y (2018). Expanding the interpretive power of psychological science by attending to culture. Proceedings of the National Academy of Sciences, 115, 11406–11413. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bravo AJ, Pilatti A, Pearson MR, Read JP, Mezquita L, Ibanez MI, & Ortel G (2019). Cross-cultural examination of negative alcohol-related consequences: Measurement invariance of the Young Adult Alcohol Consequences Questionnaire in Spain, Argentina, and USA. Psychological Assessment, 31, 631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breslau J, Javaras KN, Blacker D, Murphy JM, & Normand SLT (2008). Differential item functioning between ethnic groups in the epidemiological assessment of depression. The Journal of Nervous and Mental Disease, 196, 297–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burgard SA, & Chen PV (2014). Challenges of health measurement in studies of health disparities. Social Sciences & Medicine, 106, 143–150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burlew AK, Feaster D, Brecht ML, & Hubbard R (2009). Measurement and data analysis in research addressing health disparities in substance abuse. Journal of Substance Abuse Treatment, 36, 25–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burlew AK, Peteet BJ, McCuistian C, & Miller-Roenigk BD (2019). Best practices for researching diverse groups. American Journal of Orthopsychiatry, 89, 354–368. [DOI] [PubMed] [Google Scholar]
- Byrne BM, & Campbell TL (1999). Cross-cultural comparisons and the presumption of equivalent measurement and theoretical structure: A look beneath the surface. Journal of Cross-Cultural Psychology, 30, 555–574. [Google Scholar]
- Byrne BM, Shavelson RJ, & Muthen B (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456- [Google Scholar]
- Carle AC (2008). Cross-cultural validity of alcohol dependence across Hispanics and non-Hispanic Caucasians. Hispanic Journal of Behavioral Sciences, 30, 106–120. [Google Scholar]
- Carle AC (2009). Cross-cultural invalidity of alcohol dependence measurement across Hispanics and Caucasians in 2001 and 2002. Addictive Behaviors, 34, 43–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chartier K, & Caetano R (2010). Ethnicity and health disparities in alcohol research. Alcohol Research & Health, 33, 152- [PMC free article] [PubMed] [Google Scholar]
- Chartier KG, Scott DM, Wall TL, Covault J, Karriker-Jaffe KJ, Mills BA, Luczak SE, Caetano R, & Arroyo JA (2014). Framing ethnic variations in alcohol outcomes from biological pathways to neighborhood context. Alcoholism: Clinical & Experimental Research, 38, 611–618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen FF (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018. [DOI] [PubMed] [Google Scholar]
- Cheng AW, Iwamoto DK, & McMullen D (2018). Model minority stereotype and the diagnosis of alcohol use disorders: Implications for practitioners working with Asian Americans. Journal of Ethnicity in Substance Abuse, 17, 255–272. [DOI] [PubMed] [Google Scholar]
- Cheung GW, & Rensvold RB (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. [Google Scholar]
- Crockett LJ, Randall BA, Shen YL, Russell ST, & Driscoll AK (2005). Measurment equivalence of the Center for Epidemiological Studies Depression Scale for Latino and Anglo adolescents: A national study. Journal of Consulting and Clinical Psychology, 73, 47–58. [DOI] [PubMed] [Google Scholar]
- Cronbach LJ (1957). The two disciplines of scientific psychology. American Psychologist, 12, 671–684. [Google Scholar]
- Cronbach LJ (1975). Beyond the two disciplines of scientific psychology. American Psychologist, 30, 116–127. [Google Scholar]
- Davidov KE, Meuleman B, Cieciuch J, Schmidt P, & Billiet J (2014). Measurment equivalence in cross-national research. Annual Review of Sociology, 40, 55–75. [Google Scholar]
- Davidov E, Schmidt P, Billiet J, & Meuleman B (2018). Cross-cultural analysis: Methods and applications. Routledge. [Google Scholar]
- Dong Y, & Dumas D (2020). Are personality measures valid for different populations? A systematic review of measurement invariance across cultures, gender, and age. Personality and Individual Differences, 160, 109956. [Google Scholar]
- Dunbar-Ortiz R (2014). An Indigenous Peoples’ History of the United States. Beacon Press. [Google Scholar]
- Eghaneyan BH, Sanchez K, Haeny AM, Montgomery L, Lopez-Castro T, Burlew AK, Rezaeizadeh A, & Killian MO (2020). Hispanic participants in the National Institute on Drug Abuse’s Clinical Trials Network: A scoping review of two decades of research. Addictive Behaviors Reports, 12, 100287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Embretson SE (2010). Measuring psychological constructs with model-based approaches: An introduction. In Measuring Psychological Constructs: Advances in Model-Based Approaches. Embretson SE (Ed.). American Psychological Association, Washington DC; pgs. 1–7 [Google Scholar]
- Enkavi AZ, Eisenberg IW, Bissett PG, Mazza GL, MacKinnon DP, Marsch LA, & Poldrack RA (2019). Large-scale analysis of test-retest reliabilities of self-regulation measures. Proceedings of the National Academy of Sciences, 116, 5472–5477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feldstein Ewing SW, Montanaro EA, Gaume J, Caetano R, & Bryan AD (2015). Measurement invariance of alcohol instruments with Hispanic youth. Addictive Behaviors, 46, 113–120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fish JN, Pollitt AM, Schulenberg JE, & Russell ST (2018). Measuring alcohol use across the transition to adulthood: Racial/ethnic, sexual identity, and educational differences. Addictive Behaviors, 77, 193–202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaydosh L, et al. (2019). The depths of despair among US adults entering midlife. American Journal of Public Health, 109, 774–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green CE, Chen CE, Helms JE, & Henze KT (2011). Recent reliability reporting practices in Psychological Assessment: Recognizing the people behind the data. Psychological Assessment, 23, 656–669. [DOI] [PubMed] [Google Scholar]
- Gregorich SE (2006). Do self-report instruments allow meaningful comparisons across diverse population groups? Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44, s78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gunn HJ, Grimm KJ, & Edwards MC (2020). Evaluation of six effect size measures of measurement non-invariance for continuous outcomes. Structural Equation Modeling: A Multidisciplinary Journal, 27, 503–514. [Google Scholar]
- Ham LS, Wang Y, Kim SY, & Zamboanga BL (2013). Measurement equivalence of the brief comprehensive effects of alcohol scale in a multiethnic sample of college students. Journal of Clinical Psychology, 69, 341–363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hammonds EM, & Reverby SM (2019). Toward a historically informed analysis of racial health disparities since 1619. American Journal of Public Health, 109, 1348–1349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han K, Colarelli SM, & Weed NC (2019). Methodological and statistical advances in the consideration of cultural diversity in assessment: A critical review of group classification and measurement invariance testing. Psychological Assessment, 31, 1481–1496. [DOI] [PubMed] [Google Scholar]
- Harnois CE, Bastos JL, Campbell ME, & Keith VM (2019). Measuring perceived mistreatment across diverse social groups: An evaluation of the Everyday Discrimination Scale. Social Science & Medicine, 232, 298–306. [DOI] [PubMed] [Google Scholar]
- Hedge C, Powell G, & Sumner P (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behav Res, 50, 1166–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Helms JE (1992). Why is there no study of cultural equivalence in standardized cognitive ability testing? American Psychologist, 47, 1083–1101. [Google Scholar]
- Helms JE (2012). A legacy of eugenics underlies racial-group comparisons in intelligence testing. Industrial and Organizational Psychology, 5, 176–179. [Google Scholar]
- Helms JE (2015). An examination of the evidence in culturally adapted evidence-based or empirically supported interventions. Transcultural Psychiatry, 52, 174–197. [DOI] [PubMed] [Google Scholar]
- Helms JE, Jernigan M, & Mascher J (2005). The meaning of race in psychology and how to change it: A methodological perspective. American Psychologist, 60, 27–36. [DOI] [PubMed] [Google Scholar]
- Hsiao YY, & Lai MH (2018). The impact of partial measurement invariance on testing moderation for single and multi-level data. Frontiers in Psychology, 9, 740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koob GF, Powell P, & White A (2020). Addiction as a coping response: hyperkatifeia, deaths of despair, and COVID-19. American Journal of Psychiatry, 177, 1031–1037. [DOI] [PubMed] [Google Scholar]
- Lai MH, Richardson GB, & Mak HW (2019). Quantifying the impact of partial measurement invariance in diagnostic research: An application to addictions research. Addictive Behaviors, 94, 50–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larimer ME, & Arroyo JA (2016). Alcohol use among special populations. Alcohol Research: Current Reviews, 38, 1. [PMC free article] [PubMed] [Google Scholar]
- Lee J, Little TD, & Preacher KJ (2018). Methodological issues in using structural equation models for testing differential item functioning. In Davidov E, Schmidt P, Billiet J, & Meuleman B (Eds.), Cross-cultral analysis: Methods and applications (2nd ed., pp. 65–94). New York, NY: Routledge. [Google Scholar]
- Lewis TT, Yang FM, Jacobs EA, & Fitchett G (2012). Racial/ethnic differences in response to the everyday discrimination scale: A differential item functioning analysis. American Journal or Epidemiology, 175, 391–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lilienfeld SO, & Treadway MT (2016). Clashing diagnostic approaches: DSM-ICD versus RDoC. Annual Review of Clinical Psychology, 12, 435–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Little TD (2013). Longitudinal structural equation modeling. Guilford Press. [Google Scholar]
- Lopez-Vergara HI, Rosales R, Scheuermann TS, Nollen NL, Leventhal AM, & Ahluwalia JS (2020). Social determinants of alcohol and cigarette use by race/ethnicity: Can we ignore measurement issues? Psychological Assessment. Advance online publication. 10.1037/pas0000948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lord FM, & Novick MR (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. [Google Scholar]
- Mair P (2018). Modern psychometrics with R. Springer International Publishing. [Google Scholar]
- Markus KA, & Borsboom D (2013). Frontiers of test validity theory: Measurement, causation, and meaning. New York, NY: Routledge. [Google Scholar]
- McCarthy DM, Pedersen SL, & D’Amico EJ (2009). Analysis of item response and differential item functioning of alcohol expectancies in middle school youth. Psychological Assessment, 21, 444–449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meade AW, Johnson EC, & Braddy PW (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. [DOI] [PubMed] [Google Scholar]
- Miller AP, Merkle EC, Galenkamp H, Stronks K, Derks EM, & Gizer IR (2019). Differential item functioning analysis of the CUDIT and relations with alcohol and tobacco use among men across five ethnic groups: The HELIUS Study. Psychology of Addictive Behaviors, 33, 697–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Millsap RE (2012). Statistical approaches to measurement invariance. New York, NY: Routledge. [Google Scholar]
- Montgomery L, Burlew AK, Haeny AM, & Jones CA (2020). A systematic scoping review of research on Black participants in the National Drug Abuse Treatment Clinical Trials Network. Psychology of Addictive Behaviors, 34, 117–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mulia N, Karriker-Jaffe KJ, Witbrodt J, Bond J, Williams E, & Zemore SE (2017). Racial/ethnic differences in 30-year trajectories of heavy drinking in a nationally representative U.S. sample. Drug and Alcohol Dependence, 170, 133–141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muthen B, & Asparouhov T (2002). Latent variable analysis with categorical outcomes: Multiple-group and growth modeling in Mplus. Mplus web notes, 4(5), 1–22. [Google Scholar]
- Muthen B, & Asparouhov T (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47, 637–664. [Google Scholar]
- Ortiz P (2018). An African American and Latinx History of the United States. Beacon Press. [Google Scholar]
- Parsons S, Kruijt A, & Fox E (2019). Psychological science needs a standard practice of reporting the reliability of cognitive-behavioral measurements. Advances in Methods and Practices in Psychological Science, 2, 378–395. [Google Scholar]
- Pendergast LL, von der Embse N, Kilgus SP, & Eklund KR (2017). Measurement equivalence: A non-technical primer on categorical multi-group confirmatory factor analysis in school psychology. Journal of School Psychology, 60, 65–82. [DOI] [PubMed] [Google Scholar]
- Perry JC, Satiani A, Henze KT, Mascher J, & Helms JE (2008). Why is there still no study of cultural equivalence in standardized cognitive ability tests? Journal of Multicultural Counseling and Development, 36, 155–167. [Google Scholar]
- Pinedo M (2019). A current re-examination of racial/ethnic disparities in the use of substance abuse treatment: Do disparities persist? Drug and Alcohol Dependence, 202, 162–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Putnick DL, & Bornstein MH (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raju NS, Laffitte LJ, & Byrne BM (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–529. [DOI] [PubMed] [Google Scholar]
- Ramirez M, Ford ME, Stewart AL, & Teresi J (2005). Measurement issues in health disparities research. Health Services Research, 40, 1640–1657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reeve BB, Willis G, Shariff-Marco SN, Breen N, Williams DR, Gee GC, … Levein KY (2011). Comparing cognitive interviewing and psychometric methods to evaluate a racial/ethnic discrimination scale. Field Methods, 23, 397–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Revelle W, & Condon DM (2019). Reliability from α to ω: A tutorial. Psychological Assessment, 31 (12), 1395. [DOI] [PubMed] [Google Scholar]
- Rose JS, Dierker LC, Selya A, & Smith PH (2018). Integrative data analysis of gender and ethnic measurement invariance in nicotine dependence symptoms. Prevention Science, 19, 748–760. [DOI] [PubMed] [Google Scholar]
- Rouder JN, & Haaf JM (2019). A psychometrics of individual differences in experimental tasks. Psychonomic Bulletin and Review, 26, 452–467. [DOI] [PubMed] [Google Scholar]
- Sacco P, Casado BL, & Unick CJ (2011). Differential item functioning across race in aging research: An example using a social support measure. Clinical Gerontologist, 34, 57–70. [Google Scholar]
- Santelices MV, & Wilson M (2010). Unfair treatment? The case of Freedle, the SAT, and the standardization approach to differential item functioning. Harvard Educational Review, 80, 106–134. [Google Scholar]
- Sass DA (2011). Testing measurement invariance and comparing latent factor means within a confirmatory factor analysis framework. Journal of Psychoeducational Assessment, 29, 347–363. [Google Scholar]
- Shmidt FL, Le H, & Ilies R (2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual differences constructs. Psychological Methods, 8, 206–224. [DOI] [PubMed] [Google Scholar]
- Schmidt S, Heffernan R, & Ward T (2020). Why we cannot explain cross-cultural differences in risk assessment. Aggression and Violent Behavior, 50, 101346. [Google Scholar]
- Schmitt N, & Kuljanin G (2008). Measurement invariance: Review of practice and implications. Human Resource Management Review, 18, 210–222. [Google Scholar]
- Secretary’s Task Force on Black and Minority Health (1985). Report of the Secretary’s Task Force on Black & Minority Health. U. S. Department of Health and Human Services. Bethesda, MD: National Institutes of Health. Available at: http://archive.org/stream/reportofsecretar00usde#page/n1/mode/2up. [Google Scholar]
- Singh GK, et al. (2017). Social determinants of health in the United States: Addressing major health inequality trends for the nation, 1935-2016. International Journal of MCH and AIDS, 6, 139–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sladek MR, Umana-Taylor AJ, McDermott ER, Rivas-Drake D, & Martinez-Fuentes S (2020). Testing invariance of ethnic-racial discrimination and identity measures for adolescents across ethnic-racial groups and contexts. Psychological Assessment, 32, 509–526. [DOI] [PubMed] [Google Scholar]
- Spearman C (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. [PubMed] [Google Scholar]
- Spearman C (1907). Demonstration of formulae for true measurement of correlation. The American Journal of Psychology, 161–169 [Google Scholar]
- Spearman C (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295. [Google Scholar]
- Spector PE, Liu C, & Sanchez JI (2015). Methodological and substantive issues in conducting multinational and cross-cultural research. Annu. Rev. Organ. Psychol. Organ. Behav, 2, 101–131. [Google Scholar]
- Spillane NS, & Smith GT (2007). A theory of reservation-dwelling American Indian alcohol use risk. Psychological Bulletin, 133, 395–418. [DOI] [PubMed] [Google Scholar]
- Spillane NS, & Smith GT (2010). Individual differences in problem drinking among tribal members from one First Nation community. Alcoholism: Clinical & Experimental Research, 34, 1985–1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spillane S, Shiels MS, Best AF, Haozous EA, Withrow DR, Chen Y, Berrington de Gonzalez A, & Freedman ND (2020). Trends in alcohol-induced deaths in the United States, 2000-2016. JAMA Network Open, 3, e1921451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stevanovic D, et al. (2017). Can we really use available scales for child and adolescent psychopathology across cultures? A systematic review of cross-cultural measurement invariance data. Transcultural Psychiatry, 54, 125–152. [DOI] [PubMed] [Google Scholar]
- Stewart AL, & Napoles-Springer AM (2003). Advancing health disparities research: Can we afford to ignore measurement issues? Medical Care, 1207–1220. [DOI] [PubMed] [Google Scholar]
- Sue S (1999). Science, ethnicity, and bias: Where have we gone wrong? American Psychologist, 54, 1070–1077. [DOI] [PubMed] [Google Scholar]
- Tay L, Meade AW, & Cao M (2015). An overview and practical guide to IRT measurement equivalence analysis. Organizational Research Methods, 18, 3–46. [Google Scholar]
- Thurstone LL (1932). The reliability and validity of tests. Ann Arbor, MI. [Google Scholar]
- Traub RE (1997). Classical test theory in historical perspective. Educational Measurement, 16, 8–13. [Google Scholar]
- Wicherts JM (2016). The importance of measurement invariance in neurocognitive ability testing. The Clinical Neuropsychologist, 30, 1006–1016. [DOI] [PubMed] [Google Scholar]
- Wicherts JM, & Dolan CV (2010). Measurement invariance in confirmatory factor analysis: An illustration using IQ test performance of minorities. Education Measurement: Issues and Practice, 29, 39–47. [Google Scholar]
- Williams DR, Priest N, & Anderson N (2016). Understanding associations between race, socioeconomic status and health: Patterns and prospects. Health Psychology, 35, 407–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woolf SH, et al. , (2018). Change in midlife death rates across racial and ethnic groups in the United States: Systematic analysis of vital statistics. BMJ, 362:k3096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu L, Pan J, Blazer D, Tai B, Stitzer ML, & Woody GF (2010). Using a latent variable approach to inform gender and racial/ethnic differences in cocaine dependence: A National Drug Abuse Treatment Clinical Trials Network study. Journal of Substance Abuse Treatment, 38, s70–s79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zapolski TC, Pederson SL, McCarthy DM, & Smith GT (2014). Less drinking, yet more problems: Understanding African American drinking and related problems. Psychological Bulletin, 140, 188- [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zemore SE, et al. (2018). The future of research on alcohol-related disparities across U. S. racial/ethnic groups: A plan of attack. Journal of Studies on Alcohol and Drugs, 79, 7–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zumbo BD (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223–233. [Google Scholar]