Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 1.
Published in final edited form as: Assessment. 2021 Feb 20;29(5):925–939. doi: 10.1177/1073191120986613

Reliability of Differential Item Functioning in Alcohol Use Disorder: Bayesian Meta-Analysis of Criteria Discrimination Estimates

Colin E Vize 1, Sean P Lane 2
PMCID: PMC9303006  NIHMSID: NIHMS1822138  PMID: 33615848

Abstract

Numerous studies leverage item response theory (IRT) methods to examine measurement characteristics of alcohol use disorder (AUD) diagnostic criteria. Less work has examined the consistency of AUD IRT parameter estimates, an essential step for establishing measurement invariance, making statements about symptom diagnosticity, and validating the theoretical construct. A Bayesian meta-analysis of IRT discrimination values for AUD criteria across 33 independent samples (Total N = 321,998) revealed that overall consistency of AUD criteria discriminations was low (generalized intraclass correlation range = .105-.249). However, specific study characteristics accounted for substantial variability, suggesting that the unreliability is partially systematic. We replicated evidence of differential item functioning (DIF) via established factors (e.g., age, gender), but the magnitudes were small compared with DIF associated with assessment instrument. These results offer practical recommendations regarding which instruments to use when specific AUD criteria are of interest and which criteria are most sensitive when comparing demographic groups.

Keywords: alcohol use disorder, Bayesian meta-analysis, differential item functioning, generalizability theory, item response theory


The diagnosis and conceptualization of alcohol use disorder (AUD) has gone through substantial revision since the syndrome was first included among the personality disorders in earlier versions of the Diagnostic and Statistical Manual of Mental Disorders (i.e., DSM-I and DSM-II; American Psychiatric Association [APA], 1942, 1952).1 In the current iteration of the DSM–Fifth edition (DSM-5; APA, 2013), AUD is diagnosed based on the presence of 2 or more of 11 different criteria that assess cognitive, behavioral, and physiological symptoms with severity being determined by the number of criteria endorsed. As statistical tools have become more advanced in recent decades, researchers have leveraged these advancements to further refine the field’s understanding of AUD. One such advancement has been the application of item response theory (IRT) to studying the AUD criteria.

Numerous IRT studies have been conducted on the AUD criteria set as instantiated in the 4th and 5th editions of the DSM (APA, 1994, 2013). The most common IRT model used in AUD research is the two-parameter logistic (2PL) model. The 2PL model estimates two parameters for each criterion based on a logistic model of (usually) binary endorsement: a threshold and discrimination parameter. Thresholds relate to how difficult it is to endorse a particular criterion. Consequently, many clinical measures are designed to include items/criteria that fall along a continuum of difficulty to ensure that the measure provides information about individuals who fall along various points of the latent spectrum (Reise & Waller, 2009). Within AUD research, studies have shown that certain AUD criteria like tolerance (having to drink a greater amount to achieve similar effects) tend to be relatively easy to endorse, while others such as withdrawal (insomnia/sweating/tremors/nausea within a few hours or days after stopping drinking) are more difficult (e.g., Martin et al., 2006).

In comparison, discrimination parameters relate to a criterion’s ability to distinguish between persons given their standing on the latent AUD spectrum. Higher discrimination values indicate that an item or criterion is better able to differentiate between persons at similar levels of the latent spectrum while lower values suggest that a criterion provides less information about individuals across varying levels of the latent spectrum. That is, a highly discriminating criterion is associated with a high probability of identifying individuals who will, versus will not, endorse the criterion within a narrow range along the latent spectrum. In contrast to criteria thresholds, past research suggests that both tolerance and withdrawal provide comparatively less information about latent AUD precision whereas interpersonal problems and giving up activities due to drinking have high discriminations (e.g., Martin et al., 2006; Saha et al., 2006). It is important to note that while thresholds often receive preferential attention given their association with symptom severity (Balsis et al., 2007; Cooper & Balsis, 2009; Langenbucher et al., 2004; Uebelacker et al., 2009), focused attention on any criterion given its relative severity is (arguably) only warranted to the extent that the corresponding discrimination is also high.

IRT studies of AUD criteria have addressed a range of research questions. For example, IRT-based results have shown that the AUD criteria set assesses a single latent dimension (e.g., Harford et al., 2009; Hasin et al., 2013), with some indication that it is most effective at identifying severe cases along the AUD continuum (Saha et al., 2006). Other studies suggest that AUD criteria are most informative for individuals along the moderate portion of the AUD latent spectrum (e.g., Borges et al., 2010). Despite the relatively large number of IRT studies on AUD criteria, less work has examined the overall consistency of the observed threshold and discrimination estimates across studies. In their recent meta-analysis of IRT threshold estimates, Lane et al. (2016) found notably low reliability in AUD criteria thresholds, and also found important sources of systematic variability in threshold values (e.g., measurement instruments and age group). The present study extends similar analyses to examine the between-study consistency of IRT-derived discrimination parameters, and examine competing hypotheses regarding the perceived consistency of AUD criteria discrimination values. Such analyses are critical for arguments asserting the generalizability of AUD phenomenology and diagnosis across demographic subpopulations.

IRT Analyses of AUD Criteria

An important topic within IRT analyses of the AUD criteria is the presence (or lack thereof) of differential item functioning (DIF). DIF analyses have tended to focus on whether the discrimination and/or threshold values for AUD criteria show variability across relevant sample characteristics (e.g., gender, age group, ethnicity, etc.). The presence of DIF for discrimination estimates may indicate that an AUD criterion’s ability to distinguish between individuals that fall along different points of the underlying AUD spectrum is dependent on factors that may differ across samples.

Important instances of DIF have been noted in the AUD IRT literature. For example, when examining the discriminations of AUD criteria in a sample of adolescents, the quit/cut down criterion showed relatively poor discrimination (Martin et al., 2006), consistent with other work that has found that adolescents with more severe alcohol abuse problems tend to not endorse trying to cut down while other adolescents with less severe problems do report attempts to limit their use (Chung & Martin, 2002).

In their analysis of nationally representative data from the National Epidemiological Survey of Alcohol and Related Conditions, Saha et al. (2006) found DIF across men and women, age groups, and ethnicity. With regard to discrimination values, the 24- to 44-year-old age group showed lower discrimination values compared with the ≥45 years old and 18- to 24-year-old age groups for the tolerance and hazardous use criteria. When considering ethnicity, African American participants showed lower discriminations for the tolerance, quit/cut down, and hazardous use criteria compared with White participants. By contrast, using another representative data set from the National Survey on Drug Use and Health across gender, age, and ethnicity, Harford et al. (2009) reported patterns not wholly consistent with the criteria identified by Saha et al. (2006; i.e., tolerance and hazardous use) while also reporting DIF across gender and age groups not identified by the those authors (e.g., giving up activities, role interference). When focusing on the specific AUD symptom of dependence across four studies, Witkiewitz et al. (2016) found that certain assessment items tied to alcohol dependence showed differential functioning across relevant sample characteristics. For example, the authors found that items related to physiological dependence and delirium tremens were less useful in assessing the degree of dependence severity in younger individuals and females, respectively.

Given that numerous studies have examined DIF with regard to discrimination estimates within studies, the consistency of DIF at a more global level can help inform future efforts in establishing whether AUD criteria discriminations from any given study are generalizable, and if not, what sources may be limiting the generalizability of the estimates.

The Current Study

Though IRT has been utilized to explore differences in threshold and/or discrimination values within the AUD criteria set, less work has examined the between-study consistency of the AUD criteria discriminations. Discrepancies in discrimination estimates across studies can provide important insights into diagnoses of AUD. Most important, in cases where the criteria discriminations are not comparable across different types of groups (e.g., gender, ethnicity, etc.), it suggests that the underlying composition of the latent AUD trait may differ. The implications of such findings would suggest that not only can the presentation of AUD look different across groups even when similar symptoms are endorsed, but that the underlying etiology may also differ.

The current study looks to more critically examine the overall consistency in AUD criterion discriminations across studies. Previous work has been conducted examining the consistency of threshold estimates (Lane et al., 2016) and the present study looks to apply the same analytical framework to examinations of discriminations. Specifically, we aim to test whether previously reported sources of DIF within samples (e.g., gender, ethnicity, age group) as well as commonly examined meta-analytic moderators (e.g., clinical vs. nonclinical samples, measurement instrument) emerge as systematic sources of discrimination variability. Given previous DIF findings regarding AUD discrimination estimates, we hypothesize that there will be notable sources of systematic variability in discrimination estimates, which will limit the likelihood that the AUD discrimination estimates will be consistent across studies. However, debates remain regarding whether the AUD criteria consistently distinguish between individuals at differing levels of AUD severity, and whether certain criteria are more discriminating than others (e.g., Martin et al., 2011; Martin et al., 2014; Sayette, 2016). The argument that the criteria consistently distinguish between individuals would lead to the prediction that the AUD criteria will be the primary source of variability in discrimination values across studies, especially if criteria differences reported by epidemiological investigations are to be considered generalizable (cf. Harford et al., 2009; Saha et al., 2006).

Method2

Literature Search

The present literature search incorporated all results returned from the initial publication conducted in May of 2015 (see Lane et al., 2016, for a full review). Briefly, the PubMed, Web of Science, ProQuest (including dissertations), and PsycINFO databases were searched with a time restriction of January 1977 to May 2015. The search terms entered were “item response theory” or “differential item functioning” crossed with “alcohol use disorder,” “alcohol abuse,” or “alcohol dependence.” Inclusion criteria for studies returned by these search terms were that the paper needed to report on discrimination and severity parameter estimates from IRT analyses that assessed DSM-III, DSM-III-R, DSM-IV, DSM-5, ICD-9, or ICD-10 AUD criteria, and that the results were derived from analyses of 2PL. In order to include recent studies that had used IRT analyses since the publication of the original meta-analysis, an updated search was conducted following the exact procedures outlined above with the exception that the search aimed to identify studies published between May of 2015 and May of 2018.

The updated search returned 74 articles for review. After removing duplicate articles (n = 30), a review of the remaining 44 articles identified 3 new articles to be added, resulting in a total of 33 nonoverlapping articles (with 52 samples providing discrimination estimates) included in our quantitative analyses. Though the number of included articles could be considered moderate, the total sample size (N = 321,998) for the present study was extremely large, even by meta-analytic standards. A flow chart of the literature search results is presented in Figure 1, which maps the path of study inclusion over the four phases of our systematic review: identification, screening, eligibility, and inclusion. Studies included in the meta-analysis are marked with an asterisk in the references.

Figure 1. Flow diagram of updated literature search results.

Figure 1.

Coded Information

In addition to the AUD criteria discrimination values, the information taken from each article included the study authors, year of publication, sample characteristics (i.e., mean age, clinical vs. population, gender ratio, etc.), sample size, AUD diagnostic instrument, diagnostic time frame, number of criteria assessed, and reporting metric for the discrimination values (unstandardized, standardized, IRT parameterized).3 The original data were coded by two independent raters and the raters showed near perfect agreement for the discrimination and severity values (intraclass correlations [ICCs] = .98-1.00) and showed good to excellent agreement for descriptive information taken from the studies (κ = .86-1.00). Due to only three unique studies being included from the updated literature search, the first and second author independently coded the three studies and results showed perfect agreement across all coded information.

For publications that examined DIF, we included discrimination estimates for each respective group in the analyses. We did not include aggregated estimates if such an analysis was conducted since they would be partially confounded with the individual DIF results. In this way, specific DIF identified in different publications was not explicitly examined, but was included and appropriately weighted to estimate meta-analytic evidence of DIF. We considered whether DIF in general across the criteria set existed, and, if so, for what criteria and factors was it robust across the extant literature.

Analyses

Our primary interest was in the overall reliability of the discrimination estimates for the AUD criteria set across studies. As such, we made use of GT (Brennan, 1992) to estimate generalized ICCs (Shrout & Fleiss, 1979) within a Bayesian framework. Two models were examined. The first model is expressed in Equation 1,

Pcs=μ+Cc+Ss+ecs (1)

where Pcs represents the discrimination estimate for criterion c from study s. μ represents the grand mean of all criteria discrimination estimates. Cc is the tendency for criterion c to be more or less discriminative across samples while Ss is the tendency for study s to produce higher or lower discriminations across the criteria set on average. The reliability of the estimated discrimination values for a fixed criterion set (i.e., Cc from Equation 1) for any randomly selected study can be examined using Equation 2:

R1F=σCRITERION2σCRITERION2+σERROR2 (2)

Equation 2 provides a reliability estimate between 0 and 1, with higher values indicating a greater level of reliability for the AUD criteria set. In the present context, reliability refers to the ability of the criteria set to consistently account for variability observed in discrimination values across studies, relative to other sources of variability. In the basic model, the other sources of variability are aggregated into either a study component or error component.

The basic model in Equation 1 was then expanded to estimate additional variance components for diagnostic instrument, diagnosis time frame, sample type, gender, and age along with their interactions with AUD criteria (Equation 3).

Pcsitnmag=μ+Cc+Ss+Ii+Tt+Nn+Mm+Aa+Gg+(CI)ci+(CT)ct+(CN)cn+(CM)cm+(CA)ca+ecsitnmag (3)

In the expanded model Pcsitnmag represents the discrimination parameter estimate for criterion, c, from study, s, where study was indexed by specific instrument (i; Seven categories: Alcohol Use Disorder and Associated Disabilities Interview Schedule [AUDADIS]; Composite International Diagnostic Interview [CIDI]; Semistructured Assessment for the Genetics of Alcoholism [SSAGA]; Substance Abuse and Mental Health Services Administration [SAMHSA]; Psychiatric Research Interview for Substance and Mental Disorders [PRISM]; Structured Clinical Interview for DSM-IV [SCID]; Other), AUD diagnostic time frame (t; Two categories: current, lifetime), sample composition (n; Four categories: clinical, healthy, mixed, population), gender distribution (m; Five Categories: exclusively men, primarily men, approximately equal men and women, primarily women, exclusively women), age distribution (a; 5 categories: <18 years, primarily 18-30 years, primarily, 30-50 years, >50 years, representative of the population 18 years and older), and being part of a group of studies that used the same or a partially overlapping sample (g). μ, Cc, and Ss are the same as in Equation 1, but now we include effects for instrument (Ii), diagnosis time frame (Tt), sample population (Nn), gender composition (Mm), age group (Aa), and being part of a group of studies that used overlapping samples (Gg). Two-way interaction terms are also included to examine additional sources of systematic variance in reliability. In the expanded model, the consistency of criterion discriminations for any randomly selected study that is not attributable to instrument, time frame, population, gender, or age can be examined using Equation 4.

R1R=σCRITERION2(σCRITERION2+σCRITERIONINSTRUMENT2+σCRITERIONTIMEFRAME2+σCRITERIONCLINICAL2+σCRITERIONGENDER2+σCRITERIONAGE2+σERROR2) (4)

Similar to Equation 2 above, Equation 4 provides a reliability estimate between 0 and 1, with higher values indicating that the criteria set account for the most variability in discrimination estimates across studies, relative to other potential sources of variability (e.g., an interaction between a particular criterion and measurement instrument). The analyses described were all completed using a Bayesian approach. Sample sizes from each included analysis were used to weight sets of criteria discriminations to adjust for differences in precision when calculating met-analytic estimates and confidence intervals (DuMouchel, 1990).

Although the various benefits of Bayesian methods have been emphasized elsewhere (e.g., Kruschke & Liddell, 2018; Wagenmakers et al., 2016), we chose to rely on Bayesian analyses in the present study in order to test different beliefs regarding whether the AUD criteria set would be the primary source of variability in discrimination estimates across studies.4 We made use of the Savage-Dickey method (Wagenmakers et al., 2010), allowing for approximate Bayes factors (BF) to be computed. The Savage-Dickey method provides an approximate BF by comparing the prior and posterior distribution densities at a specified point estimate (typically zero, the null hypothesis). In turn, the BF describes how much more (or less) likely the point estimate is after having seen the data. Thus, Bayesian analysis allows for direct quantification of evidence for or against particular point hypotheses.

In the context of the present study, BFs were computed to test point hypotheses regarding the amount of variability in discrimination estimates that could be accounted for by the AUD criteria set. Specifically, we examined the influence of two different types of prior specifications for the σCRITERION2 parameter.5 First, we used moderately informed priors (σCRITERION2N [.60, .035]) that reflect some degree of confidence that the AUD criteria will be the primary source of variability in discrimination estimates before having seen the data. Next, we used strongly informed priors (σCRITERION2N [.60, .001]) that reflect very strong beliefs that the AUD criteria set will be the primary source of variability in discrimination estimates. By comparing the likelihoods of σCRITERION2 parameter estimates under the prior and posterior distributions at the point estimate of .60, BFs can be computed to quantify the degree of evidence in favor of (or against) beliefs that the AUD criteria set will account for a moderate to large degree of variability (i.e., σCRITERION2=.60) in discrimination estimates across studies.

Results

Figure 2 displays the plotted IRT discrimination estimates for raw and standardized values. The criteria are plotted in ascending order according to their median discrimination value across studies (Table 1). Both the raw and standardized discrimination values show notable degrees of systematic unreliability in that, (1) there is marked variability within each criterion category across studies, and (2) there are multiple patterns of opposing change (larger/longer, cut down, role interference, time spent, interpersonal problems) across the different continuous study series. However, the standardized data show an increasing linear trend in discrimination values, suggesting some degree of reliability in the criteria. The differences between the unstandardized and standardized results is a consequence of differing within-study means and variances, which are eliminated when standardized data are used due to all discrimination estimates being placed on the same scale. The choice to use standardized results may be justifiable if the between-study differences in scaling is viewed as nuisance variance as opposed to being indicative of meaningful study-specific variability. Therefore, although we primarily focus on unstandardized values for interpretation, we present results for both unstandardized and standardized discrimination values in Table 2.6

Figure 2. Unstandardized (a) and standardized (b) discrimination estimates for AUD criteria.

Figure 2.

Table 1.

Descriptive Information for Included Studies.

Article Samplea Instrumentb Sample size Time frame # Criteria
Casey et al. (2012); Lane and Sher (2015)c NESARC Wave 2 AUDADIS-IV 22,177g Current 11
Dawson et al. (2010) d NESARC Wave 1 AUDADIS-IV 26,946g Current 11
Saha et al. (2006) d NESARC Wave 1 AUDADIS-IV 22,526h Current 10
Saha et al. (2007) d NESARC Wave 1 AUDADIS-IV 20,846i Current 11
Shmulewitz et al. (2010) d Israeli households AUDADIS-IV 1,066 Current 11
1,160 Lifetime 11
Keyes et al. (2011) NLAES AUDADIS-IV 18,352i Current l2
Preuss et al. (2014) WHO/ISBRA AUDADIS-based 711 Lifetime 11
Australia 104
Brazil 212
Canada 227
Finland 86
Japan 82
Mewton et al. (2011a); Proudfoot et al. (2006) NSMHWB CIDI V2.0 7,746 Current 11
Mewton et al. (2011b) NSMHWB CIDI V2.0 853j Current 11
McCutcheon et al. (2011) COGA SSAGA 8,605 Lifetime 9
Non-DUI Men 3,056
Non-DUI Women 3,894
DUI Men 1,330
DUI Women 325
Beseler et al. (2010) College students Survey-specific 353 Current 10o
Hagman (2017) c College Students Brief DSM-5 AUD Assessment 923g Current 11
Hagman and Cohn (2011) College students CIDI-SAM 396 Current 11
Ehlke et al. (2012) NSDUH 2009 SAMHSA 4,605k Current 11
Kuerbis et al. (2013b) NSDUH 2009 SAMHSA 3,412l Current 11
Hagman and Cohn (2013) NSDUH 2009 SAMHSA 3,806m Current 11p
Rose et al. (2012) NSDUH 2002-2008 SAMHSA 9,356n Current 11
Harford et al. (2009) NSDUH 2002-2005 SAMHSA 133,231 Current 11
Men, Age 12-17 years 11,651
Men, Age 18-25 years 27,377
Men, Age 26+ years 25,872
Women, Age 12-17 years l2,304
Women, Age 18-25 years 29,331
Women, Age 26+ years 26,696
Srisurapanont et al. (2012) Thai-NMH Survey MINI-Thai 3,718 Current 7
Men 3,174
Women 544
Adolescents 272
Adults 3,446
Duncan et al. (2011) MOAFTS SSAGA 2,835 Lifetime 11
Women, Age 18-20 1,158
Women, Age 21-25 1,677
Derringer et al. (2013) MTFS & SAGE SSAGA 6,597 Lifetime 7
Gilder et al. (2011) e American Indians SSAGA 530 Lifetime 10
Gelhorn et al. (2008) Mixed Adolescentsf CIDI-SAM 5,587 Lifetime 11
Caetano et al. (2016) c Puerto Rican adults CIDI-Spanish 1,107g Lifetime 11
Castaldelli-Maia et al. (2015) c Brazilian adults CIDI-Portuguese 936i Current 11
Bond et al. (2012); Borges et al., 2010, 2011); Cherpitel et al. (2010) ED patients CIDI V1.0 3,191 Current 12
Argentina 662
Mexico 547
Poland 1,098
USA 884
Hasin et al. (2012) d Clinical PRISM 543 Current 11
Langenbucher et al. (2004) Clinical CIDI-SAM 372 Lifetime 9
Wu et al. (2009) Clinical DSM-IV checklist 462 Current 7
Wu et al. (2012) Clinical DSM-IV checklist 671 Current 7
Martin et al. (2006) Clinical adolescents SCID 464 Lifetime 11
Edwards et al. (2013) VATSPSUD SCID 7,454 Lifetime 11
Kuerbis et al. (2013a) SARD SCID 461 Lifetime 11
a

NESARC = National Epidemiological Study on Alcohol and Related Conditions; NLAES = National Longitudinal Alcohol Epidemiologic Study; WHO/ISBRA = World Health Organization/International Society on Biomedical Research Collaborative Study; NSMHWB = National Survey of Mental Health and Well-Being (Australia); COGA = Collaborative Study on the Genetics of Alcoholism; NSDUH = National Survey on Drug Use and Health; Thai-NMH Survey = Thai National Mental Health Survey; MOAFTS = Missouri Adolescent Female Twin Study; MTFS = Minnesota Twin Family Study; SAGE = Study of Addiction: Genes and Environment; ED = Emergency Department, VATSPSUD = Virginia Adult Twin Study of Psychiatric and Substance Use Disorders; SARD = Substance Abuse Research Demonstration.

b

AUDADIS-IV = Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV; CIDI = Composite International Diagnostic Interview; SSAGA = Semistructured Assessment for the Genetics of Alcoholism; SAMHSA = Substance Abuse and Mental Health Services Administration; SAM = Substance Abuse Module; MINI-Thai = Mini International Neuropsychiatric Inventory, Thai module; PRISM = Psychiatric Research Interview for Substance and Mental Disorders; SCID = Structured Clinical Interview for DSM-IV.

c

Additional studies included in the updated analysis.

d

We could not confirm the reported metric for the item response theory (IRT) parameters with the authors, but based on the description and software used an IRT parameterization seemed likely. Differential item functioning analysis estimates could also not be acquired, so the aggregate results were used.

e

We selected the authors’ “once per month” binge drinking criteria for comparison of the IRT thresholds. Using the other criteria resulted in trivially different associations.

f

Combination of community, adjudicated, and clinical individuals.

g

Past year drinkers.

h

≥ 12 Drinks in past year and ever drank 5+ drinks on ≥ 1 occasion.

i

≥ 12 Drinks in past year.

j

Young adult (18-24 years) subsample only.

k

College students.

l

Age 50+ years.

m

Noncollege, Age 18-25 years.

n

Adolescent and young adult drinkers (12-21yrs) only.

o

Authors created a combined measure of interpersonal and legal problems criteria.

p

Tolerance severity not reported.

Table 2.

Bayesian Variance Parameter Estimates for Basic and Expanded Models Using Unstandardized and Standardized Discrimination Values.

Basic model (Equation 1)
Expanded model (Equation 3)
Parameter Median estimate (SD) 95% HDI Median estimate (SD) 95% HDI
Unstandardized estimates
σCRITERION2 0.066 (0.043) [0.021, 0.156] 0.031 (0.036) [0.002, 0.107]
σSAMPLE2 0.419 (0.099) [0.322, 0.746] 0.217 (0.086) [0.073, 0.393]
σINSTRUMENT2 0.105 (0.365) [0.001, 0.803]
σTIMEFRAME2 0.109 (174.3)a [0.001, 2.57]
σCLINICAL2 0.081 (0.617) [0.002, 0.754]
σGENDER2 0.035 (0.419) [0.002, 0.448]
σAGE2 0.108 (0.527) [0.002, 0.787]
σGROUP2 0.096 (0.104) [0.002, 0.322]
σCRITERIONINSTRUMENT2 0.073 (0.024) [0.034, 0.124]
σCRITERIONTIMEFRAME2 0.017 (0.015) [0.002, 0.050]
σCRITERIONCLINICAL2 0.012 (0.010) [0.002, 0.035]
σCRITERIONGENDER2 0.008 (0.006) [0.002, 0.022]
σCRITERIONAGE2 0.006 (0.004) [0.002, 0.015]
σERROR2 0.199 (0.013) [0.174, 0.225] 0.105 (0.011) [0.119, 0.161]
ICC
 Estimate (95% HDI) 0.249 [0.105, 0.448] 0.105 [0.010, 0.294]
Standardized estimates
σCRITERION2 0.438 (0.299) [0.135, 1.089] 0.189 (0.171) [0.006, 0.543]
σCRITERIONINSTRUMENT2 0.202 (0.059) [0.106, 0.327]
σCRITERIONTIMEFRAME2 0.021 (0.032) [0.002, 0.085]
σCRITERIONCLINICAL2 0.036 (0.031) [0.003, 0.104]
σCRITERIONGENDER2 0.019 (0.017) [0.002, 0.055]
σCRITERIONAGE2 0.016 (0.013) [0.002, 0.044]
σERROR2 0.582 (0.037) [0.514, 0.656] 0.399 (0.029) [0.345, 0.458]
ICC
 Estimate [95% HDI] 0.430 [0.232, 0.676] 0.209 [0.011, 0.439]

Note. Estimates for Model 1 are based on 25,000 Markov Chain Monte Carlo (MCMC) iterations, while estimates for Model 2 are based on 75,000 MCMC iterations; all posterior distributions for the model estimates are inverse gamma distributions given that they are variance parameters. 95% HDI = 95% highest density interval.

a

The observed SD for the time frame main effect is a result of notable outliers produced over the course of MCMC sampling. After removing outliers (i.e., values greater than 50), the σTIMEFRAME2 SD = 2.48 and is based on 74,809 iterations.

The unstandardized results for the basic model show an estimated ICC of .249 (95% HDI7 [0.105, 0.448]) when only considering two potential sources of variability in discrimination values: criteria and study-specific variance. The results for the expanded model show even lower reliability (ICC = .105, 95% HDI [0.010, 0.294]). The results for the expanded model show that the primary source of systematic unreliability in discrimination estimates was the criterion × instrument interaction (σ2 = .073, 95% HDI [0.034, 0.124]).

Further exploration of the criterion × instrument parameter showed that the types of measures that contributed significant variability to discrimination estimates (for certain criteria) were the AUDADIS, SSAGA, PRISM, SCID, CIDI, and SAMSHA, with the latter three measures showing significant effects for multiple criteria. Compared with the average discrimination estimate across all studies, studies that used the AUDADIS produced a higher discrimination value for social problems (b = .27, HDI [0.01, 0.55])8 while the SSAGA produced a higher discrimination value for role interference (b = .31, HDI [0.04, 0.59]). The PRISM produced a lower discrimination for the hazardous use criterion (b = −.61, HDI [−1.09, −0.21]). The SCID produced higher discrimination values for two criteria: hazardous use (b = .37, HDI [0.05, 0.73]) and larger/longer (b = .40, HDI [0.07, 0.76]). The CIDI produced higher discrimination estimates than the average study for both the give up activities (b = .31, HDI [0.04, 0.59]) and time spent (b = .36, HDI [0.09, 0.63]) criteria, but a lower discrimination estimate for the role interference criterion (b = −.30, HDI [−0.56, −0.04]). The SAMSHA produced higher discrimination estimates for both the role interference (b = .28, HDI [0.004, 0.56]) and social problems (b = .32, HDI [0.04, 0.60]) criteria, but a lower estimate for time spent (b = −.34, HDI [−0.62, −0.07]). Table 3 provides the posterior estimates for all criterion × instrument interactions (also see online supplementary Figure S1).

Table 3.

Posterior Median Discrimination Values for Instrument × Criterion Interactions.

CRAV CUT
DOWN
GIVEUP HAZUSE HPROB LEGAL LRG/
LONG
ROLINT SPROB TIME TOL WITD
AUDADIS .02 −.09 .10 −.03 .11 −.12 −.03 .11 .27 −.12 −.16 −.16
CIDI −.05 −.02 .31 −.05 .05 .01 .12 .30 −.20 .36 −.11 −.05
SSAGA −.12 −.09 −.04 −.17 −.06 .02 −.19 .38 .06 −.03 −.14 .28
SAMHSA .01 −.18 .00 .18 .19 .02 −.21 .28 .32 .34 −.15 −.15
PRISM .00 .13 .13 −.61 .13 .00 .23 .11 −.14 .06 −.01 .17
SCID .00 .02 −.07 .37 −.26 −.26 .40 −.27 −.25 .40 .15 −.08
OTHER −.01 .07 −.03 −.19 −.03 .17 −.16 −.29 .11 .09 .05 .04

Note. Bolded effects indicate that the 95% credible interval for the point estimate does not include zero; AUDADIS-IV = Alcohol Use Disorder and Associated Disabilities Interview Schedule–IV; CIDI = Composite International Diagnostic Interview; SSAGA = Semi structured Assessment for the Genetics of Alcoholism; SAMHSA = Substance Abuse and Mental Health Services Administration; PRISM = Psychiatric Research Interview for Substance and Mental Disorders; SCID = Structured Clinical Interview for DSM-IV.

Despite expanding the number of variance parameters, there remained a main effect due to study-specific effects (σ2 = .217, 95% HDI [0.073, 0.393]), indicating that there was notable within-study variability in discrimination estimates that was not attributable to the criteria or other sources of variability. It should be noted, however, that for some model parameters there was little variability, which limited the opportunity to examine meaningful differences. For example, many of the included samples were made up of equal parts men and women (n = 26; 50%). With regard to age, only six samples (11.54%) were composed of adolescents and only one sample was made up of elderly adults (1.92%; see online supplementary Figure S1).

Influence of σ2CRITERION Prior Distributions

Figure 39 shows the prior and posterior distributions for the basic and expanded models that made use of moderately informed priors on the σCRITERION2 parameter. In both models, the estimate of .60 for the σCRITERION2 parameter is less likely after the data are observed. In the basic model, the ratio of densities suggests strong evidence that a value of .60 is unlikely given the data (BF10 ≈ .09). Specifically, a value of .60 is about 11 times less likely after having seen the data (i.e., 1/.09 = 11.11). Similar results were observed for the expanded model. In the expanded model, the ratio of densities also suggests strong evidence that a value of .60 is unlikely given the data (BF10 ≈ .10). Unsurprisingly, the basic and expanded models with moderately informed priors produced higher estimates for both the σCRITERION2 and ICC median estimates due to the influence of the priors on the σCRITERION2 parameter (Unstandardized Basic Model: σCRITERION2=.15; ICC = .43; Unstandardized Expanded Model: σCRITERION2=.14; ICC = .34).

Figure 3. Effects of σCRITERION2 priors on σCRITERION2 posterior distributions for basic and expanded model.

Figure 3.

Note. The moderately informed prior is a normal distribution with a mean of .60 and relatively small variance: σCRITERION2N (.60, .035); the two points on the prior and posterior distributions indicate the point (.60) at which the Savage-Dickey ratio is computed to approximate a Bayes factor; both the prior and posterior distributions for the basic and expanded models are based on 75,000 iterations.

The results for the strongly informed priors for the basic and expanded models yielded much different results. The BFs for the basic and expanded models were BF10 ≈ .97 and BF10 ≈ .90, respectively. Both BFs which indicate that a hypothesis of σCRITERION2=.60 is just as likely after having seen the data as it was before having seen the data.10 Thus, the posterior distributions for both models were effectively uninfluenced by any observed data and instead determined almost entirely by the strong priors on σCRITERION2.

Discussion

The present results show that for any randomly selected study, the reliability of AUD criteria discrimination estimates is low, though not trivial, ranging from .11 to .43 depending on the model and scaling. The results also show that diagnostic instrument was the largest source of systematic unreliability in the average discrimination estimates. Other, more commonly investigated factors, such as gender, age, diagnostic status, and diagnostic timeframe were estimated to contribute to some degree to DIF across criteria discriminations, but these effects were less than 10% to 25% the magnitude of the instrument effect.

The lack of consistency in discrimination estimates is not inherently problematic in cases where all criteria discriminations are high, which was the case in general (see Figure 2a). Yet, there remained significant variability in discrimination estimates across criteria that was attributable to study-specific effects in the expanded model, indicating that criteria, and consequently diagnostic, precision can still be improved. If such factors are acknowledged in advance then they could reasonably be argued as true signal to be counted toward reliability (Lane et al., 2016); in which the reported estimates could be altered for the expanded models and the corresponding range indicating fair consistency (ICC range = .578-.603). In addition, if we consider the entirety of the AUD IRT literature as a whole, averaging across them (k = 52) as in the classic ICC(3, k) case (Shrout & Fleiss, 1979), we estimate a range of .865 to .941, indicating substantial reliability. These are hypothetical estimates that indicate stability in the AUD criteria on average; but caution should be taken before interpreting them too optimistically, as most researchers can only conduct a single study at a time and do not have the luxury of pooling results across 52 samples. Indeed, the BFs derived from comparing the likelihood of specific point estimates of σCRITERION2 showed that the present data are inconsistent with a belief that the AUD criteria would be a reliable source of discrimination variability for any given study. When contrasting the results of the moderately informed priors with the strongly informed priors, one can see just how strongly one must believe in AUD criteria reliability to maintain that belief. Specifically, the belief must be strong enough so as to ignore all the present data. Though the results from the basic and expanded models with strong priors are not incorrect per se, the models necessitate a position that appears untenable—one must ignore the data made available in the present study.

Practically speaking, the present findings for discrimination estimates suggest that which AUD criteria are better or worse at differentiating between individuals given their standing on the latent AUD spectrum are unlikely to be consistent from study to study. When considering the generally low reliability of discrimination estimates alongside the similarly low reliability observed for threshold estimates (Lane et al., 2016), the current results underscore the fact that discrimination/threshold estimates based on AUD criteria from a given study are unlikely to generalize to a separate IRT study of AUD criteria. These results have important implications for conceptualizations of AUD as a clinical construct. They also likely extend to other latent variable models of psychopathology to the extent to which consistency of factor structure is overlooked and otherwise judged by factor numeracy and/or item/factor specificity (Watts et al., 2020).

Implications for Clinical Assessment of AUD

One of the more important implications of the present study is that the choice of AUD assessment instrument may lead to substantively different conclusions regarding AUD diagnoses. For example, the SAMSHA and SSAGA instruments produced higher discrimination estimates for the role interference criterion compared with other measures. In turn, these measures assign greater weight to role interference as an indicator of the underlying AUD spectrum than do other measures. On the other hand, the CIDI produced a lower discrimination estimate for the role interference criterion. Ultimately, the results suggest that what may seem like an inconsequential decision (i.e., which assessment to choose for AUD) can result in important consequences for any given study’s understanding of the underlying disorder. The results displayed in Table 3 also offer information to clinicians and researchers about the strengths and/or limitations of particular AUD instruments in relation to their abilities to discriminate between individuals at various levels of the AUD latent spectrum. The data do not indicate that one particular measure is particularly better or worse than other measures. Instead, the results show that depending on what researchers or clinicians may be most interested in, some measures may be better suited for their interests compared with others.

For example, if a researcher is interested in ensuring that the criteria of tolerance and withdrawal reliably discriminate between individuals at differing levels of the AUD spectrum, the present results show that there are no meaningful differences across measures in assessing the discriminatory ability of tolerance and withdrawal criteria. Thus, the researcher can be relatively confident that the assessment instrument they choose will not result in systemic bias with regard to those particular criteria. However, in cases where a researcher is interested in reliably assessing role interference as a means to discriminate between individuals, the data have clear suggestions. The SSAGA and SAMSHA both provide relatively higher discrimination estimates for the role interference criterion, while the CIDI provides relatively lower estimates.

Last, the present results provide important information that can inform both past and future IRT studies on AUD, as well as other disorders. As previously noted, there have been inconsistent findings regarding which AUD criteria have displayed DIF when examining discrimination estimates (e.g., Saha et al., 2006; Harford et al., 2009). Our results suggest that systematic sources of variability (e.g., measurement) may help explain why discrimination values from a given study do not generalize to another study. This point is particularly important given the influence IRT studies have had on refining the conceptualization and diagnosis of AUD. For example, IRT studies provided strong evidence that the AUD criteria assessed a single latent continuum, leading to the distinction between abuse and dependence criteria being eliminated in the DSM-5 (Hasin et al., 2013). Furthermore, IRT studies are frequently cited as reasons for eliminating the “legal trouble” criterion from the AUD criteria set, after such studies showed that the “legal trouble” criterion showed to be uninformative regarding an individual’s standing on the latent AUD spectrum relative to the other AUD criteria (Saha et al., 2006). Thus, though the utility of IRT analyses is clear, our results highlight that more work can be done to more broadly examine the generalizability of IRT findings from a given study. One important step to take would be to directly model the factors that may influence estimation variability to potentially increase knowledge about the generalizability of the findings.

Implications for Clinical Assessment More Broadly

Though we have focused on AUD criteria in the present article, we believe that the analytical framework we utilized can also be extended to other clinical constructs to examine important clinical assessment questions. For example, researchers have argued that emotion dysregulation ought to be considered the core feature of borderline personality disorder (BPD; Glenn & Klonsky, 2009). As such, clinicians and researchers are likely to want to ensure that BPD assessment instruments contain emotion dysregulation items/criteria that reliably distinguish between individuals at varying levels of the BPD spectrum. Leveraging the current methods to explore these empirical questions may provide important information for clinicians and researchers seeking to use instruments that reliably assess core features of various disorders. Furthermore, the current methods may be useful to future IRT work on diagnostic criteria given that interest in replicability issues within clinical psychology research continues to grow (Tackett et al., 2017; Tackett et al., 2019).

It is also worthwhile to contrast the present results with previous analyses examining AUD criteria threshold estimates (Lane et al., 2016). Though both the discrimination and threshold analyses found 12 and 13 random effects, respectively, (out of 84; 12 criteria × 7 instruments) that were noteworthy when examining the criteria × instrument interaction, there was only one overlapping effect. For example, in the original meta-analysis of threshold estimates, 5 of the 13 significant effects were found for the AUDADIS while only one was found for the AUDADIS with regard to discrimination estimates, and for a separate criterion. The only overlap observed was for the SAMHSA and the time spent criterion—both threshold and discrimination estimates were lower than in the average study. Taken together, the combined meta-analytic results suggest that both threshold and discrimination estimates are in part a function of the AUD diagnostic instrument used, and the effects of diagnostic instruments span a diverse range of AUD criteria.

Limitations

There are some important limitations in the present study. First, there are other potential sources of variability in discrimination estimates that were not explored. For example, ethnicity has been examined as a source of DIF, with previous work finding evidence of DIF across different ethnic groups (e.g., Saha et al., 2006). However, there was little variability in ethnic makeup in the included samples and other options (e.g., a dichotomous comparison of White vs. non-White samples) would likely mask important differences across ethnic groups (Sue, 2001). An additional source of variability not explicitly accounted for is study quality. Our analyses included individual study sample as a random factor, which spans large-scale epidemiological studies through more idiosyncratic investigations of smaller clinical groups. The analyses also included research group as a random factor, which would capture whether some groups may produce higher quality, or “more significant,” results than others, or in higher impact journals. We did not code factors of study quality, and the variables we coded were not meant to be indicators as such. The present results should be considered with these caveats in mind.

Second, of the parameters included in the expanded models, some showed little variability (e.g., gender and age). Thus, despite their inclusion, the estimates for these parameters should be interpreted with caution. Though our results are generally consistent with DIF across gender and age as reported in previous investigations (e.g., larger/longer, withdrawal; see online supplementary Figure S2), as more data become available, it may be possible to more fully and precisely explore the influence of such factors on discrimination estimates.

Conclusion

There is generally low, but nontrivial consistency in AUD criteria discriminations within individual studies. This may underlie why DIF for various factors across individual criteria has been inconsistent. Nevertheless, averaging across all available data, there appears to be consistent structure in the discriminations of the set of AUD criteria. In other words, when one averages across studies, the AUD criteria do produce a consistent structure with respect to the overall variability observed in discrimination estimates, suggesting consistent rank-order stability in discriminations among the criteria. Thus, at the average level, the criteria are consistent in their abilities to differentiate between persons at varying levels along the AUD spectrum. This represents a silver lining in that an identifiable core structure is recoverable; but our primary analyses illustrate that such results cannot be expected in individual studies. It also ignores the considerable heterogeneity that exists in the latent AUD construct (Lane et al., 2016). We still found evidence of DIF as a function of established factors (e.g., age, gender), providing meta-analytic support on average, but the magnitudes were small in comparison with the effect of instrument. Thus, consideration of the assessment instrument, in addition to more common factors, remains a critical open issue regarding the inferred comparability of AUD diagnosis and severity across individuals and contexts.

Supplementary Material

2

Funding

The author(s) received the following financial support for the research, authorship, and/or publication of this article: Sean Lane was supported through funding provided by the National Institues of Health (Grant R01AA027264).

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

1.

Though a full review of the development of the criteria for AUD is outside the scope of the present article, see O’Brien (2011) and Hasin et al. (2013) for overviews of this topic including changes in DSM-5.

2.

The methods in the current study were conducted in line with current Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al., 2009). The PRISMA guidelines are a checklist of 27 items deemed essential for inclusion in any given systematic review or meta-analysis. The guidelines provide an overview of information to be included within each section of the review manuscript, including the abstract, introduction, methods, results, discussion, and disclosure of funding sources. The ultimate goal of the PRISMA guidelines is to provide criteria for systematic reviews that will increase the transparency and quality of the overall report.

3.

We elected to bin the potential moderators for a given analysis instead of coding them continuously (e.g., % women, average age) because precise information was not provided by a majority of studies. This choice allowed us to minimize missingness and fit the analysis into the random effects generalizability theory (GT) framework.

4.

A more in-depth overview of how our primary analyses were adapted to a Bayesian framework can be found in the online supplementary material. Specifically, the online supplementary material provides information on how Equations 1 and 3 were represented in a hierarchical Bayesian framework as well as information on the Markov Chain Monte Carlo (MCMC) sampling methods. All analyses were conducted using SAS software’s (Version 9.4) PROC MCMC method (SAS Institute, 2013).

5.

We chose the σCRITERION2 parameter for these analyses for two reasons. First, there is likely a diversity of beliefs regarding whether the parameter estimate would be small, moderate, or large. Second, if the value for the σCRITERION2 parameter is large, then the ICC estimate will also increase (see Equations 2 and 4) which indicates greater consistency in the AUD criteria set discrimination estimates across studies.

6.

All results are derived from the Bayesian models that utilized uninformed priors for the variance parameters, unless otherwise noted.

7.

However, 95% HDI (highest density interval) spans 95% of the posterior distribution and values within the interval are considered more probable than those outside the interval. The HDI also conveys the degree of confidence in an estimate, with wider intervals indicating less certainty in the estimated value.

8.

Median posterior estimates are reported along with the 95% HDI.

9.

Figure 3 was constructed using the “ggmcmc” (Fernádez-i-Marín, 2016) and “ggplot2” (Wickham, 2009) packages in R (R Core Team, 2017).

10.

See online supplementary Figure S2 for the plotted results.

References

  1. American Psychiatric Association. (1942). Diagnostic and statistical manual for mental disorders (1st ed.). Author. [Google Scholar]
  2. American Psychiatric Association. (1952). Diagnostic and statistical manual for mental disorders (2nd ed.). Author. [Google Scholar]
  3. American Psychiatric Association. (1994). Diagnostic and statistical manual for mental disorders (4th ed.). Author. [Google Scholar]
  4. American Psychiatric Association. (2013). Diagnostic and statistical manual for mental disorders (5th ed.). Author. 10.1176/appi.books.9780890425596 [DOI] [Google Scholar]
  5. Balsis S, Gleason ME, Woods CM, & Oltmanns TF (2007). An item response theory analysis of DSM-IV personality disorder criteria across younger and older age groups. Psychology and Aging, 22(1), 171–185. 10.1037/0882-7974.22.1.171 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. *Beseler CL, Taylor LA, & Leeman RF (2010). An item-response theory analysis of DSM-IV alcohol-use disorder criteria and “binge” drinking in undergraduates. Journal of Studies on Alcohol and Drugs, 71(3), 418–423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. *Bond J, Ye Y, Cherpitel CJ, Borges G, Cremonte M, Moskalewicz J, & Swiatkiewicz G (2012). Scaling properties of the combined ICD-10 dependence and harms criteria and comparisons with DSM-5 alcohol use disorder criteria among patients in the emergency department. Journal of Studies on Alcohol and Drugs, 73(2), 328–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. *Borges G, Cherpitel CJ, Ye Y, Bond J, Cremonte M, Moskalewicz J, & Swiatkiewicz G (2011). Threshold and optimal cut-points for alcohol use disorders among patients in the emergency department. Alcoholism: Clinical & Experimental Research, 35(7), 1270–1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. *Borges G, Ye Y, Bond J, Cherpitel CJ, Cremonte M, Moskalewicz GS, & Maritza R (2010). The dimensionality of alcohol use disorders and alcohol consumption in a cross-national perspective. Addiction, 105(2), 240–254. 10.1111/j.1360-0443.2009.02778.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Brennan RL (1992). Generalizability theory. Educational Measurement: Issues and Practice, 11(4), 27–34. [Google Scholar]
  11. *Caetano R, Vaeth PA, Mills B, & Canino G (2016). Employment status, depression, drinking, and alcohol use disorder in Puerto Rico. Alcoholism: Clinical and Experimental Research, 40(4), 806–815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. *Casey M, Adamson G, Shevlin M, & McKinney A (2012). The role of craving in AUDs: Dimensionality and differential functioning in the DSM-5. Drug and Alcohol Dependence, 125(1-2), 75–80. [DOI] [PubMed] [Google Scholar]
  13. *Castaldelli-Maia JM, Wang YP, Borges G, Silveira CM, Siu ER, Viana MC, Andrade AG, Martins SS,& Andrade LH (2015). Investigating dimensionality and measurement bias of DSM-5 alcohol use disorder in a representative sample of the largest metropolitan area in South America. Drug and alcohol dependence, 152, 123–130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. *Cherpitel CJ, Borges G, Ye Y, Bond J, Cremonte M, Moskalewicz J, & Swiatkiewicz G (2010). Performance of a craving criterion in DSM alcohol use disorders. Journal of Studies on Alcohol and Drugs, 71(5), 674–684. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chung T, & Martin CS (2002). Concurrent and discriminant validity of DSM-IV symptoms of impaired control over alcohol consumption in adolescents. Alcoholism: Clinical & Experimental Research, 26(4), 485–492. 10.1111/j.1530-0277.2002.tb02565.x [DOI] [PubMed] [Google Scholar]
  16. Cooper LD, & Balsis S (2009). When less is more: How fewer diagnostic criteria can indicate greater severity. Psychological Assessment, 21(3), 285–293. [DOI] [PubMed] [Google Scholar]
  17. *Dawson DA, Saha TD, & Grant BF (2010). A multidimensional assessment of the validity and utility of alcohol use disorder severity as determined by item response theory models. Drug and Alcohol Dependence, 107(1), 31–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. *Derringer J, Krueger RF, Dick DM, Agrawal A, Bucholz KK, Foroud T, Grucza RA, Hesselbrock MN, Hesselbrock V, Kramer J, Nurnberger JI Jr., Schuckit M, Bierut LJ, Iacono WG, & McGue M (2013). Measurement invariance of DSM-IV alcohol, marijuana and cocaine dependence between community-sampled and clinically over-selected studies. Addiction, 108(10), 1767–1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. *Duncan AE, Agrawal A, Bucholz KK, Sartor CE, Madden PA, & Heath AC (2011). Deconstructing the architecture of alcohol abuse and dependence symptoms in a community sample of late adolescent and emerging adult women: An item response approach. Drug and Alcohol Dependence, 116(1-3), 222–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. DuMouchel W (1990). Bayesian meta-analysis. Statistical Methodology in the Pharmaceutical Sciences, 10(4), 509–529. 10.1177/096228020101000404 [DOI] [Google Scholar]
  21. *Edwards AC, Gillespie NA, Aggen SH, & Kendler KS (2013). Assessment of a modified DSM-5 diagnosis of alcohol use disorder in a genetically informative population. Alcoholism: Clinical and Experimental Research, 37(3), 443–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. *Ehlke SJ, Hagman BT, & Cohn AM (2012). Modeling the dimensionality of DSM-IV alcohol use disorder criteria in a nationally representative sample of college students. Substance Use & Misuse, 47(10), 1073–1085. [DOI] [PubMed] [Google Scholar]
  23. Fernádez-i-Marín X (2016). ggmcmc: Analysis of MCMC samples and Bayesian inference. Journal of Statistical Software, 70(9), 1–20. 10.18637/jss.v070.i09 [DOI] [Google Scholar]
  24. *Gelhorn H, Hartman C, Sakai J, Stallings M, Young S, Rhee S, Corley R, Hewitt J, Hopfer C, & Crowley T (2008). Toward DSM-V: An item response theory analysis of the diagnostic process for DSM-IV alcohol abuse and dependence in adolescents. Journal of the American Academy of Child & Adolescent Psychiatry, 47(11), 1329–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. *Gilder DA, Gizer IR, & Ehlers CL (2011). Item response theory analysis of binge drinking and its relationship to lifetime alcohol use disorder symptom severity in an American Indian community sample. Alcoholism: Clinical & Experimental Research, 35(5), 984–995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Glenn CR, & Klonsky ED (2009). Emotion dysregulation as a core feature of borderline personality disorder. Journal of Personality Disorders, 23(1), 20–28. [DOI] [PubMed] [Google Scholar]
  27. *Hagman BT (2017). Development and psychometric analysis of the Brief DSM–5 Alcohol Use Disorder Diagnostic Assessment: Towards effective diagnosis in college students. Psychology of Addictive Behaviors, 31(7), 797–806. 10.1037/adb0000320 [DOI] [PubMed] [Google Scholar]
  28. *Hagman BT, & Cohn AM (2011). Toward DSM-V: Mapping the alcohol use disorder continuum in college students. Drug and Alcohol Dependence, 118(2-3), 202–208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. *Hagman BT, & Cohn AM (2013). Using latent variable techniques to understand DSM-IV alcohol use disorder criteria functioning. American Journal of Health Behavior, 37(4), 565–574. [DOI] [PubMed] [Google Scholar]
  30. *Harford TC, Hsiao-ye Y, Faden VB, & Chen CM (2009). The dimensionality of DSM-IV alcohol use disorders among adolescents and adult drinkers and symptom patterns by age, gender, and race/ethnicity. Alcoholism: Clinical & Experimental Research, 33(5), 868–878. 10.1111/j.1530-0277.2009.00910.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. *Hasin DS, Fenton MC, Beseler C, Park JY, & Wall MM (2012). Analyses related to the development of DSM-5 criteria for substance use related disorders: 2. Proposed DSM-5 criteria for alcohol, cannabis, cocaine and heroin disorders in 663 substance abuse patients. Drug and Alcohol Dependence, 122(1-2), 28–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Hasin DS, O’Brien CP, Auriacombe M, Borges G, Bucholz K, Budney A, Compton WM, Crowley T, Ling W, Petry NM, Schuckit M, & Grant BF (2013). DSM-5 criteria for substance use disorders: Recommendations and rationale. American Journal of Psychiatry, 170(8), 834–851. 10.1176/appi.ajp.2013.12060782 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. *Keyes KM, Krueger RF, Grant BF, & Hasin DS (2011). Alcohol craving and the dimensionality of alcohol disorders. Psychological Medicine, 41(3), 629–640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kruschke JK, & Liddell TM (2018). The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and planning from a Bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178–206. 10.3758/s13423-016-1221-4 [DOI] [PubMed] [Google Scholar]
  35. *Kuerbis AN, Hagman BT, & Morgenstern J (2013a). Alcohol use disorders among substance dependent women on temporary assistance with needy families: More information for diagnostic modifications for DSM-5. The American Journal on Addictions, 22(4), 402–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. *Kuerbis AN, Hagman BT, & Sacco P (2013b). Functioning of alcohol use disorders criteria among middle-aged and older adults: Implications for DSM-5. Substance Use & Misuse, 48(4), 309–322. [DOI] [PubMed] [Google Scholar]
  37. Lane SP, Steinley D, & Sher KJ (2016). Meta-analysis of DSM alcohol use disorder criteria severities: Structural consistency is only “skin deep”. Psychological Medicine, 46(8), 1769–1784. 10.1017/S0033291716000404 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. *Lane SP, & Sher KJ (2015). Limits of current approaches to diagnosis severity based on criterion counts: An example with DSM-5 alcohol use disorder. Clinical Psychological Science, 3(6), 819–835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. *Langenbucher JW, Labouvie E, Martin CS, Sanjuan PM, Bavly L, Kirisci L, & Chung T (2004). An application of item response theory analysis to alcohol, cannabis, and cocaine criteria in DSM-IV. Journal of Abnormal Psychology, 113(1), 72–80. [DOI] [PubMed] [Google Scholar]
  40. *Martin CS, Chung T, Kirisci L, & Langenbucher JW (2006). Item response theory analysis of diagnostic criteria for alcohol and cannabis use disorders in adolescents: Implications for DSM-V. Journal of Abnormal Psychology, 115(4), 807–814. 10.1037/0021-843X.115.4.807 [DOI] [PubMed] [Google Scholar]
  41. Martin CS, Langenbucher JW, Chung T, Kenneth J, & Louis S (2014). Truth or consequences in the diagnosis of substance use disorders. Addiction, 109(11), 1773–1778. 10.1111/add.12615 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Martin CS, Sher KJ, & Chung T (2011). Hazardous use should not be a diagnostic criterion for substance use disorders in DSM-5. Journal of Studies on Alcohol and Drugs, 72(4), 685–686. 10.15288/jsad.2011.72.685 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. *McCutcheon VV, Agrawal A, Heath AC, Edenberg HJ, Hesselbrock VM, Schuckit MA, Kramer JR, & Bucholz KK (2011). Functioning of alcohol use disorder criteria among men and women with arrests for driving under the influence of alcohol. Alcoholism: Clinical & Experimental Research, 35(11), 1985–1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. *Mewton L, Slade T, McBride O, Grove R, & Teesson M (2011a). An evaluation of the proposed SM-5 alcohol use disorder criteria using Australian national data. Addiction, 106(5), 941–950. [DOI] [PubMed] [Google Scholar]
  45. *Mewton L, Teesson M, Slade T, & Cottler L (2011b). Psychometric performance of DSM-IV alcohol use disorders in young adulthood: Evidence from an Australian general population sample. Journal of Studies on Alcohol and Drugs, 72(5), 811–822. [DOI] [PubMed] [Google Scholar]
  46. Moher D, Liberati A, Tetzlaff J, Altman DG, & Group TP (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLOS MEDICINE, 6(7), e1000097. 10.1371/journal.pmed.1000097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. O’Brien C (2011). Addiction and dependence in DSM-V. Addiction, 106(5), 866–867. 10.1111/j.1360-0443.2010.03144.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. *Preuss UW, Watzke S, & Wurst FM (2014). Dimensionality and stages of severity of DSM-5 criteria in an international sample of alcohol-consuming individuals. Psychological Medicine, 44(15), 3303–3314. [DOI] [PubMed] [Google Scholar]
  49. *Proudfoot H, Baillie AJ, & Teeson M (2006). The structure of alcohol dependence in the community. Drug and Alcohol Dependence, 81(1), 21–26. [DOI] [PubMed] [Google Scholar]
  50. R Core Team. (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/ [Google Scholar]
  51. Reise SP, & Waller NG (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27–48. [DOI] [PubMed] [Google Scholar]
  52. *Rose JS, Lee CT, Selya AS, & Dierker LC (2012). DSM-IV alcohol abuse and dependence criteria characteristics for recent onset adolescent drinkers. Drug and Alcohol Dependence, 124(1-2), 88–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. *Saha TD, Chou SP, & Grant BF (2006). Toward an alcohol use disorder continuum using item response theory: Results from the National Epidemiologic Survey on alcohol and related conditions. Psychological Medicine, 36(7), 931–941. 10.1017/S003329170600746X [DOI] [PubMed] [Google Scholar]
  54. *Saha TD, Stinson FS, & Grant BF (2007). The role of alcohol consumption in future classifications of alcohol use disorders. Drug and Alcohol Dependence, 89(1), 82–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. SAS Institute Inc. (2013). SAS/STAT User’s Guide, Version 9.4. Cary, NC: SAS Institute Inc. [Google Scholar]
  56. Sayette MA (2016). The role of craving in substance use disorders: Theoretical and methodological issues. Annual Review of Clinical Psychology, 12, 407–433. 10.1146/annurev-clinpsy-021815-093351 [DOI] [PubMed] [Google Scholar]
  57. *Shmulewitz D, Keyes KM, Beseler C, Aharonovich E, Aivadyan C, Spivak B, & Hasin D (2010). The dimensionality of alcohol use disorders: Results from Israel. Drug and Alcohol Dependence, 111(1-2), 146–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Shrout PE, & Fleiss JL (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 10.1037/0033-2909.86.2.420 [DOI] [PubMed] [Google Scholar]
  59. *Srisurapanont M, Kittiratanapaiboon P, Likhitsathian S, Kongsuk T, Suttajit S, & Junsirimongkol B (2012). Patterns of alcohol dependence in Thai drinkers: A differential item functioning analysis of gender and age bias. Addictive Behaviors, 37(12), 173–178. [DOI] [PubMed] [Google Scholar]
  60. Sue DW (2001). Multidimensional facets of cultural competence. The Counseling Psychologist, 29(6), 790–821. 10.1177/0011000001296002 [DOI] [Google Scholar]
  61. Tackett JL, Brandes CM, King KM, & Markon KE (2019). Psychology’s replication crisis and clinical psychological science. Annual Review of Clinical Psychology, 15, 579–604. 10.1146/annurev-clinpsy-050718-095710 [DOI] [PubMed] [Google Scholar]
  62. Tackett JL, Lilienfeld SO, Patrick CJ, Johnson SL, Krueger RF, Miller JD, Oltmanns TF, & Shrout PE (2017). It’s time to broaden the replicability conversation: Thoughts for and from clinical psychological science. Perspectives on Psychological Science, 12(5), 742–756. 10.1177/1745691617690042 [DOI] [PubMed] [Google Scholar]
  63. Uebelacker LA, Strong D, Weinstock LM, & Miller IW (2009). Use of item response theory to understand differential functioning of DSM-IV major depression symptoms by race, ethnicity and gender. Psychological medicine, 39(4), 591–601. [DOI] [PubMed] [Google Scholar]
  64. Wagenmakers EJ, Lodewyckx T, Kuriyal H, & Grasman R (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognitive Psychology, 60(3), 158–189. 10.1016/j.cogpsych.2009.12.001 [DOI] [PubMed] [Google Scholar]
  65. Wagenmakers EJ, Morey RD, & Lee MD (2016). Bayesian benefits for the pragmatic researcher. Current Directions in Psychological Science, 25(3), 169–176. 10.1177/0963721416643289 [DOI] [Google Scholar]
  66. Watts AL, Lane SP, Bonifay W, Steinley D, & Meyer FAC (2020). Building theories on top of, and not independent of, statistical models: The case of the p-factor. Psychological Inquiry. 10.31234/osf.io/3vsey [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Wickham H (2009). ggplot2: Elegant graphics for data analysis. Springer. 10.1007/978-0-387-98141-3 [DOI] [Google Scholar]
  68. Witkiewitz K, Hallgren KA, O’Sickey AJ, Roos CR, & Maisto SA (2016). Reproducibility and differential item functioning of the alcohol dependence syndrome construct across four alcohol treatment studies: An integrative data analysis. Drug and Alcohol Dependence, 158, 86–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. *Wu LT, Blazer DG, Woody GE, Burchett B, Yang C, Pan JJ, & Ling W (2012). Alcohol and drug dependence symptom items as brief screeners for substance use disorders: Results from the Clinical Trials Network. Journal of Psychiatric Research, 46(3), 360–369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. *Wu LT, Pan JJ, Blazer DG, Tau B, Stitzer ML, Brooner RK, Woody GE, Patkar AA, Blaine JD (2009). An item response theory modeling of alcohol and marijuana dependences: A National Drug Abuse Treatment Clinical Trials Network study. Journal of Studies on Alcohol and Drugs, 70, 414–425. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

2

RESOURCES