Abstract
Criterion-related validation of diagnostic test scores for a construct of interest is complicated by the unavailability of the construct directly. The standard method, Known Group Validation, assumes an infallible reference test in place of the construct, but infallible reference tests are rare. In contrast, Mixed Group Validation allows for a fallible reference test, but has been found to make strong assumptions not appropriate for the majority of diagnostic test validation studies. The Neighborhood model is adapted for the purpose of diagnostic test validation, which makes alternate, but also strong, assumptions. The statistical properties of the Neighborhood model are evaluated and the assumptions are reviewed in the context of diagnostic test validation. Alternatively, strong assumptions may be avoided by estimating only intervals for the validity estimates, instead of point estimates. The Method of Bounds is also adapted for the purpose of diagnostic test validation, and an extension, Method of Bounds–Test Validation, is introduced here for the first time. All three point-estimate methods were found to make strong assumptions concerning the conditional relationships between the tests and the construct of interest, and all three lack robustness to assumption violation. The Method of Bounds–Test Validation was found to perform well across a range of plausible simulated datasets where the point-estimate methods failed. The point-estimate methods are recommended in special cases where the assumptions can be justified, while the interval methods are appropriate more generally.
Keywords: diagnostic tests, test validation, mixed group validation, sensitivity, specificity, gold standard
Criterion-related validity is important and often central to supporting the intended interpretation and use of diagnostic test scores (Meyer et al., 2001), and hence the validity of the diagnostic test scores (American Educational Research Association, American Psychological Association, and National Council of Measurement in Education [AERA, APA, NCME], 2014). However, criterion-related validation studies for diagnostic test scores in psychology typically involve empirically validating the test scores against one criterion, such as clinical diagnosis, while interpreting the results as supporting the use of the test scores as a measure of a different criterion, such as the psychopathological construct. These two criteria are assumed equivalent, but this is rarely true.
Consider a typical psychological construct of interest, C, such as a psychopathological construct defined by the Diagnostic and Statistical Manual of Mental Disorders (5th ed.; DSM-5; APA, 2013). Constructs in psychopathology are often defined as dichotomous in that people either have the construct of interest (cases; C = 1) or do not (noncases; C = 0). While psychopathological constructs may be better understood as dimensions as opposed to dichotomous variables (Borsboom et al., 2016; Widiger & Samuel, 2005), dichotomous constructs remain prominent in the primary diagnostic systems (APA, 2013; World Health Organization, 1992), and categorization can have utility to clinicians and practitioners even if dimensions represent the constructs better (Kamphuis & Noordhof, 2009).
A typical diagnostic test X provides two possible outcomes, positive (X = 1) and negative (X = 0). The criterion-related validity of the diagnostic test scores for the construct of interest is given by the sensitivity and specificity of the test scores (Altman & Bland, 1994a; Yerushalmy, 1947). Sensitivity is the probability that a case will receive a positive outcome on the diagnostic test, or Pr(X = 1|C = 1). Specificity is the probability that a non-case will receive a negative outcome on the diagnostic test, or Pr(X = 0|C = 0). Here, the nonspecificity Pr(X = 1|C = 0), which is (1 – specificity), is often used to simplify the equations. Dichotomous classification of abnormal behavior is often required to inform common dichotomous decisions, such as whether to treat, refer, or insure patients (Bowden, Harrison, & Loring, 2014) or whether to classify a defendant as psychotic or malingering to assess criminal responsibility (Rogers, 2008). Criterion-related validity for the construct, with other validity evidence, supports the use of the dichotomous classification to inform these decisions (AERA, APA, NCME, 2014).
While the validity argument to support the use of the test scores involves criterion-related validity for the construct of interest, the construct itself is not available for empirical validation studies. Instead, diagnostic test validation studies typically estimate criterion-related validity of the diagnostic test scores against another test, the reference test R, which also typically involves two possible outcomes, positive (R = 1) and negative (R = 0). In many studies, the reference test is clinical diagnosis for the construct of interest. An inconsistency arises as the validity argument is based on criterion-related validity for the construct, but the criterion-related validity for the reference test is observed instead.
Known Group Validation
Almost all diagnostic test validation studies use the method of Known Group Validation. The defining assumption of Known Group Validation is that the reference test is infallible or a perfect measure for the construct. In other words, the reference test is assumed to have perfect validity for the construct, so
| (1) |
With an infallible reference test, the reference test can be used in the place of the construct of interest, and the sensitivity and nonspecificity of the test to be validated are simply
| (2) |
| (3) |
Let x, r, and c be values of X, R, and C, respectively. The asymptotic standard error on the estimate for Pr(X = 1|C = c) is equal to asymptotic standard error on the estimate for Pr(X = 1|R = r) with r = c, which is
| (4) |
where are estimated probabilities and n is the total sample size.
Of course, in practice, the reference test is rarely infallible (Knottnerus, van Weel, & Muris, 2002; Kraemer, 2014). To the extent the reference test is fallible, the diagnostic study may be inaccurate or biased (Valenstein, 1990).
While the fact that most diagnostic test validation studies do not use an infallible reference test is non-controversial, and while estimates of the error rates of the reference test are often available based on prior validation studies, the assumption of an infallible reference test is generally made out of necessity because methods that facilitate incorporating information about the fallibility of the reference test are not well known. The present article focuses on methods that can account for the fallibility of the reference test when the reference test error rates are known based on previous studies or sources.
Mixed Group Validation
The problem that reference tests are rarely infallible has long been recognized, and the first method popularized in psychology as an alternative was introduced by Dawes and Meehl (1966). More recently, Mixed Group Validation and related methods have received a resurgence of interest in psychology (e.g., Frederick & Bowden, 2009a, 2009b; Jewsbury & Bowden, 2013, 2014; Mossman et al., 2015; Mossman, Wygant, & Gervais, 2012; Ortega, Labrenz, Markowitsch, & Piefke, 2013; Thomas, Lanyon, & Millsap, 2009; Tolin, Steenkamp, Marx, & Litz, 2010).
Mixed Group Validation studies are similar to Known Group Validation studies, in that they have a test to be validated, a reference test, and a construct. However, instead of assuming the reference test is infallible, the defining assumption of Mixed Group Validation is that the reference test and test to be validated are conditionally independent given the true construct status (Dawes & Meehl, 1966; Jewsbury & Bowden, 2013). That is,
| (5) |
While Mixed Group Validation does not require that the reference test is infallible, the method does require that the validity of the reference test and the prevalence of the construct of interest, Pr(C = 1), are known, so that estimates of Pr(C = 1|R = 0) and Pr(C = 1|R = 0) can be calculated (with formulas given in Altman and Bland, 1994b).
This assumption is sometimes phrased as that sensitivity and specificity are constant irrespective of reference test outcome (Frederick, 2000). That is, if groups were formed as testing positive and testing negative on the reference test, as typical in diagnostic test validation studies, the test to be validated will have the same validity within both groups.
From Equation 5 and following Dawes and Meehl (1966), when the reference test has two possible outcomes (R = 1 and R = 0), the sensitivity and nonspecificity of the test to be validated is,
| (6) |
| (7) |
Jewsbury and Bowden (2014) provided the standard error on the estimates for Pr(X = 1|C = c) as,
| (8) |
| (9) |
Mixed Group Validation assumptions of conditional independence may be inappropriate for many or most psychological diagnostic test validation studies (Jewsbury & Bowden, 2013). For example, the reference test and the test to be validated will often both be sensitive to symptom severity, over and above sensitivity to the dichotomized construct, which is a type of conditional dependence. Indeed, the test to be validated and the reference test may often be expected to share sensitivity to several other non-construct effects, such as effort, method effects, reading comprehension, cognitive abilities, response styles, and test taker sophistication. Mixed Group Validation is not robust to assumption violation (Jewsbury & Bowden, 2014), and many of the existing Mixed Group Validation studies are demonstrably and severely biased due to assumption violation (Jewsbury & Bowden, 2013).
The Mixed Group Validation assumptions are especially implausible if the construct is an arbitrary dichotomization of a continuous dimension or has underlying continuous attributes (Jewsbury & Bowden, 2013), which may be true for many psychopathological constructs (Widiger & Samuel, 2005). However, models with an alternate conceptualization of the construct and corresponding assumptions may still be appropriate for such constructs, and several such methods are explored in the present article.
The Present Study
Point-estimate methods and interval methods for psychological diagnostic test validation are reviewed, discussed, and compared. Beyond Known Group Validation and Mixed Group Validation, above, the Neighborhood model from the political science literature (Freedman, Klein, Sacks, Smyth, & Everett, 1991) is adapted for the purpose of diagnostic test validation for the first time. The assumptions for the Neighborhood model are explored in the context of psychological diagnostic test validation, and recommendations are made where the model may be the most appropriate.
The Method of Bounds (Duncan & Davis, 1953) is also adapted for the purpose of diagnostic test validation here for the first time. The Method of Bounds can provide a basis for interval based methods, but has limited utility unless supplemented with additional information. The present article builds on the discussion of expected conditional dependencies in psychology, to extend the Method of Bounds. This test validation extension, the Method of Bounds–Test Validation, is novel to the present article.
The performance and bias of all five methods are compared across a range of scenarios with simulated data, including plausible scenarios and scenarios consistent and inconsistent with model assumptions. Finally, because the present article introduces the Method of Bounds–Test Validation for the first time, a real data example is provided in the supplemental materials.
New Models for Diagnostic Test Validation
Neighborhood Model
The Neighborhood model was developed to model political voting patterns of subgroups when survey data for the subgroups are not available (Freedman et al., 1991; Klein, Sacks, & Freedman, 1991). The name comes from the scenario where a variable that differs between neighborhoods predicts voting patterns, but the same variable does not predict voting patterns within neighborhoods. In test validation terms, the variable is the outcome of the test to be validated (X), the neighborhood is the outcome of the reference test (R), and voting patterns is true diagnostic status on the construct of interest (C). The Neighborhood model is adapted here because the assumptions suggest that it may be appropriate for certain psychological diagnostic test validation studies, as explored below, and because it provides an alternate but symmetrical conditional dependency assumption to the primary method for test validation without an infallible reference test in psychology, Mixed Group Validation.
As the Neighborhood model has not been used for test validation, the derivation will be repeated with test validation terms. The standard error and expected direction of bias are shown here as a novel contribution.
Model
The Neighborhood model assumption is that the test to be validated and the construct of interest are conditionally independent given reference test outcome. That is,
| (10) |
Like Mixed Group Validation, the Neighborhood model does not require that the reference test is infallible, but the method does require that the validity of the reference test and the prevalence of the construct of interest, Pr(C = 1), are known, so that Pr(C = 1|R = r) can be calculated (Altman and Bland, 1994b). The assumption in Equation 10 implies that sensitivity and nonspecificity can be calculated as
| (11) |
| (12) |
Standard errors
Assuming that Pr(R = r) and Pr(C = c|R = r) are known exactly and the errors on Pr(X = 1|R = r) are uncorrelated, Taylor series expansion implies that
| (13) |
where higher order derivatives are zero. Equation 13 leads to
| (14) |
| (15) |
Equations 14 and 15 only account for sampling error on the estimates for Pr(X = 1|R = r), and not uncertainty in estimates for Pr(R = r) and Pr(C = 1|R = r). These equations are useful to compare the Neighborhood model with alternate models of diagnostic test validation, as sampling error on the test to be validated is almost exclusively the only source of error or bias accounted for in psychological diagnostic test validation studies. The equations are also useful to evaluate the robustness of the Neighborhood model to sampling error on the test to be validated in alternate conditions, to inform study design and sample size requirements. In practice, there may also be uncertainty in the estimates for Pr(C = 1|R = r), as well as sampling error on Pr(R = r).
Appropriate use of the model
The Neighborhood model assumes that the test to be validated is conditionally independent to the construct of interest, given the reference test outcome. That is, the test to be validated has zero validity (equal sensitivity and nonspecificity) within the groups defined as positive and negative on the reference test. The Neighborhood model is not appropriate when the test to be validated measures some aspect or individual difference dimension of the construct of interest that was not measured by the reference test. Psychological disorders are usually defined by the presence of multiple symptoms, and in some cases, the multiple symptoms have been found to relate to different latent dimensions of individual differences with methods such as factor analysis (e.g., social cognition and neurocognition in schizophrenia; Mehta et al., 2013). If the test to be validated measures symptoms that are not measured by the reference test, the Neighborhood model assumptions are expected to be violated. For example, the Neighborhood model would not be appropriate with a reference test for social cognition and a test to be validated for neurocognition, with schizophrenia as the construct of interest.
In practice, many psychological diagnostic tests for a given psychological construct have been found to measure the same latent dimension of individual differences. This may be most plausibly true for screening tests and short-form tests, when validated against a long-form version of the same or similar instrument. Screening and short-form tests are some of the most used clinical instruments (Shulman et al., 2006), and validating a short form against the full test battery is not uncommon (e.g., for the Wechsler Adult Intelligence Scale; Denney, Ringe, & Lacritz, 2015). In summary, an ideal research design for the Neighborhood model would be the use of a highly reliable measure of an individual difference dimension as the reference test, and a less reliable measure of the same individual difference dimension as the test to be validated.
Expected direction of assumption violation
Satisfying the Neighborhood assumptions exactly may be implausible, if only because diagnostic tests do not have perfect reliability (Murphy & Davidshofer, 2005). It is generally expected that a second application of a test or an alternate form of the same test will improve validity; for the same reason, it is generally expected that increasing the number of items in a test will improve reliability and validity (Wainer & Thissen, 2001). Therefore, even if the test to be validated measures the same dimension of individual differences as the reference test, the test to be validated still may have validity within groups defined as positive and negative on the reference test, due to not perfectly repeating unreliable misclassifications by the reference test. If the test to be validated has validity, or sensitivity is greater than nonspecificity, positive conditional dependencies exist.
Expected direction of bias
By definition, Pr(X = 1|R = r) is a weighted average of Pr(X = 1|R = r, C = 1) and Pr(X = 1|R = r, C = 0). This implies that Pr(X = 1|R = r) must be between these two conditional probabilities. If the expected direction of assumption violation, Pr(X = 1|R = r, C = 1) ≥ Pr(X = 1|R = r, C = 0), is assumed for every value of r, then,
| (16) |
| (17) |
The left side of Equation 16 and 17 is the true sensitivity and nonspecificity, respectively, as a function of sensitivity and nonspecificity conditional on R = r. The right side is the Neighborhood model estimates of sensitivity and nonspecificity as defined by Equations 11 and 12, respectively. In words, the true sensitivity is greater than the Neighborhood model estimate of sensitivity, and the true nonspecificity is less than the Neighborhood model estimate of nonspecificity. If the test is judged to have adequate sensitivity and specificity with the Neighborhood model, this conclusion will not be threatened by assumption violation in the expected direction. However, a test may be erroneously concluded to be invalid based on Neighborhood model sensitivity and specificity estimates. In this sense, the Neighborhood model is a conservative model, consistent with the logic of preferring low false positive rates over low false negative rates as typical in null hypothesis significance testing.
Method of Bounds
As discussed in the present article, both Mixed Group Validation and the Neighborhood model make strong assumptions that may be approximately satisfied in some diagnostic test validation studies. When the practical implications of the assumptions are explored, it is clear that the models are not appropriate for many or most diagnostic test validation studies with psychological disorders. While strong assumptions are required to obtain point-estimates of the validity coefficients, weaker assumptions may be made to obtain bounds for the validity coefficients.
The Method of Bounds for ecological regression (Duncan & Davis, 1953) is a very simple method that avoids problematic strong assumptions. As the Method of Bounds has not been previously considered for diagnostic test validation, the following shows the derivation of the Method of Bounds in test validation terms.
Model
By definition, sensitivity and nonspecificity are bounded by 0 and 1,
| (18) |
| (19) |
However, sensitivity and nonspecificity are further constrained in a less obvious way, if the unconditional probability of a positive test score is available, which is related to sensitivity and nonspecificity as
| (20) |
Because nonspecificity is constrained between 0 and 1, Equation 20 means that sensitivity is constrained by,
| (21) |
Because sensitivity is constrained between 0 and 1, Equation 20 means that nonspecificity is constrained by,
| (22) |
Combining the two constraints for each of sensitivity and nonspecificity (Equations 18, 19, 21, and 22) leads to
| (23) |
| (24) |
where max[a, b] is the greater of a and b, and min[a, b] is the lesser of a and b. Equations 23 and 24 define the Method of Bounds (Duncan & Davis, 1953). The bounds are tautologies. For example, the upper bound on sensitivity comes from the fact that there cannot be more cases testing positive than cases and noncases combined testing positive, and that a probability cannot be greater than 1.
Method of Bounds–Test Validation
Method of Bounds–Test Validation is a novel method proposed in the present article in reaction to the novel difficulties and opportunities for diagnostic test validation in psychology. As discussed above, the strong assumptions of the point-estimate methods of Known Group Validation, Mixed Group Validation, and the Neighborhood model are unlikely to be satisfied in the majority of diagnostic test validation studies. The Method of Bounds allows for bounds on sensitivity and nonspecificity without these strong assumptions, but the range of the bounds are typically large in practice. Additional constraints may be added to the Method of Bounds to reduce the range of the bounds. The Method of Bounds–Test Validation is introduced here to incorporate constraints that are appropriate for most psychological diagnostic test validation studies.
Model
As described above for Mixed Group Validation and the Neighborhood model, most diagnostic test validation studies may be expected to have positive conditional dependencies. As reviewed in the “Mixed Group Validation” section, Jewsbury and Bowden (2014) showed that for positive violations of the Mixed Group Validation conditional independence assumptions, Mixed Group Validation will overestimate sensitivity and underestimate nonspecificity. In the “Neighborhood” section of the present article, it was shown that for positive violations of the Neighborhood model conditional independence assumptions, the Neighborhood model will underestimate sensitivity and overestimate nonspecificity. Consequently, although it is generally not appropriate for most diagnostic test validation studies to assume the conditional independence assumptions made by Mixed Group Validation and the Neighborhood model, taking a much weaker assumption of the direction of the conditional dependence allows the model estimates to be interpreted as bounds for the true values.
Taking advantage of the expected positive conditional dependencies in most psychological diagnostic test validation studies, Mixed Group Validation and the Neighborhood model estimates can be used to provide addition constraints to add to Equations 23 and 24, allowing for more usefully narrow intervals while still avoiding the strong assumptions of the point-estimate based methods. That is,
| (25) |
| (26) |
where NH[Pr()] are Neighborhood model estimated probabilities and MGV[Pr()] are Mixed Group Validation estimated probabilities.
Simulation
A simulation was conducted to compare and contrast the alternate methods, including the point-estimate methods of Known Group Validation, Mixed Group Validation, and the Neighborhood model, as well as the interval methods of the Method of Bounds and the Method of Bounds–Test Validation. As the dimensionality of all possible simulations is large (23 = 8), the simulated conditions were carefully selected to examine the critical issues within a reasonable total number of conditions.
First, the conditional dependencies between the three variables (X, R, and C) were manipulated. To clarify, the conditional dependencies for the 2 × 2 × 2 case can be defined simply as
| (27) |
| (28) |
| (29) |
| (30) |
Half of the simulations were conducted with the Mixed Group Validation assumptions approximately satisfied (by directly specifying the values for DC =1 and DC =0), and the other half were conducted with the Neighborhood model assumptions approximately satisfied (by directly specifying the values for DR =1 and DR =0). Within each half, small negative (–.05), zero (0), small positive (.05), and moderately positive (.15) conditional dependencies were simulated. To avoid overcomplicating the simulation, the same conditional dependency was simulated within each pair (e.g., DC =1 and DC =0 both set to .05). Second, the prevalence or Pr(C = 1) was manipulated, with possible values of .05, .15, and .40. Lower prevalence values were selected as generally noncases are more common than cases in diagnostic test validation studies. Finally, the validity of the reference test was manipulated. The reference test either had very high validity, specifically, Pr(R =1 |C = 1) = Pr(R = 0|C = 0) = .90, or high validity, specifically, Pr(R = 1|C = 1) = Pr(R = 0|C = 0) = .80.
The total sample size was specified to be 500 in all simulations. The test to be validated had moderate validity in all simulations, specifically, Pr(X = 1|C = 1) = Pr(X = 0|C = 0) = .70. The simulation was conducted with R 3.2.1 (R Core Team, 2015). For each simulated condition, implied values for Pr(X = 1|R = r) could be calculated for both values of r based on the specifications noted above. In each simulation within each condition, each simulated participant with known R and C values was randomly assigned X = 1 or X = 0 with probability Pr(X = 1|R = r) for the participant’s r. At this point, the data could be used to emulate a research study. The proportion of X = 1 scores for each value of R () was calculated and used in the equations to obtain the sensitivity and specificity estimates and standard errors for each method. Confidence intervals were calculated as [–1.96 ×SE, 1.96 ×SE] and whether the true value was within the confidence interval was recorded. The process was repeated 500,000 times for each of the 48 simulated conditions.
Table 1 shows the simulated results where the Mixed Group Validation assumptions were directly manipulated, and Table 2 shows the simulated results where the Neighborhood model assumptions were directly manipulated. The results reveals several patterns, many of which have been shown theoretically to hold in general in the present article or previous research (Jewsbury & Bowden, 2013, 2014, for Mixed Group Validation; Valenstein, 1990, for Known Group Validation; present article for the Neighborhood model). General conclusions and comparisons of the methods that are supported by theoretical work and confirmed with simulations including the present simulation are summarized in Table 3.
Table 1.
Simulated Performance of Alternative Methods Under Different Designs, With Varying Degree of Mixed Group Validation Assumption Violation.
| KGV |
MGV |
NH |
MoB |
MoB-TV |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Assumption violation | Pr(C = 1) | Pr(R = 1|C = 1) | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | |
| 1 | Mild negative MGV assumption
violation DC= 1 = DC= 0 = –.05 |
.05 | 0.8 | 0.33 (0%) | 0.68 (90%) | 0.43 (84%) | 0.69 (92%) | 0.33 (0%) | 0.68 (85%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.33-0.43 (18%) | 0.68-0.68 (14%) |
| 2 | .05 | 0.9 | 0.40 (0%) | 0.69 (94%) | 0.59 (90%) | 0.69 (94%) | 0.39 (0%) | 0.68 (88%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.39-0.59 (28%) | 0.69-0.69 (19%) | |
| 3 | .15 | 0.8 | 0.44 (0%) | 0.67 (82%) | 0.61 (87%) | 0.68 (92%) | 0.42 (0%) | 0.65 (35%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.42-0.61 (22%) | 0.65-0.68 (28%) | |
| 4 | .15 | 0.9 | 0.53 (5%) | 0.69 (92%) | 0.66 (92%) | 0.69 (94%) | 0.50 (0%) | 0.67 (65%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.50-0.66 (32%) | 0.67-0.69 (34%) | |
| 5 | .40 | 0.8 | 0.57 (2%) | 0.63 (29%) | 0.67 (90%) | 0.68 (91%) | 0.53 (0%) | 0.59 (0%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.53-0.67 (26%) | 0.59-0.68 (28%) | |
| 6 | .40 | 0.9 | 0.63 (46%) | 0.66 (74%) | 0.69 (94%) | 0.69 (94%) | 0.60 (9%) | 0.64 (26%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.60-0.69 (37%) | 0.64-0.69 (38%) | |
| 7 | MGV assumption
satisfaction DC= 1 = DC= 0 = 0 |
.05 | 0.8 | 0.37 (0%) | 0.69 (94%) | 0.70 (95%) | 0.70 (95%) | 0.36 (0%) | 0.68 (87%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.36-0.67 (50%) | 0.68-0.70 (31%) |
| 8 | .05 | 0.9 | 0.43 (1%) | 0.70 (95%) | 0.70 (94%) | 0.70 (95%) | 0.42 (0%) | 0.69 (90%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.42-0.69 (50%) | 0.69-0.70 (27%) | |
| 9 | .15 | 0.8 | 0.47 (0%) | 0.68 (91%) | 0.70 (95%) | 0.70 (95%) | 0.44 (0%) | 0.65 (41%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.44-0.70 (50%) | 0.65-0.70 (49%) | |
| 10 | .15 | 0.9 | 0.55 (11%) | 0.69 (94%) | 0.70 (95%) | 0.70 (95%) | 0.52 (1%) | 0.67 (70%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.52-0.70 (50%) | 0.67-0.70 (44%) | |
| 11 | .40 | 0.8 | 0.59 (7%) | 0.64 (48%) | 0.70 (95%) | 0.70 (95%) | 0.54 (0%) | 0.60 (1%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.54-0.70 (51%) | 0.60-0.70 (51%) | |
| 12 | .40 | 0.9 | 0.64 (59%) | 0.67 (83%) | 0.70 (95%) | 0.70 (95%) | 0.61 (15%) | 0.64 (34%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.61-0.70 (50%) | 0.64-0.70 (50%) | |
| 13 | Mild positive MGV assumption
violation DC= 1 = DC= 0 = .05 |
.05 | 0.8 | 0.40 (0%) | 0.71 (94%) | 0.97 (86%) | 0.71 (91%) | 0.38 (0%) | 0.68 (88%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.38-0.86 (81%) | 0.69-0.70 (50%) |
| 14 | .05 | 0.9 | 0.46 (2%) | 0.70 (94%) | 0.81 (91%) | 0.71 (94%) | 0.44 (0%) | 0.69 (91%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.44-0.80 (72%) | 0.69-0.70 (35%) | |
| 15 | .15 | 0.8 | 0.49 (0%) | 0.69 (95%) | 0.79 (88%) | 0.72 (90%) | 0.46 (0%) | 0.66 (47%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.46-0.79 (78%) | 0.66-0.71 (69%) | |
| 16 | .15 | 0.9 | 0.57 (21%) | 0.70 (95%) | 0.74 (92%) | 0.71 (94%) | 0.54 (3%) | 0.67 (75%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.54-0.74 (68%) | 0.67-0.71 (52%) | |
| 17 | .40 | 0.8 | 0.61 (18%) | 0.66 (67%) | 0.73 (89%) | 0.72 (90%) | 0.56 (0%) | 0.60 (1%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.56-0.73 (75%) | 0.60-0.72 (74%) | |
| 18 | .40 | 0.9 | 0.65 (71%) | 0.68 (88%) | 0.71 (93%) | 0.71 (93%) | 0.62 (22%) | 0.65 (43%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.62-0.71 (64%) | 0.65-0.71 (61%) | |
| 19 | Moderate positive MGV assumption
violation DC= 1 = DC= 0 = .15 |
.05 | 0.8 | 0.47 (0%) | 0.73 (79%) | 1.50 (26%) | 0.74 (60%) | 0.43 (0%) | 0.69 (90%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.43-0.99 (100%) | 0.69-0.71 (70%) |
| 20 | .05 | 0.9 | 0.53 (19%) | 0.71 (89%) | 1.04 (58%) | 0.72 (87%) | 0.50 (3%) | 0.69 (92%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.50-0.94 (96%) | 0.69-0.71 (48%) | |
| 21 | .15 | 0.8 | 0.55 (3%) | 0.72 (88%) | 0.97 (35%) | 0.75 (58%) | 0.50 (0%) | 0.66 (60%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.50-0.94 (99%) | 0.66-0.74 (92%) | |
| 22 | .15 | 0.9 | 0.61 (52%) | 0.71 (92%) | 0.81 (69%) | 0.72 (86%) | 0.58 (14%) | 0.68 (83%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.58-0.81 (92%) | 0.68-0.72 (65%) | |
| 23 | .40 | 0.8 | 0.65 (59%) | 0.69 (92%) | 0.80 (46%) | 0.77 (53%) | 0.58 (0%) | 0.62 (5%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.58-0.80 (98%) | 0.62-0.77 (97%) | |
| 24 | .40 | 0.9 | 0.67 (88%) | 0.70 (94%) | 0.74 (79%) | 0.73 (83%) | 0.64 (43%) | 0.66 (61%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.64-0.74 (85%) | 0.66-0.73 (79%) | |
Note. Values for sensitivity and specificity are estimated values for the respective method, compared with the true sensitivity of .70 and the true specificity of .70. Number in parentheses is the percentage of simulations where the true value was within the estimated confidence interval. Assumption violation for MGV relates to conditional dependencies between the test to be validated and the reference test given the construct of interest. Variables DC= 1 and DC= 0 are in Equations 27 and 28. KGV = Known Group Validation; MGV = Mixed Group Validation; NH = Neighborhood model; MoB = Method of Bounds; MoB-TV = Method of Bounds–Test Validation.
Table 2.
Simulated Performance of Alternative Methods Under Different Designs, With Varying Degree of Neighborhood Model Assumption Violation.
| KGV |
MGV |
NH |
MoB |
MoB-TV |
|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Assumption violation | Pr(C = 1) | Pr(R = 1|C = 1) | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | Sensitivity | Specificity | |
| 25 | Mild negative NH assumption
violation DR= 1 = DR= 0 = –.05 |
.05 | 0.8 | 0.89 (0%) | 0.85 (0%) | 4.70 (0%) | 0.91 (0%) | 0.74 (53%) | 0.70 (94%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.74-1 (4%) | 0.70-0.72 (44%) |
| 26 | .05 | 0.9 | 0.79 (51%) | 0.76 (22%) | 1.97 (0%) | 0.77 (13%) | 0.74 (82%) | 0.70 (95%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.74-1 (20%) | 0.70-0.72 (46%) | |
| 27 | .15 | 0.8 | 0.88 (0%) | 0.85 (0%) | 2.03 (0%) | 0.94 (0%) | 0.73 (64%) | 0.71 (92%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.73-1 (7%) | 0.71-0.75 (35%) | |
| 28 | .15 | 0.9 | 0.78 (51%) | 0.76 (24%) | 1.12 (0%) | 0.77 (10%) | 0.72 (88%) | 0.70 (94%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.72-1 (26%) | 0.70-0.75 (42%) | |
| 29 | .40 | 0.8 | 0.86 (0%) | 0.86 (0%) | 1.20 (0%) | 1.03 (0%) | 0.72 (79%) | 0.71 (86%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.72-1 (16%) | 0.71-0.90 (23%) | |
| 30 | .40 | 0.9 | 0.76 (44%) | 0.76 (35%) | 0.86 (1%) | 0.81 (4%) | 0.71 (92%) | 0.71 (93%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.71-0.86 (33%) | 0.71-0.81 (37%) | |
| 31 | NH assumption
satisfaction DR= 1 = DR= 0 = 0 |
.05 | 0.8 | 0.83 (6%) | 0.83 (0%) | 4.26 (0%) | 0.89 (0%) | 0.70 (94%) | 0.70 (95%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.70-1 (49%) | 0.70-0.72 (49%) |
| 32 | .05 | 0.9 | 0.75 (79%) | 0.75 (33%) | 1.83 (0%) | 0.76 (22%) | 0.70 (94%) | 0.70 (95%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.70-1 (49%) | 0.70-0.72 (49%) | |
| 33 | .15 | 0.8 | 0.83 (2%) | 0.83 (0%) | 1.89 (0%) | 0.91 (0%) | 0.70 (94%) | 0.70 (95%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.70-1 (49%) | 0.70-0.75 (49%) | |
| 34 | .15 | 0.9 | 0.75 (74%) | 0.75 (36%) | 1.07 (0%) | 0.77 (18%) | 0.70 (95%) | 0.70 (95%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.70-0.99 (49%) | 0.70-0.75 (50%) | |
| 35 | .40 | 0.8 | 0.83 (0%) | 0.83 (0%) | 1.14 (0%) | 1 (0%) | 0.70 (95%) | 0.70 (95%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.70-1 (49%) | 0.70-0.90 (49%) | |
| 36 | .40 | 0.9 | 0.75 (62%) | 0.75 (50%) | 0.84 (3%) | 0.79 (9%) | 0.70 (95%) | 0.70 (95%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.70-0.84 (50%) | 0.70-0.79 (50%) | |
| 37 | Mild positive NH assumption
violation DR=1 = DR= 0 = .05 |
.05 | 0.8 | 0.78 (52%) | 0.82 (0%) | 3.81 (0%) | 0.86 (0%) | 0.66 (75%) | 0.70 (95%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.66-1 (92%) | 0.70-0.72 (55%) |
| 38 | .05 | 0.9 | 0.71 (93%) | 0.74 (44%) | 1.68 (0%) | 0.75 (33%) | 0.66 (90%) | 0.70 (95%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.66-1 (77%) | 0.70-0.72 (53%) | |
| 39 | .15 | 0.8 | 0.79 (29%) | 0.81 (0%) | 1.74 (0%) | 0.88 (0%) | 0.67 (80%) | 0.69 (94%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.67-1 (89%) | 0.69-0.75 (62%) | |
| 40 | .15 | 0.9 | 0.72 (89%) | 0.74 (49%) | 1.03 (1%) | 0.76 (29%) | 0.68 (92%) | 0.70 (95%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.68-0.98 (72%) | 0.70-0.75 (57%) | |
| 41 | .40 | 0.8 | 0.80 (5%) | 0.81 (1%) | 1.09 (0%) | 0.96 (0%) | 0.68 (87%) | 0.69 (91%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.68-1 (81%) | 0.69-0.90 (75%) | |
| 42 | .40 | 0.9 | 0.74 (77%) | 0.74 (64%) | 0.82 (9%) | 0.78 (18%) | 0.69 (94%) | 0.69 (94%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.69-0.82 (65%) | 0.69-0.78 (62%) | |
| 43 | Moderate positive NH assumption
violation DR= 1 = DR= 0 = .15 |
.05 | 0.8 | 0.66 (85%) | 0.78 (4%) | 2.92 (0%) | 0.82 (0%) | 0.57 (3%) | 0.69 (94%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.57-1 (100%) | 0.69-0.72 (64%) |
| 44 | .05 | 0.9 | 0.63 (81%) | 0.73 (68%) | 1.40 (4%) | 0.74 (60%) | 0.59 (46%) | 0.69 (94%) | 0-1 (100%) | 0.66-0.72 (100%) | 0.59-1 (98%) | 0.69-0.71 (56%) | |
| 45 | .15 | 0.8 | 0.70 (94%) | 0.78 (8%) | 1.44 (0%) | 0.83 (0%) | 0.60 (9%) | 0.68 (86%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.60-1 (100%) | 0.68-0.75 (82%) | |
| 46 | .15 | 0.9 | 0.67 (93%) | 0.73 (73%) | 0.93 (14%) | 0.74 (57%) | 0.63 (63%) | 0.69 (92%) | 0-1 (100%) | 0.58-0.75 (100%) | 0.63-0.93 (95%) | 0.69-0.74 (67%) | |
| 47 | .40 | 0.8 | 0.74 (68%) | 0.76 (34%) | 0.98 (0%) | 0.89 (0%) | 0.64 (31%) | 0.66 (56%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.64-0.97 (99%) | 0.66-0.88 (97%) | |
| 48 | .40 | 0.9 | 0.71 (94%) | 0.72 (86%) | 0.79 (35%) | 0.76 (47%) | 0.67 (80%) | 0.68 (86%) | 0-1 (100%) | 0.23-0.90 (100%) | 0.67-0.79 (87%) | 0.68-0.76 (80%) | |
Note. Values for sensitivity and specificity are estimated values for the respective method, compared with the true sensitivity of .70 and the true specificity of .70. Number in parentheses is the percentage of simulations where the true value was within the estimated confidence interval. Assumption violation for NH relates to conditional dependencies between the test to be validated and the construct of interest given the reference test. Variables DR= 1 and DR= 0 are in Equations 29 and 30. KGV = Known Group Validation; MGV = Mixed Group Validation; NH = Neighborhood model; MoB = Method of Bounds; MoB-TV = Method of Bounds–Test Validation.
Table 3.
Comparative Properties of the Alternate Methods in Various Conditions.
| Condition | KGV | MGV | NH | MoB | MoB-TV |
|---|---|---|---|---|---|
| Reference test has perfect validity | Unbiased (Valenstein, 1990) | Unbiased, becomes equivalent to KGV (Jewsbury & Bowden, 2013; also, Equations 6-9) | Unbiased, becomes equivalent to KGV (Equations 11-12, 14-15) | Unbiased, becomes equivalent to KGV (Equations 23-24) | Unbiased, becomes equivalent to KGV (Equations 25-26) |
| Reference test validity increases | Assumptions increasingly satisfied: bias decreases (Valenstein, 1990) | More robust to assumption violation: bias decreases, error decreases (Jewsbury & Bowden, 2013; also, Tables 1 and 2) | More robust to assumption violation: bias decreases, error increases (Tables 1 and 2) | Range for sensitivity and specificity decrease (Tables 1 and 2) | Range for sensitivity and specificity decrease (Tables 1 and 2) |
| Reference test validity decreases | Assumptions increasingly violated, bias increases (Valenstein, 1990) | Less robust to assumption violation, bias increases, error increases (Jewsbury & Bowden, 2013; also, Tables 1 and 2) | Less robust to assumption violation, bias increases, error decreases (Tables 1 and 2) | Range for sensitivity and specificity increases (Tables 1 and 2) | Range for sensitivity and specificity increases (Tables 1 and 2) |
| Positive conditional dependencies | May over or underestimate sensitivity and specificity (Valenstein, 1990) | Overestimate sensitivity and specificity (Jewsbury & Bowden, 2014; also, Table 1) | Underestimate sensitivity and specificity (Equations 16 and 17; also, Table 2) | True values within bounds (“Method of Bounds” section; also, Tables 1 and 2) | True values within bounds (Tables 1 and 2) |
| Negative conditional dependencies | May over or underestimate sensitivity and specificity (Valenstein, 1990) | Underestimate sensitivity and specificity (Jewsbury & Bowden, 2014; also, Table 1) | Overestimate sensitivity and specificity (Equations 16 and 17; also Table 2) | True values within bounds (“Method of Bounds” section; also, Tables 1 and 2) | True values outside of bounds (Tables 1 and 2) |
| Prevalence goes to zero | Bias on specificity reduces, bias on sensitivity increases (Valenstein, 1990) | Bias and error on specificity decreases, bias and error on sensitivity increases (Jewsbury & Bowden, 2013; also Tables 1 and 2) | Bias and error on specificity decreases, bias and error on sensitivity increases (Tables 1 and 2) | Range for specificity decreases, range for sensitivity becomes 1 (“Method of Bounds” section; also Tables 1 and 2) | Range for specificity decreases, range for sensitivity becomes 1 (Tables 1 and 2) |
| Prevalence goes to one | Bias on sensitivity reduces, bias on specificity increases (Valenstein, 1990) | Bias and error on sensitivity decreases, bias and error on specificity increases (Jewsbury & Bowden, 2013; also Tables 1 and 2) | Bias and error on sensitivity decreases, bias and error on specificity increases (Tables 1 and 2) | Range for sensitivity decreases, range for specificity becomes 1 (“Method of Bounds” section; also Tables 1 and 2) | Range for sensitivity decreases, range for specificity becomes 1 (Tables 1 and 2) |
Note. Bias refers to the difference between the expected estimated value and the true value. A model is described as unbiased if the expected value is equal to the true value and the true value is within the 95% confidence interval 95% of the time. Robustness refers to the degree of bias for a degree of assumption violation. A model is described as more robust if the degree of bias is reduced while holding the degree of assumption violation constant. Positive conditional dependencies are expected in most psychological diagnostic test validation studies. KGV = Known Group Validation; MGV = Mixed Group Validation; NH = Neighborhood model; MoB = Method of Bounds; MoB-TV = Method of Bounds–Test Validation.
For the three point-estimate methods of Known Group Validation, Mixed Group Validation, and the Neighborhood model, a lack of robustness to the model assumptions was observed (Tables 1 and 2). Known Group Validation mean sensitivity estimates varied from 0.33 to 0.89; Mixed Group Validation mean sensitivity estimates varied from 0.43 to 4.70; and Neighborhood model mean sensitivity estimates varied from 0.33 to 0.74, versus a true value of 0.70. Tables 1 and 2 also reveal that for a given degree of assumption violation, the observed bias for all of the methods varies with respect to the validity of the reference test and the prevalence. Notably, all methods become more robust as the reference test becomes more valid.
The results in Tables 1 and 2 where mild and moderate positive conditional dependencies were simulated with a fallible reference test represent the plausible expectation for many psychological diagnostic test validation studies. As expected, Mixed Group Validation overestimated both sensitivity and specificity, while the Neighborhood model underestimated both sensitivity and specificity. Known Group Validation, in contrast, may be biased in either direction. The lack of robustness to assumption violation of all three point-estimate methods is a major concern, as the associated assumptions for each of the three methods may not be expected to be perfectly satisfied in any realistic study. The Method of Bounds was not biased, but generally the width of the bounds is too large to be practically useful. Finally, the Method of Bounds–Test Validation was also not biased, and gave more practically useful bounds.
Note that while the Method of Bounds–Test Validation uses Mixed Group Validation and the Neighborhood model estimates as bounds, the Method of Bounds–Test Validation does not make the conditional independence assumptions of Mixed Group Validation and the Neighborhood model. Instead, by using the estimates as bounds rather than point estimates, the Method of Bounds–Test Validation allows for positive conditional dependencies, as described in the “Method of Bounds–Test Validation” section, above.
While the Method of Bounds–Test Validation often produced wide intervals, these results should be interpreted in comparison with alternate methods under the same conditions. For example, in row 21 in Table 1, the range for the Method of Bounds–Test Validation sensitivity is .50 to .94. However, for the same simulated conditions, the standard method of Known Group Validation has a bias of –.15 and the true value was within the confidence interval only 3% of the time, versus 99% for the Method of Bounds–Test Validation. Similar to Known Group Validation, the other point-estimate methods of Mixed Group Validation and the Neighborhood model showed limited robustness to assumption violation. Mixed Group Validation showed a bias of .27 with the true value captured in the confidence interval 35% of the time, and the Neighborhood model showed a bias of –.20 with the true value captured in the confidence interval 0% of the time. Consequently, unless confidence can be placed in the strong and often implausible assumptions of the point-estimate methods, the wide bounds of Method of Bounds–based methods may be a better representation of how much information the study really provides. In such situations, wide intervals may be reflective of limitations in the study design, not the method.
Two potentially non-intuitive caveats should be noted to interpret the results for the Method of Bounds–Test Validation. First, when either Mixed Group Validation or the Neighborhood model assumptions are exactly satisfied, it may be expected that the corresponding model estimates are always selected as a bound. However, when there is a large amount of variability in the estimates (such as Mixed Group Validation estimates on sensitivity when prevalence is low like in row 7, Table 1; Jewsbury & Bowden, 2014), the estimates can be greater than 1, which will be truncated to 1 by the method (see Equation 23). This, for example, brings the mean upper bound on sensitivity to be lower than the mean Mixed Group Validation sensitivity estimate, leading to the counter-intuitive result that the true value is within the bounds 50% of the time yet does not fall within the mean bounds in row 7. Second, when the assumptions for Mixed Group Validation or the Neighborhood model were exactly satisfied, the true values for sensitivity and specificity were very close to the expected value of one of the associated bounds. Therefore, due to sampling error, the true value was expected to be between the bounds only about 50% of the time (see rows 7-12, Table 1; rows 31-36, Table 2).
General Discussion
Criterion-related diagnostic test validation without an infallible reference test or perfect measure of the construct of interest is not straightforward. Estimating diagnostic test validity coefficients with a fallible reference test requires an additional assumption to estimate the relationship between the test to be validated and the construct, based on the observed relationship of the test to be validated and the reference test. Three point-estimate methods corresponding to alternate assumptions were described here: Known Group Validation, Mixed Group Validation, and the Neighborhood model. An alternate class of methods that estimate intervals was also reviewed, starting with the Method of Bounds and introducing the Method of Bounds–Test Validation.
The three point-estimate methods, Known Group Validation, Mixed Group Validation, and the Neighborhood model, make strong assumptions. None of the models demonstrated robustness to assumption violation in the simulation. Consequently, the assumptions of the models should be carefully justified prior to the use of the models for diagnostic test validation purposes. The choice of model should be made carefully, and where possible, the study should be carefully designed to satisfy the model assumptions.
Due to the inherent difficulties with exact assumptions required for point-estimate methods, interval-based methods were considered. The Method of Bounds avoids strong assumptions, but was found to have limited utility due to generally estimating wide bounds. Building on work showing that Mixed Group Validation is expected to overestimate validity in diagnostic test validation studies in psychology (Jewsbury & Bowden, 2013), and that the Neighborhood Model is expected to underestimate validity (“Neighborhood” section), an extension of the Method of Bounds was derived that incorporates additional information in the form of relatively safe and weak assumptions of positive conditional dependencies. The Method of Bounds–Test Validation is novel to the present article. The Method of Bounds–Test Validation performed relatively well across a range of plausible test validation studies, even when the point-estimate methods were severely biased.
While the Method of Bounds–Test Validation can often produce wide intervals for the validity coefficients, the width of the interval is also a function of the quality of the study and the amount of information truly known about the test validity. Given the strong assumptions typically made in diagnostic test validation studies, the degree of precision in estimating the test validity is likely overestimated. For example, in a meta-analysis of the Mini-Mental State Examination for dementia, Mitchell (2009) reported confidence intervals for sensitivity and nonspecificity for 39 studies. Of these 78 estimates, the 95% confidence interval for only 29 (37%) contained the average sensitivity or nonspecificity across all relevant studies, in contrast to the expected value of appropriately 95%. Consequently, wider intervals may reflect a more accurate description of the available information.
Wide confidence intervals obtained by taking a more disciplined approach and accounting for additional errors and biases typically neglected may be discouraging to researchers and disruptive to the field. Consequently, the Method of Bounds–Test Validation is recommended to supplement rather than replace Known Group Validation studies. This would provide a quantification of the degree of potential bias due to the erroneous assumption of an infallible reference test, as a gentle move toward a more comprehensive account of the errors and biases, and higher research standards in psychological diagnostic test validation studies.
The type of conditional independence assumed by Mixed Group Validation implies a latent class (Jewsbury & Bowden, 2013), but the Neighborhood model and the Method of Bounds–Test Validation do not assume latent classes. Latent variable models such as Mixed Group Validation assume conditional independence of the observed variables given the latent variables (Holland & Rosenbaum, 1986) are dominant in psychology, and discussions of whether psychopathological constructs are categorical versus continuous is often framed in terms of latent classes versus latent traits (e.g., Borsboom et al., 2016). However, as the Neighborhood model demonstrates, alternate models with unobserved variables that are neither latent classes nor latent traits also exist, although generally neglected in psychology. While the field continues to question the appropriateness of treating psychopathological constructs as latent classes (Widiger & Samuel, 2005), dichotomous constructs remain advantageous for practical reasons (Kamphuis & Noordhof, 2009). The Neighborhood model and Method of Bounds–Test Validation allow for an alternate, and more plausible, approach to dichotomous psychopathological constructs as compared to latent classes.
There are many potential fruitful directions for future research. While it may be implausible for a single point-estimate method to be appropriate for all diagnostic test validation studies in psychology, there is potential for the derivation of new models that make alternate conditional dependency assumptions that may be appropriate for certain types of validation studies. Relatedly, further empirical work on the typical magnitude of conditional dependencies in real studies will be helpful in generalizing simulation results to actual studies. As the Neighborhood model is introduced in the present article for diagnostic test validation for the first time, future research may evaluate the model for real data studies designed optimally for the model, following the discussion and recommendations in the “Neighborhood” section. The Method of Bounds–Test Validation could potentially be more useful if more additional constraints, appropriate for most or some diagnostic studies, could be identified to further narrow the estimated intervals. Finally, the Method of Bounds–Test Validation could be revised as a Bayesian model, where safe assumptions are made in the form of priors as opposed to constraints, allowing the specification of constraints in the form of priors with non-uniform distributions.
The methods reviewed in the present article are appropriate with research designs involving a reference test with known predictive power, Pr(C = c|R = r). In practice, many diagnostic test validation studies use reference tests where estimated predictive power are available based on prior studies, but assume the reference test is infallible only because the available methodology, Known Group Validation, requires that assumption. However, a related issue is when prior validation of the reference test itself is impossible due to the unavailability of the construct of interest directly even in prior studies. Latent class analysis can be used as a method of diagnostic test validation with three or more diagnostic tests of unknown validity, provided that the construct of interest conforms to the definition of a latent class (e.g., Thomas et al., 2009). Future research could also consider models that do not require prior validation of a reference test but involve alternative conditional dependency assumptions that may better conform to diagnostic constructs of interest.
Confidence intervals on the Method of Bounds–Test Validation bounds to account for sampling error on the test to be validated outcome can be obtained from the relevant Mixed Group Validation and Neighborhood model equations (Equations 8-9, 14-15). The equations and simulations in the present article only account for sampling errors on the estimates for Pr(X = x|R = r) to fairly compare the methods to typical diagnostic test validation studies with the standard method, Known Group Validation, that virtually universally only consider this type of sampling error. Alternatively, resampling methods such as the bootstrap could be used for a more full account of sampling errors (Efron & Tibshirani, 1993).
The point-estimate methods of Mixed Group Validation and the Neighborhood model for the special cases where the respective assumptions can be satisfied are recommended, but not in general. The Method of Bounds–Test Validation as a simple and gentle way for researchers to supplement Known Group Validation studies with a quantification of the range of potential bias are also recommended, due to the erroneous assumption of an infallible reference test.
Supplemental Material
Supplemental material, supplemental for Diagnostic Test Score Validation With a Fallible Criterion by Paul A. Jewsbury in Applied Psychological Measurement
Acknowledgments
The author thanks Stephen Bowden, Shelby Haberman, Samuel Livingston, and Nuo Xi for helpful comments, suggestions, and improvements.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material: Supplemental material for this article is available online.
References
- Altman D. G., Bland J. M. (1994. a). Diagnostic tests 1: Sensitivity and specificity. British Medical Journal, 308, Article 1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altman D. G., Bland J. M. (1994. b). Diagnostic tests 2: Predictive values. British Medical Journal, 309, Article 102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. [Google Scholar]
- American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Publishing. [Google Scholar]
- Borsboom D., Rhemtulla M., Cramer A. O. J., van der Maas H. L. J., Scheffer M., Dolan C. V. (2016). Kinds versus continua: A review of psychometric approaches to uncover the structure of psychiatric constructs. Psychological Medicine, 46, 1-13. [DOI] [PubMed] [Google Scholar]
- Bowden S. C., Harrison E. J., Loring D. W. (2014). Evaluating research for clinical significance: Using critically appraised topics to enhance evidence-based neuropsychology. The Clinical Neuropsychologist, 28, 653-668. [DOI] [PubMed] [Google Scholar]
- Dawes R. M., Meehl P. E. (1966). Mixed group validation. Psychological Bulletin, 66, 63-67. [DOI] [PubMed] [Google Scholar]
- Denney D. A., Ringe W. K., Lacritz L. H. (2015). Dyadic Short Forms of the Wechsler Adult Intelligence Scale-IV. Archives of Clinical Neuropsychology, 30, 404-412. [DOI] [PubMed] [Google Scholar]
- Duncan O. D., Davis B. (1953). An alternative to ecological correlation. American Sociological Review, 18, 665-666. [Google Scholar]
- Efron B., Tibshirani R. (1993). An introduction to the bootstrap. New York, NY: Chapman & Hall. [Google Scholar]
- Frederick R. I. (2000). Mixed group validation: A method to address the limitations of criterion group validation in research on malingering detection. Behavioral Sciences and the Law, 18, 693-718. [DOI] [PubMed] [Google Scholar]
- Frederick R. I., Bowden S. C. (2009. a). The test validation summary. Assessment, 16, 215-236. [DOI] [PubMed] [Google Scholar]
- Frederick R. I., Bowden S. C. (2009. b). Evaluating constructs represented by symptom validity tests in forensic neuropsychological assessment of traumatic brain injury. The Journal of Head Trauma Rehabilitation, 24, 105-122. [DOI] [PubMed] [Google Scholar]
- Freedman D. A., Klein S. P., Sacks J., Smyth C. A., Everett C. G. (1991). Ecological regression and voting rights. Evaluation Review, 15, 673-711. [Google Scholar]
- Holland P. W., Rosenbaum P. R. (1986). Conditional association and unidimensionality in monotone latent variable models. The Annals of Statistics, 14, 1523-1543. [Google Scholar]
- Jewsbury P. A., Bowden S. C. (2013). Considerations underlying the use of mixed group validation. Psychological Assessment, 25, 204-215. [DOI] [PubMed] [Google Scholar]
- Jewsbury P. A., Bowden S. C. (2014). A description of mixed group validation. Assessment, 21, 170-180. [DOI] [PubMed] [Google Scholar]
- Kamphuis J. H., Noordhof A. (2009). On categorical diagnoses in DSM-V: Cutting dimensions at useful points? Psychological Assessment, 21, 294-301. [DOI] [PubMed] [Google Scholar]
- Klein S. P., Sacks J., Freedman D. A. (1991). Ecological regression versus the secret ballot. Jurimetrics, 31, 393-413. [Google Scholar]
- Knottnerus J. A., van Weel C., Muris J. W. M. (2002). Evaluation of diagnostic procedure. British Medical Journal, 324, 477-480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kraemer H. C. (2014). The reliability of clinical diagnoses: State of the art. Annual Review of Clinical Psychology, 10, 111-130. [DOI] [PubMed] [Google Scholar]
- Mehta U. M., Thirthalli J., Subbakrishna D. K., Gangadhar B. N., Eack S. M., Keshavan M. S. (2013). Social and neuro-cognition as distinct cognitive factors in schizophrenia: A systematic review. Schizophrenia Research, 148, 3-11. [DOI] [PubMed] [Google Scholar]
- Meyer G. J., Finn S. E., Eyde L. D., Kay G. G., Moreland K. L., Dies R. R., . . .Reed G. M. (2001). Psychological testing and psychological assessment: A review of evidence and issues. American Psychologist, 56, 128-165. [PubMed] [Google Scholar]
- Mitchell A. J. (2009). A meta-analysis of the accuracy of the mini-mental state examination in the detection of dementia and mild cognitive impairment. Journal of Psychiatric Research, 43, 411-431. [DOI] [PubMed] [Google Scholar]
- Mossman D., Miller W. G., Lee E. R., Gervais R. O., Hart K. J., Wygant D. B. (2015). A Bayesian approach to mixed group validation of performance validity tests. Psychological Assessment, 27, 763-776. [DOI] [PubMed] [Google Scholar]
- Mossman D., Wygant D. B., Gervais R. O. (2012). Estimating the accuracy of neurocognitive effort measures in the absence of a “gold standard.” Psychological Assessment, 24, 815-822. [DOI] [PubMed] [Google Scholar]
- Murphy K. R., Davidshofer C. O. (2005). Psychological testing: Principles and applications (6th ed.). Upper Saddle River, NJ: Pearson Education. [Google Scholar]
- Ortega A., Labrenz S., Markowitsch H. J., Piefke M. (2013). Diagnostic accuracy of a Bayesian latent group analysis for the detection of malingering-related poor effort. The Clinical Neuropsychologist, 27, 1019-1042. [DOI] [PubMed] [Google Scholar]
- R Core Team. (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
- Rogers R. (2008). Clinical assessment of malingering and deception (3rd ed.). New York, NY: Guilford. [Google Scholar]
- Shulman K. I., Herrmann N., Brodaty H., Chiu H., Lawlor B., Ritchie K., Scanlan J. M. (2006). IPA survey of brief cognitive screening instruments. International Psychogeriatrics, 18, 281-294. [DOI] [PubMed] [Google Scholar]
- Thomas M. L., Lanyon R. I., Millsap R. E. (2009). Validation of diagnostic measures based on latent class analysis: A step forward in response bias research. Psychological Assessment, 21, 227-230. [DOI] [PubMed] [Google Scholar]
- Tolin D. F., Steenkamp M. M., Marx B. P., Litz B. T. (2010). Detecting symptom exaggeration in combat veterans using the MMPI-2 symptom validity scale: A mixed group validation. Psychological Assessment, 22, 729-736. [DOI] [PubMed] [Google Scholar]
- Valenstein P. N. (1990). Evaluating diagnostic tests with imperfect standards. American Journal of Clinical Pathology, 93, 252-258. [DOI] [PubMed] [Google Scholar]
- Wainer H., Thissen D. (2001). True score theory: The traditional method. In Wainer H., Thissen D. (Eds.), Test scoring (pp. 23-72). Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
- Widiger T. A., Samuel D. B. (2005). Diagnostic categories or dimensions? A question for the Diagnostic and statistical manual of mental disorders. Journal of Abnormal Psychology, 114, 494-504. [DOI] [PubMed] [Google Scholar]
- World Health Organization. (1992). The ICD-10 classification of mental and behavioural disorders: Clinical descriptions and diagnostic guidelines. Geneva, Switzerland: Author. [Google Scholar]
- Yerushalmy J. (1947). Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Reports, 62, 1432-1449. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, supplemental for Diagnostic Test Score Validation With a Fallible Criterion by Paul A. Jewsbury in Applied Psychological Measurement
