Validation Study Methods for Estimating Odds Ratio in 2 × 2 × J Tables When Exposure is Misclassified

Bijan Nouri; Najaf Zare; Seyyed Mohammad Taghi Ayatollahi

doi:10.1155/2013/170120

. 2013 Dec 23;2013:170120. doi: 10.1155/2013/170120

Validation Study Methods for Estimating Odds Ratio in 2 × 2 × J Tables When Exposure is Misclassified

Bijan Nouri ¹, Najaf Zare ^2,^*, Seyyed Mohammad Taghi Ayatollahi ¹

PMCID: PMC3884800 PMID: 24454529

Abstract

Background. Misclassification of exposure variables in epidemiologic studies may lead to biased estimation of parameters and loss of power in statistical inferences. In this paper, the inverse matrix method, as an efficient method of the correction of odds ratio for the misclassification of a binary exposure, was generalized to nondifferential misclassification and 2 × 2 × J tables. Methods. Simple estimates for predictive values when misclassification is nondifferential are presented. Using them, we estimated the corrected log odds ratio and its variance for 2 × 2 × J tables, using the inverse matrix method. A two-step weighted likelihood method was also developed. Moreover, we compared the matrix and inverse matrix methods to the maximum likelihood (MLE) method using a simulation study. Results. In all situations, the inverse matrix method proved to be more efficient than the matrix method. Matrix and inverse matrix methods for nondifferential situations are more efficient than differential misclassification. Conclusions. Although MLE is optimal among all of the methods, it is computationally difficult and requires programming. On the other hand, the inverse matrix method with a simple closed-form presents acceptable efficiency.

1. Introduction

In epidemiology studies, where the assessment of the relationship between exposure and outcome variables is the main goal, misclassification of exposure variable leads to biased estimate of odds ratio. In multicenter clinical trials with an increasing number of centers, the possibility of misclassification of the exposure variable and the biases induced by it will arise. Methods to correct a possibly misclassified exposure so that the strength of the association between exposure and outcome variables can be precisely assessed have been a focus of statistical and epidemiological research for over 30 years. Beginning with classic papers, the issue of misclassification on tabular data has long been recognized and their adjustment has been considered [1–5]. In 1977, the matrix method was presented by Barron in order to correct nondifferential misclassification in 2 × 2 tables [6]. Greenland and Kleinbaum generalized it to differential misclassification and match paired data [7]. In addition, Greenland proposed a variance estimate for the matrix method under the assumptions of differential and nondifferential misclassifications [8]. Selen, in a simulation study, showed that both the matrix and the maximum likelihood methods performed equally well [9]. In 1990, Marshall presented a more direct inverse matrix approach for correcting differential misclassifications [10]. Morrissey and Spiegelman compared the matrix and inverse matrix methods to the maximum-likelihood estimator using a grid search for 2 × 2 tables [11]. He found that, under the assumption of differential misclassification, the inverse matrix method was always more efficient than the matrix method. Examples of validation studies in the misclassification context in linear models are prevalent in the statistical and epidemiologic literature [12–15]. More recent works are included in the reference section [16–20].

The main impediment to the use of the inverse matrix method, despite its superiority compared to the matrix method, is that it is restricted to differential misclassification and 2 × 2 tables, while, in practice, it is very likely that misclassification is nondifferential or that we have a 2 × 2 × J table, in which J determines the number of centers in a multicenter clinical trial or is the levels of a confounder [21]. For example, when misclassification is due to recall bias, it is expected that the rates of misclassification in case and control groups are the same, leading to nondifferential misclassification. Here, we aim to further extend the focus within the inverse matrix method when misclassification is nondifferential and data is stratified on J strata.

Section 2.1 provides definition and notation for a 2 × 2 × J table with a binary error-prone exposure. For the first time, we propose estimates for negative and positive predictive values as misclassification parameters under the assumption of nondifferential misclassification, which can be used to generalize an inverse matrix method to nondifferential cases (Section 2.2). Then we will generalize an inverse matrix approach to a nondifferential assumption and situations where the data is stratified on J strata (Section 3.1). Intuitive closed-form formulae for misclassification-adjusted effect and its variance are presented. The matrix method and the likelihood method will be briefly reviewed later. In addition, we will present a new two-step method in Section 3.4, which uses corrected cell count by the matrix method to construct a weighted maximum likelihood method. Finally, the study will continue with a simulation study that compares the mean square error (MSE) of each of the presented methods to the MLE under the assumptions of differential and nondifferential misclassifications. The main text focuses on simple formulae for conditions involving a binary misclassified exposure in a case-control study, with stratification on confounders. Several assumptions were made in this work. First, the disease status and level of strata were measured accurately. Second, the methods required an error-free criterion for exposure to validate the misclassified exposure in the validation study. Finally, we assumed that the main and validation studies are independent. The formulae are illustrated with data from an often-cited case-control study of sudden infant death syndrome (SIDS) and maternal use of antibiotics during pregnancy [22].

2. Method and Material

2.1. Definitions and Notations

To begin, consider a case-control study sample; binary exposure and outcome variables were measured by two error-prone and correct methods, respectively, as well as a J level variable S (S can be a confounder or a combination of some confounders). So, the data could be classified in a 2 × 2 × J table, for which its jth stratum contains n _1j cases and n _0j controls. Because the exposure status X*is an error-prone variable, the cross-table will be misclassified. The jth stratum of a 2 × 2 × J misclassified table and the notations that are employed in this paper are displayed in Table 1.

Table 1.

Notation for jth stratum of a 2 × 2 × J misclassified table.

	Error-prone exposure (X*)
Outcome (y)	1	0
1	A _1j* (n _11j)	B _1j* (n _01j)	n _1j
0	A _0j* (n _10j)	B _0j* (n _00j)	n _0j

Open in a new tab

X is supposed to be an error-free variable that shows the actual status of the exposure. The sensitivity and specificity of the error-prone diagnosis of exposure are known as misclassification parameters or misclassification rates and defined by SE_kj = P(X* = 1 | X = 1, Y = k, S = j) and SP_kj = P(X* = 0 | X = 0, Y = k, S = j) for k = 0, 1 and j = 1, …, J, respectively. Also one can offer misclassification rates by another set of parameters, known as positive and negative predictive values, which can be defined as PPV_kj = P(X = 1 | X* = 1, Y = k, S = j) and NPV_kj = P(X = 0 | X* = 0, Y = k, S = j) for k = 0, 1 and j = 1, …, J, respectively. Among different approaches developed to deal with the misclassification problem, the inverse matrix method (direct method) uses positive and negative predictive values as misclassification parameters; on the other hand, the matrix method (indirect method) and the likelihood method both apply sensitivity and specificity values instead. Let e _kj* = P(X* = 1 | Y = k, S = j) = A _kj*/n _kj represents the prevalence of error-prone exposure for k = 0, 1 and j = 1, …, J, and also f _kj* = 1 − e _kj*.

Both of the sets of misclassifications are unknown, and it is required to estimate them through a validation study. Thus, in addition to the sample described above known as the main study, another random sample of size m is drawn separately. This sample makes up the external validation study. In order to estimate the misclassification parameters, as well as the error-prone status of exposure X*, the correct exposure status should be measured precisely for each subject of the validation sample by a dichotomous error-free variable X.

2.2. Estimation of Misclassification Parameters

Misclassification parameters are allowed to vary by levels of J in this approach and separate misclassification parameters for each stratum can be estimated, but if we notice that the misclassification rates are similar in strata, regardless of stratification, validation samples of strata should be combined. Due to the simplicity of notation, we will describe the case in which misclassification rates are the same across strata; otherwise, estimation of the parameters is similar. Table 2 displays the available data m _kli for the validation study, where subscripts k = 0, 1, l = 0, 1, and i = 0, 1 indicate the outcome, error-free, and error-prone exposure status, respectively. Misclassification parameters and their variances are estimated in the validation study by the following formulae:

\begin{matrix} {SE}_{k} = \frac{m_{k 11}}{m_{k 1 .}} {SP}_{k} = \frac{m_{k 00}}{m_{k 0 .}}, \\ Var ({SE}_{k}) = \frac{{SE}_{k} (1 - {SE}_{k})}{m_{k 1 .}}, \\ Var ({SP}_{k}) = \frac{{SP}_{k} (1 - {SP}_{k})}{m_{k 0 .}}, \\ {PPV}_{k} = \frac{m_{k 11}}{m_{k . 1}}, {NPV}_{k} = \frac{m_{k 00}}{m_{k . 0}}, \\ Var ({NPV}_{k}) = \frac{{NPV}_{k} (1 - {NPV}_{k})}{m_{k . 0}}, \\ Var ({NPV}_{k}) = \frac{{NPV}_{k} (1 - {NPV}_{k})}{m_{k . 0}} . \end{matrix}

(1)

In this case, a point, “.”, in the subscripts stands for summation over that subscript. Misclassification is nondifferential when the misclassification rates are independent of the outcome status. In other words, when P(X* = 1 | X = 1, Y = k) = P(X* = 1 | X = 1) and P(X* = 0 | X = 0, Y = k) = P(X* = 0 | X = 0); otherwise, misclassification is differential. We can easily adapt estimates of sensitivity and specificity to the following nondifferential assumption:

\begin{matrix} SE = \frac{m_{. 11}}{m_{. 1 .}}, SP = \frac{m_{. 00}}{m_{. 0 .}}, \\ Var (SE) = \frac{SE (1 - SE)}{m_{. 1 .}}, \\ Var (SP) = \frac{SP (1 - SP)}{m_{. 0 .}} . \end{matrix}

(2)

But once we assume that misclassification is nondifferential, the sensitivity and specificity are equal given the disease status. In this case, nondifferential sensitivity and specificity do not mean that PPV and NPV are equal across strata of the outcome because PPV and NPV are functions of exposure prevalence and sensitivity and specificity. Due to this problem, the methods that apply predictive values as misclassification parameters (such as the inverse matrix method) will be restricted to the differential assumption. In order to overcome this difficulty, Morrissey and spiegelman recommended the use of the same predictive values as those estimated by the differential misclassification assumption, which is not a reasonable approach [11]. Instead, in the case of nondifferential misclassification, we present the following estimates for predictive values:

\begin{matrix} {PPV}_{k} = \frac{(SE m_{k 1 .})}{m_{k}^{'}} {, NPV}_{k} = \frac{(SP m_{k 0 .})}{(m_{k . .} - m_{k}^{'})}, \\ Var ({PPV}_{k}) = \frac{{PPV}_{k} (1 - {PPV}_{k})}{m_{k}^{'}}, \\ Var ({PPV}_{k}) = \frac{{NPV}_{k} (1 - {NPV}_{k})}{(m_{k . .} - m_{k}^{'})}, \end{matrix}

(3)

where m _k′ = (SEm _k1. + (1 − SP)m _k0.). If sensitivity and specificity are constant and known, using the probability roles, we can estimate predictive values as follows:

\begin{matrix} {PPV}_{k} = \frac{{SE}_{k} ({S P}_{k} - f_{k}^{*})}{e_{k}^{*} D_{k}}, {NPV}_{k} = \frac{{SP}_{k} ({SE}_{k} - e_{k}^{*})}{f_{k}^{*} D_{k}} . \end{matrix}

(4)

When equality of the misclassification parameters in J strata cannot be assumed, using distinct validation sample for each stratum and the above relations, misclassification parameters can be estimated separately.

Table 2.

Data layout for the validation study.

Validation study
	Y = 1		Y = 0
X*	X = 1	X = 0	X = 1	X = 0
1	m ₁₁₁ = 29	m ₁₀₁ = 22	m ₀₁₁ = 21	m ₀₀₁ = 12
0	m ₁₁₀ = 17	m ₁₀₀ = 143	m ₀₁₀ = 16	m ₀₀₀ = 168

Open in a new tab

m _kli.

Example 1 —

Table 3 exhibits the main study data from an often-cited case-control study on sudden infant death syndrome (SIDS), which has examined the relationship between maternal use of antibiotics during pregnancy and the odds of SIDS [22]. The table is classified on two strata (J = 2) according to the sex of infants, a common risk factor for SIDS. Drug use error-prone measurement was an interview response X*, and it was validated by medical records X. A separate sample from a validation study with dual exposure measurement (both error-prone and error-free) is presented in Table 2. Using (1)–(3) and the validation data, separate estimates with regard to both differential and nondifferential misclassification assumptions for sensitivity, specificity, and predictive values are presented in Table 4. Two cells in the last column of Table 4 are blank because, for nondifferential misclassification, sensitivity and specificity are the same for case and control groups.

Table 3.

Uncorrected data from SIDS study stratified on the sex of infant.

	Males		Females
Interview response (X*)	Y = 1	Y = 0	Y = 1	Y = 0
Use	80	42	42	59
No use	257	261	185	218

Total	337	303	227	277

Open in a new tab

Table 4.

Estimates of misclassification parameters and their variance.

	Differential	Nondifferential
PPV₁	0.5686 (0.0048096)	0.6300 (0.0053002)
PPV₀	0.6363 (0.0070127)	0.5567 (0.0061395)
NPV₁	0.8937 (0.0005937)	0.8904 (0.0005842)
NPV₀	0.9130 (0.0004316)	0.9168 (0.0004310)
SE₁	0.6304 (0.0050651)	0.6024 (0.0028857)
SE₀	0.5676 (0.0066332)
SP₁	0.8667 (0.0007003)	0.9014 (0.0025761)
SP₀	0.9333 (0.0003457)

Open in a new tab

These estimates are based on the assumption that misclassification rates are the same on male and female strata.

3. Estimate of the Actual Association

3.1. Inverse Matrix Method

As mentioned in the introduction, the inverse matrix method was proposed by Marshall to correct misclassified 2 × 2 tables [10]. He restricted the use of his method to the assumption of differential misclassification. According to the estimates we present for predictive values in (3), the inverse matrix method can be developed to the differential assumption. Now, we generalize this approach to 2 × 2 × J tables with misclassified exposure and either differential or nondifferential assumptions. Estimates for correct proportions of exposed and unexposed subjects in the jth stratum of the main study are e _kj = PPV_kj e _kj* + (1 − NPV_kj)f _kj* and f _kj = NPV_kj f _kj* + (1 − PPV_kj)f _kj*, respectively. These equations can be written in the matrix form as

\begin{matrix} [\begin{matrix} e_{1 j} \\ \begin{matrix} f_{1 j} \\ e_{0 j} \\ f_{0 j} \end{matrix} \end{matrix}] \\ = [\begin{matrix} {PPV}_{1 j} & {1 - NPV}_{1 j} & 0 & 0 \\ {1 - PPV}_{1 j} & {NPV}_{1 j} & 0 & 0 \\ 0 & 0 & {PPV}_{0 j} & \begin{matrix} {1 - NPV}_{0 j} \end{matrix} \\ 0 & 0 & 1 - {PPV}_{0 j} & {NPV}_{0 j} \end{matrix}] \\ \times [\begin{matrix} \begin{matrix} e_{1 j}^{*} \\ f_{1 j}^{*} \end{matrix} \\ e_{0 j}^{*} \\ f_{0 j}^{*} \end{matrix}] . \end{matrix}

(5)

The biased log odds ratio estimate of the jth stratum is θ _j* = log⁡(A _1j*B _0j*/A _0j*B _1j*), with asymptotic variance estimate V _j* = ∑_k(1/A _kj* + 1/B _kj*) where k = 0, 1. Corrected values in (5) are used to build the inverse matrix corrected log-odds-ratio θ _j = log⁡(e _1j f _0j/e _0j f _1j) where e _kj = P(X = 1 | Y = k, S = j) representing the estimate of actual prevalence of exposure for k = 0, 1 and j = 1, …, J, and also f _kj = 1 − e _kj. Considering binomial distribution for A _kj*, we derived the asymptotic variance for θ _j using the delta method as

\begin{matrix} V_{j j} = Var (θ_{j}) \\ = \sum_{k = 0, 1}^{} \frac{(d_{k j}^{2} V_{e_{k j}^{*}} + {e_{k j}^{*}}^{2} V_{{PPV}_{k j}} + {f_{k j}^{*}}^{2} V_{{NPV}_{k j}})}{({f_{k j}^{2} e}_{k j}^{2})}, \end{matrix}

(6)

where the letter V stands for an abbreviation for variance and d _kj = PPV_kj + NPV_kj − 1. If nondifferential misclassification is assumed, instead of the estimated positive and negative predictive values and their variance estimate in (1), corresponding values in (3) must be considered substitutes in expression (6) and also the corrected proportions in the disease group are no longer independent of the corrected proportions in the control group. Consequently, twice the covariance between log odds (7) must be subtracted from expression (6). Consider

\begin{matrix} \frac{V_{{SE}_{j}}}{f_{0 j} f_{1 j} D_{j}^{2}} + \frac{V_{{SP}_{j}}}{e_{0 j} e_{1 j} D_{j}^{2}} . \end{matrix}

(7)

If in the validation study for each stratum a separate misclassification-rate is estimated, the log odds ratios are independent; otherwise, using the same misclassification parameters for all the strata leads to covariance (8) between θ _j and θ _i. Consider

\begin{matrix} V_{i j} = \sum_{k = 0, 1}^{} \frac{(V_{{SE}_{k}} / f_{k j} f_{k i} + V_{{SP}_{k}} / e_{k j} e_{k i})}{D_{k}^{2}} . \end{matrix}

(8)

If the same misclassification parameters for all the strata are estimated by assuming nondifferential misclassification, the covariance estimate for θ _j and θ _i is given by

\begin{matrix} V_{i j} = \sum_{k = 0, 1}^{} \sum_{h = 0, 1}^{} {(- 1)}^{h + k} \frac{(V_{SE} / f_{k j} f_{h i} + V_{SP} / e_{k j} e_{h i})}{D^{2}} . \end{matrix}

(9)

Let V be the symmetric J by J variance-covariance matrix of the J estimated odds-ratio, with ijth element V _ij. Then θ _w = (θ′V ⁻¹1)V _w is the minimum variance weighted average of the estimated log-odds ratios, where V _w = 1/(1′V ⁻¹1), and also 1 and θ are J-vector of ones and estimated odds ratios, respectively. For large strata sample, θ _w is a consistent uniformly asymptotic normal estimator for the actual log odds ratio, with consistent variance estimator V _w. Hence, construction of a corrected Wald test of no association will be possible using the z statistic θ _w/√V _w, with a corresponding corrected 1 − α percent confidence interval for the actual log odds ratio as θ _w ± z _t=1−α/2√V _w, where z _t is the t = 1 − α/2 percentile of a standard normal distribution.

In order to evaluate the homogeneity of θ _j, we can utilize an approximate J − 1 degree of freedom Chi-square statistics X ² = θ′V ⁻¹ θ − V _w ⁻¹ θ _w ². Large values of this statistic indicate loss of homogeneity and so θ _w will be an inadequate summary of the association.

3.2. Matrix Method

Now we briefly outline the matrix method. In this method sensitivity and specificity are used as misclassification parameters to estimate the log-odds ratio. The matrix method uses the equations A _kj* = SE_kj A _kj + (1 − SP_kj)B _kj and B _kj* = (1−SE_kj)A _kj + SP_kj B _kj to estimate correct cell counts in the jth stratum; in matrix form these equations can be written as

\begin{matrix} [\begin{matrix} A_{1 j} \\ \begin{matrix} B_{1 j} \\ A_{0 j} \\ B_{0 j} \end{matrix} \end{matrix}] = {[\begin{matrix} {SE}_{1 j} & 1 - {SP}_{1 j} & \begin{matrix} 0 & 0 \end{matrix} \\ 1 - {SE}_{1 j} & {SP}_{1 j} & \begin{matrix} 0 & 0 \end{matrix} \\ \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} \begin{matrix} {SE}_{0 j} \\ 1 - {SE}_{0 j} \end{matrix} & \begin{matrix} 1 - {SP}_{0 j} \\ {SP}_{0 j} \end{matrix} \end{matrix} \end{matrix}]}^{- 1} \\ \times [\begin{matrix} A_{1 j}^{*} \\ \begin{matrix} B_{1 j}^{*} \\ A_{0 j}^{*} \\ B_{0 j}^{*} \end{matrix} \end{matrix}] . \end{matrix}

(10)

The matrix corrected log-odds-ratio is θ _j = log⁡(e _1j f _0j/e _0j f _1j), e _kj = A _kj/n _kj and f _kj = 1 − e _kj. Considering binomial distribution for A _kj*, Greenland provided the asymptotic variance for θ _j, given by

\begin{matrix} V_{j j} = Var (θ_{j}) \\ = \sum_{k = 0, 1}^{} \frac{(V_{{SE}_{k j}} / f_{k j}^{2} + V_{{SP}_{k j}} / e_{k j}^{2} + e_{k j}^{*} f_{k j}^{*} / n_{k j} e_{k j}^{2} f_{k j}^{2})}{(D_{k j}^{2})}, \end{matrix}

(11)

where D _kj = SE_kj + SP_kj − 1. The other components of the variance-covariance matrix are similar to what we derived for the inverse matrix method.

This method has two limitations: when the sum of sensitivity and specificity is equal to one, the corrected count estimates are undefined because the misclassification matrix is singular in (10); if the sum is less than one, negative estimates for corrected cell counts occur.

3.2.1. Direct Likelihood Method

Some authors use the maximum likelihood estimate (MLE) to estimate the actual odds ratio and test the exposure-disease association. Although this method can be the most efficient method of correction, it does not have a close form and requires an iterative solution to a nonlinear set of equations under some constraints. If the error-free exposure variable X was available, we would be able to estimate the covariate-adjusted odds ratio by the following multiple logistic regression model:

\begin{matrix} logit [Pr (Y = 1 ∣ X, C)] = α + θ x + \sum_{j = 1}^{J - 1} γ_{j} c_{j} . \end{matrix}

(12)

However, we just have access to the error-prone binary variable X*. We assume that the covariates C _j(j = 1,…, J − 1) are dummy variables that determine to which stratum a subject belongs. If the subject belongs to the jth stratum C _j = 1, as mentioned before, θ indicates overall log odds ratio. To obtain the ML estimate for unknown parameter θ, we should numerically maximize the following log likelihood function with respect to θ. The approximate standard error for this estimate is available by inversing the observed information matrix. Consider

\begin{matrix} l (θ) = \sum_{j}^{} \sum_{k}^{} \sum_{i}^{} n_{i k j} \ln (Pr (X^{*} = i, Y = k ∣ C_{j} = c)), \end{matrix}

(13)

where n _ikjwas presented in Table 1; note that Pr(X*, Y | C = c) in expression (13) is the observed data likelihood contribution in the likelihood function which can be obtained as follows:

\begin{matrix} \sum_{x = 0}^{1} Pr (X^{*} = i ∣ X = l, Y = k, C = c) \\ \times Pr (Y = k ∣ X = l, C = c) Pr (X = l ∣ C = c), \end{matrix}

(14)

where the first term in (14) can be obtained using the estimated sensitivity and specificity in (1) or (2) according to whether misclassification is differential or nondifferential, the second term displays the logistic model (12), and finally the last term is a nuisance parameter that can be modeled via a second logistic regression model of X on C.

3.3. A Weighted Likelihood Method

Now, we briefly present a two-step method which is a combination of the inverse matrix method and the likelihood method. In the first step, we use the inverse matrix method to correct misclassified tables, and in the second step, we utilize corrected cell count in the first step that is assigned to (x, y, j) log-likelihood contribution as weights. Formally, the weighted log-likelihood is given by

\begin{matrix} l (θ) = \sum_{j = 1}^{J} \sum_{k = 0}^{1} \sum_{l = 0}^{1} N_{l k j} l_{l k j} (θ), \end{matrix}

(15)

where the N _xyk represents the corrected data cell count by the inverse matrix method and subscripts l = 0, 1, k = 0, 1, and j = 1, …, J indicate error-free exposure, outcome, and strata number, respectively. N _lkj represents the corrected data cell count by the inverse matrix method and l _lkj(θ) is its log-likelihood contribution. This weighted method is simpler than the previous likelihood method and directly uses (x, y, j) log-likelihood contribution instead of (x*, y, j). The variance of corrected log-OR can be obtained by inversing the Fisher information matrix.

3.4. Combining the Results

Now, we need an approach to combine the results from the main study and the internal validation study. We emphasize that, for subjects in the validation study, a conventional analysis using an error-free exposure can be done to obtain a log-OR estimate θ _v with variance estimate V _v. The estimate of the effect for the main study is biased and must be adjusted using the earlier formulae in order to obtain a corrected log-OR estimate θ _m with variance estimate V _m. A weighted method to combine the log-OR from the two samples can be constructed as follows:

\begin{matrix} θ_{c} = \frac{(θ_{m} V_{v} + θ_{v} V_{m})}{(V_{m} + V_{v})} . \end{matrix}

(16)

θ _c has variance estimate V _c = V _m V _v/(V _m + V _v). The weighted method proceeds well if θ _v and θ _m estimate a common value, and this is achieved when samples are randomly selected from the population, and the misclassification parameter estimates are unbiased.

Example 2 —

For the data in Table 3, we have misclassified OR = 1.124 and ln⁡(OR) = 0.1174(0.02357), in which the term in the bracket represents the estimated variance. Suppose that the investigator assumes that misclassification rates are equal in both the strata of male and female infants. Under the nondifferential assumption and using the parameters estimated in Table 4, the matrix method yields $\hat{OR} = 1.26$ and $\ln (\hat{OR}) = 0.2311 (0.04305)$ , and the inverse matrix method yields $\hat{OR} = 1.237$ and $\ln (\hat{OR}) = 0.213 (0.04103)$ . In contrast, we supposedly have differential estimates for misclassification rates in Table 2; thus, the results from the matrix and inverse matrix method will be $\ln (\hat{OR}) = 0.2678 (0.05147)$ and $\ln (\hat{OR}) = 0.2471 (0.04012)$ , respectively.

Applying the direct ML (13) to the data in Table 3 yields $\hat{β} = \ln (\hat{OR}) = 0.00585 (0.02418)$ and $\hat{β} = \ln (\hat{OR}) = 0.505 (0.0312)$ under the differential and nondifferential assumptions, respectively. Note the dramatic shift in the implication of assuming a differential misclassification. The application of the two-step weighted likelihood approach produces $\hat{β} = \ln (\hat{OR}) = 0.3027 (0.048309)$ and $\hat{β} = \ln (\hat{OR}) = 0.3047 (0.04831)$ given the differential and nondifferential cases, respectively.

4. Simulation Studies

We conducted a simulation study to compare corrected log-odds-ratios obtained from both matrix and inverse matrix methods and also to assess the appropriateness of the maximum likelihood and weighted maximum likelihood for 2 × 2 × J tables. In order to mimic the SIDS example, a total of 1000 sets of data were generated with a binary outcome (Y) and also two other binary variables as error-free exposure status and stratification indicator (X and S) and a total sample size of 1144. Correct cell counts for each table were generated with regard to the following simulation conditions: prevalence of P(S = 1) = 0.5 and conditional prevalence of exposure are P(X = 1 | S = 1) = 0.5 and P(X = 1 | S = 0) = 0.5. In order to generate the response variable, the binary distribution with parameter π _lj = exp⁡⁡(α + θx _l + γS _j)/(1 + exp⁡⁡(α + θx _l + γS _j)) was used, where we set the effect parameters for all the simulation scenarios as α = −1, θ = 1.71 and γ = 1. The misclassified exposure (X*) was generated assuming numerous nondifferential conditions. In order to estimate the misclassification parameters for every one of the simulated data sets, three scenarios of 20%, 30%, and 40% of each of their strata were taken as internal validation samples separately. The programming code could be provided via request from the authors.

The last four columns of Table 5 represent different scenarios related to cases in which nondifferential misclassifications vary across strata, while the four columns before them display various situations in which nondifferential misclassifications are constant across strata; analysis of each column is in accordance with the manner in which that column was generated. For the purpose of comparing the different methods, the mean square error (MSE) measure was used in order to take into account the bias and variance of estimators simultaneously. As you can see, the MLE estimator is more efficient than other estimator methods in all scenarios. Our simulation study showed that for nondifferential misclassification the performance of these inverse matrix methods and likelihood method is very close but is not equivalent. The inverse matrix method is significantly more precise than the matrix method. Performance of the likelihood and weighted likelihood methods is relatively the same under both assumptions. The efficiency of all estimators would be improved by increasing the validation study sample size, but this improvement would be more prominent in the weighted likelihood method, as the efficiency of the estimator will be approximately doubled by increasing the validation study sample size.

Table 5.

Simulation results for a 2 × 2 × 2 table when binary exposure is misclassified.

Validation sample rate	Method	Strata 1	Se₁₁ = 0.90 Sp₁₁ = 0.90 Se₁₀ = 0.90 Sp₁₀ = 0.90	Se₁₁ = 0.85 Sp₁₁ = 0.90 Se₁₀ = 0.85 Sp₁₀ = 0.90	Se₁₁ = 0.80 Sp₁₁ = 0.85 Se₁₀ = 0.80 Sp₁₀ = 0.85	Se₁₁ = 075 Sp₁₁ = 080 Se₁₀ = 075 Sp₁₀ = 0.80	Se₁₁ = 0.90 Sp₁₁ = 0.90 Se₁₀ = 0.90 Sp₁₀ = 0.90	Se₁₁ = 0.85 Sp₁₁ = 0.90 Se₁₀ = 0.85 Sp₁₀ = 0.90	Se₁₁ = 0.80 Sp₁₁ = 0.85 Se₁₀ = 0.80 Sp₁₀ = 0.85	Se₁₁ = 0.70 Sp₁₁ = 0.80 Se₁₀ = 0.70 Sp₁₀ = 0.80
Validation sample rate	Method	Strata 2	Se₂₁ = 0.90 Sp₂₁ = 0.90 Se₂₀ = 0.90 Sp₂₀ = 0.90	Se₂₁ = 0.85 Sp₂₁ = 0.90 Se₂₀ = 0.85 Sp₂₀ = 0.90	Se₂₁ = 0.80 Sp₂₁ = 0.85 Se₂₀ = 0.80 Sp₂₀ = 0.85	Se₂₁ = 075 Sp₂₁ = 080 Se₂₀ = 075 Sp₂₀ = 0.80	Se₂₁ = 0.85 Sp₂₁ = 0.85 Se₂₀ = 0.85 Sp₂₀ = 0.85	Se₂₁ = 0.80 Sp₂₁ = 0.85 Se₂₀ = 0.80 Sp₂₀ = 0.85	Se₂₁ = 0.70 Sp₂₁ = 0.80 Se₂₀ = 0.70 Sp₂₀ = 0.80	Se₂₁ = 0.90 Sp₂₁ = 0.95 Se₂₀ = 0.90 Sp₂₀ = 0.95
n _v = 0.2	Matrix		1.717 (0.0298)	1.711 (0.0325)	1.671 (0.0400)	1.652 (0.0483)	1.720 (0.0301)	1.713 (0.0338)	1.682 (0.0418)	1.715 (0.0375)
	Inverse matrix		1.724 (0.0232)	1.723 (0.0234)	1.732 (0.0238)	1.686 (0.0240)	1.727 (0.0218)	1.731 (0.0232)	1.634 (0.0252)	1.762 (0.0233)
	Likelihood		1.723 (0.0146)	1.725 (0.0155)	1.730 (0.0176)	1.689 (0.0182)	1.724 (0.0165)	1.730 (0.0182)	1.682 (0.0203)	1.724 (0.0187)
	Weighted likelihood		1.737 (0.0560)	1.726 (0.0554)	1.730 (0.0561)	1.685 (0.0539)	1.738 (0.0534)	1.730 (0.0532)	1.691 (0.0560)	1. 721 (0.0544)

n _v = 0.3	Matrix		1.718 (0.0247)	1.720 (0.0263)	1.721 (0.0300)	1.659 (0.0358)	1.721 (0.0272)	1.711 (0.0330)	1.694 (0.0362)	1.718 (0.0335)
	Inverse matrix		1.720 (0.0188)	1.723 (0.0194)	1.723 (0.0237)	1.678 (0.0310)	1.723 (0.0223)	1.729 (0.0240)	1.673 (0.0322)	1.728 (0.0232)
	Likelihood		1.719 (0.0140)	1.719 (0.0144)	1.716 (0.0178)	1.671 ) 0.0174(	1.711 (0.0145)	1.722 (0.0181)	1.696 (0.0181)	1.723 (0.0173)
	Weighted likelihood		1.727 (0.0372)	1.740 (0.0383)	1.745 (0.0431)	1.682 ) 0.0364(	1.721 (0.0375)	1.728 (0.0388)	1.624 (0.0443)	1.733 (0.0362)

n _v = 0.4	Matrix		1.714 (0.0214)	1.719 (0.0228)	1.713 (0.0249)	1.662 (0.0293)	1.718 (0.0224)	1.712 (0.0235)	1.697 (0.0250)	1.711 (0.0233)
	Inverse matrix		1.712 (0.0162)	1.725 (0.0167)	1.723 (0.0140)	1.684 (0.0178)	1.715 (0.0166)	1.727 (0.0165)	1.692 (0.0162)	1.729 (0.0165)
	Likelihood		1.710 (0.0132)	1.722 (0.0140)	1.716 (0.0147)	1.673 (0.0163)	1.710 (0.0131)	1.716 (0.0140)	1.688 (0.0142)	1.711 (0.0162)
	Weighted likelihood		1.726 (0.0280)	1.724 (0.0278)	1.727 (0.0275)	1.678 (0.0277)	1.729 (0.0281)	1.728 (0.0282)	1.674 (0.0275)	1.729 (0.0273)

Open in a new tab

Numbers in each cell reflect mean of adjusted odds ratios (MSE) based on 1000 simulated data sets, with true OR = 1.71.

5. Discussion and Conclusions

The misclassification of exposure is an issue of broad concern in epidemiological research studies, and considerable amount of pieces of literature discuss adjusting inferences on exposure-disease relationships regarding such misclassification. Matrix method, inverse matrix method, likelihood method, Bayesian method, and SIMEX method are some of the general tools for performing such adjustments. Our main focus was on the development of the inverse matrix method to nondifferential misclassification and 2 × 2 × J tables. At least in the context of misclassified binary exposure, this paper has illustrated several positive attributes of the inverse matrix method. First, to assess the diagnostic tests, the importance of predictive values (rather than sensitivity and specificity) is well known. The inverse matrix method, as a direct method, uses the predictive values. We presented simple closed-form estimates for positive and negative predictive values when misclassification is nondifferential. Second, the simulation study showed that the inverse matrix method estimator is practically more precise than the matrix method in all scenarios.

We also examined the effect of the parameters controlled by the researcher on the MSEs of these approaches. We found that the optimal estimator depends on the size of the validation study. This was due to the fact that, in some cases where the validation sample is large, a good part of the information about θ can come from the validation study. Morrissey and spiegelman noted this phenomenon previously from the comparison of these methods for 2 × 2 tables. These researchers showed that the efficiency of estimators depends more on the relative size of the validation study than on the case-control ratio [11]. In another study, Greenland noted that the samples from which the misclassification parameters are estimated must be large enough to assure the approximate normality of SE, SP, PPV, and NPV [8]. Lyles noted clearly that under the differential misclassification setting, the inverse matrix method is equivalent to the maximum likelihood approach [19]. There are closed-form solutions for both, which are exactly the same. For this reason, both approaches should be equally more efficient than the matrix method.

The methods described in this study handle the misclassification of exposure, but they are easily adapted for disease misclassification. The study depends on some assumptions stated at the end of the introduction. Further studies seem to be necessary to compare and extend these methods, as those assumptions are not met. This would make sense when the outcome status and exposure status are simultaneously misclassified and when there is not a gold standard to validate the exposure.

In conclusion, when trying to make a decision about which method to be used, it can be helpful to consider the following concept. Although the MLE method has minimum MSE in all the scenarios, it can be computationally difficult and does not have a simple closed-form expression like the matrix and inverse matrix methods. When misclassification is non-differential, the inverse matrix method performs very well. Even when the size of the validation study is large enough, it can perform equal to the likelihood method. The two-step likelihood method has a simpler form than the likelihood method and can perform for nondifferential misclassification like the matrix and inverse matrix methods.

Conflict of Interests

The authors declare that they have no conflict of interests.

Authors' Contribution

Bijan Nouri and Najaf Zare were responsible for the design, simulation, and interpretation. Seyyed Mohammad Taghi Ayatollahi supervised the study and interpreted the results. All the authors read and approved the final paper.

Acknowledgments

This work was a part of the Ph.D. thesis of Bijan Nouri and was supported by Shiraz University of Medical Sciences, Shiraz, Iran. The authors are also thankful to the reviewers for their invaluable comments.

References

1.Bross I. Misclassification in 2 × 2 tables. Biometrics. 1954;10(4):478–486. [Google Scholar]
2.Bross IDJ. Spurious effects from an extraneous variable. Journal of Chronic Diseases. 1966;19(6):637–647. doi: 10.1016/0021-9681(66)90062-2. [DOI] [PubMed] [Google Scholar]
3.Cornfield J, Haenszel W, Hammond EC, Lillienfeld AM, Shimkin MB, Wynder EL. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute. 1959;22:173–203. [PubMed] [Google Scholar]
4.Koch GG. The effect of non-sampling errors on measures of association in 2 × 2 contingency tables. Journal of the American Statistical Association. 1969;64(327):852–863. [Google Scholar]
5.Tenenbein A. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association. 1970;65(331):1350–1361. [Google Scholar]
6.Barron BA. The effects of misclassification on the estimation of relative risk. Biometrics. 1977;33(2):414–418. [PubMed] [Google Scholar]
7.Greenland S, Kleinbaum D. Correcting for misclassification in two-way tables and matched-pair studies. International Journal of Epidemiology. 1983;12(1):93–97. doi: 10.1093/ije/12.1.93. [DOI] [PubMed] [Google Scholar]
8.Greenland S. Variance estimation for epidemiologic effect estimates under misclassification. Statistics in Medicine. 1988;7(7):745–757. doi: 10.1002/sim.4780070704. [DOI] [PubMed] [Google Scholar]
9.Selen J. Adjusting for errors in classification and measurement in the analysis of partly and purely categorical data. Journal of the American Statistical Association. 1986;81(393):75–81. [Google Scholar]
10.Marshall RJ. Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology. 1990;43(9):941–947. doi: 10.1016/0895-4356(90)90077-3. [DOI] [PubMed] [Google Scholar]
11.Morrissey MJ, Spiegelman D. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons. Biometrics. 1999;55(2):338–344. doi: 10.1111/j.0006-341x.1999.00338.x. [DOI] [PubMed] [Google Scholar]
12.Davidov O, Faraggi D, Reiser B. Misclassification in logistic regression with discrete covariates. Biometrical Journal. 2003;45(5):541–553. [Google Scholar]
13.Liu X, Liang K-Y. Adjustment for non-differential misclassification error in the generalized linear model. Statistics in Medicine. 1991;10(8):1197–1211. doi: 10.1002/sim.4780100804. [DOI] [PubMed] [Google Scholar]
14.Luan X, Pan W, Gerberich SG, Carlin BP. Does it always help to adjust for misclassification of a binary outcome in logistics regression? Statistics in Medicine. 2005;24(14):2221–2234. doi: 10.1002/sim.2094. [DOI] [PubMed] [Google Scholar]
15.Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty. The American Journal of Epidemiology. 1997;146(2):195–203. doi: 10.1093/oxfordjournals.aje.a009251. [DOI] [PubMed] [Google Scholar]
16.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. New York, NY, USA: Chapman & Hall; 2010. [Google Scholar]
17.Chu H, Wang Z, Cole SR, Greenland S. Sensitivity analysis of misclassification: a graphical and a Bayesian approach. Annals of Epidemiology. 2006;16(11):834–841. doi: 10.1016/j.annepidem.2006.04.001. [DOI] [PubMed] [Google Scholar]
18.Greenland S. Maximum-likelihood and closed-form estimators of epidemiologic measures under misclassification. Journal of Statistical Planning and Inference. 2008;138(2):528–538. [Google Scholar]
19.Lyles RH. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 2002;58(4):1034–1037. doi: 10.1111/j.0006-341x.2002.1034_1.x. [DOI] [PubMed] [Google Scholar]
20.Lyles RH, Lin J. Sensitivity analysis for misclassification in logistic regression via likelihood methods and predictive value weighting. Statistics in Medicine. 2010;29(22):2297–2309. doi: 10.1002/sim.3971. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Jurek AM, Greenland S, Maldonado G. Brief report: how far from non-differential does exposure or disease misclassification have to be to bias measures of association away from the null? International Journal of Epidemiology. 2008;37(2):382–385. doi: 10.1093/ije/dym291. [DOI] [PubMed] [Google Scholar]
22.Kraus JF, Greenland S, Bulterys M. Risk factors for sudden infant death syndrome in the US collaborative perinatal project. International Journal of Epidemiology. 1989;18(1):113–120. doi: 10.1093/ije/18.1.113. [DOI] [PubMed] [Google Scholar]

[B1] 1.Bross I. Misclassification in 2 × 2 tables. Biometrics. 1954;10(4):478–486. [Google Scholar]

[B2] 2.Bross IDJ. Spurious effects from an extraneous variable. Journal of Chronic Diseases. 1966;19(6):637–647. doi: 10.1016/0021-9681(66)90062-2. [DOI] [PubMed] [Google Scholar]

[B3] 3.Cornfield J, Haenszel W, Hammond EC, Lillienfeld AM, Shimkin MB, Wynder EL. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute. 1959;22:173–203. [PubMed] [Google Scholar]

[B4] 4.Koch GG. The effect of non-sampling errors on measures of association in 2 × 2 contingency tables. Journal of the American Statistical Association. 1969;64(327):852–863. [Google Scholar]

[B5] 5.Tenenbein A. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association. 1970;65(331):1350–1361. [Google Scholar]

[B6] 6.Barron BA. The effects of misclassification on the estimation of relative risk. Biometrics. 1977;33(2):414–418. [PubMed] [Google Scholar]

[B7] 7.Greenland S, Kleinbaum D. Correcting for misclassification in two-way tables and matched-pair studies. International Journal of Epidemiology. 1983;12(1):93–97. doi: 10.1093/ije/12.1.93. [DOI] [PubMed] [Google Scholar]

[B8] 8.Greenland S. Variance estimation for epidemiologic effect estimates under misclassification. Statistics in Medicine. 1988;7(7):745–757. doi: 10.1002/sim.4780070704. [DOI] [PubMed] [Google Scholar]

[B9] 9.Selen J. Adjusting for errors in classification and measurement in the analysis of partly and purely categorical data. Journal of the American Statistical Association. 1986;81(393):75–81. [Google Scholar]

[B10] 10.Marshall RJ. Validation study methods for estimating exposure proportions and odds ratios with misclassified data. Journal of Clinical Epidemiology. 1990;43(9):941–947. doi: 10.1016/0895-4356(90)90077-3. [DOI] [PubMed] [Google Scholar]

[B11] 11.Morrissey MJ, Spiegelman D. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons. Biometrics. 1999;55(2):338–344. doi: 10.1111/j.0006-341x.1999.00338.x. [DOI] [PubMed] [Google Scholar]

[B12] 12.Davidov O, Faraggi D, Reiser B. Misclassification in logistic regression with discrete covariates. Biometrical Journal. 2003;45(5):541–553. [Google Scholar]

[B13] 13.Liu X, Liang K-Y. Adjustment for non-differential misclassification error in the generalized linear model. Statistics in Medicine. 1991;10(8):1197–1211. doi: 10.1002/sim.4780100804. [DOI] [PubMed] [Google Scholar]

[B14] 14.Luan X, Pan W, Gerberich SG, Carlin BP. Does it always help to adjust for misclassification of a binary outcome in logistics regression? Statistics in Medicine. 2005;24(14):2221–2234. doi: 10.1002/sim.2094. [DOI] [PubMed] [Google Scholar]

[B15] 15.Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty. The American Journal of Epidemiology. 1997;146(2):195–203. doi: 10.1093/oxfordjournals.aje.a009251. [DOI] [PubMed] [Google Scholar]

[B16] 16.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. New York, NY, USA: Chapman & Hall; 2010. [Google Scholar]

[B17] 17.Chu H, Wang Z, Cole SR, Greenland S. Sensitivity analysis of misclassification: a graphical and a Bayesian approach. Annals of Epidemiology. 2006;16(11):834–841. doi: 10.1016/j.annepidem.2006.04.001. [DOI] [PubMed] [Google Scholar]

[B18] 18.Greenland S. Maximum-likelihood and closed-form estimators of epidemiologic measures under misclassification. Journal of Statistical Planning and Inference. 2008;138(2):528–538. [Google Scholar]

[B19] 19.Lyles RH. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics. 2002;58(4):1034–1037. doi: 10.1111/j.0006-341x.2002.1034_1.x. [DOI] [PubMed] [Google Scholar]

[B20] 20.Lyles RH, Lin J. Sensitivity analysis for misclassification in logistic regression via likelihood methods and predictive value weighting. Statistics in Medicine. 2010;29(22):2297–2309. doi: 10.1002/sim.3971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Jurek AM, Greenland S, Maldonado G. Brief report: how far from non-differential does exposure or disease misclassification have to be to bias measures of association away from the null? International Journal of Epidemiology. 2008;37(2):382–385. doi: 10.1093/ije/dym291. [DOI] [PubMed] [Google Scholar]

[B22] 22.Kraus JF, Greenland S, Bulterys M. Risk factors for sudden infant death syndrome in the US collaborative perinatal project. International Journal of Epidemiology. 1989;18(1):113–120. doi: 10.1093/ije/18.1.113. [DOI] [PubMed] [Google Scholar]

PERMALINK

Validation Study Methods for Estimating Odds Ratio in 2 × 2 × J Tables When Exposure is Misclassified

Bijan Nouri

Najaf Zare

Seyyed Mohammad Taghi Ayatollahi

Abstract

1. Introduction

2. Method and Material

2.1. Definitions and Notations

Table 1.

2.2. Estimation of Misclassification Parameters

Table 2.

Example 1 —

Table 3.

Table 4.

3. Estimate of the Actual Association

3.1. Inverse Matrix Method

3.2. Matrix Method

3.2.1. Direct Likelihood Method

3.3. A Weighted Likelihood Method

3.4. Combining the Results

Example 2 —

4. Simulation Studies

Table 5.

5. Discussion and Conclusions

Conflict of Interests

Authors' Contribution

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Validation Study Methods for Estimating Odds Ratio in 2 × 2 × J Tables When Exposure is Misclassified

Bijan Nouri

Najaf Zare

Seyyed Mohammad Taghi Ayatollahi

Abstract

1. Introduction

2. Method and Material

2.1. Definitions and Notations

Table 1.

2.2. Estimation of Misclassification Parameters

Table 2.

Example 1 —

Table 3.

Table 4.

3. Estimate of the Actual Association

3.1. Inverse Matrix Method

3.2. Matrix Method

3.2.1. Direct Likelihood Method

3.3. A Weighted Likelihood Method

3.4. Combining the Results

Example 2 —

4. Simulation Studies

Table 5.

5. Discussion and Conclusions

Conflict of Interests

Authors' Contribution

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases