Summary:
Identifiability of statistical models is a fundamental regularity condition that is required for valid statistical inference. Investigation of model identifiability is mathematically challenging for complex models such as latent class models. Jones et al. (2010) used Goodman’s technique (Goodman, 1974) to investigate the identifiability of latent class models with applications to diagnostic tests in the absence of a gold standard test. The tool they used was based on examining the singularity of the Jacobian or the Fisher information matrix, in order to obtain insights into local identifiability (i.e., there exists a neighborhood of a parameter such that no other parameter in the neighborhood leads to the same probability distribution as the parameter).
In this paper, we investigate a stronger condition: global identifiability (i.e., no two parameters in the parameter space give rise to the same probability distribution), by introducing a powerful mathematical tool from computational algebra: the Gröbner basis. With several existing well-known examples (such as Warner (1965), Zhou (1993), Hui and Walter (1980) and Pepe and Janes (2007)), we argue that the Gröbner basis method is easy to implement and powerful to study global identifiability of latent class models, and is an attractive alternative to the information matrix analysis by Rothenberg (1971) and the Jacobian analysis by Goodman (1974) and Jones et al. (2010).
Keywords: Computational algebraic geometry, Latent class models, Polynomial equations, Survey sampling
1. Introduction
For model based statistical inference, the assumption that the imposed model is identifiable needs to be verified before the model is applied to any real data application. There are at least two distinct definitions of the model identifiability: the global and the local one. Suppose a random variable Y has a probability density function f(y; θ) indexed by the parameter θ. The parameter θ is said to be globally identifiable if there is no other parameter value in the parameter space that gives rise to an identical probability distribution, or equivalently, f(y; θ*) = f(y; θ) for all y almost surely implies θ* = θ. This definition excludes the undesirable situation where different parameter values lead to the same distribution, which makes it impossible to distinguish the true parameter value from the others based on the data. The global identifiability is generally difficult to verify and there are limited mathematical tools available for such verifications (Huang, 2005). A weaker form of identifiability, namely locally identifiable at θ0 holds if there exists a neighborhood of θ0 such that no other has the same probability distribution as f(y; θ0) for all y almost surely. Different from the the global identifiability, there are several effective methods to investigate the local identifiability. For example one can investigate the singularity of the Fisher information matrix or use the inverse function theorem on the Jacobian matrix, to investigate the conditions for the local identifiability; see Theorem 8 in Rothenberg (1971). Using these tools, Jones et al. (2010) investigated the local identifiability of latent class models with applications to diagnostic tests in the absence of a gold standard test. More importantly, the Fisher information matrix also relates closely to the asymptotic variance of the maximum likelihood estimator of a model (Fisher, 1922). However, the local identifiability generally does not imply the global identifiability, and is not sufficient to guarantee the validity of inference for the model.
In the application of evaluating the accuracy of a diagnostic test, the selected reference test (known as the gold standard test) may be imperfect due to measurement errors, non-existence, invasive nature, or expensive cost of a gold standard (Joseph et al., 1995). In some other applications, although a gold standard is available, only a subset of subjects are verified by the gold standard (Whiting et al., 2003). Partial verification bias arises when the selection of subjects verified by the gold standard depends on the results of a candidate test (de Groot et al., 2011). To account for the absence of a gold standard test and partial verification bias, methods have been proposed using latent class models and their variants (Hui and Walter, 1980; Zhou, 1993; Yang and Becker, 1997; Zhou, 1998; Hui and Zhou, 1998; Pepe and Janes, 2007; Chu et al., 2009; de Groot et al., 2011; Dendukuri et al., 2012; Wang et al., 2016); see also the review papers by Ma et al. (2014) and Liu et al. (2014) and the references therein.
While many models for diagnostic tests in the absence of the gold standard or under partial verification are available, the validity of the model identifiability is still questionable and in many cases remains an open problem. Goodman (1974) showed that a model can be nonidentifiable even when the degrees of freedom in the data meet or exceed the number of model parameters. Johnson et al. (2001) showed that Bayesian inferences based on data from a single population for the prevalence and accuracy of two screening tests are imprecise regardless of the sample size, which is a problem resulting from model nonidentifiability. Jones et al. (2010) developed Goodman’s technique by making use of symbolic algebraic calculations. However, all the aforementioned results are for local identifiability of models, which can only partially guarantee the validity of inference because identifiability holds in a neighborhood of the truth. Given the fact that the truth is often unknown, it is important to investigate the global identifiability, which often involves proving the uniqueness of solutions for a system of nonlinear equations.
The goal of this paper is two-fold. First, using symbolic algebraic calculations as in Jones et al. (2010), we propose a different technique, namely the Gröbner basis method originating from computational algebra. It has been used in various applications such as graphical models and causal inference (Drton et al., 2011; Leung et al., 2016; Garcia, 2004). However, there lacks a systematic introduction to the Gröbner basis method that is accessible to statisticians. Second, by studying several important models for diagnostic test accuracy settings, we demonstrate the feasibility and simplicity of using the Gröbner basis method. An attractive feature of this method is that it can be used by any statistician and biomedical researcher without training in computational algebra. In addition, we reveal new insights on the global identifiability of models for diagnostic tests, in particular, with verification bias, which has not been considered before. The surprisingly simple closed-form solution suggests a very intuitive procedure to quantify and correct for partial verification bias without additional modeling assumptions.
2. A Brief Introduction of Gröbner Basis Theory
For many latent class models, the model identiability is shown by proving the uniqueness of solutions for a system of nonlinear equations (Hui and Walter, 1980; Pepe and Janes, 2007). Here we consider the situation where the nonlinear system is given as a set of multinomial polynomial equations (see below). In this case techniques involving the computation of a Gröbner basis for the system can be used to decide if a system has solutions and the multiplicity of such solutions, should they exist. In particular, the computation of the reduced Gröbner basis for such a system can be used to solve the system and investigate the identiability of a latent class model.
We start with some definitions. Given a set V = {x1, x2, … , xn} of n symbols, a monomial in V is a symbol of the form where each ai is a non-negative integer. Thus if V = {x, y, z}, M1 = x3y1z3, M2 = x0y6z1 = y6z, and, M3 = x2y0z1 = x2z are all monomials in V. By convention, the monomial is taken to be 1.
Let K be a field (such as the rational numbers, , or the real numbers, ). By K[x1, x2, … , xn] is meant all linear combinations of monomials in {x1, x2, …, xn} with coefficients in K. For example , , and, are all members of . The elements of K[x1, x2, … , xn] are called multinomial polynomials, or simply polynomials.
If P is a set of polynomials, then the ideal generated by P is the set (denoted by 〈P〉) containing all polynomial linear combinations:
By Hilbert’s Basis Theorem, every ideal I in K[x1, x2, … , xn] is generated by some finite set P of polynomials. The generator of an ideal is not unique.
An admissible order (also known as term order or monomial order) on K[x1, x2, … , xn] is a total order on the set of all monomials in the form (a1,…an are non-negative integers) which has the following properties. If M1 and M2 are monomials, then,
M1 ⩽ M2 if and only if HM1 ⩽ HM2 for every monomial H;
M1 ⩽ HM1 for every monomial H.
An example of monomial order is the degree lexicographical order (deglex for short). In the case of n = 2, the degree lexicographical order is
Other commonly used monomial orders are discussed in the Web Appendix A.
Once an order is placed on monomials then every polynomial f has a unique leading term (LT (f)), which is the monomial in f with the highest order. And the leading coefficient of f is the coefficient associated with the leading term. For example, the leading term of the polynomial is and the leading coefficient is 5. For an ideal I in K[x1, … , xn], its leading term ideal LT (I) is the ideal generated by the leading terms of all polynomials in I, i.e., LT (I) = 〈LT (f) : f ∈ I〉. We are now in a position to define a Gröbner basis.
For a set of polynomials P, a finite subset G is a Gröbner basis of P with respect to a monomial order if G is a generating set of 〈P〉, and satisfies
That is, the leading terms in G are sufficient to generate the leading term ideal of 〈P〉. Given an initial set, P, of multinomial polynomials, the existence of the Gröbner basis is guaranteed by the Buchberger’s Algorithm (see Buchberger and Winkler (1998)). However, if G is a Gröbner basis of P, then any finite subset of 〈P〉 that contains G is also a Gröbner basis of P. So the Gröbner basis of a set is not unique.
To remedy this nonuniqueness, we then introduce the definition of polynomial reduction. Given two polynomials, f and h, we say that the polynomial f can be reduced by h if f has a monomial which is a multiple of the leading monomial of h. Suppose M is the monomial in f which is a multiple of the leading monomial H of h. So M = QH for some monomial Q. In this case we write the reduction of f by h as R(f, h) = f − (c/d)Qh, where d is the leading coefficient of h and c is the coefficient associated with the monomial M in f.
Then we say a Gröbner basis G = {g1, … , gr} is a reduced Gröbner basis if
for each gi ∈ G, its leading coefficient is 1.
the leading term of any gi ∈ G does not divide any term in any other gj ∈ G for i ≠ j.
With this definition, it can be shown that every ideal I in K[x1, … , xn] has a unique reduced Gröbner basis.
Intuitively, for a set of polynomials, P, a Gröbner basis (for the corresponding ideal 〈P〉) is a collection of polynomials which can be used to build all the elements of 〈P〉. A reduced Gröbner basis is also a collection of polynomials which can be used to build all the polynomials in 〈P〉 but consisting of irreducible (or “atomic”) polynomials. For example, consider a set . The bases {x2 + x1, x1} and {x2, x1} are both Gröbner bases of P. However, the former is not reduced since we can reduce x2 + x1 to x2 using x1.
More importantly, a Gröbner basis G of P shares the same solution set as P. Let
We have . And therefore, . If G is a reduced Gröbner basis of P, then is empty if and only if G equals 1. This means an over-identifiable model can be found by investigating the reduced Gröbner basis. Compared to the original set of polynomials, its reduced Gröbner basis often has a simpler form and can therefore reduce the computational difficulty in solving the system of polynomial equations.
The Buchberger’s Algorithm was derived for computing the reduced Gröbner basis G for any input set P, and it has been widely implemented in software (Buchberger and Winkler, 1998; Buchberger, 2006). For readers interested in more details, we recommend the review papers and textbooks by Sturmfels (2005) and Cox et al. (2005). Finally, we provide pseudo code to outline the steps for investigating the global identifiability of a model via the Gröbner basis approach.
List all the identifiability conditions of the model.
Reduce the conditions in (1) into a system of polynomial equations.
Apply Buchberger’s Algorithm to find a reduced Gröbner basis G.
Solve and determine whether the solution is unique.
3. Examples
In this section we introduce four latent class models. In all examples, the data can be summarized by contingency tables, and the cell probabilities in the contingency tables (referred to as the marginal parameters) can be written as functions of the parameters of interest (referred to as the structural parameters). Since the marginal parameters are intrinsically identifiable by the identifiability of multinomial distributions, the identifiability of the structural parameters boils down to the problem of solving a system of polynomial equations. We provide solutions to these identifiability problems using the Gröbner basis method discussed in Section 2. The corresponding Maple code for finding the reduced Gröbner basis can be found in Web Appendix C.
3.1. Example 1: Randomized response in survey sampling (Warner, 1965)
Randomized response is a survey sampling technique employed in the context of asking sensitive questions, e.g., use of illicit drugs, where the idea is for respondents to answer truthfully or not based on a known chance mechanism. Warner (1965) proposed that the respondent draw from a deck of cards with specified proportions p of color A and 1 − p of color B - the respondent is to answer the question truthfully if she draws color A and lie if she draws color B. While the color of the card is unknown to the interviewer, it is possible to use the known proportion p to identify the prevalence of illicit drug use, even though some unknown number of respondents have lied.
In this example we consider a survey sampling problem where two questions are asked. In this case, two decks of cards are used, with each deck following the Warner model. For example, each respondent is asked to draw two cards with replacement from two decks of cards, and then answer: “Do you smoke?” and “Do you drink?” Whether or not to answer a question truthfully depends on the color of the corresponding card. Denote p1 as the known probability of drawing a card with the color corresponding to answer truthfully in deck 1, and p2 as the known probability for deck 2. Denote structural parameters π11, π10 and π01 as the respective prevalence of smoking and drinking, smoking but not drinking, and drinking but not smoking. Finally, denote the marginal parameters θ11, θ10 and θ01 as the proportions of answering ‘yes and yes’, ‘yes and no’ and ‘no and yes’, respectively. A simple calculation yields the following system of equations,
| (1) |
where π00 = 1−π11 −π10 −π01. The marginal parameters θ11, θ10 and θ01 are directly observed. To show identifiability of the parameters of interest π11, π10 and π01, we only need to solve this system of equations. Here we have three linear equations with three structural parameters. Note that it is a necessary condition for model identifiability that the number of structure parameters equal to the number of linearly independent equations in the system (Huang and Bandeen-Roche, 2004).
We first rewrite equation (1) in matrix form to apply Gaussian elimination.
where a = p1, p2, b = p1(1 − p2), c = (1 − p1)p2 and d = (1 − p1)(1 − p2). The determinant of the above 3×3 matrix is (2p1 −1)2(2p2 −1)2, which reveals that the structural parameters (π1, π2, π3) are identifiable if and only if p1 ≠ 1/2 and p2 ≠ 1/2. Routine linear algebra yields the solution as
Equivalently, to solve equation (1), we use Maple and obtain the following Gröbner bases
Note that the above Gröbner bases is a reduced Gröbner bases after diving the leading coefficient of each polynomial. As in Gaussian elimination, each of the above polynomials involves exactly one structural parameter and the coefficient of the structural parameter reveals the identification condition: neither p1 nor p2 can be 1/2. This result is equivalent to what we obtained from solving the system equations via Gaussian elimination.
3.2. Example 2: Diagnostic tests without a gold standard - three tests and one population (Pepe and Janes, 2007)
In traditional diagnostic test accuracy studies, the evaluation of a candidate test is usually through the comparison with a reference test which is assumed to be a gold standard. When the reference test is imperfect, latent class models have been commonly used; see for example, the analysis of a single study in Begg and Greenes (1983), and the meta-analysis of multiple studies in Chu et al. (2009).
Recently, Pepe and Janes (2007) considered a setting where three candidate tests are simultaneously applied to each subject of a study population. Following their notation, let Yj denote the outcome of test j for j = 1, 2, 3 with 0 being negative and 1 being positive. Let D denote the unknown true disease status with 0 being absence and 1 being presence. It is typically assumed that the test results are conditionally independent given the true disease state, i.e., . Let ϕj denote the sensitivity and ψj denote the 1− specificity of test j for j = 1, 2, 3, and ρ denote the disease prevalence in the study population; these are structural parameters. Let pk = Pr(Yj = 1) for j = 1, 2, 3, pjk = Pr(Yj = Yk = 1) for j < k, and p123 = Pr(Y1 = Y2 = Y3 = 1); these are the marginal parameters which are considered observed.
By the law of total probability, the univariate marginal probability can be represented by the following system of equations:
| (2) |
By observing the system of equations (2), we can see that the structural sensitivity parameters (ϕ1, ϕ2, ϕ3) and the specificity parameters (ψ1, ψ2, ψ3) are symmetric, and the sensitivity (or specificity) parameters are also symmetric up to a “label switching”. In such case, if we put ρ in the highest position of the degree lexicographic order, we obtain the following seven elements of a Gröbner basis:
| (3) |
where and
We note that the first element only involves ϕ1 with a quadratic term, and each of the elements 2–7 contains exactly one linear term of ϕ1 and another structural parameter. Therefore, the structural parameters are uniquely identified if the parameter ϕ1 is identifiable. Since the first element is a quadratic function of ϕ1, there are generally two roots, denoted as ϕ11 and ϕ12. The global identifiability of the structural parameter holds in a half space, partitioned by the surface ϕ1 = (ϕ11 + ϕ12)/2. Although the two roots ϕ11 and ϕ12 have closed-form expressions, the expressions are too complicated to reveal further insights into this model.
On the other hand, if we put ρ in the lowest position of the lexicographic order in the Gröbner basis algorithm, The outputs are similar to the system of equations (3) in the sense that only one element involves a quadratic function of a structural parameter (but now it is the parameter in the lowest position ρ), and the remaining elements are linear functions of two distinct structural parameters. Therefore, we only need to focus on the quadratic function of ρ, i.e.,
| (4) |
where , and c = (p12 − p1p2)(p13 − p1p3)(p23 − p2p3). Equation (4) yields two roots , which can be easily simplified. We factor out a2−4ac = aA2, where A = 2p1p2p3−p1p23−p2p13−p3p12+p123. Since a(a−4c) = aA2, we can replace a by A2 + 4c and then the two roots of Equation (4) can be written as
which is exactly the same as the solution obtained by Pepe and Janes (2007) except they had a typographical error – there was a sum rather than a subtraction in the square root.
The output of a univariate polynomial involving only ρ was not obtained by chance. This is guaranteed by the elimination property of lexicographic ordering in a Gröbner basis as long as ρ is in the lowest position. When computing the Gröbner basis, polynomials of parameters in the lower positions are used as “basic elements” to reduce the polynomials with parameters in the higher position. Therefore, in this case, putting ρ in the lowest position maintains the symmetry of other structure parameters, and therefore leads to better insights of the model. More discussion regarding how to choose the monomial orders can be found in the Web Appendices A and B.
3.3. Example 3: Diagnostic tests without a gold standard - Hui-Walter Paradigm
Hui and Walter (1980) considered a similar setting as in Example 2 where the gold standard is not available. Two conditionally independent tests are applied simultaneously to subjects from two different study populations. Let Yj denote the binary result (1: positive, 0: negative) for test j, and let Z denote the known population membership of subject i (e.g. Z = 1 for female, and Z = 2 for male). Let D denote the underlying unknown true disease status. Hui and Walter (1980) assumed that the two tests are conditionally independent given the unknown diseases status, i.e. Y1 ⊥ Y2|D, and the results of two tests of a subject provide no additional information on that subject’s population membership given the true disease status, i.e. Z ⊥ (Y1, Y2)|D.
To be consistent with Hui and Walter (1980) in notation, let α1 and α2 denote the false positive rates of tests 1 and 2 respectively, and β1 and β2 denote the false negative rates. Let θ1 and θ2 denote the disease prevalence in study populations 1 and 2, respectively. The parameters β1, β2, θ1, θ2 are structural parameters. We let pgjj′ = Pr(Y1 = j, Y2 = j′|Z = g) (j, j′ = 0, 1, g = 1, 2) denote the identifiable, marginal parameters that govern the conditional joint distribution of Y1 and Y2 given Z. By the law of total probability, we have Pr(Y1, Y2|Z) = Pr(Y1, Y2|D = 1, Zi)Pr(D = 1|Z) + Pr(Y1, Y2|D = 0, Z)Pr(D = 0|Y ) = Pr(Y1, Y2|D = 1)Pr(D = 1|Z) + Pr(Y1, Y2|D = 0)Pr(D = 0|Y ) = Pr(Y1|D = 1)Pr(Y2|Di = 1)Pr(D = 1|Z) + Pr(Y1|D = 0)Pr(Y2|D = 0)Pr(D = 0|Y ), where the last two steps follow by the conditional independence assumptions. The structural parameters are then the solution to the following system of six equations:
| (5) |
By observing the system of equations (5), we treat the disease prevalences θ1 and θ2 as the lowest positions. The outputs are five polynomials where each involves θ2 and another structural parameter of the first order, and a quadratic function of θ2 as
| (6) |
where We factor out a2 −4ac = aA2, where A = 2p111p211 +p111p212 +p111p221 +p112p211 +p112p221 + p121p211 + p121p212 − 2p2112 − 2p211p212 − 2p211p221 − 2p212p221 − p111 + p211. Moreover, using equation (5), we obtain that a = (α1 + β1 − 1)2(α2+ β2 − 1)2(θ1 − θ2)2 ⩾ 0, and A = (α1 + β1 −1)(α2 + β2 − 1)(θ1 − θ2)(2θ2 − 1). Thus θ2 can be solved from the quadratic function (6) using the above simplified expressions as:
| (7) |
Therefore, in order that the structural parameter θ2 is identifiable, we need a ≠ 0, which implies that the two populations have different disease prevalence (i.e., θ1 ≠ θ2), and the sum of false positive and false negative rates does not equal to one in both populations (i.e., αq + βq ≠ 1, for q = 1, 2). In addition, to make sure θ2 is globally identifiable, only one unique solution of θ2 is allowed in Equation (7). Therefore the parameter space of θ2 has to be either [0, 0.5] or [0.5, 1]. Once θ2 is uniquely identified, other parameters are uniquely identified as functions of θ2 obtained from solving the system of linear equations of the Gröbner bases, and we achieve the global identifiability of all the structural parameters. This is equivalent to the main result in Hui and Walter (1980). By using Gröbner basis, we were able to obtain new insights on the conditions of global identifiability.
3.4. Example 4: Diagnostic tests with verification bias
In many systematic reviews of diagnostic tests, some subjects who undergo a diagnostic test may not have their disease condition status verified by a gold standard. More importantly, subjects who have their disease condition status verified do not represent a random sample from population. This can lead to verification bias when evaluating the accuracy of the test (Ma et al., 2014).
Let D denote the binary disease status with 0 for healthy and 1 for diseased, Y denote the binary diagnostic test result with 0 for negative and 1 for positive, and M denote the indicator of whether the true disease status D is verified by the gold standard test (M = 0 if D is observed and 1 if not observed). Assume that given the test result Y , the missingness M is independent of the disease status D, i.e., M ⊥ D|Y . Denote the non-verification probabilities given the test result as ω1 = Pr(M = 1|Y = 1) and ω0 = Pr(M = 1|Y = 0). It is easy to see that when ω1 = ω0, the missingness is completely at random and there is no verification bias. In clinical practice, we often have ω1 < ω0. Denote π = Pr(D = 1), Se = Pr(Y = 1|D = 1) and Sp = Pr(Y = 0|D = 0) for disease prevalence, sensitivity and specificity, respectively. From a cohort study design, the identifiable marginal parameters are pjk = P(M = 0, Y = j, D = k) (j, k = 0, 1) and . The structural parameters are (π, ω0, ω1, Se, Sp).
Straightforward calculation yields pjk = Pr(M = 0, Y = j, D = k) = Pr(M = 0|Y = j, D = k)Pr(Y = j|D = k)Pr(D = k) = Pr(M = 0|Y = j)Pr(Y = j|D = k)Pr(D = k), where the first step is by conditional independence of M ⊥ D|Y , and the probability can be calculated similarly. By letting j, k take values of 0 and 1, the structural parameters are then the solution to the following system of five equations in (8). The results from Maple are five first-order polynomials, each involving only one structural parameter.
| (8) |
The solution to (8) is
| (9) |
where A = C + p11p01 + p10p01, B = C − p11p00 − p10p00 and .
To reveal further insights about π, note that
| (10) |
Similarly,
| (11) |
From the above simplified expressions, it is clear that the structural parameters are globally identifiable if and only if ω0 ≠ 1 and ω1 ≠ 1. In other words, the structural parameters are globally identifiable unless none of the positive or none of the negative test results are verified.
In addition to revealing identifiability conditions, the above results provide important insights into the direction and magnitude of bias when partial verification bias is ignored. In practice, naive estimators of prevalence π, sensitivity Se and specificity Sp are obtained by ignoring those with missing disease status, and are estimated by p11 + p01, p11/(p11 + p01) and p00/(p00 + p10), respectively. By Equation (10), the naive estimator of π tends to underestimate the true prevalence. Furthermore, Equation (11) suggests that naive estimators of Se and Sp tend to be upward biased and downward biased, respectively, if people with negative test results are less likely to have verification, i.e., ω0 > ω1, a common situation in practice.
4. Discussion
The global identifiability in different contexts has been considered by Allman and Rhodes (2006, 2009); Allman et al. (2009); Drton et al. (2011); Meshkat and Sullivant (2014), among many others. They used approaches in algebraic statistics such as Kruskal’s Theorem (Kruskal, 1977) to study model identifiability. We refer to Drton et al. (2009) for a more detailed introduction to algebraic statistics. In contrast to their methods, the Gröbner basis approach introduced here is more accessible to statisticians and biomedical researchers and can be applied without resorting to sophisticated mathematical arguments. In addition, if the model is identifiable, the Gröbner basis approach produces an explicit solution to the identifiability problem, which can provide further insights into the statistical models.
Although the Gröbner basis approach is designed to be directly applied to problems formulated as a system of polynomial equations, reparameterization can be used to expand the utility of the Gröbner basis approach. In the Web Appendix D, we included another investigation of the global identifiability of a latent class regression model with binary outcome and two latent classes.
As a final note, it has been suggested that when a model lacks identifiability, model expansion can help to achieve identifiability - see Johnson et al. (2001) and the discussion paper by Gustafson et al. (2005). In addition, under a Bayesian inference framework with carefully elicited priors, informative posteriors can still be obtained in models that lack identifiability (Georgiadis et al., 2003; McInturff et al., 2004; Dendukuri et al., 2012; Wang et al., 2016).
Supplementary Material
Acknowledgement
This work is supported in part by the National Institutes of Health grants 1R01LM012607 (RD and YC), 1R01AI130460 (RD and YC), P50MH113840 (YC), 1R01AI116794 (JHM and YC), R01LM009012 (JHM and YC), R01 LM010098 (JHM) and R01 AI116794 (JHM). MC was supported by the UTHealth Innovation for Cancer Prevention Research Training Program Pre-doctoral Fellowship (Cancer Prevention and Research Institute of Texas award RP160015). The content is solely the responsibility of the authors and does not necessarily represent the official views of the Cancer Prevention and Research Institute of Texas. The authors thank the referees, the associate editor and the editor for their constructive comments that substantially improved the presentation of this work. RD, MC, YN and MZ are co-first authors and XZ, JHM, JGI, DOS and YC are co-senior authors.
Footnotes
References
- Allman ES, Matias C, and Rhodes JA (2009). Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics pages 3099–3132. [Google Scholar]
- Allman ES and Rhodes JA (2006). The identifiability of tree topology for phylogenetic models, including covarion and mixture models. Journal of Computational Biology 13, 1101–1113. [DOI] [PubMed] [Google Scholar]
- Allman ES and Rhodes JA (2009). The identifiability of covarion models in phylogenetics. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 6, 76–88. [DOI] [PubMed] [Google Scholar]
- Begg CB and Greenes RA (1983). Assessment of diagnostic tests when disease verification is subject to selection bias. Biometrics pages 207–215. [PubMed] [Google Scholar]
- Buchberger B (2006). Bruno buchberger’s phd thesis 1965: An algorithm for finding the basis elements of the residue class ring of a zero dimensional polynomial ideal. Journal of Symbolic Computation 41, 475–511. [Google Scholar]
- Buchberger B and Winkler F (1998). Gröbner bases and applications, volume 251 Cambridge University Press. [Google Scholar]
- Chu H, Chen S, and Louis T (2009). Random effects models in a meta-analysis of the accuracy of two diagnostic tests without a gold standard. Journal of the American Statistical Association 104, 512–523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DA, Little JB, and O’shea D (2005). Using algebraic geometry, volume 185 Springer. [Google Scholar]
- de Groot JA, Bossuyt PM, Reitsma JB, Rutjes AW, Dendukuri N, Janssen KJ, and Moons KG (2011). Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ 343,. [DOI] [PubMed] [Google Scholar]
- de Groot JA, Dendukuri N, Janssen KJ, Reitsma JB, Bossuyt PM, and Moons KG (2011). Adjusting for differential-verification bias in diagnostic-accuracy studies: a bayesian approach. Epidemiology 22, 234–241. [DOI] [PubMed] [Google Scholar]
- Dendukuri N, Schiller I, Joseph L, and Pai M (2012). Bayesian meta-analysis of the accuracy of a test for tuberculous pleuritis in the absence of a gold standard reference. Biometrics 68, 1285–1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drton M, Foygel R, Sullivant S, et al. (2011). Global identifiability of linear structural equation models. The Annals of Statistics 39, 865–886. [Google Scholar]
- Drton M, Sturmfels B, and Sullivant S (2009). Lectures on algebraic statistics. Springer. [Google Scholar]
- Fisher RA (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222, 309–368. [Google Scholar]
- Garcia LD (2004). Algebraic statistics in model selection In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pages 177–184. AUAI Press. [Google Scholar]
- Georgiadis MP, Johnson WO, Gardner IA, and Singh R (2003). Correlation-adjusted estimation of sensitivity and specificity of two diagnostic tests. Journal of the Royal Statistical Society: Series C (Applied Statistics) 52, 63–76. [Google Scholar]
- Goodman LA (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61, 215–231. [Google Scholar]
- Gustafson P, Gelfand AE, Sahu SK, Johnson WO, Hanson TE, Joseph L, and Lee J (2005). On model expansion, model contraction, identifiability and prior information: Two illustrative scenarios involving mismeasured variables [with comments and rejoinder]. Statistical Science pages 111–140. [Google Scholar]
- Huang G-H (2005). Model identifiability. Wiley StatsRef: Statistics Reference Online. [Google Scholar]
- Huang G-H and Bandeen-Roche K (2004). Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika 69, 5–32. [Google Scholar]
- Hui SL and Walter SD (1980). Estimating the error rates of diagnostic tests. Biometrics pages 167–171. [PubMed] [Google Scholar]
- Hui SL and Zhou XH (1998). Evaluation of diagnostic tests without gold standards. Statistical methods in medical research 7, 354–370. [DOI] [PubMed] [Google Scholar]
- Johnson WO, Gastwirth JL, and Pearson LM (2001). Screening without a “gold standard”: the hui-walter paradigm revisited. American Journal of Epidemiology 153, 921–924. [DOI] [PubMed] [Google Scholar]
- Jones G, Johnson WO, Hanson TE, and Christensen R (2010). Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics 66, 855–863. [DOI] [PubMed] [Google Scholar]
- Joseph L, Gyorkos TW, and Coupal L (1995). Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. American Journal of Epidemiology 141, 263–272. [DOI] [PubMed] [Google Scholar]
- Kruskal JB (1977). Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications 18, 95–138. [Google Scholar]
- Leung D, Drton M, Hara H, et al. (2016). Identifiability of directed gaussian graphical models with one latent source. Electronic Journal of Statistics 10, 394–422. [Google Scholar]
- Liu Y, Chen Y, and Chu H (2014). A unification of models for meta-analysis of diagnostic accuracy studies without a gold standard (in press). Biometrics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma X, Chen Y, Cole SR, and Chu H (2014). A hybrid bayesian hierarchical model combining cohort and case–control studies for meta-analysis of diagnostic tests: Accounting for partial verification bias. Statistical Methods in Medical Research page 0962280214536703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma X, Nie L, Cole SR, and Chu H (2014). Statistical methods for multivariate meta-analysis of diagnostic tests: An overview and tutorial (in press). Statistical Methods in Medical Research. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McInturff P, Johnson WO, Cowling D, and Gardner IA (2004). Modelling risk when binary outcomes are subject to error. Statistics in Medicine 23, 1095–1109. [DOI] [PubMed] [Google Scholar]
- Meshkat N and Sullivant S (2014). Identifiable reparametrizations of linear compartment models. Journal of Symbolic Computation 63, 46–67. [Google Scholar]
- Pepe MS and Janes H (2007). Insights into latent class analysis of diagnostic test performance. Biostatistics 8, 474–484. [DOI] [PubMed] [Google Scholar]
- Rothenberg TJ (1971). Identification in parametric models. Econometrica 39, 577–591. [Google Scholar]
- Sturmfels B (2005). What is a Gröbner basis. Notices Amer. Math. Soc 52, 1199–1200. [Google Scholar]
- Wang XN, Zhou V, Liu Q, Gao Y, and Zhou X-H (2016). Evaluation of the accuracy of diagnostic scales for a syndrome in chinese medicine in the absence of a gold standard. Chinese Medicine 11, 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warner SL (1965). Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60, 63–69. [PubMed] [Google Scholar]
- Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, and Kleijnen J (2003). The development of quadas: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Medical Research Methodology 3, 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang I and Becker MP (1997). Latent variable modeling of diagnostic accuracy. Biometrics pages 948–958. [PubMed] [Google Scholar]
- Zhou X. h. (1993). Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Communications in Statistics-Theory and Methods, 22, 3177–3198. [Google Scholar]
- Zhou X-H (1998). Correcting for verification bias in studies of a diagnostic test’s accuracy. Statistical Methods in Medical Research 7, 337–353. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
