Ascertainment-Adjusted Parameter Estimates Revisited

Michael P Epstein; Xihong Lin; Michael Boehnke

doi:10.1086/339517

. 2002 Mar 5;70(4):886–895. doi: 10.1086/339517

Ascertainment-Adjusted Parameter Estimates Revisited

Michael P Epstein ¹, Xihong Lin ¹, Michael Boehnke ¹

PMCID: PMC379117 PMID: 11880949

Abstract

Ascertainment-adjusted parameter estimates from a genetic analysis are typically assumed to reflect the parameter values in the original population from which the ascertained data were collected. Burton et al. (2000) recently showed that, given unmodeled parameter heterogeneity, the standard ascertainment adjustment leads to biased parameter estimates of the population-based values. This finding has important implications in complex genetic studies, because of the potential existence of unmodeled genetic parameter heterogeneity. The authors further stated the important point that, given unmodeled heterogeneity, the ascertainment-adjusted parameter estimates reflect the true parameter values in the ascertained subpopulation. They illustrated these statements with two examples. By revisiting these examples, we demonstrate that if the ascertainment scheme and the nature of the data can be correctly modeled, then an ascertainment-adjusted analysis returns population-based parameter estimates. We further demonstrate that if the ascertainment scheme and data cannot be modeled properly, then the resulting ascertainment-adjusted analysis produces parameter estimates that generally do not reflect the true values in either the original population or the ascertained subpopulation.

Introduction

Adjusting for nonrandom sampling, or ascertainment, has been an important topic in the genetics literature for many years (e.g., Weinberg 1912; Apert 1914; Fisher 1934; Haldane 1938; Morton 1959; Cannings and Thompson 1977; Elston and Sobel 1979; Ewens and Shute 1986a, 1986b; Vieland and Hodge 1995; de Andrade and Amos 2000). Ascertainment issues arise often in genetic studies because of the frequent use of nonrandom sampling, particularly when the trait of interest is rare. For a family-based genetic study of a rare disease, a common ascertainment sampling procedure is to collect families with at least one or at least two affected members. Ascertainment usually results in oversampling subjects from the affected subset of the original population and undersampling subjects from the complementary set. Failure to account for this ascertainment effect may lead to biased estimates of the parameters of interest.

After proper adjustment for ascertainment has been made, it is generally assumed that the resulting analysis will yield parameter estimates that reflect the values of the parameters in the original population from which the ascertained data were collected. Recently, Burton et al. (2000) stated that, in the presence of unmodeled parameter heterogeneity, a standard ascertainment-adjusted analysis returns parameter estimates that are biased with respect to the population-based values. This finding has important implications in genetic studies because of the probable existence of unmodeled parameter heterogeneity in a complex genetic trait. The authors’ finding implies that it can be difficult, if not impossible, to interpret the results of an ascertainment-adjusted genetic analysis with respect to the original population. This raises the question of whether it is futile even to attempt an ascertainment-adjusted analysis in a genetic study.

Burton et al. (2000) went on to state the important point that, given unmodeled heterogeneity, ascertainment-adjusted parameter estimates reflect the true parameter values in the ascertained subpopulation. We interpret this statement to mean that, in the presence of unmodeled heterogeneity, ascertainment-adjusted parameter estimates converge to the true parameter values in the ascertained subpopulation. Burton and colleagues illustrated their statements with two examples.

In the present article, we make two points regarding ascertainment-adjusted analyses in the presence of latent parameter heterogeneity. First, we demonstrate that the proper construction of the ascertainment-adjusted likelihood (which properly models both the ascertainment mechanism and the true nature of the data) yields population-based parameter estimates. Second, we demonstrate that if one is unable to properly construct the correct ascertainment-adjusted likelihood (as Burton et al. [2000] pointed out, this can occur), then resulting parameter estimates need not reflect the true values in either the original population or the ascertained subpopulation. We support our points by revisiting the two examples of Burton et al. (2000). For each example, we describe the authors’ ascertainment-adjusted methods. We then describe ascertainment-adjustment procedures that yield parameter estimates that (when identifiable) reflect the true parameter values in the original population. Finally, we show that using the standard ascertainment-adjusted analyses in the two examples produce parameter estimates that do not reflect the true parameter values in the ascertained subpopulation.

Material and Methods

Assumptions and Definitions

Suppose our original population consists of a set of n independent sibships. Let n_ASC denote the total number of sibships ascertained from the original population and let J_i denote the number of siblings in ascertained sibship i. Let D_ij represent an indicator variable for the presence or absence of the disease in the jth sibling in the ith sibship, where D_ij=1 if the disease is present and D_ij=0 otherwise.

General Form of the Ascertainment-Adjusted Likelihood

In general, one constructs the standard ascertainment-adjusted likelihood by dividing the unconditional likelihood by the probability of the ascertainment event. We let ASC_i denote the ascertainment event for sibship i. For example, ASC_i could represent ascertainment based on the presence of at least one affected sibling, such that

graphic file with name AJHGv70p886df1.jpg

The ascertainment-adjusted likelihood then takes the form

graphic file with name AJHGv70p886df2.jpg

Example 1: Estimating Disease Prevalence

In their first example, Burton et al. (2000) were interested in estimating disease prevalence under the assumption of a population of n sibships, each of size J. They distributed the sibships into one of K discrete strata, each with a different disease prevalence p_k (k=1,…,K). The affection status of each sibling depended only on the sibship’s stratum-specific disease prevalence. Burton and colleagues collected an ascertained subpopulation by ascertaining all n_ASC sibships that included at least one affected sibling. Let N^(k) and N^(k)_ASC denote the number of sibships from stratum k in the original population and ascertained subpopulation, respectively. By definition,

graphic file with name AJHGv70p886df3.jpg

and

graphic file with name AJHGv70p886df4.jpg

Burton et al. (2000) estimated the overall disease prevalence p as the average of the prevalence of each stratum weighted by its stratum size, which is asymptotically equivalent to being weighted by the probability of stratum membership. We denote the overall disease prevalence p in the original population by p_P and that in the ascertained subpopulation by p_A. By definition, p_P is estimated by

graphic file with name AJHGv70p886df5.jpg

whereas p_A is estimated by

graphic file with name AJHGv70p886df6.jpg

Burton et al. (2000) assumed that stratum membership was unobservable and estimated p by combining the ascertained subpopulation of each of the K strata into one overall subpopulation; they then analyzed the resulting sample, using the classical approaches for a homogeneous sample. Because of prevalence heterogeneity across strata, sibships in the higher-risk strata were more likely to be ascertained than were sibships in the lower-risk strata. This leads to differences in the distribution of the values of the overall prevalence between the ascertained subpopulation (p_A) and the original population (p_P).

Burton et al. (2000) assumed that, for a given sibship, D_i1,D_i2,…,D_iJ were independent Bernoulli random variables with disease probability p. They then constructed the ascertainment-adjusted likelihood across the n_ASC ascertained sibships as

graphic file with name AJHGv70p886df7.jpg

where n_j represents the number of (ascertained) sibships with j affected members (j=1,…,J) and Inline graphic .

The authors’ motivation for considering the likelihood (2) is that one would have difficulty constructing the correct likelihood because of the inherent inability to resolve all the latent stratification in the analysis. They acknowledged that likelihood (2) was incorrect because it did not properly account for the prevalence heterogeneity due to the effect of unobserved strata. We note that, in fact, the main reason for likelihood (2) to fail is that it assumes that the disease statuses of all subjects in the ascertained subpopulation are independent. However, under the data-generating mechanism assumed by the authors, D_i1,D_i2,…,D_iJ are independent only when conditioned on their sibship’s stratum membership and therefore are marginally dependent. The likelihood (2) does not account for the marginal dependence of these observations in the pooled subpopulation.

We now illustrate our first point: that an analysis based on the correct likelihood (which properly models the ascertainment criterion and the dependent nature of the data) leads to population-based estimates. Later, we demonstrate our second point: that if the data cannot be modeled properly, then ascertainment-adjusted parameter estimates do not reflect the true values in either the ascertained subpopulation or the original population. It actually is not difficult mathematically to replace the incorrect likelihood (2) with one that correctly accounts for the dependence among the disease status indicators D_ij, under the sampling frame assumed by the authors. To allow for the dependence, we must account for the stratum membership of the various sibships within the likelihood. Let π_k be the proportion of the population that is in stratum k. Initially, we assume that π_k is known for all k. Conditional on sibship i being in stratum k, D_i1,D_i2,…,D_iJ are independent and each follows a Bernoulli distribution with disease probability p_k. The unconditional likelihood for sibship i then has the form

graphic file with name AJHGv70p886df8.jpg

The ascertainment-adjusted likelihood across all n_ASC ascertained sibships is then

graphic file with name AJHGv70p886df9.jpg

Using the ascertainment-adjusted likelihood (3), we can, in principle, obtain estimates Inline graphic of the stratum-specific prevalences p₁, p₂,…, p_K and estimate the overall disease prevalence by

graphic file with name AJHGv70p886df10.jpg

However, we show in the Appendix that the estimates of p₁, p₂,…, p_K are only identifiable when the sibship size J is strictly greater than the number of strata K.

A second issue for our ascertainment-adjusted likelihood (3) is that we are assuming both the number of strata K and the probabilities of stratum membership π₁,π₂,…,π_K are known. However, as stated by Burton et al. (2000), these are typically unknown in genetic analyses. In such cases, we might apply latent-class analysis methods and mixture models (Roeder et al. 1999) to the data to obtain valid estimates of the overall disease prevalence Inline graphic . If marker genotype data are available for individuals within the original population, we could also estimate K and π₁,π₂,…,π_K, using the methods suggested by Pritchard et al. (2000), and then estimate by use of the likelihood (3).

We use an example to contrast the results of the ascertainment-adjusted likelihood (2) with the ascertainment-adjusted likelihood (3). Burton et al. (2000) originally examined a simulated data set of n=8,000 sibships, each of size J=3, that were distributed into one of K=4 strata, each with its own stratum-specific disease prevalence. Within stratum k, the authors simulated the disease status of a sibling using a Bernoulli random variable with disease probability p_k. After simulating the disease phenotypes within the sibships (n=8,000), the authors ascertained all n_ASC sibships with one or more affected siblings.

Burton et al. (2000) estimated the overall disease prevalence p in the ascertained subpopulation using two different analyses. Using the likelihood (2), they estimated p by use of Gibbs sampling procedures (Gelfand and Smith 1990). They also estimated p by use of the method-of-moments Li-Mantel (1968) estimator (see Appendix). Like the ascertainment-adjusted likelihood (2), the validity of the Li-Mantel estimator requires that D_i1, D_i2, and D_i3 be independent and be identically distributed as Bernoulli random variables with disease probability p. Application of the Li-Mantel method in this example fails because of the dependence among the D_ij. It should be noted that if the D_ij are marginally independent, the Gibbs sampling method and Li-Mantel method used by Burton et al. (2000) would yield consistent estimates of the population-based disease prevalence p, even when the population is composed of latent subpopulations with heterogeneous disease prevalences. In the Appendix, we show that this statement holds for the Li-Mantel method.

Burton et al. (2000) found that estimates of disease prevalence p, based on both Gibbs sampling and the Li-Mantel estimator, more closely resembled the prevalence in the ascertained subpopulation than that in the original population. They then asserted that overall prevalence estimates using these two methods reflect the overall disease prevalence in the ascertained subpopulation. We interpret this to mean both estimates asymptotically converge to the true prevalence in the ascertained subpopulation. However, we show in the Appendix that the Li-Mantel estimator does not converge to the true prevalence in the ascertained subpopulation. To verify our theoretical findings, we use the data in the example of Burton et al. (2000) and apply equations (B1) and (B2) in the Appendix. The theoretical overall prevalence is 0.132 in the original population and 0.223 in the ascertained subpopulation. Using equation (B3) in the Appendix, we calculate that the asymptotic theoretical value of the Li-Mantel estimator is 0.238. These values are in nearly perfect agreement with those reported by the authors. It should be noted that the difference between 0.238 and 0.223 is intrinsic and is not due to sampling error in finite samples. Thus, the Li-Mantel estimate that ignores the strata does not reflect the true value in either the original population or the ascertained subpopulation, which validates our second point.

We could not apply our ascertainment-adjusted likelihood (3) to the ascertained data set of Burton et al. (2000), since there are K=4 strata and the sibship size is J=3, which makes p₁,p₂…,p_K unidentifiable. To assure identifiable prevalence estimates, we modified the example to assume only K=2 disease strata. We simulated a population of n=10,000 sibships each of size J=3. Stratum 1 contained 8,000 sibships of size 3 and had a simulated disease prevalence p₁ of 0.10. Stratum 2 contained the remaining 2,000 sibships of size 3 and had a simulated disease prevalence p₂ of 0.40. The population characteristics are shown in table 1. The overall population prevalence is then p_P = (0.10)(8,000/10,000) + (0.40) (2,000/10,000) = 0.16.

Table 1.

Original Population Characteristics

	No. of
Stratum	Sibships	Siblings	Affected Siblings	Disease Prevalence
	No. of			Disease
Stratum	Sibships	Siblings	Affected Siblings	Prevalence
1	8,000	24,000	2,400	.10
2	2,000	6,000	2,400	.40
Total	10,000	30,000	4,800	.16

Open in a new tab

To help in interpretation, we simulated the number of sibships with zero, one, two, and three affected siblings within each stratum to be the numbers expected. We then ascertained all n_ASC=3,736 sibships with at least one affected sibling. The characteristics of the ascertained subpopulation are shown in table 2. The prevalence in the ascertained subpopulation is p_A = (0.10)(2,168/3,736) + (0.40) (1,568/3,736) = 0.226.

Table 2.

Ascertainment Subpopulation Characteristics

	No. of Sibships with			Total No. of Ascertained
Stratum	1 Affected Sibling	2 Affected Siblings	3 Affected Siblings	Sibships	Siblings	Affected Siblings
1	1,944	216	8	2,168	6,504	2,400
2	864	576	128	1,568	4,704	2,400
Total	2,808	792	136	3,736	11,208	4,800

Open in a new tab

From table 2, the numbers of sibships with one affected sibling (n₁), two affected siblings (n₂), and three affected siblings (n₃) across both strata are 2,808, 792, and 136, respectively. Using these ascertained counts and knowing π₁=4/5 and π₂=1/5, we applied our ascertainment-adjusted likelihood (3). Using a Fisher-scoring estimation procedure, we obtained stratum-specific prevalence estimates of Inline graphic (SE = 0.020) and (SE = 0.008), consistent with the values of p₁ and p₂ in the original population and not that in the ascertained subpopulation. We then estimated the overall prevalence as (SE = 0.017), which also reflects the overall disease prevalence in the original population. This validates our first point.

We then compared our results with those obtained by means of the classical procedures used by Burton et al. (2000), which did not use any information about the dependent nature of the data and were therefore biased. We applied a Fisher-scoring procedure using the likelihood (2) and obtained a biased prevalence estimate of 0.241 (SE = 0.004). Likewise, when we applied the authors’ Li-Mantel estimator, we obtained a biased prevalence estimate of 0.237 (SE = 0.006). Using (B3) in the Appendix, we found that the asymptotic theoretical value of the Li-Mantel estimator is 0.237. These estimates do not reflect the overall disease prevalence in either the ascertained subpopulation Inline graphic or the original population . These results are consistent with our second point.

The results from this example support our two main points. We can consistently estimate the overall disease prevalence in the original population from the disease statuses of the siblings in the ascertained subpopulation if we can correctly model the dependent structure of the data in the ascertainment-adjusted likelihood. If not, the resulting estimates need not reflect the true parameter values in either the original population or the ascertained subpopulation. Not surprisingly, incorrect specification of the likelihood, as in equation (2), can lead to biased estimates of the disease prevalence. If a non–likelihood-based approach, such as the method-of-moments Li-Mantel estimator, is used, then it is important to make sure the assumptions regarding the nature of the data (such as independent observations) are valid.

Example 2: Estimating Parameters in a Logistic Variance-Component Model

In their second example, Burton et al. (2000) investigated the effect of ascertainment on parameter estimates in a logistic variance components model. They simulated the disease-status indicator D_ij as a Bernoulli random variable with mean μ_ij, using a logistic variance-components model where Inline graphic and η_ij=α+β_Bz_ij,B+β_Nz_ij,N+C_i (Breslow and Clayton 1993). In this model, α represents the overall intercept, β_B is the regression coefficient for a binary covariate z_B, β_N is the regression coefficient for a normally distributed covariate z_N, and C_i is a random effect shared by all members of the ith sibship. Fixed covariates were centered about their means, to have expected values of zero. The random effect C_i was assumed to follow a normal distribution with mean of zero and variance σ²_C. After simulating sibships under the logistic variance-components model, the authors ascertained all sibships with at least one affected member from the original population, to form their ascertained subpopulation.

In the example, we focus on illustrating our first point: that an ascertainment-adjusted analysis based on a properly constructed ascertainment-adjusted likelihood returns population-based parameter estimates. To demonstrate this, we first examined the ascertainment-adjusted likelihood that Burton et al. (2000) used for analysis. After viewing the computer code that Burton et al. (2000) used, we determined that the authors constructed their ascertainment-adjusted likelihood by dividing the likelihood of the data by the probability of ascertainment conditional on the random effects. They then integrated the conditional ascertainment-adjusted likelihood over the random effects C_i. Specifically, their ascertainment-adjusted likelihood had the form

graphic file with name AJHGv70p886df11.jpg

where

graphic file with name AJHGv70p886df12.jpg

and where f(C_i) denotes the probability-density function of the normally distributed random variable C_i.

However, using the usual ascertainment-adjusted likelihood (1), we obtained the following ascertainment-adjusted likelihood for this example:

graphic file with name AJHGv70p886df13.jpg

The correct ascertainment-adjusted likelihood (5) is different from (4). The likelihood (5) requires integrating over the distribution of the random effects C_i in the numerator and denominator separately before taking their ratio. In contrast, the likelihood (4) is misspecified and conditions on both the ascertainment and the random effects first, followed by integration over the distribution of the random effects. Results based on the likelihood (5) are consistent with the suggestion by the authors that a likelihood-based model can be constructed that returns population-based parameter estimates (see below).

Burton et al. (2000) applied the ascertainment-adjusted likelihood (4) to analyze a simulated data set. The authors set α=-5, β_B=-0.4, β_N=0.3, and σ²_C=4.5 in their logistic variance-components model. They simulated sibships with five members and then ascertained samples of 1,000 sibships, each with at least one affected member. The authors correctly noted that this ascertainment criterion selects sibships in which values of C_i are primarily in the upper tail of the normal distribution, so that the features of the random effects C_i in the ascertained subpopulation are different from those in the original population. They also noted that, although the random effects are still approximately normally distributed in the ascertained subpopulation, the empirical mean and variance of the C_i were 2.76 and 2.42, respectively, in contrast to 0 and 4.5 in the original population. This affects the values of the grand mean (α) and the variance parameter (σ²_C) in the ascertained subpopulation. In the subpopulation, the grand mean (α) is Inline graphic , whereas the variance parameter σ²_C is 2.42. So, although the true parameter values of (α, σ²_C) were (−5, 4.5) in the original population, the authors expected (α, σ²_C) to be closer to (−2.24, 2.42) in the ascertained subpopulation.

Burton et al. (2000) performed their ascertainment-adjusted analysis by applying the likelihood (4), using Gibbs sampling procedures (Gelfand and Smith 1990; Zeger and Karim 1991) in the software package WinBUGS (Spiegelhalter et al. 2000). The results of their analysis yielded parameter estimates of Inline graphic (SE = 0.11) and (SE = 0.32) as reported in an erratum by Burton et al. (2000). These estimates were closer to the expected values (−2.24, 2.42) in the ascertained subpopulation than those in the original population (−5, 4.5). From these results, the authors argued that the ascertainment-adjusted parameter estimates reflected the values of the parameters in the ascertained subpopulation rather than those in the original population. We suggest instead that this conclusion results from the use of a misspecified likelihood and does not represent the true nature of the problem.

To study whether we can recover mean values of (α, σ²_C ) in the original population by use of the ascertainment-adjusted likelihood (5), we simulated 100 data sets of 1,000 ascertained sibships, each of size 5, using the same logistic variance-components model and same ascertainment criterion as Burton et al. (2000). We analyzed the ascertained subpopulation by maximizing the likelihood (5), which we evaluated using adaptive Gaussian quadrature (Pinheiro and Bates 1995). To ensure a high degree of accuracy, we used 20 quadrature points in our analyses. We implemented these estimation procedures using the SAS version 8 procedure PROC NLMIXED (SAS Institute). Our SAS code is available upon request.

Our analyses yielded mean estimates of α and σ²_C of −4.77 (SD = 0.74) and 4.21 (SD = 1.01), respectively, over the 100 simulated data sets. These results are consistent with the generating values of −5.0 and 4.5 in the original population and are inconsistent with those of −2.24 and 2.42 in the ascertained subpopulation. Appealing to asymptotics, we repeated the simulations with 100 data sets of 10,000 ascertained sibships of size five. Analyses yielded even better mean estimates of α and σ²_C of −4.95 (SD=0.24) and 4.43 (SD=0.33), respectively. Our results for this example support our first point that, for a well-specified model, ascertainment-adjusted parameter estimates reflect the true values of the parameters in the original population when the correct ascertainment-adjusted likelihood is used.

Discussion

Given a well-defined ascertainment scheme, it has long been assumed that ascertainment correction leads to parameter estimates that reflect parameter values in the population. Burton et al. (2000) recently demonstrated that, given unmodeled heterogeneity, the usual ascertainment adjustment leads to parameter estimates that do not reflect those in the original population. This conclusion is certainly true and is a useful warning to avoid performing genetic analyses uncritically.

Burton et al. (2000) go on to state the important finding that, given unmodeled heterogeneity, ascertainment-adjusted parameter estimates reflect parameter values in the ascertained subpopulation, and they support their claim with two examples. We demonstrate instead that: (1) if the genetic mechanism and ascertainment scheme can be appropriately modeled, the genetic analysis should yield estimates consistent with the parameter values in the original population; and (2) if not, the estimates using the conventional method cannot be expected to reflect the parameters in either the original population or the ascertained subpopulation.

To support our argument, we revisited the two examples of Burton et al. (2000) and showed that, for these examples, properly-specified analyses yield ascertainment-adjusted parameter estimates that reflect parameter values in the original population. As we have shown, the key to recovering estimates that reflect parameter values in the original population is correct specification of the ascertainment-adjusted likelihood in the analysis. Incorrect specification of the ascertainment-adjusted likelihood owing to, for example, use of the conventional method, unknown model features, nonidentifiability of the correct model, or uncertain ascertainment scheme, can be expected to lead to parameter estimates that do not reflect the true values in either the original population or the ascertained subpopulation. Similar conclusions likely hold for non–likelihood-based ascertainment-adjusted estimation procedures. We showed this clearly in example 1, where we demonstrated that the conventional Li-Mantel method in this context failed to consistently estimate the true prevalence value in either the original population or the ascertained subpopulation.

Although we did not prove that the ascertainment-correction equation (1) works in general to obtain population-based parameter estimates, it is reasonable to assume that it does in cases for which the correct ascertainment-adjusted likelihood can be derived. We feel it is important to emphasize that proper construction of the ascertainment-adjusted likelihood (1) is necessary in order for the ascertainment-adjusted analysis to return valid population-based estimates. As Burton et al. (2000) pointed out, circumstances exist in the analysis of complex traits in which one will be unable to correctly model the true nature of the data by use of (1), owing, perhaps, to the inability to resolve all the hidden data-influencing strata. In such cases, the resulting ascertainment-adjusted parameter estimates cannot be expected to reflect the true values of the parameters in either the original population or the ascertained subpopulation. To avoid this unpleasant predicament in genetic studies, we should seek, when possible, to apply current statistical methods, such as those described by Pritchard et al. (2000), and to develop new approaches, such as mixture models, to identify hidden strata.

Acknowledgments

We thank Drs. Robert Elston and Jane Olson for their helpful comments. We thank Dr. Paul Burton for generously providing us his WinBUGS computer code. This work was supported by National Institutes of Health grants T32 HG00040 (to M.P.E.), R29 CA76404 (to X.L.), and R01 HG00376 (to M.B.).

Appendix A: Identifiability of by Use of the Ascertainment-Adjusted Likelihood (3)

In this Appendix, we briefly describe why estimates of stratum-specific prevalences p₁, p₂,…,p_K, by use of the likelihood (3) are identifiable only when sibship size J is strictly greater than the number of strata K. To show this holds, define the function

graphic file with name AJHGv70p886df14.jpg

for j=1,…,J-1. We can rewrite the ascertainment-adjusted likelihood (3) as

graphic file with name AJHGv70p886df15.jpg

We can easily obtain maximum-likelihood estimates of Inline graphic (j=1,…,J-1) from equation (3), and, from these estimates, determine maximum likelihood estimates of . However, if K>J-1, then are clearly nonidentifiable. Therefore, we will only obtain identifiable estimates of when the sibship size J is strictly greater than the number of strata K Inline graphic .

Appendix B: The Li-Mantel (1968) Estimator of Disease Prevalence Assuming Complete Ascertainment

Assume we have a population consisting of n sibships, each of size three. Let n_j denote the number of sibships in the population with j affected siblings (j=0,.., 3) such that n=n₀+n₁+n₂+n₃. As before, let D_ij denote the affection status of the jth sibling in the ith sibship. Also, let a=n₁+2n₂+3n₃ denote the total number of affected siblings in the population.

To estimate the overall disease prevalence p, we collect all sibships from the population that have at least one affected sibling, to form the ascertained subpopulation. As defined earlier, we let n_ASC=n₁+n₂+n₃ denote the total number of sibships in the ascertained subpopulation. Also, let m_ASC=3n_ASC denote the total number of siblings in the ascertained subpopulation and define a_ASC as the number of affected siblings in the ascertained subpopulation. Under our complete ascertainment model, a_ASC=a. The Li-Mantel (1968) estimator of p then takes the form Inline graphic . If the values of p are the same for all subjects within the population, then is a consistent, but not unbiased, estimator of p that solves the estimating equation (Li and Mantel 1968; Burton et al. 2000).

Li-Mantel Estimator Assuming Multiple Strata and Marginal Dependence of Siblings within a Sibship

Now, assume that the disease prevalence varies across strata within the original population. To be consistent with the first example of Burton et al. (2000), assume that the original population contains K=4 strata with prevalences p₁, p₂, p₃, and p₄. Assume that the disease statuses of siblings are independent only when conditioned on stratum membership (so the disease statuses of siblings are marginally dependent). Let π_k denote the proportion of the original population found in stratum k. Also, let N^(k) and N^(k)_ASC denote the number of sibships from stratum k in the original population and ascertained subpopulation, respectively. By definition,

graphic file with name AJHGv70p886df16.jpg

and

graphic file with name AJHGv70p886df17.jpg

Therefore, the overall disease prevalences in the original population and the ascertained subpopulation, which we denote as p_P and p_A, respectively, converge in probability to the following forms:

graphic file with name AJHGv70p886df18.jpg

and

graphic file with name AJHGv70p886df19.jpg

Suppose we fail to detect strata and only observe the pooled ascertained sibship counts (n₁, n₂, and n₃). Burton et al. (2000) stated that the Li-Mantel (1968) estimator Inline graphic should reflect the disease prevalence in the ascertained subpopulation p_A, but not that in the original population p_P. We show that the Li-Mantel estimate does not consistently estimate p_P or p_A. To show this, we evaluate the marginal expectations E[a_ASC], E[m_ASC], and E[n₁] by conditioning on all possible strata. We obtain the following expected values:

graphic file with name AJHGv70p886df20.jpg

graphic file with name AJHGv70p886df21.jpg

graphic file with name AJHGv70p886df22.jpg

Using these expected values, we have

By comparison of (B3) with the theoretical forms of p_P and p_A in (B1) and (B2), it is clear that, when the disease statuses are marginally dependent and we fail to account for strata, the Li-Mantel estimate fails to consistently estimate the overall disease prevalence in either the original population (p_P) or the ascertained subpopulation (p_A). Olson and Cordell (2000) demonstrated a similar result in the analysis of sibling recurrence risk.

Li-Mantel Estimator Assuming Multiple Strata and Marginal Independence of Siblings in a Sibship

Now, let us assume that the disease statuses of siblings are marginally independent. We show in such a case that the Li-Mantel estimator will consistently estimate the population prevalence p_P, even when the population contains strata with heterogeneous disease prevalences. As before, assume that the original population contains K=4 strata with prevalences p₁, p₂, p₃, and p₄. Let π_k denote the proportion of the original population found in stratum k. It can easily be shown that the population disease prevalence converges in probability to

graphic file with name AJHGv70p886df24.jpg

Assuming marginal independence of siblings in a sibship, the expected values E[a_ASC], E[m_ASC], and E[n₁] are evaluated as

Using these expected values, we have

graphic file with name AJHGv70p886df28.jpg

This shows that, in the presence of hidden stratification, the Li-Mantel estimator consistently estimates the population prevalence when the disease statuses of siblings in a sibship are marginally independent. This might occur when disease statuses of siblings are determined entirely by environmental factors that have no tendency to be excessively shared by siblings.

References

Apert E (1914) The laws of Naudin-Mendel. J Hered 5:492–497 [Google Scholar]
Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25 [Google Scholar]
Burton PR, Palmer LJ, Jacobs K, Keen KJ, Olson JM, Elston RC (2000) Ascertainment adjustment: where does it take us? Am J Hum Genet 67:1505–1514 (erratum: 69:672 [2001]) [DOI] [PMC free article] [PubMed] [Google Scholar]
Cannings C, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12:208–212 [DOI] [PubMed] [Google Scholar]
de Andrade M, Amos CI (2000) Ascertainment issues in variance components models. Genet Epidemiol 19:333–344 [DOI] [PubMed] [Google Scholar]
Elston RC, Sobel E (1979) Sampling considerations in the gathering and analysis of pedigree data. Am J Hum Genet 31:62–69 [PMC free article] [PubMed] [Google Scholar]
Ewens WJ, Shute NC (1986a) The limits of ascertainment. Ann Hum Genet 50: 399–402 [DOI] [PubMed] [Google Scholar]
——— (1986b) A resolution of the ascertainment sampling problem. I. Theory. Theor Popul Biol 30:388–412 [DOI] [PubMed] [Google Scholar]
Fisher RA (1934) The effects of methods of ascertainment upon the estimation of frequencies. Ann Eugen 6:13–25 [Google Scholar]
Gelfand AE, Smith AFM (1990) Sampling based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409 [Google Scholar]
Haldane JBS (1938) The estimation of the frequencies of recessive conditions in man. Ann Eugen 8:255–262 [Google Scholar]
Li CC, Mantel N (1968) A simple method of estimating the segregation ratio under complete ascertainment. Am J Hum Genet 20:61–81 [PMC free article] [PubMed] [Google Scholar]
Morton NE (1959) Genetic tests under incomplete ascertainment. Am J Hum Genet 11:1–16 [PMC free article] [PubMed] [Google Scholar]
Olson JM, Cordell HJ (2000) Ascertainment bias in the estimation of sibling genetic risk parameters. Genet Epidemiol 18:217–235 [DOI] [PubMed] [Google Scholar]
Pinheiro JC, Bates DM (1995) Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Statist 4:12–35 [Google Scholar]
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959 [DOI] [PMC free article] [PubMed] [Google Scholar]
Roeder K, Lynch KG, Nagin DS (1999) Modeling uncertainty in latent class membership: a case study in criminology. J Am Stat Assoc 94:766–776 [Google Scholar]
Spiegelhalter D, Thomas A, Best N (2000) WinBUGS version 1.3 user manual. MRC Biostatistics Unit, Cambridge, UK [Google Scholar]
Vieland VJ, Hodge SE (1995) Inherent intractability of the ascertainment problem for pedigree data: a general likelihood framework. Am J Hum Genet 56:33–43 [PMC free article] [PubMed] [Google Scholar]
Weinberg W (1912) Methode und Fehlerquellen der Untersuchung auf Mendelschen Zahlen beim Menschen. Arch Rass Ges Biol 9:165–174 [Google Scholar]
Zeger SL, Karim MR (1991) Generalized linear models with random effects: a Gibbs sampling approach. J Am Stat Assoc 86:79–86 [Google Scholar]

[RF1] Apert E (1914) The laws of Naudin-Mendel. J Hered 5:492–497 [Google Scholar]

[RF2] Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88:9–25 [Google Scholar]

[RF3] Burton PR, Palmer LJ, Jacobs K, Keen KJ, Olson JM, Elston RC (2000) Ascertainment adjustment: where does it take us? Am J Hum Genet 67:1505–1514 (erratum: 69:672 [2001]) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RF4] Cannings C, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12:208–212 [DOI] [PubMed] [Google Scholar]

[RF5] de Andrade M, Amos CI (2000) Ascertainment issues in variance components models. Genet Epidemiol 19:333–344 [DOI] [PubMed] [Google Scholar]

[RF6] Elston RC, Sobel E (1979) Sampling considerations in the gathering and analysis of pedigree data. Am J Hum Genet 31:62–69 [PMC free article] [PubMed] [Google Scholar]

[RF7] Ewens WJ, Shute NC (1986a) The limits of ascertainment. Ann Hum Genet 50: 399–402 [DOI] [PubMed] [Google Scholar]

[RF8] ——— (1986b) A resolution of the ascertainment sampling problem. I. Theory. Theor Popul Biol 30:388–412 [DOI] [PubMed] [Google Scholar]

[RF9] Fisher RA (1934) The effects of methods of ascertainment upon the estimation of frequencies. Ann Eugen 6:13–25 [Google Scholar]

[RF10] Gelfand AE, Smith AFM (1990) Sampling based approaches to calculating marginal densities. J Am Stat Assoc 85:398–409 [Google Scholar]

[RF11] Haldane JBS (1938) The estimation of the frequencies of recessive conditions in man. Ann Eugen 8:255–262 [Google Scholar]

[RF12] Li CC, Mantel N (1968) A simple method of estimating the segregation ratio under complete ascertainment. Am J Hum Genet 20:61–81 [PMC free article] [PubMed] [Google Scholar]

[RF13] Morton NE (1959) Genetic tests under incomplete ascertainment. Am J Hum Genet 11:1–16 [PMC free article] [PubMed] [Google Scholar]

[RF14] Olson JM, Cordell HJ (2000) Ascertainment bias in the estimation of sibling genetic risk parameters. Genet Epidemiol 18:217–235 [DOI] [PubMed] [Google Scholar]

[RF15] Pinheiro JC, Bates DM (1995) Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Statist 4:12–35 [Google Scholar]

[RF16] Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959 [DOI] [PMC free article] [PubMed] [Google Scholar]

[RF17] Roeder K, Lynch KG, Nagin DS (1999) Modeling uncertainty in latent class membership: a case study in criminology. J Am Stat Assoc 94:766–776 [Google Scholar]

[RF18] Spiegelhalter D, Thomas A, Best N (2000) WinBUGS version 1.3 user manual. MRC Biostatistics Unit, Cambridge, UK [Google Scholar]

[RF19] Vieland VJ, Hodge SE (1995) Inherent intractability of the ascertainment problem for pedigree data: a general likelihood framework. Am J Hum Genet 56:33–43 [PMC free article] [PubMed] [Google Scholar]

[RF20] Weinberg W (1912) Methode und Fehlerquellen der Untersuchung auf Mendelschen Zahlen beim Menschen. Arch Rass Ges Biol 9:165–174 [Google Scholar]

[RF21] Zeger SL, Karim MR (1991) Generalized linear models with random effects: a Gibbs sampling approach. J Am Stat Assoc 86:79–86 [Google Scholar]

PERMALINK

Ascertainment-Adjusted Parameter Estimates Revisited

Michael P Epstein

Xihong Lin

Michael Boehnke

Abstract

Introduction