Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2000 Mar 29;66(4):1328–1340. doi: 10.1086/302851

Detection of Disease Genes by Use of Family Data. I. Likelihood-Based Theory

Alice S Whittemore 1, I-Ping Tu 1
PMCID: PMC1288198  PMID: 10739758

Abstract

We present a class of likelihood-based score statistics that accommodate genotypes of both unrelated individuals and families, thereby combining the advantages of case-control and family-based designs. The likelihood extends the one proposed by Schaid and colleagues (Schaid and Sommer 1993, 1994; Schaid 1996; Schaid and Li 1997) to arbitrary family structures with arbitrary patterns of missing data and to dense sets of multiple markers. The score statistic comprises two component test statistics. The first component statistic, the nonfounder statistic, evaluates disequilibrium in the transmission of marker alleles from parents to offspring. This statistic, when applied to nuclear families, generalizes the transmission/disequilibrium test to arbitrary numbers of affected and unaffected siblings, with or without typed parents. The second component statistic, the founder statistic, compares observed or inferred marker genotypes in the family founders with those of controls or those of some reference population. The founder statistic generalizes the statistics commonly used for case-control data. The strengths of the approach include both the ability to assess, by comparison of nonfounder and founder statistics, the potential bias resulting from population stratification and the ability to accommodate arbitrary family structures, thus eliminating the need for many different ad hoc tests. A limitation of the approach is the potential power loss and/or bias resulting from inappropriate assumptions on the distribution of founder genotypes. The systematic likelihood-based framework provided here should be useful in the evaluation of both the relative merits of case-control and various family-based designs and the relative merits of different tests applied to the same design. It should also be useful for genotype-disease association studies done with the use of a dense set of multiple markers.

Introduction

In some diseases with complex genetic etiologies, conflicting results have emerged from case-control studies of association, compared with linkage analyses based on allele-sharing within families. Specifically, although the case-control studies have shown strong associations, the linkage tests have proved negative (Parsian et al. 1991). To explain this phenomenon, Risch and Merikangas (1996) have suggested that allele-sharing linkage tests can have poor power compared with tests for association and that a genomewide search for associations may be more sensitive than genome scanning for determination of linkage.

However, case-control studies may give biased measures of association as a result of unrecognized ethnic admixture of the population (known as the “population stratification” problem). This possibility has prompted interest in the use of family-based designs. Comparison of genotypes of affected individuals with those of their unaffected siblings or with Mendelian expectation based on the genotypes of their parents allows such designs to avoid this problem. However, family-based designs can be less powerful than case-control designs (Witte et al. 1999), and their advantage is unclear in light of uncertainty about the need to control for population stratification (Rothman et al. 1999).

In the present study, we derive a class of likelihood-based test statistics that are applicable to cases, controls, and arbitrary families with arbitrary patterns of missing data and that combine the advantages of family-based and case-control designs. The tests are based on the score statistics derived from a specific likelihood for the data. The likelihood function extends, to arbitrary families and to multiple markers, the likelihood proposed by Schaid and co-authors (Schaid and Sommer 1993, 1994; Schaid 1996; Schaid and Li 1997) for nuclear families with only affected offspring.

The score statistic comprises two component test statistics. The first statistic, the nonfounder statistic (NFS), evaluates transmission disequilibrium from parents to offspring. This statistic generalizes the transmission/disequilibrium test (TDT) (Ott 1989; Terwilliger and Ott 1992; Knapp et al. 1993; Spielman et al. 1993; Ewens and Spielman 1995; Spielman and Ewens 1996) and the score statistics proposed by Schaid and Sommer (1994), to include multiallelic markers, markers distinct from the trait locus, and multiple markers, with the use of families with both affected and unaffected offspring and families with missing parental genotypes. The second component statistic, the founder statistic (FS), compares marker genotypes in the family founders with those expected under the null hypothesis. This statistic generalizes the statistics that are commonly used for case-control data (Barcellos et al. 1997; Risch and Teng 1998).

We illustrate the tests with some simple examples. These focus on genotypes at a single diallelic marker that is in partial or complete disequilibrium with the etiologically relevant disease locus. In the present study, which is published with a companion article (Tu et al. 2000) in this issue of the Journal, we apply the statistics to the special case of unrelated individuals, whereas, in the companion article (Tu et al. 2000), we treat nuclear families. Basing our tests on a likelihood function requires that we specify a penetrance model for the relationship between the disease and the genotypes at the putative disease locus. We consider dominant and recessive models as well as a family of generalized linear models (GLMs) that includes the additive, multiplicative, and linear logistic models. To use the test statistics based on the dominant and recessive models, we must specify the extent of gametic disequilibrium between marker and disease loci. For GLMs, however, the tests depend only on the total allele counts at the marker locus in affected and unaffected individuals. Thus, for these models, the tests can be used with pooled DNA.

Test Statistics

We assume that members in each of N unrelated families have been genotyped at a set of closely spaced markers in a chromosomal region. We also assume that, for some of the family members, phenotype (affected versus unaffected) is known. We want to use the marker data to test the null hypothesis that none of the genes in the region is related to the disease. Our objective is to develop test statistics with good power under the alternative hypothesis that a locus in the region is associated with the disease. We base our tests on the likelihood of the family’s marker data, considered as a function of position t in the region. By use of the term “region,” we define the set of all loci that are both linked to and in gametic disequilibrium with at least one of the markers. We make basic assumption A.1: given the family's genoytypes at a test locus t, the family's phenotypes and marker genotypes are independent. Figure 1 shows the chromosomal region when the test locus t lies between two markers.

Figure 1.

Figure  1

A portion of the chromosomal region of interest, in which the test locus t is flanked by two diallelic markers. The power of the test statistics is determined by (a) the extent of disequilibrium between t and the two markers among the chromosomes in the founder population and (b) the probabilities of recombination between t and each of the two markers in meiosis from parents to offspring within a family.

Likelihood for a Family

We define a family to be a set of individuals such that (a) any two individuals are connected in the sense proposed by Thompson (1986, p 21) and (b) each individual is either a founder (neither parent belongs to the family) or a nonfounder (both parents belong to the family). The genotype and/or phenotype of any family member may be unknown; however, the family must contain at least one member with a known phenotype and at least one member with a known genotype. Let y=(y1,...,ym) denote the vector of phenotypes for the m members with known phenotype. Here y has a value of 1 if the th family member is affected with the disease and has a value of 0 otherwise, ℓ=1,...,m. We assume that phenotypes are missing at random (Little and Rubin 1987)—that is, the probability of failure to observe a member’s disease status does not depend on his or her actual status or on his or her marker genotypes. With this assumption, the likelihood at locus t is the probability P(ℳ|R,y,t) of the family’s observed marker data, denoted as , given the family’s geneological structure R, given the vector y of observed phenotypes, and given that t is the disease locus. We use Bayes theorem to write this probability as follows:

graphic file with name AJHGv66p1328df1.jpg

To simplify the notation, we now suppress the dependence of the probabilities both on the family structure R and on the particular locus t. Let g=(g1,...,gm) denote the vector of genotypes at locus t of the m family members of known phenotype. We want to allow for the possibility that t is not one of the marker loci. Assumption A.1 states that the family’s phenotype y and its marker data are conditionally independent, given g. Thus, we write

graphic file with name AJHGv66p1328df2.jpg

where Inline graphic denotes summation over all possible genotype vectors g. Substitution of notation (2) into probability (1) gives the likelihood for the family as

graphic file with name AJHGv66p1328df3.jpg

In this instance, P(y)=ΣgP(y|g)P(g) is the marginal probability of the family phenotype.

Family likelihood (3) involves three types of probabilities. The first type is the probability P(ℳ) of the family’s marker data, under the null hypothesis that the chromosomal region containing the markers is not related to disease risk. We shall specify this probability in terms of a vector of marker parameters. These determine the frequencies of the marker alleles or haplotypes among chromosomes in the populations from which the family’s founders were drawn, and we assume that they are known.

The second type of probability relates the alleles at locus t to the marker data. These probabilities appear in expression (3) as P(g|ℳ). They depend on the extent of gametic disequilibrium between locus t and the marker loci in the family founders and on the probability of recombination between t and its nearest marker loci in meioses within a family. In general, the extent of gametic disequilibrium between markers and t is not known, unless t is one of the marker loci. However, in many situations of practical interest, the test statistics themselves do not depend on this gametic disequilibrium, although their power does.

The third type of probability needed in the likelihood consists of the penetrance functions P(y|g) for the various possible family genotypes g. For family members with known phenotype, these penetrance functions give the joint probability of disease occurrence or nonoccurrence as functions of the members' genotypes at locus t. To specify these penetrances, we assume that, at most, one allele or one group of alleles, labeled D1, confers elevated disease risk. We also group all other alleles and label this group as allele D2. We set g=i for the genotype of an individual who carries i copies of the putative high-risk allele D1, i=0,1,2.

For families with one individual of known phenotype, we need only the three penetrances P(y=1|g=i), i=0,1,2. We shall consider the following general class of penetrance models:

graphic file with name AJHGv66p1328df4.jpg

Here π is a known smooth monotonic function, α is a constant that specifies risk in D2D2 homozygotes, and β is an unknown constant relating risk in D1D1 homozygotes to that in D2D2 homozygotes. Also, c1 is a specified constant relating disease risk in D1D2 heterozygotes to that in D2D2 homozygotes. The value β=0 corresponds to the null hypothesis of no relation between the disease and locus t. In this instance, the parameter α determines the disease prevalence in the population π(α), under the null hypothesis that disease is unrelated to the family genotype g. We will often assume that π(α) is known from data on disease prevalence in the population. We shall call α and β “penetrance parameters.”

Penetrance has traditionally been modeled with the use of π(x)=x, with c1=1 for the dominant model, c1=0 for the recessive model, and c1=1/2 for the additive model. Other models in the class (4) include the multiplicative model, with π(x)=exp(x) and c1=1/2 (Self et al. 1991; Risch and Merikangas 1996; Schaid 1996; Whittaker and Lewis 1998), and the linear logistic model, with π(x)=ex/(1+ex) and c1=1/2 . The penetrances in any model (4) with c1=1/2 are, after an appropriate transformation, linear in the number i of high-risk D1 alleles. Accordingly, we shall call them “GLMs” (McCullagh and Nelder 1989). Individuals with genotype g at a putative trait locus t will be said to have a “D1 count of cg.” Thus, if the penetrances are specified by a GLM, an individual’s D1 count is one-half of the number of his or her copies of allele D1. In contrast, for a recessive or dominant model, an individual’s D1 count is 1 if he or she is a carrier of the high-risk genotype; otherwise, it is 0.

For families with m>1 members of known phenotype, we must specify their joint probability of disease, conditional on their genotypes at locus t. We assume that, given his or her own genotype g, an individual's phenotype does not depend on the genotypes of his or her relatives and that, given g, the phenotypes of relatives are conditionally independent. Therefore,

graphic file with name AJHGv66p1328df5.jpg

These assumptions ignore the possibility that residual correlation in family phenotypes may be a result of other loci responsible for the disease or of shared, unmeasured risk factors. Fortuitously, the proposed statistical tests remain valid (in the sense of having the correct null asymptotic p values) even if this residual correlation is ignored, and we shall ignore it hereafter. However, it is possible that more-accurate modeling of the correlation could improve statistical power, and this possibility requires investigation.

From (5), we see that, under the null hypothesis β=0, the probability P(y|g) is independent of the family genotype g; thus, the likelihood (3) is simply the null probability P(ℳ) of the marker data. Our objective is to derive efficient score statistics based on the likelihood (3) and to use them to test the null hypothesis β=0 for various test loci t in the region covered by the markers. When assumption A.1 holds, the score statistics have standard Gaussian null distributions—asymptotically, as N→∞. After presentation of the statistics, we will discuss, in brief, the types of bias that can arise when assumption A.1 fails.

Score for a Family

We consider the use of likelihood-based efficient score statistics (Cox and Hinkley 1974) for testing of the null hypothesis that a family’s phenotype y is independent of its genotype g at locus t—that is, β=0. To describe these statistics, we suppose that there are K possible categories of marker genotypes for the family and that, if the marker genotypes were known for all members, the family could be classified in one and only one of the categories. For example, table 1 shows the K=15 categories for a nuclear family with one offspring, when the marker data consist of genotypes at a single diallelic locus. In general, the family’s marker category may not be known, because some members have not been typed at some or all of the marker loci. To deal with this uncertainty, we shall introduce a random variable xk that represents the null probability that the family has category k, given its observed marker data :

graphic file with name AJHGv66p1328df6.jpg

In this instance, rk is the marginal probability that the family belongs to category k, under the null hypothesis. If the family is known to have, for example, category , then x=1 and xk=0, k≠ℓ. If the category is not known, then xk is a conditional probability, given the marker data.

Table 1.

K=15 Categories of a Diallelic Autosomal Marker for a Nuclear Family with One Offspring

Probabilityb
Marker Category (k=fh=f1f2h)a uf vh|f C1kC2kC3kc wkd
1. B1B1,B1B1,B1B1 (222) u22 1 111 2-ψ
2. B1B1,B1B2,B1B1 (212) u1u2 1/2 1c11 2-ψc1
3. B1B1,B1B2,B1B2 (211) u1u2 1/2 1c1c1 1-ψc1+c1
4. B1B2,B1B1,B1B1 (122) u1u2 1/2 c111 c1-ψ+1
5. B1B2,B1B1,B1B2 (121) u1u2 1/2 c11c1 2c1
6. B1B1,B2B2,B1B2 (201) u0u2 1 10c1 1+c1
7. B2B2,B1B1,B1B2 (021) u0u2 1 01c1 -ψ+c1
8. B1B2,B1B2,B1B1 (112) u21 1/4 c1c11 (1-ψ)c1+1
9. B1B2,B1B2,B1B2 (111) u21 1/2 c1c1c1 (2-ψ)c1
10. B1B2,B1B2,B2B2 (110) u21 1/4 c1c10 (1-ψ)c1
11. B1B2,B2B2,B1B2 (101) u1u0 1/2 c10c1 2c1
12. B1B2,B2B2,B2B2 (100) u1u0 1/2 c100 c1
13. B2B2,B1B2,B1B2 (011) u0u1 1/2 0c1c1 (1-ψ)c1
14. B2B2,B1B2,B2B2 (010) u0u1 1/2 0c10 −ψc1
15. B2B2,B2B2,B2B2 (000) u20 1 000 0
a

fi = number of B1 alleles in the genotype of parent i, i=1,2; h = number of B1 alleles in genotype of the offspring.

b

uf=uf1f2 = probability of parental genotype f=f1f2; vh|f = probability of offspring genotype h, given parental genotype f.

c

Ck = D1 count of family member , ℓ=1,2,3, when family has category k and when D1=B1.

d

wk=a1C1k+a2C2k+a3C3k, where a is a phenotype value for family member ℓ, ℓ=1,2,3.

In the Appendix, we show that, for the class (4) of penetrance models and on the basis of the likelihood (3) at locus t, the score for the family is

graphic file with name AJHGv66p1328df7.jpg

Here wk is a weight attached to marker category k. Thus, the score is a weighted sum of deviations between “observed” and expected frequencies of the K marker categories. The weights wk determine the importance of one marker category relative to another. They depend on the relations between genotypes at marker loci and genotypes at locus t as well as on the penetrance functions relating disease to genotypes at t.

To describe the weights, we consider the simple example of a nuclear family with one offspring evaluated at a single diallelic “candidate gene” with alleles B1=D1 and B2=D2 and with no missing genotypes or phenotypes. In table 1, the Marker Category column shows the 15 possible marker categories for the family, where f1, f2, and h denote the genotypes of parent 1, parent 2, and the offspring, respectively. For category k=f1f2h, the penetrance values c2, c1, and c0 determine a triple value, C1kC2kC3k=cf1cf2ch, of D1 counts for the family. The C1kC2kC3k column in table 1 shows these triple values for c2=1 and c0=0. The weight wk for category k is

graphic file with name AJHGv66p1328df8.jpg

where Ck is the D1 count of family member when the family has marker category k, ℓ=1,...,n, and n=3 in this example. Also, a, a phenotype value for member , is defined as follows:

graphic file with name AJHGv66p1328df9.jpg

In this definition, ψ=π/(1-π), where π=π(α) is the disease prevalence in the population under the null hypothesis. If, for example, parent 1 and the offspring are affected and parent 2 is unaffected, then a1=a3=1 and a2=-ψ. Substitution of these values in (8) results in wk=C1k+C3kC2k. Thus, wk is a difference between D1 counts of affected and unaffected family members, with unaffected family members contributing a value ψ, relative to affected members. In table 1, the wk column shows the weights wk for the 15 marker categories.

The marker category probability rk in score (7) factors as the probability of the founder-genotype category multiplied by the conditional probability of the nonfounder-genotype category, given the founder-genotype category. Thus, a specific marker category can be written as k=fh, where f and h represent categories of marker genotypes for founders and nonfounders, respectively. For example, the family genotype B1B1,B1B2,B1B1 (table 1, row 3 [B1B1,B1B2,B1B2 {211}] ) is labeled f=21 and h=1 or k=fh=211. We write

graphic file with name AJHGv66p1328df10.jpg

where uf is the null probability of founder category f and where vh|f is the probability of nonfounder subcategory h, given that the founders’ genotypes belong to category f. The Probability column of table 1 shows uf and vh|f for the nuclear family in this example. Notice that the probabilities rk depend on the marker parameters only through uf; the probabilities vh|f are constants determined by the Mendelian laws of inheritance.

To illustrate computation of the score S, suppose that the nuclear family is observed to have marker category f1f2h=212. Then, from equation (6), x212=1 and xk=0, k≠212. Substitution of these values into equation (7) gives the score for this family as S=w212krkwk. From equation (8), we see that the score is the difference between the observed and expected values of a linear combination of D1 counts among the family members.

In row 2 (B1B1,B1B2,B1B1 [212]) of table 1, w212=2-ψc1. Also, from equation (8) and from the values for rk in table 1, Σkrkwk3ℓ=1akrkCk)=(a1+a2)(u2+c1u1)+a3(u2+c1u1), where u2=(u2+ 1/2u1)2 and u1=u1+2u0u2- 1/2u21. Thus, Inline graphic. Under Hardy-Weinberg equilibrium (HWE) for the parental-genotype frequencies, u2=u2=[P(B1)]2 and u1=u1=2P(B1)P(B2). If, in addition, c1=1/2, then S=2- 1/2ψ-(2-ψ)P(B1).

Suppose that parent 1 is untyped. We therefore know only that the family marker category is either 212 or 112. Thus, according to equation (6) and rows 2 and 8 (B1B1,B1B2,B1B1 [212] and B1B2,B1B2,B1B1 [112], respectively) of table 1,

graphic file with name AJHGv66p1328df11.jpg

Under HWE, x212=P(B1) and x112=P(B2). The family’s score is as follows:

graphic file with name AJHGv66p1328df12.jpg

In general, when some family members are untyped, a random vector of probabilities (6) is assigned to the possible categories, where the probabilities are conditional on the observed marker data for the family. This random vector depends on the null founder-genotype probabilities uf, which must be specified or estimated from external data. In the companion article (Tu et al. 2000), we illustrate maximum-likelihood estimation of uf when the founders are parents in nuclear families. Martin et al. (1998) have also proposed such genotype reconstruction for parents in nuclear families.

We will now describe the weights wk for families of arbitrary structure, not only when some members may be untyped but, also, when the markers do not necessarily include the trait locus. To do so, we will let γk(i) denote the probability that member of a family with marker category k carries i copies of allele D1, i=0,1,2. The weights are then given according to equation (8), where Ck=c1γk(1)+γk(2) is now the conditional D1 count for family member , given the family marker category k. When, as in the example shown in table 1, the marker data include genotypes at locus t, a family marker category k specifies a D1 count for each family member.

In summary, computation of a family’s score S may involve two types of imputation: (a) imputation of the family marker category when it is incompletely observed (e.g., when parental genotypes are missing) and (b) imputation of family members’ D1 counts at the putative disease locus t, when locus t is not one of the markers. The latter imputation requires knowledge of the extent of disequilibrium between locus t and its nearest markers among the family founders. Like the disease prevalence π, disequilibrium parameters relating alleles at locus t to those at the marker loci cannot be estimated from the data, since genotypes at locus t are not observed. However, misspecification of the disequilibrium parameters will not affect the validity of a score test, although it will decrease the test’s power.

Expressions (8) and (9) indicate that the contribution of an unaffected member, relative to that of an affected member, is determined by the disease odds ψ=π/(1-π). Since the disease prevalence π is invariably <1/2, the counts of unaffected members contribute less to the score than do those of affected members. In particular, for rare diseases (π≪1), the counts of unaffected members contribute little to the score statistic. For many diseases, π is known at least approximately. It cannot be estimated from the family data, since the families have been ascertained on the basis of their phenotype. However, since π appears only in the weights wk and not in the null probabilities rk, its misspecification will not produce incorrect p values, although it may affect statistical power. Choosing optimal values for ψ is an area that requires research. In the companion article (Tu et al. 2000), we use simulations to compare the power of tests based on genotypes in phenotype-discordant sib pairs, with the use of ψ=1 and ψ equal to the disease odds in the population.

Under the null hypothesis, the score S of equation (7) has a mean of 0 and an asymptotic variance of

graphic file with name AJHGv66p1328df13.jpg

as described in the Appendix.

Decomposition of the Score

The factorization (10) of the null family marker category probabilities rk of induces a decomposition of the score (7) as a sum of nonfounder and founder scores: S=SNF+SF. In this instance, the nonfounder score is

graphic file with name AJHGv66p1328df14.jpg

where Inline graphic denotes summation over the nonfounder categories h compatible with f, wfh=wk, and xfh=xk when fh=k, and with Inline graphic.

For the family with observed marker category 212 (see table 1), we have x21•=1 and xf=0 when f 21. Substitution of these values and the corresponding vh|f into equation (12) results in the following nonfounder score for this family: SNF=w212- 1/2(w212+w211)= 1/2(w212-w211). For the phenotype values (a1,a2,a3)=(1,-ψ,1), w212 and w211 are shown in rows 2 and 3 (B1B1,B1B2,B1B1 [212] and B1B1,B1B2,B1B2 [211], respectively) of table 1. With these values, SNF=2-ψc1- 1/2(3-2ψc1+c1)= 1/2(1-c1). Marker category 212 denotes transmission of allele B1 from heterozygous parent 2 to the affected offspring. If this parent had transmitted allele B2 to the affected offspring, resulting in marker category 211, then the NFS would be SNF= 1/2(c1-1). Thus, if c1=1/2, then SNF=1/4 or -1/4, depending on whether parent 2 transmits allele B1 or B2 to the offspring. Indeed, for a single marker in nuclear families with known genotypes and c1=1/2, SNF is just the TDT.

Notice that the parental-phenotype values do not contribute to SNF. In general, founders contribute only their genotypes to the NFS, and these founder genotypes are used only to compute the null expectations of the nonfounder genotypes. In contrast, phenotypes of nonfounders play a central role in the statistic. If, for example, the phenotype of the single offspring in this family were unknown, then SNF would vanish. The genotypes of nonfounders with unknown phenotype are useful for reconstruction of the family’s marker category, but, because they do not enter the weights wk, they are not evaluated directly in the NFS.

When the genotype of parent 1 is unknown, the nonfounder score is as follows:

graphic file with name AJHGv66p1328df15.jpg

As this example illustrates, when the founder category of the family is known, SNF does not involve the marker parameters. However, when the founder category is not known, SNF depends on the marker parameters through the founder-category probabilities uf, which, in formula (6), determine the rk for the xk.

The variance of SNF, conditional on the vector x=(x1•,...,xF) of probabilities for the F founder categories, is as follows:

graphic file with name AJHGv66p1328df16.jpg

We now turn to the founder score, which is as follows:

graphic file with name AJHGv66p1328df17.jpg

In this case,

graphic file with name AJHGv66p1328df18.jpg

and Inline graphic is the expected D1 count for the th member in a family whose founders have marker category f. Consider again the family with category 212 (see table 1), so that x21•=1 and xf=0 when f≠21. For this family, Inline graphic. From equation (15), we see that, with phenotype values (a1,a2,a3)=(1,-ψ,1),

graphic file with name AJHGv66p1328df19.jpg

where we have used the expected D1 counts Inline graphic, and Inline graphic. Two observations are noteworthy in this case: (a) the expected D1 counts for the founder parents are just their observed D1 counts, whereas the D1 count of the offspring is his or her expected count, given those of his or her parents, and (b) if, for example, parent 1 has a known genotype but an unknown phenotype—so that a1=0—then the genotype of this parent still contributes to the founder score through its contribution to the expected D1 count of his or her affected offspring. In contrast, if the offspring has a known genotype but an unknown phenotype, his or her genotype is used only to reconstruct the genotypes of his or her parents.

The variance of SF is as follows:

graphic file with name AJHGv66p1328df20.jpg

The variance of the total score, given by equation (11), can be written as follows:

graphic file with name AJHGv66p1328df21.jpg

where E[VNF] is given by equation (13), with xf replaced with uf.

Score Statistic for N Families

For a collection of N independent families from a population that is homogeneous with respect to disease risk, the nonfounder, founder, and total score statistics are, respectively,

graphic file with name AJHGv66p1328df22.jpg

In this case, for the νth family, ν=1,...,N, VNFν, V, and Vν, are given by (13), (16), and (17), and the hat denotes an estimate. (In these expressions, the unknown parameters uf must be replaced with estimates.) When the founder-genotype probabilities are specified correctly, the statistics 𝒯, 𝒯NF, and 𝒯F in (18) each have—asymptotically, as N→∞—a standard Gaussian distribution under the null hypothesis of no effect at locus t.

The score statistics for N families from a population that is heterogeneous with respect to disease risk are the weighted averages of statistics for the I homogeneous subpopulations. For example, the total score statistic is Inline graphic, where, for subpopulation i, 𝒯i is the total score statistic and where the weight εi depends on its disease risk in the population πi, as described in the Appendix.

Some invariance properties of the score statistics are worth noting. First, a family’s founder score S in (14) is unchanged if all weights Inline graphic are replaced with Inline graphic, where ξν is an arbitrary family-specific constant, since Inline graphic. Second, the nonfounder score SNFν in (12) is unchanged when all weights wνfh are replaced by w*νfh=wνfhνf, where ξνf is an arbitrary family-specific constant that is independent of its nonfounder category h. In particular, since the conditional D1 counts of founders depend only on their own category f and not on the category h of their descendants, the summands of wνfh corresponding to founders can be ignored. Therefore, neither the D1 counts nor the phenotypes of founders contribute to the NFS. Finally, the weights in either the founder score or the nonfounder score can all be multiplied by any nonzero constant ξ, without changing the standardized test statistics. We shall use these invariance properties to simplify the test statistics in the examples in this article and in the companion article (Tu et al. 2000).

We conclude this section with a brief discussion of the bias (i.e., incorrect asymptotic type I–error probabilities) that could arise if assumption A.1 fails. This assumption states that, given the family’s genotypes at a test locus t, its phenotypes and marker genotypes are independent. It would fail if one or more of the markers were associated with any (genetic or nongenetic) risk factors for the disease. In this case, the likelihood at locus t, under the null hypothesis β=0, would not be the null probability of the markers in the general population, since the families were ascertained for multiple cases of disease. Consequently, the FS, which compares the genotype frequencies among founders in these multiple-case families with those in some reference population, would not have a standard Gaussian distribution. Thus, the FS can be biased by association between markers and risk factors that do not segregate with disease within families. The NFS, in contrast, is conditioned on the observed or inferred distribution of marker genotypes in the founders of the ascertained families. This conditioning assures that the NFS has the correct asymptotic null distribution in the absence of linkage to a locus that segregates with disease within families, provided that, when some founder genotypes are unobserved, the distribution of founder genotypes is specified correctly. While departures from random mating, HWE, etc., in the founders could, in principle, affect the asymptotic null distribution of the NFS, it is difficult to envisage examples involving serious bias. Most likely, divergent results for NFS and FS at a given locus t, with the former being nonsignificant and the latter significant, would suggest marker-disease association in the absence of linkage. Conversely, if the NFS were significant but the FS were nonsignificant, then this would suggest that the markers are linked to, but are in weak gametic disequilibrium with, a disease locus t.

Application to Single Diallelic Polymorphisms in Case Series and Case-Control Studies

We illustrate the score statistics by applying them to very simple families, such as nuclear families and “families” consisting of single unrelated individuals. We regard a single individual, who is either affected (a case) or unaffected (a control), as a founder of his or her “family.” The scores of nuclear families and of unrelated cases and controls are summed to form the test statistics. The FS TF evaluates the genotypes of cases, controls, and parents in the nuclear families, whereas the NSF TNF evaluates just the genotypes of the offspring in the nuclear families, comparing them with the Mendelian expectation.

Suppose that individuals are typed at a single diallelic marker with alleles B1 and B2. The marker locus may be distinct from the test locus t, but, if this is so, then the alleles at the two loci are assumed to be in gametic disequilibrium in the population. Let P(DiBj) denote the probability that a random chromosome from the population carries the haplotype DiBj, i, j=1, 2. Table 2 gives these probabilities in terms of the marginal probabilities P(Di) and P(Bj) and the disequilibrium coefficient δ=P(D1B1)-P(D1)P(B1). The probability p1 that a random chromosome containing allele B1 also contains allele D1 is thus

graphic file with name AJHGv66p1328df50.jpg

The analogous probability p2, for a chromosome containing B2, is

graphic file with name AJHGv66p1328df51.jpg

For convenience, we introduce a standardized disequilibrium coefficient Δ, defined as Inline graphic. By assumption, Δ≠0.

Table 2.

Haplotype Probabilities at Test Locus t and Marker Locus s

Haplotype Probability at Marker Allele
Disease Allele B1 B2 Total
D1 P(B1)P(D1)+δ P(B2)P(D1)-δ P(D1)
D2 P(B1)P(D2)-δ P(B2)P(D2)+δ P(D2)
 Total P(B1) P(B2) 1

In the companion article (Tu et al. 2000), we applied the score statistics to nuclear families for which markers at the putative disease locus are observed but for which the family marker category typically is unobserved. In contrast, in the present study, we apply them to families consisting of single, unrelated individuals in which the marker category is observed but in which the marker may not be the trait locus. Since these latter families contain only founders, the nonfounder score is 0, and the total score for an individual reduces to the founder score SF of (14). There are K=F=3 marker categories—namely, the three genotypes B2B2, B1B2, and B1B1—which are indexed as f=0,1,2, respectively. Their null probabilities are rf=uf, f=0,1,2.

Case Series

In case series, the marker genotypes of a sample of N cases are compared with those in some reference population. The score statistic is given by (14), with the weight for a case with genotype f given as Inline graphic, as shown in table 3. The invariance properties of the test statistic, which were described at the end of the previous section, allow us to standardize the weights so that w0=0 and w2=1. Note, however, that this standardization requires replacement of Inline graphic with Inline graphic, where Inline graphic. Since Inline graphic is proportional to the disequilibrium coefficient Δ, the standardization is possible only when Δ≠0. Indeed, SF is itself proportional to Δ. Thus, if Δ=0—that is, if the marker were in linkage equilibrium with the test locus, then the distribution of the FS would be degenerate at 0.

Table 3.

Probability Inline graphic That a Founder with f Copies of Marker Allele B1 Carries i Copies of a Nearby Disease Allele

f Inline graphic Inline graphic Inline graphic Inline graphic Inline graphica
2 q21 2p1q1 p21 2c1p1+ep21b 1
1 q1q2 p1q2+q1p2 p1p2 c1(p1+p2)+ep1p2 λc
0 q22 2p2q2 p22 2c1p2+ep22 0
a

Inline graphic, if and only if Δ≠0, where Δ=p1-p2 is the standardized disequilibrium coefficient.

b

e=1-2c1.

c

Inline graphic.

In table 3, we see that the weight attached to B1B2 heterozygotes, when B1B1 and B2B2 homozygotes receive weights of 1 and 0, respectively, is λ, which is defined as follows:

graphic file with name AJHGv66p1328df23.jpg

Here e=1-2c1. We use the symbol S to denote the founder score for the νth case. Substitution of the weights (19) into equation (14) results in

graphic file with name AJHGv66p1328df24.jpg

where uf is the frequency of genotype f in the reference population and where xνf=1; otherwise, if the case’s genotype is f, xνf=0, f=0,1,2. Summing the individual scores (20) over the N cases shows that the score is ΣNν=1SFν=N2N1-N(u2u1), where Nf is the number of cases with genotype f. The founder (and total) score statistic is

graphic file with name AJHGv66p1328df25.jpg

where, from (16),

graphic file with name AJHGv66p1328df26.jpg

To use 𝒯, we must specify the weight λ for heterozygotes and the genotype frequencies u0,u1,u2 in the reference population.

Case-Control Studies

Case-control studies are based on comparison of the marker genotypes of, for example, N𝒜 cases with those of N𝒰 controls. The likelihood for the data is the product of the probabilities of case and control marker data, and the corresponding score is the sum of the score for the case data plus the score for the control data. The resulting test statistic can be written as

graphic file with name AJHGv66p1328df27.jpg

where 𝒯𝒜 and 𝒯𝒰 are the test statistics (21) applied to cases and controls, respectively; where ψ is the phenotype value for controls in equation (9); and where φ2=N𝒰/N𝒜 is the control:case ratio.

To use 𝒯, we must specify the heterozygote weight λ of equation (19), the phenotype value ψ, and the null marker genotype probabilities u0,u1,u2. With the choice of ψ=N𝒜/N𝒰, equation (23) becomes

graphic file with name AJHGv66p1328df28.jpg

where N𝒜g and N𝒰g are the numbers of cases and controls, respectively, with genotype g=0,1,2 and where σ2 is given by (22). This choice for ψ eliminates the null genotype frequencies u0,u1,u2 from the numerator of 𝒯; however, they appear in σ. They can be estimated from the group of controls as Inline graphic. The resulting test statistic 𝒯 is the standardized difference in expected D1 counts between cases and controls. For c1=1/2, 𝒯 is the statistic proposed by Barcellos et al. (1997) and Risch and Teng (1998), for testing of allelic association between disease and marker in pooled DNA from cases and controls. In this case, 𝒯 also is the score statistic for the traditional linear logistic regression of unrelated cases and controls.

Heterozygote Weight λ

The optimal weight λ for heterozygotes varies according to the penetrance model and the extent of disequilibrium between trait and marker loci, as specified by the probabilities p1 and p2. In practice, these probabilities seldom are known. However, it is evident from (19) that, for c1=1/2, the weight is λ=1/2, independent of p1 and p2. For these models, 𝒯 is just the standardized difference between observed and expected B1 counts, regardless of the extent of gametic disequilibrium between the two loci (provided, of course, that some disequilibrium exists).

When trait and marker loci coincide (so that, for example, p2=0, p1=1), then λ=c1. In this case, 𝒯 is the standardized difference between observed and expected B1 counts in cases and controls. If trait and marker loci do not coincide and if c1≠ 1/2, then the extent of disequilibrium between trait and marker loci can have a large influence on the heterozygote weight λ. For a dominant model (c1=1), equation (19) shows that λ=q2/(q1+q2), which is closer to 1/2 than to 1 when the disease allele is rare. For a recessive model (c1=0), equation (19) shows λ=p2/(p1+p2), which varies from 0 (when the disease allele and allele B1 are in complete disequilibrium) to 1 (when the disease allele and allele B2 are in complete disequilibrium).

Suppose, for example, that the frequency of the disease allele is .01 and that the two marker alleles are equally likely. Then, for maximum disequilibrium between disease and marker loci, λ=0 for a recessive disease gene and λ=.505 for a dominant gene. If the disequilibrium coefficient δ equals one-fourth of its maximum value, then λ=.25 for a recessive gene and .503 for a dominant gene. Thus, when disease and marker loci do not coincide, the optimal weight λ for heterozygotes is not 0 for a recessive gene and 1 for a dominant gene. For rare disease alleles, it is always ∼1/2, when the disease allele is dominant. For common disease alleles, the optimal weight can strongly depend on the tightness of disequilibrium, regardless of the mode of inheritance. For instance, when the disease-allele frequency is .2, then, under maximum disequilibrium, λ=0 for a recessive model and λ=.63 for a dominant model. However, for loose disequilibrium (δ equal to one-fourth of its maximum value), λ=.38 for a recessive model and λ=.53 for a dominant model. Lack of knowledge concerning both the extent of gametic disequilibrium and the correct penetrance model suggests use of the value λ=1/2 in the test statistic (21). There is a need for investigation of the power associated with use of this strategy, for a range of situations.

Discussion

We have described a likelihood-based score statistic for detection of disease genes by evaluation of phenotype-marker associations and transmission disequilibrium within families. The score statistic decomposes into two components, the NFS and the FS. These components represent two types of deviation between observed genotypes and their expectations under the null hypothesis. Each will have a large absolute value if the chromosomal region contains a disease locus and if the families have been selected to contain affected individuals. The NFS represents the deviation between observed and expected marker alleles in nonfounders, given the founder genotypes. It reflects transmission disequilibrium from parent to affected and unaffected offspring, and it will be large because the disease locus is linked to the markers. When the marker data consist of genotypes at a single locus, both the nonfounder and founder scores are proportional to the standardized disequilibrium coefficient Δ between marker and disease loci. Thus, the statistics should be used only to evaluate the etiologic relevance of loci t that are in linkage disequilibrium with the marker. In addition, the nonfounder score is proportional to 1-2θ, where θ is the probability of recombination between the marker and disease loci. Thus, if the two loci were unlinked (θ=1/2), then the nonfounder score would vanish identically, and the marker genotypes of nonfounders would not contribute to the total test statistic. If, on the other hand, θ=0, which would hold when we wish to test the etiologic relevance of the marker itself, then founders and nonfounders contribute equally to the total test statistic. In this sense, the genetic distance between marker and disease loci, as measured by their recombination fraction θ, determines the relative contributions of founder and nonfounder genotypes to the total test statistic.

The FS evaluates association between disease and marker alleles in family founders, and it extends the test statistics that are currently used for the analysis of case-control data. FS reflects deviation between the observed or inferred frequencies of marker genotypes in the founders and those that are expected in the general populations to which they belong. It measures association between the disease and the disease locus, and it will be large when there is gametic disequilibrium between disease and marker loci among the founder chromosomes. If the null founder-genotype frequencies can be estimated from independent data, then the FS provides information supporting or refuting the null hypothesis derived from the (observed or inferred) marker genotypes of the founders, even when their phenotypes are unknown. However, the FS also can be large because of population stratification or inappropriate assumptions (e.g., random mating or Hardy-Weinberg proportions) on the distribution of founder genotypes. Thus, the FS tests for association, whereas the NFS tests for linkage as the cause of the association.

These likelihood-based statistics have certain strengths and limitations. Because they are model-based, they clarify the role of the underlying genetic model in the determination of the weights for affected versus unaffected individuals and, when dealing with single diallelic markers, the weights for heterozygotes versus homozygotes. When the model is correctly specified, the statistics enjoy certain local asymptotic optimality properties (Cox and Hinkley 1974). However, the model requires assumptions about the distribution of the founder genotypes, and the effects on bias and power resulting from departures from these assumptions have not yet been fully evaluated. There is a need to assess the impact on bias and power loss associated with misspecification of the distribution of founder genotypes. One might expect that this impact would be small when the founder genotypes are known or can be inferred with some confidence.

One strength of the statistics is that they apply to any type of family structure, including a “family” consisting of a single individual, and that they thus eliminate the need for many different ad hoc tests. In addition, the approach provides a strategy for dealing with missing phenotypes and genotypes for key family members, such as parents. Also, when it is applied to families in which the phenotypes of some founders are known, the FS allows comparison of genotypes of affected and unaffected founders. Simultaneous evaluation of the FS (which may be biased by population stratification) and the NFS (which is less vulnerable to such bias) can provide insight into the etiologic relevance of observed associations.

The likelihood-based framework presented in this study stimulates consideration of several potentially useful extensions. First, the likelihood could be extended to accommodate censored survival data rather than binary disease outcomes. Second, the likelihood could be modified to include nongenetic covariates, in the manner considered by Self et al. (1991). In fact, the likelihood function proposed by Self et al. is a special case of the nonfounder component of the likelihood considered in the present study. Inclusion of nongenetic covariates would lead to score statistics that have been adjusted for the effects of the covariates. In addition, joint maximization of the likelihood, with respect to regression coefficients for both genetic markers and nongenetic factors, would allow for multivariate estimation of genotype relative risks (Schaid and Sommer 1993; Schaid and Li 1997; Witte et al. 1999).

If it becomes feasible to produce reliable estimations of both intermarker genetic distances and population-specific intermarker disequilibrium coefficients, then association studies will benefit from simultaneous consideration of multiple markers that may flank a disease locus. The systematic framework presented here should prove useful for such studies.

Acknowledgment

This research was supported by National Institutes of Health grant R35-CA47448. The authors thank Joseph B. Keller and the reviewers, for helpful comments on an earlier version of this manuscript.

Appendix

We derive the score statistic for a family with phenotype y=(y1,...,ym), where m is the number of members with known phenotype. Suppose that there are K categories of marker genotypes. Let rk denote the family’s null probability of having category k, and let xk=P(k|ℳ) denote its conditional probability of having category k, given its observed marker data . The likelihood (3) for the family can therefore be written as follows:

graphic file with name AJHGv66p1328df29.jpg

P(g|k) is the probability that a family with marker category k has genotype g at the disease locus t. Also, Θ is a vector of parameters that includes the penetrance parameters α and β, any unknown marker parameters in the probabilities rk, and the test-locus-vs.-marker parameters. Let Inline graphic be a null value of Θ—that is, one for which β=0 and for which the remaining parameters are specified under the null hypothesis. By differentiation of the logarithm of (A1), with respect to β, and by evaluation of the same logarithm at Inline graphic, we find, after some algebraic calculations, that the family’s score is as follows:

graphic file with name AJHGv66p1328df30.jpg

In this instance, Inline graphic is the logarithmic derivative of the null disease prevalence in the population, and wk is a nonnegative constant, as described in equations (8) and (9).

The null mean of S is 0, which follows from likelihood theory (Cox and Hinkley 1974). This can also be seen from equation (7) and from the fact that the null mean of the random variable

graphic file with name AJHGv66p1328df31.jpg

is

graphic file with name AJHGv66p1328df32.jpg

Inline graphic denotes summation over all possible realizations of the observed marker data .

The asymptotic variance of the score εS (Cox and Hinkley 1974) is

graphic file with name AJHGv66p1328df33.jpg

For N families from a population that is homogeneous with respect to disease risk π, the score statistic is Inline graphic. If the families are sampled from a heterogeneous population consisting of I identified subpopulations with disease risks πi, i=1,...,I, then Inline graphic, where Inline graphic and where 𝒯i is the score statistic for the subset of families from population i. In the present study, we assume that the population is homogeneous, so that εi≡ε, i=1,...,I, and we may take ε=1 without loss of generality.

Note added in proof.—Further discussion of likelihood-based methods analogous to the methods presented here can be found in a study by Clayton (1999).

References

  1. Barcellos LF, Klitz W, Field LL, Tobias R, Bowcock AM, Wilson R, Nelson MP (1997) Association mapping of disease loci, by use of a pooled DNA genomic screen. Am J Hum Genet 61:734–747 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Clayton D (1999) A generalization of the transmission/disequilibrium test for uncertain-haplotype transmission. Am J Hum Genet 65:1170–1177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cox RDR, Hinkley DV (1974) Theoretical statistics. Chapman and Hall, London [Google Scholar]
  4. Ewens WJ, Spielman RS (1995) The transmission/disequilibrium test: history, subdivision, and admixture. Am J Hum Genet 57:455–464 [PMC free article] [PubMed] [Google Scholar]
  5. Knapp M, Seuchter SA, Bauer MP (1993) The haplotype-relative-risk (HRR) method for analysis of association in nuclear families. Am J Hum Genet 52:1085–1093 [PMC free article] [PubMed] [Google Scholar]
  6. Little RJA, Rubin DB (1987) Statistical analysis with missing data. John Wiley & Sons, New York [Google Scholar]
  7. Martin RB, Alda M, MacLean CJ (1998) Parental genotype reconstruction: applications of haplotype relative risk to incomplete parental data. Genet Epidemiol 15:471–490 [DOI] [PubMed] [Google Scholar]
  8. McCullagh P, Nelder JA (1989) Generalized linear models, 2d ed. Chapman and Hall, London [Google Scholar]
  9. Ott J (1989) Statistical properties of the haplotype relative risk. Genet Epidemiol 6:127–130 [DOI] [PubMed] [Google Scholar]
  10. Parsian A, Todd RD, Devor EJ, O’Malley KL, Suarez BK, Reich T, Cloninger CR (1991) Alcoholism and alleles of the human D2 dopamine receptor locus: studies of association and linkage. Arch Gen Psychiatry 48:655–663 [DOI] [PubMed] [Google Scholar]
  11. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases. Science 273:1516–1517 [DOI] [PubMed] [Google Scholar]
  12. Risch N, Teng J (1998) The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases. I. DNA pooling. Genome Res 8:1273–1788 [DOI] [PubMed] [Google Scholar]
  13. Rothman N, Caporaso NE, Wacholder S, Garcia-Closas M, Lubin JH, Marcus P, Hoover RE, et al (1999) Evaluation of interactions between environmental exposures and common genetic polymorphisms: a population-based epidemiologic perspective. Proc Am Assoc Cancer Res 40:762–763 [Google Scholar]
  14. Schaid DJ (1996) General score tests for associations of genetic markers with disease using cases and their parents. Genet Epidemiol 13:423–449 [DOI] [PubMed] [Google Scholar]
  15. Schaid DJ, Li H (1997) Genotype relative risks and association tests for nuclear families with missing parental data. Genet Epidemiol 14:1113–1118 [DOI] [PubMed] [Google Scholar]
  16. Schaid DJ, Sommer SS (1993) Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet 53:1114–1126 [PMC free article] [PubMed] [Google Scholar]
  17. Schaid DJ, Sommer SS (1994) Comparison of statistics for candidate-gene association studies using cases and parents. Am J Hum Genet 55:402–409 [PMC free article] [PubMed] [Google Scholar]
  18. Self SG, Longton G, Kopecky KJ, Liang K-Y (1991) On estimating HLA/disease association with application to a study of aplastic anemia. Biometrics 47:53–62 [PubMed] [Google Scholar]
  19. Spielman RS, Ewens WJ (1996) The TDT and other family-based tests for linkage disequilibrium and association. Am J Hum Genet 59:983–989 [PMC free article] [PubMed] [Google Scholar]
  20. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516 [PMC free article] [PubMed] [Google Scholar]
  21. Terwilliger JD, Ott J (1992) A haplotype-based “haplotype relative risk” approach to detecting allelic associations. Hum Hered 42:337–346 [DOI] [PubMed] [Google Scholar]
  22. Thompson EA (1986) Pedigree analysis in human genetics. Johns Hopkins University Press, Baltimore [Google Scholar]
  23. Tu I-P, Balise RR, Whittemore AS (2000) Detection of disease genes by use of family data. II. Application to nuclear families. Am J Hum Genet 66:1341–1350 (in this issue) [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Whittaker JC, Lewis CM (1998) Effect of family structure on linkage tests using allelic association. Am J Hum Genet 63:889–897 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Witte JS, Gauderman WJ, Thomas DC (1999) Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol 149:693–705 [DOI] [PubMed] [Google Scholar]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES