Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Mar 1.
Published in final edited form as: Theor Popul Biol. 2013 Dec 7;92:51–54. doi: 10.1016/j.tpb.2013.11.004

Within a Sample from a Population, the Distribution of the Number of Descendants of a Subsample’s Most Recent Common Ancestor

John L Spouge a
PMCID: PMC3944932  NIHMSID: NIHMS547351  PMID: 24321308

Abstract

Sample n individuals uniformly at random from a population, and then sample m individuals uniformly at random from the sample. Consider the most recent common ancestor (MRCA) of the subsample of m individuals. Let the subsample MRCA have j descendants in the sample (m≤ j ≤ n). Under a Moran or coalescent model (and therefore under many other models), the probability that j = n is known. In this case, the subsample MRCA is an ancestor of every sampled individual, and the subsample and sample MRCAs are identical. The probability that j = m is also known. In this case, the subsample MRCA is an ancestor of no sampled individual outside the subsample. This article derives the complete distribution of j, enabling inferences from the corresponding p-value. The text presents hypothetical statistical applications pertinent to taxonomy (the gene flow between Neanderthals and anatomically modern humans) and medicine (the association of genetic markers with disease).

Keywords: Most Recent Common Ancestor of a Subsample, Coalescent Theory

1 Introduction

Consider the following hypothetical situation. Within a sample of n individuals, a subsample of m individuals share a morphological character. Upon genetic analysis, the m individuals share some genetic characters with a further j–m≥0 individuals within the sample. One might desire a p-value to test whether j–m is “too small”, i.e., to test whether the concentration of the morphological character among individuals with the genetic characters is too excessive to reflect chance alone. This article derives a p-value by giving the sampling distribution of j. Depending on its context, a small p-value might suggest among other possibilities, e.g., that gene flow between the subpopulations represented by the subsample and its complement within the sample is not free (i.e., that the mathematical assumptions underlying the coalescent are violated), or that the genetic characters have a causal influence on the morphological character. The Discussion demonstrates how the p-value might be relevant to rejecting the hypothesis of free gene flow between Neanderthals and anatomically modern humans (Krings et al. 1997, Nordborg 1998, Krings et al. 2000) or to associating a genetic disease or phenotype to a set of DNA markers necessary but not sufficient for it.

To determine the distribution corresponding to the p-value, consider Kingman’s coalescent (Kingman 1982a, Kingman 1982b), where n individuals are sampled uniformly at random at time t0 from a large population. Kingman examined a haploid population, but coalescent models can also apply to sexual populations (Nordborg 2004, Pollak 2004, Wakeley et al. 2012). A pure death process Dt (t ≥ 0) counts the ancestors of the sample at prior times t0t. The process Dt transitions through the states n→n–1→→2→1, with the state Dt = k (k = 2, …, n) having a sojourn time τk exponentially distributed with parameter dk=12k(k1) and with the state Dt= 1 absorbing.

The sample ancestry can be described using ℰn, the set of all equivalence relations on the n individuals. Consider the Markov chain ℛn→ℛn–1→…→ℛ2→ ℛ1, whose state-space is ℰn, where ℛk corresponds to having Dt = k ancestors (k = n, n – 1, …, 1). The variate ℛk partitions the n individuals into k equivalence classes, each equivalence class corresponding to an ancestor and containing the ancestor’s descendants at time t0. Define the identity relation Δ = {(i, i): i = 1,2,…,n} and the trivial relation Θ={(i, j): i, j = 1,2, …,n}. Given ξ, η ∈ ℰn, let ξ ≺ η denote that η can be obtained from ξ by combining two equivalence classes in ξ, and in fact, Δ = ℛn ≺ ℛn–1 ≺ … ≺ ℛ2 ≺ ℛ1 = Θ. The transition probabilities of the Markov chain {ℛk} are

{k1=η|k=ξ}={2/[k(k1)]ifξη0otherwise. (1)

Kingman shows that if ξ contains k equivalence classes,

{k=ξ}=(nk)!k!(k1)!n!(n1)!λ1!λ2!λk!, (2)

where λ1, λ2, …, λk, are the sizes of the equivalence classes of ξ. Eq (1) implies Eq (2), so Eq (2) holds for any model imposing Eq (1) on the ancestry of a sample, in particular the Moran model (Moran 1962, Kimura and Crow 1964, Watterson 1984, Donnelly and Tavare 1986a, Donnelly and Tavare 1986b) (without mutation), or indeed any model of ancestry approximating a coalescent process closely enough.

Now, draw a subsample of m individuals uniformly at random from the sample of n individuals. The subsample has a most recent common ancestor (MRCA). For 1 ≤ mjn , let pn,m; j denote the probability that the subsample MRCA has j descendants within the sample. For j = n, e.g., the subsample has the same MRCA as the sample. From Theorem 2 in (Saunders et al. 1984) with l1 = l2 = 2 (or Example 1 in (Saunders et al. 1984)),

pn,m;n=m1m+1n+1n1. (3)

(See also p. 77 in (Hein et al. 2005))

In a standard notation (Graham et al. 1994), let nm = n(n–1) … (nm + 1) denote the falling factorial for 1≤mn, with n0 = 1. In addition to Eq (3), we have the trivial boundary cases pn,1;1 = pn,n;n = 1, so for m < n , consider the recursion

pn,m;m=m(m1)n(n1)pn1,m1;m1+(nm)(nm1)n(n1)pn1,m;m, (4)

which conditions on ℛ2 , the two terms corresponding to coalescences: (1) within the subsample (probability m(m–1)/[n(n–1)]); and (2) outside of the subsample (probability (nm)(nm–1)/[n(n–1)]) Eq (4) can provide an inductive proof of the formula

pn,m;m=2(m1)!(m+1)(n1)m1¯ (5)

from (Wiuf and Donnelly 1999). (See also, e.g., p. 84 in (Hein et al. 2005) and Eq (1) in (Rosenberg 2007.)

If j = m, Eq (5) provides a p-value pn,m;m, to test whether under the assumptions underlying the coalescent, subsample ancestries are likely to coalesce before coalescing with the remainder of the sample (see, e.g., p. 86 in (Hein et al. 2005) for examples concerning Neanderthal ancestry (Nordborg 1998, Harris and Hey 1999)). If j > m, then the relevant (left-sided) p-value becomes a sum pn,m;j,=i=mjpn,m;i. With the motivating applications mentioned in the Introduction, Theorem 1 in the Results section extends the analytic formula for pn,m; j from j = m and j = n to 1 ≤ mjn.

2 Theory

Theorem 1: Let 1 ≤ mn, and consider a sample whose ancestry satisfies Eq (1). Under the set-up described above, for m = 1, definitions show that pn,m; j equals 1 if j = 1 and 0 otherwise. For m > 1,

pn,m;j={m1m+12(j2)m2¯(n1)m1¯for2mj<nm1m+1n+1n1for2mj=n, (6)

with pn,m;j= 0 unless 2 ≤ mjn.

Remark: Eq (6) reduces to Eq (5) in the case j = m, as it should.

Proof: Note the following identity for 1≤ ab:

ai=ab(i1)a1¯=i=ab[i(ia)](i1)a1¯=i=ab[ia¯(i1)a¯]=ba¯ (7)

where the second equality follows because i(i–1)a–1 = ia and (i–1)a–1(ia) = (i–1)a–1.

Thus, j=mnpn,m;j=1 for 2 ≤mn:

m1m+1[n+1n1+j=mn12(j2)m2¯(n1)m1¯]=m1m+1[n+1n1+2(n1)m1¯j=mn1(j2)m2¯]=m1m+1[n+1n1+2(n1)m1¯(n2)m1¯m1],=m1m+1[n+1n1+2n1nmm1]=1 (8)

where the second equality follows from Eq (7).

For m = 1, definitions yield pn,1;1 = 1, and for m > 1, pn,m;j = 0 unless mjn. To set up an inductive proof of Theorem 1 for the cases in Eq (6), let 𝒫i be the proposition that Theorem 1 holds for every 2 ≤ mjni. To start the induction, 𝒫2 is true, because by definition p2,2;2 = 1, agreeing with Eq (6) for 2 ≤ mj = n ≤2 (the other case 2 ≤ mj < n ≤ 2 being vacuous).

For the inductive step, assume 𝒫n–1 holds for some fixed n ≥3. From Eq (2) for k = 2, the probability that one of the two equivalence classes of ℛ2 contains all m subsample individuals and has a total of i elements is

[(n2)!2!1!n!(n1)!i!(ni)!](nm)!(ni)!(im)!=2n1im¯nm¯, (9)

because there are (nm)!/[(ni)!(im)!] equally probable ways of forming the two equivalence classes of ℛ2 by placing the m subsample individuals into an equivalence class of i elements.

As usual, let empty sums equal 0. To check that 𝒫n follows from 𝒫n–1, we check first that Eq (6) holds f or 2 ≤ mj < n, then conclude from j=mnpn,m;j=1 and Eq (8) that Eq (6) also holds for 2 ≤ mj = n. For 2 ≤ mj < n, then,

pn,m;j=2n1i=jn1im¯nm¯pi,m;j=2n1[jm¯nm¯m1m+1j+1j1+i=j+1n1im¯nm¯m1m+12(j2)m2¯(i1)m1¯]=m1m+12n11nm¯[jm¯j+1j1+2(j2)m2¯i=j+1n1i]=m1m+12n11nm¯{jm¯j+1j1+2(j2)m2¯12[n2¯(j+1)2¯]}=m1m+12(j2)m2¯(n1)m1¯ (10)

where the first equality is justified as follows. One of the two equivalence classes of ℛ2 must contain all m individuals from the subsample. Let the equivalence class of ℛ2 containing the subsample has size i, where i ∈ {j, j + 1, …, n – 1}. In each case, Eq (9) gives the probability of the size i, the weight for pi,m;j in the right side of the first equality in Eq (10).

Now, for m ≥ 2, every j with pn,m;j > 0 satisfies 2 ≤ mjn. Because Eq (10) applies to the cases 2 ≤ mj < n, Eq (8) shows that Eq (3) for pn,m;n holds. Thus, Eq (6) also holds for 2 ≤ mj = n , so 𝒫n follows from 𝒫n–1, completing the induction and the proof.

Corollary 1: In the set-up of Theorem 1, the left-sided p-value pn,m;j,=i=mjpn,m;i is pn,1;1,•= for n ≥1 and

pn,m;j,={2m+1(j1)m1¯(n1)m1¯for2mj<n1for2mj=n. (11)

Proof: For 2 ≤ mj < n,

pn,m;j,=i=mjpn,m;i=2(m+1)(n1)m1¯(m1)i=mj(i2)m2¯=2(m+1)(n1)m1¯(m1)i=m1j1(i1)m2¯=2(j1)m1¯(m+1)(n1)m1¯, (12)

where the third equality changes the index of summation from i to i′= i – 1, and the final equality follows from the identity in Eq (7) with a = m – 1 and b = j. Theorem 1 completes the proof, because for 2 ≤ m ≤ j = n, pn,m;n,=i=mnpn,m;i=1.

Corollary 2: In the set-up of Theorem 1, and in the limit n → ∞ and jn−1f with m fixed, the left-sided p-value pn,m;j,=i=mjpn,m;i satisfies

limpn,m;j,={2m+1fm1for2mj<n1for2mj=n. (13)

3 Numerical Results

Typically, as Eq (3) suggests, much of the probability mass of {pn,m;j: j = m,m + 1,…,n} occurs at pn,m;n. To explore the corresponding left-sided p-values pn,m;j,=i=mjpn,m;i numerically, pn,m;n,• = 1, and

pn,m;j,=pn,m;j+1,jm+1jfor2mj<n, (14)

suggesting that typical left-sided p-values {pn,m;j, •} (mjn) decrease rapidly as j decreases from j = n to j = m.

In Figure 1, e.g., p16,8;12,• ≈ 0.01 , suggesting that inference on samples as small as n = 16 can be surprisingly strong, even for j > m. For large samples (n→ ∞), Corollary 2 confirms that small subsamples of m individuals can produce strong inferences, even if j is much larger than m.

Figure 1. A semi-logarithmic plot of the left-sided p-value pn,m;j,• against j for mjn = 8 and mjn = 16.

Figure 1

The solid circles and lines correspond to n = 8; the open circles/triangles and dotted lines correspond to n = 16. Points on a curve share a common value of m, given in the appropriate color on the left of the curve. Each point on the curve corresponds to (j, pn,m;j,•) plotted for the values of n and m common to the points on the curve.

4 Discussion

Theorem 1 is related to combinatorial results on unique event polymorphisms and the sub-trees corresponding to a mutation (Wiuf and Donnelly 1999). More specifically, a special case of Corollary 1 (Eq (5)) has been presented as a p-value for inferring monophyly (e.g., Eq (1) in (Rosenberg 2007), and as a p-value for inferring non-random mating (e.g., in the informal discussion of the phylogenetic relationship of Neanderthals and modern humans on p. 84 of (Hein et al. 2005)). Because Corollary 1 generalizes Eq (5), it has similar applications in taxonomy.

As a hypothetical example, consider the phylogenetic tree presented in (Krings et al. 2000) (the following description of the tree suffices for present purposes). The tree was consistent with reciprocal monophyly of Neanderthals and modern humans, but contained too few Neanderthals to conclude reciprocal monophyly at p ≤0.05 from tree topology alone (e.g., (Rosenberg 2007)). An alternative statistic, estimated times to most recent common ancestor (TMRCA) (Nordborg 1998), effectively excluded random mating, but not the possibility of some gene flow. The statistical conclusions based on TMRCAs, however, require assumptions about the entire history of human population sizes (e.g., (Tang et al. 2002)), whereas conclusions based on tree topology alone require less restrictive assumptions, ones about the size of the human population co-existing with Neanderthals. In general, less restrictive assumptions yield more robust statistical tests, so an inference based on tree topology alone is more robust than an inference based on estimated TMRCAs.

As Neanderthal sequences accumulate, the present state of knowledge does not exclude inferences on a future genetic sample of Neanderthals and modern humans that generates a hypothetical tree with n – m human ancestors co-existing with m Neanderthals. The hypothetical tree might contain j – m > 0 human ancestors sharing a most recent common ancestor with the Neanderthals, the remaining human ancestors lying on a second lineage. With the use of the p-value in Corollary 1, the tree topology on its own (despite displaying gene flow) could suffice to reject (and reject more strongly than the present data permit) random mating between Neanderthals and human ancestors.

As another hypothetical example of Corollary 1, consider a study of a human genetic disease sampling n individuals under restrictions justifying a statistical analysis with coalescent theory. Let m individuals within the sample display the disease, along with a genetic marker suspected as necessary for the disease. Because of incomplete genetic penetrance or the absence of additional but unknown genetic factors required for disease, a further j–m > 0 individuals might display the relevant marker without displaying the disease. As in the coalescent theory of unique event polymorphisms (R. C. Griffiths and Tavare 1998, Wiuf and Donnelly 1999, R.C. Griffiths and Tavare 2003, Tavare 2004), assume that the mutation generating the marker occurred only once. Corollary 1 could test if the association of the marker with the disease reaches statistical significance.

Any statistical test using Eq (6) or Corollary 1 is based solely on Eq (1), an assumption common to most coalescent models. Accordingly, such a test is more robust than tests based on more specific coalescent models. Although the test loses power through its sparse assumptions, the Numerical Results section suggests that nonetheless, tests based on Corollary 1 can be surprisingly powerful.

Acknowledgements

It is my pleasure to acknowledge helpful conversations with Drs. Laszlo Szekely, Eva Czabarka, Susanta Tewari, Dave Erickson, and Peter Rogan. I also thank the Editor and two anonymous reviewers for their helpful suggestions, which greatly improved the article. This research was supported by the Intramural Research Program of the NIH, National Library of Medicine.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Donnelly P, Tavare S. The ages of alleles and a coalescent. Advances in Applied Probability. 1986a;18:1–19. [Google Scholar]
  2. Donnelly P, Tavare S. The ages of alleles and a coalescent : Correction. Advances in Applied Probability. 1986b;18:1023. [Google Scholar]
  3. Graham RL, Knuth DE, Ptashnik O. Concrete mathematics: a foundation for computer science. New York: Addison-Wesley; 1994. [Google Scholar]
  4. Griffiths RC, Tavare S. The age of a mutation in a general coalescent tree. Stochastic Models. 1998:273–295. [Google Scholar]
  5. Griffiths RC, Tavare S. In: The genealogy of a neutral mutation. Highly Structured Stochastic Systems. Green P, Hjort N, Richardson S, editors. Oxford University Press; 2003. pp. 93–412. [Google Scholar]
  6. Harris EE, Hey J. X chromosome evidence for ancient human histories. Proceedings of the National Academy of Sciences of the United States of America. 1999;96:3320–3324. doi: 10.1073/pnas.96.6.3320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hein J, Shierup MH, Wiuf C. Gene Genealogies, Variation, and Evolution Oxford. Oxford University Press; 2005. [Google Scholar]
  8. Kimura M, Crow JF. Number of alleles that can be maintained in finite population. Genetics. 1964;49:725. doi: 10.1093/genetics/49.4.725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kingman J. The coalescent. Stochastic Processes and Their Applications. 1982a;13:235–248. [Google Scholar]
  10. Kingman J. On the genealogy of large populations. Essays in Statistical Science. 1982b;19:27–43. [Google Scholar]
  11. Krings M, Capelli C, Tschentscher F, et al. A view of Neandertal genetic diversity. Nature genetics. 2000;26:144–146. doi: 10.1038/79855. [DOI] [PubMed] [Google Scholar]
  12. Krings M, Stone A, Schmitz RW, et al. Neandertal DNA sequences and the origin of modern humans. Cell. 1997;90:19–30. doi: 10.1016/s0092-8674(00)80310-4. [DOI] [PubMed] [Google Scholar]
  13. Moran PAP. The Statistical Processes of Evolutionary Theory. Oxford: Clarendon Press; 1962. [Google Scholar]
  14. Nordborg M. On the probability of Neanderthal ancestry. American Journal of Human Genetics. 1998;63:1237–1240. doi: 10.1086/302052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Nordborg M. Handbook of Statistical Genetics. New York: Wiley; 2004. Coalescent Theory; pp. 602–635. [Google Scholar]
  16. Pollak E. On an extension of the theory of coalescents to populations with two sexes Jour. Ind. Soc. Ag. Statistics. 2004;57:84–93. [Google Scholar]
  17. Rosenberg NA. Statistical tests for taxonomic distinctiveness from observations of monophyly. Evolution. 2007;61:317–323. doi: 10.1111/j.1558-5646.2007.00023.x. [DOI] [PubMed] [Google Scholar]
  18. Saunders IW, Tavare S, Watterson GA. On the genealogy of nested subsamples from a haploid population. Advances in Applied Probability. 1984;16:471–491. [Google Scholar]
  19. Tang H, Siegmund DO, Shen PD, et al. Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition. Genetics. 2002;161:447–459. doi: 10.1093/genetics/161.1.447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Tavare S. Ancestral Inference in Population Genetics. Lectures on Probability Theory and Statistics. 2004;1837 [Google Scholar]
  21. Wakeley J, King L, Low BS, et al. Gene genealogies within a fixed pedigree, and the robustness of Kingman's coalescent. Genetics. 2012;190:1433–1445. doi: 10.1534/genetics.111.135574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Watterson GA. Lines of descent and the coalescent. Theoretical Population Biology. 1984;26:77–92. [Google Scholar]
  23. Wiuf C, Donnelly P. Conditional genealogies and the age of a neutral mutant. Theoretical Population Biology. 1999;56:183–201. doi: 10.1006/tpbi.1998.1411. [DOI] [PubMed] [Google Scholar]

RESOURCES