Skip to main content
Genetics logoLink to Genetics
. 2016 Apr 19;203(2):699–714. doi: 10.1534/genetics.116.187492

Efficient Genome-Wide Sequencing and Low-Coverage Pedigree Analysis from Noninvasively Collected Samples

Noah Snyder-Mackler *, William H Majoros , Michael L Yuan *, Amanda O Shaver *, Jacob B Gordon , Gisela H Kopp §,**,††, Stephen A Schlebusch ‡‡, Jeffrey D Wall §§, Susan C Alberts *,‡,***, Sayan Mukherjee †††,‡‡‡,§§§, Xiang Zhou ****,††††,1, Jenny Tung *,‡,***,‡‡‡‡,1
PMCID: PMC4896188  PMID: 27098910

Abstract

Research on the genetics of natural populations was revolutionized in the 1990s by methods for genotyping noninvasively collected samples. However, these methods have remained largely unchanged for the past 20 years and lag far behind the genomics era. To close this gap, here we report an optimized laboratory protocol for genome-wide capture of endogenous DNA from noninvasively collected samples, coupled with a novel computational approach to reconstruct pedigree links from the resulting low-coverage data. We validated both methods using fecal samples from 62 wild baboons, including 48 from an independently constructed extended pedigree. We enriched fecal-derived DNA samples up to 40-fold for endogenous baboon DNA and reconstructed near-perfect pedigree relationships even with extremely low-coverage sequencing. We anticipate that these methods will be broadly applicable to the many research systems for which only noninvasive samples are available. The lab protocol and software (“WHODAD”) are freely available at www.tung-lab.org/protocols-and-software.html and www.xzlab.org/software.html, respectively.

Keywords: capture-based enrichment, noninvasive samples, baboons, paternity analysis, pedigree, genome resequencing


THE capacity to generate genetic data from low-quality or noninvasively collected samples, first developed in the 1990s, revolutionized the study of genetics, evolution, behavior, and ecology in natural populations. These methodological advances facilitated phylogenetic and phylogeographic analyses of difficult-to-sample taxa; helped define the role of admixture in mammalian evolution (Pérez et al. 2010; Sacks et al. 2011; Charpentier et al. 2012); and enabled theoretical expectations about paternal investment, kin recognition, and reproductive skew to be empirically tested, sometimes for the first time (Buchan et al. 2003; Smith et al. 2003; Archie et al. 2007; Gottelli et al. 2007). They also yielded important insights into the genetic viability and future prospects of threatened or endangered populations from which invasive samples are impossible to obtain (Idaghdour et al. 2003; Valière et al. 2003; Nagata et al. 2005; Rudnick et al. 2007; Mondol et al. 2009). Noninvasive genetic analysis has thus changed the ways we study population, ecological, and conservation genetics, and we would know far less about many species without it.

However, techniques for noninvasive genetic analysis have changed little in the past 20 years. Collection of genetic data from noninvasively collected tissues (e.g., feces, hair, and urine) continues to be labor intensive, time intensive, and vulnerable to technical artifacts such as allelic dropout and cross-contamination (Gagneux et al. 1997; Taberlet et al. 1999). Further, current methods ultimately yield very small amounts of data by today’s standards. Typical studies genotype up to several dozen microsatellite loci per individual—a trivial amount compared to the data sets now routinely generated using standard high-throughput sequencing approaches. Thus, while existing methods are sufficient for basic pedigree construction and estimating some population genetic parameters (although usually with substantial uncertainty), they are severely underpowered for many other types of analyses (Sabeti et al. 2002; Price et al. 2009; Li and Durbin 2011), including any that require local (i.e., gene- or region-specific) information instead of genome-wide averages (Huang et al. 2007; Li et al. 2007; Sankararaman and Sridhar 2008; Yang et al. 2011; Ma et al. 2014). Further, because noninvasively collected genotype data are most often based on microsatellites, they cannot take advantage of new tools designed specifically for single-nucleotide variants (Purcell et al. 2007; Visscher 2009; Durand et al. 2011).

Generating genome-scale data sets from noninvasive samples is challenging for two reasons. First, in many cases, the DNA extracted from these samples is low quality and highly fragmented. Second, it contains large proportions of nonhost DNA. For example, only ∼1% of DNA extracted from fecal-derived samples is endogenous to the donor animal [most is microbial (Perry et al. 2010)]. Sequence capture methods, in which synthesized baits are used to enrich for prespecified target sequences from a larger DNA pool (Gnirke et al. 2009), present a potential solution to both of these problems. Because shearing is a required step in library preparation, the problem of working with highly fragmented samples is obviated. Indeed, Perry et al. (2010) were able to use a modified version of sequence capture to target and sequence 1.5 Mb of the chimpanzee genome from fecal-derived DNA, with very low genotyping error rates relative to blood-derived DNA. More recently, Carpenter et al. (2013) reported a method for performing genome-wide sequence capture from low-quality ancient DNA samples, which recapitulate many of the challenges posed by noninvasive samples (e.g., highly fragmented DNA and low proportions of endogenous DNA).

However, while considerable investment in single samples often makes sense in ancient DNA studies, the low levels of postcapture enrichment associated with currently available protocols are not cost effective for population studies of noninvasive samples. Substantially higher rates of enrichment, particularly in nonrepetitive regions of the genome, will be essential to overcome this limitation. In addition, computational methods for analyzing the resulting data are also required, especially given that genome-scale sequencing efforts for such samples are likely to produce low-coverage data. For example, current paternity assignment approaches (Chakraborty et al. 1974; Marshall et al. 1998; Kalinowski et al. 2007) were not designed to deal with uncertain genotypes, an inevitable component of analyzing low-coverage sequencing data. Thus, for capture-based methods to become broadly accessible, the development of appropriate new computational approaches is also essential.

Here, we report an optimized laboratory protocol for genome-wide capture of endogenous DNA from noninvasively collected samples, combined with a novel computational approach to reconstruct pedigree links from the resulting data (implemented in the program WHODAD). We validate both our laboratory methods and computational tools, using noninvasively collected samples from 54 members of an intensively studied wild baboon population in the Amboseli basin of Kenya (Alberts and Altmann 2012). We also demonstrate the generalizability of our methods to noninvasive samples collected using different methods from a different baboon species from West Africa. Our protocol is cost effective, has manageable sample input requirements, yields good capture efficiency for high complexity, nonrepetitive DNA, and minimizes the need for extensive PCR amplification. Importantly, we find that genotype data generated from fecal samples closely match data from high-quality blood-derived DNA samples from the same individuals, and provide near-perfect information on pedigree relationships even with extremely low per-sample sequencing coverage (mean = 0.49× genome coverage). Together, these methods will enable population, conservation, and ecological genetic analyses of natural populations to again take a major leap forward, into the genomic era. At the same time, they will also introduce new systems to the genomics community.

Results

DSN digestion during bait construction increases library complexity

Our protocol relies on in vitro transcription of biotinylated RNA baits to capture host-specific DNA from the mixed pool of host, environmental, and microbial DNA extracted from noninvasive samples. Similar to Carpenter et al. (2013), RNA baits are generated from DNA templates obtained from a high-quality DNA sample (here, DNA extracted from blood). This approach avoids the high cost of custom bait synthesis (as in Perry et al. 2010 and Gnirke et al. 2009), but can also produce a bait set that includes a large proportion of low-complexity, repetitive regions. Consequently, many reads generated from captured DNA cannot be uniquely mapped, lowering the protocol’s efficiency. To address this concern, we incorporated a novel duplex specific nuclease (DSN) digestion in the bait construction step (Supplemental Material, Figure S1A; see Methods). Sequencing the DNA bait templates prior to in vitro amplification demonstrates that including the digestion step reduces the percentage of baits synthesized from low-complexity/highly duplicated regions. Specifically, a 4-hr incubation of sheared DNA at 68° followed by a 20-min DSN digestion in the presence of human Cot-1 produced the highest-complexity bait library of the five conditions we tested. Compared to DNA templates from a non-DSN-digested library, bait templates produced using these conditions reduced the number of reads mapping to multiple locations by 2.6-fold (from 19.2% to 7.5%; Figure S2).

Capture-based enrichment

We validated our full capture protocol (bait construction followed by capture of endogenous DNA and sequencing of captured fragments), using fecal-derived DNA (fDNA) samples collected from 54 individually recognized yellow baboons (36 males and 18 females; Figure 1) from the Amboseli baboon population, an intensively studied population in which maternal and paternal pedigree relationships are known for a large set of individuals (Buchan et al. 2003; Alberts et al. 2006; Alberts and Altmann 2012). We produced data for 52 of the samples in two successive capture efforts: “capture 1” was conducted on fDNA from 24 baboons, and “capture 2” was conducted on fDNA from 28 additional baboons after making multiple improvements to our initial protocol (changes to the protocol between capture efforts are described in detail in Table S1 and File S1; see Table S2 for information on sequencing coverage and mapping statistics). Data from the remaining two individuals, “LIT” and “HAP,” were generated to compare the captured fDNA sample with data derived from sequencing blood-derived genomic DNA (gDNA) samples from the same individuals.

Figure 1.

Figure 1

Pedigree of a subset of baboons monitored by the Amboseli Baboon Research Project. Samples from both males (squares) and females (circles) were enriched in capture 1 (green) or capture 2 (purple). Open circles and squares represent baboons that connect individuals in our pedigree, but who were not sequenced as part of this study. Each sequenced individual is represented by a unique number (below the circles or squares), with some individuals repeated because baboons often produce offspring with multiple mates. The paired fDNA and gDNA samples came from two individuals, HAP (blue) and LIT (orange), who were members of the study population but are not connected to this pedigree.

Our protocol (Figure S1) resulted in substantial enrichment of baboon DNA in the postcapture vs. precapture samples (see Table S2 for sample-specific details). A mean of 44.56% (range: 10.28–83.17%) of postcapture fragments mapped to the yellow baboon genome (Pcyn1.0), despite starting with precapture samples that contained a mean of only 2.04% endogenous baboon DNA, as estimated by quantitative PCR (qPCR) (range 0.19–8.37%). However, in capture 1 a large proportion of the mapped fragments were identified as PCR duplicates (meancapture1 = 71.97% of mapped fragments, rangecapture1 = 51.43–88.46%; Figure 2A). After removing PCR duplicates, a mean of 9.16% of the postcapture reads in capture 1 were nonduplicate mappable fragments (rangecapture1 = 2.23–23.75%), producing a mean coverage of 0.20× per sample relative to the mappable baboon genome (mean sequencing depth of 5.8 Gb per sample; rangecapture1 = 0.04–0.49×; Figure 2B). These numbers translated to an overall mean fold enrichment of 39.8× for mapped reads (rangecapture1 = 8.0–111.8-fold, SD = 25.2) and 9.6× enrichment of non-PCR duplicate mapped reads (rangecapture1 = 3.9–22.4-fold, SD = 5.0; Figure 2C).

Figure 2.

Figure 2

fDNA enrichment results. (A) Percentage of sequencing reads that mapped to the baboon genome and were not PCR duplicates (“Mapped,” dark blue), mapped and were PCR duplicates (“PCR Duplicate,” blue), or did not map and likely represent environmental or bacterial DNA in the case of fDNA/aDNA and unmappable fragments in the case of gDNA (“Other,” light blue). “gDNA” represents genomic DNA derived from the blood samples for LIT and HAP; “aDNA” represents ancient DNA data from capture-based enrichment reported in Carpenter et al. (2013). Numbers above each bar show the total number of PCR cycles used in each protocol. (B) Capture 2 produced significantly greater genome coverage than capture 1, despite a similar number of reads generated per sample (two-sample t-test, T = 9.7, P = 3.0 × 10−12). On average in capture 2, we obtained ∼0.73× coverage of the genome with 5.76 Gb of sequencing. If all 5.76 Gb mapped to the baboon genome as non-PCR duplicates, we would have produced ∼2.2× genome-wide coverage. (C) Capture 2 also produced significantly greater fold enrichment of baboon DNA (fold enrichment is measured as percentage of nonduplicate baboon DNA postcapture divided by percentage of baboon DNA precapture: two-sample t-test, T = 4.4, P = 7.3 × 10−5). (D) The amount of baboon DNA in the sample precapture [percentage of baboon DNA precapture, based on qPCR of the single-copy c-myc gene (Morin et al. 2001)] is strongly correlated with the percentage of baboon fragments obtained in postenrichment sequencing (Pearson’s r = 0.80, P = 1.0 × 10−11). However, even samples with low amounts of endogenous DNA (<2%) exhibit substantial fold enrichment using our protocol (meancapture1 = 10.60×, meancapture2 = 24.82×).

Based on our results for capture 1, we made multiple protocol improvements prior to conducting capture 2 (Table S1 and File S1). The improved protocol was twice as effective on average, resulting in a mean 18-fold enrichment of high-quality, analysis-ready reads and a maximum fold enrichment of close to 40-fold [rangecapture2 = 8.0–39.2-fold, Figure 2C; by comparison, methods optimized for ancient DNA achieved a mean of 5.5-fold enrichment of non-PCR duplicate fragments (Carpenter et al. 2013), Figure 2A]. Specifically, the protocol changes improved the proportion of nonduplicate mapped fragments by >4-fold, from a mean proportion of 9.16% in capture 1 to a mean proportion of 37.74% in capture 2 (rangecapture2 = 6.16–68.61%), and reduced the proportion of PCR duplicates among mapped reads 2-fold (from 71.97% in capture 1 to 36.97% in capture 2). This improvement translated to an increase in overall genomic coverage from a mean of 0.20× in capture 1 to 0.73× in capture 2 (mean total sequencing of 5.7 Gb per sample; rangecapture2 = 0.19–1.24×; Figure 2B). This improvement in coverage was not explained by increased sequencing depth in capture 2 (Table S2). Thus, while we would need to sequence a precapture fDNA sample 50–100 times as deeply as a blood- or tissue-derived sample to produce the same level of coverage, our capture method reduces this difference to ∼2 times the sequencing effort. Importantly, our method was also successful in enriching fDNA samples (n = 8) from independent samples collected from Guinea baboons (Papio papio; Figure 2A, Table S2), suggesting that our results are highly generalizable across different species and storage and extraction methods.

Sample attributes influencing capture efficiency

The amount of baboon DNA in the precapture fDNA sample was the strongest predictor of enrichment success. Specifically, the percentage of baboon DNA precapture, as assessed via qPCR, was positively correlated with the percentage of nonduplicate fragments mapped postcapture (Figure 2D; T = 6.88, P = 1.72 × 10−8). Samples from capture 2 had more precapture baboon DNA than samples used in capture 1 because we attempted to optimize the input samples based on our initial analyses in capture 1 (capture 1 mean = 1.21%, range = 0.19–4.90%; capture 2 mean = 2.80%, range = 0.25–8.37%). However, even when controlling for this difference, enrichment of samples from capture 2 was improved over that of capture 1. This pattern is observable whether assessed using the percentage of baboon DNA fragments sequenced postcapture (Tcapture2 = 10.00, P = 6.76 × 10−13) or assessed using fold enrichment relative to precapture amounts (Tcapture2 = 6.89, P = 1.69 × 10−8) and could not be explained by differences in the length of sequence fragments or overall sequencing depth (Figure S3, Table S2). The amount of fDNA library used in the capture reaction was also weakly positively correlated with the percentage of baboon DNA fragments sequenced postcapture, after controlling for the amount of baboon DNA in the precapture sample (Tng_fDNA_library = 2.09, P = 0.042; Table S2).

Library complexity, distribution of reads, and GC content

The postcapture libraries included a higher proportion of PCR duplicates relative to reads generated from high-quality genomic DNA samples, for which fewer rounds of PCR amplification were required (PCR duplicate proportion: meanfDNA_capture1 = 69.6%, meanfDNA_capture2 = 36.8%, meangDNA = 11.3% of mapped reads; 18 rounds of PCR in the capture protocol vs. 6 rounds for the high-quality samples). For comparison, this proportion is much lower than reported for ancient DNA (aDNA) samples, which go through more rounds of PCR amplification (meanaDNA = 94.6%; Figure 2A and Figure S4; Carpenter et al. 2013). Despite increases in clonality, the number of nonduplicate reads continued to increase with increasing sequencing depth, with the slope of this relationship especially favorable for capture 2 (Figure 3). Thus, deeper sequencing of postcapture libraries should continue to increase genome-wide coverage, albeit not as efficiently as sequencing blood-derived gDNA samples.

Figure 3.

Figure 3

Increased sequencing effort produces increased numbers of nonduplicate reads. Shown is the number of mapped reads plotted against the number of nonduplicate reads mapped [mean ± SD; plotted using the program “preseq” (Daley and Smith 2013)]. More complex libraries (i.e., those containing more nonduplicate fragments) have a slope closer to 1 (as in the case of the gDNA libraries), while less complex libraries have a shallower slope and asymptote at a smaller value. The main plot shows the first 10 million mapped reads for each sample. The inset shows the same plot for the first 1 million mapped reads.

As with other capture-based methods (Carpenter et al. 2013; Samuels et al. 2013), a modest fraction of the mapped fragments mapped to the mitochondrial genome (mtDNA). When we included all mapped reads, this fraction was similar in libraries from capture 1 and capture 2 (meancapture1 = 6.55%; meancapture2 = 6.73%; Figure S5A). However, capture 2 resulted in significantly more unambiguously nonduplicate mtDNA-mapped reads than capture 1, largely due to the paired-end sequencing used in capture 2 (meancapture1 = 0.47% of all mapped reads; meancapture2 = 6.46%; Figure S5B). The higher number of nonduplicate mtDNA reads in capture 2 thus produced much deeper overall coverage of the mitochondrial genome (Figure S5C), despite the fact that the ratio of mtDNA to nuclear DNA mapped reads was comparable between the two captures (Figure S5D). Finally, the distributions of read GC content for postcapture reads using our protocol, the DNA template for the RNA baits, and aDNA libraries were highly similar (Figure S6). This observation suggests that any GC bias relative to the genome appears during bait construction and/or sequencing, not during the hybridization step.

Postcapture fDNA-derived genotype data are consistent with individual identity and independently established pedigree relationships

To assess the accuracy of genotypes called from postcapture fDNA libraries, we compared genotype data from paired blood-derived gDNA (without capture) and postcapture fDNA libraries for two individuals, LIT and HAP. Using genotypes for sites that were called with a genotype quality (GQ) > 20 in both the fDNA and gDNA data sets for either LIT or HAP, we found that the majority of the genotypes called in both data sets were concordant (86.5% of 312,739 sites for the LIT paired samples; 77% of 40,132 sites for the HAP paired samples, for whom we had much lower coverage for the fecal-derived sample). As expected, the majority of the discordant sites occurred when the low-coverage fDNA sample was called as homozygous and the high-coverage gDNA sample was called as heterozygous (77.7% and 74.4% of discordant sites in LIT and HAP, respectively; Figure S7). Further, among all sites, the fDNA genotype captured at least one of the alleles from the gDNA genotype in 99.8% (LIT) and 99.6% (HAP) of cases (Figure S7). Thus, even when genotypes called in fDNA and gDNA samples from the same individual were discordant, they were almost always compatible.

Further, we found that genotypes called from the postcapture fDNA libraries were more similar to the genotypes called from their high-quality gDNA counterparts than they were to those from other postcapture fDNA libraries. Specifically, k0 values from lcMLkin (Lipatov et al. 2015), which estimate the probability that two samples share no alleles that are identical by descent, were much smaller for the LITfDNA–LITgDNA paired samples (0.487) and HAPfDNA–HAPgDNA paired samples (0.243) than for k0 values calculated for the two blood-derived samples when compared to any other fDNA sample (k0 range LITfdna vs. other fDNA samples = 0.996–1.000; Z = 849.2, P < 10−20; k0 range HAPfDNA vs. other fDNA samples = 0.786–0.999; Z = 10.6, P < 10−20; Figure 4A).

Figure 4.

Figure 4

Postcapture genotype data are consistent with individual identity and pedigree relationships. (A) The k0 values for the HAP and LIT fDNA–gDNA paired samples (arrows) were significantly lower than the range of k0 values for LITfDNA and HAPfDNA vs. any other fDNA sample (gray distribution). Lower k0 values reflect increased relatedness (i.e., decreased probability of no IBD sharing). (B) Estimated dyadic coefficient of relatedness values (range: 0–1) were correlated with independently obtained pedigree relatedness values calculated using the R package kinship2 (Sinnwell et al. 2014) (Pearson’s r = 0.73, P < 10−16). The blue line shows the best-fit slope and intercept from the linear model. Both k0 and the estimated relatedness values were calculated with lcMLkin (Lipatov et al. 2015).

For the 48 extended-pedigree individuals (Figure 1, including 8 Amboseli baboons with no known relatives in the pedigree), we then tested whether the estimated coefficient of relatedness values, r, from lcMLkin (Lipatov et al. 2015) in the postcapture data (range: 0–1, or 2× the kinship coefficient) were correlated with coefficient of relatedness values obtained from the independently constructed pedigree (based on known mother–offspring relationships and microsatellite-based paternity assignments: see Methods). Using a filtered set of 127,654 single-nucleotide variants (see Methods), we found a strong correlation between the two measures (Pearson’s r = 0.73, P < 10−16; Figure 4B). This correlation improved further if we imposed thresholds for the minimum number of sites genotyped in both individuals (“shared sites”) in a dyad (Figure S8). For example, if we removed all dyads with <2000 shared sites (84 of 1128 dyads or 7.4%), the correlation between pedigree relatedness and genotype similarity reached Pearson’s r = 0.86 (P < 10−16). Notably, for one individual we prepared and sequenced capture libraries from two independently collected fecal samples (libraries AMB_018 and AMB_040). For these biological replicates, the pairwise relatedness value was 0.774, more than twice as high as for any other pair of relatives (range of estimates for parent–offspring and full-sib pairs typed at ≥2000 sites: 0.10–0.38). Thus, our methods readily distinguish replicate samples (which can be inadvertently collected, especially in unhabituated populations) from those collected from distinct individuals, even close relatives.

Paternity inference using WHODAD

Current methods for assigning paternity [e.g., CERVUS (Marshall et al. 1998; Kalinowski et al. 2007) and exclusion (Chakraborty et al. 1974)] assume genotype certainty, such that individuals are assigned a deterministic genotype at each locus (i.e., 0, 1, or 2 or a microsatellite repeat number; while a low level of measurement error due to sample mishandling can be modeled, this error rate is held constant across genotype calls). This assumption is violated in low-coverage sequencing data, in which genotypes are not known with certainty and this uncertainty varies across genotype calls. However, the relative probabilities of each genotype can be estimated, given estimated population allele frequencies and sequencing coverage information. To conduct paternity inference and pedigree reconstruction in this context, we therefore developed a novel approach to integrate information across low-coverage sites, implemented in the program WHODAD. Our method has two components. The first component identifies a top candidate male and tests whether he is significantly more related to the offspring than any other candidate male, using a P-value criterion. The second component tests whether the dyadic similarity between the top candidate and offspring is consistent with a parent–offspring dyad, using posterior probabilities obtained from a mixture model (see Methods and Figure S9).

Using WHODAD, we assigned paternity to all father–offspring pairs (n = 27) represented in the independently established extended pedigree in Figure 1. This approach is conservative because it departs from the usual practice of first identifying a likely set of candidate fathers based on demographic and prior pedigree information (the approach used in producing the pedigree in Figure 1). For 15 of the 27 offspring, we produced genotype data from the known mother with our enrichment protocol. WHODAD identified the same father as shown in the pedigree in 12 of these 15 trios (80%); in the other 3 trios (20%), no candidate male satisfied WHODAD’s paternity assignment criteria (in all 3 of these cases, sequencing coverage was very low for either the pedigree-identified father or offspring: 0.04–0.17×). For the remaining 12 offspring, we did not generate genotype data using our enrichment protocol for their mothers. To test all 27 father–offspring dyads together, we therefore reran WHODAD, excluding maternal genotype information. In this setting, WHODAD’s paternity assignments agreed with the pedigree data in 22 of 27 (81%) cases (Figure 5). Notably, when the pedigree-identified father was included in the data set, WHODAD never assigned paternity to a different male, whether or not maternal genotype data were available. Because our method is highly robust to exclusion of maternal genotype data, we therefore performed all subsequent analyses assuming maternal genotype data were not available, a scenario that may often occur in studies of natural populations.

Figure 5.

Figure 5

Paternity inference with WHODAD using low-coverage genotype calls. (1) When all males (n = 34) were included in the pool of candidate fathers (top bar), WHODAD assigned paternity to the same father identified in the pedigree for 22 of 27 (81%) of offspring (dark blue; see assignment criterion in Methods). The remaining offspring were not assigned a father based on WHODAD’s assignment criteria, most likely due to low sequencing coverage (5 of 27; light blue). (2) WHODAD’s accuracy was identical when we removed all close male relatives of the offspring (r ≥ 0.25) from the pool of candidate fathers. (3) When we removed all close relatives, including fathers, from the candidate pool, no fathers were assigned, as expected. (4) Finally, when we removed the father from the candidate pool but retained close relatives, our method incorrectly assigned paternity to 11% of offspring (3 of 27; bottom bar). All three incorrectly assigned fathers were closely related to the offspring (in two cases the assigned father was the half-brother of the offspring and in one case the assigned father was the son of the offspring).

The presence of close relatives, such as full- or half-siblings, can influence the accuracy of paternity assignment if these close relatives are also included as candidate fathers (Thompson and Meagher 1987; Marshall et al. 1998; Olsen et al. 2001; Ford and Williamson 2010). Thus, to examine how the presence of close male kin influenced the accuracy and confidence of WHODAD’s paternity assignments, we conducted three additional analyses. First, when all close male kin were removed from the candidate list of potential fathers (r ≥ 0.25), but the father was retained, our method performed equivalently to the case when both father and close relatives were in the candidate pool. Second, when we removed all close male kin including the father, none of the best candidate fathers from the conditional probability analysis (0%) were assigned as fathers based on WHODAD’s assignment criteria (Figure 5). Third, when we removed the father from the pool of candidate fathers, but included close male kin, 11% of the best remaining candidates (3 of 27 cases) were incorrectly assigned as fathers, based on comparison to the pedigree (Figure 5). All 3 of these false positives were close male relatives: in two cases WHODAD assigned the half-brother of the offspring as the likely father, and in one case WHODAD assigned the son of the offspring as the likely father. The best balance between maximizing the number of true positives while minimizing the number of false positives was achieved by combining both the P-value and mixture model criteria (see Methods). This approach outperformed either component used alone (Figure S10). For example, when all males were included in the candidate pool, the combined approach resulted in an 81% true positive rate and a 0% false positive rate, while using just the k0 values in a mixture model resulted in the same true positive rate (81%), but an additional 11% false positive rate (Figure S10).

Discussion

Our capture-based method strongly enriches the proportion of host DNA in low-quality DNA extracted from feces (fDNA). Our method is the first use of genome-wide enrichment-based capture methods (Carpenter et al. 2013; Enk et al. 2014; Ávila-Arcos et al. 2015) for noninvasively collected samples, which represent a major resource for behavioral, conservation, and evolutionary genetic studies in natural populations. Importantly, our protocol increases efficiency and lowers cost by reducing the input requirements (<1 μg) and number of PCR cycles relative to previous methods (Perry et al. 2010) and, in our final protocol, achieves up to 40-fold enrichment of postcapture endogenous DNA relative to precapture levels. We also show, for the first time since Perry et al. (2010), that capture libraries from low-quality samples produce genotype data that are highly concordant with genotype data derived from high-quality, noncaptured samples from the same individuals.

We anticipate that data generated through this protocol could be leveraged for a wide variety of applications. To illustrate this point for paternity analysis, we present an accompanying method, WHODAD, that produces results in near-perfect concordance with an independently constructed pedigree, using low-coverage data generated with our enrichment protocol. By incorporating prior information about pedigree links or other demographic and behavioral data, or by sequencing very low-coverage samples to additional depth (similar to typing more markers in conventional microsatellite analysis), its performance would be improved even further. For instance, in reconstructing pedigree links in the Amboseli population, we generally include only plausible candidates (e.g., we exclude males who were immature or not yet born at the offspring’s conception), not all males with genotype data, as we did here.

Together, these results provide valuable, accessible wet laboratory and computational tools for moving studies of difficult-to-sample natural populations forward into the genomics era. Importantly, our methods can be generalized to produce low-complexity DNA-depleted RNA baits for any species in which at least one high-quality DNA sample is available [or potentially a closely related species (Enk et al. 2014)]. Further, our results show that WHODAD is highly accurate for pedigree reconstruction even when the reference genome is not a high-quality chromosomal assembly (here, we used 33,120 contigs from Pcyn1.0) or, based on exploratory analyses, even from the same species. Specifically, when mapping to the reference genome for the rhesus macaque [MacaM (Zimin et al. 2014)] instead of baboon, which diverged from baboons 6–8 MYA (Steiper and Young 2006), WHODAD produced similarly accurate paternity assignments (21 of 27 fathers were correctly assigned using our recommended statistical thresholds compared to 22 of 27 when mapping to Pcyn1.0; there were no false positive assignments in either case).

Costs of performing the protocol

At the time of publication, using the same reagents as we used here and sourced from the same locations, the costs of generating these data are ∼$60 per sample (excluding sequencing costs). Because our method does not require the commercial synthesis of targeted capture probes, the majority of the costs are accounted for by the streptavidin-coated Dynalbeads ($11 per preparation), RNA baits ($5 per sample) and high-sensitivity Bioanalyzer chips for quality control ($9 per sample). Replacing Ampure XP beads with homemade SPRI beads would reduce the per-sample costs considerably, as would pooling adapter-ligated fDNA samples prior to hybridization (instead of posthybridization, as reported here). For a multiplexed pool of 10 samples, we estimate that using these two strategies would result in a per-sample cost of ∼$29. Indeed, we have verified that multiplexing samples prior to hybridization does not result in loss of capture efficiency and actually resulted in improved yield of mapped, non-PCR duplicate reads (∼61% of reads; mean of 117-fold enrichment, range = 54.8–257.2-fold; Figure S11A), although it did result in more uneven coverage of samples sequenced within a pool (Figure S11B) and raises the possibility of barcode swapping [which can be managed using dual barcoding approaches (Kircher et al. 2012)]. Multiplexing also has the advantage of reducing the amounts of input DNA per sample and the number of PCR cycles required for the initial library preparation step. We are currently pursuing improvements to the protocol along these lines.

Based on achieving 40% non-PCR duplicate, mapped reads after capture (the mean result for capture 2 samples), we estimate that the sequencing costs of a 1× genome for baboon (∼2.9 Gb) would be ∼$200 (based on paired-end, 125-bp sequencing at $2000 per lane and exclusion of PCR duplicates). This cost per sample is approximately twice the cost of genotyping 14 microsatellites from the same fDNA sample—the previous strategy for the main study population, the Amboseli baboons (Van Horn et al. 2008)—but provides substantially more genetic information. These estimates will drop farther as the cost of high-throughput sequencing continues to fall, making application of our approach to whole populations increasingly feasible. Our finding that useful sequencing reads do not asymptote with deeper sequencing (Figure 3) also suggests the feasibility of producing a high-quality, high-coverage genome from such samples if one were to sequence more deeply. This approach would alleviate cases in which both alleles at a truly heterozygous site were not observed due to low sequencing depth (for example, with 1× coverage, only one of the two alleles can possibly be observed). Notably, however, it would not fix “allelic dropout” problems in which an allele was not represented in the pool of sequenceable fragments (Pompanon et al. 2005). Analogous to the solution in noninvasive microsatellite typing, multiple, independent PCR reactions could be used to solve this problem.

Finally, to make the current protocol as cost effective as possible, we recommend that researchers use qPCR to choose DNA samples with the highest proportion of host DNA possible—the strongest predictor of the fold-change enrichment in endogenous DNA postcapture vs. precapture (Figure 2D).

Assigning paternity using WHODAD

The lack of available tools for working with low-coverage genomic data—realistically, one of the most likely data types to be produced for studies of natural populations—represents a major barrier to moving from low-throughput marker genotyping to genome-scale analyses. The pedigree structure of a study population is fundamental to understanding its genetic structure and social organization. However, current methods for pedigree reconstruction are unable to cope with high levels of genotype uncertainty. The approach we have implemented in WHODAD takes this uncertainty into account, suggesting one simple application for the wet laboratory methods presented here. Indeed, our method performed well when compared to an independently constructed extended pedigree, with its major challenges—differentiating between close relatives in a candidate pool—comparable to those reported for existing software (Marshall et al. 1998; Olsen et al. 2001; Kalinowski et al. 2007; Ford and Williamson 2010). Importantly, while analyses of pedigree structure using previously available methods are greatly aided by prior knowledge of mother–offspring relationships (Kalinowski et al. 2007), maternal links do not appear to be necessary for WHODAD analyses, which perform well even when no maternal information is available (Figure 5, Figure S9).

Conclusions

High-throughput sequencing approaches solve one problem of working with low-quality, noninvasive samples: the sheared nature of the original samples. Capture approaches have demonstrated great promise for solving the second major problem—large proportions of nonendogenous DNA—since the results published by Perry et al. (2010). Our results help to fulfill this promise by providing methods to perform cost-effective sequence capture from noninvasive samples on a genome-wide scale, coupled with analytical methods to deal with the resulting data (we note that our protocols could also be explored for broader application to aDNA samples). For questions in which investigators are specifically interested in variants in a priori-defined subsets of the genome [e.g., the exome (Vallender 2011; George et al. 2011)], targeted capture with synthesized baits may still be the best option. However, for the many types of analyses that use genome-scale data [e.g., local ancestry analysis, genome-wide scans for selection, and reconstruction of population demographic history (Sabeti et al. 2002; Huang et al. 2007; Li et al. 2007; Sankararaman and Sridhar 2008; Price et al. 2009; Durand et al. 2011; Li and Durbin 2011; Yang et al. 2011; Ma et al. 2014)], our approach will be more useful, especially as the costs of high-throughput sequencing continue to fall.

Here, we focused specifically on DNA obtained from fecal samples, which are one of the most commonly collected types of noninvasive samples: they contain information not only about host genetics, but also about endocrinological parameters (Palme 2005), gut microbiota (Ley et al. 2008), parasite burdens (Gillespie 2006), and gene expression levels (Knight et al. 2014). The sample banks already available for many natural populations thus open the door to population and evolutionary genomic studies in species in which such analyses were previously impossible. As the costs of data generation continue to fall, and the limiting factor for many studies becomes high-quality phenotypic data, we envision that such studies will rapidly move far beyond the simple analyses of paternity and pedigree structure reported here.

Methods

Bait generation

Similar to Carpenter et al. (2013), we use a cost-effective in vitro synthesis method based on T7 RNA polymerase amplification of sheared DNA from a high-quality sample (Figure S1A). We extracted genomic DNA from a blood sample collected from an olive baboon (P. anubis) that was unrelated to any of the individuals in the samples we wished to enrich. To generate baits, we sheared 5 μg of purified DNA to a mean fragment size of 150 bp and then end repaired and A-tailed the fragments, using the KAPA DNA Library Preparation Kit for Illumina Sequencing. We purified the resulting reaction, using a 1.8× ratio of AMPure beads to sample volume.

We annealed custom adapters to the A-tailed library by incubating the following reagents for 15 min at 20°: 10 μl 5× ligation buffer (KAPA Biosystems), 5 μl DNA Ligase (KAPA Biosystems), 1 μl 25 μM custom adapter, ≤34 μl of A-tailed DNA, and H2O up to 50 μl total volume. The custom adapters we used (EcoOT7dTV, Fwd 5′-GGAAGGAAGGAAGAGATAATACGACTCACTATAGGGCCTGGT; EcoOT7dTV, Rev 5′-/5Phos/CCAGGCCCTATAGTGAGTCGTATTATCTCTTCCTTCCTTCC) differ from those used in other protocols (Carpenter et al. 2013; Enk et al. 2014; Ávila-Arcos et al. 2015). Specifically, they contained (1) a T7 RNA polymerase recognition site, (2) flanking sequence that improves T7 transcription efficiency (Moll et al. 2004), and (3) an EcoO109I restriction enzyme cut site that allowed us to later cleave off the adapter sequence from T7 amplified RNAs (rather than blocking it, as in Carpenter et al. 2013).

We then digested the purified, adapter-ligated DNA with DSN (Axxora). DSN is a Kamchatka crab-derived enzyme that specifically degrades double-stranded DNA but not single-stranded DNA, allowing us to take advantage of DNA reassociation kinetics to reduce the representation of repetitive regions in the bait set (Figure S2; Shagina et al. 2010). We performed DSN digestion in 15 2-μl aliquots, each mixed with 1 μl 4× hybridization buffer [200 mM HEPES (pH 7.5), 2 M NaCl, 0.8 mM EDTA] and 1 μl human Cot-1 DNA (1 μg/μl). We denatured the DNA by heating to 98° for 3 min; held the reaction at 68° for 4 hr; and then added 4 μl H2O, 1 ml 10× DSN buffer, and 1 μl DSN (1 unit/μl) to the reaction. After 20 min of digestion, we stopped the reaction by adding 5 μl 2× DSN Stop Solution (10 mM EDTA) and purified it with 2.4× AMPure beads.

Next, we used Klenow DNA polymerase to blunt end the nondigested DNA, size selected for 200- to 300-bp fragments on a 2% agarose gel, and purified the size-selected fraction using the Zymoclean Gel DNA Recovery Kit (Zymo Research). After purification the aliquots were PCR amplified for 16 cycles, using 25 μl 2× HiFi Hot Start ReadyMix (KAPA Biosystems) and 1 μl each of 25 μΜ primers EcoOT7_PCR1 (5′-GGAAGGAAGGAAGAGATAATACGACTCACT) and EcoOT7_PCR2 (5′-TACGACTCACTATAGGGCCTGGT). Following amplification the bait DNA libraries were purified using 1.8× AMPure beads and the resulting product was visualized on a Bioanalyzer DNA 1000 chip (Agilent Technologies).

Finally, we in vitro transcribed the DNA libraries to construct biotin-tagged RNA baits, using the MEGA Shortscript Kit (Life Technologies) and Biotin-UTP (Illumina). Briefly, 125–150 nM of DNA baits were incubated at 37° for 4 hr in the following reaction: 2 μl T7 10× reaction buffer; 2 μl each of T7 ATP, GTP, CTP, and UTP solutions (75 mM); 1 μl Biotin-UTP (50 mM); 2 μl T7 enzyme mix; and water to 20 μl total volume. We then digested the DNA template by adding 1 μl TURBO DNase (Life Technologies) to the reaction and incubating it at 37° for 15 min. We purified the resulting reaction with the MEGAClear Transcription Clean-Up Kit (Life Technologies) and eluted it in a final volume of 70 μl. To cleave off the adapter sequence, we digested the RNA baits with the EcoO1091 enzyme (NEB). Finally, the baits were again purified with the MEGAclear Clean-Up Kit, eluted in 70 μl, and quantified on a Bioanalyzer RNA 6000, Eukaryote Total RNA chip (Agilent Technologies).

Samples, DNA extraction, and qPCR quantification

Baboon samples from Amboseli (the main study population) or West Africa (8 unhabituated Guinea baboons) were collected, stored, and extracted as detailed in Table S2. For LIT and HAP, gDNA was extracted from blood samples, using the QIAGEN (Valencia, CA) Maxi Kit. The majority of the sampled Amboseli individuals (48 of 54) were either members of a single extended pedigree or unrelated males living in the same study population (Figure 1). We assessed the proportion of endogenous DNA in each fDNA sample, using qPCR against the c-myc gene, as described in Morin et al. (2001).

Library preparation

All samples were fragmented to the desired size (200 or 400 bp; see Table S1), using a Bioruptor instrument (Diagenode). Illumina sequencing libraries were then generated from the fragmented DNA, using either the KAPA DNA library kits for Illumina (capture 1) or the NEBNext DNA Ultra library kit (capture 2; see Table S1). Libraries were amplified for 6 PCR cycles prior to capture-based enrichment. Sample-specific details of library preparation and sequencing results are described in Table S1. Note that we changed several steps between capture 1 and capture 2 based on interim improvements in the protocol (also detailed in Table S1). Because the methods used in capture 2 were ultimately more effective, the updated capture 2 protocol is described in the Methods section except where explicitly noted.

Capture-based enrichment

We modified the capture methods from Gnirke et al. (2009) and Perry et al. (2010) (Figure S1B). For each capture, we hybridized 121–626 ng of the fDNA libraries generated as described above to the RNA baits. First, we mixed each fDNA library with 2.5 μl human Cot-1 DNA (1 mg/ml), 2.5 μl salmon sperm DNA (1 mg/ml), and 0.6 μl index-blocking reagent (“IBR”) (50 μM). This mixture was incubated for 5 min at 95° followed by 12 min at 65°. Next, we added 13 μl of hybridization buffer (10× SSPE, 10× Denhardt’s solution, 10 mM EDTA, 0.2% SDS, preheated to 65°), 7 μl hybridization bait mixture (1 μl SUPERase-In, 750 ng RNA baits, and water up to a total volume of 7 μl, preheated to 65°) to the fDNA mixture and incubated the complete mixture at 65° for 48 hr (see Figure S12 for comparison of alternative bait concentrations and incubation times).

After incubation, we purified the enriched fDNA sample, using 50 μl Dynal MyOne Streptavidin T1 beads (Invitrogen, Carlsbad, CA). To do so, the beads were washed a total of three times with 200 μl binding buffer [1 M NaCl, 10 mM Tris-HCl (pH 7.5), 1 mM EDTA] and resuspended in 200 μl binding buffer. Next, the entire fDNA/RNA hybridization mix was added to the 200-μl Dynal MyOne Streptavidin T1 bead and binding buffer slurry. We incubated this mixture at room temperature for 30 min on an Eppendorf Thermomixer at 700 rpm. The mixture was placed on a magnetic rack, the supernatant was discarded, and the beads were washed once with 500 μl low-stringency wash buffer (1× SSC, 0.1% SDS) followed by a 15-min incubation at room temperature. The beads were then washed three times with 500 μl high-stringency wash buffer (0.1× SSC, 0.1% SDS) with a 10-min room temperature incubation between each wash. After the final wash, the enriched fDNA fraction was eluted from the beads with 50 μl elution buffer (0.1 M NaOH), transferred to a new tube containing 70 μl “neutralization buffer” (1 M Tris-HCl, pH 7.5), purified with 1.8× AMPure beads, and eluted in a 30-μl volume. A final PCR was carried out in a 50-μl reaction volume, using 23 μl of the posthybridization fDNA and either (1) 25 μl 2× KAPA High Fidelity master mix and 2 μl TruSeq universal primer (capture 1) or (2) 25 μl 2× NEBNext High Fidelity PCR master mix, 1 μl universal PCR primer, and 1 μl NEB indexing primer (capture 2). After 12 PCR cycles the final reaction was purified with 1× AMPure beads, eluted in 20 μL H2O, and visualized on a Bioanalyzer High Sensitivity DNA chip.

Sequencing and alignment

All high-throughput sequence generation was conducted on the Illumina HiSeq platform (see Table S1 for sequencing details). The resulting sequencing reads were mapped to a de novo assembly of the P. cynocephalus genome (Wall et al. 2016) (alignment available at https://abrp-genomics.biology.duke.edu/index.php?title=Other-downloads/Pcyn1.0), using the default settings of the bwa mem alignment algorithm v0.7.4-r385 (Li 2013). Reads that mapped to scaffold 10204 of Pcyn1 were assigned to mitochondrial DNA due to scaffold 10204's similarity (97% sequence similarity) to a published P. anubis mitochondrial genome (NCBI GenBank accession no. KC757406.1). Duplicate reads were marked and discarded in subsequent analyses, using the “MarkDuplicates” function in PicardTools (http://picard.sourceforge.net). To facilitate comparison across samples of differing coverage, and because coverage of the gDNA samples was much higher (∼30×) than for the fDNA samples for LIT and HAP (1.4 and 0.27, respectively), we downsampled the gDNA libraries to 0.73× coverage (the median coverage of samples in capture 2), using “DownsampleSam” in PicardTools.

Comparison of sequencing data sets

In several analyses, we compared our capture-based enrichment results to two independent data sets: (i) a previously published capture-based enrichment of aDNA samples [NCBI SRA accession no. SRP042225 (Carpenter et al. 2013)] and (ii) shotgun sequencing from six capture 1 fDNA samples prior to hybridization (“precapture”; Table S1). The aDNA samples were aligned to the human genome (hg38) and the precapture fDNA samples were mapped to the de novo Pcyn1.0 genome assembly.

Library complexity, distribution of reads, and GC content

We calculated the complexity of each library, using two methods. First, we used the ENCODE Project’s PCR bottleneck coefficient (PBC), which calculates the percentage of nonduplicate mapped reads from the total number of mapped reads (Kharchenko et al. 2008; Landt et al. 2012).The PBC ranges from 0 to 1, where more complex libraries have higher numbers. Second, we used the function “c_curve” from the program preseq (v1.0.2) to plot the number of nonduplicate fragments mapped vs. the number of total mapped fragments (Daley and Smith 2013). More complex libraries (i.e., those with fewer duplicate fragments) have a c_curve slope closer to 1, meaning that increasing sequencing depth continues to provide novel information. Less complex libraries have a shallower slope and asymptote at smaller values. Finally, we evaluated the GC bias for each sequencing library, using Picard Tools’ “CollectGCBiasMetrics” (http://picard.sourceforge.net).

Sample attributes influencing capture efficiency

To determine the sample attributes that predicted the success of our capture protocol, we first modeled the relationship between the proportion of nonduplicate reads that mapped to the baboon genome after capture (our primary measure of protocol success) and (i) the percentage of endogenous baboon DNA in the precapture samples, (ii) the amount of fDNA library (nanograms) that went into the capture, and (iii) whether the sample was captured using our initial protocol or the second version of the protocol (i.e., in capture 1 or capture 2). Second, we investigated the relationship between the same three variables and a secondary measure of protocol success, the fold-change enrichment of baboon DNA in the sample precapture vs. postcapture. Precapture concentrations of endogenous DNA in fDNA samples were measured as the concentration of baboon DNA estimated using qPCR, relative to the concentration of total DNA estimated using the Qubit High Sensitivity fluorometer (Life Technologies). To ensure that our qPCR-based measures were well calibrated, we confirmed the relationship between qPCR-based estimates and precapture sequence-based estimates of endogenous DNA in six samples for which both values were available (R2 = 0.92; Figure S13). All statistical analyses were carried out in R (R Development Core Team 2015).

Variant calling

We used two different approaches to call variants and genotypes in our sample: SAMTOOLS (Li et al. 2009; Li 2011) and the Genome Analysis Toolkit (GATK) (McKenna et al. 2010; DePristo et al. 2011; Van der Auwera et al. 2013). In downstream analyses, we retained only variants that were identified by both methods, a strategy that produces a higher ratio of true positives to false positives than variants identified by a single method alone (O’Rawe et al. 2013). Duplicate-marked alignments were used as input for both methods. SAMTOOLS multisample variant calling was carried out using mpileup and bcftools, with a maximum allowed read depth (-D) of 100. GATK variant calling was carried out using HaplotypeCaller following the GATK v3.0 Best Practices for variant calling from DNA-seq. To minimize potential batch effects introduced by the two capture efforts, we used the following strategy. First, we called genotypes using reads from each capture independently. Second, we recalled genotypes using reads from both captures together. Third, we extracted the union set of variants called in steps 1 and 2 for downstream analysis.

Because no reference set of genetic variants is currently publicly available for baboons, we used a bootstrapping procedure for base quality score recalibration. Briefly, we performed an initial round of variant calling on read alignments without quality score recalibration. From this variant call set, we extracted a set of high-confidence variants that passed the following hard filters: quality score ≥100; QD < 2.0; MQ < 35.0; FS > 60.0; HaplotypeScore >13.0; MQRankSum < −12.5; and ReadPosRankSum < −8.0 (as described in Tung et al. 2015). We then recalibrated the base quality scores for each alignment, using this high-confidence set as the database of “known variants,” and repeated the same variant-calling and filtering procedure for three additional rounds. Finally, we identified the intersection set between the variants called from GATK and SAMTOOLS, respectively, using the bcftools function isec (Li et al. 2009). To produce our final call set, we removed all sites that were genotyped in only one of the capture efforts, had a minor allele frequency of <0.05, or were within 10 kb of one another, using vcftools (Danecek et al. 2011). For comparisons between the paired fDNA and gDNA samples, we used the above variant-calling pipeline to jointly genotype all samples sequenced in the study.

Estimating the coefficient of relatedness

To produce an estimate of relatedness between samples in our pedigree and to test for concordance between fecal and blood-derived samples for the same individuals, we used the program lcMLkin (Lipatov et al. 2015). lcMLkin uses the genotype likelihoods generated by GATK for each genotype call to calculate two measures: (i) k0, the probability that two individuals share no alleles that are identical by descent, and (ii) r, the coefficient of relatedness (Lipatov et al. 2015) (i.e., twice the kinship coefficient). Several other methods have been developed (Manichaikul et al. 2010; Yang et al. 2010) to estimate relatedness from thousands of SNPs, but lcMLkin yielded the best match to pedigree-based estimates in our data set (Figure S14).

We also compared genotype calls for the matched fecal and blood-derived samples, using GATK’s GenotypeConcordance function (DePristo et al. 2011). This tool allowed us to determine concordance rates between data sets for different classes of variants (e.g., 0, 1, or 2).

WHODAD: paternity inference and pedigree reconstruction

Our paternity prediction model is based on a naive Bayes classifier that takes advantage of the rules of Mendelian segregation within pedigrees. Using data from all sites genotyped in a potential father–mother–offspring trio or, when the mother is not genotyped, all sites genotyped in a potential father–offspring dyad, it estimates the posterior probability that a potential candidate is the true father of a given offspring.

Our approach can be broken into three steps (Figure S9). First, we estimate, for each candidate male, the conditional probability that he is the true father of a given offspring, given the genotype data for the candidate, offspring, and mother, if known (below we show the case in which genotype information is available for the mother, but the model is similar when maternal genotype information is missing). Second, we assign a P-value for the top candidate male from the first step, for the null hypothesis that he is not more related to the focal offspring than the other candidates tested. Third, we calculate the probability that the genotype data for the top candidate and offspring are consistent with a true parent–offspring relationship, using a mixture model. Steps 2 and 3 perform subtly different functions in our analysis: step 2 tests that the top candidate is significantly more related to the offspring than any other candidate, whereas step 3 tests that the dyadic similarity between the candidate and the offspring looks as expected for parent–offspring dyads. We have found that combining both approaches is key to detecting true positive fathers while minimizing false positive calls that can occur when true fathers are not in the pool of genotyped candidates (Figure S10).

Step 1: estimating conditional probabilities for each trio:

For a given offspring or mother–offspring dyad, our goal is to infer the true genetic father from a pool of n candidates. For the ith candidate, we use data for the Li variants for which we have genotype information for the known mother–offspring dyad and for the candidate father. Assuming the true father is present in the candidate pool (i.e., he has been genotyped), the probability that the ith potential candidate is the father is

P(Fi|M,O)=P(Fi,M,O)/(k=1nP(Fk,M,O)), (1a)

where P(Fi|M,O) denotes the probability that the candidate is the father, conditional on the (known) mother–offspring dyad; P(Fi,M,O) denotes the joint probability of the whole trio; and k=1nP(Fk,M,O) is the sum of the joint probabilities of all possible trios evaluated in the analysis. In practice, we normalize these conditional probabilities to take into account differences in the number of variants evaluated for each trio by taking the Lith root:

P(Fi|M,O)P(Fi,M,O)1/Li/(k=1nP(Fk,M,O)1/Lk). (1b)

Each joint probability can be calculated in turn as

P(Fi,M,O)=f,m,oP(Fi,M,O,f,m,o)=f,m,oj=1LiP(Fi,M,O,fij,mj,oj), (2)

where mj, fij, and oj represent the genotype data for the jth variant of the mother, the candidate father, and the offspring, respectively. Genotypes take values in {0, 1, 2} (i.e., the number of copies of the reference allele at each individual–site combination). Importantly, although Equation 2 unrealistically assumes independence across loci, this assumption does not change the relative order of trio joint probabilities.

The probability P(Fi,M,O,fij,mj,oj) for each locus can be further decomposed as

P(Fi,M,O,fij,mj,oj)P(oj|mj,fij)P(fij|Fi)P(mj|M)P(oj|O)P(oj), (3)

where we take genotype uncertainty into account by using GATK’s genotype probabilities to calculate the conditional genotype probabilities for P(fij|Fi), P(mj|M), and P(oj|O) over all possible genotype values at each site–individual combination (i.e., the probabilities that each genotype is 0, 1, or 2, which sum to 1). We also ignore the scaling constant P(Fi)P(M)P(O) because it cancels out in the numerator and denominator of (1). The marginal probability of the offspring’s genotype, P(oj), is calculated from the minor allele frequency of the variant in the population. Finally, the conditional probability P(oj|mj,fij) is based on the rules of Mendelian transmission (e.g., Marshall et al. 1998). Due to genotype uncertainty in low-coverage data, the values of P(Fi|M,O) are small. However, the highest value is usually assigned to the most likely father (based on comparison to the pedigree; see Results) and we can directly assess the strength of the relative evidence for the top candidate vs. other candidates in step 2 by calibrating these values against permuted data.

Step 2: calculating resampling-based P-values:

To compute P-values for each paternity assignment, candidates are ranked based on their conditional probability P(Fi|M,O) of being the true father. The log ratio of conditional probabilities between the highest-probability father and the second best candidate is the test statistic

v=log(P(Fbest|M,O)P(Fsecond|M,O)). (4)

To assess significance for v, we then simulate genotype data for a set of n unrelated candidate fathers based on allele frequency information for each locus in the analysis and sequence coverage information for the real candidates, at each of the loci for which they were genotyped in the true data set. Specifically, for each locus-simulated unrelated candidate combination, fij, where i indexes a (real) candidate male and j indexes the locus, we simulate a vector of genotype probabilities for the candidate father, (fij0,fij1,fij2), which sum to 1. The number of probability vectors simulated for each candidate is based on the number and identity of the loci observed in the real data. For example, if the top candidate in the real data were evaluated based on 10,000 sites, we would simulate an unrelated male with genotype vector probabilities simulated for each of those 10,000 sites; if the second-best candidate was evaluated at 9000 sites, we would simulate an unrelated male with genotype vector probabilities simulated for each of those 9000 sites, and so on. The variant sets for different simulated candidates need not be identical and are in fact highly unlikely to be so in practice.

To simulate each vector, we draw values from a Dirichlet distribution (i.e., a distribution on probability vectors that sum to one). In principle, the Dirichlet distribution for each biallelic site could be parameterized by the genotype frequencies for each of the three potential genotype values, Dir(πj0,πj1,πj2), with genotype frequencies equal to the Hardy–Weinberg expected values based on the allele frequency of the reference allele [i.e., p2, 2p(1 − p), (1 − p)2, with p estimated from the data]. However, the low coverage in our data introduces additional noise into this sampling problem, so we instead draw values from the following Dirichlet distribution,

(fij0,fij1,fij2)∼Dir(κcij(πj0,πj1,πj2)), (5)

where cij is the read depth (coverage) for the site in (true) candidate father i, and κ is a concentration parameter common to all sites and candidate fathers, estimated from the real data using the method of moments. κ can be thought of as a scaling factor for the effect of coverage on variance in (fij0,fij1,fij2). To make the simulations as realistic as possible, all parameters are estimated from the real data as

πjl=E(fijl), (6)

where the expectation is based on the allele frequencies for the reference allele estimated across all individuals, for each locus j and genotype l combination, and

κ=E(fijl)E(fijl2)E(cijfijl2)E2(cijfijl), (7)

where the expectations are based on the allele frequencies (as above) across all individuals and loci and across all three possible genotype values (0, 1, and 2) for each locus–individual combination. Our estimates for πij and κ are based on the observed average values from the data, which approximate the expected value.

After simulating genotype data for each candidate male as if he were unrelated to the focal offspring, we can obtain a new value of v (Equation 4) from the simulated data. By repeating this procedure s times, we can compute a P-value for the hypothesis that the best candidate in the true data is no more related to the focal offspring than any other candidate in the data set. This P-value is equal to the proportion of times the simulated test statistics exceed the observed test statistic. It intuitively corresponds to the probability of seeing a gap as large as the true gap between the conditional probabilities for the best and second-best candidates, if all candidates were in fact unrelated (or equally related) to the focal offspring.

Step 3: estimating the posterior probability of paternity:

WHODAD’s inference method, like other paternity inference methods [e.g., CERVUS (Marshall et al. 1998; Kalinowski et al. 2007)], can falsely assign paternity to a close relative if the true father is not included in the pool of potential fathers. Such false positives arise because these methods do not actually test the hypothesis that the assigned father is the true father, but rather whether the assigned father is significantly more closely related to the focal offspring than other candidates in the pool. A more direct method would be to test the probability of observing the data for a father–offspring dyad (or father–mother–offspring trio) under the alternative hypothesis that the assigned father is the true father. Testing the alternative hypothesis is nontrivial with low-coverage data and by itself can also yield incorrect inferences (Figure S10). However, in combination with the resampling-based P-values described above, it can improve paternity assignments.

To estimate the probability of the data given the best candidate–offspring dyad, we take advantage of the fact that dyadic measures of genotype similarity or relatedness or other estimates of identity-by-descent should differ for true parent–offspring pairs compared to all other dyads (except for full sibs). By utilizing the many dyadic values in a data set of mothers, offspring, and candidate fathers, we should therefore be able to distinguish father–offspring dyads from dyads involving other relatives or unrelated pairs. Notably, this method allows us to use dyadic values for mother–offspring pairs to maximum effect.

We use a normal mixture clustering approach and the k0 value from the R package lcMLkin, where low k0 values predict a low probability of sharing 0 alleles. We denote yb as the vector of logit-transformed k0 measurements for the best candidate–offspring dyads for all tested father–offspring dyads; y1 as the vector of logit (k0) measurements for all known mother–offspring dyads, if any are present (y1 can be an empty vector if no mother–offspring dyads were sampled); and y0 as the vector of logit (k0) measurements for all other dyads. Thus, y0 captures the distribution of logit (k0) values for non-parent–offspring dyads; y1 captures the distribution of logit (k0) values for known parent–offspring dyads; and yb contains a mixture of logit (k0) values for both true parent–offspring dyads and non-parent–offspring dyads.

We first work only with y0 and use a mixture model approach to assign the logit (k0) value for each dyad i into one of K component normal distributions (fitted using the mixtools function in R, with a default value of K = 5; note that our analyses are robust to reasonable choices of K, see File S1). Components with lower mean values for k0 can be thought of as capturing the distribution of logit (k0) values for highly related dyads (e.g., half-siblings), whereas components with high mean values capture distantly related or unrelated dyads (if relatedness coefficients were used instead of k0, this direction would be reversed: low values would correspond to distantly related dyads instead). For y1, all dyads are from the same relatedness category (mother–offspring), so logit (k0) values in y1 can be modeled by a single distribution parameterized by a mean and a variance. Finally, for yb, values of logit (k0) can be assumed to be drawn either from the distribution on y1 or from one of the distributions (likely one with a low mean value) in the mixture model for y0,

ybiπN(μ,σ2)+(1π)N(μi,σi2), (8)

where for the ith individual in yb, μi and σi2 are the mean and variance for one of the distributions in the mixture model of y0; μ and σ2 are the mean and variance for the distribution on y1; and π is the probability that a value in yb belongs to the parent–offspring distribution or one of the distributions fitted in the mixture model for other dyads. To infer these parameters, for each dyad in yb, we assign μi,σi2 to the mean and variance of the mostly likely normal component by evaluating the likelihood under all K components. We then combine y1 and yb to jointly infer π,μ,σ2 in Equation 8.

Finally, we introduce a latent indicator variable zbi for each dyad to indicate whether the ith dyad in yb is a true father–offspring dyad. The probability of being a true father–offspring dyad, or P (zbi = 1), becomes the final statistic used to assess our paternity assignments. To infer P (zbi = 1), we use an expectation-maximization algorithm (see File S1 for detailed information about the EM steps). WHODAD considers a male as the likely true father of a focal offspring if he was (i) the candidate with the highest conditional probability of paternity, (ii) assigned a P-value from our simulations < 0.05, and (iii) P (zbi = 1) > 0.9.

Testing the accuracy of paternity assignment using WHODAD

We assigned paternity using the methods detailed above for all previously identified father–offspring pairs (n = 27) in the Amboseli pedigree (Figure 1). This pedigree was constructed using a combination of observational life history data on female pregnancies and infant care (to infer maternal–offspring dyads), demographic data to identify possible candidate fathers, and microsatellite genotyping data analyzed in the program CERVUS (with confidence >95%; see Alberts et al. 2006 for additional detail).

Our data set contained maternal genotype information derived from the fecal enrichment protocol for 15 of these individuals (56%). We first used WHODAD to assign paternity for these 15 offspring while incorporating the genotype data from their mothers. To assess the accuracy of WHODAD in the absence of maternal genotype data, we then repeated the paternity analysis for the same 15 offspring without including the mother’s genotype. For this analysis, we were also able to include the 12 additional offspring for whom we did not have genotype data from the mother, but had genotype data from the known father (n = 27).

To examine how the presence of close male kin influenced the accuracy and confidence of WHODAD’s paternity assignments, we conducted three additional analyses. First, to assess the accuracy of WHODAD when the pedigree-assigned father is the only close male relative present, we removed all close relatives of the offspring except the father (r ≥ 0.25, e.g., grandfathers and half-sibling or full-sibling brothers) from the pool of potential fathers. Second, to test whether WHODAD assigned a father with high confidence even when no close relatives were present, we removed all close male relatives, including the pedigree-assigned father, from the pool of candidate males. Third, to assess the risk of confidently (but erroneously) assigning a close male relative as the likely father when the pedigree-assigned father was not genotyped, we removed the father from the pool of potential fathers. For all WHODAD analyses we report assignment accuracy based on whether the father was identified by WHODAD with a P-value <0.05 and a P (zbi = 1) > 0.90. Offspring were not assigned a father (“no assignment”) when the best candidate male was identified with a P-value >0.05 or a P (zbi = 1) < 0.90.

Data availability

All sequencing data sets reported in this article have been deposited in the NCBI Short Read Archive (SRA), accession no. SRP064514. The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Acknowledgments

We thank the Kenya Wildlife Service, the Institute of Primate Research, National Museums of Kenya, the National Council for Science and Technology, members of the Amboseli–Longido pastoralist communities, Tortilis Camp, and Ker & Downey Safaris for their assistance in Kenya. We also thank Jeanne Altmann and Elizabeth Archie for their generous support and access to the Amboseli Baboon Research Project data set and samples; Raphael Mututua, Serah Sayialel, Kinyua Warutere, Mercy Akinyi, Tim Wango, and Vivian Oudu for invaluable assistance with the Amboseli baboon sample collection; Emily McLean for assistance in identifying samples from the extended pedigree; and Tauras Vilgalys for assistance in drawing the pedigree. For access to the Guinea baboon samples, we thank Julia Fischer, Dietmar Zinner and José Carlos Brito; the Wild Chimpanzee Foundation for logistical support in Guinea; and the Ministère de l’Environnement et de la Protection de la Nature and the Direction des Parcs Nationaux in Senegal; the Opération du Parc National de la Boucle du Baoulé and the Ministère de l’Environnement et de l’Assainissement in Mali; the Office Guinéen de la Diversité Biologique et des Aires Protégées and the Ministère de L’Environnement, des Eaux et Forêts in Guinea; and the Ministère Délégue auprès du Premier Ministre, Chargé de l’Environnement et du Développement Durable in Mauritania. Finally, we thank P. J. Perry, Luis Barreiro, Greg Crawford, Tim Reddy, members of the Alberts and Tung laboratories, and two anonymous reviewers for their feedback on earlier versions of this work. This work was supported by National Science Foundation grants DEB-1405308 (to J.T.) and SMA-1306134 (to J.T. and N.S.-M.). G.H.K. was supported by the German Academic Exchange Service [Deutscher Akademischer Austauschdienst (DAAD)], the Christiane-Nüsslein-Volhard Foundation, The Leakey Foundation, and the German Primate Center. X.Z. was supported by a grant from the Foundation for the National Institutes of Health through the Accelerating Medicines Partnership BOEH15AMP. The authors declare no competing financial interests.

Author contributions: J.T., N.S.-M., S.M., and X.Z. conceived and designed the research. M.L.Y., A.O.S., J.B.G., G.H.K., and N.S.M. performed all laboratory experiments. S.A.S., J.T., and J.D.W. provided the genome assembly. W.H.M., S.M., J.T., and X.Z. developed the computational methods. W.H.M. and X.Z. implemented the software. N.S.-M., J.T., and X.Z. analyzed the data. S.C.A. and G.H.K. provided samples, reagents, and logistical support. N.S.-M., J.T., and X.Z. wrote the manuscript with input from all of the coauthors.

Footnotes

Communicating editor: J. Shendure

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.187492/-/DC1.

Literature Cited

  1. Alberts S. C., Altmann J., 2012.  The Amboseli Baboon Research Project: 40 years of continuity and change, pp. 261–287 in Long-Term Field Studies of Primates, edited by Kappeler P. M., Watts D. P. Springer-Verlag, Berlin/Heidelberg, Germany. [Google Scholar]
  2. Alberts S. C., Buchan J. C., Altmann J., 2006.  Sexual selection in wild baboons: from mating opportunities to paternity success. Anim. Behav. 72(5): 1177–1196. [Google Scholar]
  3. Archie E. A., Hollister-Smith J. A., Poole J. H., Lee P. C., Moss C. J., et al. , 2007.  Behavioural inbreeding avoidance in wild African elephants. Mol. Ecol. 16(19): 4138–4148. [DOI] [PubMed] [Google Scholar]
  4. Ávila-Arcos M. C., Sandoval-Velasco M., Schroeder H., Carpenter M. L., Malaspinas A.-S., et al. , 2015.  Comparative performance of two whole genome capture methodologies on ancient DNA Illumina libraries. Methods Ecol. Evol. 6(6): 725–734. [Google Scholar]
  5. Buchan J. C., Alberts S. C., Silk J. B., Altmann J., 2003.  True paternal care in a multi-male primate society. Nature 425(6954): 179–181. [DOI] [PubMed] [Google Scholar]
  6. Carpenter M. L., Buenrostro J. D., Valdiosera C., Schroeder H., Allentoft M. E., et al. , 2013.  Pulling out the 1%: whole-genome capture for the targeted enrichment of ancient DNA sequencing libraries. Am. J. Hum. Genet. 93(5): 852–864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chakraborty R., Shaw M., Schull W. J., 1974.  Exclusion of paternity: the current state of the art. Am. J. Hum. Genet. 26(4): 477–488. [PMC free article] [PubMed] [Google Scholar]
  8. Charpentier M. J. E., Fontaine M. C., Cherel E., Renoult J. P., Jenkins T., et al. , 2012.  Genetic structure in a dynamic baboon hybrid zone corroborates behavioural observations in a hybrid population. Mol. Ecol. 21(3): 715–731. [DOI] [PubMed] [Google Scholar]
  9. Daley T., Smith A. D. A., 2013.  Predicting the molecular complexity of sequencing libraries. Nat. Methods 10(4): 325–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Danecek P., Auton A., Abecasis G., Albers C. A., Banks E., et al. , 2011.  The variant call format and VCFtools. Bioinformatics 27(15): 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. DePristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., et al. , 2011.  A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43(5): 491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Durand E. Y., Patterson N., Reich D., Slatkin M., 2011.  Testing for ancient admixture between closely related populations. Mol. Biol. Evol. 28(8): 2239–2252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Enk J. M., Devault A. M., Kuch M., Murgha Y. E., Rouillard J.-M., et al. , 2014.  Ancient whole genome enrichment using baits built from modern DNA. Mol. Biol. Evol. 31(5): 1292–1294. [DOI] [PubMed] [Google Scholar]
  14. Ford M. J., Williamson K. S., 2010.  The aunt and uncle effect revisited - the effect of biased parentage assignment on fitness estimation in a supplemented salmon population. J. Hered. 101(1): 33–41. [DOI] [PubMed] [Google Scholar]
  15. Gagneux P., Boesch C., Woodruff D. S., 1997.  Microsatellite scoring errors associated with noninvasive genotyping based on nuclear DNA amplified from shed hair. Mol. Ecol. 6(9): 861–868. [DOI] [PubMed] [Google Scholar]
  16. George R. D., McVicker G., Diederich R., Ng S. B., MacKenzie A. P., et al. , 2011.  Trans genomic capture and sequencing of primate exomes reveals new targets of positive selection. Genome Res. 21(10): 1686–1694. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gillespie T. R., 2006.  Noninvasive assessment of gastrointestinal parasite infections in free-ranging primates. Int. J. Primatol. 27(4): 1129–1143. [Google Scholar]
  18. Gnirke A., Melnikov A., Maguire J., Rogov P., LeProust E. M., et al. , 2009.  Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27(2): 182–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gottelli D., Wang J., Bashir S., Durant S. M., 2007.  Genetic analysis reveals promiscuity among female cheetahs. Proc. Biol. Sci. 274(1621): 1993–2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Huang B., Amos C., Lin D., 2007.  Detecting haplotype effects in genomewide association studies. Genet. Epidemiol. 31: 803–812. [DOI] [PubMed] [Google Scholar]
  21. Idaghdour Y., Broderick D., Korrida A., 2003.  Faeces as a source of DNA for molecular studies in a threatened population of great bustards. Conserv. Genet. 4(6): 789–792. [Google Scholar]
  22. Kalinowski S. T., Taper M. L., Marshall T. C., 2007.  Revising how the computer program CERVUS accommodates genotyping error increases success in paternity assignment. Mol. Ecol. 16(5): 1099–1106. [DOI] [PubMed] [Google Scholar]
  23. Kharchenko P. V., Tolstorukov M. Y., Park P. J., 2008.  Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 26(12): 1351–1359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kircher M., Sawyer S., Meyer M., 2012.  Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res. 40(1): e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Knight J. M., Davidson L. A., Herman D., Martin C. R., Goldsby J. S., et al. , 2014.  Non-invasive analysis of intestinal development in preterm and term infants using RNA-sequencing. Sci. Rep. 4: 5453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Landt S. G., Marinov G. K., Kundaje A., Kheradpour P., Pauli F., et al. , 2012.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 22(9): 1813–1831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ley R. E., Lozupone C. A., Hamady M., Knight R., Gordon J. I., 2008.  Worlds within worlds: evolution of the vertebrate gut microbiota. Nat. Rev. Microbiol. 6(10): 776–788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li H., 2011.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27(21): 2987–2993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Li, H., 2013 Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: 1303.3997.
  30. Li H., Durbin R., 2011.  Inference of human population history from individual whole-genome sequences. Nature 475: 493–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., et al. , 2009.  The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Li Y., Sung W., Liu J., 2007.  Association mapping via regularized regression analysis of single-nucleotide–polymorphism haplotypes in variable-sized sliding windows. Am. J. Hum. Genet. 80: 705–715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lipatov, M., K. Sanjeev, R. Patro, and K. Veeramah, 2015 Maximum likelihood estimation of biological relatedness from low coverage sequencing data. bioRxiv: 023374. [Google Scholar]
  34. Ma Y., Zhao J., Wong J.-S., Ma L., Li W., et al. , 2014.  Accurate inference of local phased ancestry of modern admixed populations. Sci. Rep. 4: 5800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Manichaikul A., Mychaleckyj J. C., Rich S. S., Daly K., Sale M., et al. , 2010.  Robust relationship inference in genome-wide association studies. Bioinformatics 26(22): 2867–2873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Marshall T. C., Slate J., L. E. Kruuk, and J. M. Pemberton, 1998.  Statistical confidence for likelihood-based paternity inference in natural populations. Mol. Ecol. 7(5): 639–655. [DOI] [PubMed] [Google Scholar]
  37. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., et al. , 2010.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20(9): 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Moll P. R., Duschl J., Richter K., 2004.  Optimized RNA amplification using T7-RNA-polymerase based in vitro transcription. Anal. Biochem. 334(1): 164–174. [DOI] [PubMed] [Google Scholar]
  39. Mondol S., Ullas Karanth K., Samba Kumar N., Gopalaswamy A. M., Andheria A., et al. , 2009.  Evaluation of non-invasive genetic sampling methods for estimating tiger population size. Biol. Conserv. 142(10): 2350–2360. [Google Scholar]
  40. Morin P. A., Chambers K. E. K., Boesch C., Vigilant L., 2001.  Quantitative polymerase chain reaction analysis of DNA from noninvasive samples for accurate microsatellite genotyping of wild chimpanzees (Pan troglodytes verus). Mol. Ecol. 10(7): 1835–1844. [DOI] [PubMed] [Google Scholar]
  41. Nagata J., Aramilev V. V., Belozor A., Sugimoto T., McCullough D. R., 2005.  Fecal genetic analysis using PCR-RFLP of cytochrome b to identify sympatric carnivores, the tiger Panthera tigris and the leopard Panthera pardus, in far eastern Russia. Conserv. Genet. 6(5): 863–866. [Google Scholar]
  42. O’Rawe J., Jiang T., Sun G., Wu Y., Wang W., et al. , 2013.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 5(3): 28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Olsen J. B., Busack C., Britt J., Bentzen P., 2001.  The aunt and uncle effect: an empirical evaluation of the confounding influence of full sibs of parents on pedigree reconstruction. J. Hered. 92(3): 243–247. [DOI] [PubMed] [Google Scholar]
  44. Palme R., 2005.  Measuring fecal steroids: guidelines for practical application. Ann. N. Y. Acad. Sci. 1046: 75–80. [DOI] [PubMed] [Google Scholar]
  45. Pérez T., Naves J., Vázquez J. F., Seijas J., Corao A., et al. , 2010.  Evidence for improved connectivity between Cantabrian brown bear subpopulations. Ursus 21(1): 104–108. [Google Scholar]
  46. Perry G. H., Marioni J. C., Melsted P., Gilad Y., 2010.  Genomic-scale capture and sequencing of endogenous DNA from feces. Mol. Ecol. 19(24): 5332–5344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Pompanon F., Bonin A., Bellemain E., Taberlet P., 2005.  Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6(11): 847–859. [DOI] [PubMed] [Google Scholar]
  48. Price A., Tandon A., Patterson N., Barnes K., 2009.  Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5: e1000519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M. A. R., et al. , 2007.  PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. R Development Core Team , 2015.  R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. [Google Scholar]
  51. Rudnick J. A., Katzner T. E., Bragin E. A., DeWoody J. A., 2007.  A non-invasive genetic evaluation of population size, natal philopatry, and roosting behavior of non-breeding eastern imperial eagles (Aquila heliaca) in central Asia. Conserv. Genet. 9(3): 667–676. [Google Scholar]
  52. Sabeti P., Reich D., Higgins J., 2002.  Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837. [DOI] [PubMed] [Google Scholar]
  53. Sacks B. N., Moore M., Statham M. J., Wittmer H. U., 2011.  A restricted hybrid zone between native and introduced red fox (Vulpes vulpes) populations suggests reproductive barriers and competitive exclusion. Mol. Ecol. 20(2): 326–341. [DOI] [PubMed] [Google Scholar]
  54. Samuels D. C., Han L., Li J., Quanghu S., Clark T. A., et al. , 2013.  Finding the lost treasures in exome sequencing data. Trends Genet. 29(10): 593–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Sankararaman, S., and S. Sridhar, 2008 Estimating local ancestry in admixed populations. Am. J. Hum. Genet. 82: 290–303. [DOI] [PMC free article] [PubMed]
  56. Shagina I., Bogdanova E., Mamedov I. Z., Lebedev Y., Lukyanov S., et al. , 2010.  Normalization of genomic DNA using duplex-specific nuclease. Biotechniques 48(6): 455–459. [DOI] [PubMed] [Google Scholar]
  57. Sinnwell J. P., Therneau T. M., Schaid D. J., 2014.  The kinship2 R package for pedigree data. Hum. Hered. 78(2): 91–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Smith K., Alberts S. C., Altmann J., 2003.  Wild female baboons bias their social behaviour towards paternal half-sisters. Proc. Biol. Sci. 270(1514): 503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Steiper M. E., Young N. M., 2006.  Primate molecular divergence dates. Mol. Phylogenet. Evol. 41(2): 384–394. [DOI] [PubMed] [Google Scholar]
  60. Taberlet P., Waits L., Luikart G., 1999.  Noninvasive genetic sampling: look before you leap. Trends Ecol. Evol. 14(8): 323–327. [DOI] [PubMed] [Google Scholar]
  61. Thompson E. A., Meagher T. R., 1987.  Parental and sib likelihoods in genealogy reconstruction. Biometrics 43(3): 585–600. [PubMed] [Google Scholar]
  62. Tung J., Zhou X., Alberts S. C., Stephens M., Gilad Y., 2015.  The genetic architecture of gene expression levels in wild baboons. eLife 4: e04729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Valière N., Fumagalli L., Gielly L., Miquel C., Lequette B., et al. , 2003.  Long-distance wolf recolonization of France and Switzerland inferred from non-invasive genetic sampling over a period of 10 years. Anim. Conserv. 6(1): 83–92. [Google Scholar]
  64. Vallender E. J., 2011.  Expanding whole exome resequencing into non-human primates. Genome Biol. 12(9): R87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Van der Auwera, G. A., M. O. Carneiro, C. Hartl, R. Poplin, G. del Angel et al., 2013 From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43: 11.10.1–11.10.33. [DOI] [PMC free article] [PubMed]
  66. Van Horn R. C., Altmann J., Alberts S. C., 2008.  Can’t get there from here: inferring kinship from pairwise genetic relatedness. Anim. Behav. 75(3): 1173–1180. [Google Scholar]
  67. Visscher P. M., 2009.  Whole genome approaches to quantitative genetics. Genetica 136(2): 351–358. [DOI] [PubMed] [Google Scholar]
  68. Wall J. D., Schlebusch S. A., Alberts S. C., Cox L., N. Snyder-Mackler et al., 2016 Genome-wide ancestry and divergence patterns from low-coverage sequencing data reveal a complex history of admixture in wild baboons. Mol. Ecol. DOI: 10.1111/mec.13684. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010.  Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42(7): 565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Yang J., Manolio T. A., Pasquale L. R., Boerwinkle E., Caporaso N., et al. , 2011.  Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43(6): 519–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Zimin A. V., Cornish A. S., Maudhoo M. D., Gibbs R. M., Zhang X., et al. , 2014.  A new rhesus macaque assembly and annotation for next-generation sequencing analyses. Biol. Direct 9(1): 20. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All sequencing data sets reported in this article have been deposited in the NCBI Short Read Archive (SRA), accession no. SRP064514. The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES