Abstract
In polymerase chain reaction (PCR)-based DNA sequencing studies, there is the possibility that mutations at the binding sites of primers result in no primer binding and therefore no amplification. In this article, we call such mutations PCR dropouts and present a coalescent-based theory of the distribution of segregating PCR dropout mutations within a species. We show that dropout mutations typically occur along branch sections that are at or near the base of a coalescent tree, if at all. Given that a dropout mutation occurs along a branch section near the base of a tree, there is a good chance that it causes the alleles of a large fraction of a species to go unamplified, which distorts the tree shape. Expected coalescence times and distributions of pairwise sequence differences in the presence of PCR dropout mutations are derived under the assumptions of both neutrality and background selection. These expectations differ from when PCR dropout mutations are absent and may form the basis of inferential approaches to detect the presence of dropout mutations, as well as the development of unbiased estimators of statistics associated with population-level genetic variation.
Keywords: Background selection, coalescent, eDNA, metagenomics, pairwise differences, tree shape
Introduction
DNA sequencing is either primer based, whereby a specific genetic locus is amplified by polymerase chain reaction (PCR) and sequenced1 or by shotgun sequencing, whereby genomes are randomly broken into small segments and sequenced directly with no amplification.2,3 In primer-based sequencing, 2 primers that together flank a locus are used. Primers are typically about 18 to 24 nucleotides in length and bind to DNA. Each primer initiates replication on separate strands, with complementary effects such that after several rounds of replication and strand separation a specific locus is amplified.4
Substitutions at the binding sites of primers can lower the probability a primer binds and initiates replication. Not all substitutions at the binding site of a primer are equal. For example, substitutions within the last 3 to 4 nucleotides of the 3′ end can significantly reduce PCR replication.5,6 Primer coverage corresponds to the proportion of species or taxa in the sample that are, or are expected to be, amplified and sequenced for a primer or set of primers.7 Studies of primer coverage that compare primers to DNA sequence databases indicate that coverage for a specific primer varies depending on taxon, ranging from 0% to nearly 100%,7,8 where the taxon level is at the domain or phylum. For example, in the bacterial phylum, Nitrospirae coverage is about 97% for the commonly used 16S rRNA primer 519F if a single mismatch is allowed within the last 4 nucleotides of a primer binding site, but only about 32% if a single mismatch is not allowed.7 For the bacterial phylum Lentisphaerae and the commonly used 16S rRNA primer 338F, coverage is 97% if a single mismatch is allowed within the last 4 nucleotides of a primer binding site, but 0% if a single mismatch is not allowed.7 High coverage may occur at the phylum level for a specific primer, but is typically lower at the domain level due to broader sequence diversity.
A lack of primer coverage is straightforward to detect in studies where it is sought to sequence DNA from a single individual because amplification and sequencing will fail, which is directly detectable. In contrast, it is becoming increasingly common to pool, amplify, and sequence DNA from multiple individuals across one or more species. For example, in primer-based metagenomic and eDNA studies, a sample is taken from the environment that contains individuals and/or DNA from potentially thousands of species or OTUs (operational taxonomic units) and multiple individuals per species or OTU. In this approach, PCR amplification and sequencing is often non-targeted and used to discover or detect the species, OTUs or higher taxonomic groups present, as well as targeted and used to detect a specific species, OTU or taxonomic group. In non-targeted applications, there is the possibility that a taxonomic group or subset of individuals within a taxonomic group go undiscovered or undetected because of a mismatch between primers and binding sites. Furthermore, even with targeted sequencing, there is the possibility of low primer coverage.7,8
As context, many datasets deposited in the European Bioinformatics Institute metagenomics database (https://www.ebi.ac.uk/metagenomics/)9 involve primer/amplicon-based and non-targeted sequencing from nature. Although there has been progress in developing reference databases for primer/amplicon-based metagenomic studies to match sequences to species or higher taxonomic groups in a targeted manner,10 there is still a need for assessing whether a taxonomic group goes partially or fully unsequenced in non-targeted sequencing studies, as well as assess the degree of coverage of targeted taxa.
A challenge for primer-based metagenomic studies is assessing coverage for a sample from nature. A rarefaction curve may be used to assess coverage by plotting the number of taxa delineated versus the number of sequences recorded.11 If this curve is asymptotic, then nearly all taxa have been delineated. Nevertheless, this only applies to taxa that are amplified by the primer or primer set. Even if primer coverage was less than 100%, an asymptotic rarefaction curve can occur with more sequences because eventually all amplified species can eventually be sequenced and delineated to a taxon or as coming from an unknown taxon. Generally, it would be helpful for a metagenomic study to have available approaches to assess primer coverage for their specific sample from nature. In this article, we present theory to help the development of approaches to assess coverage.
In particular, we use coalescent theory to derive expectations for the pattern of nucleotide variation within a species when a mutation or a set of mutations at primer site(s) block the chain of events during a primer-based sequencing study, such that a set of sequences are not recorded as being present. This mutation or a set of mutations could either completely block a primer from binding to its binding site or reduce binding, such that amplification is so low that the locus is not sequenced for a set of individuals descendent from the mutation or a set of mutations. We call such mutations “PCR dropouts.”
A PCR dropout mutation or a set of PCR dropout mutations may occur ancestrally to all individuals within a species or later (forward in time), such that only a subset of individuals from a species have one or more of the dropouts. Dropout mutations that occur ancestrally to all individuals and are sufficient to block amplification of all individuals within the species are not the focus of this study. Instead, we focus on dropouts that are segregating within a population, such that these dropouts occur later (forward in time) than the most recent common ancestor (MRCA) to the species. Using coalescent theory,12,13 we show that segregating dropout mutations give rise to distinct patterns of DNA sequence variation. This pattern may be used to assess whether a species (or OTU) is prone to PCR dropouts and therefore reduced coverage. Furthermore, if several species (or OTUs) within a taxon have a signal of dropouts, this may indicate a larger problem of coverage for the entire taxon.
Although this article focuses on dropouts within a species, similar principles apply at higher levels of biological organization. We choose to first focus on within-species variation because DNA sequence evolution is well characterized by the coalescent process and this process is universal among species and taxa. In contrast, at higher levels, there is no standard theoretical framework of DNA sequence evolution, although Hey’s14 approach may be promising in the context PCR dropouts and metagenomics (see section “Discussion”).
Expectations for DNA sequence variation at the population level and in the presence of segregating dropout mutations will also be of potential use in other contexts besides assessing primer coverage. For example, there has been the development of estimators of nucleotide diversity that account for the higher rate of sequencing errors associated with next-generation sequencing technology.15,16 Yet, methods are not available that account for segregating PCR dropouts. Furthermore, methods have been developed to use population-level variation to infer the phylogenetic structure of ecological communities using metagenomic data.17 Accordingly, it would be helpful to consider the effect of PCR dropouts on population-level variation and whether they may affect inferences of phylogenetic structure.
The organization of this article is as follows: First, we present theory assessing the probability that dropout mutations occur along branch sections that are deep in the coalescent tree of a species or along branch section that occur more recently, as well as whether we expect a species to segregate a single dropout mutation or multiple dropout mutations. Generally, we find that dropout mutations typically occur along branch sections that are deep in the coalescent tree of a species. Furthermore, given rates of mutation at primer sites, a single dropout mutation typically segregates in a sample. Focusing on dropouts that completely block amplification, we then derive expectations for the distribution of coalescence times and pairwise differences of sequenced alleles at a locus. This analysis indicates that the distribution of coalescence times and pairwise differences is affected by the presence of a dropout and may therefore give rise to a detectable signal of a dropout. The theory accounts for the process of background selection,18,19 which is occurring at loci used in primer-based metagenomic studies, such as 16S/18S RNA.1
Theory
It is assumed that a sample is taken from an environment, be it water, soil, feces, etc, and DNA is purified from the sample and subject to PCR and sequencing. Although it may be that within a sample there are individuals across the tree of life, only a single genetic locus is amplified by PCR. For simplicity, we assume that the PCR primers target bacteria and archaea such that species are haploid and effective population sizes are potentially large. Although this article focuses on haploid microbes, the theory applies directly to diploid microbes with a change in timescale (see below). For a given species in the sample, there is a sample size of n. This sample size is not known. In the theory presented, we nevertheless assume a sample size of n and then develop expectations for how many individuals within the sample are expected to be unamplified given a sample of size n. The theory demonstrates that n is a nuisance parameter and detectable effects of dropout mutations can be inferred without the knowledge of n.
For PCR sequencing of a species or OTU to occur, DNA primers need to amplify the DNA of a species or OTU to a sufficient level that allows for sequencing. Here we define to be the rate of mutation that leads to a PCR dropout. For simplicity, we assume that a mutation either causes a dropout or does not, ie, the effects of mutations are discrete. An alternative model is quantitative, such that a mutation may decrease amplification by a certain fraction that is between, but not equal to, 0 and 1. We assume that the nucleotide sites that affect PCR amplification do not affect the fitness of an organism and consequently evolve neutrally. Amplified loci in metagenomic and environmental DNA studies are typically under purifying selection, such that we expect the amplified PCR locus to be affected by background selection.18 Initially, we assume neutrality and no background selection, but later consider the consequences of background selection. Furthermore, when developing expectations for polymorphism at an amplified locus, we focus on synonymous sites and/or assume neutrality and no background selection, but later consider the consequences of background selection.
Neutrality and no background selection at the amplified locus
A natural theoretical framework to model dropouts assuming neutrality is Kingman’s12 coalescent (see also Tajima13). In the Kingman coalescent, the genealogical history of a sample of size n is divided into sections in which coalescent events define the sections, whereby descending from the top to the bottom of the coalescent tree, the first section consists of n lineages, the second lineages, etc, with the last section consisting of 2 lineages. Following Fu,20 we call the section of the tree consisting of k branches “level k” (Figure 1). Mutations causing a dropout can occur along branches within each of these sections. A dropout mutation along a branch section is “unique” if no other dropout mutation occurs along its corresponding ancestral lineage from the point of mutation back to the MRCA of the sample (see Figure 2, section “Probability of the dropouts at level k are unique along their corresponding lines of descent”). Unique dropout mutations are of interest because all of the descendent lineages of the dropout will not be amplified by PCR, including lineages that have incurred subsequent dropout mutations (hence why we focus on unique dropouts). Next, we derive expectations for the distribution of the number of dropouts when there are k lineages remaining from a sample of size n. We then derive the conditional distribution of the number of dropouts that are unique at level k given a certain number of dropout mutations. For the set of unique dropouts, we derive the distribution of the number of lineages descendent from each dropout. With information about the number of descendant lineages, we then derive the conditional distribution of pairwise differences in the sample. This distribution is conditioned on the number of unique dropouts and numbers of descendent lineages from these dropouts, noting that by determining what individuals are not amplified by PCR we can then focus attention on the properties of individuals that are amplified by PCR.
Probability of branches with dropouts at level k,
Define to be a random variable giving the length of branch sections at level k in the coalescent tree. In continuous time, the distribution of is approximately exponential12 such that
where to simplify notation we let . The approximation is in the diffusion limit as the effective size of a population tends to infinity. In the Wright-Fisher model, 1 unit of time equals generations and in the Moran model 1 unit of time equals generations.12 For simplicity, we use 1 effective population size in our analysis, keeping in mind that the effective size of a Moran population is one half that of a Wright-Fisher population. Furthermore, note that for diploid microbes the baseline timescale is generations.
The expected number of dropout mutations per branch section at level k is , where the mutation rate is measured per generation and is measured on a generation timescale. The distribution of the number of mutations along a branch section is Poisson distributed and the probability of at least 1 dropout along a branch section at level k is
noting that is the probability of no dropout mutations. Define to be a random variable for the number of branch sections at level k with at least 1 dropout, then the probability is
(1) |
where , is the gamma function, and, to simplify notation, we let . A straightforward calculation from is the probability , which is
(2) |
and corresponds to the probability of at least 1 mutation prior to a coalescent event at level k (cf. Wakeley21).
Probability of the dropouts at level k are unique along their corresponding lines of descent
Define to be a random variable for the number of lineages at level k that have a dropout mutation at level k and no prior dropout mutations from level k to the MRCA of the sample, such that is the number of unique dropout mutations at level k and . Figure 2 illustrates the notation, whereby at level in the tree, there are 2 dropout mutations , but a second dropout mutation occurs along the line of descent to the MRCA for one of the dropouts, such that ; at level , and .
To calculate the distribution for , we note that moving down the coalescent tree from level k to the MRCA of the sample, we can define a set of numbers for i from 1 to which consists of the number of lineages with i descendants that have a unique dropout mutation at level k. Combining the set with the number , which is the number of lineages out of the k lineages remaining in the sample, forms a state space that characterizes the genealogical and dropout mutation process from level k to the MRCA of the sample. At the MRCA, and either all of the equal zero, such that there are no unique dropout mutations at level k, or one and only one of the equals 1, such that if and then .
Backward in time, 2 events can occur that are relevant to determining the distribution of . One event is a coalescence and the other event is a dropout mutation along a descendent lineage. If a coalescent event occurs, it could be between the 2 lineages with a dropout mutation, between a lineage with a dropout mutation and without a dropout mutation or between 2 lineages without a dropout mutation. For example, 2 lineages within the set with i descendants may coalesce resulting in the transition and this occurs at rate on an appropriate timescale. Or, 2 lineages, one with i descendants and the other with j descendants, may coalesce resulting in the transition
for at rate . A coalescent event could occur between lineages that are not descendants of dropout mutations or between a lineage with a descendent dropout mutation and one without a descendent dropout, such that at rate . Last, if a dropout mutation occurs along a descendent lineage below level k, then a dropout mutation at level k is no longer unique. For each element in , this event occurs at rate .
Generally, for , the following transitions are possible and their corresponding rate of transition, where for efficiency we only show elements in the state space set that change
(3) |
For the set , there will be a corresponding state space from level k to the MRCA. Define to be a random variable for the state of the system at time t into the past starting at some point in level k. is a finite-state and continuous-time Markov process with multiple absorbing states when . We can define an matrix to be the corresponding finite-state and continuous-time Markovian transition rate matrix, where is the size of the set . If the current state of the system is i, then the probability of being in state j at time t generations into the past satisfies the differential equation
(4) |
where is the full matrix consisting of elements and is defined by the transition rates in equation (3). The solution to the differential equation is
(5) |
For our purposes, we are interested in the probability at time t that the system is in a state in which starting from the initial state and furthermore whether one of the equals 1. Label the initial state as and states in which with an integer j corresponding to the number of descendent lineages with a unique dropout mutation at level k. For example, for and , the possibilities are , , and .
Define to be the probability at time t given , then
(6) |
Furthermore, in the limit as , then
(7) |
As has a finite dimension and satisfies properties of instantaneous transition rates, the limit is finite and defined.21
For the case , it is straightforward to directly write down for , which is
(8) |
where and , as well as recognizing that for the dropout to be unique no subsequent dropout mutations can occur along its lineage back to the MRCA of the sample.
Probability of the set of descendants given unique dropouts at level k
Define to be a set consisting of elements, in which an element is an integer greater than 0 equaling the number of descendants of one of the lineages, respectively, and noting that for the set is . Enumeration of is a direct application of the Pólya urn scheme and nicely applied in a similar population genetic context by Fu.19 For a sample of size n and level k in a coalescent tree, there are
ways the n individuals can be assigned to k ancestors. Of the k ancestors, we are interested in the number of descendants of each of the unique dropout mutations. The number of ways of drawing a set from n individuals under the constraint that at level k is
such that the probability of the set given is
(9) |
Probability of the set of descendants of unique dropouts at level k
From the law of conditional probability, the probability of the set of descendants of unique dropouts at level k is
(10) |
For sufficiently small , it is most likely that at a level there is a single dropout, given a dropout occurs, and we can focus our attention on the case
(11) |
For the purposes of metagenomics or eDNA studies, we may be interested in the probability for level k, which is the probability of the loss of d or more individuals in the sample due to a unique dropout at level k, and is calculated as
(12) |
Distribution of pairwise differences with dropouts
The presence of dropouts may affect the distribution of pairwise differences for a given species. For sufficiently small , unique dropouts tend to occur along branch sections near the MRCA of a sample (see section “Analysis and Results”). We therefore focus on the derivation of expectations for the distribution of pairwise difference when and , noting that, for unique dropouts at higher levels, the calculation of pairwise differences follows similar principles.
Single unique dropout at level 2 (Y2 = 1)
A dropout generates a subtree of sequenced individuals from a sample of n individuals in total. A subtree has levels just like the coalescent tree for an entire sample. To calculate the distribution of pairwise differences between 2 random and sequenced individuals involves determining (1) whether the 2 sequenced individuals coalesce at level j in the subtree, (2) whether the entire sample is at level k given the subtree is at level j, and (3) the time to coalesce given the level in the entire sample.
For a subtree of size , the probability that 2 randomly chosen individuals at the tips of a subtree coalesce at level j for is
(13) |
Wiuf and Donnelly22 derived the probability that, given a subtree first enters level j, the entire sample is at level k, such that
and where here and .
Given k, the distribution of coalescence times is
and can be calculated directed from the sum of exponentially distributed random variables.21,24 Note further that i is indexed initially with because the transition in was from . If Z is a random variable for the number of pairwise differences between the 2 sequences, then its distribution is
assuming that mutations are Poisson distributed along branch sections and the rate of mutation causing pairwise differences is . Over the distributions for j and k, the probability of z pairwise differences is
(14) |
Single unique dropout at level 3 (Y3=1)
Here we assume that the dropout mutation occurred such that the 2 clades of sequenced individuals coalesce when for the entire sample (Figure 3). We label these 2 clades A and B, such that the number of sequenced individuals in clade A is a and the number in clade B is b. Two lineages of sequenced individuals can both occur in clade A with probability , both in clade B with probability , or 1 in clade A and 1 in clade B.
Let be the probability of z pairwise differences given that 2 sequences are in clade A. As we are finding the marginal probability of z conditioned on a single clade, the result from Wiuf and Donnelly22 given above applies, such that
(15) |
which differs from because for clade A the minimum value of k for which is , noting that a is substituted for . For clade B, the 2 lineages can coalesce up to also such that is equal to , except for b substituted for a.
When one sequenced individual is in clade A and the other in clade B, the time to coalesce is distributed as , such that
(16) |
and together for is
(17) |
The effect of dropouts on coalescence times of sequenced individuals
The previous section demonstrated that dropouts affect the distribution of pairwise differences relative to when there are no dropouts. The difference in the distribution of pairwise difference comes about because of differences in coalescence times, which in turn reflects the mapping in the function. Here we further explore the effect of the mapping on coalescence times to see if there is additional information that may be used to detect dropouts.
Single unique dropout at level 2 (Y2 = 1)
For , there is a subtree of sequenced individuals that can be divided into j levels in a similar manner as for the entire sample n that is divided into k levels. The mapping is itself a Markov process from to , with the probability of a mapping depending on the k level in the previous mapping. We leave the study of this Markov process for later and here investigate the unconditional expectation for the mapping, which is defined as the expected value of k at level j in the subtree or
(18) |
The expected time between the and levels in the subtree is
(19) |
where is the largest integer less than or equal to . In contrast, the expected time between the jth and levels without drops is
from the standard coalescent.
Single unique dropout at level 3 (Y3 = 1)
As before, we assume that the sequenced individuals are in subclades A and B, as in Figure 3. The numbers of sequenced individuals in subclades A and B are a and b, respectively. We could use a similar approach as in section “Distribution of pairwise differences with dropouts” and condition on each subclade. This conditioning is justified when considering 2 sequenced lineages because either the 2 sequenced lineages occur within the same subclade with mapping according to , or in separate subclades and by definition coalesce at . In contrast, when seeking to derive expectations for the properties of coalescence times for the full set of sequenced individuals, we need to account for the joint set of coalescence times in both subtrees. The Wiuf and Donnelly23 theory does not include this case.
A simple argument indicates that the coalescence times are expected to be distorted with a dropout mutation versus no dropouts. For example, consider a period in the history of the sample in which there are a lineages in clade A and b lineages in clade B, and correspondingly unsequenced lineages. In the absence of recognizing that a dropout mutation has generated a structured coalescent process, the observed sample size is , as opposed to n. Accordingly, the expected coalescence rate, given the observed sample, is on a timescale of generations. Yet, the dropout mutation causes the coalescence process to be structured, such that the actual coalescence rate among sequenced individuals is
(20) |
for in the coalescent tree. Together, the coalescence rate is reduced by a factor of
(21) |
compared with what is expected for an observed sample size of .
Neutrality at the primer sites and background selection at the amplified locus
If deleterious mutations occur at rate U and have multiplicative effects on fitness, each with effect s, then the expected frequency of individuals with j deleterious mutations is Poisson distributed with mean .25 For a sample of size n, the expected number of sampled individuals in the jth mutational class is , where the jth mutational class has j deleterious mutations. Following the structured coalescent modeling framework of Nordborg,26 the class structure of deleterious mutations can be represented abstractly as a set , where is the number of individuals in the jth class and is the maximum value of j for which . Assuming no back mutation and a continuous-time model, transitions corresponding to deleterious mutations, such that and , occur at rate U and transitions corresponding to coalescent events, such that , occur at rate . For sufficiently large , mutational transitions occur on a fast timescale, whereas coalescent transitions occur on a slow timescale, such that the sample is expected to quickly transition to the 0-class. Once in the 0-class, the sample behaves in a neutral manner with coalescence rate . Accordingly, background selection is expected to shorten coalescence times proportionally along a genealogical tree.18
In principle, we could just scale time as and measure dropout properties leading to an expectation for pairwise differences, but this would be somewhat inaccurate because it would miss coalescent events that occur during the transition from the initial sample to the 0-class. Although these events are rare, they lead to a lack of differences between sequences, allowing for average pairwise difference to be less than , where S is the number of segregating sites and .27
For a genotype in the jth class, the expected time to reach the 0th class is units of time. During this time, lineages starting in the jth class may coalesce with other lineages, such that the probability that of the sequences in the jth mutational class coalesce at or before reaching the 0-class is approximately
(22) |
where is an approximation because it assumes that only sequences starting within the same mutational class can coalesce. This assumption is relaxed later when the distribution of pairwise differences is derived. Allowing for sequences starting within different mutational classes to coalesce during the fast period from the initial sample to the 0-class could be modeled as a nested Markovian process. Nevertheless, is an accurate approximation given that coalescent events are rare relative to mutation. The expected number of sequences that coalesce during the period of collapse to the 0th mutational class is
(23) |
where is the probability of the jth mutational class. With , .
Once we have , we can use the theory for the neutral case to derive expectations for the number of descendants of dropouts (, , etc), as well as the distribution of pairwise differences. Expectations for the probability of a dropout along a branch section are a function of with background selection compared with without background selection.
Conditional distribution of pairwise differences with dropouts
Single unique dropout at level 2
With background selection, the corresponding expressions for , , and are the same, except for the replacement of n with . We need to consider that a pair of sequences coalesce during the fast process in which sequences collapse to the 0th mutational class. For a sample of size n, the probability that 2 random sequences are in deleterious mutation classes and is for and for .
For 2 sequences in the same class , the average time to reach the 0th class is units of time and the probability that they coalesce before reaching the 0-class is
(24) |
For 2 sequences in different classes ( and ), the probability that they come together in the mutational class is
(25) |
and once in the class they can coalesce, such that the overall probability they coalesce is
(26) |
Together, the probability of no pairwise difference due to a coalescent event during the fast process of collapse to the 0th mutational class is
(27) |
After entering the 0-class, pairwise differences are expected to behave as with n being replaced by and by such that becomes , and with background selection the distribution of pairwise differences is
(28) |
Single unique dropout at level 3 (Y3 = 1)
Following similar logic as for and assuming the topology in Figure 3, with background selection becomes
(29) |
where again is with n substituted for and for . Likewise, becomes
(30) |
and becomes
(31) |
noting that the 2 sequences cannot coalesce during the fast time period when sampled in clades A and B.
The effect of dropouts on coalescence times of sequenced individuals
Once the sample reaches the 0th class mutationally, we expect the effects of dropouts on coalescence times to be the same as the neutral case, except for being replaced by . In the neutral case, a signal of the effect of dropouts on coalesce times comes about by taking the ratio of adjacent coalescence times, which is independent of . We therefore expect a similar effect of dropouts on adjacent coalescence times when is replaced by in the presence of background selection.
Analysis and Results
Expected values for
In the context of dropout mutations, is not expected to be large. As presented in the introduction, primer sites are typically around 20 nucleotides, with the highest probability that a mutation leads to a dropout occurring within 3 to 4 nucleotides of the 3′ end.7 Accordingly, regarding mutations with a high probability of directly leading to a dropout, the mutation rate is near that of the baseline rate for a single nucleotide and at most about 10 times the baseline rate per nucleotide site. Estimates of per-site mutation rates in bacteria are typically .28 Nevertheless, a species or OTU may be predisposed to a dropout mutation due to prior mutational history within primer sites, leading to a higher dropout rate of mutation. is also a function of . Less is known about for microbes. Smith29 estimated an effective size of or less in Neisseria meningitis, based on electrophoretic data and noting that this is many orders of magnitude less than our understanding of census population size. By comparison in the eukaryotic microbial species Saccharomyces paradoxus, an estimate of its effective population size is of the order .29 Focusing on effective population sizes in the range of and mutation rates in the range of to capture uncertainty leads to in the range of approximately . In the following section, we focus the analysis on values in the middle of this range, namely, of the order .
Under neutrality
Probability of branches with dropouts at level k,
Figure 4A plots and indicates that the probability of a lineage with 1 dropout along branch sections that are at lower levels in the coalescent tree is not too uncommon in the context of studies involving thousands of species or OTUs and assuming 2 values of . Two lineages with dropouts are rarer, but still appreciable at lower levels of the coalescent tree, provided that is large enough. Three or more lineages with dropout mutations are fairly rare in the context of studies involving thousands of species or OTUs. Although higher levels in a coalescent tree have more branch sections, these sections are short in length, such that the total probability of a dropout mutation along levels 500 to 1000 for a sample of size is less than that for level by itself and levels 4 through 50 together (Figure 4B). The values of the order lead to the values between 0.010 and 0.091 at level , 0.005 and 0.047 at level , and 0.003 and 0.032 at level . In total, for , equals 0.007 and 0.070 for and , respectively.
Probability of the dropouts at level k are unique along their corresponding lines of descent, and
Expressions for and (see section “Probability of the dropouts at level k are unique along their corresponding lines of descent”) work for any , but given a low dropout mutation rate it is likely that only 1 dropout occurs at level k, if at all. Multiplying by gives the joint probability of and in the limit as . Figure 5 indicates that lower levels of a coalescent tree are expected to have a unique dropout mutation with the highest probability, and that for the probability that at least 1 level has a unique mutation is still less than level .
Although single dropouts occur with the highest probability, to illustrate the utility of and , we consider the case when a dropout occurs at level and assume 2 dropouts at that level . For this case, the state space corresponding to (see section “Probability of the dropouts at level k are unique along their corresponding lines of descent”) is the set . The corresponding matrix is
and whereas is a somewhat complicated function of and t, is more easily written down, such that
and, numerically, for
indicating that there is a high probability given 2 dropout mutations at level that they are both unique.
Probability of d or more descendent lineages of a unique dropout mutation at level k,
There is a fairly good chance of a unique dropout at level 2 and that it leads to the loss of at least 100 individuals (~8%; Figure 6) for and . There is a smaller, but appreciable, chance under the same conditions at level 4 (~2%). Figure 6 assumes a single dropout mutation at a given level. To illustrate calculations for the case of 2 dropout mutations at level
As is more likely than when , the probability of d or more dropout individuals is greater for versus (Figure 7). For , it is rare (~0.004%) to simultaneously have 2 unique dropout mutations at level 4 and the loss of at least 100 individuals in a sample of size 1000. Here, the rarity primarily comes about from the rarity of the occurrence of 2 dropout mutations at level 4 (cf. Figure 4A).
Distribution of pairwise differences with dropouts and ratio of coalescent times between adjacent nodes
The distribution of pairwise differences is distorted when there is a single dropout at level 2 (Figure 8A). When the number of sequenced individuals is small relative to the entire sample, the distribution of pairwise difference is shifted toward smaller values because sequenced individuals must coalesce before the dropout, which shortens coalesce times. For a moderate to large number of sequenced individuals relative to the entire sample, the distribution can have a mode, unlike the case without dropouts.
With a dropout leading to 2 clades of sequenced individuals, such that (ie, Figure 3), the distribution of pairwise differences is distorted further relative to the case, such that coalescences restricted to be within A or B result in relatively few pairwise differences and coalescences restricted to be between clades A and B result in relatively large pairwise differences (Figure 8B).
The ratio of coalescent times between adjacent nodes indicates that it is lower at levels near the base of the tree in the presence of a dropout (Figure 9).
With background selection
Expected values for
Background selection is expected to decrease the effective value of by a factor of , which is the expected proportion of sequences free of deleterious mutation as a result of background selection.17 Furthermore, Charlesworth et al18 showed that the average pairwise difference is accurately approximated by a neutral model with multiplied by a factor of , where is of the order to for strong and moderate background selection.
Properties of samples with dropout mutations and background selection
Let us first revisit in the context of background selection for a species with effective population size . With background selection, such that and (following Charlesworth et al18), and . For mutation rates in the range of , the corresponding values of encompass the range of values explored in Figure 4, which were and . As an additional point of comparison, for a smaller value of , , which indicates that segregating dropouts would be rare for metagenomic studies with about 1000 species under these conditions. Nevertheless, for smaller values of U and/or s, is expected to increase.
Next, consider for , , , and . The expected number of sequences that coalesce is prior to collapse of the sample to the 0th mutational class, such that we expect . As a point of comparison when is smaller , and n, U, and s remain the same, the number of sequences that coalesce is . Overall, this suggests that with background selection and large effective population sizes.
The analysis involving suggests that, for large relative to , the probability of no pairwise difference due to a coalescent event during the fast process of collapse to the 0th mutational class is small because coalescence events are rare relative to the mutational collapse process to the 0th class. Numerically, for the conditions and and for a large sample size, for and for .
As for large , , we expect only a small difference between the distribution of pairwise difference with background selection versus without background selection , as well as between , , and and , , and , respectively. Therefore, Figure 8 presents an accurate approximation of the distribution of pairwise differences under background selection for .
Discussion
Bacterial and other microbial species are associated with populations of large effective size. This in combination with metagenomic and/or eDNA studies of microbial communities that include thousands of species or OTUs9 brings about the possibility that one or more of the sampled species or OTUs may be segregating mutations that cause dropouts at PCR primer sites. This article quantified the probability of a segregating dropout mutation at a primer site, as well as the effects of dropouts on the distribution of pairwise differences and coalescence times within a species or OTU.
The analysis indicates that segregating PCR dropout mutations are reasonably common provided that the effective population size of a species is large and/or the dropout mutation rate is reasonably high. Furthermore, if dropout mutations occur, they are expected to occur along basal branch sections in the coalescent tree of a sample. That dropouts are expected to occur along the basal branch section brings about the possibility that a large fraction of sampled individuals go unsequenced due to a dropout mutation within a species. This fraction of individuals that go unsequenced leaves behind a signal in the pattern of pairwise differences, as well as the coalescence times among sequenced individuals. In particular, there is a greater tendency for the distribution of pairwise differences to have a non-zero mode in its distribution, with a mode being particularly pronounced when a dropout mutation gives rise to 2 subclades of sequenced individuals as depicted in Figure 3. Furthermore, a dropout mutation is expected to distort coalescence times near the base of a coalescent tree, such that coalescence times are more equal in value relative to when dropouts are absent. The distortion of coalescence times is independent of n (the true sample size) in the context of equation (21). In other contexts, such as pairwise differences, the effect of dropouts is qualitatively the same across n. Here, inference approaches could integrate across possible values of n to assess whether a distribution is consistent with the presence of one or more dropout mutations.
When a dropout mutation gives rise to 2 clades of sequenced individuals as in Figure 3, the coalescent process for these 2 clades is structured such that individuals between clades cannot coalesce until after the dropout mutation (backwards in time). Qualitatively, the distribution of pairwise differences is similar to that when there is population structure,25 in which the distribution is a combination of small pairwise differences between individuals within the same clade and large pairwise differences between individuals in different clades. When a single dropout mutation occurs at level 2 in a coalescent tree, its effect on the distribution of pairwise differences is less pronounced than a level 3 dropout mutation that gives rise to 2 clades, such that there is a tendency for the distribution to have a non-zero mode.
A future direction for research is to develop statistical techniques to assess whether a metagenomic or eDNA sample has segregating dropout mutations. Assessing whether a dropout mutation occurs in a single species will be challenging because the pattern that is generated is similar to population structure26 or changes in population size.31 Furthermore, the coalescence process in a panmictic population of constant size gives rise to a highly diverse set of distributions of pairwise differences and coalesce times for a given sample size. It may be more promising to first develop methods that seek to assess whether there is a signal of dropouts at the population level across a clade of species or OTUs in a metagenomic study. A consistent signal of dropouts within a clade may lead to greater confidence in the presence of dropouts. A clade of species or OTUs may be predisposed to dropout mutations due to prior substitutions at the primer sites in the ancestor to the clade. We think that the EMBL Metagenomics database9 provides an ideal resource to begin to assess the potential occurrence of PCR dropout mutations. The database consists of thousands of studies and often within a study there are multiple samples from a site, each of which is sequenced using an amplicon approach. Having multiple samples from a site would allow for the assessment of whether a signature of PCR dropouts, perhaps across species or OTUs within a clade, is replicated across samples. If it is, then this would be evidence against the random occurrence of a pattern that is similar to predictions made in this study.
A consistent signal across samples could nevertheless be due to population structure or changes in population size, instead of PCR dropouts. Although qualitatively there are similarities in patterns of DNA sequence variation, quantitatively there are likely differences that can distinguish PCR dropouts from the population structure and changes in population size. For example, polymorphisms within clades of a coalescent tree are expected to be at different frequencies geographically because of partial isolation, drift, and the history of mutation. In contrast at a single geographic site and with no population structure at a larger spatial scale, we would expect more equal frequencies of polymorphisms, on average, across clades of a coalescent tree in the presence of PCR dropouts. Population growth may be distinguished from PCR dropouts in that recent growth is expected to increase the number of rare polymorphisms in a population, in contrast to PCR dropouts. Overall, the detection of PCR dropouts will likely be quantitative and involve the development of a model-based inference method with the processes of PCR dropout mutations, population structure, and changes in population size co-occurring. Assessment of the occurrence of PCR dropouts would then be probabilistic, whereby the likelihood or posterior probability of occurrence of a set of PCR dropouts would be weighed in combination with parameters that model population structure and demography.
The detection of the presence of dropout mutations may be useful in metagenomic studies that seek to not only identify taxon-level diversity, but also the abundances of taxa in a community.32 If a taxon has a signature of dropout mutations, then its abundance estimate may be lower than its true value.
Regarding existing methods that use next-generation sequencing approaches to the estimate nucleotide diversity14,15 and phylogenetic structure,17 our results indicate that PCR dropouts are a factor of consideration. In the case of nucleotide diversity, dropouts give rise to patterns of DNA sequence variation similar to population structure or changes in population size (see above) and therefore are expected to affect estimates of nucleotide diversity. The process of PCR dropout mutations could be added to coalescent-based estimators of population-level statistics, particularly in the context of the use of next-generation sequencing and environmental samples. Likewise, when inferring the phylogenetic structure of environmental samples that use primer-based sequencing, the addition of the process of mutation at primer sites may be warranted, as these mutations may lead to DNA sequence patterns that suggest demographic processes such as population structure and growth. Similar to the previous discussion of detecting PCR dropouts, the incorporation of dropouts to estimators of statistics and phylogenetic structure will likely involve adding the dropout process to model-based approaches, which would then affect the likelihood or posterior distributions of focal statistics or histories.
This article focused on properties of segregating dropout mutations within a species. The development of more powerful methods to detect dropouts will likely involve combining within-species and between-species approaches. In the context of between-species approaches, a promising theoretical framework is Hey’s14 model of cladogenesis and its potential generalization. In this article, we showed that dropout mutations affect the shape of coalescent trees and it is likely that they also affect the shape of phylogenetic trees. Properties of tree shape at the phylogenetic level under different diversification processes has been a topic of ongoing research, with detectable differences in diversification processes arising from differences in internode lengths in a phylogenetic tree.33 Differences in internode lengths phylogenetically are analogous to the effect of dropouts on the ratio of coalescence times. A strong signal for the presence of dropouts would be a clade with a distorted tree shape at the species level combined with distorted coalescence times within species within the clade. Of course, this type of inference requires segregating dropouts within species and polymorphism for dropouts at the clade level, whereas a more common pattern may be that an entire clade is not amplified by PCR due to ancestral dropout mutations. Furthermore, it is important to note that expectations about where dropout mutations occur genealogically at the population or species level using the coalescent may not hold at higher taxonomic levels due to differences in the expected distribution of relative internode lengths in a phylogenetic versus coalescent model.
Regarding sequencing methods, a solution to the occurrence of PCR dropout mutations is to use complementary non-overlapping primer sets that both amplify the same genetic locus.34 Using this approach, a locus in a species’ DNA will be amplified even if there are dropout mutations at a primer binding site. Nevertheless, this approach is not always used and there exist a large number of datasets that did not use complementary non-overlapping primer sets (see Mitchell et al).9 For these studies, approaches to assess coverage would be useful.
This article is a starting point to derive expectations for the effects of PCR efficiencies on the inference of taxon diversity and within-taxon genetic diversity in other contexts. For example, a common approach to biodiversity assessment is DNA metabarcoding.35 DNA metabarcoding is often applied to organisms with smaller effective population sizes than prokaryotes, such that the population-level rate of mutation at primer sites is lower. Accordingly, PCR dropout mutations are expected to be less common in these types of studies, but other factors such as differential PCR inhibition or amplification bias may result in the dropout of individual sequences.36 Generally, accounting for PCR efficiency may be important in studies that seek to infer population-level genetic diversity from metabarcode data.37
Coalescent theory has been applied in metagenomic studies previously. For example, O’Brien et al17 derived a coalescent-based approach to infer the phylogenetic relationships of species from a metagenomic sample, and Bittner et al38 and Liberles et al39 provide a broad overview for integrating coalescent approaches into metagenomic studies. Furthermore, Johnson and Slatkin40 presented a coalescent-based approach to detect recombination in a metagenomic sample. Our article suggests that coalescent theory may allow for the detection of PCR primer dropouts in metagenomic studies.
Conclusions
Primer coverage in microbial metagenomic and eDNA studies is a key uncertainty. In this article, we derived coalescent-based expectations for the pattern of DNA sequence variation within a species in the presence of segregating PCR dropout mutations, where a dropout mutation completely blocks PCR amplification. Dropout mutations can alter coalescence times and the distribution of pairwise differences, which may form the basis of statistical techniques to detect or account for reduced primer coverage.
Acknowledgments
I wish to thank C Somerville for her assistance developing this line of research and two reviewers for their comments that improved this work, as well as the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant program for their support.
Footnotes
Funding:The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the NSERC (Canada) - Discovery program.
Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions: CKG conceived and derived the model, performed its analysis and wrote the paper.
ORCID iD: Cortland K Griswold https://orcid.org/0000-0003-2993-7043
References
- 1. Hugenholtz P, Goebel BM, Pace NR. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J Bacteriol. 1998;180:4765-4774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Tyson GW, Chapman J, Hugenholtz P, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37-43. [DOI] [PubMed] [Google Scholar]
- 3. Venter JC, Remington K, Heidelberg JF, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66-74. [DOI] [PubMed] [Google Scholar]
- 4. Alberts AB, Johnson A, Lewis J, et al. Molecular Biology of the Cell. 4th ed New York, NY: Garland Science; 2002. [Google Scholar]
- 5. Bru D, Martin-Laurent F, Philippot L. Quantification of the detrimental effect of a single primer-template mismatch by real-time PCR using the 16S rRNA gene as an example. Appl Environ Microb. 2008;74:1660-1663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wu JH, Hong PY, Liu WT. Quantitative effects of position and type of single mismatch on single base primer extension. J Microbiol Meth. 2009;77:267-275. [DOI] [PubMed] [Google Scholar]
- 7. Mao DP, Zhou Q, Chen CY, Quan ZX. Coverage evaluation of universal bacterial primers using the metagenomic datasets. BMC Microbiol. 2012;12:66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Klindworth A, Pruesse E, Schweer, et al. Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic Acids Res. 2012;41:e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Mitchell AL, Scheremetjew M, Denise H, et al. EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies. Nucleic Acids Research. 2018;46:D726-D735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Santamaria M, Fosso B, Consiglio A, et al. Reference databases for taxonomic assignment in metagenomics. Brief Bioinform. 2012; 13: 682–695. [DOI] [PubMed] [Google Scholar]
- 11. Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6:e1000667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Kingman JFC. On the genealogy of large populations. J Appl Prob A. 1982;19:27-43. [Google Scholar]
- 13. Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics. 1983;105:437-460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Hey J. Using phylogenetic trees to study speciation and extinction. Evolution. 1992;46:627-640. [DOI] [PubMed] [Google Scholar]
- 15. Liu X, Fu YX, Maxwell TJ, Boerwinkle E. Estimating population genetic parameters and comparing model goodness-of-fit using DNA sequences with error. Genome Res. 2010;20:101-109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Liu X. jPopGen Suite: population genetic analysis of DNA polymorphism from nucleotide sequences with errors. Method Ecol Evol. 2012;3:624-627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. O’Brien J, Didelot X, Iqbal Z, Amenga-Etego L, Ahiska B, Falush D. A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data. Genetics. 2014;197:925-937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Charlesworth B, Morgan MT, Charlesworth D. The effect of deleterious mutations on neutral variation. Genetics. 1993;134:1289-1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Charlesworth D, Charlesworth B, Morgan MT. The pattern of neutral molecular variation under the background selection model. Genetics. 1995;141:1619-1632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Fu YX. Statistical properties of segregating sites. Theor Popul Biol. 1995;48:172-197. [DOI] [PubMed] [Google Scholar]
- 21. Wakeley J. Coalescent Theory. Greenwood Village, CO: Roberts; 2008. [Google Scholar]
- 22. Bharucha-Reid AT. Elements of the Theory of Markov Processes and Their Applications. New York, NY: Dover; 1997. [Google Scholar]
- 23. Wiuf C, Donnelly P. Conditional genealogies and the age of a neutral mutant. Theor Popul Biol. 1999;56:183-201. [DOI] [PubMed] [Google Scholar]
- 24. Ross SM. Introduction to Probability Models. Toronto, ON, Canada: Academic Press; 1997. [Google Scholar]
- 25. Kimura M, Maruyama T. The mutational load with epistatic gene interactions in fitness. Genetics. 1996;54:1337-1351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Nordborg M. Structured coalescent process on different time scales. Genetics. 1997;146:1501-1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585-595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Ford CB, Lin PL, Chase MR, et al. Use of whole genome sequencing to estimate the mutation rate of Mycobacterium tuberculosis during latent infection. Nature Genet. 2011;43:482-486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Smith JM. The population genetics of bacteria. Proceed Roy Soc Biol Sci. 1991;245:37-41. [Google Scholar]
- 30. Masel J, Griswold CK. The strength of selection against the yeast prion [PSI+]. Genetics. 2009;181:1057-1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Rogers AR, Harpending HC. Population growth makes waves in the distribution of pairwise genetic differences. Mol Biol Evol. 1992;9:552-569. [DOI] [PubMed] [Google Scholar]
- 32. He JZ, Shen JP, Zhang LM, et al. Quantitative analyses of the abundance and composition of ammonia-oxidizing bacteria and ammonia-oxidizing archaea of a Chinese upland red soil under long-term fertilization practices. Environ Microbiol. 2007;9:2364-2374. [DOI] [PubMed] [Google Scholar]
- 33. Mooers AØ, Harmon LJ, Blum MGB, Wong DHJ, Heard SB. Some models of phylogenetic tree shape. In: Gascuel O, ed. Reconstructing Evolution: New Mathematical and Computational Advances. Oxford, UK: Oxford University Press; 2007:149-170. [Google Scholar]
- 34. Blais J, Lavoie SB, Giroux S, et al. Risk of misdiagnosis due to allele dropout and false-positive PCR artifacts in molecular diagnostics: analysis of 30,769 genotypes. J Mol Diagnost. 2015;17:505-514. [DOI] [PubMed] [Google Scholar]
- 35. Taberlet P, Coissac E, Pompanon F, Brochmann C, Willerslev E. Towards next-generation biodiversity assessment using DNA metabarcoding. Mol Ecol. 2012;21:2045-2050. [DOI] [PubMed] [Google Scholar]
- 36. Piñol J, Mir G, Gomez-Polo P, Agustí N. Universal and blocking primer mismatches limit the use of high-throughput DNA sequencing for the quantitative metabarcoding of arthropods. Mol Ecol Res. 2015;15:819-830. [DOI] [PubMed] [Google Scholar]
- 37. Elbrecht V, Vamos EE, Steinke D, Leese F. Estimating intraspecific genetic diversity from community DNA metabarcoding data. PeerJ. 2018;6:e4644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Bittner L, Halary S, Payri C, et al. Some considerations for analyzing biodiversity using integrative metagenomics and gene networks. Biol Direct. 2010;5:47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Liberles DA, Teufel AI, Liu L, Stadler T. On the need for mechanistic models in computational genomics and metagenomics. Genome Biol Evol. 2013;5:2008-2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Johnson PLF, Slatkin M. Inference of microbial recombination rates from metagenomic data. PLoS Genet. 2009;5:e1000674. [DOI] [PMC free article] [PubMed] [Google Scholar]