Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2014 Jul 2;9(7):e97899. doi: 10.1371/journal.pone.0097899

A Likelihood Approach to Estimate the Number of Co-Infections

Kristan A Schneider 1,*, Ananias A Escalante 2,3
Editor: Art F Y Poon4
PMCID: PMC4079681  PMID: 24988302

Abstract

The number of co-infections of a pathogen (multiplicity of infection or MOI) is a relevant parameter in epidemiology as it relates to transmission intensity. Notably, such quantities can be built into a metric in the context of disease control and prevention. Having applications to malaria in mind, we develop here a maximum-likelihood (ML) framework to estimate the quantities of interest at low computational and no additional costs to study designs or data collection. We show how the ML estimate for the quantities of interest and corresponding confidence-regions are obtained from multiple genetic loci. Assuming specifically that infections are rare and independent events, the number of infections per host follows a conditional Poisson distribution. Under this assumption, we show that a unique ML estimate for the parameter (Inline graphic) describing MOI exists which is found by a simple recursion. Moreover, we provide explicit formulas for asymptotic confidence intervals, and show that profile-likelihood-based confidence intervals exist, which are found by a simple two-dimensional recursion. Based on the confidence intervals we provide alternative statistical tests for the MOI parameter. Finally, we illustrate the methods on three malaria data sets. The statistical framework however is not limited to malaria.

Introduction

Infections are ubiquitous and ecologically complex processes. Indeed the chain of events conducing to the colonization and replication of parasites within a host involves many environmental, physiological, and genetic factors both in the host and the infectious agent. A common observation in many host-parasite interactions is that there are multiple genetically distinct lineages of the pathogen infecting the same individual host [1][3]. Whereas in some diseases such as malaria, this is considered an important parameter, in others it is still somehow a neglected aspect that is just starting to be considered [2].

The observation of multiple genetic variants or multiplicity of infection (MOI) is indicative of the transmission dynamics since it allows for the co-transmission of different parasite variants or the overlap of several genetic variants due to multiple infectious contacts. Thus, the incidence of MOI or superparasitism per se is an important metric of exposure [2], [4][7]. In addition to its epidemiological importance, as many other ecological processes involving genetically distinct individuals, MOI leads to several outcomes derived from the interactions among lineages. This process is usually referred to as the intra-host dynamics [3].

During the last two decades, the outcomes of intra-host dynamics have been the subject of several theoretical and experimental investigations exploring a broad spectrum of scenarios. Usually, such studies focus on major effects that different interconnected factors have in terms of parasite dispersion (parasite fitness) and/or the elicited manifestations of disease that may lead to an effect on the host's fitness [3], [8][11]. Furthermore, intra-host dynamics also affect the spread of parasite lineages with adaptive mutations conferring resistance to antimicrobial agents or that allow the evasion of immune and/or vaccine-mediated protection [12], [13]. Under all these circumstances, following or measuring MOI as a parameter is essential whenever epidemiological inferences or models involving intra-host dynamics are formulated.

Although it is possible to control or measure the number of distinctive parasite lineages in models and experimental settings (e.g.[14]), a totally different scenario is the one faced by those studying naturally occurring infections in the context of ecological and epidemiological investigations [4][6], [15], [16]. Under such circumstances, MOI is usually measured by ad hoc metrics that rely on a set of genetic markers or the observed polymorphism in one or several genes [2]. The need for an experimental definition of MOI has generated approaches based on phylogenetic frameworks (e.g. many viruses) or some form of multi-locus genotyping [2], [17]. Whereas such approximations have been useful, there is still need for a formal statistical framework that allows the estimation of the actual number of lineages and other approximations to MOI that facilitates and/or considers confounding factors.

Given the broad spectrum of genetic architectures observed in parasitic organisms, it is not possible to define a universal framework of MOI. E.g. HIV accumulates mutations at a rate that allows for the use of phylogenetic base methods [17]. On the other hand, eukaryotic parasites such as Plasmodium, Trypanosoma, Toxoplasma, and Schistosoma [18], [19] and bacteria such as Mycobacterium [16] evolve at a rate at which it is possible to determine a stable number of genetically distinct lineages during the course of an infection given a set of genetic markers. In this investigation, we describe a formal statistical framework to estimate MOI that allows, among other aspects, building formal tests for comparing groups, e.g., before or after deploying an intervention such as a vaccine, complicated versus non-complicated cases, populations with different exposures, among other possibilities.

More specifically, we further develop the maximum-likelihood framework introduced by [20], which allows to estimate MOI and prevalences of pathogen lineages from a single genetic marker, e.g., microsatellite loci. We establish how to compute ML estimates and confidence intervals (or regions) for all involved parameters. Based on these, we show how statistical tests can be constructed to test the parameters. Although, the framework is - in principle - not restricted to a particular disease or species, we applied it to malaria by comparing data sets from three endemic regions with different levels of endemicity.

The philosophy behind the method section's structure is the following. We first establish the general methods and then refine them assuming that the number of co-infections follows a conditional Poisson distribution. This structure embraces a better understanding of how to derive particular results for alternative choices to the Poisson distribution. Moreover, rigorous mathematical proofs are shifted to the appendix. Readers less interested in these technical details should feel free to skip them.

Methods

We adapt the maximum-likelihood method of [20] to estimate the average MOI. This approach is fully compatible with the model of [12], [21] which describes the hitchhiking effect associated with drug resistance in Malaria, for which MOI is a fundamental quantity. Being able to estimate MOI, the model can be ‘reverse engineered’ to reconstruct the evolutionary process underlying drug resistance. By doing so, a formal means is provided to identify those among the many compounding factors, which can be influenced to slow-down or prevent the spread of drug resistance in the course of public health initiatives.

1 Model background

Assume Inline graphic different ‘lineages’ Inline graphic of a pathogen, e.g., Inline graphic alleles at a marker locus (or haplotypes in a non-recombining region), circulate in a given population. Particularly, we have neutral markers in mind characterizing linages, so that their frequencies do not change too rapidly, e.g., due to selection. The Inline graphic lineages considered are those that contribute to infection, not new variants that are generated by mutation inside hosts, but ‘fail’ to participate in transmission.

Because we identify a pathogen with the allele at the considered locus, we will use the terms ‘lineage’ and ‘allele’ synonymously. (We refrain from using the term strain, as we refer here to a genotypic characterization and the term strain may have different meanings across pathogens.)

In vector notation, the lineages' relative frequencies are Inline graphic. An individual (host) is infected by Inline graphic (not necessarily different) lineages of the pathogen with probability Inline graphic. The Inline graphic lineages are sampled randomly from the pathogen population. Hence, within an infection, the combination of pathogen linages follows a multinomial distribution with parameters Inline graphic and Inline graphic. Consequently, the probability that Inline graphic of the infecting linages carry allele Inline graphic (Inline graphic) is given by Inline graphic, where Inline graphic, Inline graphic is a multinomial coefficient, and Inline graphic. Clearly, Inline graphic summarizes the pathogen configuration infecting a host.

In practice, Inline graphic is unknown for a given host. It is possible to detect which alleles (or lineages) are present in a clinical sample, but it is difficult to reliably reconstruct Inline graphic without using next generation sequencing, a technology that is not practical to use in many settings. For instance, if only a single allele, say Inline graphic, is found in a clinical sample, the patient might have been infected by just one parasite lineages (Inline graphic), or co-infected by several lineages (Inline graphic), all of which carry allele Inline graphic. Hence, it is convenient to represent an infection (lineages detected in a patient) by a vector of zeros and ones of length Inline graphic, referring to the detected alleles (lineages). Hence, a clinical sample is represented by a vector Inline graphic, where Inline graphic if Inline graphic is found in the infection, and otherwise Inline graphic. In mathematical terms Inline graphic. (Remember Inline graphic and Inline graphic for Inline graphic). Note that the vector Inline graphic is excluded, which corresponds to no infection. In the following, Inline graphic will always denote a vector of nonnegative integers and Inline graphic a vector of zeros and ones.

Let Inline graphic be the multiplicity of infection (MOI) with distribution Inline graphic. Because Inline graphic is unknown in practice, we aim to estimate it from clinical samples - or rather some summary statistics characterizing Inline graphic.

Assume a total of Inline graphic clinical samples, taken from different hosts roughly at the same time. We assume that the Inline graphic lineages Inline graphic detected in the samples are all lineages circulating in the population. (There is no knowledge of undetectable lineages.) Each clinical sample contains one or more of the Inline graphic lineages (alleles). (We assume that lineages that infected the host have not vanished due to intra-host dynamics, e.g., drug treatments, and that new lineages have not emerged inside the host, e.g. by mutation, recombination etc.) A clinical specimen with allelic (or lineage) configuration Inline graphic could descend from an infection with pathogen configuration Inline graphic as long as Inline graphic. Let Inline graphic denote the expected frequency of clinical specimen with allelic configuration Inline graphic. Then,

graphic file with name pone.0097899.e051.jpg (1)

where the first sum runs over all integers larger than or equal to Inline graphic, as this obviously is the minimum number of parasite lineages that could have caused the infection. The second sum runs over all possible configurations Inline graphic of exactly Inline graphic parasites that lead to the allelic configuration Inline graphic (i.e. Inline graphic), and hence could have potentially infected the host.

It follows, that for a given allele-frequency distribution Inline graphic, Inline graphic is determined by the distribution Inline graphic. If infections with the pathogen are rare, a natural assumption is that the number of pathogens infecting a host is Poisson distributed, or more precisely follows a conditional Poisson distribution (CPD), i.e.,

graphic file with name pone.0097899.e060.jpg (2)

Of note, this conditions on the fact that each host is infected by at least one pathogen. The mean value of this distribution is

graphic file with name pone.0097899.e061.jpg

Assuming the CPD (2), Inline graphic can explicitly be derived. In Analysis (subsection 4.1) it is shown that

graphic file with name pone.0097899.e063.jpg

2 Maximum likelihood

Consider a total of Inline graphic samples or clinical specimen, Inline graphic of which have allelic configuration Inline graphic. Hence, Inline graphic, where the sum runs over all zero-one vectors of length Inline graphic, i.e,. Inline graphic (the case of no infection i.e., Inline graphic is excluded).

Since the (natural) likelihood for observing these samples is Inline graphic, the log-likelihood is given by

graphic file with name pone.0097899.e072.jpg (3)

Assuming the CPD for the number of lineages infecting a host, it is shown in Analysis (subsection 4.2) that the log-likelihood becomes

graphic file with name pone.0097899.e073.jpg (4)

where Inline graphic is the number of samples that contain allele Inline graphic. The prevalence of allele Inline graphic is then Inline graphic. Notably, Inline graphic with equality if and only if exclusively single-lineage infections occur. This is one of two special cases that need to be treated separately. In the other special case all lineages are found in every infection. These cases are somewhat non-generic. We shall therefore formulate the following generic assumption.

Assumption 1 Assume that the sum over the alleles' prevalences is larger than one, but not all alleles are Inline graphic prevalent. In other words, more than one lineage is found in at least one infection, i.e., Inline graphic and not all lineages are found in every infection, i.e., Inline graphic for at least one Inline graphic .

Results

In the following Inline graphic will refer to the parameter of the CPD, or in the general case, to the parameter (or parameter vector) summarizing the distribution Inline graphic. In the latter case Inline graphic has to be interpreted as Inline graphic.

We shall start by deriving the maximum likelihood (ML) estimates for the parameters of interest. Before we do so, we shall start by a rather intuitive observation.

Not surprisingly Inline graphic can never be an ML estimate if multiple alleles are found in at least one sample, as Inline graphic implies single infections only. We summarize this in the following remark which is proved in Analysis (subsection 4.3).

Remark 1 If at least one sample contains more than one allele, i.e., Inline graphic, Inline graphic is not the maximum likelihood estimate.

To obtain the ML estimate for Inline graphic, (4) needs to be maximized on the simplex, either using the method of Lagrange multiplies or by eliminating one of the redundant variables, i.e., by setting e.g., Inline graphic. When using Lagrange multipliers we need to find the zeros of the derivatives of

graphic file with name pone.0097899.e093.jpg (5)

i.e., Inline graphic. The derivatives based on the conditional Poisson distribution are derived in Analysis (subsection 4.4). The equations Inline graphic can be straightforwardly solved by a Newton method, i.e., by iterating

graphic file with name pone.0097899.e096.jpg (6a)
graphic file with name pone.0097899.e097.jpg
graphic file with name pone.0097899.e098.jpg (6b)

and Inline graphic is any initial choice of Inline graphic and Inline graphic. Here, Inline graphic denotes the (transposed) Hessian matrix evaluated at Inline graphic, i.e.,

graphic file with name pone.0097899.e104.jpg (7)

If, in the general case, Inline graphic is a parameter vector, the derivatives above have to be interpreted accordingly.

In the case of the conditional Poisson distribution (2) the entries of the Hessian matrix are derived in Analysis (subsection 4.4).

Clearly, instead of (6) also Inline graphic can be iterated, which, however, is numerically less recommendable. Alternative approaches would be using an iterative least-square algorithm or the EM algorithm (cf. e.g.[22]).

Of note, in general, an ML estimate does neither necessarily exist, nor is it unique, not to mention that closed formulas typically do not exist. Unfortunately, assuming the CPD (2), the ML estimate indeed cannot be calculated explicitly. However, the estimate exists and is unique. Furthermore, although it can be straightforwardly derived by the above methods, the complexity of whole procedure can be greatly simplified.

Result 1 Assume the conditional Poisson distribution (2) for Inline graphic . Under Assumption 1 there is a unique maximum likelihood estimate Inline graphic . The first component Inline graphic is the unique positive solution of the equation.

graphic file with name pone.0097899.e110.jpg (8)

It is found by iterating

graphic file with name pone.0097899.e111.jpg (9)

which converges monotonically and at quadratic rate from any initial value Inline graphic.

The maximum likelihood estimates of the allele frequencies are given by

graphic file with name pone.0097899.e113.jpg (10)

The result is proven in Analysis (subsection 5.1).

For the sake of completeness we shall also consider the instances in which Assumption 1 is violated. In the first situation, only one pathogen lineage is found in each infection, i.e., there is no indication whatsoever of co-infections. The results are summarized in the following remark which is proven in Analysis (subsection 5.1).

Remark 2 Assume that each sample contains only one allele, i.e., Inline graphic . Then the ML estimates are Inline graphic and Inline graphic .

In the other non-generic case that all alleles are found in every sample an ML estimate does not exist, more precisely, it is Inline graphic, implying that – with probability one – all alleles are in every sample independently of the allele-frequency distribution.

Remark 3 Assume Inline graphic for all Inline graphic . Then the ML estimate is “ Inline graphic ” for every allelic distribution.

A proof can be found in Analysis (subsection 5.1).

Of note, the maximum likelihood has an intuitive interpretation. We summarize this as the following result which is proven in Analysis (subsection 5.1).

Remark 4 The maximum likelihood estimate Inline graphic is the set of parameters for which the observed number of samples containing allele Inline graphic equals its expectations, i.e.,

graphic file with name pone.0097899.e123.jpg

Hence, the maximum likelihood maximizes the expectation of the log-likelihood.

1 Confidence intervals from the profile-likelihood

Let Inline graphic denote the ML estimate. Confidence intervals can be derived from the profile-likelihood for each parameter.

We are interested in finding a confidence interval (CI) for Inline graphic. For a fixed value of Inline graphic, the profile likelihood is defined as

graphic file with name pone.0097899.e127.jpg

i.e., as the maximum likelihood taken over the remaining parameters while keeping the parameter of interest fixed. Moreover, denote the maximum likelihood by Inline graphic (clearly Inline graphic). Suppose Inline graphic is the true parameter and Inline graphic the corresponding profile likelihood. Then

graphic file with name pone.0097899.e132.jpg (11)

i.e. twice the difference of the maximum likelihood minus the profile likelihood assuming the true parameter is Inline graphic distributed with one degree of freedom (cf. e.g. [23], chapter 4). This can be used to construct confidence intervals for the true parameter Inline graphic. To construct a CI at the Inline graphic level, we need to find all Inline graphic satisfying

graphic file with name pone.0097899.e137.jpg

i.e., we need to find Inline graphic satisfying Inline graphic, where Inline graphic denotes Inline graphic-quantile of the Inline graphic distribution with Inline graphic degrees of freedom. In other words, the equation Inline graphic needs to be solved. By definition of Inline graphic, this means that Inline graphic needs to be solved with respect to Inline graphic, while simultaneously maximizing Inline graphic with respect to Inline graphic. The latter is done using the method of Lagrange multipliers for fixed Inline graphic, i.e.,

graphic file with name pone.0097899.e151.jpg

is maximized. This leads to the equations Inline graphic. Therefore, following [24] the bound of the confidence intervals are found by solving the following system of equations

graphic file with name pone.0097899.e153.jpg (12)

where Inline graphic

Clearly, Inline graphic can be straightforwardly solved by a Newton method, i.e., by iterating

graphic file with name pone.0097899.e156.jpg (13a)

where (Inline graphic) is the solution of the system of linear equations

graphic file with name pone.0097899.e158.jpg (13b)

and Inline graphic is any initial choice of Inline graphic, Inline graphic and Inline graphic. The derivative Inline graphic is identical to (7) except for the first line, which needs to be replaced by

graphic file with name pone.0097899.e164.jpg (14)

The derivatives of Inline graphic are given by (39). Hence, Inline graphic is given by

graphic file with name pone.0097899.e167.jpg (15)

where all derivatives are given by (39) and (40).

Again, alternatively Inline graphic can be iterated, which however requires to invert the matrix Inline graphic in every iteration step. The alternatives to the Newton method are again the EM algorithm or an iterated least-mean-square algorithm.

To obtain the confidence bounds Inline graphic and Inline graphic it is necessary to iterate (13) from two different initial values. Of note, obtaining one bound for the confidence interval is numerically only as demanding as obtain the ML estimate.

Confidence intervals for the allele frequencies Inline graphic are obtained similarly by iterating (13) with obvious changes. Namely, the first component of the function Inline graphic needs to be replaced by Inline graphic and the Inline graphic-th component by Inline graphic, i.e., Inline graphic is the gradient of Inline graphic with the derivative with respect to Inline graphic replaced by Inline graphic. Consequently Inline graphic is identical to Inline graphic with the Inline graphic-th component replaced by (14).

Importantly, existence and uniqueness of the confidence bounds Inline graphic and Inline graphic can be proved under the assumption of the CPD (2). Moreover, it is possible to significantly reduce the complexity of the Newton method (13) to find the CI's bounds. We obtain the following result, which is proven in Analysis (subsection 5.2).

Result 2 Suppose Assumption 1 holds. If Inline graphic is given by the conditional Poisson distribution (2), the confidence interval for Inline graphic (based on the profile likelihood) is uniquely defined.

The bounds of the confidence interval ( Inline graphic and Inline graphic ) for Inline graphic are obtained by iterating

graphic file with name pone.0097899.e191.jpg (16a)
graphic file with name pone.0097899.e192.jpg (16b)

where

graphic file with name pone.0097899.e193.jpg
graphic file with name pone.0097899.e194.jpg (16d)

and

graphic file with name pone.0097899.e195.jpg (16e)

There are exactly two possible solutions Inline graphic and Inline graphic . The algorithm is converging quadratically for any initial values Inline graphic sufficiently close to the one of the solutions.

The proof is found in Analysis (subsection 5.2).

Formally, the above result holds true in the non-generic cases Inline graphic and Inline graphic. If all samples contain just one lineage, i.e., Inline graphic, the ML estimate is Inline graphic and the confidence interval has the form Inline graphic. If all samples contain all lineages, i.e., Inline graphic the maximum likelihood estimate is Inline graphic and the confidence interval has the form Inline graphic, hence it is infinitely large. Although, formally the result still holds, the asymptotic (11) is no longer true, as discussed in Analysis (subsection 6), rendering the result inapplicable if Assumption 1 is violated.

2 Asymptotic confidence intervals

As an alternative to the profile likelihood, one can use the asymptotic normality of the maximum likelihood to construct confidence intervals. Asymptotically the difference of the maximum likelihood (Inline graphic) and the true parameter (Inline graphic) is normally distributed. However, it is important to notice that - unless one eliminates one of the redundant allele frequencies - the Lagrange multiplier Inline graphic needs to be treated like a regular parameter. The corresponding likelihood function is of course given by (5). Hence, the actual parameters involved are Inline graphic. The difference of the maximum likelihood Inline graphic and the true parameter Inline graphic is asymptotically distributed according to

graphic file with name pone.0097899.e213.jpg (17a)

or

graphic file with name pone.0097899.e214.jpg (17b)

where Inline graphic is the expected Fisher information and Inline graphic is the observed Fisher information (based on sample size Inline graphic). The matrix Inline graphic is the transposed Hessian matrix given by (7).

The expression Inline graphic is the convenient, although imprecise notation, for Inline graphic, where Inline graphic is the Inline graphic-dimensional identity matrix and Inline graphic the symmetric square root of the Fisher information. Namely, any positive semi-definite, symmetric matrix Inline graphic (as it is the case of any covariance matrix, and particularly the Fisher information) has a spectral decomposition Inline graphic, where Inline graphic is orthogonal and Inline graphic is the diagonal matrix that contains all eigenvalues. These are real and nonnegative, and the diagonal matrix that contains the square roots of the eigenvalues is denoted by Inline graphic. Hence, by setting Inline graphic, we have Inline graphic.

An often used alternative notation is

graphic file with name pone.0097899.e231.jpg

or

graphic file with name pone.0097899.e232.jpg

with Inline graphic and Inline graphic.

From (17) the asymptotic distribution of the parameters of interest Inline graphic follows immediately by dropping the ‘dummy’ variable Inline graphic and the corresponding rows and column in the inverse Fisher information. Of note, this is not identical to ‘formally’ derive the inverse Fisher information based on Inline graphic and Inline graphic. Namely, it is important to drive the asymptotic covariance matrix with respect to Inline graphic and Inline graphic.

Since Inline graphic the bounds for the Inline graphic CI for Inline graphic are given by

graphic file with name pone.0097899.e244.jpg (18)

and those for the components of Inline graphic by

graphic file with name pone.0097899.e246.jpg (19)

Here, Inline graphic denotes the Inline graphic quantile of the standard normal distribution.

Of course, when using the expected Fisher information, Inline graphic needs to be replaced by Inline graphic. Under the assumption of the conditional Poisson distribution (2), the second derivatives Inline graphic needed to derive the Fisher information are calculated in Analysis (subsection 4.4; eq.39). Moreover, evaluated at the maximum likelihood estimate, Inline graphic, it is seen that the expected and observed Fisher information are identical, i.e., Inline graphic, when assuming (2).

With some algebraic manipulation it is possible to simplify the expressions for the confidence intervals assuming the CPD (2).

Result 3 Suppose the number of co-infections follow the conditional Poisson distribution (2) and that Assumption 1 holds. Then an asymptotic Inline graphic -confidence interval for Inline graphic is given by

graphic file with name pone.0097899.e256.jpg (20)

Alternatively, the following formula, requires just the ML estimate for Inline graphic

graphic file with name pone.0097899.e258.jpg (21)

For a proof, see Analysis (subsection 5.3).

In the non-generic case Inline graphic for all Inline graphic, the ML estimate is not unique, and we have Inline graphic. Hence, asymptotic CIs make no sense in this case, neither for Inline graphic nor for the frequencies Inline graphic.

In the case Inline graphic, it also impossible to derive CIs as the asymptotics (17) break down (cf. subsection 6 in Analysis).

Explicit formulas for the CIs of the allele frequencies are obtained similarly.

Result 4 Under the same assumptions as Result 3, an asymptotic Inline graphic -confidence interval for Inline graphic is given by

graphic file with name pone.0097899.e267.jpg (22)

The proof can again be found in Analysis (subsection 5.3).

3 Testing the parameters

In practice, data from several loci is typically available, each of which yields a different ML estimate or there might be some prior estimate for the parameters of interest. Depending on particular properties of the marker loci (mutation rate, allele-frequency spectrum, biochemical issues in determining motif repeats, etc.) different marker loci will lead to different ML estimates. Hence, it is desirable to test whether different estimates are significantly different. The confidence intervals can be adapted to test the parameters.

Clearly, at different marker loci, different alleles will segregate and the allele-frequency spectra will be very different. Hence, for the present purpose, it is meaningless to compare the allele frequencies at different loci. However, the estimate for Inline graphic should be consistent, as this parameter is the same for all loci. Consequently, in the following we will focus on testing Inline graphic and present three alternative tests for the null hypothesis Inline graphic vs. the alternative Inline graphic.

3.1 The likelihood-ratio test

The first test is rather straightforward. Since

graphic file with name pone.0097899.e272.jpg (23)

under the null hypothesis Inline graphic, it is rejected at significance level Inline graphic if

graphic file with name pone.0097899.e275.jpg

In other words, we reject the null hypothesis for any Inline graphic that lies outside the Inline graphic-confidence interval of Inline graphic, which are obtained as outlined above in “Confidence intervals from the profile likelihood”. Therefore, this test requires no additional numerical effort if the confidence intervals were already derived.

The corresponding p-value is given by

graphic file with name pone.0097899.e279.jpg (24)

To calculate the p-value, Inline graphic needs to be derived first. Similarly as in section in “Confidence intervals from the profile likelihood”, this leads to the equations Inline graphic. Therefore, the system of equations

graphic file with name pone.0097899.e282.jpg (25)

needs to be solved by a Newton method, i.e., by iterating

graphic file with name pone.0097899.e283.jpg (26a)
graphic file with name pone.0097899.e284.jpg
graphic file with name pone.0097899.e285.jpg (26b)

and Inline graphic is any initial choice of Inline graphic and Inline graphic. The derivative Inline graphic is obtained from (7) by deleting the first row and column and substituting Inline graphic, i.e.,

graphic file with name pone.0097899.e291.jpg (27)

where all derivatives are given by (39) and (40).

Result 5 Suppose Assumption 1 and Inline graphic holds. In the case of the conditional poisson distribution, the p-value under the null hypothesis Inline graphic is given by (24), where Inline graphic is given by (4) with Inline graphic and Inline graphic given by

graphic file with name pone.0097899.e297.jpg (28)

where Inline graphic is the solution of (16e) with Inline graphic.

The solution Inline graphic is found by iterating

graphic file with name pone.0097899.e301.jpg (29)

The proof is presented in Analysis (subsection 5.4).

In case of Inline graphic, there are two possibilities. If Inline graphic, then Inline graphic. Hence, the null hypothesis is always rejected. This is clear, because if Inline graphic is the true parameter, it is impossible to observe data Inline graphic with Inline graphic (see Remark 7 in Analysis, subsection 6). However, if Inline graphic, then Inline graphic and Inline graphic, and the null hypothesis is always accepted.

Therefore, in the case of Inline graphic the test can still be formally performed in a meaningful way. However, note that the asymptotic (23) does not long hold true, as Inline graphic does not lie in the interior of the parameter space.

3.2 The score test

In the following, for any parameter choice Inline graphic, let Inline graphic by the corresponding profile-likelihood estimate, i.e., Inline graphic, where Inline graphic is the Inline graphic dimensional simplex. By using a dummy variable as before, Inline graphic is obtained from Inline graphic. The Fisher information can be written as

graphic file with name pone.0097899.e320.jpg

where Inline graphic is obtained from the Fisher information with the first row and column deleted. The definitions of the remaining sub-matrices follow accordingly.

A test for the null hypothesis Inline graphic vs. the alternative Inline graphic is obtained by using the fact that

graphic file with name pone.0097899.e324.jpg (30)

(cf. Remark 6 in subsection 5.4 of Analysis). The function

graphic file with name pone.0097899.e325.jpg (31)

serves as test statistic, where the data is Inline graphic. The test rejects Inline graphic at the Inline graphic-level if Inline graphic The corresponding p-value is Inline graphic.

Note that it is legitimate to write Inline graphic on the left-hand side of (30) because Inline graphic. However, it is nevertheless important to derive the asymptotic variance from Inline graphic.

Alternatively, the expected Fisher information Inline graphic in (30) and (31) can be replaced by the observed Fisher information Inline graphic. However, if Inline graphic is not the ML estimate, Inline graphic. As proven Analysis (subsection 5.4), one obtains for the CPD:

Result 6 Consider the score test for the null hypothesis Inline graphic vs. the alternative Inline graphic under the assumptions of Result 5. The test statistic based on the observed Fisher information is

graphic file with name pone.0097899.e340.jpg (32)

and that based on the expected Fisher information is

graphic file with name pone.0097899.e341.jpg (33)

The p-values are Inline graphic in either case. The frequencies Inline graphic are derived as specified in Result 5.

Of note, instead of (30) the ML estimate can be used as a plug-in estimate for the asymptotic variance, i.e., Inline graphic. In this case, it is not necessary to distinguish between the expected and observed Fisher information as they coincide (cf. section “Asymptotic confidence intervals”).

In summary one obtains:

Remark 5 Under the assumptions of Result 6, a test statistic for the null hypothesis Inline graphic vs. the alternative Inline graphic is

graphic file with name pone.0097899.e347.jpg (34)

where Inline graphic and Inline graphic are sample size and number of alleles, in the data yielding the estimate Inline graphic.

The proof is analogously to the one of Result 6.

The test cannot be applied in the special cases Inline graphic or Inline graphic for all Inline graphic, as the asymptotic (30) no longer holds true (cf. subsection 6 of Analysis).

3.3 The Wald test

A third test for the null hypothesis Inline graphic is an adaptation of the Wald test for the profile likelihood. It is based on the same asymptotic properties that we used to derive confidence intervals namely Inline graphic. This is exactly the same as the asymptotic Inline graphic as Inline graphic.

This implies Inline graphic or Inline graphic. Hence, the test statistic

graphic file with name pone.0097899.e360.jpg

can be used. The p-value is Inline graphic.

Now, we shall consider again the CPD. An explicit expression for Inline graphic is given by (54). Hence, we obtain:

Result 7 Under the assumptions of Result 5, the Wald test for the null hypothesis Inline graphic vs. the alternative Inline graphic has the test statistic

graphic file with name pone.0097899.e365.jpg (35)

based on the (expected or observed) Fisher information.

The p-values are Inline graphic in either case. Here, Inline graphic and the frequencies Inline graphic are derived as specified in Result 1.

Alternatively, if the profile-likelihood estimate based on Inline graphic is used as a plug-in for the asymptotic variance, one can employ Inline graphic or Inline graphic.

In the first case, using (53) implies that the test statistic changes to

graphic file with name pone.0097899.e372.jpg (36)

In the second case, (54) implies that the test statistic changes to

graphic file with name pone.0097899.e373.jpg (37)

Also the Wald test cannot be applied in the special cases Inline graphic or Inline graphic for all Inline graphic, as the asymptotic for Inline graphic no longer holds true (cf. subsection 6 of Analysis).

4 Testing the method

Although - as we have seen - most of the theory works quite general, assuming a CPD for the number of co-infections permits to derive explicit results or, at least, reduces the complexity significantly. However, assuming a CPD might not be justified. Therefore, it is desirable to have a test for the model's fit. Namely, let

graphic file with name pone.0097899.e378.jpg

be the likelihood assuming a perfect fit to the data, in which the expected frequencies of infection with stain configuration Inline graphic equal their observed frequencies. In other words, Inline graphic is the maximum likelihood of the saturated model. As there are Inline graphic possible allelic configurations Inline graphic infecting a host, Inline graphic has Inline graphic degrees of freedom. The maximum likelihood Inline graphic of the reduced model (assuming the CPD) has Inline graphic independent allele frequencies and one Poisson parameter. Therefore,

graphic file with name pone.0097899.e387.jpg (38)

Hence, the following test can be used.

Result 8 To test Inline graphic : “the conditional Poisson distribution is justified” vs. Inline graphic : “the conditional Poisson distribution is not justified”, the test-statistic Inline graphic can be used. The p-value is given by . Inline graphic

It should be mentioned that the above test might perform poorly if the number of lineages or alleles Inline graphic is large. The reason is that the Inline graphic distribution has too many degrees of freedom. This might be the case when using hyper-mutable microsatellite markers with 10 or more alleles found across samples.

Application to data

As an illustration, the methods are applied to three previously-described data sets [25][27]. Each of which comprises molecular data from P. falciparum-infected blood samples from endemic areas with different levels of malaria incidence. For each blood sample, parasite DNA was extracted and several microsatellite markers assayed.

1 Preliminary remarks

It is important to note beforehand that only (selectively) neutral markers should be included in the analysis. Namely, loci linked to others that are targets of selection (e.g., mdr1, crt, dhfr, dhps in P. falciparum that are associated with selection for drug resistance) will have skewed allele-frequency distribution. Hence, using these markers might lead to artifacts and severe misinferences. In practice, a marker located on a chromosome not carrying a strongly selected gene (e.g. resistance-conferring gene), can be regarded to be neutral. Moreover, clinical samples from groups that will be compared need to consider confounding effects such as differences in treatment polices, control interventions, and changing transmission intensities (e.g., a group should not contain samples from two time points during which treatment policies changed). By not considering such effects, the estimates of MOI would be inappropriate. For these reasons, we only used parts of the available data sets.

2 Data description

The first data set emerged from a longitudinal study conducted in Asembo Bay, a hyper-endemic region in Kenya, and was described in [27]. We included five (neutral) microsatellites on chromosome 2 and four (neutral) markers on chromosome 3. Additionally, we included two markers on chromosome 8, quite close to dhfr, which are common to all three data sets and meet Assumption 1. Only blood samples collected in the first study year (mid 1993 to mid 1994) were included, resulting in 42 blood samples.

The second data set described in [26] is from a study from Yaoundé, Cameroon, a region of intermediate/high transmission. Besides the two markers on chromosome 8 mentioned above, we included all eight available (neutral) microsatellite markers on chromosomes 2 and 3 from all 331 blood samples (data of one of the 332 original samples was unavailable).

The third data set is from Bolivar State, Venezuela, a region of low transmission. It was described in [25] and consists of 97 blood samples. Due to the low transmission intensities, for most markers each blood samples contains only one allele, violating Assumption 1. We included all markers that met Assumption 1 as well as all available neutral markers. Particularly, we included four on chromosome 2 and three on chromosome 3, two markers on chromosome 8 and one on chromosome 4, which are sufficiently distant from respectively dhps and dhfr to be considered neutral, and the two makers on chromosome 4, which were also included in the other data sets. All 97 blood samples were used.

3 Results

The results are summarized in Figures 1 and 2 and Tables 13. In all cases, the test for the model fit (cf. Result 8) justified the assumption of the CPD (cf. Tables 13). This is important because the three locations exhibit different transmission intensities. In all three regions, the ML estimates Inline graphic or rather the mean MOI, Inline graphic, obtained from different marker loci are fairly consistent. As expected, most variation in the estimates is observed in Kenya because of the low sample size. Moreover, the transmission intensities are stronger, which leads to more variation in allele-frequency spectra among marker loci, resulting in more variation among the ML estimates.

Figure 1. Shown are the ML estimates Inline graphic (dots) and their respective profile-likelihood-based (blue) and asymptotic (green) CIs for the data from Kenya (A), Cameroon (B) and Venezuela (C) for several microsatellite markers each.

Figure 1

Figure 2. Average ML estimates by region.

Figure 2

Averages are the arithmetic mean of the ML estimates Inline graphic 2 standard deviations derived from the microsatellite loci, which are common to all data sets, including (blue) and excluding (green) locus L1, which appears to be hyper-mutable in Kenya and Cameroon.

Table 1. Estimates for each locus of the data set from Kenya.

locus lower bound Inline graphic upper bound 2(LNL 1) d.f.
U7 1.00194 1.03409 1.15244 6.40471 9
0.968395 1.10265
L5 1.21506 1.38975 1.64696 67.8528 15
1.18622 1.61235
J3 1.16387 1.32208 1.56625 44.993 16
1.1331 1.52876
J6 1.13457 1.27344 1.49108 58.3296 15
1.10558 1.45595
U6 1.15044 1.29506 1.51735 65.1444 14
1.12211 1.48319
L4 1.18509 1.34319 1.57899 89.2578 18
1.15735 1.54568
U5 1.16453 1.31318 1.53811 76.1215 20
1.13692 1.50489
K6 1.31334 1.51443 1.7943 134.024 26
1.28687 1.76291
L1 1.3654 1.59303 1.90742 87.4142 16
1.33699 1.87367
c4 1.15248 1.30977 1.55585 15.9715 7
1.12049 1.51705
b3 1.06529 1.16656 1.34475 34.7327 16
1.03537 1.30777

Each row shows, locus name, lower profile-likelihood (top) and asymptotic (bottom) confidence bound, ML estimate, upper profile-likelihood (top) and asymptotic (bottom) confidence bound. For the confidence, bounds α = 0.05 was assumed. Moreover, the test statistic for the fit of the CPD (2) is shown as well as the corresponding degrees of freedom. In all cases, the outcomes are not significant, suggesting that the assumption of the CPD is justified.

Table 3. See description of Table 1 but for the Venezuela data set.

locus lower bound Inline graphic upper bound 2(LNL 1) d.f.
J3 N/A 1 N/A N/A N/A
N/A N/A
J6 N/A 1 N/A N/A N/A
N/A N/A
U6 N/A 1 N/A N/A N/A
N/A N/A
L4 N/A 1 N/A N/A N/A
N/A N/A
U5 0.974273 1.02745 1.08251 8.32780 8
1.00156 1.12327
K6 0.971082 1.03104 1.09339 8.06610 3
1.00176 1.13908
L1 0.984526 1.04242 1.10251 0.00000 2
1.00703 1.13188
c4 0.973367 1.02863 1.08592 9.79400 3
1.00163 1.1273
b3 0.99278 1.06223 1.13479 3.66900 4
1.01538 1.16345
fr13 0.981231 1.05152 1.12504 0.20579 3
1.00852 1.16137
ps6 0.98346 1.04538 1.1098 0.00000 2
1.00752 1.14139
ps7 0.978848 1.02256 1.06754 1.01430 4
1.00128 1.10032

N/A indicates that that the method is not applicable (cf. Analysis, section 6).

Table 2. See description of Table 1 but for the Cameroon data set.

locus lower bound Inline graphic upper bound 2(LNL 1) d.f.
L5 1.12239 1.17804 1.23538 165.239 27
1.12754 1.24098
J3 1.11596 1.17407 1.23404 105.218 26
1.12171 1.24032
J6 1.15263 1.21385 1.27704 178.18 25
1.15774 1.28258
U6 1.17975 1.23815 1.29829 270.763 32
1.18389 1.30274
L4 1.17469 1.24032 1.30817 222.664 29
1.17986 1.31378
U5 1.18476 1.25169 1.32089 195.916 24
1.18987 1.32643
K6 1.18436 1.24819 1.31408 294.437 40
1.18908 1.31919
L1 1.28997 1.36794 1.44861 332.781 40
1.29451 1.45349
c4 1.08125 1.20363 1.33427 0.958866 9
1.10312 1.36155
b3 1.1223 1.18418 1.24816 75.4321 27
1.12849 1.255

From Figure 1 it is apparent that the estimates for MOI are highest in Kenya, followed by Cameroon, whereas they are very low in Venezuela. This is summarized in Figure 2 showing that the average ML estimates across the regions differ by several standard deviations.

The 95% profile-likelihood CIs for Inline graphic, given by Inline graphic, are reasonably large for the data sets from Cameroon and Venezuela (cf. Figure 1). However, due to the relatively small sample size, they are much less informative for the Kenya dataset.

The asymptotic confidence intervals agree well with the profile-likelihood CIs (cf. Figure 1 and Tables 13). This is particularly true for Cameroon, as expected because of the large sample size. The profile-likelihood CIs from the Kenya and Venezuela data are asymmetric while, the asymptotic CIs are - by definition - symmetric (however, the transformation Inline graphic results in some asymmetry). (Note that, unlike profile-likelihood-based intervals, asymptotic CIs are not transformation respecting, i.e., Inline graphic is the transformed CI of Inline graphic, not the CI of Inline graphic.) In relative terms, this is more pronounced in Venezuela than in the Kenya data set. The reason is that the ML estimates Inline graphic from the Venezuela data are close to zero, i.e., the boundary of the parameter range. This results in a very skewed likelihood function, yielding quite asymmetric profile-likelihood CIs. On the contrary, in Kenya, the ML estimates are rather large, and the likelihood function tends to be symmetric around its maximum.

Furthermore, we tested for pairwise differences between the estimates based on different marker loci. Tables 46 report the p-values for the likelihood-ratio, the Score, and the Wald test for the three regions. In all data sets, all tests perform equally well. There are some discrepancies, mainly due to the above mentioned skewness of the likelihood function. In the case of a skewed likelihood function, the likelihood-ratio test is the most preferable, because it accounts for the skewness.

Table 4. Pairwise tests of ML estimates from obtained from the Kenya data set.

locus U7 L5 J3 J6 U6 L4 U5 K6 L1 c4 b3
U7 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0023
1.0000 0.0006 0.0030 0.0056 0.0033 0.0012 0.0020 0.0000 0.0000 0.0048 0.0514
1.0000 0.0004 0.0021 0.0043 0.0024 0.0007 0.0014 0.0000 0.0000 0.0034 0.0477
L5 0.0001 1.0000 0.5316 0.2527 0.3527 0.6536 0.4511 0.2585 0.0866 0.4667 0.0194
0.0000 1.0000 0.5097 0.2074 0.3159 0.6421 0.4236 0.2946 0.1258 0.4386 0.0032
0.0000 1.0000 0.5092 0.2053 0.3145 0.6419 0.4228 0.2934 0.1237 0.4376 0.0024
J3 0.0006 0.5062 1.0000 0.6079 0.7762 0.8277 0.9252 0.0632 0.0151 0.9045 0.0790
0.0000 0.5284 1.0000 0.5910 0.7708 0.8306 0.9246 0.1023 0.0396 0.9034 0.0342
0.0000 0.5279 1.0000 0.5907 0.7707 0.8306 0.9246 0.1002 0.0375 0.9034 0.0315
J6 0.0021 0.2273 0.6101 1.0000 0.8096 0.4465 0.6574 0.0141 0.0026 0.7079 0.1987
0.0000 0.2744 0.6265 1.0000 0.8136 0.4753 0.6696 0.0394 0.0146 0.7176 0.1378
0.0000 0.2727 0.6262 1.0000 0.8136 0.4747 0.6694 0.0374 0.0131 0.7174 0.1347
U6 0.0012 0.3379 0.7826 0.8139 1.0000 0.6088 0.8439 0.0292 0.0060 0.8825 0.1333
0.0000 0.3753 0.7878 0.8100 1.0000 0.6238 0.8464 0.0615 0.0232 0.8841 0.0769
0.0000 0.3742 0.7878 0.8099 1.0000 0.6236 0.8464 0.0593 0.0214 0.8841 0.0736
L4 0.0003 0.6545 0.8381 0.4722 0.6207 1.0000 0.7571 0.1053 0.0281 0.7502 0.0516
0.0000 0.6657 0.8353 0.4436 0.6058 1.0000 0.7510 0.1471 0.0583 0.7432 0.0172
0.0000 0.6655 0.8353 0.4428 0.6055 1.0000 0.7510 0.1451 0.0561 0.7432 0.0151
U5 0.0007 0.4476 0.9290 0.6720 0.8474 0.7546 1.0000 0.0498 0.0113 0.9732 0.0941
0.0000 0.4749 0.9296 0.6600 0.8448 0.7606 1.0000 0.0870 0.0333 0.9731 0.0451
0.0000 0.4742 0.9296 0.6598 0.8448 0.7605 1.0000 0.0848 0.0313 0.9731 0.0421
K6 0.0000 0.3001 0.1091 0.0331 0.0526 0.1364 0.0744 1.0000 0.5466 0.0931 0.0011
0.0000 0.2651 0.0698 0.0120 0.0250 0.0977 0.0421 1.0000 0.5613 0.0557 0.0000
0.0000 0.2636 0.0674 0.0105 0.0231 0.0954 0.0400 1.0000 0.5610 0.0528 0.0000
L1 0.0000 0.1093 0.0327 0.0075 0.0127 0.0396 0.0189 0.5391 1.0000 0.0278 0.0002
0.0000 0.0747 0.0125 0.0012 0.0030 0.0180 0.0057 0.5241 1.0000 0.0097 0.0000
0.0000 0.0725 0.0111 0.0008 0.0024 0.0165 0.0049 0.5237 1.0000 0.0083 0.0000
c4 0.0008 0.4259 0.9016 0.6976 0.8754 0.7267 0.9709 0.0453 0.0101 1.0000 0.1006
0.0000 0.4551 0.9027 0.6873 0.8737 0.7342 0.9710 0.0816 0.0312 1.0000 0.0500
0.0000 0.4544 0.9027 0.6871 0.8736 0.7341 0.9710 0.0794 0.0292 1.0000 0.0470
b3 0.0346 0.0067 0.0552 0.1581 0.0916 0.0237 0.0541 0.0000 0.0000 0.0824 1.0000
0.0005 0.0333 0.1130 0.2219 0.1529 0.0659 0.1086 0.0026 0.0010 0.1464 1.0000
0.0002 0.0308 0.1099 0.2196 0.1501 0.0631 0.1057 0.0021 0.0007 0.1429 1.0000

The ML estimate obtained from the locus specified in the rows (H 0) is tested against the estimates from the loci specified in the columns (HA). In each cell, the p-values for the likelihood-ratio (top), Score (middle), and Wald test (bottom) are shown. The Score and Wald tests are the version of eqs. (32) and (35), respectively. Significant differences are indicated in bold.

Table 6. See description of Table 4 but for the Venezuela data set.

locus L4 U5 K6 L1 c4 b3 fr13 ps6 ps7
L4 N/A N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A N/A
U5 N/A 1.0000 0.9047 0.5670 0.9669 0.2143 0.4224 0.5133 0.8397
N/A 1.0000 0.9084 0.6177 0.9673 0.3334 0.5093 0.5765 0.8291
N/A 1.0000 0.9084 0.6173 0.9673 0.3317 0.5085 0.5759 0.8291
K6 N/A 0.9008 1.0000 0.6754 0.9349 0.2827 0.5108 0.6147 0.7371
N/A 0.8968 1.0000 0.7045 0.9331 0.3859 0.5747 0.6553 0.7088
N/A 0.8967 1.0000 0.7043 0.9331 0.3846 0.5741 0.6550 0.7086
L1 N/A 0.6415 0.7433 1.0000 0.6751 0.5352 0.7913 0.9253 0.4825
N/A 0.5898 0.7164 1.0000 0.6329 0.5825 0.8035 0.9269 0.3846
N/A 0.5895 0.7162 1.0000 0.6325 0.5821 0.8035 0.9268 0.3831
c4 N/A 0.9665 0.9367 0.6027 1.0000 0.2359 0.4512 0.5465 0.8047
N/A 0.9660 0.9384 0.6457 1.0000 0.3501 0.5303 0.6018 0.7890
N/A 0.9660 0.9384 0.6453 1.0000 0.3485 0.5296 0.6014 0.7889
b3 N/A 0.3485 0.4354 0.5643 0.3751 1.0000 0.7844 0.6391 0.2254
N/A 0.2148 0.3242 0.5138 0.2505 1.0000 0.7711 0.6033 0.0871
N/A 0.2128 0.3221 0.5129 0.2468 1.0000 0.7710 0.6028 0.0833
fr13 N/A 0.4858 0.5827 0.7771 0.5167 0.7529 1.0000 0.8551 0.3411
N/A 0.3880 0.5150 0.7631 0.4303 0.7668 1.0000 0.8491 0.2077
N/A 0.3869 0.5142 0.7630 0.4286 0.7667 1.0000 0.8490 0.2047
ps6 N/A 0.5865 0.6872 0.9235 0.6193 0.6056 0.8612 1.0000 0.4314
N/A 0.5191 0.6477 0.9218 0.5626 0.6402 0.8667 1.0000 0.3188
N/A 0.5186 0.6473 0.9218 0.5618 0.6399 0.8667 1.0000 0.3168
ps7 N/A 0.8500 0.7629 0.4201 0.8192 0.1342 0.3064 0.3773 1.0000
N/A 0.8592 0.7853 0.5076 0.8323 0.2697 0.4269 0.4769 1.0000
N/A 0.8592 0.7852 0.5066 0.8323 0.2675 0.4256 0.4758 1.0000

N/A indicates that that the test is not applicable (cf. Analysis, section 6). Results for loci J3, J6, and U6 are not shown because the tests are also not applicable (as for locus L4).

Table 5. See description of Table 4 but for the Cameroon data set.

locus L5 J3 J6 U6 L4 U5 K6 L1 c4 b3
L5 1.0000 0.8960 0.2296 0.0282 0.0428 0.0171 0.0176 0.0000 0.6775 0.8465
1.0000 0.8953 0.2552 0.0439 0.0637 0.0312 0.0314 0.0000 0.6899 0.8480
1.0000 0.8953 0.2548 0.0436 0.0631 0.0307 0.0310 0.0000 0.6899 0.8480
J3 0.8896 1.0000 0.1787 0.0184 0.0299 0.0113 0.0115 0.0000 0.6282 0.7480
0.8904 1.0000 0.2058 0.0316 0.0483 0.0230 0.0229 0.0000 0.6445 0.7522
0.8904 1.0000 0.2054 0.0313 0.0478 0.0226 0.0225 0.0000 0.6445 0.7522
J6 0.2442 0.2195 1.0000 0.4042 0.4182 0.2490 0.2746 0.0000 0.8764 0.3811
0.2189 0.1918 1.0000 0.4189 0.4340 0.2717 0.2956 0.0001 0.8746 0.3595
0.2185 0.1913 1.0000 0.4188 0.4338 0.2713 0.2953 0.0001 0.8746 0.3593
U6 0.0601 0.0572 0.4610 1.0000 0.9489 0.6909 0.7583 0.0002 0.6138 0.1255
0.0406 0.0370 0.4468 1.0000 0.9490 0.6956 0.7611 0.0010 0.5961 0.0979
0.0401 0.0364 0.4467 1.0000 0.9490 0.6956 0.7611 0.0010 0.5961 0.0974
L4 0.0522 0.0500 0.4234 0.9428 1.0000 0.7394 0.8101 0.0003 0.5929 0.1122
0.0340 0.0311 0.4075 0.9427 1.0000 0.7427 0.8119 0.0012 0.5734 0.0853
0.0335 0.0306 0.4073 0.9427 1.0000 0.7427 0.8119 0.0012 0.5733 0.0848
U5 0.0240 0.0240 0.2604 0.6605 0.7426 1.0000 0.9160 0.0011 0.4914 0.0604
0.0125 0.0119 0.2379 0.6554 0.7392 1.0000 0.9157 0.0033 0.4621 0.0392
0.0122 0.0116 0.2376 0.6553 0.7392 1.0000 0.9157 0.0032 0.4620 0.0388
K6 0.0307 0.0303 0.3047 0.7436 0.8194 0.9192 1.0000 0.0007 0.5213 0.0736
0.0172 0.0162 0.2838 0.7406 0.8178 0.9195 1.0000 0.0025 0.4950 0.0503
0.0169 0.0158 0.2835 0.7406 0.8178 0.9195 1.0000 0.0024 0.4950 0.0499
L1 0.0000 0.0000 0.0001 0.0002 0.0013 0.0035 0.0017 1.0000 0.0429 0.0000
0.0000 0.0000 0.0000 0.0000 0.0003 0.0012 0.0005 1.0000 0.0145 0.0000
0.0000 0.0000 0.0000 0.0000 0.0003 0.0011 0.0004 1.0000 0.0144 0.0000
c4 0.3971 0.3531 0.7433 0.2282 0.2539 0.1366 0.1495 0.0000 1.0000 0.5589
0.3781 0.3306 0.7469 0.2498 0.2771 0.1618 0.1738 0.0000 1.0000 0.5469
0.3779 0.3303 0.7468 0.2495 0.2768 0.1613 0.1734 0.0000 1.0000 0.5468
b3 0.8332 0.7420 0.3252 0.0514 0.0710 0.0306 0.0323 0.0000 0.7548 1.0000
0.8315 0.7378 0.3465 0.0708 0.0950 0.0485 0.0498 0.0000 0.7621 1.0000
0.8315 0.7378 0.3463 0.0704 0.0945 0.0480 0.0494 0.0000 0.7621 1.0000

Tables 79 compare the three versions of the Score test, while Tables 1012 compare those for the Wald test. The results are fairly consistent. However, the versions given by eqs. 34, 37 and 36 of the Score and Wald tests, respectively tend to be most inconsistent with the other tests, especially the likelihood-ratio test. The reason is that these use the roughest approximations.

Table 7. The same as Table 4 but with the p-value of the three versions, according to eqs. 33 (top), 32 (middle), and 34 (bottom) of the Score test.

locus U7 L5 J3 J6 U6 L4 U5 K6 L1 c4 b3
U7 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
1.0000 0.0006 0.0030 0.0056 0.0033 0.0012 0.0020 0.0000 0.0000 0.0048 0.0514
1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
L5 0.0021 1.0000 0.5434 0.2794 0.3740 0.6603 0.4671 0.2351 0.0656 0.4817 0.0384
0.0000 1.0000 0.5097 0.2074 0.3159 0.6421 0.4236 0.2946 0.1258 0.4386 0.0032
0.3113 1.0000 0.5732 0.3465 0.4250 0.6755 0.5040 0.1888 0.0316 0.5206 0.1439
J3 0.0063 0.4923 1.0000 0.6173 0.7793 0.8258 0.9255 0.0429 0.0068 0.9050 0.1112
0.0000 0.5284 1.0000 0.5910 0.7708 0.8306 0.9246 0.1023 0.0396 0.9034 0.0342
0.3298 0.4597 1.0000 0.6398 0.7867 0.8215 0.9264 0.0159 0.0007 0.9065 0.2261
J6 0.0138 0.1988 0.6004 1.0000 0.8072 0.4283 0.6498 0.0057 0.0005 0.7023 0.2335
0.0000 0.2744 0.6265 1.0000 0.8136 0.4753 0.6696 0.0394 0.0146 0.7176 0.1378
0.3467 0.1398 0.5755 1.0000 0.8014 0.3862 0.6317 0.0004 0.0000 0.6876 0.3336
U6 0.0097 0.3146 0.7796 0.8162 1.0000 0.5994 0.8423 0.0156 0.0018 0.8816 0.1685
0.0000 0.3753 0.7878 0.8100 1.0000 0.6238 0.8464 0.0615 0.0232 0.8841 0.0769
0.3387 0.2623 0.7718 0.8217 1.0000 0.5772 0.8386 0.0029 0.0000 0.8793 0.2785
L4 0.0045 0.6475 0.8397 0.4882 0.6292 1.0000 0.7607 0.0812 0.0157 0.7539 0.0801
0.0000 0.6657 0.8353 0.4436 0.6058 1.0000 0.7510 0.1471 0.0583 0.7432 0.0172
0.3235 0.6311 0.8437 0.5265 0.6492 1.0000 0.7688 0.0428 0.0030 0.7635 0.1944
U5 0.0073 0.4304 0.9287 0.6788 0.8488 0.7508 1.0000 0.0315 0.0045 0.9733 0.1276
0.0000 0.4749 0.9296 0.6600 0.8448 0.7606 1.0000 0.0870 0.0333 0.9731 0.0451
0.3326 0.3907 0.9279 0.6949 0.8523 0.7421 1.0000 0.0096 0.0003 0.9734 0.2417
K6 0.0003 0.3211 0.1338 0.0524 0.0747 0.1621 0.0985 1.0000 0.5373 0.1174 0.0052
0.0000 0.2651 0.0698 0.0120 0.0250 0.0977 0.0421 1.0000 0.5613 0.0557 0.0000
0.2855 0.3706 0.2116 0.1276 0.1488 0.2293 0.1703 1.0000 0.5161 0.1981 0.0752
L1 0.0001 0.1329 0.0494 0.0169 0.0245 0.0583 0.0330 0.5486 1.0000 0.0435 0.0015
0.0000 0.0747 0.0125 0.0012 0.0030 0.0180 0.0057 0.5241 1.0000 0.0097 0.0000
0.2725 0.1973 0.1205 0.0744 0.0833 0.1215 0.0923 0.5682 1.0000 0.1153 0.0541
c4 0.0077 0.4075 0.9010 0.7034 0.8764 0.7221 0.9709 0.0279 0.0039 1.0000 0.1345
0.0000 0.4551 0.9027 0.6873 0.8737 0.7342 0.9710 0.0816 0.0312 1.0000 0.0500
0.3337 0.3652 0.8994 0.7172 0.8787 0.7112 0.9707 0.0078 0.0002 1.0000 0.2480
b3 0.0808 0.0015 0.0309 0.1226 0.0610 0.0095 0.0303 0.0000 0.0000 0.0527 1.0000
0.0005 0.0333 0.1130 0.2219 0.1529 0.0659 0.1086 0.0026 0.0010 0.1464 1.0000
0.4077 0.0000 0.0044 0.0572 0.0174 0.0005 0.0049 0.0000 0.0000 0.0119 1.0000

Table 9. See descriptions of Table 7 but for the Venezuela data set and Table 6.

locus L4 U5 K6 L1 c4 b3 fr13 ps6 ps7
L4 N/A N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A N/A
U5 N/A 1.0000 0.9028 0.5369 0.9666 0.1491 0.3702 0.4754 0.8446
N/A 1.0000 0.9084 0.6177 0.9673 0.3334 0.5093 0.5765 0.8291
N/A 1.0000 0.8967 0.4454 0.9659 0.0316 0.2222 0.3617 0.8587
K6 N/A 0.9027 1.0000 0.6587 0.9357 0.2236 0.4731 0.5912 0.7492
N/A 0.8968 1.0000 0.7045 0.9331 0.3859 0.5747 0.6553 0.7088
N/A 0.9085 1.0000 0.6072 0.9382 0.0878 0.3576 0.5181 0.7847
L1 N/A 0.6621 0.7546 1.0000 0.6927 0.5082 0.7847 0.9244 0.5216
N/A 0.5898 0.7164 1.0000 0.6329 0.5825 0.8035 0.9269 0.3846
N/A 0.7249 0.7888 1.0000 0.7443 0.4257 0.7639 0.9219 0.6381
c4 N/A 0.9667 0.9359 0.5774 1.0000 0.1722 0.4038 0.5137 0.8117
N/A 0.9660 0.9384 0.6457 1.0000 0.3501 0.5303 0.6018 0.7890
N/A 0.9674 0.9333 0.4999 1.0000 0.0464 0.2653 0.4136 0.8322
b3 N/A 0.4003 0.4779 0.5861 0.4258 1.0000 0.7902 0.6546 0.2899
N/A 0.2148 0.3242 0.5138 0.2505 1.0000 0.7711 0.6033 0.0871
N/A 0.5756 0.6147 0.6506 0.5849 1.0000 0.8083 0.7009 0.5189
fr13 N/A 0.5231 0.6091 0.7836 0.5514 0.7454 1.0000 0.8579 0.3961
N/A 0.3880 0.5150 0.7631 0.4303 0.7668 1.0000 0.8491 0.2077
N/A 0.6406 0.6907 0.8025 0.6546 0.7220 1.0000 0.8663 0.5710
ps6 N/A 0.6127 0.7032 0.9243 0.6426 0.5862 0.8584 1.0000 0.4764
N/A 0.5191 0.6477 0.9218 0.5626 0.6402 0.8667 1.0000 0.3188
N/A 0.6934 0.7522 0.9267 0.7109 0.5259 0.8494 1.0000 0.6131
ps7 N/A 0.8452 0.7502 0.3663 0.8119 0.0695 0.2346 0.3165 1.0000
N/A 0.8592 0.7853 0.5076 0.8323 0.2697 0.4269 0.4769 1.0000
N/A 0.8295 0.7092 0.2190 0.7891 0.0029 0.0744 0.1586 1.0000

Table 10. The same as Table 4 but with the p-value of the three versions, according to eqs. 35 (top), 37 (middle), and 36 (bottom) of the Wald test.

locus U7 L5 J3 J6 U6 L4 U5 K6 L1 c4 b3
U7 1.0000 0.0004 0.0021 0.0043 0.0024 0.0007 0.0014 0.0000 0.0000 0.0034 0.0477
1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
L5 0.0000 1.0000 0.5092 0.2053 0.3145 0.6419 0.4228 0.2934 0.1237 0.4376 0.0024
0.0019 1.0000 0.5406 0.2703 0.3667 0.6579 0.4613 0.2448 0.0723 0.4784 0.0319
0.1722 1.0000 0.5728 0.3442 0.4236 0.6753 0.5032 0.1877 0.0306 0.5197 0.1322
J3 0.0000 0.5279 1.0000 0.5907 0.7707 0.8306 0.9246 0.1002 0.0375 0.9034 0.0315
0.0057 0.4969 1.0000 0.6144 0.7784 0.8264 0.9254 0.0503 0.0089 0.9049 0.1020
0.2203 0.4592 1.0000 0.6395 0.7866 0.8215 0.9264 0.0152 0.0006 0.9065 0.2188
J6 0.0000 0.2727 0.6262 1.0000 0.8136 0.4747 0.6694 0.0374 0.0131 0.7174 0.1347
0.0126 0.2075 0.6026 1.0000 0.8079 0.4341 0.6522 0.0082 0.0009 0.7034 0.2247
0.2609 0.1384 0.5752 1.0000 0.8013 0.3855 0.6315 0.0004 0.0000 0.6875 0.3298
U6 0.0000 0.3742 0.7878 0.8099 1.0000 0.6236 0.8464 0.0593 0.0214 0.8841 0.0736
0.0088 0.3220 0.7803 0.8155 1.0000 0.6025 0.8428 0.0201 0.0027 0.8818 0.1591
0.2422 0.2612 0.7718 0.8216 1.0000 0.5769 0.8385 0.0027 0.0000 0.8793 0.2731
L4 0.0000 0.6655 0.8353 0.4428 0.6055 1.0000 0.7510 0.1451 0.0561 0.7432 0.0151
0.0040 0.6499 0.8393 0.4831 0.6265 1.0000 0.7594 0.0905 0.0192 0.7531 0.0715
0.2043 0.6309 0.8437 0.5257 0.6489 1.0000 0.7688 0.0418 0.0028 0.7634 0.1857
U5 0.0000 0.4742 0.9296 0.6598 0.8448 0.7605 1.0000 0.0848 0.0313 0.9731 0.0421
0.0065 0.4360 0.9288 0.6767 0.8484 0.7521 1.0000 0.0380 0.0062 0.9733 0.1183
0.2273 0.3900 0.9279 0.6947 0.8523 0.7420 1.0000 0.0091 0.0003 0.9734 0.2350
K6 0.0000 0.2636 0.0674 0.0105 0.0231 0.0954 0.0400 1.0000 0.5610 0.0528 0.0000
0.0003 0.3131 0.1275 0.0445 0.0658 0.1517 0.0883 1.0000 0.5406 0.1116 0.0033
0.1060 0.3691 0.2076 0.1210 0.1433 0.2262 0.1659 1.0000 0.5158 0.1929 0.0585
L1 0.0000 0.0725 0.0111 0.0008 0.0024 0.0165 0.0049 0.5237 1.0000 0.0083 0.0000
0.0001 0.1233 0.0449 0.0126 0.0192 0.0500 0.0264 0.5441 1.0000 0.0395 0.0008
0.0769 0.1939 0.1147 0.0665 0.0764 0.1166 0.0864 0.5679 1.0000 0.1080 0.0364
c4 0.0000 0.4544 0.9027 0.6871 0.8736 0.7341 0.9710 0.0794 0.0292 1.0000 0.0470
0.0069 0.4135 0.9011 0.7016 0.8760 0.7236 0.9709 0.0339 0.0054 1.0000 0.1252
0.2301 0.3644 0.8994 0.7171 0.8787 0.7111 0.9707 0.0074 0.0002 1.0000 0.2416
b3 0.0002 0.0308 0.1099 0.2196 0.1501 0.0631 0.1057 0.0021 0.0007 0.1429 1.0000
0.0779 0.0022 0.0348 0.1306 0.0677 0.0123 0.0355 0.0000 0.0000 0.0571 1.0000
0.3745 0.0000 0.0040 0.0560 0.0166 0.0004 0.0045 0.0000 0.0000 0.0112 1.0000

Table 12. See description of Table 10 but for the Venezuela data set and Table 6.

locus L4 U5 K6 L1 c4 b3 fr13 ps6 ps7
L4 N/A N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A N/A
U5 N/A 1.0000 0.9084 0.6173 0.9673 0.3317 0.5085 0.5759 0.8291
N/A 1.0000 0.9026 0.5369 0.9666 0.1482 0.3678 0.4747 0.8446
N/A 1.0000 0.8967 0.4449 0.9659 0.0310 0.2213 0.3610 0.8587
K6 N/A 0.8967 1.0000 0.7043 0.9331 0.3846 0.5741 0.6550 0.7086
N/A 0.9028 1.0000 0.6587 0.9357 0.2227 0.4712 0.5907 0.7494
N/A 0.9084 1.0000 0.6070 0.9382 0.0870 0.3569 0.5177 0.7845
L1 N/A 0.5895 0.7162 1.0000 0.6325 0.5821 0.8035 0.9268 0.3831
N/A 0.6639 0.7555 1.0000 0.6931 0.5076 0.7843 0.9244 0.5221
N/A 0.7246 0.7887 1.0000 0.7440 0.4252 0.7638 0.9219 0.6371
c4 N/A 0.9660 0.9384 0.6453 1.0000 0.3485 0.5296 0.6014 0.7889
N/A 0.9667 0.9359 0.5774 1.0000 0.1713 0.4016 0.5131 0.8118
N/A 0.9674 0.9333 0.4995 1.0000 0.0457 0.2645 0.4129 0.8322
b3 N/A 0.2128 0.3221 0.5129 0.2468 1.0000 0.7710 0.6028 0.0833
N/A 0.4069 0.4826 0.5862 0.4271 1.0000 0.7907 0.6551 0.2912
N/A 0.5740 0.6132 0.6499 0.5820 1.0000 0.8082 0.7005 0.5138
fr13 N/A 0.3869 0.5142 0.7630 0.4286 0.7667 1.0000 0.8490 0.2047
N/A 0.5270 0.6117 0.7836 0.5521 0.7452 1.0000 0.8579 0.3970
N/A 0.6399 0.6901 0.8024 0.6534 0.7219 1.0000 0.8662 0.5684
ps6 N/A 0.5186 0.6473 0.9218 0.5618 0.6399 0.8667 1.0000 0.3168
N/A 0.6152 0.7046 0.9243 0.6430 0.5857 0.8582 1.0000 0.4771
N/A 0.6930 0.7520 0.9267 0.7103 0.5256 0.8494 1.0000 0.6116
ps7 N/A 0.8592 0.7852 0.5066 0.8323 0.2675 0.4256 0.4758 1.0000
N/A 0.8449 0.7495 0.3662 0.8118 0.0689 0.2318 0.3156 1.0000
N/A 0.8295 0.7091 0.2180 0.7890 0.0028 0.0736 0.1576 1.0000

Table 8. See description of Table 7 but for the Cameroon data set.

locus L5 J3 J6 U6 L4 U5 K6 L1 c4 b3
L5 1.0000 0.8964 0.2148 0.0208 0.0326 0.0111 0.0117 0.0000 0.6733 0.8457
1.0000 0.8953 0.2552 0.0439 0.0637 0.0312 0.0314 0.0000 0.6899 0.8480
1.0000 0.8974 0.1800 0.0090 0.0153 0.0033 0.0038 0.0000 0.6509 0.8433
J3 0.8892 1.0000 0.1633 0.0126 0.0214 0.0068 0.0071 0.0000 0.6227 0.7460
0.8904 1.0000 0.2058 0.0316 0.0483 0.0230 0.0229 0.0000 0.6445 0.7522
0.8881 1.0000 0.1281 0.0044 0.0084 0.0016 0.0018 0.0000 0.5930 0.7395
J6 0.2585 0.2345 1.0000 0.3954 0.4087 0.2356 0.2621 0.0000 0.8770 0.3907
0.2189 0.1918 1.0000 0.4189 0.4340 0.2717 0.2956 0.0001 0.8746 0.3595
0.2959 0.2768 1.0000 0.3742 0.3858 0.2045 0.2330 0.0000 0.8801 0.4231
U6 0.0728 0.0699 0.4690 1.0000 0.9488 0.6881 0.7566 0.0001 0.6183 0.1385
0.0406 0.0370 0.4468 1.0000 0.9490 0.6956 0.7611 0.0010 0.5961 0.0979
0.1126 0.1132 0.4887 1.0000 0.9486 0.6814 0.7526 0.0000 0.6468 0.1884
L4 0.0643 0.0621 0.4324 0.9429 1.0000 0.7374 0.8091 0.0001 0.5979 0.1250
0.0340 0.0311 0.4075 0.9427 1.0000 0.7427 0.8119 0.0012 0.5734 0.0853
0.1030 0.1044 0.4545 0.9431 1.0000 0.7326 0.8066 0.0000 0.6292 0.1749
U5 0.0327 0.0328 0.2732 0.6635 0.7445 1.0000 0.9162 0.0005 0.4985 0.0713
0.0125 0.0119 0.2379 0.6554 0.7392 1.0000 0.9157 0.0033 0.4621 0.0392
0.0648 0.0684 0.3058 0.6706 0.7491 1.0000 0.9167 0.0000 0.5457 0.1184
K6 0.0404 0.0401 0.3166 0.7453 0.8204 0.9190 1.0000 0.0003 0.5278 0.0851
0.0172 0.0162 0.2838 0.7406 0.8178 0.9195 1.0000 0.0025 0.4950 0.0503
0.0747 0.0779 0.3465 0.7494 0.8227 0.9185 1.0000 0.0000 0.5701 0.1336
L1 0.0000 0.0000 0.0003 0.0006 0.0027 0.0061 0.0033 1.0000 0.0493 0.0000
0.0000 0.0000 0.0000 0.0000 0.0003 0.0012 0.0005 1.0000 0.0145 0.0000
0.0008 0.0013 0.0033 0.0040 0.0113 0.0182 0.0118 1.0000 0.1551 0.0029
c4 0.4075 0.3650 0.7412 0.2155 0.2402 0.1223 0.1355 0.0000 1.0000 0.5643
0.3781 0.3306 0.7469 0.2498 0.2771 0.1618 0.1738 0.0000 1.0000 0.5469
0.4341 0.3975 0.7360 0.1862 0.2082 0.0916 0.1051 0.0000 1.0000 0.5822
b3 0.8341 0.7443 0.3126 0.0414 0.0584 0.0223 0.0239 0.0000 0.7525 1.0000
0.8315 0.7378 0.3465 0.0708 0.0950 0.0485 0.0498 0.0000 0.7621 1.0000
0.8365 0.7503 0.2821 0.0231 0.0343 0.0092 0.0105 0.0000 0.7395 1.0000

Table 11. See description of Table 10 but for the Cameroon data set.

locus L5 J3 J6 U6 L4 U5 K6 L1 c4 b3
L5 1.0000 0.8953 0.2548 0.0436 0.0631 0.0307 0.0310 0.0000 0.6899 0.8480
1.0000 0.8963 0.2184 0.0226 0.0350 0.0125 0.0131 0.0000 0.6684 0.8456
1.0000 0.8974 0.1797 0.0089 0.0151 0.0033 0.0037 0.0000 0.6509 0.8433
J3 0.8904 1.0000 0.2054 0.0313 0.0478 0.0226 0.0225 0.0000 0.6445 0.7522
0.8893 1.0000 0.1669 0.0140 0.0234 0.0078 0.0081 0.0000 0.6163 0.7458
0.8881 1.0000 0.1278 0.0043 0.0083 0.0015 0.0017 0.0000 0.5930 0.7395
J6 0.2185 0.1913 1.0000 0.4188 0.4338 0.2713 0.2953 0.0001 0.8746 0.3593
0.2549 0.2320 1.0000 0.3979 0.4114 0.2393 0.2657 0.0000 0.8778 0.3923
0.2954 0.2763 1.0000 0.3740 0.3857 0.2042 0.2328 0.0000 0.8801 0.4229
U6 0.0401 0.0364 0.4467 1.0000 0.9490 0.6956 0.7611 0.0010 0.5961 0.0974
0.0694 0.0676 0.4668 1.0000 0.9488 0.6890 0.7571 0.0001 0.6259 0.1411
0.1117 0.1122 0.4886 1.0000 0.9486 0.6813 0.7526 0.0000 0.6467 0.1878
L4 0.0335 0.0306 0.4073 0.9427 1.0000 0.7427 0.8119 0.0012 0.5733 0.0848
0.0610 0.0599 0.4299 0.9429 1.0000 0.7380 0.8094 0.0001 0.6062 0.1275
0.1021 0.1034 0.4543 0.9431 1.0000 0.7326 0.8066 0.0000 0.6292 0.1743
U5 0.0122 0.0116 0.2376 0.6553 0.7392 1.0000 0.9157 0.0032 0.4620 0.0388
0.0302 0.0311 0.2696 0.6626 0.7439 1.0000 0.9162 0.0006 0.5114 0.0736
0.0638 0.0673 0.3054 0.6705 0.7491 1.0000 0.9167 0.0000 0.5456 0.1177
K6 0.0169 0.0158 0.2835 0.7406 0.8178 0.9195 1.0000 0.0024 0.4950 0.0499
0.0377 0.0382 0.3133 0.7448 0.8201 0.9191 1.0000 0.0004 0.5393 0.0876
0.0738 0.0768 0.3462 0.7494 0.8227 0.9185 1.0000 0.0000 0.5700 0.1329
L1 0.0000 0.0000 0.0000 0.0000 0.0003 0.0011 0.0004 1.0000 0.0144 0.0000
0.0000 0.0000 0.0002 0.0005 0.0021 0.0050 0.0026 1.0000 0.0769 0.0001
0.0006 0.0010 0.0029 0.0038 0.0108 0.0176 0.0113 1.0000 0.1547 0.0026
c4 0.3779 0.3303 0.7468 0.2495 0.2768 0.1613 0.1734 0.0000 1.0000 0.5468
0.4050 0.3631 0.7418 0.2191 0.2439 0.1262 0.1394 0.0000 1.0000 0.5652
0.4339 0.3972 0.7360 0.1859 0.2079 0.0912 0.1048 0.0000 1.0000 0.5821
b3 0.8315 0.7378 0.3463 0.0704 0.0945 0.0480 0.0494 0.0000 0.7621 1.0000
0.8339 0.7439 0.3157 0.0439 0.0615 0.0243 0.0260 0.0000 0.7496 1.0000
0.8365 0.7503 0.2818 0.0229 0.0340 0.0090 0.0104 0.0000 0.7395 1.0000

Overall, the methods perform well for all data sets and provide meaningful results. However, the statistical tests also yielded significant differences in some of the pairwise comparisons of the various Inline graphic estimates in each region (Tables 412). The allele frequencies differ of course but all are based on the same true parameter Inline graphic. If the estimates for Inline graphic are significantly different, some of them cannot be trusted. This can have various reasons. First, it can be a type I error. However, this occurs only with small probability if the CIs are well calibrated, i.e., their nominal coverage (Inline graphic) is close to the actual coverage. Asymptotic CIs and tests based on them (Wald, Score) will be more affected than profile-likelihood-based intervals, because the former are inherently forced to be symmetric. This is particularly true if the estimates for Inline graphic are close to zero. To quantify this effect, and to suggest heuristic methods to recalibrate the CIs, a systematic numerical robustness study of the approach is planned. Preliminary investigations, however, have shown that particularly the profile-likelihood-based CIs are well calibrated.

Second, the tests are designed to compare the ML estimate based on the data with a value Inline graphic, which has to be interpreted as prior knowledge. Strictly speaking, it is not meant to be estimated from data itself, or at least data which is available. A test designed to compare two estimates, should incorporate information from both data sets (data from both markers). A standard approach to resolve this is as follows. One could calculate the product of the maximum likelihood from both markers and compare it with the maximum likelihood of both markers conditioned on equality of Inline graphic. This however would require much more numerical effort than the tests here. Note further, that the structure of the data does not allow to perform a permutation test, because the allele-frequency distributions are expected to be different. This is true for two different marker loci in the same endemic region as well as for the same marker in two different populations.

Third, the model assumptions might be violated, i.e., the underlying Poisson distribution might not be correct. This can again be quantified in the coarse of a robustness study.

Fourth, the allele-frequency spectra of two different marker loci is very different, and the method might be sensitive to this. For instance strong skewness in the data distributions might bias the estimates. This is obviously the case if one marker shows no variation at all. Moreover, the number of different allele at different markers is very different, which results in very different probabilities of the ML estimates. These issues again need to be investigated in a numerical study.

Fifth, some STR markers tend to be hyper-mutable. As a result, not just the frequency distribution might be more problematic, but it is also more challenging to correctly identify the tandem repeat numbers. Hence, for hyper-mutable markers the data might have very bad quality. In our examples the marker labelled L1 appears to be hyper-mutable.

Because of all these possible reasons, it would be pre-mature to suggest a heuristic on how to decide, which estimates can be trusted the most. A systematic numerical follow-up study is planned to investigate all these possibilities in detail to provide suggestions on the criteria upon which the data is chosen.

Discussion

The number of genetically distinct lineages co-infecting a host - commonly referred to as “multiplicity of infection” (MOI) - is a key quantity in epidemiology. First, it relates with transmission intensity since it provides a metric for the number of secondary infections after a primary infection; assuming that the lineages circulating are identifiable (e.g. secondary infections within a clonal outbreak simply cannot be traceable). Second, it measures the possibility of genetic exchange among those lineages as determined by the genetic system of the pathogen in question. Finally, if phenotypic differences are associated with those lineages, MOI could lead to very complex dynamics driven by natural selection.

Measuring MOI is desirable in a variety of infectious diseases, but - in many instances - only feasible if it can be measured at low cost and with a reasonable effort. Optimally it should fit into standard study designs and should be easily computable with whatever genotyping data can be collected from clinical specimens. In order to meet these goals, we further developed the maximum-likelihood (ML) method originally proposed by [20] and applied it to three malaria datasets as examples.

From a total of Inline graphic samples (e.g. blood samples), the number of genetically distinguishable lineages present in each host are recorded. From the resulting data, assuming that hosts are infected randomly by those lineages according to their prevalence, we derived the likelihood function. If infections with the pathogen are rare events, a natural choice for the number of co-infecting lineages is a conditional Poisson distribution (CPD). This distribution comes with the appealing feature that it is characterized by a single parameter Inline graphic, whose transform Inline graphic is the average MOI. Assuming a CPD, the likelihood function simplifies as well as the procedure to derive the ML estimates. Although, this was previously described by [20], we were able to derive a number of important results: First, the ML estimate always exists and is unique. Second, it has the intuitive interpretation of being the parameter vector under which the observed are the expected prevalences for the distinguishable lineages, i.e., the observation is the expectation, if the ML estimate is the true parameter vector. Third, the recursion to compute the ML estimate for Inline graphic reduced from a multi- to a one-dimensional recursion, which just depends on the number of samples Inline graphic and the observed prevalences. The ML estimates for the lineages frequencies are explicit functions of Inline graphic. Fourth, the recursion for Inline graphic converges (at least) from every initial value Inline graphic. Convergence is monotonically, at quadratic rate, and typically occurs within a few iterations. Besides the obvious computational advantages provided of our results their actual foremost importance is that they justify the ML approach. Using an ML estimates is only appropriate if it has a significantly higher probability than distant alternative parameter choices, which is difficult to evaluate in a multi-dimensional space. However, the form of the ML estimate here - particularly because the lineages prevalences depend continuously on Inline graphic - indicates that the observation will have significantly lower probability under distant alternative parameter choices. The method worked well for the three malaria datasets to which it was applied, and gave similar results when applied to different independent microsatellite loci.

Although, our results justify the ML approach, it is nevertheless of fundamental importance to provide confidence intervals (CIs). We reported here on asymptotic and profile-likelihood-based CIs for all parameters. Asymptotic CIs are either based on the observed or the expected Fisher information, which under the CPD coincide. Explicit formulas for the CIs for all involved parameters were derived. Profile-likelihood based CIs were already emphasized by [20]. However, it was important to note that they can actually be derived at low numerical costs by using the method of Lagrange multiplies. This reduces the numerical effort to the same magnitude as for the ML estimate. Assuming the CPD, we proved that the CI for the parameter Inline graphic, yielding the estimate for the MOI, is uniquely defined. The confidence bounds are derived by a two-dimension recursion, which converges locally at quadratic rate. Both kinds of CIs gave meaningful results for the three data sets to which we applied the methods and they agree well. Although the asymptotic CIs are easier to derive, we suggest to use the profile-likelihood-based CIs if sample size is low and/or the ML estimate for Inline graphic is small for the reasons discussed in the application section. Although, we discussed CIs for the linages' frequencies, these are somewhat less interesting, unless one focuses on the prevalence of a particular linage. Otherwise one should derive confidence regions on the simplex for the lineage frequencies, which is done as outlined, but numerically more demanding.

To test the ML estimate against other parameter choices typically three statistical tests are used, the likelihood-ratio, the Score, and the Wald test. The latter two are based on the asymptotic CIs, while the likelihood-ratio test builds upon the profile-likelihood-based CIs. Motivated by our intention to apply the methods to malaria we focused on using these tests to compare estimates for the parameter Inline graphic. Namely, several genetic markers characterizing linages are typically available (e.g., several microsatellite markers), to all of which the methods are applicable. While the true parameter Inline graphic is of course the same for all markers, the ML estimates obtained from them will differ. It is therefore important to test whether these estimates differ significantly. The parameter Inline graphic changes on temporal and spatial scales. An obvious question is, whether MOI changes over time (e.g. before and after the implementation of control measures) or varies across endemic regions. Hence, it is important to test for significant differences in estimates for Inline graphic.

Not surprisingly all tests described perform equally well as they are asymptotically equivalent. However, as in the case of CIs we suggest to use the likelihood-ratio test if sample size is small or the parameters compared are small. If interested in p-values additional effort is required for the likelihood-ratio test, because a two-dimensional iteration needs to be performed. However, numerically this is only as demanding as obtaining the CIs. Because the test statistics for the Score and Wald tests can be derived, it is easy to derive p-values in these cases. For each of these two tests we provided three alternative variants, which all worked almost equally well in the provided examples. We should point out that it was our intention to indicate only how tests for the parameters can be constructed. With the usual approaches one could compare multiple parameters at the same time, including the information of all these markers. This however, exceeds both our intention and the scope of this article. Finally, as a justification for using the CPD, which simplifies the method to a great extent, we summarized the test suggested by [20]. Although the test will be uninformative if many lineages are present it provides a justification for the approach. Of note, the CPD is an intuitive assumption if infections are relatively rare events. This does not relate with the overall prevalence but rather with how high the observed incidence is in a given population in terms of the time scale required for the pathogen to complete its transmission cycle. Such relationship is hard to establish without complex simulations but it is worth noting that there could be biologic scenarios (particular pathogens or epidemiologic settings) where this assumption does not hold. Thus, it is advisable to check whether the CPD assumption is violated using the tests for the model fit proposed in this investigation. In our case of study, we observe robust estimates across very different epidemiologic settings. Overall, the methods developed here can be used to compare groups under different exposures, different manifestations of disease, groups of patients that have different genotypes (e.g. sickle cell or any other hemoglobinopathies associated with protection), or the efficacy of a given vaccine. Biologically, this method assumes that the rate of evolution of the marker used is “low” relative to the time of the infection. That is, there is a “numerable” set of lineages that can be estimated and no variants are generated during the time scale of one infection. Thus, it is not suitable for pathogens such as HIV or any other hypervariable virus. The second assumption is that the set of markers used to detect and characterize the MOI are effectively neutral, so they are not linked to genes under selection. Thus, the loci cannot be associated with antigens or drug resistance. As presented, each loci is considered independent, which is a typical assumption of genotyping base approximations used in molecular epidemiology. We also want to emphasize that this MOI estimate depends on the number of detectable lineages given a laboratory method. Thus, results from different markers such SNPs or microsatellites are expected to differ as a function of their differences in mutation rates and mode of evolution. One could actually calculate the fit of individual loci and then exclude potential outliers if there is any biological reason to do so (e.g. microsatellites under different evolutionary models where one is hyper-variable or non-variable when compared with others). The method is sensitive enough to detect differences in MOI under different epidemiologic settings as indicated by the analyses of empirical data. Whereas this is not per se a “genomic” method, in the sense that is not designed to estimate MOI directly from reads generated from next generation sequence (NGS) data, it can do so from a given set of SNPs or microsatellites detected by using NGS. Whereas the method was originally intended for applications to malaria, it can be applied to other parasitic or microbial diseases where the assumptions are not violated. E.g. variation on the VNTRs in a multi-clonal infection of Mycobacterium tuberculosis. Unlike empirical approaches where simply alleles are counted and then averaged, the proposed ML method provides a robust and computationally efficient statistical framework that can be integrated in epidemiological investigations.

Analysis

1 The Model

1.1 Background

Here, Inline graphic given by (1) is explicitly derived under the assumption that Inline graphic is given by the CPD (2). Namely,

graphic file with name pone.0097899.e432.jpg

where in the derivations the condition Inline graphic indicates that the product is taken over all non-zero components of Inline graphic, corresponding to the alleles found in a sample with allele configuration Inline graphic.

1.2 Log-Likelihood

Assuming that the number of lineages infecting a host follows the CPD (2), the log-likelihood (3) simplifies to

graphic file with name pone.0097899.e436.jpg

where

graphic file with name pone.0097899.e437.jpg

is the number of samples that contain allele Inline graphic. Notably, Inline graphic with equality only if all samples are single infections.

1.3 Proof of Remark 1

The proof of Remark 1 is as follows.

Proof of Remark 1. First, note that

graphic file with name pone.0097899.e440.jpg

Moreover, using de l'Hospitals rule we see that

graphic file with name pone.0097899.e441.jpg

because Inline graphic (note that this holds also true if Inline graphic for some Inline graphic). This proves that Inline graphic is not a maximum likelihood estimate, which is quite intuitive. Inline graphic

1.4 Derivatives of the log-likelihood

Assuming the CPD (2) the log-likelihood function is given by (4) and the derivatives of (5) are hence straightforwardly calculated to be

graphic file with name pone.0097899.e447.jpg (39a)
graphic file with name pone.0097899.e448.jpg (39b)
graphic file with name pone.0097899.e449.jpg (39c)
graphic file with name pone.0097899.e450.jpg (39d)
graphic file with name pone.0097899.e451.jpg
graphic file with name pone.0097899.e452.jpg (39e)

The entries of the Hessian matrix (7), i.e., the second derivatives of Inline graphic, given by (5), are calculated to be

graphic file with name pone.0097899.e454.jpg (40a)
graphic file with name pone.0097899.e455.jpg (40b)
graphic file with name pone.0097899.e456.jpg (40c)
graphic file with name pone.0097899.e457.jpg (40d)
graphic file with name pone.0097899.e458.jpg (40e)
graphic file with name pone.0097899.e459.jpg (40f)
graphic file with name pone.0097899.e460.jpg (40g)

2 Proofs of the main results

2.1 Existence and uniqueness of the ML estimate

First, the result showing existence and uniqueness of the ML estimate in the generic case is proven.

Proof of Result 1. Assume Inline graphic, as this cannot be the ML estimate according to Remark 1. Equating (39b) to zero yields Inline graphic for all Inline graphic. Substituting this into (39a) and setting the equation to zero yields Inline graphic. Therefore, we obtain Inline graphic or

graphic file with name pone.0097899.e466.jpg (41)

proving the last assertion. Hence, it remains to prove the statements for Inline graphic.

By using (41) and equating (39c) to zero, we obtain Inline graphic, which is equivalent to

graphic file with name pone.0097899.e469.jpg (42)

Therefore, the ML estimate is a solution of (42). Straightforward calculation gives

graphic file with name pone.0097899.e470.jpg

and

graphic file with name pone.0097899.e471.jpg

Note that, Inline graphic and Inline graphic, because Inline graphic. Hence, Inline graphic near zero. Note further that Inline graphic. Hence, Inline graphic has at least one positive solution. Since, Inline graphic for at least one Inline graphic, Inline graphic, implying that Inline graphic is strictly convex for Inline graphic. Because Inline graphic is strictly convex there can be at most one positive solution Inline graphic of Inline graphic. Moreover, Inline graphic is strictly monotonically increasing for Inline graphic.

The solution can be found by a Newton method. Because Inline graphic is strictly convex and monotonically increasing for Inline graphic, the Newton method converges monotonically to the solution Inline graphic. Moreover, because Inline graphic is continuous, the rate of convergence is at least quadratic. Noting that Inline graphic yields (9) completes the proof.Inline graphic

The special case, in which only single infections occur, is summarized by Remark 2. It can be proven as follows.

Proof of Remark 2. Examining the proof of Result 1 yields that that the ML estimate is any positive root of Inline graphic. In the present case Inline graphic. However, since Inline graphic must hold for at least one Inline graphic, Inline graphic is still strictly convex. This implies that Inline graphic for all Inline graphic. Hence, no maximum likelihood estimate with Inline graphic exists.

Moreover, since

graphic file with name pone.0097899.e502.jpg

the ML estimate can only be attained at Inline graphic.

In the limit Inline graphic, one obtains, as in the proof of Remark 1,

graphic file with name pone.0097899.e505.jpg

which is maximized at Inline graphic. Particularly, the likelihood function is finite in this case. Inline graphic

In the other non-generic situation, every lineage is found in all samples, which is described in Remark 3 and can be proven as follows.

Proof of Remark 3. The proof of Result 1 yields Inline graphic. Hence, Inline graphic has no positive solution, and hence no ML estimate with Inline graphic exists. Clearly, Remark 1 states that Inline graphic is also not an ML estimate.

In this case the log-likelihood function simplifies to

graphic file with name pone.0097899.e512.jpg
graphic file with name pone.0097899.e513.jpg
graphic file with name pone.0097899.e514.jpg

Taking the limit Inline graphic yields

graphic file with name pone.0097899.e516.jpg

Since Inline graphic implies that the likelihood is one, this limit case, which is - of note - independent of the allele-frequency distribution, is the maximum likelihood.Inline graphic

Remark 4 states that the expected number of samples containing a given lineage equals the observed number of samples containing this allele if the ML estimate is the true parameter. The proof is as follows.

Proof of Remark 4. The maximum likelihood estimate satisfies Inline graphic. Equating (39b) to zero yields Inline graphic for all Inline graphic. Substituting this into (39a) and setting the equation to zero yields Inline graphic. Therefore, we obtain Inline graphic or Inline graphic. Hence, it remains to be shown that Inline graphic holds.

In the following we will use that Inline graphic. To simplify the notation assume Inline graphic. Hence,

graphic file with name pone.0097899.e528.jpg

Successively repeating the last step gives

graphic file with name pone.0097899.e529.jpg

Since the alleles can be arbitrarily labeled, we obtain

graphic file with name pone.0097899.e530.jpg (43)

The proof is completed by noting that Inline graphic is obtained from (4) by replacing Inline graphic with Inline graphic.

2.2 Profile likelihood based confidence intervals

The existence sand uniqueness of the profile-likelihood-based confidence intervals are proven as follows.

Proof of Result 2. The proof consists of several parts.

Part A: Existence in the generic case. We first assume Inline graphic and Inline graphic for at least one Inline graphic and prove the CI's existence.

The CI's bounds satisfy (12). The equations Inline graphic, yield Inline graphic, or

graphic file with name pone.0097899.e539.jpg (44)

which implies that Inline graphic must hold for all Inline graphic. Since, Inline graphic, by summing up the above expression one arrives at Inline graphic. Thus, for fixed Inline graphic the Lagrange multiplier Inline graphic is a zero of the function

graphic file with name pone.0097899.e546.jpg (45)

Its derivative is given by

graphic file with name pone.0097899.e547.jpg (46)

Hence, Inline graphic is strictly monotonically increasing in Inline graphic, and consequently has at most one zero Inline graphic. Note that Inline graphic and Inline graphic. Hence, Inline graphic has exactly one solution Inline graphic. Furthermore, according to the implicit-function theorem, Inline graphic is a continuously differentiable function of Inline graphic.

The likelihood function (4) can be rewritten as

graphic file with name pone.0097899.e557.jpg (47)

Note that

graphic file with name pone.0097899.e558.jpg

Since Inline graphic for at least one Inline graphic, it follows that

graphic file with name pone.0097899.e561.jpg (48)

for any arbitrary but fixed allele-frequency vector Inline graphic. Moreover, the proof of Remark 1 reveals that

graphic file with name pone.0097899.e563.jpg (49)

Now, for any Inline graphic, let Inline graphic with Inline graphic given by (44) with Inline graphic.

Next, we show indirectly that Inline graphic.

First, assume Inline graphic. Hence, there exists a sequence Inline graphic, with Inline graphic but Inline graphic. Hence, Inline graphic such that for a subsequence Inline graphic, Inline graphic. Without loss of generality, Inline graphic. Let Inline graphic be the corresponding sequence of allele-frequency vectors. Since the simplex is compact, there exists a convergent subsequence Inline graphic. Because Inline graphic is continuous, it follows that Inline graphic, contradicting (48).

Analogously it is shown that Inline graphic.

Since Inline graphic, Inline graphic as well as Inline graphic are continuous, and Inline graphic, there exist Inline graphic, such that Inline graphic is a solution of (12), where Inline graphic is given by (44). This proves the existence of the CI's bounds.

Part B: Uniqueness in the generic case. Next, the uniqueness of the confidence intervals is proven. Assume two values Inline graphic with Inline graphic. Since Inline graphic is continuously differentiable the mean value theorem implies that there exists Inline graphic with Inline graphic. Application of the chain rule yields Inline graphic. By definition of Inline graphic, the relation Inline graphic holds. Hence, Inline graphic Thus,

Inline graphic, where Inline graphic is given by (44) with Inline graphic. This implies that Inline graphic is a zero of (39), or, in other words, that Inline graphic is a maximum likelihood estimate. Because of its uniqueness Inline graphic, and Inline graphic. Hence, Inline graphic or Inline graphic is impossible, and the CI is therefore uniquely defined.

Part C: Existence and uniqueness in the non-generic cases. In the case Inline graphic the same proof holds with obvious modifications. As (49) is violated and becomes Inline graphic. It follows that at least one solution of (12) exist. The above proof of uniqueness, implies that this is the only solution.

Similarly, for Inline graphic for all Inline graphic, (48) is violated and becomes Inline graphic, from which the existence of exactly one solution of (12) follows from the same proof as in the generic case.

Part D: Derivation of the CIs in the generic case. Parts A and B reveal that the bounds of the CI's bounds are the two solutions (Inline graphic and Inline graphic) of the equations Inline graphic and Inline graphic, where Inline graphic is given by (45), and Inline graphic with Inline graphic given by (44). A little algebraic manipulations yields that Inline graphic is given by (16e).

The solutions can be found by a Newton method. Straightforward calculation gives

graphic file with name pone.0097899.e620.jpg
graphic file with name pone.0097899.e621.jpg
graphic file with name pone.0097899.e622.jpg
graphic file with name pone.0097899.e623.jpg

where Inline graphic and Inline graphic are given by (16c) and (45) or (16d), respectively. Hence, the Newton method leads to the following iteration

graphic file with name pone.0097899.e626.jpg

Due to its relatively simple form, the above matrix can be easily inverted and the iteration can be rewritten as (16a) and (16b).

The Newton methods converges locally quadratically if the above matrix is nonsingular in the solutions. Part A of the proof reveals that these solutions satisfy Inline graphic, yielding Inline graphic. Hence, the matrix simplifies to

graphic file with name pone.0097899.e629.jpg

Therefore,

graphic file with name pone.0097899.e630.jpg

Clearly, since Inline graphic, Inline graphic if and only if Inline graphic. According to the proof of Result 1 this condition is only fulfilled at the unique ML estimate. Hence, Inline graphic in Inline graphic and Inline graphic. Therefore, the Newton method converges quadratically for any initial value sufficiently close to the respective solution.Inline graphic

2.3 Asymptotic confidence intervals. Proof of Result 3

This proof is slightly more general than necessary as we will re-use part of it later.

First, consider a matrix Inline graphic with the following structure

graphic file with name pone.0097899.e639.jpg (50a)

with

graphic file with name pone.0097899.e640.jpg (50b)

Let Inline graphic. We aim to derive Inline graphic. We do so by inverting Inline graphic blockwise. Namely,

graphic file with name pone.0097899.e644.jpg

The formulae applies whenever, Inline graphic and the Inline graphic matrix Inline graphic is invertible. Moreover,

graphic file with name pone.0097899.e648.jpg (51)

where Inline graphic, Inline graphic, and Inline graphic. Its inverse is given by

graphic file with name pone.0097899.e652.jpg

Hence, the desired quantity Inline graphic becomes

graphic file with name pone.0097899.e654.jpg

We are now ready to derive the confidence interval given by (18). To derive Inline graphic we first note that (7), (40) and rearrangement of the parameters imply that the Fisher information matrix has the form (50), with

graphic file with name pone.0097899.e656.jpg

given by (40), and Inline graphic corresponds to Inline graphic. Therefore,

graphic file with name pone.0097899.e659.jpg (52)

and consequently

graphic file with name pone.0097899.e660.jpg

Moreover,

graphic file with name pone.0097899.e661.jpg

and

graphic file with name pone.0097899.e662.jpg

Hence,

graphic file with name pone.0097899.e663.jpg (53)

Deriving Inline graphic is easy. Namely, exactly the same calculations hold with

graphic file with name pone.0097899.e665.jpg

By inspecting (40), it becomes clear that all derivations remain unchanged with Inline graphic replaced by Inline graphic (cf. eq. 53). This gives

graphic file with name pone.0097899.e668.jpg

which simplifies to

graphic file with name pone.0097899.e669.jpg (54)

Substituting the above with Inline graphic into (18) - using the fact that Inline graphic - yields (20) after after a little algebraic manipulation.

The identities Inline graphic follow from (43). Substituting this into (54) gives

graphic file with name pone.0097899.e673.jpg (55)

Substitution of the above evaluated at Inline graphic (using the fact that Inline graphic) into (18) yields (21) after some rearrangement.

graphic file with name pone.0097899.e725a.jpg

Proof of Result 4. To simplify the notation, we first derive the formulas for the confidence interval of Inline graphic. By re-arranging the parameters as in the proof of Result 3, it is obvious that the matrix Inline graphic given by (50) can be used instead of the Fisher information Inline graphic (or Inline graphic). Particularly, Inline graphic.

We can apply a blockwise inversion formula to Inline graphic similar as in the proof of Result 3. Namely,

graphic file with name pone.0097899.e682.jpg

where

graphic file with name pone.0097899.e683.jpg

with

graphic file with name pone.0097899.e684.jpg

Clearly,

graphic file with name pone.0097899.e685.jpg

where Inline graphic are the elements of Inline graphic. The inverse of Inline graphic is calculated exactly as the inverse of Inline graphic in the proof of Result 3. Namely, we arrive at

graphic file with name pone.0097899.e690.jpg

with Inline graphic, Inline graphic, and Inline graphic.

Hence, the desired quantity Inline graphic becomes

graphic file with name pone.0097899.e695.jpg

To derive the desired quantity Inline graphic (Inline graphic) we need to set Inline graphic, Inline graphic, and Inline graphic. By using (40) and (43) we obtain

graphic file with name pone.0097899.e701.jpg

Therefore,

graphic file with name pone.0097899.e702.jpg
graphic file with name pone.0097899.e703.jpg
graphic file with name pone.0097899.e704.jpg

Hence,

graphic file with name pone.0097899.e705.jpg
graphic file with name pone.0097899.e706.jpg
graphic file with name pone.0097899.e707.jpg
graphic file with name pone.0097899.e708.jpg
graphic file with name pone.0097899.e709.jpg

Moreover,

graphic file with name pone.0097899.e710.jpg
graphic file with name pone.0097899.e711.jpg
graphic file with name pone.0097899.e712.jpg
graphic file with name pone.0097899.e713.jpg
graphic file with name pone.0097899.e714.jpg

Combining the above yields,

graphic file with name pone.0097899.e715.jpg

and finally

graphic file with name pone.0097899.e716.jpg

Hence, the bounds of the confidence intervals are given by

graphic file with name pone.0097899.e717.jpg

By replacing Inline graphic by Inline graphic, one obtains the confidence interval of Inline graphic given by (22).

graphic file with name pone.0097899.e721.jpg

2.4 Testing the Parameters

Proof of Result 5. The result is proven by showing that the iteration (29) leads to the profile-likelihood with Inline graphic. The proof of Remark 2 reveals that the desired values for Inline graphic is the unique zero of Inline graphic given by (45). The zero can be found using a Newton method. Combining (45) and (46) yields (29) after a little rearrangement.

graphic file with name pone.0097899.e725.jpg

Remark 6. If Inline graphic and Inline graphic (and Inline graphic ) are the true (unknown) parameters, the asymptotic Inline graphic holds.

We aim to test only for Inline graphic , so any choice can be made for the true parameter. However, the parameters Inline graphic occur in the asymptotic variance Inline graphic . Hence, we need a plug-in estimate for the asymptotic variance. There are two possibilities. First, the true parameter Inline graphic is replaced by the profile-likelihood estimates Inline graphic based on Inline graphic and the asymptotic variance by Inline graphic . Here, either the expected or the observed Fisher information can be used.

Second, both Inline graphic and Inline graphic can be replaced by the ML estimate Inline graphic . In this case the expected and observed Fisher information coincide.

Proof of Result 6. The remark is proven by explicitly deriving the test statistic. To simplify the notation we write Inline graphic and Inline graphic for Inline graphic and Inline graphic, respectively. To derive Inline graphic (or Inline graphic) we can follow the proof of Result 3.

From the blockwise inversion formula (51) the relation

graphic file with name pone.0097899.e746.jpg (56)

follows immediately, where the denominator on the left-hand side is given by the reciprocal of (53).

Noting that Inline graphic given by (39a) one obtains Inline graphic. Substituting this and (56) in the test statistic (31), and writing Inline graphic and Inline graphic for Inline graphic and Inline graphic gives (32).

Of course, (56) also holds if Inline graphic is replaced by Inline graphic, where Inline graphic is given by (54). Thus, the same reasoning as above yields (33).Inline graphic

3 The case Inline graphic

3.1 Log-likelihood

In the limiting case that the true parameter is Inline graphic the conditional poison distribution becomes

graphic file with name pone.0097899.e759.jpg (57)

Following the derivations in subsection 4.1, Inline graphic becomes

graphic file with name pone.0097899.e761.jpg

where Inline graphic denotes the Inline graphicth base vector. Hence, the likelihood function (3) becomes

graphic file with name pone.0097899.e764.jpg (58)

This is the limiting case of (3) for Inline graphic. Furthermore, we can conclude the following.

Remark 7. If the true parameter is Inline graphic , according to (57), an observation Inline graphic with Inline graphic is impossible in a sample of size Inline graphic . Hence, Inline graphic with probability one.

Assume Inline graphic is the true parameter. Then, we can assume

graphic file with name pone.0097899.e772.jpg

As mentioned above, the case Inline graphic is just the continuation of the likelihood function.

Hence, we can define Inline graphic. Moreover, the (one-sided) derivatives of the likelihood function Inline graphic exist in Inline graphic. We have,

graphic file with name pone.0097899.e777.jpg (59a)
graphic file with name pone.0097899.e778.jpg (59b)
graphic file with name pone.0097899.e779.jpg (59c)

The proof is found in the next subsection 6.2.

From (59a) we immediately see that Inline graphic. Hence, the ML estimate Inline graphic (cf. Remark 2) is a boundary maximum. However, it is necessary for the asymptotic distributions (11), (17), (30), and (38) that all derivatives of the likelihood function vanish. As this is not the case, we can neither derive confidence intervals, nor test the parameters in the case Inline graphic.

3.2 Derivatives of the likelihood function

graphic file with name pone.0097899.e783.jpg

Applying de l'Hospitals rule gives Inline graphic. Hence, successive application of this rule to the above yields

graphic file with name pone.0097899.e785.jpg
graphic file with name pone.0097899.e786.jpg
graphic file with name pone.0097899.e787.jpg
graphic file with name pone.0097899.e788.jpg
graphic file with name pone.0097899.e789.jpg

Note, that the last steps also proves Inline graphic.

Similarly, from (39d)

graphic file with name pone.0097899.e791.jpg

Since from (58) Inline graphic, one obtains (59b).

Acknowledgments

The authors thank Andrea M. McCollum for sharing the data sets from Cameroon, Kenya, Venezuela. The constructive comments of one anonymous reviewer are gratefully acknowledged!

Funding Statement

This research is supported by U.S. National Institutes of Health grants 1U19AI089702 and R01GM084320 to AAE. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Read AF, Taylor LH (2001) The ecology of genetically diverse infections. Science 292: 1099–1102. [DOI] [PubMed] [Google Scholar]
  • 2. Balmer O, Tanner M (2011) Prevalence and implications of multiple-strain infections. The Lancet Infectious Diseases 11: 868–878. [DOI] [PubMed] [Google Scholar]
  • 3. Alizon S, de Roode JC, Michalakis Y (2013) Multiple infections and the evolution of virulence. Ecology Letters 16: 556–567. [DOI] [PubMed] [Google Scholar]
  • 4. Wacker M, Turnbull L, Walker L, Mount M, Ferdig M (2012) Quantification of multiple infections of plasmodium falciparum in vitro. Malaria Journal 11: 180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Matussek A, Stark L, Dienus O, Aronsson J, Mernelius S, et al. (2011) Analyzing multiclonality of staphylococcus aureus in clinical diagnostics using spa-based denaturing gradient gel electrophoresis. Journal of Clinical Microbiology 49: 3647–3648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Vu-Thien H, Hormigos K, Corbineau G, Fauroux B, Corvol H, et al. (2010) Longitudinal survey of staphylococcus aureus in cystic fibrosis patients using a multiple-locus variable-number of tandemrepeats analysis method. BMC Microbiology 10: 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Tognazzo M, Schmid-Hempel R, Schmid-Hempel P (2012) Probing mixed-genotype infections ii: High multiplicity in natural infections of the trypanosomatid, crithidia bombi, in its host, bombus spp. PLoS ONE 7: e49137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Frank SA (1992) A kin selection model for the evolution of virulence. Proceedings of the Royal Society of London Series B: Biological Sciences 250: 195–197. [DOI] [PubMed] [Google Scholar]
  • 9. Lively C (2005) Evolution of virulence: coinfection and propagule production in spore-producing parasites. BMC Evolutionary Biology 5: 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Schjørring S, Koella JC (2003) Sub-lethal effects of pathogens can lead to the evolution of lower virulence in multiple infections. Proceedings of the Royal Society of London Series B: Biological Sciences 270: 189–193. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ben-Ami F, Mouton L, Ebert D (2008) The effects of multiple infections on the expression and evolution of virulence in a daphnia-endoparasite system. Evolution 62: 1700–1711. [DOI] [PubMed] [Google Scholar]
  • 12. Schneider KA, Kim Y (2010) An analytical model for genetic hitchhiking in the evolution of antimalarial drug resistance. Theor Popul Biol 78: 93–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Klein EY, Smith DL, Laxminarayan R, Levin S (2012) Superinfection and the evolution of resistance to antimalarial drugs. Proc Biol Sci 279: 3834–3842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ben-Ami F, Routtu J (2013) The expression and evolution of virulence in multiple infections: the role of specificity, relative virulence and relative dose. BMC Evolutionary Biology 13: 97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Thanapongpichat S, McGready R, Luxemburger C, Day N, White N, et al. (2013) Microsatellite genotyping of plasmodium vivax infections and their relapses in pregnant and non-pregnant patients on the thai-myanmar border. Malaria Journal 12: 275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Cohen T, van Helden PD, Wilson D, Colijn C, McLaughlin MM, et al. (2012) Mixed-strain mycobacterium tuberculosis infections and the implications for tuberculosis treatment and control. Clinical Microbiology Reviews 25: 708–719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Poon AFY, Swenson LC, Bunnik EM, Edo-Matas D, Schuitemaker H, et al. (2012) Reconstructing the dynamics of hiv evolution within hosts from serial deep sequence data. PLoS Comput Biol 8: e1002753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Theron A, Sire C, Rognon A, Prugnolle F, Durand P (2004) Molecular ecology of schistosoma mansoni transmission inferred from the genetic composition of larval and adult infrapopulations within intermediate and definitive hosts. Parasitology 129: 571–585. [DOI] [PubMed] [Google Scholar]
  • 19. Lindstrm I, Sundar N, Lindh J, Kironde F, Kabasa J, et al. (2008) Isolation and genotyping of toxoplasma gondii from ugandan chickens reveals frequent multiple infections. Parasitology 135: 39–45. [DOI] [PubMed] [Google Scholar]
  • 20. Hill WG, Babiker HA (1995) Estimation of numbers of malaria clones in blood samples. Proceedings of the Royal Society of London Series B: Biological Sciences 262: 249–257. [DOI] [PubMed] [Google Scholar]
  • 21. Schneider K, Kim Y (2011) Approximations for the hitchhiking effect caused by the evolution of antimalarial-drug resistance. Journal of Mathematical Biology 62: 789–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
  • 23.Davison AC (2003) Statistical Models (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press.
  • 24.Venzon DJ, Moolgavkar SH (1988) A method for computing profile-likelihood-based condence intervals. Journal of the Royal Statistical Society Series C (Applied Statistics) 37: pp. 87–94.
  • 25. McCollum AM, Mueller K, Villegas L, Udhayakumar V, Escalante AA (2007) Common origin and fixation of plasmodium falciparum dhfr and dhps mutations associated with sulfadoxinepyrimethamine resistance in a low-transmission area in south america. Antimicrob Agents Chemother 51: 2085–2091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. McCollum AM, Basco LK, Tahar R, Udhayakumar V, Escalante AA (2008) Hitchhiking and Selective Sweeps of Plasmodium falciparum Sulfadoxine and Pyrimethamine Resistance Alleles in a Population from Central Africa. Antimicrob Agents Chemother 52: 4089–4097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. McCollum AM, Schneider KA, Griffing SM, Zhou Z, Kariuki S, et al. (2012) Differences in selective pressure on dhps and dhfr drug resistant mutations in western kenya. Malar J 11: 77. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES