A Likelihood Approach to Estimate the Number of Co-Infections

Kristan A Schneider; Ananias A Escalante

doi:10.1371/journal.pone.0097899

. 2014 Jul 2;9(7):e97899. doi: 10.1371/journal.pone.0097899

A Likelihood Approach to Estimate the Number of Co-Infections

Kristan A Schneider ^1,^*, Ananias A Escalante ^2,³

Editor: Art F Y Poon⁴

PMCID: PMC4079681 PMID: 24988302

Abstract

The number of co-infections of a pathogen (multiplicity of infection or MOI) is a relevant parameter in epidemiology as it relates to transmission intensity. Notably, such quantities can be built into a metric in the context of disease control and prevention. Having applications to malaria in mind, we develop here a maximum-likelihood (ML) framework to estimate the quantities of interest at low computational and no additional costs to study designs or data collection. We show how the ML estimate for the quantities of interest and corresponding confidence-regions are obtained from multiple genetic loci. Assuming specifically that infections are rare and independent events, the number of infections per host follows a conditional Poisson distribution. Under this assumption, we show that a unique ML estimate for the parameter ( Inline graphic ) describing MOI exists which is found by a simple recursion. Moreover, we provide explicit formulas for asymptotic confidence intervals, and show that profile-likelihood-based confidence intervals exist, which are found by a simple two-dimensional recursion. Based on the confidence intervals we provide alternative statistical tests for the MOI parameter. Finally, we illustrate the methods on three malaria data sets. The statistical framework however is not limited to malaria.

Introduction

Infections are ubiquitous and ecologically complex processes. Indeed the chain of events conducing to the colonization and replication of parasites within a host involves many environmental, physiological, and genetic factors both in the host and the infectious agent. A common observation in many host-parasite interactions is that there are multiple genetically distinct lineages of the pathogen infecting the same individual host [1]–[3]. Whereas in some diseases such as malaria, this is considered an important parameter, in others it is still somehow a neglected aspect that is just starting to be considered [2].

The observation of multiple genetic variants or multiplicity of infection (MOI) is indicative of the transmission dynamics since it allows for the co-transmission of different parasite variants or the overlap of several genetic variants due to multiple infectious contacts. Thus, the incidence of MOI or superparasitism per se is an important metric of exposure [2], [4]–[7]. In addition to its epidemiological importance, as many other ecological processes involving genetically distinct individuals, MOI leads to several outcomes derived from the interactions among lineages. This process is usually referred to as the intra-host dynamics [3].

During the last two decades, the outcomes of intra-host dynamics have been the subject of several theoretical and experimental investigations exploring a broad spectrum of scenarios. Usually, such studies focus on major effects that different interconnected factors have in terms of parasite dispersion (parasite fitness) and/or the elicited manifestations of disease that may lead to an effect on the host's fitness [3], [8]–[11]. Furthermore, intra-host dynamics also affect the spread of parasite lineages with adaptive mutations conferring resistance to antimicrobial agents or that allow the evasion of immune and/or vaccine-mediated protection [12], [13]. Under all these circumstances, following or measuring MOI as a parameter is essential whenever epidemiological inferences or models involving intra-host dynamics are formulated.

Although it is possible to control or measure the number of distinctive parasite lineages in models and experimental settings (e.g.[14]), a totally different scenario is the one faced by those studying naturally occurring infections in the context of ecological and epidemiological investigations [4]–[6], [15], [16]. Under such circumstances, MOI is usually measured by ad hoc metrics that rely on a set of genetic markers or the observed polymorphism in one or several genes [2]. The need for an experimental definition of MOI has generated approaches based on phylogenetic frameworks (e.g. many viruses) or some form of multi-locus genotyping [2], [17]. Whereas such approximations have been useful, there is still need for a formal statistical framework that allows the estimation of the actual number of lineages and other approximations to MOI that facilitates and/or considers confounding factors.

Given the broad spectrum of genetic architectures observed in parasitic organisms, it is not possible to define a universal framework of MOI. E.g. HIV accumulates mutations at a rate that allows for the use of phylogenetic base methods [17]. On the other hand, eukaryotic parasites such as Plasmodium, Trypanosoma, Toxoplasma, and Schistosoma [18], [19] and bacteria such as Mycobacterium [16] evolve at a rate at which it is possible to determine a stable number of genetically distinct lineages during the course of an infection given a set of genetic markers. In this investigation, we describe a formal statistical framework to estimate MOI that allows, among other aspects, building formal tests for comparing groups, e.g., before or after deploying an intervention such as a vaccine, complicated versus non-complicated cases, populations with different exposures, among other possibilities.

More specifically, we further develop the maximum-likelihood framework introduced by [20], which allows to estimate MOI and prevalences of pathogen lineages from a single genetic marker, e.g., microsatellite loci. We establish how to compute ML estimates and confidence intervals (or regions) for all involved parameters. Based on these, we show how statistical tests can be constructed to test the parameters. Although, the framework is - in principle - not restricted to a particular disease or species, we applied it to malaria by comparing data sets from three endemic regions with different levels of endemicity.

The philosophy behind the method section's structure is the following. We first establish the general methods and then refine them assuming that the number of co-infections follows a conditional Poisson distribution. This structure embraces a better understanding of how to derive particular results for alternative choices to the Poisson distribution. Moreover, rigorous mathematical proofs are shifted to the appendix. Readers less interested in these technical details should feel free to skip them.

Methods

We adapt the maximum-likelihood method of [20] to estimate the average MOI. This approach is fully compatible with the model of [12], [21] which describes the hitchhiking effect associated with drug resistance in Malaria, for which MOI is a fundamental quantity. Being able to estimate MOI, the model can be ‘reverse engineered’ to reconstruct the evolutionary process underlying drug resistance. By doing so, a formal means is provided to identify those among the many compounding factors, which can be influenced to slow-down or prevent the spread of drug resistance in the course of public health initiatives.

1 Model background

Assume Inline graphic different ‘lineages’ of a pathogen, e.g., alleles at a marker locus (or haplotypes in a non-recombining region), circulate in a given population. Particularly, we have neutral markers in mind characterizing linages, so that their frequencies do not change too rapidly, e.g., due to selection. The Inline graphic lineages considered are those that contribute to infection, not new variants that are generated by mutation inside hosts, but ‘fail’ to participate in transmission.

Because we identify a pathogen with the allele at the considered locus, we will use the terms ‘lineage’ and ‘allele’ synonymously. (We refrain from using the term strain, as we refer here to a genotypic characterization and the term strain may have different meanings across pathogens.)

In vector notation, the lineages' relative frequencies are Inline graphic . An individual (host) is infected by (not necessarily different) lineages of the pathogen with probability . The lineages are sampled randomly from the pathogen population. Hence, within an infection, the combination of pathogen linages follows a multinomial distribution with parameters Inline graphic and . Consequently, the probability that of the infecting linages carry allele () is given by , where , is a multinomial coefficient, and . Clearly, summarizes the pathogen configuration infecting a host.

In practice, Inline graphic is unknown for a given host. It is possible to detect which alleles (or lineages) are present in a clinical sample, but it is difficult to reliably reconstruct without using next generation sequencing, a technology that is not practical to use in many settings. For instance, if only a single allele, say Inline graphic , is found in a clinical sample, the patient might have been infected by just one parasite lineages (), or co-infected by several lineages (), all of which carry allele . Hence, it is convenient to represent an infection (lineages detected in a patient) by a vector of zeros and ones of length Inline graphic , referring to the detected alleles (lineages). Hence, a clinical sample is represented by a vector , where if is found in the infection, and otherwise . In mathematical terms . (Remember and for ). Note that the vector is excluded, which corresponds to no infection. In the following, Inline graphic will always denote a vector of nonnegative integers and a vector of zeros and ones.

Let Inline graphic be the multiplicity of infection (MOI) with distribution . Because is unknown in practice, we aim to estimate it from clinical samples - or rather some summary statistics characterizing .

Assume a total of Inline graphic clinical samples, taken from different hosts roughly at the same time. We assume that the lineages detected in the samples are all lineages circulating in the population. (There is no knowledge of undetectable lineages.) Each clinical sample contains one or more of the lineages (alleles). (We assume that lineages that infected the host have not vanished due to intra-host dynamics, e.g., drug treatments, and that new lineages have not emerged inside the host, e.g. by mutation, recombination etc.) A clinical specimen with allelic (or lineage) configuration Inline graphic could descend from an infection with pathogen configuration as long as . Let denote the expected frequency of clinical specimen with allelic configuration . Then,

graphic file with name pone.0097899.e051.jpg

(1)

where the first sum runs over all integers larger than or equal to Inline graphic , as this obviously is the minimum number of parasite lineages that could have caused the infection. The second sum runs over all possible configurations of exactly parasites that lead to the allelic configuration (i.e. ), and hence could have potentially infected the host.

It follows, that for a given allele-frequency distribution Inline graphic , is determined by the distribution . If infections with the pathogen are rare, a natural assumption is that the number of pathogens infecting a host is Poisson distributed, or more precisely follows a conditional Poisson distribution (CPD), i.e.,

(2)

Of note, this conditions on the fact that each host is infected by at least one pathogen. The mean value of this distribution is

Assuming the CPD (2), Inline graphic can explicitly be derived. In Analysis (subsection 4.1) it is shown that

2 Maximum likelihood

Consider a total of Inline graphic samples or clinical specimen, of which have allelic configuration . Hence, , where the sum runs over all zero-one vectors of length , i.e,. (the case of no infection i.e., is excluded).

Since the (natural) likelihood for observing these samples is Inline graphic , the log-likelihood is given by

(3)

Assuming the CPD for the number of lineages infecting a host, it is shown in Analysis (subsection 4.2) that the log-likelihood becomes

(4)

where Inline graphic is the number of samples that contain allele . The prevalence of allele is then . Notably, with equality if and only if exclusively single-lineage infections occur. This is one of two special cases that need to be treated separately. In the other special case all lineages are found in every infection. These cases are somewhat non-generic. We shall therefore formulate the following generic assumption.

Assumption 1 Assume that the sum over the alleles' prevalences is larger than one, but not all alleles are Inline graphic prevalent. In other words, more than one lineage is found in at least one infection, i.e., and not all lineages are found in every infection, i.e., for at least one .

Results

In the following Inline graphic will refer to the parameter of the CPD, or in the general case, to the parameter (or parameter vector) summarizing the distribution . In the latter case has to be interpreted as .

We shall start by deriving the maximum likelihood (ML) estimates for the parameters of interest. Before we do so, we shall start by a rather intuitive observation.

Not surprisingly Inline graphic can never be an ML estimate if multiple alleles are found in at least one sample, as implies single infections only. We summarize this in the following remark which is proved in Analysis (subsection 4.3).

Remark 1 If at least one sample contains more than one allele, i.e., Inline graphic , is not the maximum likelihood estimate.

To obtain the ML estimate for Inline graphic , (4) needs to be maximized on the simplex, either using the method of Lagrange multiplies or by eliminating one of the redundant variables, i.e., by setting e.g., . When using Lagrange multipliers we need to find the zeros of the derivatives of

(5)

i.e., Inline graphic . The derivatives based on the conditional Poisson distribution are derived in Analysis (subsection 4.4). The equations can be straightforwardly solved by a Newton method, i.e., by iterating

(6a)

(6b)

and Inline graphic is any initial choice of and . Here, denotes the (transposed) Hessian matrix evaluated at , i.e.,

graphic file with name pone.0097899.e104.jpg

(7)

If, in the general case, Inline graphic is a parameter vector, the derivatives above have to be interpreted accordingly.

In the case of the conditional Poisson distribution (2) the entries of the Hessian matrix are derived in Analysis (subsection 4.4).

Clearly, instead of (6) also Inline graphic can be iterated, which, however, is numerically less recommendable. Alternative approaches would be using an iterative least-square algorithm or the EM algorithm (cf. e.g.[22]).

Of note, in general, an ML estimate does neither necessarily exist, nor is it unique, not to mention that closed formulas typically do not exist. Unfortunately, assuming the CPD (2), the ML estimate indeed cannot be calculated explicitly. However, the estimate exists and is unique. Furthermore, although it can be straightforwardly derived by the above methods, the complexity of whole procedure can be greatly simplified.

Result 1 Assume the conditional Poisson distribution (2) for Inline graphic . Under Assumption 1 there is a unique maximum likelihood estimate . The first component is the unique positive solution of the equation.

(8)

It is found by iterating

graphic file with name pone.0097899.e111.jpg

(9)

which converges monotonically and at quadratic rate from any initial value Inline graphic .

The maximum likelihood estimates of the allele frequencies are given by

(10)

The result is proven in Analysis (subsection 5.1).

For the sake of completeness we shall also consider the instances in which Assumption 1 is violated. In the first situation, only one pathogen lineage is found in each infection, i.e., there is no indication whatsoever of co-infections. The results are summarized in the following remark which is proven in Analysis (subsection 5.1).

Remark 2 Assume that each sample contains only one allele, i.e., Inline graphic . Then the ML estimates are and .

In the other non-generic case that all alleles are found in every sample an ML estimate does not exist, more precisely, it is Inline graphic , implying that – with probability one – all alleles are in every sample independently of the allele-frequency distribution.

Remark 3 Assume Inline graphic for all . Then the ML estimate is “ ” for every allelic distribution.

A proof can be found in Analysis (subsection 5.1).

Of note, the maximum likelihood has an intuitive interpretation. We summarize this as the following result which is proven in Analysis (subsection 5.1).

Remark 4 The maximum likelihood estimate Inline graphic is the set of parameters for which the observed number of samples containing allele equals its expectations, i.e.,

Hence, the maximum likelihood maximizes the expectation of the log-likelihood.

1 Confidence intervals from the profile-likelihood

Let Inline graphic denote the ML estimate. Confidence intervals can be derived from the profile-likelihood for each parameter.

We are interested in finding a confidence interval (CI) for Inline graphic . For a fixed value of , the profile likelihood is defined as

i.e., as the maximum likelihood taken over the remaining parameters while keeping the parameter of interest fixed. Moreover, denote the maximum likelihood by Inline graphic (clearly ). Suppose is the true parameter and the corresponding profile likelihood. Then

(11)

i.e. twice the difference of the maximum likelihood minus the profile likelihood assuming the true parameter is Inline graphic distributed with one degree of freedom (cf. e.g. [23], chapter 4). This can be used to construct confidence intervals for the true parameter . To construct a CI at the level, we need to find all satisfying

i.e., we need to find Inline graphic satisfying , where denotes -quantile of the distribution with degrees of freedom. In other words, the equation needs to be solved. By definition of , this means that needs to be solved with respect to , while simultaneously maximizing with respect to . The latter is done using the method of Lagrange multipliers for fixed Inline graphic , i.e.,

is maximized. This leads to the equations Inline graphic . Therefore, following [24] the bound of the confidence intervals are found by solving the following system of equations

graphic file with name pone.0097899.e153.jpg

(12)

where Inline graphic

Clearly, Inline graphic can be straightforwardly solved by a Newton method, i.e., by iterating

(13a)

where ( Inline graphic ) is the solution of the system of linear equations

(13b)

and Inline graphic is any initial choice of , and . The derivative is identical to (7) except for the first line, which needs to be replaced by

(14)

The derivatives of Inline graphic are given by (39). Hence, is given by

graphic file with name pone.0097899.e167.jpg

(15)

where all derivatives are given by (39) and (40).

Again, alternatively Inline graphic can be iterated, which however requires to invert the matrix in every iteration step. The alternatives to the Newton method are again the EM algorithm or an iterated least-mean-square algorithm.

To obtain the confidence bounds Inline graphic and it is necessary to iterate (13) from two different initial values. Of note, obtaining one bound for the confidence interval is numerically only as demanding as obtain the ML estimate.

Confidence intervals for the allele frequencies Inline graphic are obtained similarly by iterating (13) with obvious changes. Namely, the first component of the function needs to be replaced by and the -th component by , i.e., is the gradient of with the derivative with respect to replaced by . Consequently is identical to with the -th component replaced by (14).

Importantly, existence and uniqueness of the confidence bounds Inline graphic and can be proved under the assumption of the CPD (2). Moreover, it is possible to significantly reduce the complexity of the Newton method (13) to find the CI's bounds. We obtain the following result, which is proven in Analysis (subsection 5.2).

Result 2 Suppose Assumption 1 holds. If Inline graphic is given by the conditional Poisson distribution (2), the confidence interval for (based on the profile likelihood) is uniquely defined.

The bounds of the confidence interval ( Inline graphic and ) for are obtained by iterating

(16a)

(16b)

where

(16d)

and

(16e)

There are exactly two possible solutions Inline graphic and . The algorithm is converging quadratically for any initial values sufficiently close to the one of the solutions.

The proof is found in Analysis (subsection 5.2).

Formally, the above result holds true in the non-generic cases Inline graphic and . If all samples contain just one lineage, i.e., , the ML estimate is and the confidence interval has the form . If all samples contain all lineages, i.e., the maximum likelihood estimate is and the confidence interval has the form , hence it is infinitely large. Although, formally the result still holds, the asymptotic (11) is no longer true, as discussed in Analysis (subsection 6), rendering the result inapplicable if Assumption 1 is violated.

2 Asymptotic confidence intervals

As an alternative to the profile likelihood, one can use the asymptotic normality of the maximum likelihood to construct confidence intervals. Asymptotically the difference of the maximum likelihood ( Inline graphic ) and the true parameter () is normally distributed. However, it is important to notice that - unless one eliminates one of the redundant allele frequencies - the Lagrange multiplier needs to be treated like a regular parameter. The corresponding likelihood function is of course given by (5). Hence, the actual parameters involved are Inline graphic . The difference of the maximum likelihood and the true parameter is asymptotically distributed according to

(17a)

(17b)

where Inline graphic is the expected Fisher information and is the observed Fisher information (based on sample size ). The matrix is the transposed Hessian matrix given by (7).

The expression Inline graphic is the convenient, although imprecise notation, for , where is the -dimensional identity matrix and the symmetric square root of the Fisher information. Namely, any positive semi-definite, symmetric matrix (as it is the case of any covariance matrix, and particularly the Fisher information) has a spectral decomposition Inline graphic , where is orthogonal and is the diagonal matrix that contains all eigenvalues. These are real and nonnegative, and the diagonal matrix that contains the square roots of the eigenvalues is denoted by . Hence, by setting , we have .

An often used alternative notation is

with Inline graphic and .

From (17) the asymptotic distribution of the parameters of interest Inline graphic follows immediately by dropping the ‘dummy’ variable and the corresponding rows and column in the inverse Fisher information. Of note, this is not identical to ‘formally’ derive the inverse Fisher information based on and . Namely, it is important to drive the asymptotic covariance matrix with respect to Inline graphic and .

Since Inline graphic the bounds for the CI for are given by

(18)

and those for the components of Inline graphic by

(19)

Here, Inline graphic denotes the quantile of the standard normal distribution.

Of course, when using the expected Fisher information, Inline graphic needs to be replaced by . Under the assumption of the conditional Poisson distribution (2), the second derivatives needed to derive the Fisher information are calculated in Analysis (subsection 4.4; eq.39). Moreover, evaluated at the maximum likelihood estimate, , it is seen that the expected and observed Fisher information are identical, i.e., Inline graphic , when assuming (2).

With some algebraic manipulation it is possible to simplify the expressions for the confidence intervals assuming the CPD (2).

Result 3 Suppose the number of co-infections follow the conditional Poisson distribution (2) and that Assumption 1 holds. Then an asymptotic Inline graphic -confidence interval for is given by

graphic file with name pone.0097899.e256.jpg

(20)

Alternatively, the following formula, requires just the ML estimate for Inline graphic

graphic file with name pone.0097899.e258.jpg

(21)

For a proof, see Analysis (subsection 5.3).

In the non-generic case Inline graphic for all , the ML estimate is not unique, and we have . Hence, asymptotic CIs make no sense in this case, neither for nor for the frequencies .

In the case Inline graphic , it also impossible to derive CIs as the asymptotics (17) break down (cf. subsection 6 in Analysis).

Explicit formulas for the CIs of the allele frequencies are obtained similarly.

Result 4 Under the same assumptions as Result 3, an asymptotic Inline graphic -confidence interval for is given by

graphic file with name pone.0097899.e267.jpg

(22)

The proof can again be found in Analysis (subsection 5.3).

3 Testing the parameters

In practice, data from several loci is typically available, each of which yields a different ML estimate or there might be some prior estimate for the parameters of interest. Depending on particular properties of the marker loci (mutation rate, allele-frequency spectrum, biochemical issues in determining motif repeats, etc.) different marker loci will lead to different ML estimates. Hence, it is desirable to test whether different estimates are significantly different. The confidence intervals can be adapted to test the parameters.

Clearly, at different marker loci, different alleles will segregate and the allele-frequency spectra will be very different. Hence, for the present purpose, it is meaningless to compare the allele frequencies at different loci. However, the estimate for Inline graphic should be consistent, as this parameter is the same for all loci. Consequently, in the following we will focus on testing and present three alternative tests for the null hypothesis vs. the alternative .

3.1 The likelihood-ratio test

The first test is rather straightforward. Since

(23)

under the null hypothesis Inline graphic , it is rejected at significance level if

In other words, we reject the null hypothesis for any Inline graphic that lies outside the -confidence interval of , which are obtained as outlined above in “Confidence intervals from the profile likelihood”. Therefore, this test requires no additional numerical effort if the confidence intervals were already derived.

The corresponding p-value is given by

graphic file with name pone.0097899.e279.jpg

(24)

To calculate the p-value, Inline graphic needs to be derived first. Similarly as in section in “Confidence intervals from the profile likelihood”, this leads to the equations . Therefore, the system of equations

graphic file with name pone.0097899.e282.jpg

(25)

needs to be solved by a Newton method, i.e., by iterating

(26a)

(26b)

and Inline graphic is any initial choice of and . The derivative is obtained from (7) by deleting the first row and column and substituting , i.e.,

graphic file with name pone.0097899.e291.jpg

(27)

where all derivatives are given by (39) and (40).

Result 5 Suppose Assumption 1 and Inline graphic holds. In the case of the conditional poisson distribution, the p-value under the null hypothesis is given by (24), where is given by (4) with and given by

(28)

where Inline graphic is the solution of (16e) with .

The solution Inline graphic is found by iterating

graphic file with name pone.0097899.e301.jpg

(29)

The proof is presented in Analysis (subsection 5.4).

In case of Inline graphic , there are two possibilities. If , then . Hence, the null hypothesis is always rejected. This is clear, because if is the true parameter, it is impossible to observe data with (see Remark 7 in Analysis, subsection 6). However, if , then and , and the null hypothesis is always accepted.

Therefore, in the case of Inline graphic the test can still be formally performed in a meaningful way. However, note that the asymptotic (23) does not long hold true, as does not lie in the interior of the parameter space.

3.2 The score test

In the following, for any parameter choice Inline graphic , let by the corresponding profile-likelihood estimate, i.e., , where is the dimensional simplex. By using a dummy variable as before, is obtained from . The Fisher information can be written as

where Inline graphic is obtained from the Fisher information with the first row and column deleted. The definitions of the remaining sub-matrices follow accordingly.

A test for the null hypothesis Inline graphic vs. the alternative is obtained by using the fact that

(30)

(cf. Remark 6 in subsection 5.4 of Analysis). The function

graphic file with name pone.0097899.e325.jpg

(31)

serves as test statistic, where the data is Inline graphic . The test rejects at the -level if The corresponding p-value is .

Note that it is legitimate to write Inline graphic on the left-hand side of (30) because . However, it is nevertheless important to derive the asymptotic variance from .

Alternatively, the expected Fisher information Inline graphic in (30) and (31) can be replaced by the observed Fisher information . However, if is not the ML estimate, . As proven Analysis (subsection 5.4), one obtains for the CPD:

Result 6 Consider the score test for the null hypothesis Inline graphic vs. the alternative under the assumptions of Result 5. The test statistic based on the observed Fisher information is

graphic file with name pone.0097899.e340.jpg

(32)

and that based on the expected Fisher information is

graphic file with name pone.0097899.e341.jpg

(33)

The p-values are Inline graphic in either case. The frequencies are derived as specified in Result 5.

Of note, instead of (30) the ML estimate can be used as a plug-in estimate for the asymptotic variance, i.e., Inline graphic . In this case, it is not necessary to distinguish between the expected and observed Fisher information as they coincide (cf. section “Asymptotic confidence intervals”).

In summary one obtains:

Remark 5 Under the assumptions of Result 6, a test statistic for the null hypothesis Inline graphic vs. the alternative is

graphic file with name pone.0097899.e347.jpg

(34)

where Inline graphic and are sample size and number of alleles, in the data yielding the estimate .

The proof is analogously to the one of Result 6.

The test cannot be applied in the special cases Inline graphic or for all , as the asymptotic (30) no longer holds true (cf. subsection 6 of Analysis).

3.3 The Wald test

A third test for the null hypothesis Inline graphic is an adaptation of the Wald test for the profile likelihood. It is based on the same asymptotic properties that we used to derive confidence intervals namely . This is exactly the same as the asymptotic as .

This implies Inline graphic or . Hence, the test statistic

can be used. The p-value is Inline graphic .

Now, we shall consider again the CPD. An explicit expression for Inline graphic is given by (54). Hence, we obtain:

Result 7 Under the assumptions of Result 5, the Wald test for the null hypothesis Inline graphic vs. the alternative has the test statistic

graphic file with name pone.0097899.e365.jpg

(35)

based on the (expected or observed) Fisher information.

The p-values are Inline graphic in either case. Here, and the frequencies are derived as specified in Result 1.

Alternatively, if the profile-likelihood estimate based on Inline graphic is used as a plug-in for the asymptotic variance, one can employ or .

In the first case, using (53) implies that the test statistic changes to

graphic file with name pone.0097899.e372.jpg

(36)

In the second case, (54) implies that the test statistic changes to

graphic file with name pone.0097899.e373.jpg

(37)

Also the Wald test cannot be applied in the special cases Inline graphic or for all , as the asymptotic for no longer holds true (cf. subsection 6 of Analysis).

4 Testing the method

Although - as we have seen - most of the theory works quite general, assuming a CPD for the number of co-infections permits to derive explicit results or, at least, reduces the complexity significantly. However, assuming a CPD might not be justified. Therefore, it is desirable to have a test for the model's fit. Namely, let

be the likelihood assuming a perfect fit to the data, in which the expected frequencies of infection with stain configuration Inline graphic equal their observed frequencies. In other words, is the maximum likelihood of the saturated model. As there are possible allelic configurations infecting a host, has degrees of freedom. The maximum likelihood of the reduced model (assuming the CPD) has independent allele frequencies and one Poisson parameter. Therefore,

(38)

Hence, the following test can be used.

Result 8 To test Inline graphic : “the conditional Poisson distribution is justified” vs. : “the conditional Poisson distribution is not justified”, the test-statistic can be used. The p-value is given by .

It should be mentioned that the above test might perform poorly if the number of lineages or alleles Inline graphic is large. The reason is that the distribution has too many degrees of freedom. This might be the case when using hyper-mutable microsatellite markers with 10 or more alleles found across samples.

Application to data

As an illustration, the methods are applied to three previously-described data sets [25]–[27]. Each of which comprises molecular data from P. falciparum-infected blood samples from endemic areas with different levels of malaria incidence. For each blood sample, parasite DNA was extracted and several microsatellite markers assayed.

1 Preliminary remarks

It is important to note beforehand that only (selectively) neutral markers should be included in the analysis. Namely, loci linked to others that are targets of selection (e.g., mdr1, crt, dhfr, dhps in P. falciparum that are associated with selection for drug resistance) will have skewed allele-frequency distribution. Hence, using these markers might lead to artifacts and severe misinferences. In practice, a marker located on a chromosome not carrying a strongly selected gene (e.g. resistance-conferring gene), can be regarded to be neutral. Moreover, clinical samples from groups that will be compared need to consider confounding effects such as differences in treatment polices, control interventions, and changing transmission intensities (e.g., a group should not contain samples from two time points during which treatment policies changed). By not considering such effects, the estimates of MOI would be inappropriate. For these reasons, we only used parts of the available data sets.

2 Data description

The first data set emerged from a longitudinal study conducted in Asembo Bay, a hyper-endemic region in Kenya, and was described in [27]. We included five (neutral) microsatellites on chromosome 2 and four (neutral) markers on chromosome 3. Additionally, we included two markers on chromosome 8, quite close to dhfr, which are common to all three data sets and meet Assumption 1. Only blood samples collected in the first study year (mid 1993 to mid 1994) were included, resulting in 42 blood samples.

The second data set described in [26] is from a study from Yaoundé, Cameroon, a region of intermediate/high transmission. Besides the two markers on chromosome 8 mentioned above, we included all eight available (neutral) microsatellite markers on chromosomes 2 and 3 from all 331 blood samples (data of one of the 332 original samples was unavailable).

The third data set is from Bolivar State, Venezuela, a region of low transmission. It was described in [25] and consists of 97 blood samples. Due to the low transmission intensities, for most markers each blood samples contains only one allele, violating Assumption 1. We included all markers that met Assumption 1 as well as all available neutral markers. Particularly, we included four on chromosome 2 and three on chromosome 3, two markers on chromosome 8 and one on chromosome 4, which are sufficiently distant from respectively dhps and dhfr to be considered neutral, and the two makers on chromosome 4, which were also included in the other data sets. All 97 blood samples were used.

3 Results

The results are summarized in Figures 1 and 2 and Tables 1–3. In all cases, the test for the model fit (cf. Result 8) justified the assumption of the CPD (cf. Tables 1–3). This is important because the three locations exhibit different transmission intensities. In all three regions, the ML estimates Inline graphic or rather the mean MOI, , obtained from different marker loci are fairly consistent. As expected, most variation in the estimates is observed in Kenya because of the low sample size. Moreover, the transmission intensities are stronger, which leads to more variation in allele-frequency spectra among marker loci, resulting in more variation among the ML estimates.

Averages are the arithmetic mean of the ML estimates 2 standard deviations derived from the microsatellite loci, which are common to all data sets, including (blue) and excluding (green) locus L1, which appears to be hyper-mutable in Kenya and Cameroon.

Table 1. Estimates for each locus of the data set from Kenya.

locus	lower bound		upper bound	2(L_N–L ₁)	d.f.
U7	1.00194	1.03409	1.15244	6.40471	9
	0.968395		1.10265
L5	1.21506	1.38975	1.64696	67.8528	15
	1.18622		1.61235
J3	1.16387	1.32208	1.56625	44.993	16
	1.1331		1.52876
J6	1.13457	1.27344	1.49108	58.3296	15
	1.10558		1.45595
U6	1.15044	1.29506	1.51735	65.1444	14
	1.12211		1.48319
L4	1.18509	1.34319	1.57899	89.2578	18
	1.15735		1.54568
U5	1.16453	1.31318	1.53811	76.1215	20
	1.13692		1.50489
K6	1.31334	1.51443	1.7943	134.024	26
	1.28687		1.76291
L1	1.3654	1.59303	1.90742	87.4142	16
	1.33699		1.87367
c4	1.15248	1.30977	1.55585	15.9715	7
	1.12049		1.51705
b3	1.06529	1.16656	1.34475	34.7327	16
	1.03537		1.30777

Open in a new tab

Each row shows, locus name, lower profile-likelihood (top) and asymptotic (bottom) confidence bound, ML estimate, upper profile-likelihood (top) and asymptotic (bottom) confidence bound. For the confidence, bounds α = 0.05 was assumed. Moreover, the test statistic for the fit of the CPD (2) is shown as well as the corresponding degrees of freedom. In all cases, the outcomes are not significant, suggesting that the assumption of the CPD is justified.

Table 3. See description of Table 1 but for the Venezuela data set.

locus	lower bound		upper bound	2(L_N–L ₁)	d.f.
J3	N/A	1	N/A	N/A	N/A
	N/A		N/A
J6	N/A	1	N/A	N/A	N/A
	N/A		N/A
U6	N/A	1	N/A	N/A	N/A
	N/A		N/A
L4	N/A	1	N/A	N/A	N/A
	N/A		N/A
U5	0.974273	1.02745	1.08251	8.32780	8
	1.00156		1.12327
K6	0.971082	1.03104	1.09339	8.06610	3
	1.00176		1.13908
L1	0.984526	1.04242	1.10251	0.00000	2
	1.00703		1.13188
c4	0.973367	1.02863	1.08592	9.79400	3
	1.00163		1.1273
b3	0.99278	1.06223	1.13479	3.66900	4
	1.01538		1.16345
fr13	0.981231	1.05152	1.12504	0.20579	3
	1.00852		1.16137
ps6	0.98346	1.04538	1.1098	0.00000	2
	1.00752		1.14139
ps7	0.978848	1.02256	1.06754	1.01430	4
	1.00128		1.10032

Open in a new tab

N/A indicates that that the method is not applicable (cf. Analysis, section 6).

Table 2. See description of Table 1 but for the Cameroon data set.

locus	lower bound		upper bound	2(L_N–L ₁)	d.f.
L5	1.12239	1.17804	1.23538	165.239	27
	1.12754		1.24098
J3	1.11596	1.17407	1.23404	105.218	26
	1.12171		1.24032
J6	1.15263	1.21385	1.27704	178.18	25
	1.15774		1.28258
U6	1.17975	1.23815	1.29829	270.763	32
	1.18389		1.30274
L4	1.17469	1.24032	1.30817	222.664	29
	1.17986		1.31378
U5	1.18476	1.25169	1.32089	195.916	24
	1.18987		1.32643
K6	1.18436	1.24819	1.31408	294.437	40
	1.18908		1.31919
L1	1.28997	1.36794	1.44861	332.781	40
	1.29451		1.45349
c4	1.08125	1.20363	1.33427	0.958866	9
	1.10312		1.36155
b3	1.1223	1.18418	1.24816	75.4321	27
	1.12849		1.255

Open in a new tab

From Figure 1 it is apparent that the estimates for MOI are highest in Kenya, followed by Cameroon, whereas they are very low in Venezuela. This is summarized in Figure 2 showing that the average ML estimates across the regions differ by several standard deviations.

The 95% profile-likelihood CIs for Inline graphic , given by , are reasonably large for the data sets from Cameroon and Venezuela (cf. Figure 1). However, due to the relatively small sample size, they are much less informative for the Kenya dataset.

The asymptotic confidence intervals agree well with the profile-likelihood CIs (cf. Figure 1 and Tables 1–3). This is particularly true for Cameroon, as expected because of the large sample size. The profile-likelihood CIs from the Kenya and Venezuela data are asymmetric while, the asymptotic CIs are - by definition - symmetric (however, the transformation Inline graphic results in some asymmetry). (Note that, unlike profile-likelihood-based intervals, asymptotic CIs are not transformation respecting, i.e., is the transformed CI of , not the CI of .) In relative terms, this is more pronounced in Venezuela than in the Kenya data set. The reason is that the ML estimates Inline graphic from the Venezuela data are close to zero, i.e., the boundary of the parameter range. This results in a very skewed likelihood function, yielding quite asymmetric profile-likelihood CIs. On the contrary, in Kenya, the ML estimates are rather large, and the likelihood function tends to be symmetric around its maximum.

Furthermore, we tested for pairwise differences between the estimates based on different marker loci. Tables 4–6 report the p-values for the likelihood-ratio, the Score, and the Wald test for the three regions. In all data sets, all tests perform equally well. There are some discrepancies, mainly due to the above mentioned skewness of the likelihood function. In the case of a skewed likelihood function, the likelihood-ratio test is the most preferable, because it accounts for the skewness.

Table 4. Pairwise tests of ML estimates from obtained from the Kenya data set.

locus	U7	L5	J3	J6	U6	L4	U5	K6	L1	c4	b3
U7	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0023
	1.0000	0.0006	0.0030	0.0056	0.0033	0.0012	0.0020	0.0000	0.0000	0.0048	0.0514
	1.0000	0.0004	0.0021	0.0043	0.0024	0.0007	0.0014	0.0000	0.0000	0.0034	0.0477
L5	0.0001	1.0000	0.5316	0.2527	0.3527	0.6536	0.4511	0.2585	0.0866	0.4667	0.0194
	0.0000	1.0000	0.5097	0.2074	0.3159	0.6421	0.4236	0.2946	0.1258	0.4386	0.0032
	0.0000	1.0000	0.5092	0.2053	0.3145	0.6419	0.4228	0.2934	0.1237	0.4376	0.0024
J3	0.0006	0.5062	1.0000	0.6079	0.7762	0.8277	0.9252	0.0632	0.0151	0.9045	0.0790
	0.0000	0.5284	1.0000	0.5910	0.7708	0.8306	0.9246	0.1023	0.0396	0.9034	0.0342
	0.0000	0.5279	1.0000	0.5907	0.7707	0.8306	0.9246	0.1002	0.0375	0.9034	0.0315
J6	0.0021	0.2273	0.6101	1.0000	0.8096	0.4465	0.6574	0.0141	0.0026	0.7079	0.1987
	0.0000	0.2744	0.6265	1.0000	0.8136	0.4753	0.6696	0.0394	0.0146	0.7176	0.1378
	0.0000	0.2727	0.6262	1.0000	0.8136	0.4747	0.6694	0.0374	0.0131	0.7174	0.1347
U6	0.0012	0.3379	0.7826	0.8139	1.0000	0.6088	0.8439	0.0292	0.0060	0.8825	0.1333
	0.0000	0.3753	0.7878	0.8100	1.0000	0.6238	0.8464	0.0615	0.0232	0.8841	0.0769
	0.0000	0.3742	0.7878	0.8099	1.0000	0.6236	0.8464	0.0593	0.0214	0.8841	0.0736
L4	0.0003	0.6545	0.8381	0.4722	0.6207	1.0000	0.7571	0.1053	0.0281	0.7502	0.0516
	0.0000	0.6657	0.8353	0.4436	0.6058	1.0000	0.7510	0.1471	0.0583	0.7432	0.0172
	0.0000	0.6655	0.8353	0.4428	0.6055	1.0000	0.7510	0.1451	0.0561	0.7432	0.0151
U5	0.0007	0.4476	0.9290	0.6720	0.8474	0.7546	1.0000	0.0498	0.0113	0.9732	0.0941
	0.0000	0.4749	0.9296	0.6600	0.8448	0.7606	1.0000	0.0870	0.0333	0.9731	0.0451
	0.0000	0.4742	0.9296	0.6598	0.8448	0.7605	1.0000	0.0848	0.0313	0.9731	0.0421
K6	0.0000	0.3001	0.1091	0.0331	0.0526	0.1364	0.0744	1.0000	0.5466	0.0931	0.0011
	0.0000	0.2651	0.0698	0.0120	0.0250	0.0977	0.0421	1.0000	0.5613	0.0557	0.0000
	0.0000	0.2636	0.0674	0.0105	0.0231	0.0954	0.0400	1.0000	0.5610	0.0528	0.0000
L1	0.0000	0.1093	0.0327	0.0075	0.0127	0.0396	0.0189	0.5391	1.0000	0.0278	0.0002
	0.0000	0.0747	0.0125	0.0012	0.0030	0.0180	0.0057	0.5241	1.0000	0.0097	0.0000
	0.0000	0.0725	0.0111	0.0008	0.0024	0.0165	0.0049	0.5237	1.0000	0.0083	0.0000
c4	0.0008	0.4259	0.9016	0.6976	0.8754	0.7267	0.9709	0.0453	0.0101	1.0000	0.1006
	0.0000	0.4551	0.9027	0.6873	0.8737	0.7342	0.9710	0.0816	0.0312	1.0000	0.0500
	0.0000	0.4544	0.9027	0.6871	0.8736	0.7341	0.9710	0.0794	0.0292	1.0000	0.0470
b3	0.0346	0.0067	0.0552	0.1581	0.0916	0.0237	0.0541	0.0000	0.0000	0.0824	1.0000
	0.0005	0.0333	0.1130	0.2219	0.1529	0.0659	0.1086	0.0026	0.0010	0.1464	1.0000
	0.0002	0.0308	0.1099	0.2196	0.1501	0.0631	0.1057	0.0021	0.0007	0.1429	1.0000

Open in a new tab

The ML estimate obtained from the locus specified in the rows (H ₀) is tested against the estimates from the loci specified in the columns (H_A). In each cell, the p-values for the likelihood-ratio (top), Score (middle), and Wald test (bottom) are shown. The Score and Wald tests are the version of eqs. (32) and (35), respectively. Significant differences are indicated in bold.

Table 6. See description of Table 4 but for the Venezuela data set.

locus	L4	U5	K6	L1	c4	b3	fr13	ps6	ps7
L4	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
U5	N/A	1.0000	0.9047	0.5670	0.9669	0.2143	0.4224	0.5133	0.8397
	N/A	1.0000	0.9084	0.6177	0.9673	0.3334	0.5093	0.5765	0.8291
	N/A	1.0000	0.9084	0.6173	0.9673	0.3317	0.5085	0.5759	0.8291
K6	N/A	0.9008	1.0000	0.6754	0.9349	0.2827	0.5108	0.6147	0.7371
	N/A	0.8968	1.0000	0.7045	0.9331	0.3859	0.5747	0.6553	0.7088
	N/A	0.8967	1.0000	0.7043	0.9331	0.3846	0.5741	0.6550	0.7086
L1	N/A	0.6415	0.7433	1.0000	0.6751	0.5352	0.7913	0.9253	0.4825
	N/A	0.5898	0.7164	1.0000	0.6329	0.5825	0.8035	0.9269	0.3846
	N/A	0.5895	0.7162	1.0000	0.6325	0.5821	0.8035	0.9268	0.3831
c4	N/A	0.9665	0.9367	0.6027	1.0000	0.2359	0.4512	0.5465	0.8047
	N/A	0.9660	0.9384	0.6457	1.0000	0.3501	0.5303	0.6018	0.7890
	N/A	0.9660	0.9384	0.6453	1.0000	0.3485	0.5296	0.6014	0.7889
b3	N/A	0.3485	0.4354	0.5643	0.3751	1.0000	0.7844	0.6391	0.2254
	N/A	0.2148	0.3242	0.5138	0.2505	1.0000	0.7711	0.6033	0.0871
	N/A	0.2128	0.3221	0.5129	0.2468	1.0000	0.7710	0.6028	0.0833
fr13	N/A	0.4858	0.5827	0.7771	0.5167	0.7529	1.0000	0.8551	0.3411
	N/A	0.3880	0.5150	0.7631	0.4303	0.7668	1.0000	0.8491	0.2077
	N/A	0.3869	0.5142	0.7630	0.4286	0.7667	1.0000	0.8490	0.2047
ps6	N/A	0.5865	0.6872	0.9235	0.6193	0.6056	0.8612	1.0000	0.4314
	N/A	0.5191	0.6477	0.9218	0.5626	0.6402	0.8667	1.0000	0.3188
	N/A	0.5186	0.6473	0.9218	0.5618	0.6399	0.8667	1.0000	0.3168
ps7	N/A	0.8500	0.7629	0.4201	0.8192	0.1342	0.3064	0.3773	1.0000
	N/A	0.8592	0.7853	0.5076	0.8323	0.2697	0.4269	0.4769	1.0000
	N/A	0.8592	0.7852	0.5066	0.8323	0.2675	0.4256	0.4758	1.0000

Open in a new tab

N/A indicates that that the test is not applicable (cf. Analysis, section 6). Results for loci J3, J6, and U6 are not shown because the tests are also not applicable (as for locus L4).

Table 5. See description of Table 4 but for the Cameroon data set.

locus	L5	J3	J6	U6	L4	U5	K6	L1	c4	b3
L5	1.0000	0.8960	0.2296	0.0282	0.0428	0.0171	0.0176	0.0000	0.6775	0.8465
	1.0000	0.8953	0.2552	0.0439	0.0637	0.0312	0.0314	0.0000	0.6899	0.8480
	1.0000	0.8953	0.2548	0.0436	0.0631	0.0307	0.0310	0.0000	0.6899	0.8480
J3	0.8896	1.0000	0.1787	0.0184	0.0299	0.0113	0.0115	0.0000	0.6282	0.7480
	0.8904	1.0000	0.2058	0.0316	0.0483	0.0230	0.0229	0.0000	0.6445	0.7522
	0.8904	1.0000	0.2054	0.0313	0.0478	0.0226	0.0225	0.0000	0.6445	0.7522
J6	0.2442	0.2195	1.0000	0.4042	0.4182	0.2490	0.2746	0.0000	0.8764	0.3811
	0.2189	0.1918	1.0000	0.4189	0.4340	0.2717	0.2956	0.0001	0.8746	0.3595
	0.2185	0.1913	1.0000	0.4188	0.4338	0.2713	0.2953	0.0001	0.8746	0.3593
U6	0.0601	0.0572	0.4610	1.0000	0.9489	0.6909	0.7583	0.0002	0.6138	0.1255
	0.0406	0.0370	0.4468	1.0000	0.9490	0.6956	0.7611	0.0010	0.5961	0.0979
	0.0401	0.0364	0.4467	1.0000	0.9490	0.6956	0.7611	0.0010	0.5961	0.0974
L4	0.0522	0.0500	0.4234	0.9428	1.0000	0.7394	0.8101	0.0003	0.5929	0.1122
	0.0340	0.0311	0.4075	0.9427	1.0000	0.7427	0.8119	0.0012	0.5734	0.0853
	0.0335	0.0306	0.4073	0.9427	1.0000	0.7427	0.8119	0.0012	0.5733	0.0848
U5	0.0240	0.0240	0.2604	0.6605	0.7426	1.0000	0.9160	0.0011	0.4914	0.0604
	0.0125	0.0119	0.2379	0.6554	0.7392	1.0000	0.9157	0.0033	0.4621	0.0392
	0.0122	0.0116	0.2376	0.6553	0.7392	1.0000	0.9157	0.0032	0.4620	0.0388
K6	0.0307	0.0303	0.3047	0.7436	0.8194	0.9192	1.0000	0.0007	0.5213	0.0736
	0.0172	0.0162	0.2838	0.7406	0.8178	0.9195	1.0000	0.0025	0.4950	0.0503
	0.0169	0.0158	0.2835	0.7406	0.8178	0.9195	1.0000	0.0024	0.4950	0.0499
L1	0.0000	0.0000	0.0001	0.0002	0.0013	0.0035	0.0017	1.0000	0.0429	0.0000
	0.0000	0.0000	0.0000	0.0000	0.0003	0.0012	0.0005	1.0000	0.0145	0.0000
	0.0000	0.0000	0.0000	0.0000	0.0003	0.0011	0.0004	1.0000	0.0144	0.0000
c4	0.3971	0.3531	0.7433	0.2282	0.2539	0.1366	0.1495	0.0000	1.0000	0.5589
	0.3781	0.3306	0.7469	0.2498	0.2771	0.1618	0.1738	0.0000	1.0000	0.5469
	0.3779	0.3303	0.7468	0.2495	0.2768	0.1613	0.1734	0.0000	1.0000	0.5468
b3	0.8332	0.7420	0.3252	0.0514	0.0710	0.0306	0.0323	0.0000	0.7548	1.0000
	0.8315	0.7378	0.3465	0.0708	0.0950	0.0485	0.0498	0.0000	0.7621	1.0000
	0.8315	0.7378	0.3463	0.0704	0.0945	0.0480	0.0494	0.0000	0.7621	1.0000

Open in a new tab

Tables 7–9 compare the three versions of the Score test, while Tables 10–12 compare those for the Wald test. The results are fairly consistent. However, the versions given by eqs. 34, 37 and 36 of the Score and Wald tests, respectively tend to be most inconsistent with the other tests, especially the likelihood-ratio test. The reason is that these use the roughest approximations.

Table 7. The same as Table 4 but with the p-value of the three versions, according to eqs. 33 (top), 32 (middle), and 34 (bottom) of the Score test.

locus	U7	L5	J3	J6	U6	L4	U5	K6	L1	c4	b3
U7	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	1.0000	0.0006	0.0030	0.0056	0.0033	0.0012	0.0020	0.0000	0.0000	0.0048	0.0514
	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
L5	0.0021	1.0000	0.5434	0.2794	0.3740	0.6603	0.4671	0.2351	0.0656	0.4817	0.0384
	0.0000	1.0000	0.5097	0.2074	0.3159	0.6421	0.4236	0.2946	0.1258	0.4386	0.0032
	0.3113	1.0000	0.5732	0.3465	0.4250	0.6755	0.5040	0.1888	0.0316	0.5206	0.1439
J3	0.0063	0.4923	1.0000	0.6173	0.7793	0.8258	0.9255	0.0429	0.0068	0.9050	0.1112
	0.0000	0.5284	1.0000	0.5910	0.7708	0.8306	0.9246	0.1023	0.0396	0.9034	0.0342
	0.3298	0.4597	1.0000	0.6398	0.7867	0.8215	0.9264	0.0159	0.0007	0.9065	0.2261
J6	0.0138	0.1988	0.6004	1.0000	0.8072	0.4283	0.6498	0.0057	0.0005	0.7023	0.2335
	0.0000	0.2744	0.6265	1.0000	0.8136	0.4753	0.6696	0.0394	0.0146	0.7176	0.1378
	0.3467	0.1398	0.5755	1.0000	0.8014	0.3862	0.6317	0.0004	0.0000	0.6876	0.3336
U6	0.0097	0.3146	0.7796	0.8162	1.0000	0.5994	0.8423	0.0156	0.0018	0.8816	0.1685
	0.0000	0.3753	0.7878	0.8100	1.0000	0.6238	0.8464	0.0615	0.0232	0.8841	0.0769
	0.3387	0.2623	0.7718	0.8217	1.0000	0.5772	0.8386	0.0029	0.0000	0.8793	0.2785
L4	0.0045	0.6475	0.8397	0.4882	0.6292	1.0000	0.7607	0.0812	0.0157	0.7539	0.0801
	0.0000	0.6657	0.8353	0.4436	0.6058	1.0000	0.7510	0.1471	0.0583	0.7432	0.0172
	0.3235	0.6311	0.8437	0.5265	0.6492	1.0000	0.7688	0.0428	0.0030	0.7635	0.1944
U5	0.0073	0.4304	0.9287	0.6788	0.8488	0.7508	1.0000	0.0315	0.0045	0.9733	0.1276
	0.0000	0.4749	0.9296	0.6600	0.8448	0.7606	1.0000	0.0870	0.0333	0.9731	0.0451
	0.3326	0.3907	0.9279	0.6949	0.8523	0.7421	1.0000	0.0096	0.0003	0.9734	0.2417
K6	0.0003	0.3211	0.1338	0.0524	0.0747	0.1621	0.0985	1.0000	0.5373	0.1174	0.0052
	0.0000	0.2651	0.0698	0.0120	0.0250	0.0977	0.0421	1.0000	0.5613	0.0557	0.0000
	0.2855	0.3706	0.2116	0.1276	0.1488	0.2293	0.1703	1.0000	0.5161	0.1981	0.0752
L1	0.0001	0.1329	0.0494	0.0169	0.0245	0.0583	0.0330	0.5486	1.0000	0.0435	0.0015
	0.0000	0.0747	0.0125	0.0012	0.0030	0.0180	0.0057	0.5241	1.0000	0.0097	0.0000
	0.2725	0.1973	0.1205	0.0744	0.0833	0.1215	0.0923	0.5682	1.0000	0.1153	0.0541
c4	0.0077	0.4075	0.9010	0.7034	0.8764	0.7221	0.9709	0.0279	0.0039	1.0000	0.1345
	0.0000	0.4551	0.9027	0.6873	0.8737	0.7342	0.9710	0.0816	0.0312	1.0000	0.0500
	0.3337	0.3652	0.8994	0.7172	0.8787	0.7112	0.9707	0.0078	0.0002	1.0000	0.2480
b3	0.0808	0.0015	0.0309	0.1226	0.0610	0.0095	0.0303	0.0000	0.0000	0.0527	1.0000
	0.0005	0.0333	0.1130	0.2219	0.1529	0.0659	0.1086	0.0026	0.0010	0.1464	1.0000
	0.4077	0.0000	0.0044	0.0572	0.0174	0.0005	0.0049	0.0000	0.0000	0.0119	1.0000

Open in a new tab

Table 9. See descriptions of Table 7 but for the Venezuela data set and Table 6.

locus	L4	U5	K6	L1	c4	b3	fr13	ps6	ps7
L4	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
U5	N/A	1.0000	0.9028	0.5369	0.9666	0.1491	0.3702	0.4754	0.8446
	N/A	1.0000	0.9084	0.6177	0.9673	0.3334	0.5093	0.5765	0.8291
	N/A	1.0000	0.8967	0.4454	0.9659	0.0316	0.2222	0.3617	0.8587
K6	N/A	0.9027	1.0000	0.6587	0.9357	0.2236	0.4731	0.5912	0.7492
	N/A	0.8968	1.0000	0.7045	0.9331	0.3859	0.5747	0.6553	0.7088
	N/A	0.9085	1.0000	0.6072	0.9382	0.0878	0.3576	0.5181	0.7847
L1	N/A	0.6621	0.7546	1.0000	0.6927	0.5082	0.7847	0.9244	0.5216
	N/A	0.5898	0.7164	1.0000	0.6329	0.5825	0.8035	0.9269	0.3846
	N/A	0.7249	0.7888	1.0000	0.7443	0.4257	0.7639	0.9219	0.6381
c4	N/A	0.9667	0.9359	0.5774	1.0000	0.1722	0.4038	0.5137	0.8117
	N/A	0.9660	0.9384	0.6457	1.0000	0.3501	0.5303	0.6018	0.7890
	N/A	0.9674	0.9333	0.4999	1.0000	0.0464	0.2653	0.4136	0.8322
b3	N/A	0.4003	0.4779	0.5861	0.4258	1.0000	0.7902	0.6546	0.2899
	N/A	0.2148	0.3242	0.5138	0.2505	1.0000	0.7711	0.6033	0.0871
	N/A	0.5756	0.6147	0.6506	0.5849	1.0000	0.8083	0.7009	0.5189
fr13	N/A	0.5231	0.6091	0.7836	0.5514	0.7454	1.0000	0.8579	0.3961
	N/A	0.3880	0.5150	0.7631	0.4303	0.7668	1.0000	0.8491	0.2077
	N/A	0.6406	0.6907	0.8025	0.6546	0.7220	1.0000	0.8663	0.5710
ps6	N/A	0.6127	0.7032	0.9243	0.6426	0.5862	0.8584	1.0000	0.4764
	N/A	0.5191	0.6477	0.9218	0.5626	0.6402	0.8667	1.0000	0.3188
	N/A	0.6934	0.7522	0.9267	0.7109	0.5259	0.8494	1.0000	0.6131
ps7	N/A	0.8452	0.7502	0.3663	0.8119	0.0695	0.2346	0.3165	1.0000
	N/A	0.8592	0.7853	0.5076	0.8323	0.2697	0.4269	0.4769	1.0000
	N/A	0.8295	0.7092	0.2190	0.7891	0.0029	0.0744	0.1586	1.0000

Open in a new tab

Table 10. The same as Table 4 but with the p-value of the three versions, according to eqs. 35 (top), 37 (middle), and 36 (bottom) of the Wald test.

locus	U7	L5	J3	J6	U6	L4	U5	K6	L1	c4	b3
U7	1.0000	0.0004	0.0021	0.0043	0.0024	0.0007	0.0014	0.0000	0.0000	0.0034	0.0477
	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
L5	0.0000	1.0000	0.5092	0.2053	0.3145	0.6419	0.4228	0.2934	0.1237	0.4376	0.0024
	0.0019	1.0000	0.5406	0.2703	0.3667	0.6579	0.4613	0.2448	0.0723	0.4784	0.0319
	0.1722	1.0000	0.5728	0.3442	0.4236	0.6753	0.5032	0.1877	0.0306	0.5197	0.1322
J3	0.0000	0.5279	1.0000	0.5907	0.7707	0.8306	0.9246	0.1002	0.0375	0.9034	0.0315
	0.0057	0.4969	1.0000	0.6144	0.7784	0.8264	0.9254	0.0503	0.0089	0.9049	0.1020
	0.2203	0.4592	1.0000	0.6395	0.7866	0.8215	0.9264	0.0152	0.0006	0.9065	0.2188
J6	0.0000	0.2727	0.6262	1.0000	0.8136	0.4747	0.6694	0.0374	0.0131	0.7174	0.1347
	0.0126	0.2075	0.6026	1.0000	0.8079	0.4341	0.6522	0.0082	0.0009	0.7034	0.2247
	0.2609	0.1384	0.5752	1.0000	0.8013	0.3855	0.6315	0.0004	0.0000	0.6875	0.3298
U6	0.0000	0.3742	0.7878	0.8099	1.0000	0.6236	0.8464	0.0593	0.0214	0.8841	0.0736
	0.0088	0.3220	0.7803	0.8155	1.0000	0.6025	0.8428	0.0201	0.0027	0.8818	0.1591
	0.2422	0.2612	0.7718	0.8216	1.0000	0.5769	0.8385	0.0027	0.0000	0.8793	0.2731
L4	0.0000	0.6655	0.8353	0.4428	0.6055	1.0000	0.7510	0.1451	0.0561	0.7432	0.0151
	0.0040	0.6499	0.8393	0.4831	0.6265	1.0000	0.7594	0.0905	0.0192	0.7531	0.0715
	0.2043	0.6309	0.8437	0.5257	0.6489	1.0000	0.7688	0.0418	0.0028	0.7634	0.1857
U5	0.0000	0.4742	0.9296	0.6598	0.8448	0.7605	1.0000	0.0848	0.0313	0.9731	0.0421
	0.0065	0.4360	0.9288	0.6767	0.8484	0.7521	1.0000	0.0380	0.0062	0.9733	0.1183
	0.2273	0.3900	0.9279	0.6947	0.8523	0.7420	1.0000	0.0091	0.0003	0.9734	0.2350
K6	0.0000	0.2636	0.0674	0.0105	0.0231	0.0954	0.0400	1.0000	0.5610	0.0528	0.0000
	0.0003	0.3131	0.1275	0.0445	0.0658	0.1517	0.0883	1.0000	0.5406	0.1116	0.0033
	0.1060	0.3691	0.2076	0.1210	0.1433	0.2262	0.1659	1.0000	0.5158	0.1929	0.0585
L1	0.0000	0.0725	0.0111	0.0008	0.0024	0.0165	0.0049	0.5237	1.0000	0.0083	0.0000
	0.0001	0.1233	0.0449	0.0126	0.0192	0.0500	0.0264	0.5441	1.0000	0.0395	0.0008
	0.0769	0.1939	0.1147	0.0665	0.0764	0.1166	0.0864	0.5679	1.0000	0.1080	0.0364
c4	0.0000	0.4544	0.9027	0.6871	0.8736	0.7341	0.9710	0.0794	0.0292	1.0000	0.0470
	0.0069	0.4135	0.9011	0.7016	0.8760	0.7236	0.9709	0.0339	0.0054	1.0000	0.1252
	0.2301	0.3644	0.8994	0.7171	0.8787	0.7111	0.9707	0.0074	0.0002	1.0000	0.2416
b3	0.0002	0.0308	0.1099	0.2196	0.1501	0.0631	0.1057	0.0021	0.0007	0.1429	1.0000
	0.0779	0.0022	0.0348	0.1306	0.0677	0.0123	0.0355	0.0000	0.0000	0.0571	1.0000
	0.3745	0.0000	0.0040	0.0560	0.0166	0.0004	0.0045	0.0000	0.0000	0.0112	1.0000

Open in a new tab

Table 12. See description of Table 10 but for the Venezuela data set and Table 6.

locus	L4	U5	K6	L1	c4	b3	fr13	ps6	ps7
L4	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
U5	N/A	1.0000	0.9084	0.6173	0.9673	0.3317	0.5085	0.5759	0.8291
	N/A	1.0000	0.9026	0.5369	0.9666	0.1482	0.3678	0.4747	0.8446
	N/A	1.0000	0.8967	0.4449	0.9659	0.0310	0.2213	0.3610	0.8587
K6	N/A	0.8967	1.0000	0.7043	0.9331	0.3846	0.5741	0.6550	0.7086
	N/A	0.9028	1.0000	0.6587	0.9357	0.2227	0.4712	0.5907	0.7494
	N/A	0.9084	1.0000	0.6070	0.9382	0.0870	0.3569	0.5177	0.7845
L1	N/A	0.5895	0.7162	1.0000	0.6325	0.5821	0.8035	0.9268	0.3831
	N/A	0.6639	0.7555	1.0000	0.6931	0.5076	0.7843	0.9244	0.5221
	N/A	0.7246	0.7887	1.0000	0.7440	0.4252	0.7638	0.9219	0.6371
c4	N/A	0.9660	0.9384	0.6453	1.0000	0.3485	0.5296	0.6014	0.7889
	N/A	0.9667	0.9359	0.5774	1.0000	0.1713	0.4016	0.5131	0.8118
	N/A	0.9674	0.9333	0.4995	1.0000	0.0457	0.2645	0.4129	0.8322
b3	N/A	0.2128	0.3221	0.5129	0.2468	1.0000	0.7710	0.6028	0.0833
	N/A	0.4069	0.4826	0.5862	0.4271	1.0000	0.7907	0.6551	0.2912
	N/A	0.5740	0.6132	0.6499	0.5820	1.0000	0.8082	0.7005	0.5138
fr13	N/A	0.3869	0.5142	0.7630	0.4286	0.7667	1.0000	0.8490	0.2047
	N/A	0.5270	0.6117	0.7836	0.5521	0.7452	1.0000	0.8579	0.3970
	N/A	0.6399	0.6901	0.8024	0.6534	0.7219	1.0000	0.8662	0.5684
ps6	N/A	0.5186	0.6473	0.9218	0.5618	0.6399	0.8667	1.0000	0.3168
	N/A	0.6152	0.7046	0.9243	0.6430	0.5857	0.8582	1.0000	0.4771
	N/A	0.6930	0.7520	0.9267	0.7103	0.5256	0.8494	1.0000	0.6116
ps7	N/A	0.8592	0.7852	0.5066	0.8323	0.2675	0.4256	0.4758	1.0000
	N/A	0.8449	0.7495	0.3662	0.8118	0.0689	0.2318	0.3156	1.0000
	N/A	0.8295	0.7091	0.2180	0.7890	0.0028	0.0736	0.1576	1.0000

Open in a new tab

Table 8. See description of Table 7 but for the Cameroon data set.

locus	L5	J3	J6	U6	L4	U5	K6	L1	c4	b3
L5	1.0000	0.8964	0.2148	0.0208	0.0326	0.0111	0.0117	0.0000	0.6733	0.8457
	1.0000	0.8953	0.2552	0.0439	0.0637	0.0312	0.0314	0.0000	0.6899	0.8480
	1.0000	0.8974	0.1800	0.0090	0.0153	0.0033	0.0038	0.0000	0.6509	0.8433
J3	0.8892	1.0000	0.1633	0.0126	0.0214	0.0068	0.0071	0.0000	0.6227	0.7460
	0.8904	1.0000	0.2058	0.0316	0.0483	0.0230	0.0229	0.0000	0.6445	0.7522
	0.8881	1.0000	0.1281	0.0044	0.0084	0.0016	0.0018	0.0000	0.5930	0.7395
J6	0.2585	0.2345	1.0000	0.3954	0.4087	0.2356	0.2621	0.0000	0.8770	0.3907
	0.2189	0.1918	1.0000	0.4189	0.4340	0.2717	0.2956	0.0001	0.8746	0.3595
	0.2959	0.2768	1.0000	0.3742	0.3858	0.2045	0.2330	0.0000	0.8801	0.4231
U6	0.0728	0.0699	0.4690	1.0000	0.9488	0.6881	0.7566	0.0001	0.6183	0.1385
	0.0406	0.0370	0.4468	1.0000	0.9490	0.6956	0.7611	0.0010	0.5961	0.0979
	0.1126	0.1132	0.4887	1.0000	0.9486	0.6814	0.7526	0.0000	0.6468	0.1884
L4	0.0643	0.0621	0.4324	0.9429	1.0000	0.7374	0.8091	0.0001	0.5979	0.1250
	0.0340	0.0311	0.4075	0.9427	1.0000	0.7427	0.8119	0.0012	0.5734	0.0853
	0.1030	0.1044	0.4545	0.9431	1.0000	0.7326	0.8066	0.0000	0.6292	0.1749
U5	0.0327	0.0328	0.2732	0.6635	0.7445	1.0000	0.9162	0.0005	0.4985	0.0713
	0.0125	0.0119	0.2379	0.6554	0.7392	1.0000	0.9157	0.0033	0.4621	0.0392
	0.0648	0.0684	0.3058	0.6706	0.7491	1.0000	0.9167	0.0000	0.5457	0.1184
K6	0.0404	0.0401	0.3166	0.7453	0.8204	0.9190	1.0000	0.0003	0.5278	0.0851
	0.0172	0.0162	0.2838	0.7406	0.8178	0.9195	1.0000	0.0025	0.4950	0.0503
	0.0747	0.0779	0.3465	0.7494	0.8227	0.9185	1.0000	0.0000	0.5701	0.1336
L1	0.0000	0.0000	0.0003	0.0006	0.0027	0.0061	0.0033	1.0000	0.0493	0.0000
	0.0000	0.0000	0.0000	0.0000	0.0003	0.0012	0.0005	1.0000	0.0145	0.0000
	0.0008	0.0013	0.0033	0.0040	0.0113	0.0182	0.0118	1.0000	0.1551	0.0029
c4	0.4075	0.3650	0.7412	0.2155	0.2402	0.1223	0.1355	0.0000	1.0000	0.5643
	0.3781	0.3306	0.7469	0.2498	0.2771	0.1618	0.1738	0.0000	1.0000	0.5469
	0.4341	0.3975	0.7360	0.1862	0.2082	0.0916	0.1051	0.0000	1.0000	0.5822
b3	0.8341	0.7443	0.3126	0.0414	0.0584	0.0223	0.0239	0.0000	0.7525	1.0000
	0.8315	0.7378	0.3465	0.0708	0.0950	0.0485	0.0498	0.0000	0.7621	1.0000
	0.8365	0.7503	0.2821	0.0231	0.0343	0.0092	0.0105	0.0000	0.7395	1.0000

Open in a new tab

Table 11. See description of Table 10 but for the Cameroon data set.

locus	L5	J3	J6	U6	L4	U5	K6	L1	c4	b3
L5	1.0000	0.8953	0.2548	0.0436	0.0631	0.0307	0.0310	0.0000	0.6899	0.8480
	1.0000	0.8963	0.2184	0.0226	0.0350	0.0125	0.0131	0.0000	0.6684	0.8456
	1.0000	0.8974	0.1797	0.0089	0.0151	0.0033	0.0037	0.0000	0.6509	0.8433
J3	0.8904	1.0000	0.2054	0.0313	0.0478	0.0226	0.0225	0.0000	0.6445	0.7522
	0.8893	1.0000	0.1669	0.0140	0.0234	0.0078	0.0081	0.0000	0.6163	0.7458
	0.8881	1.0000	0.1278	0.0043	0.0083	0.0015	0.0017	0.0000	0.5930	0.7395
J6	0.2185	0.1913	1.0000	0.4188	0.4338	0.2713	0.2953	0.0001	0.8746	0.3593
	0.2549	0.2320	1.0000	0.3979	0.4114	0.2393	0.2657	0.0000	0.8778	0.3923
	0.2954	0.2763	1.0000	0.3740	0.3857	0.2042	0.2328	0.0000	0.8801	0.4229
U6	0.0401	0.0364	0.4467	1.0000	0.9490	0.6956	0.7611	0.0010	0.5961	0.0974
	0.0694	0.0676	0.4668	1.0000	0.9488	0.6890	0.7571	0.0001	0.6259	0.1411
	0.1117	0.1122	0.4886	1.0000	0.9486	0.6813	0.7526	0.0000	0.6467	0.1878
L4	0.0335	0.0306	0.4073	0.9427	1.0000	0.7427	0.8119	0.0012	0.5733	0.0848
	0.0610	0.0599	0.4299	0.9429	1.0000	0.7380	0.8094	0.0001	0.6062	0.1275
	0.1021	0.1034	0.4543	0.9431	1.0000	0.7326	0.8066	0.0000	0.6292	0.1743
U5	0.0122	0.0116	0.2376	0.6553	0.7392	1.0000	0.9157	0.0032	0.4620	0.0388
	0.0302	0.0311	0.2696	0.6626	0.7439	1.0000	0.9162	0.0006	0.5114	0.0736
	0.0638	0.0673	0.3054	0.6705	0.7491	1.0000	0.9167	0.0000	0.5456	0.1177
K6	0.0169	0.0158	0.2835	0.7406	0.8178	0.9195	1.0000	0.0024	0.4950	0.0499
	0.0377	0.0382	0.3133	0.7448	0.8201	0.9191	1.0000	0.0004	0.5393	0.0876
	0.0738	0.0768	0.3462	0.7494	0.8227	0.9185	1.0000	0.0000	0.5700	0.1329
L1	0.0000	0.0000	0.0000	0.0000	0.0003	0.0011	0.0004	1.0000	0.0144	0.0000
	0.0000	0.0000	0.0002	0.0005	0.0021	0.0050	0.0026	1.0000	0.0769	0.0001
	0.0006	0.0010	0.0029	0.0038	0.0108	0.0176	0.0113	1.0000	0.1547	0.0026
c4	0.3779	0.3303	0.7468	0.2495	0.2768	0.1613	0.1734	0.0000	1.0000	0.5468
	0.4050	0.3631	0.7418	0.2191	0.2439	0.1262	0.1394	0.0000	1.0000	0.5652
	0.4339	0.3972	0.7360	0.1859	0.2079	0.0912	0.1048	0.0000	1.0000	0.5821
b3	0.8315	0.7378	0.3463	0.0704	0.0945	0.0480	0.0494	0.0000	0.7621	1.0000
	0.8339	0.7439	0.3157	0.0439	0.0615	0.0243	0.0260	0.0000	0.7496	1.0000
	0.8365	0.7503	0.2818	0.0229	0.0340	0.0090	0.0104	0.0000	0.7395	1.0000

Open in a new tab

Overall, the methods perform well for all data sets and provide meaningful results. However, the statistical tests also yielded significant differences in some of the pairwise comparisons of the various Inline graphic estimates in each region (Tables 4–12). The allele frequencies differ of course but all are based on the same true parameter . If the estimates for are significantly different, some of them cannot be trusted. This can have various reasons. First, it can be a type I error. However, this occurs only with small probability if the CIs are well calibrated, i.e., their nominal coverage ( Inline graphic ) is close to the actual coverage. Asymptotic CIs and tests based on them (Wald, Score) will be more affected than profile-likelihood-based intervals, because the former are inherently forced to be symmetric. This is particularly true if the estimates for are close to zero. To quantify this effect, and to suggest heuristic methods to recalibrate the CIs, a systematic numerical robustness study of the approach is planned. Preliminary investigations, however, have shown that particularly the profile-likelihood-based CIs are well calibrated.

Second, the tests are designed to compare the ML estimate based on the data with a value Inline graphic , which has to be interpreted as prior knowledge. Strictly speaking, it is not meant to be estimated from data itself, or at least data which is available. A test designed to compare two estimates, should incorporate information from both data sets (data from both markers). A standard approach to resolve this is as follows. One could calculate the product of the maximum likelihood from both markers and compare it with the maximum likelihood of both markers conditioned on equality of Inline graphic . This however would require much more numerical effort than the tests here. Note further, that the structure of the data does not allow to perform a permutation test, because the allele-frequency distributions are expected to be different. This is true for two different marker loci in the same endemic region as well as for the same marker in two different populations.

Third, the model assumptions might be violated, i.e., the underlying Poisson distribution might not be correct. This can again be quantified in the coarse of a robustness study.

Fourth, the allele-frequency spectra of two different marker loci is very different, and the method might be sensitive to this. For instance strong skewness in the data distributions might bias the estimates. This is obviously the case if one marker shows no variation at all. Moreover, the number of different allele at different markers is very different, which results in very different probabilities of the ML estimates. These issues again need to be investigated in a numerical study.

Fifth, some STR markers tend to be hyper-mutable. As a result, not just the frequency distribution might be more problematic, but it is also more challenging to correctly identify the tandem repeat numbers. Hence, for hyper-mutable markers the data might have very bad quality. In our examples the marker labelled L1 appears to be hyper-mutable.

Because of all these possible reasons, it would be pre-mature to suggest a heuristic on how to decide, which estimates can be trusted the most. A systematic numerical follow-up study is planned to investigate all these possibilities in detail to provide suggestions on the criteria upon which the data is chosen.

Discussion

The number of genetically distinct lineages co-infecting a host - commonly referred to as “multiplicity of infection” (MOI) - is a key quantity in epidemiology. First, it relates with transmission intensity since it provides a metric for the number of secondary infections after a primary infection; assuming that the lineages circulating are identifiable (e.g. secondary infections within a clonal outbreak simply cannot be traceable). Second, it measures the possibility of genetic exchange among those lineages as determined by the genetic system of the pathogen in question. Finally, if phenotypic differences are associated with those lineages, MOI could lead to very complex dynamics driven by natural selection.

Measuring MOI is desirable in a variety of infectious diseases, but - in many instances - only feasible if it can be measured at low cost and with a reasonable effort. Optimally it should fit into standard study designs and should be easily computable with whatever genotyping data can be collected from clinical specimens. In order to meet these goals, we further developed the maximum-likelihood (ML) method originally proposed by [20] and applied it to three malaria datasets as examples.

From a total of Inline graphic samples (e.g. blood samples), the number of genetically distinguishable lineages present in each host are recorded. From the resulting data, assuming that hosts are infected randomly by those lineages according to their prevalence, we derived the likelihood function. If infections with the pathogen are rare events, a natural choice for the number of co-infecting lineages is a conditional Poisson distribution (CPD). This distribution comes with the appealing feature that it is characterized by a single parameter Inline graphic , whose transform is the average MOI. Assuming a CPD, the likelihood function simplifies as well as the procedure to derive the ML estimates. Although, this was previously described by [20], we were able to derive a number of important results: First, the ML estimate always exists and is unique. Second, it has the intuitive interpretation of being the parameter vector under which the observed are the expected prevalences for the distinguishable lineages, i.e., the observation is the expectation, if the ML estimate is the true parameter vector. Third, the recursion to compute the ML estimate for Inline graphic reduced from a multi- to a one-dimensional recursion, which just depends on the number of samples and the observed prevalences. The ML estimates for the lineages frequencies are explicit functions of . Fourth, the recursion for converges (at least) from every initial value . Convergence is monotonically, at quadratic rate, and typically occurs within a few iterations. Besides the obvious computational advantages provided of our results their actual foremost importance is that they justify the ML approach. Using an ML estimates is only appropriate if it has a significantly higher probability than distant alternative parameter choices, which is difficult to evaluate in a multi-dimensional space. However, the form of the ML estimate here - particularly because the lineages prevalences depend continuously on Inline graphic - indicates that the observation will have significantly lower probability under distant alternative parameter choices. The method worked well for the three malaria datasets to which it was applied, and gave similar results when applied to different independent microsatellite loci.

Although, our results justify the ML approach, it is nevertheless of fundamental importance to provide confidence intervals (CIs). We reported here on asymptotic and profile-likelihood-based CIs for all parameters. Asymptotic CIs are either based on the observed or the expected Fisher information, which under the CPD coincide. Explicit formulas for the CIs for all involved parameters were derived. Profile-likelihood based CIs were already emphasized by [20]. However, it was important to note that they can actually be derived at low numerical costs by using the method of Lagrange multiplies. This reduces the numerical effort to the same magnitude as for the ML estimate. Assuming the CPD, we proved that the CI for the parameter Inline graphic , yielding the estimate for the MOI, is uniquely defined. The confidence bounds are derived by a two-dimension recursion, which converges locally at quadratic rate. Both kinds of CIs gave meaningful results for the three data sets to which we applied the methods and they agree well. Although the asymptotic CIs are easier to derive, we suggest to use the profile-likelihood-based CIs if sample size is low and/or the ML estimate for Inline graphic is small for the reasons discussed in the application section. Although, we discussed CIs for the linages' frequencies, these are somewhat less interesting, unless one focuses on the prevalence of a particular linage. Otherwise one should derive confidence regions on the simplex for the lineage frequencies, which is done as outlined, but numerically more demanding.

To test the ML estimate against other parameter choices typically three statistical tests are used, the likelihood-ratio, the Score, and the Wald test. The latter two are based on the asymptotic CIs, while the likelihood-ratio test builds upon the profile-likelihood-based CIs. Motivated by our intention to apply the methods to malaria we focused on using these tests to compare estimates for the parameter Inline graphic . Namely, several genetic markers characterizing linages are typically available (e.g., several microsatellite markers), to all of which the methods are applicable. While the true parameter is of course the same for all markers, the ML estimates obtained from them will differ. It is therefore important to test whether these estimates differ significantly. The parameter Inline graphic changes on temporal and spatial scales. An obvious question is, whether MOI changes over time (e.g. before and after the implementation of control measures) or varies across endemic regions. Hence, it is important to test for significant differences in estimates for .

Not surprisingly all tests described perform equally well as they are asymptotically equivalent. However, as in the case of CIs we suggest to use the likelihood-ratio test if sample size is small or the parameters compared are small. If interested in p-values additional effort is required for the likelihood-ratio test, because a two-dimensional iteration needs to be performed. However, numerically this is only as demanding as obtaining the CIs. Because the test statistics for the Score and Wald tests can be derived, it is easy to derive p-values in these cases. For each of these two tests we provided three alternative variants, which all worked almost equally well in the provided examples. We should point out that it was our intention to indicate only how tests for the parameters can be constructed. With the usual approaches one could compare multiple parameters at the same time, including the information of all these markers. This however, exceeds both our intention and the scope of this article. Finally, as a justification for using the CPD, which simplifies the method to a great extent, we summarized the test suggested by [20]. Although the test will be uninformative if many lineages are present it provides a justification for the approach. Of note, the CPD is an intuitive assumption if infections are relatively rare events. This does not relate with the overall prevalence but rather with how high the observed incidence is in a given population in terms of the time scale required for the pathogen to complete its transmission cycle. Such relationship is hard to establish without complex simulations but it is worth noting that there could be biologic scenarios (particular pathogens or epidemiologic settings) where this assumption does not hold. Thus, it is advisable to check whether the CPD assumption is violated using the tests for the model fit proposed in this investigation. In our case of study, we observe robust estimates across very different epidemiologic settings. Overall, the methods developed here can be used to compare groups under different exposures, different manifestations of disease, groups of patients that have different genotypes (e.g. sickle cell or any other hemoglobinopathies associated with protection), or the efficacy of a given vaccine. Biologically, this method assumes that the rate of evolution of the marker used is “low” relative to the time of the infection. That is, there is a “numerable” set of lineages that can be estimated and no variants are generated during the time scale of one infection. Thus, it is not suitable for pathogens such as HIV or any other hypervariable virus. The second assumption is that the set of markers used to detect and characterize the MOI are effectively neutral, so they are not linked to genes under selection. Thus, the loci cannot be associated with antigens or drug resistance. As presented, each loci is considered independent, which is a typical assumption of genotyping base approximations used in molecular epidemiology. We also want to emphasize that this MOI estimate depends on the number of detectable lineages given a laboratory method. Thus, results from different markers such SNPs or microsatellites are expected to differ as a function of their differences in mutation rates and mode of evolution. One could actually calculate the fit of individual loci and then exclude potential outliers if there is any biological reason to do so (e.g. microsatellites under different evolutionary models where one is hyper-variable or non-variable when compared with others). The method is sensitive enough to detect differences in MOI under different epidemiologic settings as indicated by the analyses of empirical data. Whereas this is not per se a “genomic” method, in the sense that is not designed to estimate MOI directly from reads generated from next generation sequence (NGS) data, it can do so from a given set of SNPs or microsatellites detected by using NGS. Whereas the method was originally intended for applications to malaria, it can be applied to other parasitic or microbial diseases where the assumptions are not violated. E.g. variation on the VNTRs in a multi-clonal infection of Mycobacterium tuberculosis. Unlike empirical approaches where simply alleles are counted and then averaged, the proposed ML method provides a robust and computationally efficient statistical framework that can be integrated in epidemiological investigations.

Analysis

1 The Model

1.1 Background

Here, Inline graphic given by (1) is explicitly derived under the assumption that is given by the CPD (2). Namely,

graphic file with name pone.0097899.e432.jpg

where in the derivations the condition Inline graphic indicates that the product is taken over all non-zero components of , corresponding to the alleles found in a sample with allele configuration .

1.2 Log-Likelihood

Assuming that the number of lineages infecting a host follows the CPD (2), the log-likelihood (3) simplifies to

graphic file with name pone.0097899.e436.jpg

where

is the number of samples that contain allele Inline graphic . Notably, with equality only if all samples are single infections.

1.3 Proof of Remark 1

The proof of Remark 1 is as follows.

Proof of Remark 1. First, note that

Moreover, using de l'Hospitals rule we see that

graphic file with name pone.0097899.e441.jpg

because Inline graphic (note that this holds also true if for some ). This proves that is not a maximum likelihood estimate, which is quite intuitive.

1.4 Derivatives of the log-likelihood

Assuming the CPD (2) the log-likelihood function is given by (4) and the derivatives of (5) are hence straightforwardly calculated to be

(39a)

(39b)

(39c)

(39d)

(39e)

The entries of the Hessian matrix (7), i.e., the second derivatives of Inline graphic , given by (5), are calculated to be

(40a)

(40b)

(40c)

(40d)

(40e)

(40f)

(40g)

2 Proofs of the main results

2.1 Existence and uniqueness of the ML estimate

First, the result showing existence and uniqueness of the ML estimate in the generic case is proven.

Proof of Result 1. Assume Inline graphic , as this cannot be the ML estimate according to Remark 1. Equating (39b) to zero yields for all . Substituting this into (39a) and setting the equation to zero yields . Therefore, we obtain or

(41)

proving the last assertion. Hence, it remains to prove the statements for Inline graphic .

By using (41) and equating (39c) to zero, we obtain Inline graphic , which is equivalent to

(42)

Therefore, the ML estimate is a solution of (42). Straightforward calculation gives

graphic file with name pone.0097899.e470.jpg

and

graphic file with name pone.0097899.e471.jpg

Note that, Inline graphic and , because . Hence, near zero. Note further that . Hence, has at least one positive solution. Since, for at least one , , implying that is strictly convex for . Because is strictly convex there can be at most one positive solution of . Moreover, is strictly monotonically increasing for Inline graphic .

The solution can be found by a Newton method. Because Inline graphic is strictly convex and monotonically increasing for , the Newton method converges monotonically to the solution . Moreover, because is continuous, the rate of convergence is at least quadratic. Noting that yields (9) completes the proof.

The special case, in which only single infections occur, is summarized by Remark 2. It can be proven as follows.

Proof of Remark 2. Examining the proof of Result 1 yields that that the ML estimate is any positive root of Inline graphic . In the present case . However, since must hold for at least one , is still strictly convex. This implies that for all . Hence, no maximum likelihood estimate with exists.

Moreover, since

graphic file with name pone.0097899.e502.jpg

the ML estimate can only be attained at Inline graphic .

In the limit Inline graphic , one obtains, as in the proof of Remark 1,

which is maximized at Inline graphic . Particularly, the likelihood function is finite in this case.

In the other non-generic situation, every lineage is found in all samples, which is described in Remark 3 and can be proven as follows.

Proof of Remark 3. The proof of Result 1 yields Inline graphic . Hence, has no positive solution, and hence no ML estimate with exists. Clearly, Remark 1 states that is also not an ML estimate.

In this case the log-likelihood function simplifies to

Taking the limit Inline graphic yields

graphic file with name pone.0097899.e516.jpg

Since Inline graphic implies that the likelihood is one, this limit case, which is - of note - independent of the allele-frequency distribution, is the maximum likelihood.

Remark 4 states that the expected number of samples containing a given lineage equals the observed number of samples containing this allele if the ML estimate is the true parameter. The proof is as follows.

Proof of Remark 4. The maximum likelihood estimate satisfies Inline graphic . Equating (39b) to zero yields for all . Substituting this into (39a) and setting the equation to zero yields . Therefore, we obtain or . Hence, it remains to be shown that holds.

In the following we will use that Inline graphic . To simplify the notation assume . Hence,

graphic file with name pone.0097899.e528.jpg

Successively repeating the last step gives

graphic file with name pone.0097899.e529.jpg

Since the alleles can be arbitrarily labeled, we obtain

(43)

The proof is completed by noting that Inline graphic is obtained from (4) by replacing with .

2.2 Profile likelihood based confidence intervals

The existence sand uniqueness of the profile-likelihood-based confidence intervals are proven as follows.

Proof of Result 2. The proof consists of several parts.

Part A: Existence in the generic case. We first assume Inline graphic and for at least one and prove the CI's existence.

The CI's bounds satisfy (12). The equations Inline graphic , yield , or

(44)

which implies that Inline graphic must hold for all . Since, , by summing up the above expression one arrives at . Thus, for fixed the Lagrange multiplier is a zero of the function

(45)

Its derivative is given by

(46)

Hence, Inline graphic is strictly monotonically increasing in , and consequently has at most one zero . Note that and . Hence, has exactly one solution . Furthermore, according to the implicit-function theorem, is a continuously differentiable function of .

The likelihood function (4) can be rewritten as

graphic file with name pone.0097899.e557.jpg

(47)

Note that

graphic file with name pone.0097899.e558.jpg

Since Inline graphic for at least one , it follows that

(48)

for any arbitrary but fixed allele-frequency vector Inline graphic . Moreover, the proof of Remark 1 reveals that

(49)

Now, for any Inline graphic , let with given by (44) with .

Next, we show indirectly that Inline graphic .

First, assume Inline graphic . Hence, there exists a sequence , with but . Hence, such that for a subsequence , . Without loss of generality, . Let be the corresponding sequence of allele-frequency vectors. Since the simplex is compact, there exists a convergent subsequence . Because is continuous, it follows that Inline graphic , contradicting (48).

Analogously it is shown that Inline graphic .

Since Inline graphic , as well as are continuous, and , there exist , such that is a solution of (12), where is given by (44). This proves the existence of the CI's bounds.

Part B: Uniqueness in the generic case. Next, the uniqueness of the confidence intervals is proven. Assume two values Inline graphic with . Since is continuously differentiable the mean value theorem implies that there exists with . Application of the chain rule yields . By definition of , the relation holds. Hence, Thus,

Inline graphic , where is given by (44) with . This implies that is a zero of (39), or, in other words, that is a maximum likelihood estimate. Because of its uniqueness , and . Hence, or is impossible, and the CI is therefore uniquely defined.

Part C: Existence and uniqueness in the non-generic cases. In the case Inline graphic the same proof holds with obvious modifications. As (49) is violated and becomes . It follows that at least one solution of (12) exist. The above proof of uniqueness, implies that this is the only solution.

Similarly, for Inline graphic for all , (48) is violated and becomes , from which the existence of exactly one solution of (12) follows from the same proof as in the generic case.

Part D: Derivation of the CIs in the generic case. Parts A and B reveal that the bounds of the CI's bounds are the two solutions ( Inline graphic and ) of the equations and , where is given by (45), and with given by (44). A little algebraic manipulations yields that is given by (16e).

The solutions can be found by a Newton method. Straightforward calculation gives

graphic file with name pone.0097899.e622.jpg

where Inline graphic and are given by (16c) and (45) or (16d), respectively. Hence, the Newton method leads to the following iteration

graphic file with name pone.0097899.e626.jpg

Due to its relatively simple form, the above matrix can be easily inverted and the iteration can be rewritten as (16a) and (16b).

The Newton methods converges locally quadratically if the above matrix is nonsingular in the solutions. Part A of the proof reveals that these solutions satisfy Inline graphic , yielding . Hence, the matrix simplifies to

graphic file with name pone.0097899.e629.jpg

Therefore,

graphic file with name pone.0097899.e630.jpg

Clearly, since Inline graphic , if and only if . According to the proof of Result 1 this condition is only fulfilled at the unique ML estimate. Hence, in and . Therefore, the Newton method converges quadratically for any initial value sufficiently close to the respective solution.

2.3 Asymptotic confidence intervals. Proof of Result 3

This proof is slightly more general than necessary as we will re-use part of it later.

First, consider a matrix Inline graphic with the following structure

(50a)

with

(50b)

Let Inline graphic . We aim to derive . We do so by inverting blockwise. Namely,

The formulae applies whenever, Inline graphic and the matrix is invertible. Moreover,

(51)

where Inline graphic , , and . Its inverse is given by

Hence, the desired quantity Inline graphic becomes

graphic file with name pone.0097899.e654.jpg

We are now ready to derive the confidence interval given by (18). To derive Inline graphic we first note that (7), (40) and rearrangement of the parameters imply that the Fisher information matrix has the form (50), with

given by (40), and Inline graphic corresponds to . Therefore,

graphic file with name pone.0097899.e659.jpg

(52)

and consequently

graphic file with name pone.0097899.e660.jpg

Moreover,

graphic file with name pone.0097899.e661.jpg

and

Hence,

graphic file with name pone.0097899.e663.jpg

(53)

Deriving Inline graphic is easy. Namely, exactly the same calculations hold with

By inspecting (40), it becomes clear that all derivations remain unchanged with Inline graphic replaced by (cf. eq. 53). This gives

graphic file with name pone.0097899.e668.jpg

which simplifies to

graphic file with name pone.0097899.e669.jpg

(54)

Substituting the above with Inline graphic into (18) - using the fact that - yields (20) after after a little algebraic manipulation.

The identities Inline graphic follow from (43). Substituting this into (54) gives

graphic file with name pone.0097899.e673.jpg

(55)

Substitution of the above evaluated at Inline graphic (using the fact that ) into (18) yields (21) after some rearrangement.

Proof of Result 4. To simplify the notation, we first derive the formulas for the confidence interval of Inline graphic . By re-arranging the parameters as in the proof of Result 3, it is obvious that the matrix given by (50) can be used instead of the Fisher information (or ). Particularly, .

We can apply a blockwise inversion formula to Inline graphic similar as in the proof of Result 3. Namely,

graphic file with name pone.0097899.e682.jpg

where

with

Clearly,

graphic file with name pone.0097899.e685.jpg

where Inline graphic are the elements of . The inverse of is calculated exactly as the inverse of in the proof of Result 3. Namely, we arrive at

with Inline graphic , , and .

Hence, the desired quantity Inline graphic becomes

To derive the desired quantity Inline graphic () we need to set , , and . By using (40) and (43) we obtain

graphic file with name pone.0097899.e701.jpg

Therefore,

graphic file with name pone.0097899.e702.jpg

Hence,

Moreover,

graphic file with name pone.0097899.e710.jpg

Combining the above yields,

graphic file with name pone.0097899.e715.jpg

and finally

graphic file with name pone.0097899.e716.jpg

Hence, the bounds of the confidence intervals are given by

graphic file with name pone.0097899.e717.jpg

By replacing Inline graphic by , one obtains the confidence interval of given by (22).

2.4 Testing the Parameters

Proof of Result 5. The result is proven by showing that the iteration (29) leads to the profile-likelihood with Inline graphic . The proof of Remark 2 reveals that the desired values for is the unique zero of given by (45). The zero can be found using a Newton method. Combining (45) and (46) yields (29) after a little rearrangement.

Remark 6. If Inline graphic and (and ) are the true (unknown) parameters, the asymptotic holds.

We aim to test only for Inline graphic , so any choice can be made for the true parameter. However, the parameters occur in the asymptotic variance . Hence, we need a plug-in estimate for the asymptotic variance. There are two possibilities. First, the true parameter is replaced by the profile-likelihood estimates Inline graphic based on and the asymptotic variance by . Here, either the expected or the observed Fisher information can be used.

Second, both Inline graphic and can be replaced by the ML estimate . In this case the expected and observed Fisher information coincide.

Proof of Result 6. The remark is proven by explicitly deriving the test statistic. To simplify the notation we write Inline graphic and for and , respectively. To derive (or ) we can follow the proof of Result 3.

From the blockwise inversion formula (51) the relation

(56)

follows immediately, where the denominator on the left-hand side is given by the reciprocal of (53).

Noting that Inline graphic given by (39a) one obtains . Substituting this and (56) in the test statistic (31), and writing and for and gives (32).

Of course, (56) also holds if Inline graphic is replaced by , where is given by (54). Thus, the same reasoning as above yields (33).

3 The case

3.1 Log-likelihood

In the limiting case that the true parameter is Inline graphic the conditional poison distribution becomes

(57)

Following the derivations in subsection 4.1, Inline graphic becomes

where Inline graphic denotes the th base vector. Hence, the likelihood function (3) becomes

(58)

This is the limiting case of (3) for Inline graphic . Furthermore, we can conclude the following.

Remark 7. If the true parameter is Inline graphic , according to (57), an observation with is impossible in a sample of size . Hence, with probability one.

Assume Inline graphic is the true parameter. Then, we can assume

graphic file with name pone.0097899.e772.jpg

As mentioned above, the case Inline graphic is just the continuation of the likelihood function.

Hence, we can define Inline graphic . Moreover, the (one-sided) derivatives of the likelihood function exist in . We have,

(59a)

(59b)

(59c)

The proof is found in the next subsection 6.2.

From (59a) we immediately see that Inline graphic . Hence, the ML estimate (cf. Remark 2) is a boundary maximum. However, it is necessary for the asymptotic distributions (11), (17), (30), and (38) that all derivatives of the likelihood function vanish. As this is not the case, we can neither derive confidence intervals, nor test the parameters in the case Inline graphic .

3.2 Derivatives of the likelihood function

Applying de l'Hospitals rule gives Inline graphic . Hence, successive application of this rule to the above yields

graphic file with name pone.0097899.e786.jpg

Note, that the last steps also proves Inline graphic .

Similarly, from (39d)

Since from (58) Inline graphic , one obtains (59b).

Acknowledgments

The authors thank Andrea M. McCollum for sharing the data sets from Cameroon, Kenya, Venezuela. The constructive comments of one anonymous reviewer are gratefully acknowledged!

Funding Statement

This research is supported by U.S. National Institutes of Health grants 1U19AI089702 and R01GM084320 to AAE. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Read AF, Taylor LH (2001) The ecology of genetically diverse infections. Science 292: 1099–1102. [DOI] [PubMed] [Google Scholar]
2. Balmer O, Tanner M (2011) Prevalence and implications of multiple-strain infections. The Lancet Infectious Diseases 11: 868–878. [DOI] [PubMed] [Google Scholar]
3. Alizon S, de Roode JC, Michalakis Y (2013) Multiple infections and the evolution of virulence. Ecology Letters 16: 556–567. [DOI] [PubMed] [Google Scholar]
4. Wacker M, Turnbull L, Walker L, Mount M, Ferdig M (2012) Quantification of multiple infections of plasmodium falciparum in vitro. Malaria Journal 11: 180. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Matussek A, Stark L, Dienus O, Aronsson J, Mernelius S, et al. (2011) Analyzing multiclonality of staphylococcus aureus in clinical diagnostics using spa-based denaturing gradient gel electrophoresis. Journal of Clinical Microbiology 49: 3647–3648. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Vu-Thien H, Hormigos K, Corbineau G, Fauroux B, Corvol H, et al. (2010) Longitudinal survey of staphylococcus aureus in cystic fibrosis patients using a multiple-locus variable-number of tandemrepeats analysis method. BMC Microbiology 10: 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Tognazzo M, Schmid-Hempel R, Schmid-Hempel P (2012) Probing mixed-genotype infections ii: High multiplicity in natural infections of the trypanosomatid, crithidia bombi, in its host, bombus spp. PLoS ONE 7: e49137. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Frank SA (1992) A kin selection model for the evolution of virulence. Proceedings of the Royal Society of London Series B: Biological Sciences 250: 195–197. [DOI] [PubMed] [Google Scholar]
9. Lively C (2005) Evolution of virulence: coinfection and propagule production in spore-producing parasites. BMC Evolutionary Biology 5: 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Schjørring S, Koella JC (2003) Sub-lethal effects of pathogens can lead to the evolution of lower virulence in multiple infections. Proceedings of the Royal Society of London Series B: Biological Sciences 270: 189–193. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Ben-Ami F, Mouton L, Ebert D (2008) The effects of multiple infections on the expression and evolution of virulence in a daphnia-endoparasite system. Evolution 62: 1700–1711. [DOI] [PubMed] [Google Scholar]
12. Schneider KA, Kim Y (2010) An analytical model for genetic hitchhiking in the evolution of antimalarial drug resistance. Theor Popul Biol 78: 93–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Klein EY, Smith DL, Laxminarayan R, Levin S (2012) Superinfection and the evolution of resistance to antimalarial drugs. Proc Biol Sci 279: 3834–3842. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Ben-Ami F, Routtu J (2013) The expression and evolution of virulence in multiple infections: the role of specificity, relative virulence and relative dose. BMC Evolutionary Biology 13: 97. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Thanapongpichat S, McGready R, Luxemburger C, Day N, White N, et al. (2013) Microsatellite genotyping of plasmodium vivax infections and their relapses in pregnant and non-pregnant patients on the thai-myanmar border. Malaria Journal 12: 275. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Cohen T, van Helden PD, Wilson D, Colijn C, McLaughlin MM, et al. (2012) Mixed-strain mycobacterium tuberculosis infections and the implications for tuberculosis treatment and control. Clinical Microbiology Reviews 25: 708–719. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Poon AFY, Swenson LC, Bunnik EM, Edo-Matas D, Schuitemaker H, et al. (2012) Reconstructing the dynamics of hiv evolution within hosts from serial deep sequence data. PLoS Comput Biol 8: e1002753. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Theron A, Sire C, Rognon A, Prugnolle F, Durand P (2004) Molecular ecology of schistosoma mansoni transmission inferred from the genetic composition of larval and adult infrapopulations within intermediate and definitive hosts. Parasitology 129: 571–585. [DOI] [PubMed] [Google Scholar]
19. Lindstrm I, Sundar N, Lindh J, Kironde F, Kabasa J, et al. (2008) Isolation and genotyping of toxoplasma gondii from ugandan chickens reveals frequent multiple infections. Parasitology 135: 39–45. [DOI] [PubMed] [Google Scholar]
20. Hill WG, Babiker HA (1995) Estimation of numbers of malaria clones in blood samples. Proceedings of the Royal Society of London Series B: Biological Sciences 262: 249–257. [DOI] [PubMed] [Google Scholar]
21. Schneider K, Kim Y (2011) Approximations for the hitchhiking effect caused by the evolution of antimalarial-drug resistance. Journal of Mathematical Biology 62: 789–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
23.Davison AC (2003) Statistical Models (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press.
24.Venzon DJ, Moolgavkar SH (1988) A method for computing profile-likelihood-based condence intervals. Journal of the Royal Statistical Society Series C (Applied Statistics) 37: pp. 87–94.
25. McCollum AM, Mueller K, Villegas L, Udhayakumar V, Escalante AA (2007) Common origin and fixation of plasmodium falciparum dhfr and dhps mutations associated with sulfadoxinepyrimethamine resistance in a low-transmission area in south america. Antimicrob Agents Chemother 51: 2085–2091. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. McCollum AM, Basco LK, Tahar R, Udhayakumar V, Escalante AA (2008) Hitchhiking and Selective Sweeps of Plasmodium falciparum Sulfadoxine and Pyrimethamine Resistance Alleles in a Population from Central Africa. Antimicrob Agents Chemother 52: 4089–4097. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. McCollum AM, Schneider KA, Griffing SM, Zhou Z, Kariuki S, et al. (2012) Differences in selective pressure on dhps and dhfr drug resistant mutations in western kenya. Malar J 11: 77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Read1] 1. Read AF, Taylor LH (2001) The ecology of genetically diverse infections. Science 292: 1099–1102. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Balmer1] 2. Balmer O, Tanner M (2011) Prevalence and implications of multiple-strain infections. The Lancet Infectious Diseases 11: 868–878. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Alizon1] 3. Alizon S, de Roode JC, Michalakis Y (2013) Multiple infections and the evolution of virulence. Ecology Letters 16: 556–567. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Wacker1] 4. Wacker M, Turnbull L, Walker L, Mount M, Ferdig M (2012) Quantification of multiple infections of plasmodium falciparum in vitro. Malaria Journal 11: 180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Matussek1] 5. Matussek A, Stark L, Dienus O, Aronsson J, Mernelius S, et al. (2011) Analyzing multiclonality of staphylococcus aureus in clinical diagnostics using spa-based denaturing gradient gel electrophoresis. Journal of Clinical Microbiology 49: 3647–3648. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-VuThien1] 6. Vu-Thien H, Hormigos K, Corbineau G, Fauroux B, Corvol H, et al. (2010) Longitudinal survey of staphylococcus aureus in cystic fibrosis patients using a multiple-locus variable-number of tandemrepeats analysis method. BMC Microbiology 10: 24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Tognazzo1] 7. Tognazzo M, Schmid-Hempel R, Schmid-Hempel P (2012) Probing mixed-genotype infections ii: High multiplicity in natural infections of the trypanosomatid, crithidia bombi, in its host, bombus spp. PLoS ONE 7: e49137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Frank1] 8. Frank SA (1992) A kin selection model for the evolution of virulence. Proceedings of the Royal Society of London Series B: Biological Sciences 250: 195–197. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Lively1] 9. Lively C (2005) Evolution of virulence: coinfection and propagule production in spore-producing parasites. BMC Evolutionary Biology 5: 64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Schjrring1] 10. Schjørring S, Koella JC (2003) Sub-lethal effects of pathogens can lead to the evolution of lower virulence in multiple infections. Proceedings of the Royal Society of London Series B: Biological Sciences 270: 189–193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-BenAmi1] 11. Ben-Ami F, Mouton L, Ebert D (2008) The effects of multiple infections on the expression and evolution of virulence in a daphnia-endoparasite system. Evolution 62: 1700–1711. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Schneider1] 12. Schneider KA, Kim Y (2010) An analytical model for genetic hitchhiking in the evolution of antimalarial drug resistance. Theor Popul Biol 78: 93–108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Klein1] 13. Klein EY, Smith DL, Laxminarayan R, Levin S (2012) Superinfection and the evolution of resistance to antimalarial drugs. Proc Biol Sci 279: 3834–3842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-BenAmi2] 14. Ben-Ami F, Routtu J (2013) The expression and evolution of virulence in multiple infections: the role of specificity, relative virulence and relative dose. BMC Evolutionary Biology 13: 97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Thanapongpichat1] 15. Thanapongpichat S, McGready R, Luxemburger C, Day N, White N, et al. (2013) Microsatellite genotyping of plasmodium vivax infections and their relapses in pregnant and non-pregnant patients on the thai-myanmar border. Malaria Journal 12: 275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Cohen1] 16. Cohen T, van Helden PD, Wilson D, Colijn C, McLaughlin MM, et al. (2012) Mixed-strain mycobacterium tuberculosis infections and the implications for tuberculosis treatment and control. Clinical Microbiology Reviews 25: 708–719. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Poon1] 17. Poon AFY, Swenson LC, Bunnik EM, Edo-Matas D, Schuitemaker H, et al. (2012) Reconstructing the dynamics of hiv evolution within hosts from serial deep sequence data. PLoS Comput Biol 8: e1002753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Theron1] 18. Theron A, Sire C, Rognon A, Prugnolle F, Durand P (2004) Molecular ecology of schistosoma mansoni transmission inferred from the genetic composition of larval and adult infrapopulations within intermediate and definitive hosts. Parasitology 129: 571–585. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Lindstrm1] 19. Lindstrm I, Sundar N, Lindh J, Kironde F, Kabasa J, et al. (2008) Isolation and genotyping of toxoplasma gondii from ugandan chickens reveals frequent multiple infections. Parasitology 135: 39–45. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Hill1] 20. Hill WG, Babiker HA (1995) Estimation of numbers of malaria clones in blood samples. Proceedings of the Royal Society of London Series B: Biological Sciences 262: 249–257. [DOI] [PubMed] [Google Scholar]

[pone.0097899-Schneider2] 21. Schneider K, Kim Y (2011) Approximations for the hitchhiking effect caused by the evolution of antimalarial-drug resistance. Journal of Mathematical Biology 62: 789–832. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-Durbin1] 22.Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

[pone.0097899-Davison1] 23.Davison AC (2003) Statistical Models (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge University Press.

[pone.0097899-Venzon1] 24.Venzon DJ, Moolgavkar SH (1988) A method for computing profile-likelihood-based condence intervals. Journal of the Royal Statistical Society Series C (Applied Statistics) 37: pp. 87–94.

[pone.0097899-McCollum1] 25. McCollum AM, Mueller K, Villegas L, Udhayakumar V, Escalante AA (2007) Common origin and fixation of plasmodium falciparum dhfr and dhps mutations associated with sulfadoxinepyrimethamine resistance in a low-transmission area in south america. Antimicrob Agents Chemother 51: 2085–2091. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-McCollum2] 26. McCollum AM, Basco LK, Tahar R, Udhayakumar V, Escalante AA (2008) Hitchhiking and Selective Sweeps of Plasmodium falciparum Sulfadoxine and Pyrimethamine Resistance Alleles in a Population from Central Africa. Antimicrob Agents Chemother 52: 4089–4097. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0097899-McCollum3] 27. McCollum AM, Schneider KA, Griffing SM, Zhou Z, Kariuki S, et al. (2012) Differences in selective pressure on dhps and dhfr drug resistant mutations in western kenya. Malar J 11: 77. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Likelihood Approach to Estimate the Number of Co-Infections

Kristan A Schneider

Ananias A Escalante

Roles

Abstract

Introduction

Methods

1 Model background

2 Maximum likelihood

Results

1 Confidence intervals from the profile-likelihood

2 Asymptotic confidence intervals

3 Testing the parameters

3.1 The likelihood-ratio test

3.2 The score test

3.3 The Wald test

4 Testing the method

Application to data

1 Preliminary remarks

2 Data description

3 Results

Figure 1. Shown are the ML estimates (dots) and their respective profile-likelihood-based (blue) and asymptotic (green) CIs for the data from Kenya (A), Cameroon (B) and Venezuela (C) for several microsatellite markers each.

Figure 2. Average ML estimates by region.

Table 1. Estimates for each locus of the data set from Kenya.

Table 3. See description of Table 1 but for the Venezuela data set.

Table 2. See description of Table 1 but for the Cameroon data set.

Table 4. Pairwise tests of ML estimates from obtained from the Kenya data set.

Table 6. See description of Table 4 but for the Venezuela data set.

Table 5. See description of Table 4 but for the Cameroon data set.

Table 7. The same as Table 4 but with the p-value of the three versions, according to eqs. 33 (top), 32 (middle), and 34 (bottom) of the Score test.

Table 9. See descriptions of Table 7 but for the Venezuela data set and Table 6.

Table 10. The same as Table 4 but with the p-value of the three versions, according to eqs. 35 (top), 37 (middle), and 36 (bottom) of the Wald test.

Table 12. See description of Table 10 but for the Venezuela data set and Table 6.

Table 8. See description of Table 7 but for the Cameroon data set.

Table 11. See description of Table 10 but for the Cameroon data set.

Discussion

Analysis

1 The Model

1.1 Background

1.2 Log-Likelihood

1.3 Proof of Remark 1

1.4 Derivatives of the log-likelihood

2 Proofs of the main results

2.1 Existence and uniqueness of the ML estimate

2.2 Profile likelihood based confidence intervals

2.3 Asymptotic confidence intervals. Proof of Result 3

2.4 Testing the Parameters

3 The case

3.1 Log-likelihood

3.2 Derivatives of the likelihood function

Acknowledgments

Funding Statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases