Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 Jun 9;19(6):e1010247. doi: 10.1371/journal.pcbi.1010247

coiaf: Directly estimating complexity of infection with allele frequencies

Aris Paschalidis 1,#, Oliver J Watson 1,2,#, Ozkan Aydemir 1,3, Robert Verity 2, Jeffrey A Bailey 1,*
Editor: Daniel B Larremore4
PMCID: PMC10310041  PMID: 37294835

Abstract

In malaria, individuals are often infected with different parasite strains. The complexity of infection (COI) is defined as the number of genetically distinct parasite strains in an individual. Changes in the mean COI in a population have been shown to be informative of changes in transmission intensity with a number of probabilistic likelihood and Bayesian models now developed to estimate the COI. However, rapid, direct measures based on heterozygosity or FwS do not properly represent the COI. In this work, we present two new methods that use easily calculated measures to directly estimate the COI from allele frequency data. Using a simulation framework, we show that our methods are computationally efficient and comparably accurate to current approaches in the literature. Through a sensitivity analysis, we characterize how the distribution of parasite densities, the assumed sequencing depth, and the number of sampled loci impact the bias and accuracy of our two methods. Using our developed methods, we further estimate the COI globally from Plasmodium falciparum sequencing data and compare the results against the literature. We show significant differences in the estimated COI globally between continents and a weak relationship between malaria prevalence and COI.

Author summary

Computational models, used in conjunction with rapidly advancing sequencing technologies, are increasingly being used to help inform surveillance efforts and understand the epidemiological dynamics of malaria. One such important metric, the complexity of infection (COI), indirectly quantifies the level of transmission. Existing “gold-standard” COI measures rely on complex probabilistic likelihood and Bayesian models. As an alternative, we have developed the statistics and software package coiaf, which features two rapid, direct measures to estimate the number of genetically distinct parasite strains in an individual (the COI). Our methods were evaluated using simulated data and subsequently compared to current state-of-the-art methods, yielding comparable results. Lastly, we examined the distribution of the COI in several locations across the world, identifying significant differences in the COI between continents. coiaf, therefore, provides a new, promising framework for rapidly characterizing polyclonal infections.


This is a PLOS Computational Biology Methods paper.

Introduction

Malaria remains a leading cause of death worldwide—in 2021, there were an estimated 247 million cases and 619,000 deaths around the globe [1]. Despite the considerable burden of malaria, these numbers represent the substantial global progress made to control malaria in the last two decades. The WHO reports that 2 billion malaria cases and 11.7 million malaria deaths were averted globally from 2000 to 2021 [1]. The majority of these gains reflect an increase in vector control initiatives [24], the development of highly efficacious antimalarial combination therapies [57], and improved case management through the deployment of rapid diagnostic tests (RDTs) [813]. However, evidence indicates that progress has slowed and that there is a need for new approaches to capitalize on the gains already made [1].

One approach is the use of computational methods, which often rely on recent advances in genetic sequencing and provide an increased understanding of malaria biology, to help inform control efforts [1416]. For example, molecular and genomic epidemiology surveillance tools, which have been developing rapidly, can track drug resistance and understand and evaluate control efforts [17, 18]. Moreover, computational methods have been applied to identify polyclonal infections—malaria infections with multiple distinct strains [19, 20]. Such polyclonal infections introduce additional genetic complexity that is often difficult to account for computationally. As a result, many of the population genetic tools frequently applied to study other organisms are unsuitable for studying malaria, and researchers often rely on limiting genetic analyses to individuals who are monoclonally infected [21].

Although determining the most informative metrics is an active field of investigation [22], one important metric is the complexity of infection (COI). Sometimes referred to as multiplicity of infection, although this is generally reserved for infections within the same cell, the COI represents the number of genetically distinct malaria genomes or strains that can be identified in a particular individual. These polyclonal infections may arise from one or both of the following: (i) a single infectious mosquito feeds on a human host, transferring several genetically distinct parasite strains (often referred to as a co-transmission event [23, 24]) or (ii) two or more infectious mosquitoes with distinct malaria strains feed on an individual (known as a superinfection event [24, 25]). Measures of genetic diversity and the COI are increasingly used for inferring malaria transmission intensity and evaluating malaria control interventions [18]. Transmission intensity has been shown to impact the contribution of each event towards the generation of within-host parasite genetic diversity [22]. Superinfection is modulated by the host’s current infections [26], age, and exposure-acquired immunity [27]. Additionally, the COI provides a practical approach for identifying monoclonal infections to simplify genomic analyses.

Traditionally, the COI was measured in one or a few regions of the genome, relying on the enumeration of the maximal number of haplotypes detected through PCR amplification at genes or markers encoding highly diverse length polymorphisms. Two of the most common markers are the merozoite surface proteins 1 and 2 (msp1 and msp2), surface proteins found on the merozoite stage of the malarial parasite [28, 29]. Traditional methods are hindered by limitations on the number of loci examined [30, 31], and lack the ability to detect parasites at low parasitemia in samples [32]. Furthermore, sanger sequencing lacks sensitivity for low parasitemia samples without employing laborious subcloning [33].

High-throughput sequencing provides more sensitive and specific methods. As a result, new computational methods have been developed to determine the COI. Two early proposed methods were the FwS metric by Aubern et al., which characterizes within-host diversity and its relationship to population-level diversity [30], and the estMOI software, which utilizes local phasing information of microhaplotypes within read pairs to estimate the COI [32]. Unfortunately, the FwS metric does not directly relate to the COI nor has a concrete biological interpretation, and the estMOI software relies on observed local haplotypes and heuristic interpretation. More recently, new tools have been developed to better measure the COI beyond the maximal observed haplotype. For instance, the DEploid software package uses haplotype structure to infer the number of strains, their relative proportions, and the haplotypes present in a sample [34]. However, it is known that DEploid under-predicts the COI for high COI infections [24]. Other methods have been developed to examine the relatedness between parasite strains [35, 36]. The current state-of-the-art method for determining the COI of a sample is THE REAL McCOIL, which is an extension of the COIL method [31]. THE REAL McCOIL employs a Bayesian approach, turning heterozygous data into estimates of allele frequency using Markov chain Monte Carlo methods and jointly estimating the likelihood of the COI [37].

Despite various methods for estimating the COI, no rapid, direct measures have been developed to work effectively on a set of loci or at the genome-wide level. In this work, we present two new methods that use easily calculable metrics to directly estimate the COI from allele frequency data. Our two methods closely resemble the categorical and proportional methods implemented in THE REAL McCOIL [37], yet are geared towards large numbers of loci and provide rapid estimates of the COI. Using our methods, we have developed the software package coiaf in the programming language Inline graphic, a language and environment for statistical computing and graphics [38].

Materials and methods

Problem formulation

Current state-of-the-art approaches for estimating the COI rely on identifying the number of different parasites present in an infection using high-throughput sequencing. In monoclonal infections, i.e., infections composed of only one parasite strain (COI = 1), all sequence reads should be identical, originating from the same parasite strain. However, a combination of each parasite strain, proportional to the strain’s abundance, will contribute to the observed sequence reads in mixed infections. At genetic loci containing variation, there is an increased chance of observing multiple alleles as the number of unrelated parasite strains within an infection increases. Therefore, the likelihood of observing multiple alleles at any locus depends on the number of parasite strains in an infection and the prevalence of genetic polymorphisms in the population.

We focus on only biallelic SNPs—the vast majority of loci—and define the major allele as the most prevalent allele in a population. We note that any multiallelic site can be collapsed into a biallelic site, although some information will be lost. Assuming for any individual there are l biallelic loci, we define the population-level allele frequency (PLAF) as the mean within-sample allele frequencies (WSAF) of the alternate allele, i.e., the non-3D7 reference, at each locus across a population. We further introduce the population-level minor allele frequency (PLMAF), which first requires us to define the minor allele at each locus. The minor allele is defined as the alternate allele if PLAF ≤ 0.5, and the reference allele if PLAF > 0.5. The PLMAF is defined as a vector, p, of length l composed of the frequencies of the minor allele at each locus across a population, namely p = (p1,…, pl), where pi ∈ [0, 0.5]. Additionally, we define the within-sample minor allele frequency (WSMAF) as a vector, w, of length l composed of the frequencies of the population-level minor allele at each locus for a single individual infection. For instance, the WSMAF will be equal to one when all sequence reads observed at a given locus are of the population-level minor allele.

Variant and Frequency Methods

Variant Method

Our overall goal is to estimate the COI of the sample, denoted by k, using the WSMAF and the PLMAF. We do this by comparing observed data to derived expressions that define a relationship between the WSMAF, the PLMAF, and the COI. We present two alternative expressions that we refer to as the Variant Method and the Frequency Method.

In the Variant Method, we examine a set of SNPs and express the probability of SNP i being heterozygous with respect to the PLMAF and the COI. We define Vi, a Bernoulli random variable that takes the value of one if a site is heterozygous and zero otherwise. The probability that locus i is heterozygous, written as P(Vi=1), will be equal to one minus the probability that a locus is homozygous (see Appendix B in S1 Appendices). We thus write,

E[Vi]=P(Vi=1)=1-pik-(1-pi)k. (1)

As the COI increases, the probability of observing a heterozygous locus within an infection also increases (see Fig 1). We note that this is the same expression used within the categorical method of THE REAL McCOIL (Eq (2)) [37]. Similarly to the categorical method of THE REAL McCOIL, we assume that loci are independent. If this assumption is not met, the confidence intervals around COI estimates will be impacted with bootstrapped confidence intervals increasing, but the bias in estimated COI should not change, providing the model is otherwise well specified (see Appendix C.2 in S1 Appendices). This is the same outcome as for the THE REAL McCOIL categorical method [37], and first described in the earlier COIL method [31].

Fig 1. Flowchart of methods.

Fig 1

(A) The relationship between the WSMAF and the PLMAF is shown for an example simulation with a COI of 4. (B) Data have been processed so that loci are deemed variant if they are heterozygous and invariant otherwise. (C) Homozygous data have been filtered out. (D-E) Following the processing of data, Eqs (1) and (2) have been plotted for varying COIs from 1 to 4, respectively.

Frequency Method

In the Frequency Method, we focus on the expected value of the within-sample minor allele frequency. For the sake of simplicity, the complete derivation has been left to Appendix B in S1 Appendices. Briefly, we determine the probability of a particular strain carrying the minor allele and then determine the expected WSMAF by summing over the expected contribution of each strain. We represent the expected value of the within-sample minor allele frequency given that a site is heterozygous as follows:

E[Wi|Vi=1]=pi-pik1-pik-(1-pi)k. (2)

Estimation method

Given data, D : {(pi, wi, di), i = 1, …, l}, where pi is the PLMAF at locus i, wi is the WSMAF at locus i, and di is the within-sample sequencing coverage at locus i, we next explore our methods to approximate the COI of a sample. Data are first processed to account for sequencing error. This process denotes loci at which there was suspected sequencing error as homozygous instead of heterozygous (for additional information, see Appendix D in S1 Appendices).

Following adjustment for sequence error, we consider an arbitrary data point (pi, wi, di). Recall that the Variant Method and the Frequency Method examine different random variables. Specifically, the Variant Method identifies the probability of a locus being heterozygous, P(Vi=1), and the Frequency Method identifies the expected value of the WSMAF given a site is heterozygous, E[Wi|Vi=1]. To determine the COI, we utilize Eqs (1) and (2). We solve the following weighted least squares minimization problem for the Variant Method:

mink(i=1l(vi-(1-pik-(1-pi)k))2di), (3)

where vi is defined as the value of the random variable Vi. Similarly, we solve the following weighted least squares minimization problem for the Frequency Method:

mink(i=1l(vi(wi-(pi-pik1-pik-(1-pi)k)))2di). (4)

Note that the estimation methods described minimize the sum of squared residuals between the observed data and the relationships derived in Eqs (1) and (2).

Solution methods

We solve this optimization problem using two methods: (i) assuming discrete values of the COI and (ii) assuming continuous values of the COI. Recall that COI is defined as the number of genetically distinct malaria parasite strains an individual is infected with. While a continuous value of the COI has no direct biological interpretation, significant departures from discrete values are expected due to parasite relatedness and a range of other biological phenomena, including, but not limited to, overdispersion in sequencing and parasite densities. Therefore, a continuous value of the COI may provide a more accurate representation of the overall population of samples being studied. Furthermore, as relatedness in mixed infections is common [36], a continuous COI may provide insights into the degree of relatedness between parasite strains in mixed infections and detect highly-related polyclonal infections that may traditionally be categorized as monoclonal.

We use a brute force approach to solve the discrete versions of the previously defined optimization problems, which involves computing the objective function for each COI considered. As brute force approaches can be computationally inefficient, we limit the range of values of the COI. To solve the continuous versions of the optimization problems, we utilize Inline graphic’s built-in optimization function [38]. In particular, we leverage a quasi-Newton L-BFGS-B approach with box constraints [39]. We set the lower and upper bounds of the COI as 1 and 25, respectively, with the default starting value of the COI equal to two. Note that in both the discrete and continuous case, the upper bound of the COI is much larger than most COIs seen in the real world [37, 40]. In both cases, we also provide the capability to determine the 95% confidence interval for our COI estimates by leveraging bootstrapping techniques [41] (see Appendix E in S1 Appendices for more details).

Data

To evaluate the accuracy and sensitivity of our methods, we created a simulator that generates synthetic sequencing data for several individuals in a given population. In overview, each individual is assigned a COI value. The haplotype of each strain is then assigned by sampling from the population-level minor allele frequency. Next, we simulate the number of sequence reads mapped to the reference and alternative allele by sampling in proportion to the parasite densities for each strain. After simulating sequence error, the mapped sequence reads are then used to derive the within-sample minor allele frequency. A detailed description of our simulator can be found in Appendix G in S1 Appendices.

In addition to simulated data, we use sequencing data sampled from infected individuals worldwide to compare our methods to the current state-of-the-art COI estimation metric and investigate the COI distribution across the world. We analyzed over 7,000 P. falciparum samples from 28 malaria-endemic countries in Africa, Asia, South America, and Oceania from 2002 to 2015 from the MalariaGEN Plasmodium falciparum Community Project [42]. Detailed information about the data release, including brief descriptions of contributing partner studies and study locations, is available in the supplementary of MalariaGEN et al. [42]. We used the provided variant call files (VCFs) generated using a standard analysis pipeline. The median read depth of coverage of the initially sequenced field isolates was 73 across all samples. After removing replicate samples, mixed-species samples, and samples with low coverage, suspected contamination, or mislabelling, 5,970 samples remained for further analysis. Genomic data were further filtered for high-quality biallelic coding and non-coding SNPs as outlined in Zhu et al. [36]. Additionally, data were filtered to sites that are part of the core genome.

To apply our developed methods, we must estimate the population-level frequency of the minor allele. Consequently, we sought to assign the samples to a suitable number of geographic regions such that the number of samples per region was suitable for the reliable estimation of the population-level minor allele frequencies. We used the Partitioning Around Medoids (PAM) algorithm to solve a k-medoids clustering problem [43, 44] to group samples based on the longitude and latitude of sample collection. We next calculated the silhouette information for each clustering of k groups [45], arriving at 24 regions globally (see Appendix L.2 in S1 Appendices for a map of locations). Given these 24 clusters, we filtered SNPs to variants with a population-level alternative allele frequency greater than 0.005 in each region. The 0.005 frequency cutoff was chosen as sequence error likely obscures the detection of true variation from parasite strains comprising less than 0.5% of total parasite density. Clusters of data were additionally traced to a specific continent and subregion as defined by the World Development Indicators [46, 47].

Results

Performance on simulated data

We simulated data for 1,000 loci with a read depth of 200 at each locus using our simulator. Data were simulated with a complexity of infection ranging from 1 to 20. This simulation did not introduce error to determine optimal performance based on sampling. Our methods, therefore, accounted for no sequencing error. The results of running the discrete version of the Variant Method and the Frequency Method are illustrated in Fig 2A and 2B, respectively.

Fig 2. Estimating the COI on simulated data.

Fig 2

The performance of the Variant Method (A) and Frequency Method (B) is shown for 100 simulations of a COI of 1–20 with 1,000 loci, a read depth of 200, no error added to the simulations, and no sequencing error assumed. Point size indicates density, with the red line representing the line y = x. (C) The mean absolute error for each method is shown. The black bars indicate the 95% confidence interval.

The Variant Method and the Frequency Method perform well for all COIs between 1 and 20. Notably, the lower the true value of the COI, i.e., the COI that data was simulated with, the better our models perform with a mean absolute error close to 0. Our models exhibit more variability across subsequent iterations as the COI increases and underestimate the true COI. For example, at a COI of 20, the estimated COI from the Frequency Method ranges from 13 to 24—the maximum COI our model could output was 25 in these trials. Nevertheless, most predicted COIs remain close to the true COI as witnessed by the low mean absolute errors of at most 2.14 in Fig 2C. Comparing our two methods, we find that the Variant and the Frequency Methods perform equally with insignificantly different mean absolute errors (p-value = 0.174) and biases (p-value = 0.884).

Sensitivity analysis

To understand the sensitivity of our models to alterations in the parameters considered, we tested the performance of the discrete and continuous representations of the Variant Method and the Frequency Method, assessing changes in the accuracy of our predictions. For each sample, we utilized bootstrapping techniques [48, 49] to determine the mean absolute error and bias of the predicted COI compared to the true COI. Furthermore, we ran each algorithm several times to ensure reliable results. A description of several key parameters perturbed and their default values can be found in Appendix H in S1 Appendices. The resulting figures can be found in Appendix L.1 in S1 Appendices.

Here, we highlight the effect of varying two metrics that can be controlled in the field: the read depth at each locus and the number of loci sequenced. Sequencing more loci at larger read depths is preferred as this results in higher-quality data. In general, as the coverage at each locus increases, the performance of our methods also improves (see Fig K and L in S1 Appendices). A monotonic, non-linear relationship is observed between sequence coverage and the mean absolute error, with diminishing returns in performance observed with sequence coverage greater than 100. As was the case for our coverage data, when the number of loci sequenced is low, around 100 loci, our methods have high variability and tend to underpredict the true COI (see Fig M and N in S1 Appendices). However, the performance increases as the number of loci increases to 1,000. In addition, increasing the number of loci examined above a certain threshold, in this case, 1,000 loci, does not seem to substantially impact the performance of our models. However, note that an increase in the number of loci does reduce our estimate’s variance. This reduction in variance will continue to decrease as the number of loci included increases. However, this is dependent on whether loci are independent of one another (see Appendix C.2 in S1 Appendices).

The performance of our methods was also impacted by parameters controlling the underlying malarial biology. In particular, as the relatedness between parasite strains increased, our estimation methods began to underpredict the true COI. However, with 10% relatedness between strains in mixed infections, our methods could still estimate the COI of up to 10 with a low mean absolute error of less than 1. Under prediction of the COI increased consistently with increasing relatedness, with a 50% relatedness resulting in the COI being similarly underestimated by 50% (see Fig R in S1 Appendices). Our methods also underpredicted the COI in simulations with overdispersed parasite densities (see Fig O and P in S1 Appendices), i.e., how uneven parasite densities are in mixed infections, and in simulations with overdispersed read counts (see Fig Q in S1 Appendices), i.e., the observed within-sample minor allele frequency exhibited greater variance than simply being described by a Binomial distribution determined by the true within-sample minor allele frequency.

Comparison to state-of-the-art methods

In this section, we compare our novel methods to the current state-of-the-art method used to estimate the COI, THE REAL McCOIL [37]. To compare these methods, we simulated data across a range of COIs multiple times and evaluated how perturbing parameters influenced the accuracy of both THE REAL McCOIL and our methods. We performed five separate analyses, varying the coverage, the number of loci, the level of overdispersion in parasite density, the relatedness between parasite strains, and the error introduced into the simulated data. The results are presented in Appendix L.1.1 in S1 Appendices. Across each analysis, we found few differences between our software and THE REAL McCOIL. As the number of loci and the coverage increased, the performance of both software packages improved. Conversely, as greater levels of overdispersion, relatedness, and sequence error were introduced into the simulations, both methods underpredicted the true COI.

Notably, coiaf provided improvements in computational performance compared to THE REAL McCOIL (see Appendix I in S1 Appendices). When the run time of the two estimation methods was directly compared on simulated data, the speed of the point estimate generated by the discrete and continuous methods of coiaf remained constant even as the number of loci increased. Conversely, the speed of THE REAL McCOIL increased linearly. As previously mentioned, our software package provides the capability to estimate the 95% confidence interval for our COI estimates using a bootstrapping approach. When comparing THE REAL McCOIL against our methods with 100 bootstrap replicates, which was observed to be adequately large to sufficiently capture the uncertainty in the COI (see Appendix F in S1 Appendices), both methods exhibited linear increases in computational time with increasing samples and loci. However, the linear increase in computational time associated with our methods was less than exhibited by THE REAL McCOIL.

We additionally compared our methods to the current state-of-the-art method by examining estimates of the COI on real-world data. As previously described, we grouped our data into 24 regions worldwide. To estimate the COI for each of the 5,970 samples, we examined an average of 32,362 (range: 15,276–40,272) loci in each region (see Appendix J in S1 Appendices). Furthermore, we ran 5 repetitions of the THE REAL McCOIL on each sample, with a burn-in period of 1,000 iterations followed by 5,000 sampling iterations, and using standard methodology to confirm convergence between Monte Carlo Markov chains [50]. For additional information, see Appendix K in S1 Appendices. Fig 3 examines the COI estimation of THE REAL McCOIL and coiaf for all samples. We note that the relationship between coiaf’s estimated COI and THE REAL McCOIL’s estimated COI for each of the 24 individual regions is shown in Fig AF and AG in S1 Appendices.

Fig 3. Comparison between THE REAL McCOIL and coiaf.

Fig 3

The COI estimation using (A) the Variant Method and (B) the Frequency Method is compared against the THE REAL McCOIL. (C) The distribution of differences between our estimation and THE REAL McCOIL’s estimation is shown. This difference is computed by subtracting the THE REAL McCOIL’s median estimation of the COI from our estimated value of the COI. The high density observed above 0 for the Frequency Method occurs because the Frequency Method is undefined for a COI of 1. Consequently, for samples that THE REAL McCOIL estimates as having a COI equal to 2, the distribution of our estimates of the COI using the Frequency Method is skewed greater than 2 (B), in contrast to the Variant Method, which exhibits lower skewness (A).

We observe that the Variant Method and the Frequency Method strongly correlate with the estimates from THE REAL McCOIL (Fig 3A and 3B). When the COI is estimated to be below 5, both methods estimate COI values close to one another. However, as the estimated COI increases, there is greater variability in predictions. At these high COI values, our methods tend to estimate the COI within 3 of THE REAL McCOIL’s estimate (Fig 3C). As expected, the continuous estimation methods align with the discrete estimation methods. Furthermore, we note that the Frequency Method does not show estimates when THE REAL McCOIL predicts a COI of 1. This is because the Frequency Method at a COI of 1 is undefined; at a COI of 1, there would be no heterozygous loci used in the Frequency Method. We fit linear regression models to the data to quantify the relationship between our novel software and the current state-of-the-art method and evaluated the Pearson correlation between estimation methods. The results are reported in Table 1 and indicate that each of the methods introduced in coiaf is highly correlated with THE REAL McCOIL.

Table 1. Relationship between coiaf and THE REAL McCOIL.

A linear regression model was fit to the data to evaluate the relationship between coiaf’s and THE REAL McCOIL’s estimation methods. Furthermore, the Pearson correlation between the estimated COIs was computed.

coiaf Estimation Method Linear Regression Correlation
R2 P-value
Discrete Variant Method 0.840 <0.001 0.916
Continuous Variant Method 0.883 <0.001 0.940
Discrete Frequency Method 0.804 <0.001 0.896
Continuous Frequency Method 0.844 <0.001 0.919

Mapping COI worldwide

To demonstrate coiaf’s utility and better understand global patterns of the COI, we examined the distribution of COI in 24 different regions. In each region, we studied an average of 248 (range: 29 to 909) samples and 32,362 loci (range: 15,276 to 40,272) (see Appendix J in S1 Appendices). Our samples can be traced to four different continents, with most originating in either Africa (55.5%) or Asia (41.8%). All other samples were sequenced in Oceania (2.03%) or the Americas (0.620%).

Fig 4 highlights the mean and median COI across all samples in each of the 24 regions outlined previously. We, furthermore, aimed to understand the relationship between the complexity of infection and malaria prevalence by leveraging estimates of malaria microscopy prevalence in children aged two to ten generated by the Malaria Atlas Project [8, 9, 51]. Fig 4C plots the density of the COI for each region sorted by the region’s malaria prevalence. Table 2 outlines the mean COI in each of the four continents and seven subregions analyzed.

Fig 4. COI across the globe.

Fig 4

The mean (A) and median (B) COI of all samples in each study location within the 24 regions is plotted. The color and size of each point represent the magnitude of the COI. (C) A density plot for each region, where the color of the plot indicates in what subregion the data was sampled. The plots are sorted by the median microscopy prevalence in children aged two to ten as estimated in the Malaria Atlas Project [8, 9, 51] and indicated to the right of each density plot. Map data was obtained from Natural Earth (medium scale data, 1:50m), which is in the public domain.

Table 2. Mean COI.

Mean COI across each continent and subregion analyzed.

Continent Subregion Number of Samples Mean COI (SD)
Africa Eastern Africa 739 1.88 (1.10)
Africa Middle Africa 579 1.87 (0.948)
Africa Western Africa 1,996 1.87 (1.05)
Americas South America 37 1.03 (0.164)
Asia South-Eastern Asia 2,421 1.25 (0.506)
Asia Southern Asia 77 1.52 (0.641)
Oceania Melanesia 121 1.20 (0.440)

Of all the continents, Africa had the highest mean COI of 1.87, followed by Asia with a mean COI of 1.26, Oceania with a mean COI of 1.20, and the Americas with a mean COI of 1.03. A Nemenyi post-hoc test [52] indicated that while Africa is statistically different than all the other continents (p-value: <0.001 in all cases), there exist no significant differences between each pairing of the other three continents (Americas vs. Asia p-value: 0.204, Americas vs. Oceania: p-value: 0.568, p-value: Asia vs. Oceania: 0.816). Within each continent, there exist further differences among the subregions. No statistically significant difference was found between each of the three subregions in Africa (Eastern vs. Middle p-value: 0.914, Eastern vs. Western p-value: >0.999, Middle vs. Western p-value: 0.886). However, in Asia, there was a statistically significant difference between the mean COI in South-Eastern Asia and Southern Asia (p-value: 0.0282).

We found a positive correlation between the COI and the microscopy prevalence (Table 3). In regions with a lower prevalence, there were few samples with a COI larger than 2. In fact, in regions with a prevalence less than or equal to 0.01, more than 95% of samples had a COI of 1 or 2. In regions with a higher prevalence, there were more samples with a higher COI. In particular, in regions where the prevalence was greater than or equal to 0.1, more than 20% of samples had a COI greater than 2.

Table 3. Relationship between coiaf and malaria prevalence.

A linear regression model was fit to the data to evaluate the relationship between COI and prevalence. Furthermore, the Pearson correlation was examined.

coiaf Estimation Method Linear Regression Correlation
R2 P-value
Discrete Variant Method 0.0770 <0.001 0.278
Continuous Variant Method 0.0787 <0.001 0.281
Discrete Frequency Method 0.0190 <0.001 0.138
Continuous Frequency Method 0.0218 <0.001 0.148

Discussion

Despite advances in sequencing technologies and the development of various methods for estimating the COI, no direct measures have been developed to rapidly estimate the COI. In this work, we present two novel methods for directly estimating the COI based on minor allele frequencies. Our methods can provide rapid and accurate estimates. Compared to the current state-of-the-art estimation software, THE REAL McCOIL, coiaf generated similar point estimates faster, even when conducting 100 bootstrap replicates. The ability to produce point estimates in under a second can provide researchers with immediate information on the COI. Moreover, the performance of our methods suggests that they can be scaled to use whole-genome data with smaller increases in computational time than exhibited by THE REAL McCOIL. Our methods also provide a continuous measure that may provide insight into relatedness.

Through several simulations, we further explored how changing key sequencing variables, such as the number of loci and read depth at each locus, altered our software’s performance. We showed that for samples with a low and moderate COI, our methods could accurately predict the COI even with low coverage and a small number of loci. However, as the COI increased, these parameters became more important—a lack of sufficient sequencing corresponded with an underprediction of the true COI. Additionally, we demonstrated that several important factors could influence results, such as sequencing error or overdispersion in parasite density. Importantly, we also show that the population mean WSMAF is an unbiased estimator of the PLMAF (see Appendix C.1 in S1 Appendices). This finding requires certain assumptions which may not be met, for example, for loci associated with drug resistance and when sampling selectively from individuals following drug treatment. Nevertheless, this provides further advantages to using allelic read depth for COI estimation rather than haplotype calls, which are known to lead to biased estimates of the population-level allele frequency if the COI of samples is not accounted for [53].

An application of coiaf on several thousand P. falciparum samples from malaria-endemic countries in four continents from 2002–2015 [42] resulted in a comprehensive map of the complexity of infection worldwide. This study builds on previous reviews of the distribution of the COI globally [54] and is the first study to our knowledge to provide a holistic view of the COI based on allelic read depth as opposed to traditional methods leveraging msp1 and msp2 haplotyping.

In general, our results were in agreement with previously reported findings. For instance, we estimated lower average COI values in areas with historically lower malaria prevalence, such as South America and Southern Asia. In particular, in the Americas and Asia, we report a mean COI of 1.03 and 1.26, respectively. Previous efforts to estimate the COI in these regions have also found similarly low values. While directly comparing our estimates against these would be incorrect as the date and location of sample collection are very different, we are encouraged by the similarity in estimates. For example, in Brazil, the COI has been previously measured as low as 1.1 in the early 1990s [55]. In Papua New Guinea [40, 56, 57], Bangladesh [58], and Malaysia [59], previous estimates of the COI range between 1.00 and 2.12, 1.22 and 1.58, and 1.20 and 1.37, respectively. In Africa, on the other hand, we were surprised to find little to no difference in average COI estimates between the three subregions we studied: Eastern Africa, Middle Africa, and Western Africa, despite large differences in average malaria prevalence in these regions. In contrast, previous studies of the COI in African settings have found higher average COI values in some regions and, in general, greater variability. For example, in Cameroon, large mean COIs between 2.33 to 3.82 have been reported [6062]. Conversely, in Ghana, the mean COI has been reported to be between 1.13 and 1.91 in 2012–2013 [63].

Much of the work surrounding the complexity of infection is motivated by the fact that COI has been proposed as an indicator of transmission intensity. Unfortunately, the relationship between COI and malaria prevalence remains an area of much debate, with many individual studies finding different relationships [64, 65]. Lopez and Koepfli report in their review article that across the 153 studies examined, there was a weak correlation between the mean COI and prevalence, an observation that agrees with our findings [54]. As previously noted, multiple patient-level factors (e.g., age and clinical status) may affect the relationship between malaria prevalence and COI. Additionally, Karl et al. suggested that this weak correlation may be attributed to spatial effects and the existence of geographic “hotspots,” where transmission may be much higher than in surrounding areas, causing individuals to have a greater COI [66]. Moreover, multiple studies have highlighted that seasonality affects the observed COI [67, 68]. Consequently, while there is undoubtedly a relationship between transmission intensity and COI, it is important to be aware of how many factors (age, clinical status, seasonality, spatial effects, parasite density, time since the last infection, and methods of detecting multiple infections) may impact this relationship. For example, metadata is only available while analyzing the data from the MalariaGEN Plasmodium falciparum Community Project [42] regarding the year and location of sample collection. Without being able to account for the other factors that impact the COI, we cannot make more meaningful interpretations of our analysis of COI patterns. Lastly, it is worth recalling that malaria prevalence is not directly related to transmission intensity. For example, two regions with the same malaria prevalence will likely have different transmission intensities if intervention coverage differs between the region. Therefore, the variance observed between malaria prevalence and COI may reflect that malaria prevalence is itself an imperfect predictor of transmission intensity.

Our work is not without limitations. In part, limitations stem from our methods relying on certain biological assumptions, which may not be met in the real world. Additionally, the accuracy of our algorithms is impacted by sequence error. While this was not an issue in our analysis of the MalariaGEN Plasmodium falciparum Community Project (see Fig AK in S1 Appendices), high levels of sequence error need to be monitored and accounted for. While our software package allows users to account for this by providing a level of suspected sequencing error, sequence error is unlikely to be constant across the genome, and accurate inference of sequencing error is an active research challenge [69]. Furthermore, while our methods account for the coverage at each locus, if there is a low overall coverage for a sample, our results may underpredict the true COI. Similarly, many modern genotyping approaches rely on a pre-amplification step before genotyping or sequencing. The amplification of higher parasitemia samples may obscure lower prevalence parasite strains and result in lower inferred COI estimates. Lastly, our methods assume that the population-level minor allele frequency (PLMAF) is well captured by the samples provided. Sampling bias, undersampling, or complex spatial patterns of allele frequencies resulting from the spatial landscape of malaria transmission and heterogeneities in transmission intensities in a region may result in an inaccurate PLMAF estimation, which may influence our estimated COI.

In conclusion, we developed two direct measures for estimating the COI given the within-sample allele frequency of a sample and the population-level allele frequency. We also add a new understanding of how population-level allele frequencies can be estimated without bias by relying on continuous estimates of within-sample allele frequencies. Our methods could estimate the COI for samples in less than a second and were accurate compared to simulated data and current COI estimation techniques. Our software will aid in estimating the complexity of infection, an increasingly important population genetic metric for inferring malaria transmission intensity and evaluating malaria control interventions [18, 70].

Supporting information

S1 Appendices. Appendices.

Includes the complete problem formulation, including derivations of the Variant Method and the Frequency Method, additional information about our methods and software package, and all omitted figures.

(PDF)

Acknowledgments

We thank the MalariaGEN Plasmodium falciparum Community Project for maintaining a large collection of sequencing data and variant calls.

Data Availability

Documentation for coiaf can be found at https://bailey-lab.github.io/coiaf. Additional code used to analyze data and create figures can be found at https://github.com/bailey-lab/coiaf-manuscript-work (DOI: 10.5281/zenodo.7931661) and https://github.com/bailey-lab/coiaf-real-data (DOI: 10.5281/zenodo.7826507).

Funding Statement

All authors (AP, OJW, OA, RV, JAB) acknowledge funding from the National Institutes of Health (NIH) and the National Institute of Allergy and Infectious Diseases (NIAID) (reference R01AI139520). RV additionally acknowledges funding from the MRC Centre for Global Infectious Disease Analysis (reference MR/R015600/1), jointly funded by the UK Medical Research Council (MRC) and the UK Foreign, Commonwealth & Development Office (FCDO), under the MRC/FCDO Concordat agreement and is also part of the EDCTP2 programme supported by the European Union. OJW was further supported by a Schmidt Science Fellowship in partnership with the Rhodes Trust. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Organization WH. World malaria report 2022. World Health Organization; 2022.
  • 2. Benelli G, Beier JC. Current vector control challenges in the fight against malaria. Acta Tropica. 2017;174:91–96. doi: 10.1016/j.actatropica.2017.06.028 [DOI] [PubMed] [Google Scholar]
  • 3. Raghavendra K, Barik TK, Reddy BPN, Sharma P, Dash AP. Malaria vector control: from past to future. Parasitology Research. 2011;108(4):757–779. doi: 10.1007/s00436-010-2232-0 [DOI] [PubMed] [Google Scholar]
  • 4. Takken W, Knols BGJ. Malaria vector control: current and future strategies. Trends in Parasitology. 2009;25(3):101–104. doi: 10.1016/j.pt.2008.12.002 [DOI] [PubMed] [Google Scholar]
  • 5. Mutabingwa TK. Artemisinin-based combination therapies (ACTs): best hope for malaria treatment but inaccessible to the needy! Acta Tropica. 2005;95(3):305–315. [DOI] [PubMed] [Google Scholar]
  • 6. White NJ. Qinghaosu (Artemisinin): The Price of Success. Science. 2008;320(5874):330–334. doi: 10.1126/science.1155165 [DOI] [PubMed] [Google Scholar]
  • 7. Lin JT, Juliano JJ, Wongsrichanalai C. Drug-Resistant Malaria: The Era of ACT. Current infectious disease reports. 2010;12(3):165–173. doi: 10.1007/s11908-010-0099-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Hay SI, Snow RW. The Malaria Atlas Project: Developing Global Maps of Malaria Risk. PLOS Medicine. 2006;3(12):e473. doi: 10.1371/journal.pmed.0030473 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Pfeffer D, Lucas T, May D, Harris J, Rozier J, Twohig K, et al. malariaAtlas: an R interface to global malariometric data hosted by the Malaria Atlas Project. Malaria Journal. 2018;17(1):352. doi: 10.1186/s12936-018-2500-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Murray CK, Bell D, Gasser RA, Wongsrichanalai C. Rapid diagnostic testing for malaria. Tropical Medicine & International Health. 2003;8(10):876–883. doi: 10.1046/j.1365-3156.2003.01115.x [DOI] [PubMed] [Google Scholar]
  • 11. Mouatcho JC, Goldring JPD. Malaria rapid diagnostic tests: challenges and prospects. Journal of Medical Microbiology. 2013;62(10):1491–1505. doi: 10.1099/jmm.0.052506-0 [DOI] [PubMed] [Google Scholar]
  • 12.Prevention CCfDCa. CDC—Malaria—Diagnosis & Treatment (United States)—Diagnosis (U.S.); 2019. Available from: https://www.cdc.gov/malaria/diagnosis_treatment/diagnosis.html.
  • 13. Watson OJ, Slater HC, Verity R, Parr JB, Mwandagalirwa MK, Tshefu A, et al. Modelling the drivers of the spread of Plasmodium falciparum hrp2 gene deletions in sub-Saharan Africa. eLife. 2017;6:e25008. doi: 10.7554/eLife.25008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Andrade BB, Reis-Filho A, Barros AM, Souza-Neto SM, Nogueira LL, Fukutani KF, et al. Towards a precise test for malaria diagnosis in the Brazilian Amazon: comparison among field microscopy, a rapid diagnostic test, nested PCR, and a computational expert system based on artificial neural networks. Malaria Journal. 2010;9(1):117. doi: 10.1186/1475-2875-9-117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Band G, Le QS, Clarke GM, Kivinen K, Hubbart C, Jeffreys AE, et al. Insights into malaria susceptibility using genome-wide data on 17,000 individuals from Africa, Asia and Oceania. Nature Communications. 2019;10(1):5732. doi: 10.1038/s41467-019-13480-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Timmann C, Thye T, Vens M, Evans J, May J, Ehmen C, et al. Genome-wide association study indicates two novel resistance loci for severe malaria. Nature. 2012;489(7416):443–446. doi: 10.1038/nature11334 [DOI] [PubMed] [Google Scholar]
  • 17. Kümpornsin K, Kochakarn T, Chookajorn T. The resistome and genomic reconnaissance in the age of malaria elimination. Disease Models & Mechanisms. 2019;12(12). doi: 10.1242/dmm.040717 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Daniels RF, Schaffner SF, Wenger EA, Proctor JL, Chang HH, Wong W, et al. Modeling malaria genomics reveals transmission decline and rebound in Senegal. Proceedings of the National Academy of Sciences. 2015;112(22):7067–7072. doi: 10.1073/pnas.1505691112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Bushman M, Morton L, Duah N, Quashie N, Abuaku B, Koram KA, et al. Within-host competition and drug resistance in the human malaria parasite Plasmodium falciparum. Proceedings of the Royal Society B: Biological Sciences. 2016;283(1826):20153038. doi: 10.1098/rspb.2015.3038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Birger RB, Kouyos RD, Cohen T, Griffiths EC, Huijben S, Mina MJ, et al. The potential impact of coinfection on antimicrobial chemotherapy and drug resistance. Trends in Microbiology. 2015;23(9):537–544. doi: 10.1016/j.tim.2015.05.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Verity R, Aydemir O, Brazeau NF, Watson OJ, Hathaway NJ, Mwandagalirwa MK, et al. The impact of antimalarial resistance on the genetic structure of Plasmodium falciparum in the DRC. Nature Communications. 2020;11(1):2107. doi: 10.1038/s41467-020-15779-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Watson OJ, Okell LC, Hellewell J, Slater HC, Unwin HJT, Omedo I, et al. Evaluating the Performance of Malaria Genetics for Inferring Changes in Transmission Intensity Using Transmission Modeling. Molecular Biology and Evolution. 2021;38(1):274–289. doi: 10.1093/molbev/msaa225 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Wong W, Griggs AD, Daniels RF, Schaffner SF, Ndiaye D, Bei AK, et al. Genetic relatedness analysis reveals the cotransmission of genetically related Plasmodium falciparum parasites in Thiès, Senegal. Genome Medicine. 2017;9. doi: 10.1186/s13073-017-0398-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Nkhoma SC, Trevino SG, Gorena KM, Nair S, Khoswe S, Jett C, et al. Co-transmission of Related Malaria Parasite Lineages Shapes Within-Host Parasite Diversity. Cell Host & Microbe. 2020;27(1):93–103.e4. doi: 10.1016/j.chom.2019.12.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Portugal S, Drakesmith H, Mota MM. Superinfection in malaria: Plasmodium shows its iron will. EMBO Reports. 2011;12(12):1233–1242. doi: 10.1038/embor.2011.213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Portugal S, Carret C, Recker M, Armitage AE, Gonçalves LA, Epiphanio S, et al. Host mediated regulation of superinfection in malaria. Nature medicine. 2011;17(6):732–737. doi: 10.1038/nm.2368 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Rodriguez-Barraquer I, Arinaitwe E, Jagannathan P, Kamya MR, Rosenthal PJ, Rek J, et al. Quantification of anti-parasite and anti-disease immunity to malaria as a function of age and exposure. eLife. 2018;7:e35832. doi: 10.7554/eLife.35832 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Snounou G, Beck HP. The Use of PCR Genotyping in the Assessment of Recrudescence or Reinfection after Antimalarial Drug Treatment. Parasitology Today. 1998;14(11):462–467. doi: 10.1016/S0169-4758(98)01340-4 [DOI] [PubMed] [Google Scholar]
  • 29. Konaté L, Zwetyenga J, Rogier C, Bischoff E, Fontenille D, Tall A, et al. 5. Variation of Plasmodium falciparum msp1 block 2 and msp2 allele prevalence and of infection complexity in two neighbouring Senegalese villages with different transmission conditions. Transactions of The Royal Society of Tropical Medicine and Hygiene. 1999;93(Supplement_1):21–28. [DOI] [PubMed] [Google Scholar]
  • 30. Auburn S, Campino S, Miotto O, Djimde AA, Zongo I, Manske M, et al. Characterization of Within-Host Plasmodium falciparum Diversity Using Next-Generation Sequence Data. PLOS ONE. 2012;7(2):e32891. doi: 10.1371/journal.pone.0032891 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Galinsky K, Valim C, Salmier A, de Thoisy B, Musset L, Legrand E, et al. COIL: a methodology for evaluating malarial complexity of infection using likelihood from single nucleotide polymorphism data. Malaria Journal. 2015;14(1):4. doi: 10.1186/1475-2875-14-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Assefa SA, Preston MD, Campino S, Ocholla H, Sutherland CJ, Clark TG. estMOI: estimating multiplicity of infection using parasite deep sequencing data. Bioinformatics. 2014;30(9):1292–1294. doi: 10.1093/bioinformatics/btu005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Hu T, Chitnis N, Monos D, Dinh A. Next-generation sequencing technologies: An overview. Human Immunology. 2021;82(11):801–811. doi: 10.1016/j.humimm.2021.02.012 [DOI] [PubMed] [Google Scholar]
  • 34. Zhu SJ, Almagro-Garcia J, McVean G. Deconvolution of multiple infections in Plasmodium falciparum from high throughput sequencing data. Bioinformatics. 2018;34(1):9–15. doi: 10.1093/bioinformatics/btx530 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Wong W, Wenger EA, Hartl DL, Wirth DF. Modeling the genetic relatedness of Plasmodium falciparum parasites following meiotic recombination and cotransmission. PLOS Computational Biology. 2018;14(1):e1005923. doi: 10.1371/journal.pcbi.1005923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Zhu SJ, Hendry JA, Almagro-Garcia J, Pearson RD, Amato R, Miles A, et al. The origins and relatedness structure of mixed infections vary with local prevalence of P. falciparum malaria. eLife. 2019;8:e40845. doi: 10.7554/eLife.40845 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Chang HH, Worby CJ, Yeka A, Nankabirwa J, Kamya MR, Staedke SG, et al. THE REAL McCOIL: A method for the concurrent estimation of the complexity of infection and SNP allele frequency for malaria parasites. PLOS Computational Biology. 2017;13(1):e1005348. doi: 10.1371/journal.pcbi.1005348 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. Available from: https://www.R-project.org/.
  • 39. Byrd RH, Lu P, Nocedal J, Zhu C. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific Computing. 1995;16(5):1190–1208. doi: 10.1137/0916069 [DOI] [Google Scholar]
  • 40. Fola AA, Harrison GLA, Hazairin MH, Barnadas C, Hetzel MW, Iga J, et al. Higher Complexity of Infection and Genetic Diversity of Plasmodium vivax Than Plasmodium falciparum across all Malaria Transmission Zones of Papua New Guinea. The American Journal of Tropical Medicine and Hygiene. 2017;96(3):630–641. doi: 10.4269/ajtmh.16-0716 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Davison AC, Hinkley DV. Bootstrap methods and their applications. Cambridge: Cambridge University Press; 1997. Available from: http://statwww.epfl.ch/davison/BMA/.
  • 42. MalariaGEN, Ahouidi A, Ali M, Almagro-Garcia J, Amambua-Ngwa A, Amaratunga C, et al. An open dataset of Plasmodium falciparum genome variation in 7,000 worldwide samples. Wellcome Open Research. 2021;6:42. doi: 10.12688/wellcomeopenres.16168.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis. 1st ed. Hoboken, N.J: Wiley-Interscience; 2005. [Google Scholar]
  • 44.Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. cluster: Cluster Analysis Basics and Extensions; 2021. Available from: https://CRAN.R-project.org/package=cluster.
  • 45. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. doi: 10.1016/0377-0427(87)90125-7 [DOI] [Google Scholar]
  • 46.Bank W. World Development Indicators. The World Bank;.
  • 47. Arel-Bundock V, Enevoldsen N, Yetman C. countrycode: An R package to convert country names and country codes. Journal of Open Source Software. 2018;3(28):848. doi: 10.21105/joss.00848 [DOI] [Google Scholar]
  • 48. Mooney CZ, Mooney CF, Mooney CL, Duval RD, Duvall R. Bootstrapping: A Nonparametric Approach to Statistical Inference. SAGE; 1993. [Google Scholar]
  • 49. DiCiccio TJ, Efron B. Bootstrap confidence intervals. Statistical Science. 1996;11(3):189–228. doi: 10.1214/ss/1032280214 [DOI] [Google Scholar]
  • 50. Gelman A, Rubin DB. Markov chain Monte Carlo methods in biostatistics. Statistical Methods in Medical Research. 1996;5(4):339–355. doi: 10.1177/096228029600500402 [DOI] [PubMed] [Google Scholar]
  • 51. Weiss DJ, Lucas TCD, Nguyen M, Nandi AK, Bisanzio D, Battle KE, et al. Mapping the global prevalence, incidence, and mortality of Plasmodium falciparum, 2000–17: a spatial and temporal modelling study. The Lancet. 2019;394(10195):322–331. doi: 10.1016/S0140-6736(19)31097-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Nemenyi PB. Distribution-free multiple comparisons. Princeton University; 1963. [Google Scholar]
  • 53. Hastings IM, Nsanzabana C, Smith TA. A Comparison of Methods to Detect and Quantify the Markers of Antimalarial Drug Resistance. The American Journal of Tropical Medicine and Hygiene. 2010;83(3):489–495. doi: 10.4269/ajtmh.2010.10-0072 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Lopez L, Koepfli C. Systematic review of Plasmodium falciparum and Plasmodium vivax polyclonal infections: Impact of prevalence, study population characteristics, and laboratory procedures. PLoS ONE. 2021;16(6):e0249382. doi: 10.1371/journal.pone.0249382 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Sallenave-Sales S, Daubersies P, Mercereau-Puijalon O, Rahimalala L, Contamin H, Druilhe P, et al. Plasmodium falciparum: a comparative analysis of the genetic diversity in malaria-mesoendemic areas of Brazil and Madagascar. Parasitology Research. 2000;86(8):692–698. doi: 10.1007/PL00008554 [DOI] [PubMed] [Google Scholar]
  • 56. Mita T, Hombhanje F, Takahashi N, Sekihara M, Yamauchi M, Tsukahara T, et al. Rapid selection of sulphadoxine-resistant Plasmodium falciparum and its effect on within-population genetic diversity in Papua New Guinea. Scientific Reports. 2018;8:5565. doi: 10.1038/s41598-018-23811-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Barry AE, Schultz L, Senn N, Nale J, Kiniboro B, Siba PM, et al. High Levels of Genetic Diversity of Plasmodium falciparum Populations in Papua New Guinea despite Variable Infection Prevalence. The American Journal of Tropical Medicine and Hygiene. 2013;88(4):718–725. doi: 10.4269/ajtmh.12-0056 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Alam MS, Elahi R, Mohon AN, Al-Amin HM, Kibria MG, Khan WA, et al. Plasmodium falciparum Genetic Diversity in Bangladesh Does Not Suggest a Hypoendemic Population Structure. The American Journal of Tropical Medicine and Hygiene. 2016;94(6):1245–1250. doi: 10.4269/ajtmh.15-0446 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Atroosh WM, Al-Mekhlafi HM, Mahdy MA, Saif-Ali R, Al-Mekhlafi AM, Surin J. Genetic diversity of Plasmodium falciparum isolates from Pahang, Malaysia based on MSP-1 and MSP-2 genes. Parasites & Vectors. 2011;4(1):233. doi: 10.1186/1756-3305-4-233 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Apinjoh TO, Tata RB, Anchang-Kimbi JK, Chi HF, Fon EM, Mugri RN, et al. Plasmodium falciparum merozoite surface protein 1 block 2 gene polymorphism in field isolates along the slope of mount Cameroon: a cross—sectional study. BMC Infectious Diseases. 2015;15(1):309. doi: 10.1186/s12879-015-1066-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Roman DNR, Anne NNR, Singh V, Luther KMM, Chantal NEM, Albert MS. Role of genetic factors and ethnicity on the multiplicity of Plasmodium falciparum infection in children with asymptomatic malaria in Yaoundé, Cameroon. Heliyon. 2018;4(8):e00760. doi: 10.1016/j.heliyon.2018.e00760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Metoh TN, Chen JH, Fon-Gah P, Zhou X, Moyou-Somo R, Zhou XN. Genetic diversity of Plasmodium falciparum and genetic profile in children affected by uncomplicated malaria in Cameroon. Malaria Journal. 2020;19(1):115. doi: 10.1186/s12936-020-03161-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Duah NO, Matrevi SA, Quashie NB, Abuaku B, Koram KA. Genetic diversity of Plasmodium falciparum isolates from uncomplicated malaria cases in Ghana over a decade. Parasites & Vectors. 2016;9(1):416. doi: 10.1186/s13071-016-1692-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Bei AK, Niang M, Deme AB, Daniels RF, Sarr FD, Sokhna C, et al. Dramatic Changes in Malaria Population Genetic Complexity in Dielmo and Ndiop, Senegal, Revealed Using Genomic Surveillance. The Journal of Infectious Diseases. 2018;217(4):622–627. doi: 10.1093/infdis/jix580 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Pacheco MA, Forero-Peña DA, Schneider KA, Chavero M, Gamardo A, Figuera L, et al. Malaria in Venezuela: changes in the complexity of infection reflects the increment in transmission intensity. Malaria Journal. 2020;19(1):176. doi: 10.1186/s12936-020-03247-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Karl S, White MT, Milne GJ, Gurarie D, Hay SI, Barry AE, et al. Spatial Effects on the Multiplicity of Plasmodium falciparum Infections. PLOS ONE. 2016;11(10):e0164054. doi: 10.1371/journal.pone.0164054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Watson OJ, Verity R, Ghani AC, Garske T, Cunningham J, Tshefu A, et al. Impact of seasonal variations in Plasmodium falciparum malaria transmission on the surveillance of pfhrp2 gene deletions. eLife. 2019;8:e40339. doi: 10.7554/eLife.40339 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Sondo P, Derra K, Rouamba T, Nakanabo Diallo S, Taconet P, Kazienga A, et al. Determinants of Plasmodium falciparum multiplicity of infection and genetic diversity in Burkina Faso. Parasites & Vectors. 2020;13:427. doi: 10.1186/s13071-020-04302-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Zhu X, Wang J, Peng B, Shete S. Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics. 2016;17. doi: 10.1186/s12859-016-1052-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Zhong D, Koepfli C, Cui L, Yan G. Molecular approaches to determine the multiplicity of Plasmodium infections. Malaria Journal. 2018;17(1):172. doi: 10.1186/s12936-018-2322-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010247.r001

Decision Letter 0

Thomas Leitner, Daniel B Larremore

25 Jul 2022

Dear Dr Bailey,

Thank you very much for submitting your manuscript "coiaf: directly estimating complexity of infection with allele frequencies" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

In particular, the the reviews, as a whole, suggest important directions that would be valuable for the authors to explore, particularly (i) evaluating and discussing when/whether the complexity/memory advantages of coiaf are retained if one chooses to draw bootstrapped samples, (ii) the circumstances under which compute & memory are critical, and (iii) the conditions required for statistical consistency. I also thank the authors for submitting a paper whose clarity of writing allowed all reviewers and editor to engage fully with the manuscript. 

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniel B Larremore, Ph.D.

Associate Editor

PLOS Computational Biology

Thomas Leitner

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Uploaded as attachment

Reviewer #2: Uploaded as attachment

Reviewer #3: In this manuscript Pashalidis and colleagues present a new method for estimating the complexity of infection (COI) of Plasmodium infections. COI is a foundational metric in Plasmodium genetics/genomics, and its estimation is an integral part of all bioinformatic analyses. In addition, as the authors discuss, it is increasingly being considered as part of the genomic epidemiology toolkit for tracking changes in transmission. In this clear, well written manuscript, the authors give a good overview of current tools in the field, including their strengths and limitations. They then present their own method, packaged as coiaf, as being a faster and more scalable alternative to the current tool of choice, TheReaMcCoil.

Currently, the comparison between TheRealMcCoil and coiaf is limited to sequence data from a set of globally distributed samples. The two methods perform similarly but not identically, and I am left wondering how the accuracy of the two approaches compares. I suggest that the authors include a head-to-head comparison of coiaf and TheRealMcCoil on the simulated data as well, showing both COI and population level allele frequency estimates for the two methods.

In regards to assessing accuracy, I also suggest that the authors expand the simulated data results presented in the main text. I appreciated the wider range of parameters explored in the Supplement, and readers would benefit from seeing these data in the main results. In particular, I think it is worth highlighting the effects of strain proportions, read depth, and overdispersion of read counts.

Fig 1 in the appendix shows that coiaf produces an unbiased estimator of population level allele frequencies, however, accuracy is low, especially in the region of typical sample sizes (<=100). I would like to see these estimates compared to those produced using current common approaches: (1) estimating allele frequencies from putatively monoclonal samples only and (2) using TheRealMcCoil. Whether highly accurate maf estimates are necessary for COI estimation is another matter. The authors could demonstrate that their method provides sufficient accuracy for estimating COI without the added computational expense of accurately estimating population mafs. In this case, however, I would suggest they clarify this in the text.

In the absence of evidence that coiaf is more accurate than TheRealMcCoil, coiaf's increased computational efficiency strikes me as the most compelling benefit over the other. I would suggest highlighting this in the main Results (with a figure even). The authors might also consider giving increased emphasis to coiaf's ability to handle whole-genome data, although given the diminishing returns observed after crossing the 1000 loci threshold, this may not be a major benefit.

The R package itself was easy to install and run (although a few small comments are below).

In sum, I find this new approach for estimating COI in Plasmodium infections to be compelling and am particularly impressed by the speed and scalability. I, however, suggest a more thorough head-to-head comparison against TheRealMcCoil and a greater exploration of parameters like strain proportions, read depth, and read overdispersion in the main text. This is needed to present a strong case for coiaf becoming a new tool of choice.

Small comments and suggestions:

- The authors state that continuous COI estimates can provide insight into the relatedness of parasites in complex infections. In theory, I agree with this; in practice, I do not know how easy it is to interpret such numbers vis-a-vis other factors like low read coverage and genotyping error. Simulating and analyzing data to this effect would be a useful addition to this manuscript as I agree that this is an area of interest for the field.

- In Fig 1, the points are so dense that it is hard to derive any information from these plots. If the goal is to visualize the analytical steps, a cartoon schematic might be easier to interpret. (This is just a suggestion)

- The DePloidIBD paper ( https://doi.org/10.7554/eLife.40845) does incude information on the global distribution of Ke ('effective number of strains'). While this isn't equal COI in a strict sense, it is highly related. These results could be compared to the global COI estimates made here.

- Judging from the performance of the script, "minor allele freq" is really "non-reference allele freq"

- It appears the simulations assume all loci are unlinked. This is a reasonable and common assumption, but I suggest making this explicit in the text since WGS and some targeted sequencing approaches generate linked variants. (I did see that LD was addressed in the appendix in regards to running TheRealMcCoil.)

- Many genotyping approaches now rely on a pre-amplification step prior to genotyping/sequencing. The (potential) effects of this could be mentioned in the Discussion.

- Do you suggest that users take the COI probabilities into account? Does some filtration at this level increase accuracy?

- On the GitHub page, the links to the vignettes are broken

- A simple manual would be a helpful addition. For instance, the text mentions incorporating genotyping error into the model, but it was unclear from the examples how to do this short of editing the main code. I also never found the code used for bootstrapping CIs.

- Running the 'method 2, continuous' example in 'example_real_data.Rmd', all COI values were over-inflated (including many with COI=25)

I found a few small errors in 'example_real_data.Rmd' (NB: I was using the latest release, not the dev version)

line 57, 85, 100, 128: a character string [0,1] is requested for coi_method

line 56, 84, 99, 127: expected tibble values were wsaf and plaf

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: coiaf.docx

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010247.r003

Decision Letter 1

Thomas Leitner, Daniel B Larremore

9 Feb 2023

Dear Dr Bailey,

Thank you very much for submitting your manuscript "coiaf: directly estimating complexity of infection with allele frequencies" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniel B Larremore, Ph.D.

Academic Editor

PLOS Computational Biology

Thomas Leitner

Section Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: uploaded as an attachment

Reviewer #2: The authors have sufficiently addressed my major concerns through direct comparisons to THE REAL McCOIL on simulated data. I only have minor comments regarding some of the supplementary figures and minor copy editing.

1. line 118 appears to have the beginning cut off

2. in figures comparing performance with THE REAL McCOIL using simulated data, how many bootstrapped samples are being used to estimate 95% CI for coiaf methods? If it's not default setting of 100, then this should be mentioned and the plots comparing speeds should be done with the same number of bootstrapped samples

3. It's unclear to me the difference between epsilon and sequencing error in simulations. I see that epsilon is varied when evaluating the performance of the different estimation approaches and the figure text (L.1 fig 10) explains that sequencing error is fixed at 1%. However, in the comparisons to THE REAL McCOIL (L.1.1 fig 5), the text states that the sequencing error is varied, is this distinct from varying epsilon? I ask because the values chosen varying sequencing error when comparing to THE REAL McCOIL match the values of epsilon varied earlier (0.05, 0.01, 0.015), however the performance is wildly different for coiaf between the two figures. This same concern applies to estimating PLAF as well, but that is not evaluated earlier.

Reviewer #3: In this revision, Paschalidis and colleagues have given thoughtful consideration to reviewers comments and substantially strengthened this version. In my opinion, this current manuscript clearly presents sufficient detail for readers to understand and evaluate this new COI estimation method in relation to currently used approaches.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Attachment

Submitted filename: revised_review.docx

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010247.r005

Decision Letter 2

Thomas Leitner, Daniel B Larremore

1 May 2023

Dear Dr Bailey,

We are pleased to inform you that your manuscript 'coiaf: directly estimating complexity of infection with allele frequencies' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Daniel B Larremore, Ph.D.

Academic Editor

PLOS Computational Biology

Thomas Leitner

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010247.r006

Acceptance letter

Thomas Leitner, Daniel B Larremore

5 Jun 2023

PCOMPBIOL-D-22-00780R2

coiaf: directly estimating complexity of infection with allele frequencies

Dear Dr Bailey,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Bernadett Koltai

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendices. Appendices.

    Includes the complete problem formulation, including derivations of the Variant Method and the Frequency Method, additional information about our methods and software package, and all omitted figures.

    (PDF)

    Attachment

    Submitted filename: coiaf.docx

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Attachment

    Submitted filename: revised_review.docx

    Attachment

    Submitted filename: Response to Reviewers.pdf

    Data Availability Statement

    Documentation for coiaf can be found at https://bailey-lab.github.io/coiaf. Additional code used to analyze data and create figures can be found at https://github.com/bailey-lab/coiaf-manuscript-work (DOI: 10.5281/zenodo.7931661) and https://github.com/bailey-lab/coiaf-real-data (DOI: 10.5281/zenodo.7826507).


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES