Abstract
Pathogen genomics is a powerful tool for tracking infectious disease transmission. In malaria, identity-by-descent is used to assess the genetic relatedness between parasites and has been used to study transmission and importation. In theory, identity-by-descent can be used to distinguish genealogical relationships to reconstruct transmission history or identify parasites for QTL experiments. MalKinID (Malaria Kinship Identifier) is a new classification model designed to identify genealogical relationships among malaria parasites based on genome-wide identity-by-descent proportions and identity-by-descent segment distributions. MalKinID was calibrated to the genomic data from 3 laboratory-based genetic crosses (yielding 440 parent-child and 9060 full-sibling comparisons). MalKinID identified lab-generated F1 progeny with >80% sensitivity and showed that 0.39 (95% CI 0.28, 0.49) of the second-generation progeny of a NF54 and NHP4026 cross were F1s and 0.56 (0.45, 0.67) were backcrosses of an F1 with the parental NF54 strain. In simulated outcrossed importations, MalKinID reconstructs genealogy history with high precision and sensitivity, with F1-scores exceeding 0.84. However, when importation involves inbreeding, such as during serial co-transmission, the precision and sensitivity of MalKinID declined, with F1-scores (the harmonic mean of precision and sensitivity) of 0.76 (0.56, 0.92) and 0.23 (0.0, 0.4) for parent-child and full-sibling and <0.05 for second-degree and third-degree relatives. Disentangling inbred relationships required adapting MalKinID to perform multisample comparisons. Genealogical inference is most powered when (1) outcrossing is the norm or (2) multisample comparisons based on a predefined pedigree are used. MalKinID lays the foundations for using identity-by-descent to track parasite transmission history and for separating progeny for quantitative-trait-locus experiments.
Keywords: malaria, genetic relatedness, identity-by-descent, inference, genealogy
MalKinID is a new classification model designed to identify malaria genealogical relationships using identity-by-descent (IBD). Malaria genomic epidemiologists have utilized IBD to quantify genetic relatedness, evaluate differences in transmission, and characterize parasite interpopulation movement. However, interpreting the IBD and genetic relatedness between parasites can be challenging. This study developed a classification model that leverages genetic relatedness to infer genealogical relationships to aid in the interpretation of the genetic relatedness networks currently used by the field and provide a genetics-based foundation for reconstructing malaria transmission history.
Introduction
Plasmodium falciparum is a single-cell, eukaryotic parasite that is responsible for one of the deadliest forms of malaria. Global eradication efforts have drastically altered the transmission landscape of P. falciparum, and malaria parasite genomics has emerged as a powerful tool for assessing ongoing transmission and the effectiveness of public health interventions (Ashton et al. 2020; Wong et al. 2024). However, the population genomics of malaria is unusual among pathogens because it must sexually reproduce and recombine in a mosquito vector during transmission. Each parasite strain reflects up to 2 different genealogical histories, one from each parent, and can be genetically related to other individuals in the population.
The genetic relatedness of malaria parasites is a robust metric for malaria genomic epidemiology studies and has been used to evaluate changes in transmission (Neafsey and Volkman 2017; Taylor et al. 2017; Neafsey et al. 2021), distinguish between hypotheses (superinfection vs co-transmission) of multiple strain (polygenomic) infection formation (Nkhoma et al. 2012, 2018, 2020; Nair et al. 2014; Wong et al. 2017, 2018, 2022), detect differences in transmission structure (Schaffner et al. 2023), and identify regions of the genome undergoing strong, directional selection (Schaffner et al. 2023). Genetic relatedness is defined as the proportion of the genome between 2 individuals that shows identity-by-descent (IBD) and thus inherited from the same recent common ancestor (Fig. 1).
Fig. 1.
a) Definition of genetic relatedness as the proportion of the genome that is considered IBD. IBD is defined as a section of a genome that is shared between 2 individuals and inherited from the same recent common ancestor. b) Generation of genetically related malaria parasites. Genetically related parasites first appear when a mosquito bites a superinfected individual with unrelated parasite strains. This provides parasites their first opportunity to outcross (mate between a genetically unrelated strain) and produce genetically related, but not inbred, progeny that are then transmitted to the next individual. If multiple, genetically distinct but related parasites are transmitted, the transmission event is referred to as co-transmission. Subsequent transmission and serial co-transmission events provide opportunities for inbreeding (mating between genetically related strains) and can generate genetically related, inbred progeny.
Despite its use in malaria genomic epidemiology, parasite genetic relatedness can be difficult to interpret. Partially related parasites are the result of a complex series of transmission events that involve a mix of both superinfection and co-transmission (Fig. 1b). Superinfection occurs when an individual is infected with multiple strains from separate mosquito bites. The strains within this initial superinfection represent a random sampling of parasites from the population and are assumed to be unrelated (Nkhoma et al. 2012, 2020; Wong et al. 2017, 2022). Mosquitoes feeding on this superinfected polygenomic infection provide the parasite with its first opportunity to outcross and generate genetically related progeny. Co-transmission occurs when the mosquito feeds on the superinfected infection and transmits multiple, often genetically related parasites into a new host. Subsequent feedings on these co-transmitted polygenomic infections enable inbreeding and the generation of inbred progeny. As such, the genetic relatedness of parasites depends on transmission history and how superinfection and co-transmission occur in the population.
New approaches that can disentangle the complex relationship between transmission history and genetic relatedness could be valuable for future malaria genomic epidemiology studies. In theory, transmission lineages could be reconstructed by identifying parent-child relatives in the population. Genealogical inference has been used in conservation biology and ecology to reconstruct pedigrees, study recent demographic history, and monitor wild animal population movement (Blouin 2003; Kirkpatrick et al. 2011; Staples et al. 2014; Huisman 2017; Jones and Manseau 2022). Malaria genealogical inference would also be useful for establishing inbred parasites lines for laboratory-based quantitative-trait-locus (QTL) experiments (Daley and Shepherd 2008; Solberg Woods 2014) designed to identify molecular determinants of antimalarial drug resistance.
However, genealogical inference is a surprisingly nuanced problem in malaria. One consequence of its haploid genome is that progeny parasites are not guaranteed to inherit half of their genomes from each parent (Wong et al. 2018). Another consequence is that genetically distinct meiotic siblings from the same oocyst, a sac-like structure that forms in the mosquito midgut after zygosis and contains the immediate, haploid products of meiosis, have an expected relatedness of 0.33 (Wong et al. 2018; Nkhoma et al. 2020). Meiotic siblings are distinct from full siblings, which have an expected relatedness of 0.5, and occur when 2 genetically distinct haploid parasites are sampled from 2 different oocysts that involve the zygosis and meiosis of the same parental strains (Wong et al. 2018; Nkhoma et al. 2020).
MalKinID (Malaria Kinship Identifier) is a new pseudolikelihood-based classification model that infers the genealogical relationship between parasite strains based on the genome-wide IBD proportion and the per-chromosome maximum IBD segment block (IBDmax) and IBD segment count (nsegment) distributions. The pseudolikelihoods used in MalKinID were based on the patterns observed in simulated parasites generated using a previously developed meiosis model with obligate chiasma formation and crossover interference (Wong et al. 2018) that was calibrated to whole genome sequencing data obtained from 3 laboratory-based genetic crosses (NF54 × NHP4026, MKK2835 × NHP1337, and MAL31 × KH004) (Fig. 2).
Fig. 2.
Study design. MalKinID infers the genealogical relationship (G) between 2 parasites based on , , and . The pseudolikelihood is based on using a calibrated meiosis model that includes obligate chiasma formation and crossover interference to simulate the distributions for , , and for each genealogical relationship.
Materials and methods
Meiosis simulation
Genetically related progeny were simulated using a previously published meiosis model that incorporates obligate chiasma formation and crossover interference (Wong et al. 2018). Briefly, the model relies on 2 parameters, the crossover rate (k) that quantifies the kilobase pairs per centimorgan and a crossover interference parameter (v) that determines how closely crossover events can be placed next to one another on a 4-chromatid bundle. A full specification of the meiosis model can be found in our previously published paper (Wong et al. 2018) and a brief description of the meiosis model, along with full details regarding the meiosis model recalibration are presented in Supplementary File 1.
Pseudolikelihood model structure
A pseudolikelihood model was used to determine the most likely genealogical relationship between parasite pairs based on (1) the total genetic relatedness (), (2) the size of the largest IBD segment per chromosome (), and (3) the number of IBD segments per chromosome (). Both and are bound between 0 and 1 because was defined as the fraction of the genome that is IBD and was normalized and defined as the proportion of the chromosome the segment sits on. The structure of the pseudo likelihood model [Equation (1)] was defined as follows:
| (1) |
where i refers to the i-th chromosome in the genome and G refers to the examined genealogical relationship (Table 1). Note that Equation (1) shows that the pseudolikelihood model utilizes 29 different features, 1 genome-wide , 14 , and 14 values for each of the 14 chromosomes in the P. falciparum genome.
Table 1.
The genealogical relationships examined in this study and their expected genetic relatedness values.
| Kinship | Degree | Expected genetic relatedness |
|---|---|---|
| Parent-child (PC) | First | 0 .5 |
| Full siblings (FS) | First | 0.5 |
| Meiotic sibling (MS) | First | 0.33 |
| Grandparent-grandchild (GC) | Second | 0 .25 |
| Half siblings (HS) | Second | 0.25 |
| Full avuncular (FAV and FAV.MS) | Second | 0.25/0.166 |
| Great grandparent-great grandchild (GGC) | Third | 0 .125 |
| Half avuncular (HAV) | Third | 0.125 |
| Full cousins (FCS and FCS.MS) | Third | 0.125/0.0830 |
| Unrelated | 0 |
Bold and italicized rows emphasize the relationships that represent vertical genealogical descent.
was defined using a Beta distribution:
| (2) |
whose parameters, and , were fit to the simulated values for each genealogical relationship using the scipy.stats.beta.fit function from the scipy package (v 1.7.3) for Python 3. The location and scale parameters for the scipy.stats.beta.fit function were set to 0 and 1, respectively.
was defined using a kernel density estimator (KDE) with spikes (Tataru et al. 2015; Guerrero Montero and Blythe 2023) and defined as follows:
| (3) |
where P(0) is the probability that is 0 (no IBD segment on chromosome i) for the genealogical relationship G, P(1) is the probability that is 1 (the IBD segment spans the length of chromosome i), δ is a Dirac delta function, and is a KDE that was fit to the remaining data where > 0 and < 1 using the KernalDensity function in the Python 3 package sklearn.neighbors (v1.2.2).
was defined as probability mass function with add-one additive smoothing:
| (4.1) |
| (4.2) |
where is the smoothed probability of observing , α is a pseudocount used for smoothing and set to 1, N is the number of trials examined, d is the number of categories, and c is the number of simulations where = x. N was set to 5,000 (the number of simulations run), and d was the number of simulated categories for plus 1. The last index of d represents a miscellaneous category for any segment count that was not observed in the simulation.
Classification
Classification was based on maximum likelihood analysis. The classification with the highest likelihood was chosen as the most likely hypothesis. The general likelihood ratio test was used to identify the most likely classification from a set of proposed genealogical relationships:
| (5) |
where is a null hypothesis belonging to a set of null hypotheses in and is an alternative hypothesis belonging to a set of alternative hypotheses in .
Laboratory cross data
Whole genome sequences for the progeny from 3 laboratory crosses (MKK2835 × NHP1337, NF54 × NHP4026, and Mal31 × KH004) as described in Vendrely et al. (2020), Button-Simons et al. (2021), and Kumar et al. (2022). The whole genome sequences for each set of progeny were filtered to include only sites that were variant between the 2 parental lines, were biallelic, had an allele depth > 10, and had less than 20% of the sites missing across all the samples collected from that cross. This left 7,177 SNP variants for the MKK2835 × NHP1337 cross, 11,358 variants for the NF54 × NHP4026 cross, and 12,918 variants for Mal31 × KH004 cross.
For the cross data, IBD was inferred using a modified version of the hmmIBD (Schaffner et al. 2018) as described previously in Wong et al. (2018). This modified version replaces the emission probabilities with the following:
| (6.1) |
| (6.2) |
| (6.3) |
| (6.4) |
where C indicates concordance, D indicates discordance, and is the sequencing error rate and set to 0.01. After running the modified hmmIBD, IBD segments were identified as contiguous blocks of loci inferred to be IBD. Segments whose length was less than 0.05 of the chromosome on which it resided were removed from analysis. These spurious IBD segments could represent spurious artifacts generated by hmmIBD or mitotic recombination events. Total relatedness was quantified as the total proportion of the genome that is IBD.
Multilayer perceptron machine learning model
A multilayer perceptron (MLP) model was fit to the simulated data using the MLPClassifier function from the Python package sklearn (sklearn.neural_network, v1.2.2) using the “adam” stochastic gradient-based optimizer, the “relu” activation function, and 3 hidden layers. The number of nodes per hidden layer was determined using a test-and-train 90–10 split where 90% of the data for each genealogical relationship were used to train the model and the remaining 10% were used for model validation. The optimal number of nodes per hidden layer was defined as the value yielding the highest accuracy in the validation data set. Once the number of nodes per hidden layer was determined, the MLP model with the corresponding nodes per hidden layer was trained on the entire data set. A total of 55,000 simulated data points were used for model training, 5,000 for each of the examined 11 genealogical classes.
F1-score
The F1-score is the harmonic mean of the precision (the positive-predictive-value) and sensitivity (recall). It was defined as follows:
| (7) |
where TP is the number of true positives, FP the number of false positives, and FN the number of false negatives.
Results
Brief model overview
By default, MalKinID classifies parasite pairs into 11 different genealogical relationships (Table 1) that cover a broad range of outcrossed parasite relationships. These genealogical relationships include first-degree relatives [parent-child (PC), full-sibling (FS), and meiotic siblings (MS)], second-degree relatives [grandparent-grandchild (GC), half siblings (HS), full avuncular relationships where the avuncular relative is a FS (FAV) or MS (FAV.MS) to the parent], and third-degree relatives [great-grandparent–great-grandchild (GGC), full cousin where the related parental strains are FS (FCS) or MS (FCS.MS), and half avuncular (HAV)]. MalKinID is not restricted to these genealogical relationships and can be modified to perform multisample comparisons to identify any genealogical relationships based on a predefined pedigree tree.
Simulating the genetic relatedness of parasite pairs using a calibrated meiosis model
Previous pedigree reconstruction studies have shown that the genome-wide proportion of IBD between 2 individuals (referred to as total relatedness or ) and the IBD segment block distribution can be used to distinguish genealogical relationships (Browning 1998; Zhao and Liang 2001). To test whether these findings applied to malaria, a previously published meiosis model that includes obligate chiasmata formation and crossover interference (Wong et al. 2018) was used to derive the distributions for and the per-chromosome (, proportional to the size of each chromosome) and IBD segment count () for each of the genealogical relationships described in Table 1 based on the pedigree described in Fig. 3a [Equations (1)–(4) in the Materials and methods]. The distribution of IBD segment blocks was summarized by the per-chromosome values of and . These values were chosen because they provide statistically independent measures for each chromosome, which facilitates their integration into the final pseudolikelihood model.
Fig. 3.
Simulating outcrossed pedigrees. a) The pedigree used to simulate genetically related parasites in this study. Each node of the pedigree represents a genetically distinct parasite. Unlabeled notes represent external, genetically unrelated parasites that are used for outcrossing. The parental strains (P1, P2, and P3) are all genetically unrelated to one another. F11 and F12 represent 2 different F1-progeny, F21 and F22 two different F2-progeny, and F31 an F3 progeny descended from P1 and P2. A11 refers to a descendent from an “alternate” lineage involving P2 and P3. Simulated distribution for b) genome-wide total relatedness () and max IBD segment () distributions for chromosome 14 for c) first-, d) second-, and e) third-degree relatives, and IBD segment count () distribution for chromosome 14 for f) first-, g) second-, and h) third-degree relatives. These simulated distributions were generated by fitting the raw simulation output to the equations used in the pseudolikelihood model [Equations (1)–(4)]. FAV.MS and FCS.MS denote FAV and FCS descended from MS.
The meiosis model incorporates obligate chiasma formation and includes parameters that specify the crossover rate and the strength of crossover interference. For this study, the meiosis model was recalibrated to newly available genomic data obtained from 440 unique PC relationships and 9,060 unique FS relationships from 3 laboratory crosses: NF54 × NHP4026, MKK2835 × NHP1337, and Mal31 × KH004 (Supplementary Table 1 and Fig. 1 in Supplementary File 1). Several optimum parameter choices were observed (Supplementary Fig. 1): (1) a set with no crossover interference, (2) a set with weak crossover interference, and (3) a set with moderate/high crossover interference (see Supplementary File 1, Meiosis model recalibration, for additional details). The parameter set with no crossover interference was excluded because crossover interference is a fundamental aspect of meiosis that is observed across diverse taxa. Unless otherwise noted, all model results were made using the weak crossover interference parameter set because it results in more accurate PC classification rates (see below and Supplementary File 1).
Simulations reveal variations in genetic relatedness and IBD segment block distribution between genealogical relationships
These simulations revealed differences in and the IBD segment block distributions that could be used to distinguish genealogical relationships (Fig. 3; see Supplementary Fig. 2 in Supplementary File 2). First-degree relatives were the most related, with expected s of 0.50 for PC and FS and 0.33 for MS. Simulations showed that the IBD segment block distributions were significantly different for each of these genealogical relationships (P << 1.0e-10 for all chromosomes; Fig. 3f, Kruskal–Wallis test).
Relative to FS, PC have fewer but larger IBD segments that were more likely to span the entire length of the chromosome. For PC, smaller chromosomes, such as chromosome 1, were more likely to have chromosome wide IBD segments [0.23 (95% CI 0.22, 0,24)] than longer chromosomes [e.g. chromosome 14, 0.07 (95% CI 0.06, 0.07)] (Supplementary File 3). These differences in the IBD segment block distribution explain why the distribution of for FS was narrower than that of PC, despite having identical expectations (Supplementary Fig. 3). The standard deviation of the distributions from the simulated PC and FS distributions were 0.095 (95% CI 0.093, 0.097) and 0.076 (95% CI 0.075, 0.078), respectively.
With the exception of full avuncular and full cousin relationships, the expected relatedness for all second- and third-degree relatives were 0.25 and 0.125, respectively (Fig. 3b). The expected relatedness of full avuncular relatives and full cousins depended on whether their parental F1 strains (nodes F11 and F12 in Fig. 3a) were FS or MS. If the parental F1 strains were FS (FAV and FCS), they had the same expected as the other second- and third-degree progeny, 0.25 and 0.125, respectively. If the parental F1 strains were MS (FAV.MS and FCS.MS), they were more unrelated than expected, with an expected of 0.166 and 0.0830, respectively.
The IBD segment block distributions of second- and third-degree relatives were characterized by fewer IBD segments and a greater proportion of chromosomes with no IBD segments at all (Fig. 3d, e, g, and h). While each of the genealogical relationships within second- and third-degree relatives had significantly different and distributions (P << 1.0e-10 for all chromosomes, Kruskal–Wallis test), the differences were less pronounced than those seen within first-degree relatives.
Accurate genealogical inference required utilizing and the per-chromosome and distributions
Based on these results, the differences in and the per-chromosome and distributions were evaluated to determine whether they could be used for genealogical inference. It was hypothesized that a model utilizing alone could distinguish first-, second-, and third-degree relatives but that further classification into the specific genealogical subrelationships (for instance, PC and FS) would require a model that also utilized the per-chromosome and distributions. It was also of interest to determine whether including MS, FAV.MS, and FCS.MS would lessen genealogical inference accuracy, as their respective distributions blurred the distinctions between the other first-, second-, and third-degree relatives (Fig. 3b). To address these, this study evaluated several pseudolikelihoods that involved different modifications of Equation (1). Equation (1) assumes that and each of the per-chromosome and values are statistically independent. The majority of pairwise comparisons between these features had no statistically significant correlations (Spearman's correlation coefficient < 0.10) (Supplementary Fig. 4). For MS, FAV.MS, and FCS.MS, there was a slight correlation (Spearman's correlation coefficient between 0.08 and 0.13) between and if they were from the same chromosome.
The first set of examined pseudolikelihoods excluded the MS, FAV.MS, and FCS.MS categories and focused on evaluating whether a pseudolikelihood model utilizing alone could accurately classify genealogical relationships. As expected, classification based on alone performed poorly when identifying individual genealogical relationships but could be used to classify parasite pairs as first-, second-, or third-degree relatives (Supplementary Fig. 5). The true classification rate (the proportion of accurately identified comparisons) for first-degree relatives was 0.943 (95% CI 0.938, 0.947), second-degree relatives 0.771 (95% CI 0.762, 0.779), and third-degree relatives 0.828 (95% CI 0.821, 0.836).
Including the per-chromosome and distributions into the pseudolikelihood model increased the true classification rates for first- [0.989 (95% CI 0.987, 0.991)], second- [0.925 (95% CI 0.92, 0.93)], and third-degree relatives [0.944 (95% CI 0.940, 0.950)] (Supplementary Fig. 6). It also allowed the model to further distinguish first-degree relatives as either PC [0.96 (95% CI 0.956, 0.967)] or FS [0.97 (95% CI 0.964, 0.974)]. However, differentiation of the genealogical relationships within second- and third-degree relatives remained challenging, with true classification rates of ∼0.40 for GC, HS, GGC, and HAV and 0.6–0.80 for FAV and FCS.
These trends held true when expanding the pseudolikelihood model to include meiotic siblings and their descendants (Fig. 4a–e). Overall, including MS, FAV.MS, and FCS.MS caused true classification rates to decline slightly, particularly for the genealogical relationships that define second- and third-degree relatives. As before, alone was insufficient to accurately distinguish individual genealogical relationships. When including the additional MS relationships in the model, the true classification rates for first-, second-, and third-degree relatives was 0.93 (95% CI 0.92, 0.94), 0.81 (95% CI 0.80, 0.83), and 0.86 (95% CI 0.85, 0.87). The classification rates for the individual genealogical relationships within first-degree relatives were consistently higher than those for second- and third-degree relatives. The true classification rate for MS was 0.76 (95% CI 0.75, 0.77).
Fig. 4.
Classification of outcrossed parasite genealogies on simulated a–e) and empirical lab cross data f–h). Refer to Fig. 2a or Table 1 for abbreviations. a) PCA plot of the simulated , , and data. The PCA included a third component that is not shown. Each genealogical relationship is represented by 5,000 independent simulations and was fit with a Gaussian KDE for visual clarity. Darker spots indicate a region with higher mass for that genealogical relationship. MalKinID classification rates when including all meiotic sibling possibilities b–e). The term to the left of the “|” refers to the inferred classification and the term to the right the true genealogical relationship. b) classifies samples by their genealogical relationship while the other subplots classify samples as c) first-, d) second-, or e) third-degree relatives. MS.MS refers to meiotic siblings, while FAV.MS and FCS.MS refer to the meiotic sibling variants for FAV and FCS. Empirical classification rates for f) PC, g) first-degree siblings (FS and MS), and h) first-degree relatives from the laboratory cross data. Error bars indicate 2 standard deviations from the mean from 5,000 comparisons. 10 siblings refers to FS and MS. 10 refers to all first-degree relatives (PC, FS, and MS).
To test whether the assumption of statistical independence used in the pseudolikelihood model [Equation (1)] could affect classification accuracy, a MLP model was trained on the simulated data. MLP models are machine learning–based neural networks with fully connected neurons. Unlike the pseudolikelihood model used in this study, MLP models do not assume that features are statistically independent. The classifications from the trained MLP model were consistent with those from the pseudolikelihood model (R2 = 0.995; Supplementary Fig. 7) and yielded similar true classification rates.
MalKinID accurately classifies known PC and FS relatives from 3 laboratory-based genetic crosses
The performance of MalKinID was first assessed using the genomic data from the 3 crosses (NF54 × NHP4026, MKK2835 × NHP1337, and Mal31 × KH004) (Fig. 4f–h). In these crosses, MalKinID correctly identified 0.80 (95% CI 0.77, 0.84) of the PC relationships and 0.80 (95% CI 0.79, 0.81) of the first-degree siblings (FS or MS). Of the correctly identified first-degree siblings, MalKinID 0.92 (95% CI 0.91, 0.94) were identified as FS. The average of the true classification rates for PC and first-degree siblings (the macroaverage) was 0.80 (95% CI 0.77, 0.83). When using MalKinID to classify parasites as first-, second-, or third-degree relatives, 0.94 (95% CI 0.91, 0.96) and 0.99 (95% CI 0.98, 0.99) of the PC and FS relationships were correctly identified as first-degree relatives and the overall macroaverage rate for both PC and FS comparisons was of 0.96 (95% CI 0.93, 0.99).
Replacing the , , and distributions with those generated from the meiosis model with the moderate/strong crossover interference parameter set (Supplementary File 1) resulted in similar macroaverage classification rates across all PC and FS comparisons [0.80 (95% CI 0.76,0.84)], but differential true classification rates for FS and PC. This parameterization yielded higher true classification rates for FS [0.92 (95% CI 0.91, 0.93)] but worse true classification rates for PS [0.68, (95% CI 0.63, 0.72)] (Supplementary Fig. 8 in Supplementary File 1).
MalKinID accurately reconstructs the genealogical history of outcrossed parasite lineages with high precision and recall
To determine whether MalKinID could faithfully reconstruct transmission lineages, the performance of MalKinID was evaluated using simulations designed to represent the following: (1) “outcrossed point importations,” the importation of a single parasite strain in a population with sufficiently high transmission intensity that superinfection (repeated infection of an already infected individual) promotes parasite outcrossing (Fig. 5a/c), and (2) “inbred point importations” (Fig. 5b/d), the importation of a single parasite strain in a population with such low transmission intensity or such isolated transmission lineages that superinfection is rare and serial co-transmission chains (the repeated transmission of multiple, often genetically-related parasites) are the norm. For outcrossed point importations, MalKinID accurately identified parasite relationships with high precision and sensitivity (Fig. 5c). The F1-scores for PC, FS, second-degree, and third-degree relatives were 0.95 (0.84, 1.0), 0.94 (0.72, 1.0), 0.84 (0.60, 1.0), and 0.84 (0.64, 0.94), respectively.
Fig. 5.
Performance of MalKinID on simulated importations involving a/c) outcrossing and b/d) inbreeding through serial co-transmission. In a) and b), nodes starting with “A” represent superinfecting parasites sampled from the local population. For all other nodes, the nonsubscript part of the names denotes the generation (F1, F2, F3, and F4) while the subscript refers to different versions of it. The accuracy of MalKinID was summarized using the F1-score. The error bars indicate 2 standard deviations from the mean.
However, there was a significant drop in precision and sensitivity when MalKinID was applied to inbred point importations, with F1-scores of 0.76 (0.56, 0.92) and 0.23 (0.0, 0.4) for PC and FS and <0.05 for second-degree and third-degree relatives (Fig. 5d). Under these conditions, second- and third-degree relatives were consistently misidentified as PC. This is because inbreeding can result in a wide range of overlapping and per-chromosome and distributions whose values depend on how inbred the original parental strains are (Supplementary Fig. 9).
Tree-based genealogical inference and multisample comparisons can disentangle inbred parasite genealogies
Thus far, the analyses presented in this study have focused on using MalKinID to identify the most likely genealogical relationship between parasite pairs. Figure 5d showed that relying on pairwise comparisons to perform genealogical inference fails when inbreeding occurs. Fortunately, the issues caused by inbreeding can be overcome if inferences using multisample comparisons (Sieberts et al. 2002) based on a predefined pedigree that describes all the possible genealogical relationships between sample pairs are used. Such pedigrees are difficult to specify in genomic epidemiology studies but are readily available in laboratory-based genetic crosses and QTL experiments.
As an example, MalKinID was used to identify outbred and inbred parasites from a 2-generation cross involving NF54 and NHP4026 (Fig. 6a). This 2-generation cross was generated by feeding mosquitoes a blood meal containing NF54 and NHP4026, pooling the resulting first-generation progeny, and feeding that pool to a second batch of mosquitoes. The initial pool of first-generation progeny contained a mix of F1 progeny generated through outcrossing (mating between 2 genetically distinct parasites) and parental clones generated through selfing (mating between 2 clones). This pool is then fed to a second mosquito without prior removal of parental clones. As a result, the second-generation pool could contain parental, F1, and F2 progeny as well as backcrossed progenies generated by crossing an F1 with NF54 (B1) or NHP4026 (B2).
Fig. 6.
Classification of backcrossed, F1, and F2 parasites in a typical lab cross experiment with 2 rounds of transmission. a) Pedigree used to represent the genealogical history of a laboratory-based genetic cross with 2 unrelated parents (P1 and P2). F1 progeny are FS and represented by nodes F11 and F12 and F2 progeny by nodes F21 and F22. Backcrossed progeny to P1 are represented by B1 and backcrossed progeny to P2 are represented by B2. Inbreeding is denoted by double lines. True classification and misclassification rates for b) F1, c) F2, d) B1, and e) B2. Inferred classifications in the 2-generation cross for the f) first and g) second-generation progeny pools. The error bars represent 2 standard errors from the mean from 40,000 simulated comparisons.
Classification involved a 2-step triangulation strategy (Supplementary File 1) that (1) first identified backcrossed progeny and then (2) disambiguated F1 and F2 progeny. Importantly, because the pedigree was known, the likelihood used in MalKinID could be modified (Supplementary Equations 2 and 3) to infer genealogical relationships based on 3-way comparisons (trios) to triangulate the exact nodal position of a sample in the pedigree tree (Fig. 6a; Supplementary Fig. 10) (Sieberts et al. 2002). In silico experiments show that this approach identified simulated F1, F2, and backcrossed progeny with >90% accuracy (Fig. 6b–e). The true classification rate for F1 progeny was 0.925 (95% CI 0.917, 0.932), F2 progeny 0.953 (95% CI 0.947, 0.959), B1 progeny 0.954 (95% CI 0.949, 0.960), and B2 progeny 0.964 (95% CI 0.958, 0.969).
When applied to the empirical data, it correctly identified 0.87 (95% CI 0.78, 0.96) of the nonparental strains in the first-generation pool as F1's (Fig. 6f). The data from this 2-generation cross were not used to calibrate the meiosis model and was independent from the previously mentioned NF54 × NHP4026 cross used to calibrate the meiosis model. When applied to the second-generation pool, MalKinID showed that 0.39 (95% CI 0.28, 0.49) of the second-generation progeny were F1s, 0.56 (95% CI 0.45, 0.67) were backcrosses involving NF54, and only a very small fraction [0.0125 (95% CI 0, 037)] was F2s (Fig. 6g).
Discussion
MalKinID is a classification model that infers the genealogical relationships between parasites based on the total, genome-wide proportion of IBD shared between individuals and 2 aspects of the IBD segment block distribution. MalKinID accurately identifies first-, second-, and third-degree relatives from one another and can further distinguish the subrelationships within first-degree relatives (PC, FS, and MS). These relationships can be identified even in the presence of meiotic siblings (Fig. 4), despite the fact their expected relatedness (0.33) lies between the expected relatedness of the other first-degree relatives (0.50) and second-degree relatives (0.25). Genealogical inference with MalKinID is most powered when (1) outcrossing predominates and genealogical inference can be achieved by examining the relatedness IBD segment block distributions between sample pairs or (2) when a pedigree specifying the genealogical history connecting individuals can be specified to enable multisample comparisons and triangulate a sample's position on the tree.
Incorporating genealogical inference into genomic epidemiology studies could provide greater insight into the transmission structure of a moderate-to-low transmission population and address key questions regarding imported and local transmission. In Senegal, networks of genetic relatedness suggest that the parasite populations are highly interconnected (Schaffner et al. 2023) and a key question is whether the parasites in low transmission settings are locally sustained or the result of importation from a higher transmission area. Inferring the genealogical relationship of related parasites between populations could be used to approximate the origin and timing of importation. For example, first-degree relationships would suggest recent importation while second- and third-order relatives would imply older importation events. Genetic relatedness network analyses have also revealed a surprising amount of diversity in network structure in low transmission areas. Deconstructing these networks by reorganizing them into pedigrees that emphasize the hierarchical descendance of individuals (Pemberton 2008; Kirkpatrick et al. 2011; He et al. 2013; Staples et al. 2014), and thus a potential transmission history, could provide much needed insight into parasite population movement and transmission structure that could inform intervention design.
Deconstructing the genetic relatedness of malaria parasites in malaria endemic settings with MalKinID would require utilizing high density genotyping tools such as whole genome sequencing to accurately estimate IBD segment block boundaries. While genome-wide relatedness can be estimated with as few as 200 biallelic or 100 multiallelic genetic markers (Taylor et al. 2019), model-based benchmarking analyses suggest that 25 SNPs per centimorgan (Guo, Takala-Harrison, et al. 2024) are needed for accurate IBD segment block inference. Genealogical inference may be possible with lower resolution genotypic methods but would require refitting the pseudolikelihood based on simulated outputs that define IBD segment block boundaries based on the position of the nearest, examined genetic markers. The benchmarking study also showed that the algorithm used to infer IBD from empirical data can have a significant impact (Guo, Takala-Harrison, et al. 2024); algorithms such as hmmIBD (Schaffner et al. 2018) and Refined IBD (Morgan et al. 2020) are recommended because they outperformed those such as hap-IBD (Zhou et al. 2020) and isoRelate (Henden et al. 2018).
However, inbreeding poses a challenge for malaria genealogical inference, especially in low transmission settings where genetically related parasites are more frequent. Inbreeding is a known problem with genealogical inference (Kirkpatrick et al. 2011), but many of the techniques used to account for inbreeding were designed for diploid organisms (Liu et al. 2010). These techniques do not work in malaria because its predominantly haploid genome prevents using the IBD between homologous chromosomes to identify inbred individuals (Jacquard 1975). To limit the effects of inbreeding, one can use MalKinID to study genetically related parasites from different populations. Parasites sampled from different populations are less likely to be inbred, and interpopulation genetic relatedness analyses have been used to generate different hypotheses regarding importation (Taylor et al. 2017). However, these studies typically focus on the most highly related parasite pairs ( > 0.5) (Taylor et al. 2017) and do not utilize any additional information from the IBD segment block distribution. Genealogical inference with MalKinID would complement previous studies by enabling more granular investigations based on identifying PC, GC, and GGC relatives to identify imported parasites, identify their most likely source population, and estimate the number of rounds of local transmission following importation.
An alternative solution would be to utilize multisample comparisons based on a hypothesized pedigree tree that describes the genealogical history of genetically related parasites. This approach overcomes inbreeding limitations by leveraging the relatedness among multiple samples to assess the joint likelihood across all examined sample-pair comparisons and shifts the focus of genealogical inference from identifying the most likely pairwise relationship to the most likely pedigree tree that best explains the relatedness within a sample cluster. Tree-based genealogical inference helps constrains genealogical inference and could improve robustness against minor errors in identifying IBD segments. These trees can be generated stochastically, such as with a random pedigree tree simulator (Huisman 2017; Ochoa), or inferred from data using ancestral recombination graph techniques (Camponovo et al. 2023; Nielsen et al. 2024). Pairing MalKinID with both approaches will likely be needed, as ancestral recombination graphs have not yet been applied to the malaria parasite, and it is unclear how its variable effective recombination rate could affect the structure of inferred ancestral recombination graphs (Camponovo et al. 2023).
Multisample and tree-based genealogical inference was needed to identify inbred parasites in a 2-generation laboratory-based genetic cross progeny (Vendrely et al. 2020; Button-Simons et al. 2021; Kumar et al. 2022). MalKinID showed the majority of second-generation progeny were actually F1s or backcrosses to the NF54 parental strain, which could indicate preferential mating with NF54 or high NF54 selfing rates in the first generation. Using MalKinID to identify and classify recombinant, inbred parasites could be used to better inform genetic linkage mapping and QTL analyses (Daley and Shepherd 2008; Solberg Woods 2014) designed to study the genetics of clinically relevant phenotypes such as drug resistance, virulence, invasion, and transmission (Vendrely et al. 2020).
MalKinID treats each of the 14 per-chromosome IBD segment block distributions as statistically independent. This assumption is based on the independent recombination and assortment of chromosomes during meiosis and supported by the lack of significant correlations between the components of the pseudolikelihood model (Supplementary Fig. 4). This is important, because selection can distort the IBD segment block distribution that results in longer than expected IBD segment blocks. However, the effects of selection tend to be localized (Charlesworth et al. 2000; Schaffner et al. 2018) and are unlikely to extend to other chromosomes due to independent reassortment. As a result, MalKinID should accurately distinguish genealogical relationships even if certain regions of the genome are undergoing strong, directional selection. This assumption is partially validated by the high accuracy of MalKinID on the empirical lab cross data, which previously showed evidence of selective sweeps due to culture-adaptation (Vendrely et al. 2020). One way of correcting for any potential distortions would be to remove chromosomes with evidence of selective sweeps from analyses (Guo, Borda, et al. 2024).
Future studies should incorporate a Bayesian prior to MalKinID that specifies the probability that a meiotic sibling will be present in the examined data set to prevent unnecessary depression of true classification rates in situations where meiotic siblings are expected to be rare. In the absence of data to inform this prior, the following heuristic is advised. When evaluating parasites in situations where it is highly unlikely that parasites will be derived from the same oocyst, such as when examining parasites between monogenomic (single strain) infections that are the progeny from lab cross experiments (Fig. 4g), meiotic siblings and their derivatives can be excluded. When examining parasites within polygenomic infections, particularly those that are the result of co-transmission, meiotic siblings and their derivatives should not be excluded when making inferences with MalKinID as there is a greater chance of comparing parasites derived from the same oocyst. Characterizing the genealogies of parasites within and between polygenomic infections could help address questions regarding superinfection and co-transmission, particularly in high transmission settings where a significant fraction of polygenomic infections can contain genetically related parasites (Nkhoma et al. 2018; Wong et al. 2022). Polygenomic infections will need to first be deconvoluted using technologies such as long-read sequencing (Sakamoto et al. 2022) or single-cell (Nair et al. 2014; Nkhoma et al. 2020) whole genome sequencing to establish the genomic phase of each co-infecting strain phase prior to analysis.
In conclusion, MalKinID is a new classification model that can be used to identify the genealogical relationships of genetically related parasites. In theory, MalKinID can be used to identify the parent-offspring relationships in natural parasite populations. Pairwise relatedness-based inferences of genealogical history work best in for identifying importations in predominantly outbreeding populations where inbreeding is infrequent. When inbreeding is present, relatedness-based genealogical inference will need to focus on utilizing multisample, tree-based inferences to accurately identify the genealogical relationships between parasites.
Supplementary Material
Acknowledgments
We thank all the members of the Malaria Crosses Collaboration Project for their help in generating the malaria crosses used for model calibration.
Contributor Information
Wesley Wong, Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.
Lea Wang, Harvard College, Harvard University, Cambridge, MA 02138, USA.
Stephen F Schaffner, Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA 02142, USA.
Xue Li, Program in Disease Intervention and Prevention, Texas Biomedical Research Institute, San Antonio, TX 78227, USA.
Ian Cheeseman, Program in Host Pathogen Interactions, Texas Biomedical Research Institute, San Antonio, TX 78227, USA.
Timothy J C Anderson, Program in Disease Intervention and Prevention, Texas Biomedical Research Institute, San Antonio, TX 78227, USA.
Ashley Vaughan, Center for Global Infectious Disease Research, Seattle Children's Research Institute, Seattle, WA 98105, USA; Department of Pediatrics, University of Washington, Seattle, WA 98195, USA.
Michael Ferdig, Department of Biological Sciences, Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA.
Sarah K Volkman, Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA; School of Nursing, Simmons University, Boston, MA 02115, USA.
Daniel L Hartl, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA.
Dyann F Wirth, Department of Immunology and Infectious Diseases, Harvard T. H. Chan School of Public Health, Harvard University, Boston, MA 02115, USA.
Data availability
Whole genome sequencing data for cloned progeny from lab-generated P. falciparum crosses used for calibration (NF54 × NHP4026, MKK2835 × NHP1337, and Mal31 × KH004) were previously published and available at NCBI under project PRJNA524855 (Li et al. 2019). Filtered SNP genotypes for these crosses and the 2-generation crosses involving NF54 × NHP4026 are available at https://github.com/emilyli0325/P01-cloned.progeny. The 2-generation NF54 × NHP4026 cross is labeled as NF54.mcherry × NHP4026.GFP in the GitHub repository. Details of model, simulation, and computational tools used are described in the Materials and methods. Supplementary File 1 contains supplemental text, Supplementary File 2 consolidates all IBD segment distributions for each genealogical relationships for Supplementary Fig. 2, and Supplementary File 3 lists the parameters used in MalKinID. Code to run MalKinID is available on GitHub (https://github.com/weswong/MalKinID) and Zenodo (doi: 10.5281/zenodo.14052412) under the MIT license. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software and to permit persons to whom the Software is furnished to do so. This software is provided “as is,” without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of or in connection with the software or the use of other dealings in the software.
Supplemental material available at GENETICS online.
Funding
Experimental genetic crosses used were funded by the National Institutes of Health (5P01AI127338-07). Work in the Texas Biomedical Research Institute was conducted in facilities constructed with support from the Research Facilities Improvement Program grant C06 RR013556 from the National Center for Research Resources. Model calibration and MalKinID development were supported by the Bill and Melinda Gates Foundation (OPP1156051 to D.F.W. and INV-049909 to S.K.V.) and the National Institutes of Health (5R21AI141843-02 to S.K.V.).
Literature cited
- Ashton RA, Prosnitz D, Andrada A, Herrera S, Yé Y. 2020. Evaluating malaria programmes in moderate- and low-transmission settings: practical ways to generate robust evidence. Malar J. 19(1):75. doi: 10.1186/s12936-020-03158-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blouin MS. 2003. DNA-based methods for pedigree reconstruction and kinship analysis in natural populations. Trends Ecol Evol. 18(10):503–511. doi: 10.1016/S0169-5347(03)00225-8. [DOI] [Google Scholar]
- Browning S. 1998. Relationship information contained in gamete identity by descent data. J Comput Biol. 5(2):323–334. doi: 10.1089/cmb.1998.5.323. [DOI] [PubMed] [Google Scholar]
- Button-Simons KA, Kumar S, Carmago N, Haile MT, Jett C, Checkley LA, Kennedy SY, Pinapati RS, Shoue DA, McDew-White M, et al. 2021. The power and promise of genetic mapping from Plasmodium falciparum crosses utilizing human liver-chimeric mice. Commun Biol. 4(1):734. doi: 10.1038/s42003-021-02210-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camponovo F, Buckee CO, Taylor AR. 2023. Measurably recombining malaria parasites. Trends Parasitol. 39(1):17–25. doi: 10.1016/j.pt.2022.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth B, Harvey PH, Barton NH. 2000. Genetic hitchhiking. Philos Trans R Soc Lond B Biol Sci. 355(1403):1553–1562. doi: 10.1098/rstb.2000.0716. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daley JW, Shepherd RK. 2008. Utilization of F1 information in estimating QTL effects in F2 crosses between outbred lines. Journal of Animal Breeding and Genetics. 125(1):35–44. doi: 10.1111/J.1439-0388.2007.00699.X. [DOI] [PubMed] [Google Scholar]
- Montero J G, Blythe RA. 2023. Self-contained beta-with-spikes approximation for inference under a Wright–Fisher model. Genetics. 225(2):iyad092. doi: 10.1093/genetics/iyad092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo B, Borda V, Laboulaye R, Spring MD, Wojnarski M, Vesely BA, Silva JC, Waters NC, O’Connor TD, Takala-Harrison S. 2024. Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum. Nat Commun. 15(1):2499. doi: 10.1038/s41467-024-46659-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo B, Takala-Harrison S, O’Connor TD.. 2024. Benchmarking and Optimization of Methods for the Detection of Identity-By-Descent in High-Recombining Plasmodium falciparum Genomes. bioRxiv 592538. 10.1101/2024.05.04.592538, preprint: not peer reviewed. [DOI]
- He D, Wang Z, Han B, Parida L, Eskin E. 2013. IPED: inheritance path-based pedigree reconstruction algorithm using genotype data. J Comput Biol. 20(10):780–791. doi: 10.1089/CMB.2013.0080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henden L, Lee S, Mueller I, Barry A, Bahlo M. 2018. Identity-by-descent analyses for measuring population dynamics and selection in recombining pathogens. PLoS Genet. 14(5):e1007279. doi: 10.1371/journal.pgen.1007279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huisman J. 2017. Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond. Mol Ecol Resour. 17(5):1009–1024. doi: 10.1111/1755-0998.12665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jacquard A. 1975. Inbreeding: one word, several meanings. Theor Popul Biol. 7(3):338–363. doi:doi: 10.1016/0040-5809(75)90024-6. [DOI] [PubMed] [Google Scholar]
- Jones TB, Manseau M. 2022. Genetic networks in ecology: a guide to population, relatedness, and pedigree networks and their applications in conservation biology. Biol Conserv. 267:109466. doi:doi: 10.1016/j.biocon.2022.109466. [DOI] [Google Scholar]
- Kirkpatrick B, Li SC, Karp RM, Halperin E. 2011. Pedigree reconstruction using identity by descent. J Comput Biol. 18(11):1481–1493. doi: 10.1089/cmb.2011.0156. [DOI] [PubMed] [Google Scholar]
- Kumar S, Li X, McDew-White M, Reyes A, Delgado E, Sayeed A, Haile MT, Abatiyow BA, Kennedy SY, Camargo N, et al. 2022. A malaria parasite cross reveals genetic determinants of Plasmodium falciparum growth in different culture Media. Front Cell Infect Microbiol. 12:878496. doi: 10.3389/FCIMB.2022.878496/BIBTEX. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X, Kumar S, McDew-White M, Haile M, Cheeseman IH, Emrich S, Button-Simons K, Nosten F, Kappe SHI, Ferdig MT. et al2019. Genetic mapping of fitness determinants across the malaria parasite Plasmodium falciparum life cycle. PLoS Genet. 15(10):e1008453. doi: 10.1371/journal.pgen.1008453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu EY, Zhang Q, McMillan L, de Villena FPM, Wang W. 2010. Efficient genome ancestry inference in complex pedigrees with inbreeding. Bioinformatics. 26(12):i199–i207. doi: 10.1093/BIOINFORMATICS/BTQ187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgan AP, Brazeau NF, Ngasala B, Mhamilawa LE, Denton M, Msellem M, Morris U, Filer DL, Aydemir O, Bailey JA, et al. 2020. Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago. Malar J. 19(1):47. doi: 10.1186/s12936-020-3137-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nair S, Nkhoma SC, Serre D, Zimmerman PA, Gorena K, Daniel BJ, Nosten F, Anderson TJC, Cheeseman IH. 2014. Single-cell genomics for dissection of complex malaria infections. Genome Res. 24(6):1028–1038. doi: 10.1101/gr.168286.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neafsey DE, Taylor AR, MacInnis BL. 2021. Advances and opportunities in malaria population genomics. Nat Rev Genet. 22(8):502–517. doi: 10.1038/s41576-021-00349-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neafsey DE, Volkman SK. 2017. Malaria genomics in the era of eradication. Cold Spring Harb Perspect Med. 7(8):a025544. doi: 10.1101/CSHPERSPECT.A025544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen R, Vaughn AH, Deng Y. 2024. Inference and applications of ancestral recombination graphs. Nat Rev Genet. doi: 10.1038/s41576-024-00772-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nkhoma SC, Banda RL, Khoswe S, Dzoole-Mwale TJ, Ward SA. 2018. Intra-host dynamics of co-infecting parasite genotypes in asymptomatic malaria patients. Infection. Genetics and Evolution. 65:414–424. doi: 10.1016/J.MEEGID.2018.08.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nkhoma SC, Nair S, Cheeseman IH, Rohr-Allegrini C, Singlam S, Nosten F, Anderson TJ. 2012. Close kinship within multiple-genotype malaria parasite infections. Proc Biol Sci. 279(1738):2589–2598. doi: 10.1098/rspb.2012.0113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nkhoma SC, Trevino SG, Gorena KM, Nair S, Khoswe S, Jett C, Garcia R, Daniel B, Dia A, Terlouw DJ, et al. 2020. Co-transmission of related malaria parasite lineages shapes within-host parasite diversity. Cell Host Microbe. 27(1):93–103.e4. doi: 10.1016/J.CHOM.2019.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ochoa A. simfam: simulate and model family pedigrees with structured founders. https://cran.r-project.org/web/packages/simfam/vignettes/simfam.html. Accessed 01 June 2024.
- Pemberton JM. 2008. Wild pedigrees: the way forward. Proc R Soc Lond B Biol Sci. 275(1635):613–621. doi: 10.1098/RSPB.2007.1531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sakamoto Y, Miyake S, Oka M, Kanai A, Kawai Y, Nagasawa S, Shiraishi Y, Tokunaga K, Kohno T, Seki M, et al. 2022. Phasing analysis of lung cancer genomes using a long read sequencer. Nat Commun. 13(1):3464. doi: 10.1038/s41467-022-31133-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaffner SF, Badiane A, Khorgade A, Ndiop M, Gomis J, Wong W, Ndiaye YD, Diedhiou Y, Thwing J, Seck MC, et al. 2023. Malaria surveillance reveals parasite relatedness, signatures of selection, and correlates of transmission across Senegal. Nat Commun. 14(1):7268. doi: 10.1038/s41467-023-43087-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaffner SF, Taylor AR, Wong W, Wirth DF, Neafsey DE. 2018. hmmIBD: software to infer pairwise identity by descent between haploid genotypes. Malar J. 17(1):196. doi: 10.1186/s12936-018-2349-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sieberts SK, Wijsman EM, Thompson EA. 2002. Relationship inference from trios of individuals, in the presence of typing error. Am J Hum Genet. 70(1):170–180. doi: 10.1086/338444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woods LCS. 2014. QTL mapping in outbred populations: successes and challenges. Physiol Genomics. 46(3):81–90. doi: 10.1152/PHYSIOLGENOMICS.00127.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Staples J, Qiao D, Cho MH, Silverman EK, Nickerson DA, Below JE. 2014. PRIMUS: rapid reconstruction of pedigrees from genome-wide estimates of identity by descent. Am J Hum Genet. 95(5):553–564. doi: 10.1016/j.ajhg.2014.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tataru P, Bataillon T, Hobolth A. 2015. Inference under a wright-fisher model using an accurate beta approximation. Genetics. 201(3):1133–1141. doi: 10.1534/genetics.115.179606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor AR, Jacob PE, Neafsey DE, Buckee CO. 2019. Estimating relatedness between malaria parasites. Genetics. 212(4):1337–1351. doi: 10.1534/genetics.119.302120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor AR, Schaffner SF, Cerqueira GC, Nkhoma SC, Anderson TJC, Sriprawat K, Pyae Phyo A, Nosten F, Neafsey DE, Buckee CO. 2017. Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genet. 13(10):e1007065. doi: 10.1371/JOURNAL.PGEN.1007065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vendrely KM, Kumar S, Li X, Vaughan AM. 2020. Humanized mice and the rebirth of malaria genetic crosses. Trends Parasitol. 36(10):850–863. doi: 10.1016/j.pt.2020.07.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong W, Griggs AD, Daniels RF, Schaffner SF, Ndiaye D, Bei AK, Deme AB, MacInnis B, Volkman SK, Hartl DL, et al. 2017. Genetic relatedness analysis reveals the cotransmission of genetically related Plasmodium falciparum parasites in Thiès, Senegal. Genome Med. 9(1):5. doi: 10.1186/s13073-017-0398-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong W, Schaffner SF, Thwing J, Seck MC, Gomis J, Diedhiou Y, Sy N, Ndiop M, Ba F, Diallo I, et al. 2024. Evaluating the performance of Plasmodium falciparum genetic metrics for inferring national malaria control programme reported incidence in Senegal. Malar J. 23(1):68. doi: 10.1186/s12936-024-04897-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong W, Volkman S, Daniels R, Schaffner S, Sy M, Ndiaye YD, Badiane AS, Deme AB, Diallo MA, Gomis J, et al. 2022. RH: a genetic metric for measuring intrahost Plasmodium falciparum relatedness and distinguishing cotransmission from superinfection. PNAS Nexus. 1(4):pgac187. doi: 10.1093/PNASNEXUS/PGAC187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong W, Wenger EA, Hartl DL, Wirth DF. 2018. Modeling the genetic relatedness of Plasmodium falciparum parasites following meiotic recombination and cotransmission. PLoS Comput Biol. 14(1):e1005923. doi: 10.1371/journal.pcbi.1005923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H, Liang F. 2001. On relationship inference using gamete identity by descent data. J Comput Biol. 8(2):191–200. doi: 10.1089/106652701300312940. [DOI] [PubMed] [Google Scholar]
- Zhou Y, Browning SR, Browning BL. 2020. A fast and simple method for detecting identity-by-descent segments in large-scale data. Am J Hum Genet. 106(4):426–437. doi: 10.1016/j.ajhg.2020.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Whole genome sequencing data for cloned progeny from lab-generated P. falciparum crosses used for calibration (NF54 × NHP4026, MKK2835 × NHP1337, and Mal31 × KH004) were previously published and available at NCBI under project PRJNA524855 (Li et al. 2019). Filtered SNP genotypes for these crosses and the 2-generation crosses involving NF54 × NHP4026 are available at https://github.com/emilyli0325/P01-cloned.progeny. The 2-generation NF54 × NHP4026 cross is labeled as NF54.mcherry × NHP4026.GFP in the GitHub repository. Details of model, simulation, and computational tools used are described in the Materials and methods. Supplementary File 1 contains supplemental text, Supplementary File 2 consolidates all IBD segment distributions for each genealogical relationships for Supplementary Fig. 2, and Supplementary File 3 lists the parameters used in MalKinID. Code to run MalKinID is available on GitHub (https://github.com/weswong/MalKinID) and Zenodo (doi: 10.5281/zenodo.14052412) under the MIT license. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software and to permit persons to whom the Software is furnished to do so. This software is provided “as is,” without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of or in connection with the software or the use of other dealings in the software.
Supplemental material available at GENETICS online.






