Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2022 Dec 7;12:21150. doi: 10.1038/s41598-022-25568-6

Geographical classification of malaria parasites through applying machine learning to whole genome sequence data

Wouter Deelder 1,2, Emilia Manko 1, Jody E Phelan 1, Susana Campino 1,#, Luigi Palla 1,3,#, Taane G Clark 1,✉,#
PMCID: PMC9729610  PMID: 36476815

Abstract

Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surveillance purposes. Advances in sequencing technologies are helping to generate timely and big genomic datasets, with the prospect of applying Artificial Intelligence analytical techniques (e.g., machine learning) to support programmatic malaria control and elimination. Here, we assess the potential of applying deep learning convolutional neural network approaches to predict the geographic origin of infections (continents, countries, GPS locations) using WGS data of P. falciparum (n = 5957; 27 countries) and P. vivax (n = 659; 13 countries) isolates. Using identified high-quality genome-wide single nucleotide polymorphisms (SNPs) (P. falciparum: 750 k, P. vivax: 588 k), an analysis of population structure and ancestry revealed clustering at the country-level. When predicting locations for both species, classification (compared to regression) methods had the lowest distance errors, and > 90% accuracy at a country level. Our work demonstrates the utility of machine learning approaches for geo-classification of malaria parasites. With timelier WGS data generation across more malaria-affected regions, the performance of machine learning approaches for geo-classification will improve, thereby supporting disease control activities.

Subject terms: Population genetics, Machine learning

Introduction

Malaria, caused by Plasmodium parasites and transmitted by Anopheles mosquitoes, remains a pressing global health problem, with a mortality and morbidity burden heavily concentrated among children less than five years old. The morbidity and mortality impacts of Plasmodium falciparum malaria are predominantly concentrated in Sub-Saharan Africa, whereas the burdens of Plasmodium vivax are most heavily felt in Asia and South America1. The complex co-evolutionary history between Plasmodium parasites, humans, and Anopheles mosquitoes is contained within the genome of each organism, and genomic tools and data are of key importance for understanding the fundamental genetic underpinning of malaria, its geo-spatial distribution and control strategies to eliminate it. There is a rapidly growing number of P. falciparum and P. vivax isolate DNA that have undergone whole genome sequencing (WGS), with continued advances in genomic technologies likely to accelerate the timely generation of datasets from clinical and surveillance blood samples to inform disease epidemiology and control.

The rich information contained in WGS data can be used to infer transmission patterns, detect drug resistance, and support wider malaria control initiatives and elimination strategies2,3. WGS data in combination with population genomic methods can detect selective sweeps associated with drug resistance and infer the geographic origin of infections, including if infections are found to be imported or drug resistant and whether treatment should be adapted accordingly. It is known that malaria parasites have a population structure primarily based on geography4,5. Several informative molecular barcodes for speciation and geography have been developed2,3, but typically these barcodes have not used the whole genome due to the high-dimensionality of the data and the associated computational cost3. However, machine learning (a subfield of Artificial Intelligence) with its ability to incorporate and analyse very large and high-dimensional datasets in an efficient manner, seems potentially well suited for geo-predicting using WGS data. Machine learning can be applied for classification, which concerns predicting a label (e.g., country, continental region), and regression, which involves predicting a quantity (e.g., longitude or latitude).

Machine learning has been applied effectively across a variety of problems in malaria research, including the detection of evolutionary selection associated with drug resistance6,7, the classification and detection of parasites in red blood cells811, and antimalarial drug discovery12. Deep learning is a subset of machine learning where algorithms aim to extract and learn series of hierarchical representations, often leveraging large amounts of data. The application of deep learning, and especially neural networks, has been explored within population genetics13,14, including for other pathogens15,16. Pioneering work has also shown that machine learning, including deep learning convolutional neural networks (CNNs), can be used to predict geographic locations from human, mosquito and P. falciparum genetic variation17, building on methods and the use of large genotyping chips or WGS for population structure assessment18,19. Here, we aim to further expand on the application of geo-prediction for malaria parasites by using a very large dataset of isolates sourced globally, (P. falciparum, n = 5957, 27 countries; P. vivax, n = 659, 13 countries) across 11 regions (South East Asia (SEA), Southern SEA (SSEA), South Asia, South America, West Africa, Central Africa, South Central Africa, East Africa, Horn of Africa, Southern Africa, Oceania). We explore the potential of both regular machine learning approaches that aim to learn representations from sequence and geographical data, as well as deep learning approaches that aim to learn and extract layers of hierarchical representations of SNP combinations linked to geography. We compare four commonly applied approaches, including classification methods that predict locations and subsequently interpolate to specific coordinates, as well as compare the performance across geographies (countries) both including the observations within those and excluding them from the training sets used to develop the models.

Materials and methods

Processing of raw sequencing data

Publicly available raw Illumina (> 150 bp paired end) sequence data from previously published studies of P. falciparum and P. vivax was downloaded from the ENA repository (see S1 Table and S2 Table for accession numbers), and accompanied by meta-data including locations of sampling (see S1 Table and S2 Table for latitude and longitude coordinates). The data included public raw sequence and GPS data from MalariaGEN projects (www.malariagen.net). Raw WGS data for P. falciparum (n = 5957) and P. vivax (n = 659) were aligned with the Pf3D7 (v3) and PvP01 (v1) reference genomes, respectively, using bwa-mem software (v0.7.12) using default parameter settings (e.g., concerning mismatch and sequence read clipping penalties; see http://bio-bwa.sourceforge.net/bwa.shtml). The samtools (v1.9) functions fixmate and markdup were applied to the resulting BAM files to call a set of potential variants20. For variant quality control, calibration assessments were performed using the GATK’s BaseRecalibrator and ApplyBQSR functions, benchmarking off known high quality variants from genetic crosses for P. falciparum5,21 and previously curated datasets for P. vivax20. A revised set of SNPs and insertions/deletions (indels) was called with GATK’s HaplotypeCaller (version 4.1.4.1) using the option -ERC GVCF5,22. Variants were then assigned a quality score using GATK’s Variant Quality Score Recalibration (VQSR), and those with a VQSLOD score < 0, representing variants more likely to be false than true, were filtered out7,22. Additionally, SNPs were removed if they had more than 10% missing alleles7,22.The resulting dataset comprised of parasite genomes of P. falciparum (5,957 isolates, 750 k SNPs) and of P. vivax (659 isolates, 588 k SNPs). The population structure was assessed using a principal component analysis (PCA) of between isolate SNP differences. In parallel, ADMIXTURE analysis23 was performed to understand the composition of ancestral groups across geography, where the optimal number of groups (K) was established using cross validation with values ranging between 1 and 20. This cross validation analysis led to 10 ancestral groups for both P. falciparum and P. vivax (K = 10).

Statistical models and performance

Using machine learning (ML) and deep learning (DL) statistical models, the goal was to use SNPs to predict geographical source at a location (GPS), country, and regional resolution. We applied two standard models for classification at a country and region level: (1) penalized multinomial logistic regression classifier (LOG-C; ML); (2) CNN (CNN-C; DL). Subsequently, we used the predictive probabilities placed on different locations to perform a weighted interpolation between these locations and make predictions at the GPS coordinate level.

In particular, the final prediction location (longitude and latitude) was determined by a weighted average of classifier predictions, where weights are the probabilities placed by the model on each location.

We also applied two regression models for GPS coordinate prediction: (iii) penalised linear regression model (LIN-R; ML); (iv) CNN (CNN-R; DL). The LOG-C and LIN-R models were tuned on the regularization strength C for the L1 penalty (LASSO) and implemented in the sklearn Python package (https://scikit-learn.org). The penalty parameters were tuned using cross-validation (see below, S3 Table). The deep learning CNN architecture was implemented using the Keras library (version 2.2.4)24 in Python. Our CNN models had an architecture with a soft-max prediction layer and regularization through dropout25 to prevent overfitting and support transferability. The main model had one convolutional layer with 4 filters, with respective filter size of (40, 9) followed by two drop-out and dense layers with ReLu activation (similar to17), and applied the Stochastic Gradient Descent algorithm for optimisation. We trained and validated the models for 1000 epochs. The parameterisation of the models is summarised (S3 Table). We created a stratified three-fold split in the dataset (80% training, 10% validation, 10% test) for all models, and used the validation dataset to cross-validate parameters (S3 Table). The LOG-C and LIN-R models were cross-validated (stratified, four-fold) on the regularization strength C for the L1 penalty. The reported scores (accuracy, mean weighted distance error) were calculated by making predictions on the hold-out test set (see S3 Table for the final parameter set). In addition, we conducted a “leave-one-geography-out”, where each single geography in the training dataset was omitted in turn, with the model trained on the remaining geographies, to understand generalizability towards previously unseen locations26.

Classification accuracy was determined after assigning predicted latitude and longitude pairs to individual countries. For the classification models, a mean (weighted) distance error was calculated using the Haversine method to allow for (angular) distance calculations along a sphere, based on the difference of the actual and estimated location. The latter was determined by a weighted average of classifier predictions, where weights are the probabilities placed by the model on each location. The accuracy was calculated based on the labels of the prediction versus the test data. In particular, the baseline accuracy using a naive prediction based on the most common country would be 18.8% for P. falciparum (Cambodia) and 24.3% for P. vivax (Thailand). For the regression models, the error was calculated using the Haversine method based on the difference between the predicted and actual latitude and longitude using angular distance.

Results

Malaria isolate sequence data and population structure

Raw WGS data with accompanying geographic origin information was available in the public domain for P. falciparum (n = 5957, 27 countries) and P. vivax (n = 659, 13 countries) (Table 1), which represent the global distributions for each parasite. Most P. falciparum isolates were sourced from SEA (2,648, 44.5%) followed by West Africa (2,042, 34.3%) and East Africa (451, 7.6%). Whilst, for P. vivax, most isolates were sourced from SEA (282, 42.9%) followed by South America (220, 33.4%) and SSEA (48) (Table 1). By analysing each species separately, high quality genome-wide SNPs were identified across the isolates (P. falciparum 750 k SNPs, P. vivax 588 k SNPs). Most SNPs have low minor allele frequencies (SNPs with MAF < 1%: P. falciparum 94.6%, P. vivax 77.6%) (S1 Figure). Most SNPs were in genic regions (P. falciparum 76.5%, P. vivax 54.3%), with a high proportion of non-synonymous (NS) amino acid changes (P. falciparum 63.0%, P. vivax 42.5%). The genetic diversity amongst P. falciparum isolates was relatively homogeneous across the 27 countries (SNP π: median 0.037, range 0.027–0.053), and lower in magnitude than P. vivax, whose data was sourced from 13 countries (SNP π: median 0.056, range 0.037–0.066) (Table 1).

Table 1.

Sample origin and SNP Diversity by geographic location.

Region Country Pf. SNP Diversity Pf. N* Pf. % Pv. SNP Diversity Pv. N** Pv. %
West Africa Benin 0.040 76 1.3
Burkina Faso 0.028 86 1.4
Gambia 0.035 164 2.8
Ghana 0.033 928 15.6
Guinea 0.040 161 2.7
Ivory Coast 0.034 70 1.2
Mali 0.034 378 6.3
Mauritania 0.035 77 1.3
Nigeria 0.050 18 0.3
Senegal 0.039 84 1.4
East Africa Kenya 0.035 116 1.9
Tanzania 0.035 320 5.4
Uganda 0.053 15 0.3
Horn of Africa Ethiopia 0.048 25 0.4 0.060 44 6.7
Central Africa Cameroon 0.033 237 4.0
South Central Africa DRC 0.032 339 5.7
Southern Africa Madagascar 0.040 24 0.4
Malawi 0.027 29 0.5
South Asia India 0.062 40 6.1
Bangladesh 0.037 83 1.4
South East Asia (SEA) Cambodia 0.040 1118 18.8 0.049 70 10.6
Laos 0.039 126 2.1
Myanmar 0.039 246 4.1 0.061 27 4.1
Thailand 0.038 928 15.6 0.056 160 24.3
Vietnam 0.036 147 2.5 0.048 13 2.0
China 0.066 12 1.8
Southern SEA (SSEA) Malaysia 0.040 48 7.3
South America Colombia 0.046 16 0.3 0.055 30 4.6
Peru 0.037 24 0.4 0.059 88 13.4
Brazil 0.061 82 12.5
Mexico 0.039 20 3.0
Oceania PNG 0.040 120 2.0 0.037 24 3.6
Total 5955 100 658 100

Pf P. falciparum, Pv P. vivax; PNG Papua New Guinea; DRC Democratic Republic of Congo.

Unsupervised clustering methods were applied to the genome-wide SNPs of each species to reveal the extent of their population structure and linked (pseudo-)ancestral patterns. Principal component analysis (PCA) of P. falciparum and P. vivax isolates revealed the expected separation by continent, and clear evidence of population structure at both the regional and country level (Fig. 1). An analysis of population structure and ancestry using ADMIXTURE software23 determined the number of ancestral groups (P. falciparum K = 10, P. vivax K = 10), and their relative abundance for each isolate was estimated (Fig. 2). For P. falciparum, there were dominant ancestral groups across region and continent (Africa 4, SEA 4, Oceania 1, South America 1), with some evidence of mixture of ancestries (e.g., SEA isolates with 3 ancestral populations), but a general consistency within country. For P. vivax, the numbers of dominant ancestral groups by region differed from P. falciparum (South America 4, SEA 2, SSEA 2, East Africa 1, South Asia 1), due to sampling and Plasmodium species endemicity differences, such as the near absence of P. vivax in Africa. Overall, there was more homogeneity of ancestral groups within P. vivax isolates, with some groups broadly linked to neighbouring countries (comparison with Fig. 1). These analyses confirmed that spatial-genomic clustering and classification is possible using WGS data.

Figure 1.

Figure 1

Population structure using principal component analysis based on all high-quality SNPs. Axes show percentage of variation explained by each principal component (PC).

Figure 2.

Figure 2

ADMIXTURE analysis involving 10 inferred ancestral populations (denoted as K1 to K10).

Application of geo-classification models

For P. falciparum, the predictive performance of the classification methods (LOG-C, CNN-C) was stronger than for the regression models (LIN-R, CNN-R) in regional (Table 2) and country-wide (Table 3) analyses (mean distance error (km): LIN-R 470, LOG-C 93, CNN-R 245, CNN-C 77). For locations included in the training dataset, the performance of the classification models was close to 100% at the regional level, and close to 90% at the country level (S4 Table, S5 Table). The poorest performance of the models was for African populations, for example, the mean distance error for CNN-C was high in West African (267 km) and East African countries (117 km, especially Kenya and Uganda), as well as Malawi (530 km) (Table 3), compared to other regions. This observation is consistent with the complex ancestries in African populations (Fig. 2), as well as another deep learning analysis17. As expected, where we predicted countries absent in data used by the training models, the distance errors (km) were at least ~ five-fold larger (LIN-R 2246, LOG-C 1848, CNN-R 1983, CNN-C 1540), with the poorest predictions for Peru (Table 4). The best performing model in this setting was the CNN-C classifier (Fig. 3).

Table 2.

Mean distance Error (km) per model by region using geographies included in the training data.

Parasite Region N LIN-R* LOG-C* CNN–R CNN –C*
Pf West Africa 2042 665 [375–1354] 302 [5–681] 368 [161–1169] 267 [45–728]
East Africa 451 708 [693–1198] 200 [3–1581] 297 [289–856] 117 [0–1856]
Horn of Africa 25 569 [569–569] 0 [0–0] 124 [124–124] 0 [0–0]
Central Africa 237 635 [635–635] 29 [29–29] 184 [184–184] 0 [0–0]
SC Africa 339 478 [478–478] 3 [3–3] 34 [34–34] 0 [0–0]
Southern Africa 53 490 [490–968] 7 [7–433] 1543 [1018–1543] 0 [0–530]
SEA 2648 312 [247–744] 19 [8–121] 152 [39–559] 7 [0–53]
South America 40 1936 [1820–2053] 3 [0–7] 3683 [2535–4832] 0 [0–0]
Oceania 120 488 [488–488] 0 [0–0] 697 [697–697] 0 [0–0]
Pv Horn of Africa 44 334 [334–334] 0 [0–0] 142 [142- 142] 0 [0–0]
South Asia 40 500 [500–500] 0 [0–0] 517 [517–517] 0 [0–0]
South East Asia 282 616 [156–2751] 25 [0–1033] 578 [288–704] 0 [0–1463]
Southern SEA 48 213 [213–213] 0 [0–0] 957 [957–957] 0 [0–0]
South America 220 906 [134–3080] 0 [0–0] 667 [574–2773] 0 [0–0]
Oceania 24 175 [175–175] 0 [0–0] 1103 [1103–1103] 0 [0–0]

Pf P. falciparum, Pv P. vivax, * mean [range], CNN Convolutional Neural Network, SC South Central, SEA South East Asia; LOG-C multinomial logistic regression classifier; CNN-C CNN classifier; LIN-R penalised linear regression model; CNN-R CNN regression model.

Table 3.

Mean distance error (km) per model on test data using those countries included in the training data.

Parasite Region Location LIN-R LOG-C CNN-R CNN-C
P. falciparum West Africa Benin 700 4 354 45
Burkina Faso 374 96 161 88
Gambia 775 132 317 107
Ghana 401 48 193 52
Guinea 751 515 459 402
Ivory Coast 630 681 695 728
Mali 563 345 208 271
Mauritania 615 676 382 410
Nigeria 1039 329 1169 329
Senegal 1354 274 565 263
East Africa Kenya 693 200 297 117
Tanzania 707 3 289 0
Uganda 1198 1581 856 1856
Horn of Africa Ethiopia 568 0 124 0
Central Africa Cameroon 635 28 184 0
SC Africa DRC 477 2 34 0
Southern Africa Madagascar 490 6 1543 0
Malawi 968 432 1018 530
SEA Bangladesh 743 9 159 0
Cambodia 312 18 112 21
Laos 276 121 152 53
Myanmar 360 10 559 0
Thailand 247 7 39 7
Vietnam 356 90 199 0
South America Colombia 2052 0 4832 0
Peru 1820 7 2535 0
Oceania PNG 488 0 697 0
Mean 470 93 245 77
P. vivax Horn of Africa Ethiopia 334 0 142 0
South Asia India 500 0 517 0
SEA Cambodia 638 25 648 0
China 2751 1033 704 1463
Myanmar 616 311 350 311
Thailand 604 0 288 0
Vietnam 156 0 578 0
SSEA Malaysia 213 0 957 0
South America Brazil 3080 0 2773 6
Colombia 1057 0 667 0
Mexico 134 0 1502 0
Peru 755 0 574 0
Oceania PNG 175 0 1103 0
Mean 890 33 819 36

DRC Democratic Republic of Congo; PNG Papua New Guinea; CNN Convolutional Neural Network; LOG-C multinomial logistic regression classifier; CNN-C CNN deep learner classifier; LIN-R penalised linear regression model; CNN-R Penalised CNN regression model; SC South Central; SEA South East Asia; SSEA Southern SEA.

Table 4.

Mean distance error (km) per model on test data for unseen geographies.

Parasite Location LIN-R LOG-C CNN-R CNN-C
P. falciparum Cambodia 496 669 322 628
Cameroon 959 1545 1472 1636
DRC 1150 2331 2531 2456
Ethiopia 1118 1760 1252 1394
Myanmar 703 731 470 728
Peru 9050 4050 5856 2400
Mean 2246 1848 1983 1540
P. vivax Cambodia 591 323 1709 564
Ethiopia 2499 5174 3528 4140
Malaysia 459 1594 3617 2064
Peru 2376 2943 1196 2852
Mean 1481 2508 2512 2405

CNN Convolutional Neural Network; DRC Democratic Republic of Congo; LOG-C multinomial logistic regression classifier; CNN-C CNN deep learning classifier; LIN-R penalised linear regression model; CNN-R Penalised CNN regression model.

Figure 3.

Figure 3

Maps with predicted vs. actual locations for the best predictive models. Blue points are the actual locations in the dataset, red points are the predicted locations (where different to actual), with red lines link the actual and the predicted locations. CNN-C deep learning Convolutional Neural Network classifier. LOG-C penalised multinomial logistic regression classifier.

For P. vivax, the predictive performance of the classification methods (LOG-C, CNN-C) was also superior compared to regression models (LIN-R, CNN-R) across regional (Table 2) and country-wide (Table 3) analyses (mean distance error (km): LIN-R 890, LOG-C 33, CNN-R 819, CNN-C 36) (Table 3). For locations included in the training dataset, the performance of the classification models was close to 100% at both the regional and country level, with the poorest performance in neighbouring China and Myanmar (S4 Table, S5 Table). The (mean) distance error for the countries not used in the development of the model is distinctively larger (km: LIN-R 1481, LOG-C 2508, CNN-R 2512, CNN-C 2405), with the poorest predictions for Ethiopia and Peru (Table 4). The best performing model in this setting was a LIN-R regression (Fig. 3).

Discussion

WGS data of Plasmodium parasites can detect imported infections, drug resistance, and transmission patterns, thereby assisting decision making in clinical and malaria control settings. With the implementation of WGS gaining traction across health systems, there is an opportunity to implement statistical learning methodologies to assist surveillance activities. A clear use-case includes the determination of the geographical origin of isolates, building on insights from previous work which shows that genomic data can be used to cluster parasites by geography25. Our work reveals that machine learning approaches, particularly those focusing on classification (e.g., deep learning CNNs), have the potential to accurately predict geographic locations at a GPS and country-level resolution. As expected, the performance was much stronger for isolates of which the geographic origin was already represented at the country level in the dataset, demonstrating the need for WGS to be implemented more widely to fill country gaps in genetic diversity. The weakest predictions were for P. falciparum in West and East Africa, where common ancestries, mixed infections, movement of people, drug resistance and malaria endemicities can complicate genetic diversity analysis. The distance errors are similar to a previous machine learning analysis of P. falciparum (median < 20 km), which implemented a single deep learning approach on a smaller dataset17. Our CNN for classification approach appeared to perform well across parasite species, was implemented with measures to minimise the effects of over-fitting, and its performance is likely to improve with greater isolate sampling and WGS data.

Whilst we have implemented a limited set of machine learning methods, there is scope to test alternative approaches (e.g., gradient boosted trees, support vector machines)16 or further optimise our model parametrisations (beyond the default settings) to improve performance. For example, while L1-penalized regression approaches are generally quite competitive, stability selection on top of the LASSO leads generally to improvements27. Moreover, the resulting model is white box and leads to a set of interpretable SNPs. CNNs are the most utilised deep learning network type, and known to outperform alternative approaches28. However, one limitation of CNN models is their “black box” nature, with a complex architecture consisting of several layers, and in our context (and others17) making it difficult to establish which (combinations of) SNPs are informative for the geographical profiling. Other studies have used population genomic approaches to determine informative SNPs, with a focus on applying genotyping assays or amplicon sequencing for resource poor settings2,3. We provide computer code to implement the models, to assist future assessments in simulation or empirical studies. Future work should focus on the development of an online “geo-locator” tool that reveals a prediction of location, which can be assessed for its plausibility against the actual position, if known, and feedback into the model building and learning process. Such a framework could also be extended to integrate explicit drug resistance markers29, as well as genomic data for malaria vectors17, and use sequences generated on portable and field deployable sequencing platforms (e.g., Oxford Nanopore Technology MinION). Such tools would be of immediate value to malaria control programs in endemic countries, including those that are implementing elimination activities who wish to differentiate between locally acquired or imported infections. It would also assist those countries with low malaria burden, including through the detection of imported parasites that could threaten malaria elimination targets.

In summary, our study has demonstrated that machine learning methods can play an informative role in determining the geographic origin of WGS isolates, thereby providing important insights for both control and surveillance activities. Further, such approaches will be scalable when WGS becomes routine and cost effective, resulting in a setting with increasingly “big data” being available for decision making. The utility of this “learning” system will improve with time, as underlying methodologies and model performances improve with more data becoming available, and they are implemented within informatic tools to assist surveillance and clinical decision making. This utility underscores the benefit of making sequencing data and linked geographical information publicly available to global databases in a more-timely fashion to understand infection dynamics, the advantages of which have also been demonstrated by the COVID-19 crisis.

Conclusion

Advances in sequencing technologies are making real time genomics-informed surveillance and clinical management a reality. With the resulting big genomic datasets, our study has shown that machine learning methods, a subset of Artificial Intelligence, can accurately predict the geographical source of malaria parasites from sequence data. With greater geographical coverage and informatics infrastructure, such approaches will improve in performance and assist malaria control and elimination activities.

Supplementary Information

Acknowledgements

TGC was funded by Medical Research Council UK (Grant no. MR/M01360X/1, MR/N010469/1, MR/R025576/1, MR/R020973/1 and MR/X005895/1) grants. SC was funded by BloomsburySET and Medical Research Council UK grants (MR/M01360X/1, MR/R025576/1, MR/R020973/1 and MR/X005895/1). We thank Aleksei Ponomarev for providing support on Python coding.

Author contributions

W.D., S.C., L.P., and T.G.C. conceived and designed the study. E.M. and J.E.P. performed the bioinformatic processing of the raw sequencing data. W.D. and E.M. performed the population genetic and statistical analysis, under the supervision of S.C., L.P. and T.G.C. W.D. wrote the first draft of the manuscript. All authors commented on and edited the manuscript and approved the final version. W.D. and T.G.C. compiled the final manuscript.

Data availability

The raw WGS data is available from the European Nucleotide Archive (ENA) (see S1 Table and S2 Table for project accession numbers). Computing code and machine learning models are available from https://github.com/WDee/GeoComparison.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Susana Campino, Luigi Palla and Taane G. Clark.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-022-25568-6.

References

  • 1.World Health Organization. World Malaria Report (2020).
  • 2.Preston MD, et al. A barcode of organellar genome polymorphisms identifies the geographic origin of Plasmodium falciparum strains. Nat. Commun. 2014;5:1–7. doi: 10.1038/ncomms5052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.DiezBenavente E, et al. A molecular barcode to inform the geographical origin and transmission dynamics of Plasmodium vivax malaria. PloS Genet. 2020;16:e1008576. doi: 10.1371/journal.pgen.1008576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Diez Benavente E, et al. Distinctive genetic structure and selection patterns in Plasmodium vivax from South Asia and East Africa. Nat. Commun. 2021;12:1–11. doi: 10.1038/s41467-021-23422-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Samad H, et al. Imputation-based population genetics analysis of plasmodium falciparum malaria parasites. PLOS Genet. 2015;11:e1005131. doi: 10.1371/journal.pgen.1005131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pybus M, et al. Hierarchical boosting: A machine-learning framework to detect and classify hard selective sweeps in human populations. Bioinformatics. 2015;31:493. doi: 10.1093/bioinformatics/btv493. [DOI] [PubMed] [Google Scholar]
  • 7.Deelder W, et al. Using deep learning to identify recent positive selection in malaria parasite sequence data. Malar. J. 2021;20:1–9. doi: 10.1186/s12936-021-03788-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Quan Q, Wang J, Liu L. An effective convolutional neural network for classifying red blood cells in malaria diseases. Interdiscip. Sci. Comput. Life Sci. 2020;12:217–225. doi: 10.1007/s12539-020-00367-7. [DOI] [PubMed] [Google Scholar]
  • 9.Liang, Z. et al. CNN-based image analysis for malaria diagnosis. In: Proc. - 2016 IEEE Int. Conf. Bioinforma. Biomed. BIBM 2016 493–496 (2017). 10.1109/BIBM.2016.7822567.
  • 10.Poostchi M, Silamut K, Maude RJ, Jaeger S, Thoma G. Image analysis and machine learning for detecting malaria. Transl. Res. 2018;194:36–55. doi: 10.1016/j.trsl.2017.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Fuhad KMF, et al. Deep learning based automatic malaria parasite detection from blood smear and its smartphone based application. Diagnostics. 2020;10:329. doi: 10.3390/diagnostics10050329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Neves BJ, et al. Deep Learning-driven research for drug discovery: Tackling malaria. PLoS Comput. Biol. 2020;16:e1007025. doi: 10.1371/journal.pcbi.1007025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Flagel L, Brandvain Y, Schrider DR. The unreasonable effectiveness of convolutional neural networks in population genetic inference. Mol. Biol. Evol. 2019;36:220–238. doi: 10.1093/molbev/msy224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sanchez T, Cury J, Charpiat G, Jay F. Deep learning for population size history inference: design, comparison and combination with approximate Bayesian computation. bioRxiv. 2020 doi: 10.1101/2020.01.20.910539. [DOI] [PubMed] [Google Scholar]
  • 15.Deelder W, et al. Machine learning predicts accurately mycobacterium tuberculosis drug resistance from whole genome sequencing data. Front. Genet. 2019;10:922. doi: 10.3389/fgene.2019.00922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Libiseller-Egger J, Phelan J, Campino S, Mohareb F, Clark TG. Robust detection of point mutations involved in multidrug-resistant mycobacterium tuberculosis in the presence of co-occurrent resistance markers. PLoS Comput. Biol. 2020;16:e1008518. doi: 10.1371/journal.pcbi.1008518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Battey CJ, Ralph PL, Kern AD. Predicting geographic location from genetic variation with deep neural networks. Elife. 2020;9:1–22. doi: 10.7554/eLife.54507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Guillot G, Jónsson H, Hinge A, Manchih N, Orlando L. Accurate continuous geographic assignment from low- to high-density SNP data. Bioinformatics. 2016;32:1106–1108. doi: 10.1093/bioinformatics/btv703. [DOI] [PubMed] [Google Scholar]
  • 19.Bhaskar A, Javanmard A, Courtade TA, Tse D, Valencia A. Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies. Bioinformatics. 2017;33:879–885. doi: 10.1093/bioinformatics/btw720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li H. Improving SNP discovery by base alignment quality. Bioinformatics. 2011;27:1157–1158. doi: 10.1093/bioinformatics/btr076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Miles A, et al. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res. 2016;26:1288–1299. doi: 10.1101/gr.203711.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Benavente ED, et al. Genomic variation in Plasmodium vivax malaria reveals regions under selective pressure. PLoS ONE. 2017;12:e0177134. doi: 10.1371/journal.pone.0177134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chollet, F., & others. Keras. GitHub. Retrieved from https://github.com/fchollet/keras (2015).
  • 25.Srivastava N, Hinton G, Krizhevsky A, Salakhutdinov R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014;15:1929–1958. [Google Scholar]
  • 26.Mordelet F, Vert JP. ProDiGe: Prioritization of disease genes with multitask machine learning from positive and unlabeled examples. BMC Bioinform. 2011;12:1–15. doi: 10.1186/1471-2105-12-389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Mahé P, Tournoud M. Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection. BMC Bioinform. 2018;19:1–11. doi: 10.1186/s12859-018-2403-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Alzubaidi L, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data. 2021;8:1–74. doi: 10.1186/s40537-021-00444-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Turkiewicz A, et al. Genetic diversity of the Plasmodium falciparum GTP-cyclohydrolase 1, dihydrofolate reductase and dihydropteroate synthetase genes reveals new insights into sulfadoxine-pyrimethamine antimalarial drug resistance. PLoS Genet. 2020;16:e1009268. doi: 10.1371/journal.pgen.1009268. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The raw WGS data is available from the European Nucleotide Archive (ENA) (see S1 Table and S2 Table for project accession numbers). Computing code and machine learning models are available from https://github.com/WDee/GeoComparison.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES