Skip to main content
PLOS Neglected Tropical Diseases logoLink to PLOS Neglected Tropical Diseases
. 2021 Apr 7;15(4):e0008755. doi: 10.1371/journal.pntd.0008755

Combining natural language processing and metabarcoding to reveal pathogen-environment associations

David C Molik 1,2,*, DeAndre Tomlinson 3,#, Shane Davitt 1,¤,#, Eric L Morgan 2, Matthew Sisk 2, Benjamin Roche 4, Natalie Meyers 2, Michael E Pfrender 1
Editor: Peter John Myler5
PMCID: PMC8055023  PMID: 33826634

Abstract

Cryptococcus neoformans is responsible for life-threatening infections that primarily affect immunocompromised individuals and has an estimated worldwide burden of 220,000 new cases each year—with 180,000 resulting deaths—mostly in sub-Saharan Africa. Surprisingly, little is known about the ecological niches occupied by C. neoformans in nature. To expand our understanding of the distribution and ecological associations of this pathogen we implement a Natural Language Processing approach to better describe the niche of C. neoformans. We use a Latent Dirichlet Allocation model to de novo topic model sets of metagenetic research articles written about varied subjects which either explicitly mention, inadvertently find, or fail to find C. neoformans. These articles are all linked to NCBI Sequence Read Archive datasets of 18S ribosomal RNA and/or Internal Transcribed Spacer gene-regions. The number of topics was determined based on the model coherence score, and articles were assigned to the created topics via a Machine Learning approach with a Random Forest algorithm. Our analysis provides support for a previously suggested linkage between C. neoformans and soils associated with decomposing wood. Our approach, using a search of single-locus metagenetic data, gathering papers connected to the datasets, de novo determination of topics, the number of topics, and assignment of articles to the topics, illustrates how such an analysis pipeline can harness large-scale datasets that are published/available but not necessarily fully analyzed, or whose metadata is not harmonized with other studies. Our approach can be applied to a variety of systems to assert potential evidence of environmental associations.

Author summary

We expand the utility of Natural Language Processing (NLP), backtracking through metabarcodes, utilizing papers that may not mention our subject of interest, C. neoformans, in a departure from usual text analysis methods. We confirm that C. neoformans is associated with decomposing wood which is reinforced by the inferred literature studied here on C. neoformans and its close congeneric relatives. This work demonstrates the potential utility of pairing NLP with single-locus metagenetic data for the study of Neglected Tropical Diseases. While the results of this article are largely confirmatory, we present a novel method to study the ecological niches of rare pathogens that leverages the immense amount of data available to researchers in the NCBI Sequence Read Archive (SRA) combined with a text-mining analysis based on Natural Language Processing. We demonstrate that text processing, noun identification, and verb identification can play an important role in analyzing a large corpus of documents together with metagenetic data. Forging this connection requires access to all of the available ecological 18S ribosomal RNA and Internal Transcribed Spacer NCBI SRA datasets. These datasets use metabarcoding to query taxonomic diversity in eukaryotic organisms, and in the case of the Internal Transcribed Spacer, they specifically target Fungi. The presence of specific species is inferred when diagnostic 18S or ITS gene region sequences are found in the SRA data. We searched for C. neoformans in all 18S and ITS datasets available and gathered all associated journal articles that either cite the SRA data accessions or are cited in the SRA data accessions. Published metagenetic data often have associated metadata including: latitude and longitude, temperature, and other physical characteristics describing the conditions in which the metagenetic sample was collected. These metadata are not always presented in consistent formats, so harmonizing study methods may be needed to appropriately compare metagenetic data as commonly required in metanalysis studies. We present an analysis which takes as input articles associated with SRA datasets that were found to contain evidence of C. neoformans. We apply NLP methods to this corpus of articles to describe the niche of C. neoformans. Our results reinforce the current understanding of C. neoformans’s niche, indicating the pertinence of employing an NLP analysis to identify the niche of an organism. This approach could further the description of virtually any other organism that routinely appears in metagenetic surveys, especially pathogens, whose ecological niches are unknown or poorly understood.

Introduction

From the inception of the study of public health and epidemiology two major challenges have been of paramount importance: identifying the origin of the pathogen and identifying the determinants of transmission. This work focuses on origin. An early example of confronting and overcoming these challenges through evidence and inference can be found classically in Dr. John Snow’s observations of cholera in 19th century England. A major contributor to the foundations of modern epidemiology, Snow is perhaps best known for his role in fighting the infamous 1854 Broad Street cholera outbreak; Confronted with a devastating outbreak of cholera, Snow sought to determine the origin of the outbreak. Drawing on a body of indirect evidence he hypothesized that cholera was transmitted via contaminated water [1]. His hypothesis was based on statistical evidence linking water supply companies and water sources to an increase in deaths [2]. In the modern era, indirect evidence again played an important role in establishing the link between El Niño and cholera outbreaks in South America in the 1990s, informing optimal strategies for vaccination [3,4]. Similarly, we are informed by this tradition of non-obvious inference as we pursue the niche of Cryptococcus neoformans through the use of Natural Language Processing (NLP).

Cryptococcus neoformans, a basidiomycete dimorphic yeast (subphylum Agaricomycotina), was first described by Francesco Sanfelice [5], and transferred to genus Cryptococcus by Jean Paul Vuillemin in 1901 [6]. It is the principal causative agent of cryptococcosis [7], a usually coinfecting disease of meningitis in immunocompromised individuals, principally in HIV/AIDS patients [8]. This pathogen is responsible for life-threatening infections and has an estimated worldwide burden of 220,000 new cases each year, with 180,000 resulting deaths. The African Continent is home to 162,500 new Cryptococcosis involved cases a year [9] with the majority of worldwide deaths occurring in sub-Saharan Africa [9]. For such a prevalent and destructive pathogen, it is somewhat surprising that the life-cycle in nature, and ecological niches occupied by Cryptococcus spp. are not yet fully understood [10]. C. neoformans’s “classic” habitat is thought to be soil and avian guano [10]. However, the closely related species Cryptococcus gattii [11], responsible for the Vancouver Island (BC, Canada) outbreak of cryptococcosis in 1999, is documented to be associated with the bark of a variety of tree species (potentially over 50) [12]. Several recent studies have also found C. neoformans isolates in association with trees, such as eucalyptus in Egypt [13], and olive in Turkey [14]. (note: at least a dozen such papers) Furthermore, recent work has shown that C. neoformans is able to grow both on live plant material, such as Arabidopsis seedlings and Douglas fir trees, as well as in saprobic association with dead plant materials [15]. Therefore, the ecological niche of Cryptococcus may be broader than previously recognized. While it is posited that most cases of cryptococcosis are not typically associated with a specific known environmental exposure [10], the observance of specific ecological associations could be obscured because Cryptococcosis typically affects immunocompromised individuals, so sampling bias could be introduced.

Previous work shows a community-level linkage between C. neoformans and woody decomposers [1618]. While C. neoformans is generally associated with tree bark and soil, our results suggest that these are possibly common environments for C. neoformans, and that consequently the range of C. neoformans may be larger than expected, as evidenced by aerosol sampling in Columbia [19]. C. neoformans is hypothesized to be an accidental pathogen. The accumulation of genomic traits that make it an effective pathogen, for example, traits such as the ability to form capsules and cross the Blood Brain Barrier via infected phagocytes [2022], may have been acquired during its evolutionary history as a result of selective pressures unrelated to pathogenicity in human hosts, A greater understanding of the features of C. neoformans’s natural habitat will aid research into the accidental pathogenesis hypothesis [20,21,23]; This connection between virulence and the environment necessitates a deeper understanding of the particular environmental associations of the pathogen.

The distribution, number of species, and phylogenetic relationships among members of the Cryptococcus species complex have been difficult to accurately define, with recent propositions that C. neoformans and C. gattii are themselves polyphyletic aggregates representing at least seven different species [24]. Prior to the Vancouver Island outbreak, C. gattii had only been reported from tropical and subtropical regions [25]. This gap in observation shows that not only do cryptic environmental niches exist, but they could also have serious consequences for the epidemiology of cryptococcosis, lending weight to the argument that C. neoformans may have a broader range of habitats than is currently recognized. The classical method to identify Cryptococcus in nature has been the recovery and culture of natural isolates. The recent expansion in wealth and breadth of environmental metagenetic datasets now makes it possible to further uncover some of these cryptic niches, especially from studies not primarily targeted towards Cryptococcus spp. In this study we use NLP to more easily overcome some of the hurdles of a more traditional meta-analysis. Published sample data often lacks extensive descriptions of machine-readable metadata, sampling methods, or the physical measurements (e.g. salinity, temperature). Knowing this truism, we “look in” or computationally process descriptions of metagenetic samples referred within the main body of articles. NLP can circumvent differences in scientific units and article format among studies that can cause errors in traditional meta-analysis. An extreme example of such a situation would be a meta-analysis based on articles which recorded temperature in either degrees Kelvin or Celsius. Both are agreed scientific standards, and without metadata it is possible to mistake one for the other, even computationally. By using NLP on the body of the text of the article we can ignore such conflicts in the metadata. In our analysis, we make an assumption that the text in journal descriptions has a connection to the presence of C. neoformans in the metagenetic data used for the study. The assertion of this link could be misleading if the paper associated with the dataset did not actually describe the conditions in which the samples were found, if say, a dataset was associated after the publishing of the article: an important factor in the lack of metadata and description of the sample conditions.

In this study, the C. neoformans–woody decomposition association is expounded via a machine learning text-categorization analysis. To gather the data (the text of journal articles) for this analysis, the 18S portion of the eukaryotic ribosomal subunit (18S) and the Internal Transcribed Spacer (ITS, although note that in most papers, this one included, ITS usually refers to the 2nd ITS sometimes denoted as ITS2) of the reference sequences of C. neoformans were used as query sequences for the entirety of National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) [27] collection of metagenetic single-locus datasets of either 18S or ITS amplicons. The papers associated with the datasets containing C. neoformans DNA sequences, as indicated by the SRA search, were then collected and utilized in a random forest analysis. This analysis revealed that studies which found C. neoformans can be associated by their shared mentioning of terms indicating woody decomposition. This method, utilizing papers associated with metagenetic datasets, is useful in part because it reveals studies that found C. neoformans that might not explicitly mention C. neoformans. In our case, notably, only one paper mentioned C. neoformans within the body of its text necessitating an analytic approach that goes beyond simple search of papers that mention C. neoformans. In this study, the ecology of C. neoformans was elucidated using data from metabarcoding studies (for more information on metabarcoding see Box 1). Such an approach allows us to garner insights into species that are biomedically important and not well-studied ecologically.

Box 1. “What is Metabarcoding?”

Metagenetic single-locus data generally falls under the method of Metabarcoding. Metabarcoding is a method of determining the taxonomic diversity within the contents of a sample by the analysis of the DNA sequence from a specific gene region [26]. This is much like a primer that might be used to find a specific species, but instead a generalized primer is used. Searching for Fungi with metabarcoding is usually done through two common gene regions, the 18S and the ITS; both are portions of the Eukaryotic ribosomal-associated DNA in all Eukaryotes. The technique of metabarcoding is mostly used in the analysis of microbiomes both for prokaryotes, archaea as well as eukaryotes. Metabarcoding has been used to locate multicellular organisms as well [26]. The datasets that were combed through for the presence of C. neoformans are the eukaryotic fractions of microbial communities.

Methods

We use NLP to determine “topics” that are associated with positive labeled papers, or papers associated with SRA samples that contained C. neoformans, and Random Forests (made of decision trees, for more information on decision trees see Box 2), to assess the quality of these “topics” (see Fig 1, for visualization of analysis steps). Decision Tree NLP techniques read and pseudo-comprehend human text and language to derive meaning and discover interactions through machine learning. Two areas that NLP focuses on are word syntax and word semantics [28]. The syntax of a sentence refers to the arrangement of words. Semantics refers to the meaning derived from the words and their arrangement. Model creation algorithms such as Latent Dirichlet Allocation (LDA) can be used as classifiers and predictors [28]. We use LDA in the de novo determination of the number of “topics” (henceforth: topics) or rough association of journal articles within our corpus of documents. We then utilize random forest in our machine learning text-categorization pipeline to assign journal articles to the LDA determined topics. The topics which have the bulk of papers containing the unstated connection to C. neoformans are the topics from which environmental niche associated words are drawn. Decision Tree Learning is a reliable evaluation method for classification methods and assessing model performance. Decision trees have been powerful classifiers in machine learning algorithms since 1975 [29]. Ensemble methods, a composite of several methods, have improved the performance of individual models by different means, notably by bootstraps [30]. Random forest is an adaptation of decision trees with a bootstraps advantage. Bootstraps involve creating multiple decision trees from the sample data and using majority voting to determine the best split at each node in the tree [30]. With the ability to create hundreds of trees from varied samples of the training data, the overall performance of the final model is less over-fitted and has better performance than a single decision tree. In a departure from pure NLP research we determined the number of topic clusters via NLP evaluation and confirmatory visualization methods, and found that the majority of our positive hit papers formed a single topic. The results indicated that there was a combination of words that drew these papers together. Furthermore, these articles could be recalled and predicted reliably, indicating a measure of validity.

Box 2. What is a Decision Tree?

Decision trees operate on an entropy-reduction basis or a probability maximization basis, depending on the type. Decision trees split datasets or information by calculating which feature lowers the entropy of the system the best information, until the entropy is at a minimum [29]. In ID3 decision trees, information gain is the change in entropy from a previous state to a new state based on a condition, like a feature in a dataset [29]. Information gain determines the starting feature to be split by comparing the entropies between all the classes, and choosing the split based on the feature that provides the lowest entropy, which is the highest information gain. CART (Classification and Regression Trees) use a similar method called a Gini index, which separates nodes based on subtracting the sum of the squared probabilities from each class by 1 [43]. The feature with the lowest gini index is chosen for a split, and this continues until the gini index equals 0. The problem with both decision tree approaches is the dependence on the initial starting feature to be split. By the nature of the calculation, it is susceptible to overfitting due to class imbalances and the initial starting node. Random forest resolves these two disadvantages by using a bootstraps aggregation (i.e. bagging) method [44]. It creates multiple decision trees with different starting nodes, and uses a majority vote system for the final prediction. This approach resolves both issues because it accounts for different starting points, and by having multiple randomly generated models running concurrently, overfitting is reduced even with class imbalances. The overfitting that one tree would have in the forest is mitigated by a different tree that did not overfit; this method is similar to taking an average of many different trees all at once, but more robust.

Fig 1. General Workflow for associating articles into topics, and then determining topic and paper assignment validity.

Fig 1

Four overarching actions are taken in this analysis: the initial barcode search, the results of which are used in a paper search, the resulting papers of which are used in a topic model, and finally the topic model is validated with a random forest model.

Searching NCBI SRA, collecting articles

Our initial task was to collect datasets that contain C. neoformans. We created a script that crawled the NCBI SRA for datasets generated using universal 18S and fungal specific ITS PCR primers. This search was denoted by the query run on the NCBI SRA database in October of 2018:

“ecological metagenomes” [Organism] AND (18S OR ITS) AND

(cluster public[prop] AND “biomol dna” [Properties])

This query was used to find candidate metagenetic datasets, later pared down to just datasets containing C. neoformans via BLAST alignment. The SRA identifiers of samples were downloaded in fasta format. Using the command-line utility blastn_vdb, this file of SRA numbers was used to extract the corresponding NCBI datasets and search those datasets against reference sequences of the 18S and ITS regions diagnostic of C. neoformans. This search yielded the SRA numbers of the NCBI datasets that were specific (henceforth: positive) for C. neoformans (see supplement for SRA and gene Ascension numbers). The stringency parameters for the searches for both ITS and 18S were a percent identity value of 98% and an e-value of 1e-140, more stringent than the commonly used 97% identity for species level identification. A query sequence spanning the entirety of the region was used, which may have limited the number of identified sequences. Of all 18S and ITS possible SRAs at the time of the query 1380 SRA datasets were found to contain a matching sequence to the query sequence.

After retrieving the SRA identifier of datasets that contained C. neoformans, the SRA was matched to the Bioproject ID. Many of the 1380 SRA datasets had redundant Bioproject IDs, With the Bioproject ID, the associated published journal article was retrieved via a direct link with the Bioproject ID, or through a search based on a match of date, authors, and sequencing technology published in the article and Bioproject in a google scholar and/or google web search. Articles retrieved via either of these methods were labeled “positive”; only one positive article explicitly mentioned C. neoformans in the text of the paper. SRA data that was searched and found to not contain C. neoformans was used as “negative.” Randomly selected SRA numbers, of all published SRA numbers under “ecological genomes” and “biomol dna”, were spot checked for their Bioproject IDs. Since Bioprojects often contain multiple SRA datasets, the other datasets in any given Bioproject had to be checked for the presence of C. neoformans diagnostic sequences. These steps ensured negative hits would be as close in format and topic as possible to the positive hits. Papers will typically only reference one Bioproject, but if any of the SRAs within the Bioproject contained C. neoformans, that paper would be considered “positive”. Random negative SRAs from the list of all SRA were selected, their Bioproject IDs retrieved and the Bioproject-linked papers were retrieved for use as the negative set. PDFs of the positive and negative articles were downloaded. We selected 31 negatively labeled control papers for inclusion. This step ensured that texts of the negative set would be similar enough to the text of papers in the positive set to enable further analysis driven by properties beyond the basic structure and writing style of the articles in question. Retrieved papers where read for references to sampling locations and environmental types, articles were assigned randomly to two different volunteer analysts for double assessment of location and environment of each journal article thought to be a “positive” hit. The group of analysts was made up of the Authors, as well as members of the Navari Family Center for Digital Scholarship. Since positive hit articles would, by necessity, mention metabarcoding and potentially have a high incidence of description related to metabarcoding processes and environments, there would be a degree of textual similarity. Likewise, as an ostension, comparing the poems of Robert Louis Stevenson [31,32] and the “Positive” hit papers using NLP could be expected to reveal clear differences between the Research Articles and Stevenson’s poems driven merely by their time style and time period of content overwhelming the question of whether the poems Robert Louis Stevenson wrote might describe the same differences in niche associated with the presence of C. neoformans. In our case the analyzed C. neoformans NLP project corpus is all metabarcoding papers, which helps ensure that we can use NLP to examine differences between metabarcoding papers, rather than surfacing subject matter differences between wider ranging research articles.

The associated documents were classified with two binary class labels, “positive” and “negative”. We identified a total of 113 papers, but there was a skew towards positive papers resulting in a class imbalance between the positive and negative labeled documents. To rectify this imbalance, of the 113 papers, 82 labeled as Positive hits of C. neoformans were used for training the model; 31 labeled as Negative hits were used for testing. Creating the model this way means that we no longer have to account for the class imbalance and instead use the mismatch of our negative hits to measure our accuracy. The downloaded PDF files were converted into text files using an in-house R script with the “TM” library [33]. These raw, preprocessed documents were aggregated into a text corpus using the Gensim python library [34], which allowed for easier manipulation of all the text files at once, while retaining the information and properties of each document. Gensim was chosen as the scientific package of choice for its wide support and ease of use for a variety of machine learning application and use cases, in addition to its memory efficient architecture for multiple trials. To obtain optimal results, data cleaning was performed on the text corpus. The articles were reviewed, and optimized for machine readable grammar (i.e. lemmatized) prior to the aggregation; there are many symbols, characters, whitespace, and words that are not relevant to understanding the essence of a sentence or a topic which were removed [28,35]. Additionally, words were deleted from the corpus if they were noisy, or not relevant to the analysis, such as popular author and publisher names. Afterwards, the corpus was lemmatized and stemmed to root words using Gensim. Lastly, all punctuation and stop words were removed using the NLTK python package [36]. Stop words removed (i.e. not accepted for consideration within the analysis) were the base stop words from the NLTK package.

Topic modeling

We generated topic models using the Latent Dirichlet Allocation (LDA) method. This algorithm, commonly used in computer science, has additional utility in well-constructed text documents, such as academic journal articles [28]. A Dirichlet distribution is a family of multivariate, continuous, probability distributions used in categorical distributions and Bayesian statistics [40]. It can be generalized as families of continuous probability distributions from “[0, 1]” with multiple variables. For comparison purposes, LDA is specifically designed for NLP while Principal Component Analysis (PCA) is more general. Consider PCA which is an unsupervised algorithm, like LDA. PCA uses linear transformations to maximize the variance between features and assumes all data objects are continuous. In comparison, LDA uses multivariate continuous distributions and probability distributions, with the assumption that the data objects are discrete and in a “bag-of-words” format.

To determine the topics from the text corpus, LDA essentially reverse engineers the document corpus. The documents to be recreated are represented as a random collection over latent topics, characterized by a distribution over all the words. The documents are represented as a distribution of random words associated with latent topics. When the maximum likelihood that a particular set of random words successfully recreates the document, then those set of words belong to a particular topic.

To summarize (see: Fig 2), α and β can be compared to matrices, with row i and column j. In α, a value in row i and column j refers to how likely document i contains topic j, meanwhile in β that value represents how likely topic i contains word j. The Gensim library has an LDA package which generates the word-topic probabilities, and by mapping the word-topic probabilities to the corresponding documents, the final output of this portion of the method is a list of topic probabilities for the training and testing set with the appropriate “positive” or “negative” label from each document [35]. Gensim was chosen as the scientific package of choice for its wide support and ease of use for a variety of machine learning application and use cases, in addition to its memory efficient architecture for multiple trials.

Fig 2. Plate diagram displaying the conditional probabilistic dependencies in Latent Dirichlet Allocation.

Fig 2

Depicts the core of the LDA method. Plate diagrams are graphical models of probabilistic models with conditional dependence between random variables [37,38]. The boxes are “plates”; they are repeated entities of documents for the outer plate, and words in the inner plate. The Greek symbols denote the hidden or unsupervised areas of LDA, while the regular characters denote the user-created or supervised areas of this algorithm. α is the per-document topic distribution [39]. β is the per-topic word distribution, and φ is the word distribution for a singular topic K. Θ is the topic distribution in a particular document M, then Z is the topic for the N-th word in document M. These two pathways converge at W, which is the specific word. W is the observed variable while every other layer in this process is latent, or unsupervised.

Since the LDA model in our NLP pipeline is dependent on the number of topics and the optimal number of topics is unknown, it is necessary to create multiple models, each having a different number of topics as an initial parameter. After the model is created, its coherence and perplexity scores are used to evaluate its quality before inputting it into a supervised classifier (for more information on model evaluation see Box 3). Coherence is a measure to give a general understanding of whether the topics are well-defined or unknown. Essentially topic coherence scores show how well the words relate to each other semantically within a topic, and a higher coherence score indicates a better overall model. Model perplexity is a measurement of how well a model can predict unseen data via probability distribution of [41] samples.

Box 3. Model Evaluation

Common machine learning practices include model evaluation of the statistical significance of the model classification. Generating a confusion matrix (Table 1) is the starting point that provides a general overview of the model’s capabilities. The four possible classifications are true positive (TP), true negative (TN), false positive (FP), and false negative (FN). A true positive indicates the model predicted a data object to be positive and was correct (in our case an article that found C. neoformans and was marked as “positive”). A false negative is a positive object that was classified incorrectly as negative value (a “positive” paper classified as negative). Similarly, a false positive is a negative value that was classified as positive (a “negative” paper classified as positive). Lastly, a true negative implies the classifier correctly predicted a negative object correctly (a “negative” paper classified as negative). The truth values in this experiment are the assigned “positive” and “negative” labels from the initial SRA search. The prediction values were binary “positive” and “negative” labels based on the topic probabilities from the LDA model. These groups are tallied, aggregated and analyzed to quantify the performance of the classifier.

It is possible to statistically evaluate the number and proportion of correctly positively identified journal articles from the Random Forest classifier using “Precision” and “Recall.” Precision is the proportion of true positive elements. Recall is the true positive rate, which is the proportion of actual positive elements that were correctly identified as positive. These metrics can be skewed when there is a class imbalance, so a third statistic, called the F1 statistic, is the balanced average between precision and recall, and gives an accurate evaluation of how the classifier is performing, since a high F1 scores requires both the precision and recall score be high.

Table 1. Classic confusion matrix to visually analyze the classification performance of an algorithm.

Predicted Positive Predicted Negative
Ground Truth Positive True Positive (TP) False Negative (FN)
Ground Truth Negative False Positive (FP) True Negative (TN)

Random forest classification

We used a supervised classification method to quantify how well the LDA model performed. The latent layers of LDA modeling requires an evaluation method to determine the quality of its probability distributions and the created word-topic pairings. Random Forest classification is an adequate method, due to its ability to handle a wide variety of datasets while maintaining preservation of accuracy without overfitting the data. After the LDA topic probabilities were created, they were used as the basis for the random forest model. Since the classes are imbalanced between the “Positive” labeled and “Negative” labeled articles, the training and testing sets maintained a proportional imbalance as well to maintain the integrity of the analysis. The random forest classifier, powered by the Sklearn python library, creates a decision tree model with the topic probabilities as features and the label as the final positive or negative classification [42].

We created a series of scripts in order to implement LDA topic modeling, run Random Forest Classification, and calculate the AUC-ROC (see supplement). Random Forest is an evolution of the Decision Tree algorithm, in which a number of decision trees are created by randomly selecting portions of the available dataset.

In this experiment, features for the random forest classifier were the topic probabilities, and the label was the binary class label for positive and negative. Since the journal articles are imbalanced between the amounts of “Positive” and “Negative” papers, the training and testing sets maintained a proportional imbalance as well to maintain the integrity of the analysis. The random forest classifier, powered by the sklearn python library with default parameters enabled, classified the LDA model [39,42].

Area Under the Curve-Receiver Operating Characteristic (AUC-ROC) is an important evaluation measurement for binary classification. AUC is the measure of separability, and ROC is a probability curve. When combined, this metric shows how well a model can distinguish between classes. The two components of the AUC-ROC curve are the true positive rate and false positive rate. The true positive rate (TPR) is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. The false positive rate (FPR) is calculated as the number of false positives divided by the sum of the number of false positives and the number of true negatives. The TPR describes how good the model is at predicting the positive class when the actual outcome is positive. The FPR details how often a positive class is predicted when the actual outcome is negative. The ROC algorithm creates the TPR and FPR by using the truth values and the predicted values from the random classifier. Afterwards, the AUC function creates the curvature for the AUC-ROC curve. A classifier with no-skill to determine the difference between a false positive and true positive is linear on the graph, and the higher the skill, the larger the curvature becomes. Another benefit for the AUC-ROC curve is that it is multiple statistics in one. An AUC score is more generalized than the ROC statistics, since the AUC score is similar to the integral of all the curves, while ROC is a visual representation of the true positive and false positive rates at various points in the classification algorithm.

We used a supervised classification method to quantify how well the hidden layers in the LDA model performed. After the LDA topic probabilities were created, this was used as the basis for the random forest model. The Random Forest script assembled multiple decision trees and promoted selection of the optimal one via majority voting for the final classification. Decision trees by themselves have the problem of overfitting training set data with a high variance, but random forest acts as an averaging method to reduce variance [41].

Results

Collecting articles

While the metabarcode dataset search resulted in 1380 SRA datasets, each SRA dataset references a Bioproject associated with a paper, and multiple SRA datasets often referenced the same Bioproject ID. The papers came from a wide variety of publishers, with Elsevier, Nature, Wiley, and Frontiers contributing more papers than other sources. Two years, 2000 and 2007 seem to be outliers, perhaps caused by authors who assigned new datasets to older papers. The vast majority of studies mentioned that they were based on the 18S genomic region only, rather than ITS solely, both regions, or neither of the regions. The “neither of the regions” category contains papers which are associated with a dataset, where the body of the text does not explicitly mention the association. The soil environment was the primarily discussed environment for a plurality of papers, and the marine environment the second-most-frequent primary environment (see Fig 3B). Paper sampling had a global distribution (see Fig 4A), and a variety of environment types.

Fig 3. Examination of positive hit papers.

Fig 3

In panel “A”, papers are graphed according to what genomic region, if any, they mentioned in their text, regardless of whether the paper was retrieved via the ITS or 18S region. “Ds” indicates the paper did not mention a genomic region (ie. the paper will have been found via the metabarcoding search and dataset, but the paper may not link back to the dataset itself, this can indicate that the paper was attached to the dataset post publishing). Finally, papers are grouped by what environment they primarily discussed. In Panel “B,” A reader assigned the environment, overall, of each environment. Graph in panel “C” indicates the publisher, and Panel “D,” the date of publication for the papers used in this study.

Fig 4. Sampling locations were derived for each record.

Fig 4

If coordinates for sample sites were provided in the original paper, those were used. If they were not, the centroid of the most precise geographic entity was used. Panel A shows these data with the symbol size scaled to the number of samples indicated at each location. Panel B shows a kernel density of sample locations. Panel C shows the substrate class at each sample locations. Together these locations show a global distribution of C. neoformans. It is likely that these locations have sampling bias. Constructed from vector base layers from Natural Earth [45].

Text analysis

Before creating the LDA Topic Model, the collected articles were aggregated and parsed into unigrams, resulting in over 700,000 different words used across the articles. LDA topic modeling relies on the text corpus that it creates the model from, so words with relatively low occurrences can add confusion to a model. Therefore, all words that did not have at least 5 occurrences were excluded from the dataset to preserve model integrity; this culling left 75,000 words in the dataset.

LDA topic model

The words output from the text analysis in turn became input texts into the Gensim program to conduct the LDA Topic Model generation. The script created 8 different models, varying from 2 topics to 9 topics. Multiple models allow for visualization and analysis to determine the optimal parameters and is a common practice in machine learning. We evaluated the coherence and perplexity scores for each model. The metric for coherence was based on the extrinsic University of California Irvine (UCI) metric, which compares every word in the corpus to every other word in the corpus [46]. By default, the metric Gensim uses for perplexity calculation is the negative logarithmic expansion. Both plots are shown in Fig 5.

Fig 5. Count of the most common words across papers positive and negative.

Fig 5

excludes stopwords. Our results show that “crop”, “miner”, “manag-”, “forest-”, “litter”, “fertil-”, “wheat”, “contamin-”, “amend-”, and “root” have the highest association probabilities in the assignment of articles from groups of positive or negative hit papers to topics. The figure is a sample representation of the most popular words found across the text corpus that were statistically relevant after filtering for occurrences. This reveals a potential association between the most common words and environments that contain soil and rhizosphere. Only words above a count of 1000 shown.

Each LDA model created has an associated number of topics, coherence score, and perplexity, which is plotted in Fig 6. The selected LDA model returned three topics and topic probabilities. Each topic had associated words weighted by their relevance to the topic. To determine which number of topics is the optimal amount, another visualization tool, pyLDAvis [47,48] is used. pyLDAvis showed good separation between the topics on the intertopic distance map. The intertopic distances were computed within the algorithm, based on Jensen-Shannon divergence calculations. The centers of the default topic circles are laid out in two dimensions according to a multidimensional scaling (MDS) algorithm that is run on the inter-topic distance matrix. The number of topics can be correlated to the dimensionality of the topic, so MDS reduces dimensionality while maintaining the distance between objects despite being on the same plane with scaling. The higher the separation distance without topic overlap, the better the model is. The model with three topics had the best separation in the group, along with the highest coherence score. Therefore, it is the optimal model and we proceeded to further analyze its parameters.

Fig 6. Graph signifying the coherence score performance in comparison to the number of topics for each model.

Fig 6

Coherence and Perplexity were calculated at different numbers of topics. Coherence is highest at 3, which indicates that three topics should be used in the model.

By default, the Gensim created models have the top words for each topic and their associated weights for each word in comparison to the topic.

In order to relate the topics and top words within each topic to the original documents, we created a script with the aid of the pandas package. Pandas allowed for data table manipulation, and the script enabled us to combine the two data tables into one, and sort the values according to the classification label. The script enabled the visualization of what the LDA model produced: three topic groups with associated words and weights within each group. By examining the words, their associated weights within each topic, and subject matter knowledge on the text corpus, it was possible to determine whether the topic was dominated by either positive papers or negative papers. Topic zero and Topic one share some similarities in words and environments, while Topic two has a different set entirely. The words in Topic two include crop, miner-, forest, root, fertil-; these words have strong associations to soil and woody environments, while the other topics have associations with warm, watery environments.

Two of the keywords out of the negative topic are “viral” and “ocean” which seems to indicate that there may be differences in subject matter in these articles (see Fig 7, Topic 1). The middling topic (Fig 7, Topic: 0) has words like “protist,” “manure,” and “edna,” and “fish.” Lastly, the positive papers (Fig 7, Topic: 2) have words like, “root,” “litter,” “forest,” and “management.” Positive paper words like “root,” “litter,” “forest,” should be describing soils and rhizospheres; Papers which primarily study “soil” are the largest share of positive hit papers (see Fig 3).

Fig 7. shows the word count and word weight for the top lemmatized (truncated) words in each topic based on their attributes and by manipulating the outputs of the matplotlib library.

Fig 7

The most common words in each topic can give insights to what the potential topics are. They need to be extrapolated by the researchers with subject matter experts based on the text corpus given to the model and words present in the topics.

Random forest evaluation

There is stochasticity in the results from the random forest classifier vary from each iteration of the script; the results can deviate based on how the decision trees are created and the majority voting process. By using the default settings from Sklearn’s random forest classifier package, a confusion matrix and AUC-ROC curve were created from the best iteration of the script from 50 trials, see Figs 8 and 9 and Table 2.

Fig 8. Confusion matrix from random forest classifier with heatmap indicator for added visualization.

Fig 8

Starting from the top left corner, this square indicates the number of true positives. Moving clockwise, the next square is the false negative area, with the true negative square below it. The last square in the bottom left corner refers to the false positives.

Fig 9. Graph of the Area Under the Curve—Receiver Operating Characteristic curve, generated by Scikit-plot.

Fig 9

By default, the curve algorithm splits the ROC Curve into three parts. The ROC curve by the classes, the micro-average of the ROC Curve, and the macro-average of the ROC Curve. The micro-average shows that the model predicts better than a random assignment of papers to models.

Table 2. Summary table of accuracy, precision, recall, f1-score, and AUC, and ROC Score from the random forest classifier.

Accuracy Precision Recall F1 Score AUC Score ROC Score
0.67 0.71 0.90 0.80 0.77 0.55

Discussion

Our results show that there is a link between C. neoformans and wood decomposition confirming that this fungus lives in the environment as saprophyte with a preference for wood substrates. This characteristic makes C. neoformans able to occupy a multitude of environmental niches worldwide. These conclusions confirm the validity of the methodology here applied. Our de novo topic generation fell into three topics, a topic consisting of positively labeled articles, a middling topic, with both negatively labeled and positively labeled articles, and a topic, consisting of negatively labeled papers, indicating that environmental keywords makes them more similar to each other than to the randomly selected negative topics. Since LDA was used in topic modeling, a true unsupervised algorithm, only extrapolations of the topics can be made, the are essentially bags of words with paper associations. However, the ability to simultaneously characterize hundreds of journal articles, and find the statistically relevant topics and associated words, is a powerful tool in general information gathering and identification of hidden/inferable connections between groups of topics. With the increased use and reduced costs of high-throughput sequencing technology to sample biological diversity in the environment, the number of potentially vastly informative—yet not fully explored—datasets are likely to greatly increase. Data-mining this increasingly large corpus will become crucial to make the most efficient use of these data.

The accuracy, precision, and recall scores for our results are above an average, no-skill model, and are consistent across various iterations from the classifier. In addition, the high F1 Score indicates that the classifier is accurately classifying the values as their correct type, which is ideal for any classification method. In the AUC-ROC curve, the scores revealed interesting aspects about classification performance. The macro ROC score was heavily affected by the per-class ROC scores, while the micro-average was substantially greater. The Micro-average ROC curve is the weighted average ROC of both classes, which is the most informative due to the class imbalance, indicating that the macro average’s, or the overall average, was decreased by the presence of false positives and false negatives. However, the micro average is appropriate since we want to classify positively identified papers that have C. neoformans, so having a higher weight for correctly classifying positive papers is ideal. Our accuracy in correctly classifying positively labeled papers was reinforced with the resulting AUC score, which can also be interpreted as the probability that a positive instance of a classification is ranked higher than a negative instance of a classification. The similarity between the AUC score and the micro-average ROC score displays similar results, which is that the classifier is correctly classifying positively identified papers at a moderate level (See Table 2).

From a technical perspective, the relatively small number of positive hit papers proved to be challenging for a variety of reasons, including the skew of the extant articles towards positive results. In the data cleaning phase, there was a 90% reduction in the total amount of words was due to the thresholds we set for higher model performance. This reduction would account for single use words, or words shared by only a few articles. Without a high enough count of words and documents, evaluating a model becomes ineffective. If there were a bigger corpus of papers, the output of the initial NLP would have had a higher count of words and word frequencies, so the overall words list(s) to run the LDA against would have been more sizeable, providing more data points for the LDA model to utilize. In addition, the class imbalance between the “positive” and “negative” labeled papers was not ideal compared to a more equal distribution which would have allowed a more standard analysis and approach in the classifier evaluation. While, for the most part, these challenges were circumvented in our approach and pipeline, utilizing even more computational resources may result in better model prediction. LDA is dependent on datasets size and text corpus content to give the best dirichlet distribution. The number of topics prior to beginning of the LDA analysis may lead future researchers to investigate the possibility of Hierarchical Dirichlet Allocation (HDA) which is similar to LDA but is independent of an initial topic number [49]. In addition, using neural network techniques, like the lda2vec framework based on the word2vec neural network framework [50], may lead to even more significant results due to advantages deep neural networks have over standard classification methods [42,51]. Similarly, with standardization of the metadata content in metabarcoding studies, work like ours would not need to rely as much on Natural Language processing. More traditional meta-analysis methods like non-linear statistical models could be utilized to further determine not only more about the range of environmental features where C. neoformans is found, but in all likelihood expand out estimation of its range. There are some limitations caused by the data itself, these limitations included the massive reduction in words from the documents to words that made it into the text corpus for analysis in genism; If even more papers were used then there would be more robust results/analyses. In light of the data limitations approached by this paper we advocate for more metadata standardization (e.g. similar units, required measurements for certain kinds of studies) and more standardization of reporting in journal articles (i.e. required detail in methodology). These small steps can save time in meta-analysis and possibly help NLP.

We analyzed C. neoformans, an important human pathogen which is known to have a broad distribution in nature and a few recognized ecological niches, to identify additional potential cryptic niches. It is important to note that the same approach could be used to track different organisms, such as more geographically localized disease agents. Furthermore, the same approach can also be extended to bacterial barcode sequences (such as the 16S rDNA gene region). Given a large enough corpus of relevant articles, this technique could be used to track the ecological niche and geographical location of specific disease agents, an important aspect of biosurveillance. In this work on C. neoformans we reinforce the idea of a global distribution, however, the niche of C. neoformans is still not well known, which raises the question, if the pathogen has a global distribution, why does it disproportionally affect some populations more than others? And could how C. neoformans interacts with its environment be part of that. Importantly, as environmental metagenomic datasets are sampled often organisms that are not the focus of the metagenetic paper will be found, but not reported, as in the case of the corpus used in this work, C. neoformans was found, but no reference to C. neoformans was made. This analysis sidesteps that problem.

Acknowledgments

The authors would like to thank The Notre Dame Center for Research computing for supporting the computational infrastructure that this analysis this work was run on, as well as Daniel A. Molik for providing software development computing hardware consultation. The authors also thank Emmet Flynn and Paul Brunts whose undergraduate data science course work inspired some of the machine learning aspects of this paper. Finally, the authors thank the peacefulness and quiet solemnity of the Delbruck Building’s back patio for providing the scene of most of this article’s ideas.

Data Availability

The data underlying the results presented in the study are available from osf.io at doi.org/10.17605/OSF.IO/29V3F and doi.org/10.17605/OSF.IO/49W5R.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Snow J. On the Mode of Communication of Cholera. Edinb Med J. 1856;1: 668–670. [PMC free article] [PubMed] [Google Scholar]
  • 2.Paneth N, Vinten-Johansen P, Brody H, Rip M. A rivalry of foulness: official and unofficial investigations of the London cholera epidemic of 1854. Am J Public Health. 1998;88: 1545–1553. 10.2105/ajph.88.10.1545 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Colwell RR. Global Climate and Infectious Disease: The Cholera Paradigm*. Science. 1996;274: 2025–2031. 10.1126/science.274.5295.2025 [DOI] [PubMed] [Google Scholar]
  • 4.Clemens JD. Vaccines in the time of cholera. Proc Natl Acad Sci. 2011;108: 8529–8530. 10.1073/pnas.1105807108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sanfelice F. Contributo alla morfologia e biologia dei blastomiceti che si sviluppano nei succhi di alcuni frutti. Ann Ig. 1894;4: 463–495. [Google Scholar]
  • 6.Vuillemin P. Les blastomycètes pathogènes. Rev Gen Sci Pures Appl. 1901;12: 732–751. [Google Scholar]
  • 7.Benham RW. Cryptococcosis and blastomycosis. Ann N Y Acad Sci. 1950;50: 1299. 10.1111/j.1749-6632.1950.tb39828.x [DOI] [PubMed] [Google Scholar]
  • 8.Boulware DR. Cryptococcus: from human pathogen to model yeast. Lancet Infect Dis. 2011;11: 434. [Google Scholar]
  • 9.Rajasingham R, Smith RM, Park BJ, Jarvis JN, Govender NP, Chiller TM, et al. Global burden of disease of HIV-associated cryptococcal meningitis: an updated analysis. Lancet Infect Dis. 2017;17: 873–881. 10.1016/S1473-3099(17)30243-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Heitman J, Kozel TR, Kwon-Chung KJ, Perfect JR, Casadevall A. Cryptococcus: from human pathogen to model yeast. ASM press; 2010. [Google Scholar]
  • 11.Kwon-Chung KJ, Fraser JA, Doering TL, Wang Z, Janbon G, Idnurm A, et al. Cryptococcus neoformans and Cryptococcus gattii, the etiologic agents of cryptococcosis. Cold Spring Harb Perspect Med. 2014;4: a019760–a019760. 10.1101/cshperspect.a019760 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Springer DJ, Chaturvedi V. Projecting global occurrence of Cryptococcus gattii. Emerg Infect Dis. 2010;16: 14–20. 10.3201/eid1601.090369 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Elhariri M, Hamza D, Elhelw R, Refai M. Eucalyptus Tree: A Potential Source of Cryptococcus neoformans in Egyptian Environment. Int J Microbiol. 2016/01/13 ed. 2016;2016: 4080725–4080725. 10.1155/2016/4080725 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ergin Ç, Şengül M, Aksoy L, Döğen A, Sun S, Averette AF, et al. Cryptococcus neoformans Recovered From Olive Trees (Olea europaea) in Turkey Reveal Allopatry With African and South American Lineages. Front Cell Infect Microbiol. 2019;9: 384. 10.3389/fcimb.2019.00384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Springer DJ, Mohan R, Heitman J. Plants promote mating and dispersal of the human pathogenic fungus Cryptococcus. PLOS ONE. 2017;12: e0171695. 10.1371/journal.pone.0171695 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mašínová T, Yurkov A, Baldrian P. Forest soil yeasts: Decomposition potential and the utilization of carbon sources. Fungal Ecol. 2018;34: 10–19. 10.1016/j.funeco.2018.03.005 [DOI] [Google Scholar]
  • 17.Cadete RM, Lopes MR, Rosa CA. Yeasts Associated with Decomposing Plant Material and Rotting Wood. In: Buzzini P, Lachance M-A, Yurkov A, editors. Yeasts in Natural Ecosystems: Diversity. Cham: Springer International Publishing; 2017. pp. 265–292. 10.1007/978-3-319-62683-3_9 [DOI] [Google Scholar]
  • 18.Lazera MS, Cavalcanti MAS, Londero AT, Trilles L, Nishikawa MM, Wanke B. Possible primary ecological niche of Cryptococcus neoformans. Med Mycol. 2000;38: 379–383. 10.1080/mmy.38.5.379.383 [DOI] [PubMed] [Google Scholar]
  • 19.Vélez N, Escandón P. Report on novel environmental niches for Cryptococcus neoformans and Cryptococcus gattii in Colombia: Tabebuia guayacan and Roystonea regia. Med Mycol. 2017;55: 794–797. 10.1093/mmy/myw138 [DOI] [PubMed] [Google Scholar]
  • 20.Dromer F, Casadevall A, Perfect J, Sorrell T, Heitman J. Cryptococcus: from human pathogen to model yeast. 2011. [Google Scholar]
  • 21.Vu K, Tham R, Uhrig JP, Thompson GR, Na Pombejra S, Jamklang M, et al. Invasion of the Central Nervous System by <span class = "named-content genus-species" id = "named-content-1">Cryptococcus neoformans Requires a Secreted Fungal Metalloprotease. Berman J, editor. mBio. 2014;5: e01101–14. 10.1128/mBio.01101-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Santiago-Tirado FH, Onken MD, Cooper JA, Klein RS, Doering TL. Trojan Horse Transit Contributes to Blood-Brain Barrier Crossing of a Eukaryotic Pathogen. Casadevall A, editor. mBio. 2017;8: e02183–16. 10.1128/mBio.02183-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Casadevall A. Evolution of intracellular pathogens. Annu Rev Microbiol. 2008;62: 19–33. 10.1146/annurev.micro.61.080706.093305 [DOI] [PubMed] [Google Scholar]
  • 24.Hagen F, Khayhan K, Theelen B, Kolecka A, Polacheck I, Sionov E, et al. Recognition of seven species in the Cryptococcus gattii/Cryptococcus neoformans species complex. Fungal Genet Biol. 2015;78: 16–48. 10.1016/j.fgb.2015.02.009 [DOI] [PubMed] [Google Scholar]
  • 25.Galanis E, Macdougall L, Kidd S, Morshed M, British Columbia Cryptococcus gattii Working Group. Epidemiology of Cryptococcus gattii, British Columbia, Canada, 1999–2007. Emerg Infect Dis. 2010;16: 251–257. 10.3201/eid1602.090900 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Deiner K, Bik HM, Mächler E, Seymour M, Lacoursière-Roussel A, Altermatt F, et al. Environmental DNA metabarcoding: Transforming how we survey animal and plant communities. Mol Ecol. 2017;26: 5872–5895. 10.1111/mec.14350 [DOI] [PubMed] [Google Scholar]
  • 27.Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2010/11/09 ed. 2011;39: D19–D21. 10.1093/nar/gkq1019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhai C, Massung S. Text data management and analysis: a practical introduction to information retrieval and text mining. Association for Computing Machinery and Morgan & Claypool; 2016. [Google Scholar]
  • 29.Quinlan JR. Induction of decision trees. Mach Learn. 1986;1: 81–106. [Google Scholar]
  • 30.Dietterich TG. Ensemble methods in machine learning. Springer; 2000. pp. 1–15. [Google Scholar]
  • 31.Stevenson RL. Robert Louis Stevenson: A child’s garden of verses. Scribner; 1895. [Google Scholar]
  • 32.Bell I. Dreams of Exile: Robert Louis Stevenson, a Biography. Macmillan; 1993. [Google Scholar]
  • 33.Feinerer I, Hornik K, Meyer D. Text Mining Infrastructure in R. J Stat Softw Vol 1 Issue 5 2008. 2008. Available: https://www.jstatsoft.org/v025/i05 [Google Scholar]
  • 34.Řehůřek R, Sojka P. Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA; 2010. pp. 45–50. [Google Scholar]
  • 35.Cambridge UP. Introduction to information retrieval. 2009. [Google Scholar]
  • 36.Bird S, Klein E, Loper E. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.; 2009. [Google Scholar]
  • 37.Airoldi EM, Blei D, Erosheva EA, Fienberg SE. Handbook of mixed membership models and their applications. CRC press; 2014. [Google Scholar]
  • 38.Airoldi EM. Getting started in probabilistic graphical models. PLoS Comput Biol. 2007;3: e252. 10.1371/journal.pcbi.0030252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3: 993–1022. [Google Scholar]
  • 40.Dirichlet and Inverted Dirichlet Distributions. Continuous Multivariate Distributions. John Wiley & Sons, Ltd; 2005. pp. 485–527. 10.1002/0471722065.ch49 [DOI] [Google Scholar]
  • 41.Ali J, Khan R, Ahmad N, Maqsood I. Random forests and decision trees. Int J Comput Sci Issues IJCSI. 2012;9: 272. [Google Scholar]
  • 42.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830. [Google Scholar]
  • 43.Raileanu LE, Stoffel K. Theoretical comparison between the gini index and information gain criteria. Ann Math Artif Intell. 2004;41: 77–93. [Google Scholar]
  • 44.Breiman L. Bagging predictors. Mach Learn. 1996;24: 123–140. [Google Scholar]
  • 45.Natural Earth—Free vector and raster map data at 1:10m, 1:50m, and 1:110m scales. [cited 20 Mar 2021]. Available: https://www.naturalearthdata.com/
  • 46.Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. Proceedings of the eighth ACM international conference on Web search and data mining. 2015. pp. 399–408. [Google Scholar]
  • 47.Mabey B. bmabey/pyLDAvis. 2020. Available: https://github.com/bmabey/pyLDAvis [Google Scholar]
  • 48.Sievert C, Shirley K. LDAvis: A method for visualizing and interpreting topics. 2014. pp. 63–70. [Google Scholar]
  • 49.Teh YW, Jordan MI, Beal MJ, Blei DM. Sharing clusters among related groups: Hierarchical Dirichlet processes. 2005. pp. 1385–1392. [Google Scholar]
  • 50.Moody CE. Mixing dirichlet topic models and word embeddings to make lda2vec. ArXiv Prepr ArXiv160502019. 2016. [Google Scholar]
  • 51.Gulli A, Pal S. Deep learning with Keras. Packt Publishing Ltd; 2017. [Google Scholar]
PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008755.r001

Decision Letter 0

Peter John Myler, Todd B Reynolds

5 Nov 2020

Dear Mr Molik,

Thank you very much for submitting your manuscript "Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations" for consideration at PLOS Neglected Tropical Diseases. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Peter John Myler, Ph.D.

Associate Editor

PLOS Neglected Tropical Diseases

Todd Reynolds

Deputy Editor

PLOS Neglected Tropical Diseases

***********************

Reviewer's Responses to Questions

Key Review Criteria Required for Acceptance?

As you describe the new analyses required for acceptance, please consider the following:

Methods

-Are the objectives of the study clearly articulated with a clear testable hypothesis stated?

-Is the study design appropriate to address the stated objectives?

-Is the population clearly described and appropriate for the hypothesis being tested?

-Is the sample size sufficient to ensure adequate power to address the hypothesis being tested?

-Were correct statistical analysis used to support conclusions?

-Are there concerns about ethical or regulatory requirements being met?

Reviewer #1: Methods applied in this study are appropriate but the limited number of sample analyzed could drive to wrong conclusions.

The first paragraph concerning methods is too long and sometimes difficult to understand.

I suggest to move the background about validation of the computer methodologies in the introduction motivating the choice of them.

As introduction of methods I suggest to list briefly the different steps of the analysis as reported in Fig. 1 which will be then deeply described separately in the following paragraphs.

Line 214. Change "were positive for C. neoformans" to "were specific for C. neoformans".

Line 214. Add accession number of the sequences used as queries.

Figure 2. Symbols in the legend have been converted in squares so it is not possible to follow the explanation of the diagram.

Line 290. Abbreviation PCA should be explained at the first citation.

Textbox 1, 2 and 3 should be removed since they add a lot of confusing information.

I suggest just to report in the methods the concept and calculation of precision, recall, accuracy, F1, false positive and negative, and true positive and positive.

Reviewer #2: There is a significant amount of detail about various aspects of the method and lengthy explanations provided in text boxes. The methods seem appropriate, but the question being asked is somewhat vague and the application of the various methods is not always clear. It was not clear how the txt analysis was culled from 75,000 words down to the 25 presented in Figure 4. The authors chose Gensim python program to conduct the LDA topic model generation, but the authors do not state why this program was chosen; if there are alternatives or if it is considered the standard method for this type of analysis.

Reviewer #3: No new analysis required.

--------------------

Results

-Does the analysis presented match the analysis plan?

-Are the results clearly and completely presented?

-Are the figures (Tables, Images) of sufficient quality for clarity?

Reviewer #1: Figure 4. Have different colors of bars any meaning?

Line 387. Gensim software should be first cited in the methods with a reference.

Lines 417-424. Topic 1 is not described.

Random Forest Evaluation. In this paragraph results should be reported in detail.

Figure 7. Which is the unit of the bar on the right-side of the picture?

Figure 8. I suppose class 0 refer o negative papers and class 1 positive papers but this should be better explained.

Table 2. From your definitions precision is the proportion of true positives out of all the others. Therefore in this case 20/31= 0.64. Why you reported 0.71? If my calculation is right then also F1 score should be 0.77 (0.64+0.9 /2).

Reviewer #2: It is not clear what topics were used to model, as the authors list 25 potential topic words excluding stop words. They state that the actual model apparently had three topics, but it wasn’t clear which three were included. Or if they modeled each set using an overlapping 3 topic word model. The significance of Figure 6 is not clearly explained. The final model presented in Figure 7 identified 20 true positives out of 31 tested samples, which is only 65%. It is reported in Figure 8 that the micro-average of the two ROC curves is 0.72, but each of those have an area of 0.55, so it is not at all clear how they obtained an average of 0.72. They also do not explain the significance of either micro- or macro-average. The data are presented with very little attempt to explain how they inform the prediction of where C. neoformans might be found or what it means in the context of the question being asked.

Reviewer #3: Additional characterization of the positive hits to include geographic location and/or change over time in geographic location or other information about how the sequencing sample was collected.

--------------------

Conclusions

-Are the conclusions supported by the data presented?

-Are the limitations of analysis clearly described?

-Do the authors discuss how these data can be helpful to advance our understanding of the topic under study?

-Is public health relevance addressed?

Reviewer #1: Lines 449-451. C. neoformans is worldwide distributed, it is not concentrated in Sub-Saharian area!!!!

Reviewer #2: It is difficult to ascertain what conclusions the authors have obtained from this study. They state that it provides support for the previously suggested linkage between C. neoformans and decomposing wood. This is neither a novel nor unexpected result. The link between C. neoformans and decomposing wood is not in doubt as there are dozens of papers that have identified C. neoformans in various environmental locations associated with rotting wood. The application of NLP is a somewhat novel approach, but the utility of this approach is not very apparent in the unconvincing predictive power of the model that they built. It seems that unless the metagenetic data collected includes information about sampling location, the approach in this paper is unlikely to provide any additional insight as to the environmental origins of the sequenced samples.

Reviewer #3: Additional discussion about implications for public health would enhance the paper.

--------------------

Editorial and Data Presentation Modifications?

Use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity. If the only modifications needed are minor and/or editorial, you may wish to recommend “Minor Revision” or “Accept”.

Reviewer #1: (No Response)

Reviewer #2: None

Reviewer #3: I'm not sure if the presence of text boxes and other insets fits with the style of the journal. The information given in the textboxes may need to be added into the main text of the article, or given in a glossary at the end.

--------------------

Summary and General Comments

Use this section to provide overall comments, discuss strengths/weaknesses of the study, novelty, significance, general execution and scholarship. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. If requesting major revision, please articulate the new experiments that are needed.

Reviewer #1: In general this study is interesting since it try to obtain new information throughout new computer methodologies able to analyze a big amount of complex data.

On the other hand it presents a lot of steps which present a multitude of variables difficult to control and that can compromise the final result of the analysis.

Reviewer #2: This manuscript presents a potentially novel method for identifying the environmental source of metagenetic data in the short-read archive at NCBI. The proposed method uses natural language processing to evaluate the associated manuscripts for information regarding where samples are were taken. The hypothesis is that sampled data may include genetic data from organisms in addition to those that were intended to be samples. In this study, the authors identify C. neoformans genetic material in SRA Biosamples that had not previously been shown to contain C. neoformans data. However, their results provide no additional information about the environmental source of these data other than that they were found in samples associated with decomposing wood, which is a well-established niche for C. neoformans. They do not make a convincing argument that this method will provide additional insight over what has probably already been collected and provided at the time of sample collection.

Reviewer #3: The manuscript “Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations” by Molik et al is presents an interesting methodology which applies a Machine Learning approach to a specific scientific topic of interest, namely, the habitat of C. neoformans. While the paper is extremely well written and the concept is of interest to the field, I think there a few limitations that should be addressed, and if overcome would enhance the reception of the paper.

In the discussion, the author implies that the current consensus in the field is that C. neoformans is more abundant in Sub-Saharan Africa (Line 450) which is certainly not the case. It is well understood that while cases of cryptococcosis occur mainly in sub-Saharan Africa, this is due to HIV-AIDS spread and in these areas, not due to additional presence of the yeast. C. neoformans is known to be cosmopolitan, and has been found in the environment all over the world, which is discussed elsewhere in the paper. The overall conclusion of the paper, that C. neoformans is associated with soils and decomposing wood, seems to not add anything that was not previously known. The idea behind the paper and the methodology leading up to this conclusion is very exciting. If the positive hits could be further mined to add additional pieces of data such as where the positive hits were geographically located, and or the change in the geographic location over time it would strengthen the conclusion and add to what is already known. It would be interesting to know this data so that researchers can get more evidence for distribution and potential exposure and latency in individuals. While I think discussion of the technique and the implications for other pathogens is important, it is also important to provide more discussion about the topic selected for this paper. Some questions to consider in the discussion: How would understanding the ecological niche benefit the Cryptococcus field, what else could be revealed by future analyses with a larger amount of papers going forward, could this method also be applied to microbiome sequencing data to ascertain potential age/ area of exposure to C. neoformans? Overall I think the approach has a lot of promise and the authors have done a great job of explaining their method and taking a unique approach to investigating a pathogens origin.

--------------------

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see https://journals.plos.org/plosntds/s/submission-guidelines#loc-methods

PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008755.r003

Decision Letter 1

Peter John Myler, Todd B Reynolds

2 Mar 2021

Dear Mr Molik,

Thank you very much for submitting your manuscript "Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations" for consideration at PLOS Neglected Tropical Diseases. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

If you respond appropriately to the comments from Reviewer 1, the manuscript should be suitable for publication.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Peter John Myler, Ph.D.

Associate Editor

PLOS Neglected Tropical Diseases

Todd Reynolds

Deputy Editor

PLOS Neglected Tropical Diseases

***********************

If you respond appropriately to the comments from Reviewer 1, the manuscript should be suitable for publication.

Reviewer's Responses to Questions

Key Review Criteria Required for Acceptance?

As you describe the new analyses required for acceptance, please consider the following:

Methods

-Are the objectives of the study clearly articulated with a clear testable hypothesis stated?

-Is the study design appropriate to address the stated objectives?

-Is the population clearly described and appropriate for the hypothesis being tested?

-Is the sample size sufficient to ensure adequate power to address the hypothesis being tested?

-Were correct statistical analysis used to support conclusions?

-Are there concerns about ethical or regulatory requirements being met?

Reviewer #1: Authors replied sufficiently to the comments

Reviewer #2: The methods are better explained compared to the first submission and meet the stated criteria.

Reviewer #3: The methods are articulated and apply to a testable hypothesis. The authors have taken steps to make the description of the method clear and address potential shortcomings in terms of sample size limitations and choice of model.

--------------------

Results

-Does the analysis presented match the analysis plan?

-Are the results clearly and completely presented?

-Are the figures (Tables, Images) of sufficient quality for clarity?

Reviewer #1: Authors replied sufficiently to the comments

Reviewer #2: Yes, the results meet all of the stated criteria.

Reviewer #3: Authors have addressed the lack of geographic analysis by adding an additional figure and addressing it in the text as well.

--------------------

Conclusions

-Are the conclusions supported by the data presented?

-Are the limitations of analysis clearly described?

-Do the authors discuss how these data can be helpful to advance our understanding of the topic under study?

-Is public health relevance addressed?

Reviewer #1: see general comments

Reviewer #2: The edits to the conclusions from the first submission make a clearer case for how this approach could be used to study other underreported pathogens and also why their chosen approach is a better method than neural language processing.

Reviewer #3: The authors have expanded their discussion as suggested by the reviewers and fixed a potentially misleading line about geographic distribution.

--------------------

Editorial and Data Presentation Modifications?

Use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity. If the only modifications needed are minor and/or editorial, you may wish to recommend “Minor Revision” or “Accept”.

Reviewer #1: no comments

Reviewer #2: None

Reviewer #3: None

--------------------

Summary and General Comments

Use this section to provide overall comments, discuss strengths/weaknesses of the study, novelty, significance, general execution and scholarship. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. If requesting major revision, please articulate the new experiments that are needed.

Reviewer #1: The authors replied correctly to most of the comments but one point concerning conclusions still remains to be revised.

Sentence from line 567 to line 570 should be deeply modified as follows:

“Our results show that there is a link between C. neoformans and wood decomposition confirming that this fungus lives in the environment as saprophyte with a preference for wood substrates. This characteristic make C. neoformans able to occupy a multitude of environmental niches worldwide. These conclusions confirm the validity of the methodology here applied. Our de novo…..”

Statement about findings of C. neoformans outside Sub-Sahara is not correct and should be deleted.

Figure legends are lacking.

Reviewer #2: The authors responded well to the critiques raised by the reviewers and I feel the manuscript is suitable for publication.

Reviewer #3: The authors have taken appropriate measures to address the concerns of all three reviewers and have added an additional figure and amended the text to clarify confusing language and make the data more accessible. Although the novelty and significance was initially called into question, the updated manuscript emphasizes the novelty of the technique and the new figure adds information about C. neoformans location that should be of interest to the field.

--------------------

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/plosntds/s/submission-guidelines#loc-materials-and-methods

References

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008755.r005

Decision Letter 2

Peter John Myler, Todd B Reynolds

9 Mar 2021

Dear Mr Molik,

We are pleased to inform you that your manuscript 'Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations' has been provisionally accepted for publication in PLOS Neglected Tropical Diseases.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Neglected Tropical Diseases.

Best regards,

Peter J Myler, Ph.D.

Associate Editor

PLOS Neglected Tropical Diseases

Todd Reynolds

Deputy Editor

PLOS Neglected Tropical Diseases

***********************************************************

Thanks for addressing the remaining issues raised by Reviewer 1.  

PLoS Negl Trop Dis. doi: 10.1371/journal.pntd.0008755.r006

Acceptance letter

Peter John Myler, Todd B Reynolds

1 Apr 2021

Dear Mr Molik,

We are delighted to inform you that your manuscript, "Combining Natural Language Processing and Metabarcoding to Reveal Pathogen-Environment Associations," has been formally accepted for publication in PLOS Neglected Tropical Diseases.

We have now passed your article onto the PLOS Production Department who will complete the rest of the publication process. All authors will receive a confirmation email upon publication.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any scientific or type-setting errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Note: Proofs for Front Matter articles (Editorial, Viewpoint, Symposium, Review, etc...) are generated on a different schedule and may not be made available as quickly.

Soon after your final files are uploaded, the early version of your manuscript will be published online unless you opted out of this process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Neglected Tropical Diseases.

Best regards,

Shaden Kamhawi

co-Editor-in-Chief

PLOS Neglected Tropical Diseases

Paul Brindley

co-Editor-in-Chief

PLOS Neglected Tropical Diseases

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: revision_letter-1.docx

    Attachment

    Submitted filename: revision_letter_minor_revision.docx

    Data Availability Statement

    The data underlying the results presented in the study are available from osf.io at doi.org/10.17605/OSF.IO/29V3F and doi.org/10.17605/OSF.IO/49W5R.


    Articles from PLoS Neglected Tropical Diseases are provided here courtesy of PLOS

    RESOURCES