Skip to main content
PLOS One logoLink to PLOS One
. 2024 Jan 19;19(1):e0291406. doi: 10.1371/journal.pone.0291406

Finding Candida auris in public metagenomic repositories

Jorge E Mario-Vasquez 1, Ujwal R Bagal 2, Elijah Lowe 3, Aleksandr Morgulis 4, John Phan 3, D Joseph Sexton 1, Sergey Shiryev 4, Rytis Slatkevičius 5, Rory Welsh 1, Anastasia P Litvintseva 1, Matthew Blumberg 5, Richa Agarwala 4, Nancy A Chow 1,*
Editor: Ricardo Santos6
PMCID: PMC10798454  PMID: 38241320

Abstract

Candida auris is a newly emerged multidrug-resistant fungus capable of causing invasive infections with high mortality. Despite intense efforts to understand how this pathogen rapidly emerged and spread worldwide, its environmental reservoirs are poorly understood. Here, we present a collaborative effort between the U.S. Centers for Disease Control and Prevention, the National Center for Biotechnology Information, and GridRepublic (a volunteer computing platform) to identify C. auris sequences in publicly available metagenomic datasets. We developed the MetaNISH pipeline that uses SRPRISM to align sequences to a set of reference genomes and computes a score for each reference genome. We used MetaNISH to scan ~300,000 SRA metagenomic runs from 2010 onwards and identified five datasets containing C. auris reads. Finally, GridRepublic has implemented a prospective C. auris molecular monitoring system using MetaNISH and volunteer computing.

Introduction

Candida auris is an emerging and often multidrug-resistant yeast that can cause invasive candidiasis, a life-threatening disease with high mortality [1]. The World Health Organization (WHO) classified C. auris as a critical priority pathogen due to its high outbreak potential, resistance to most available antifungal medicines, and ability to persist in the healthcare environment despite intensive infection prevention strategies [2].

Although the pathogen was first described in Japan in 2009 [3], the earliest known C. auris isolates were retrospectively identified and date back to 1996 in South Korea [4]. Whole-genome sequencing (WGS) of C. auris isolates from four world regions revealed four phylogenetically distinct clades of this fungal pathogen wherein isolates clustered geographically (Clade I, South Asia; Clade II, East Asia; Clade III, Africa; and Clade IV, South America). This finding supported the hypothesis that C. auris emerged independently and simultaneously in geographically separated human populations [5]. WGS of a recently identified isolate from Iran showed the existence of a fifth major clade, with hundreds of thousands of single nucleotide polymorphisms (SNPs) separating this isolate from the four known clades [6,7]. The five major clades are separated by tens to hundreds of thousands of SNPs. Within each clade, isolates from varying countries are typically separated by hundreds to thousands of SNPs [8].

Despite intense efforts to understand how this pathogen emerged and spread to healthcare facilities worldwide, the natural reservoirs of C. auris are poorly understood. Two alternative hypotheses have been proposed to explain the origin of C. auris. One suggests that C. auris existed in the environment before clinical recognition and emerged as a human pathogen due to thermal adaptation in response to environmental changes [9]. Several biological properties of C. auris, such as thermotolerance and halotolerance, that allow this fungus to survive in hypersaline environments provide indirect evidence supporting this theory [10]. The other hypothesis is based on the C. auris unique propensity to colonize human skin [11,12] and suggests that C. auris might have existed as a minor skin commensal colonizing poorly studied sites on the human body, such as the external ear canal, in isolated human populations and emerged globally in response to the increased use of antifungals in medical and agricultural practices [10,13]. This hypothesis is indirectly supported by the results of molecular dating, which showed that the emergence of outbreak causing strains in three different lineages (Clade I, Clade III, and Clade IV) coincided with the introduction of azoles into clinics and agriculture [13]. Of course, other potential explanations are also possible, and more research is needed to better understand the environmental and human reservoirs of this pathogen. Two recent publications reported the isolation of C. auris from a salt marsh and sandy beach on the Andaman Islands in India and Colombia’s coastal estuaries [14,15]. These findings suggest a need for further environmental and human microbiome evaluations.

To conduct extensive environmental evaluations in a financially and logistically feasible manner, investigators have utilized metagenomic data in public repositories [16] like the Sequence Read Archive (SRA), the largest global sequence repository [17]. In this study, the U.S. Centers for Disease Control and Prevention (CDC), the National Center for Biotechnology Information (NCBI), and GridRepublic partnered to develop MetaNISH (Metagenomic Needles In Sequence Hay) and pipelines that utilize it. With these pipelines, we retrospectively screened ~300,000 shotgun metagenomic SRA runs from 2010 to 2022 to identify and describe datasets containing C. auris. In addition, we started prospectively screening datasets for this fungal pathogen daily in April 2023.

Materials and methods

MetaNISH design

NCBI developed the MetaNISH pipeline to screen metagenomic read sets for the presence of each genome in the given set of reference genomes (see benchmark development section). The pipeline consists of two steps: (I) the alignment step using SRPRISM [18] that aligns reads to all reference genomes as a single database, and (II) the score computation step, which increases the score for samples with reads aligned across the genome compared to samples with reads aligned to a small section of the genome.

Alignment is performed with SRPRISM as it guarantees the reporting of all equally good alignments (max 255) across all sequences in the database. Additionally, it supports specifying the region on the reads that must align and a maximum number of errors (mismatches, insertions, or deletions) in the reported alignments. For this study, we required SRPRISM to align the first 100 bases of the reads and specified a maximum of 15 errors for the reported alignments. For reads shorter than 100 bases, the full length of the read is aligned. The design of SRPRISM guarantees that the first 100 bases will have at most 5 of the 15 errors allowed. We chose the first 100 bases as the region that must align as the read quality drops beyond that in many Illumina runs, and 100 bp is also long enough to avoid spurious matches. An example of such a read is SRR11734778.40769.1, which is a paired read with each mate of length 251 bases. Alignments for this read are included in the data released at Zenodo.org (https://doi.org/10.5281/zenodo.10214980). It was shown that the first 163 bases of the read were an exact match to C. auris genomes. However, the remaining portion of the read (specifically, the substring from 164 to 251) seems of inferior quality as it did not report a match to anything in the non-redundant (nr) nucleotide database at NCBI as of November 2023.

Score computation was devised for metagenomic read sets where the depth of coverage by reads could vary considerably over the genome. To determine the extent to which reads were aligned with the genome, regardless of coverage at aligned regions, a padding of up to 100 Kb was added to either end of the genome region aligned by the read. A scaling factor was applied to adjust the padding length for each read and genome based on the number of alignments, so this ensures that multiple mappings of a read to a genome do not exceed 100 Kb for each location. The score was the percentage of the genome covered by padded alignments. For example, read SRR11734778.40769.1 aligned to four C. auris genomes with only one location in each genome. Therefore, full padding of 100 Kb was used for each of the four alignments. However, read SRR11734778.1429600.1 aligned to four contigs on three genomes, as shown in Fig 1. Fig 1A and 1B show four alignments on two genomes, which reduced the padding to 25 Kb. Fig 1C shows two alignments on the third genome, which reduced the padding to 50 Kb. Padded coverage cannot extend beyond contig boundaries.

Fig 1. Padding in alignments of read SRR11734778.1429600.1 on different assemblies.

Fig 1

(A) Assembly GCA_000150115.1 (B) Assembly GCA_000151005.2 (C) Assembly GCA_000151035.1.

Using the reference genomes provided and empirical data from a set of 4,000 SRA runs, we proposed padding of 100 Kb and a score of at least 75 to indicate the presence of the corresponding genome in the read set. The choice is conservative, where it can potentially flag a few runs as scoring at least 75 when the genome is not present (false positives) but is unlikely to miss any (false negatives). At the same time, the parameters are not too conservative to make the false positive rate unacceptable. An example that illustrates the importance of padding regardless of the number of reads aligned at any genome location is SRR9016983. The alignment of reads from SRR9016983 to the reference genome B12037.1 is 2.8% across 1,975 locations throughout the genome. Among these reads, 50.4% have 1X coverage, and 38.7% have 2X coverage. Adding 100 Kb padding to these adjacent alignments allows the coverage to reach 100%, thus increasing the likelihood of detection.

The CPU time for the 4,000 runs used for determining the parameters varied from 25 seconds (for SRR7125652) to 17 hours 14 minutes (for SRR8550535) with a median time of 25 minutes. SRR7125652 has 86,554 paired reads, while SRR8550535 has over 418 million paired reads. We noted that SRPRISM, which takes almost all the time in the MetaNISH pipeline (as score calculation takes only a couple of seconds), can be run in multi-threaded mode with good scaling till eight threads, but we did not use that option for results reported here.

The design of MetaNISH can be used for tracking any pathogen. It requires developing a reference set with representative genomes from all clades for the pathogen to be tracked and for nearby species. Doing so allows SRPRISM to find the best matches for each read among the genomes where a match can be expected. Then, empirical analysis is needed to find suitable parameters for padding and score threshold. If read properties change substantially over time, the alignment method and parameters may need revisited.

Benchmark development

CDC collated a set of 100 reference genomes representing priority pathogens for fungal molecular surveillance, including a representative subset of genomes for C. auris [19,20]. Specifically, 14 genomes were of C. auris, 44 were of other Candida species, and 42 were of other fungal genera (i.e., Ajellomyces, Blastomyces, Clavispora, Coccidioides, Cryptococcus, Emergomyces, Emmonsia, Paracoccidioides, Pichia, Pneumocystis, Saitozyma, Sporothrix, and Talaromyces). Detailed information regarding assemblies used are found in S1 Table. The data released for this paper also includes sequences for all 100 reference genomes.

For the benchmark dataset, sequencing data for 20 metagenomic runs were generated by sequencing clinical specimens with C. auris spiked in at various concentrations. Briefly, residual material from C. auris colonization screening swabs collected from the anterior nares were used as a benchmarking dataset. The qualitative presence of C. auris was first confirmed by enrichment broth culture [21]. The concentration of C. auris cells was then assessed through a quantitative Sybr Green qPCR as previously described [22]. Cell concentrations were interpolated from a standard curve built using samples spiked with C. auris AR 0385 at serial dilutions ranging between 107 CFU/mL to 103 CFU/mL. Concentrations in the standard curve were confirmed by CFU counts and tested with three biological replicates at each concentration. The melt curve was referenced for both standard curve and benchmark samples to confirm that a strong melt peak was present in positive samples at ~83–84°C, the signature temperature indicative of C. auris. No unspecific amplification was observed in the standard curve or benchmarking samples. Five "no template controls" were included in the run. As expected, there was no amplification in these samples. Sequence data were deposited in PRJNA631031.

MetaNISH implementation

The bash pipeline used by CDC for searching NCBI’s SRA database integrated with MetaNISH is depicted in Fig 2. Following filtering criteria were applied (platform: Illumina; library_source: metagenomic; consent: public; assay_type: WGS; library_selection: random) to download only whole-genome sequence metagenomic datasets temporarily with additional metadata (accession ID, biosample, bioproject, release date, library layout, mbases, and organism). The alignment and scoring were done as per the MetaNISH design described earlier. The scores for all 100 reference genomes for each SRA ID are reported by the pipeline.

Fig 2. Pipeline for C. auris sequence-based monitoring using MetaNISH.

Fig 2

Steps 1–4 comprise collecting the required input data (samples sequence reads and reference database) for MetaNISH (step 5), whose output is a file with the scores for all references for each sample processed. Finally (step 6), this stack of files is processed and analyzed to obtain the samples with positive hits (score ≥ 75) of C. auris.

Data analysis

The samples in the benchmark set were spiked using C. auris AR 0385 (Biosample SAMN05379620 as per CDC’s AR isolate bank; strain B11244) that has reads in SRA under accession SRR3883465 but no published assembly. Therefore, we used the assembly in our reference set of 100 genomes that is closest to the spike in strain for presenting the data analysis. We found the closest assembly in the following manner: The reads in SRR3883465 were assembled using SKESA [23], resulting in an assembly with a length of 12.21 Mb and N50 of 22 Kb. The assembly was then aligned to all reference genomes using BLAST, retaining only the best e-value alignments, and coverage on the reference genomes was determined using the retained alignments. The analysis revealed that 12.19 Mb of the assembly had alignments to reference genomes, with the maximum coverage for the reference assembly B12342.1 at 11.52 Mb aligned. The second-best coverage was for B11245, but it had only 1.4 Mb aligned. All alignments to B12342.1 had a percent identity of at least 99.6%, of which all except 8,465 bp aligned at a percent identity of at least 99.9%. Hence, MetaNISH scores for the benchmark dataset were presented using reference genome B12342.1. These scores were compared to KrakenUniq [24], a method for metagenomic classification that provides a quantitative measure of genome coverage. KrakenUniq was run with defaults, except no information was printed for unclassified sequences using parameter—only-classified-output.

Heatmaps were generated using alignments to reference genome B12342.1, contigs in the reference assembly were split into consecutive intervals of size 200 Kb and 2 Kb to represent padding of 100 Kb and 1 Kb on both ends of the alignments for reads, respectively. For each alignment, the starting position of the alignment on the contig was used to determine the bin where the alignment contributes to the count and to increase the count for that bin by one. The counts were plotted in MATLAB (version R2020a, Update 2) using the imagesc function to produce the heatmaps.

SRA reads were aligned to the set of reference genomes, and a score for each reference genome using padded coverage was obtained for SRA ids from January 2010 to November 2022, retrospectively. Using the output from MetaNISH, we scanned the scores for all C. auris reference assemblies using a MetaNISH score ≥50 up to the maximum possible score of 100 to obtain the number of SRA runs with at least a hit on any of the C. auris assemblies. With the suggested score of ≥75 as the threshold for positive pathogen identification, samples with C. auris positive hits were described using the metadata collected.

Results

Benchmarking the reference dataset in the monitoring tool

The benchmark dataset is further described in Table 1. The presence of C. auris spike-in, cell concentration, scores computed by MetaNISH using different padding lengths, and the assembly coverage reported by KrakenUniq (Table 1 and Fig 3) are indicated for each metagenomic run. Using a score of 75 with 100 Kb padding, MetaNISH was able to detect all true positive as well as one false positive sample, while a score of 80 was able to separate all positive and negative samples. Significant variation was observed in scores with padding of less than 100 Kb, no padding, and KrakenUniq. For example, SRR11734778 compared to SRR11734781 had similar cell concentrations (7.1 x 104 CFU/mL and 6.7 x 104 CFU/mL, respectively) and the same score (100) using 100 Kb padding; however, SRR11734778 compared to SRR11734781 had substantially different scores with 1 Kb padding (39.97 and 89.09, respectively), no padding (7.42 and 24.38, respectively) and KrakenUniq (3.73 and 13.52, respectively). Increasing padding to even just 10 Kb brings the padded coverage to over 98 for SRR11734778 and SRR11734781. As reflected in Table 1 and Fig 3, a padding length of 100 Kb was found to be effective in differentiating positive samples like SRR11734782 with a score of 92.74 from negative samples, while MetaNISH, with lesser than 100 Kb or no padding and KrakenUniq coverage values were not effective. Fig 3 depicted that the score is not affected by the number of reads aligning in a specific bin, such as high coverage for SRR11734785 (all yellow) and relatively low but well distributed throughout the genome coverage for SRR11734774 (primarily light blue).

Table 1. Benchmark results using B12342 reference genome.

Benchmark design MetaNISH scores with the specified padding KrakenUniq
Run Status Concentration a 150 Kb 100 Kb 50 Kb 10 Kb 1 Kb None coverage*100
SRR11734785 pos 5.8 x 105 100 100 100 100 99.98 99.9 82.9
SRR11734772 pos 3.5 x 105 100 100 100 100 99.97 99.57 76.98
SRR11734791 pos 2.0 x 105 100 100 100 100 99.96 85.92 56.38
SRR11734780 pos 1.6 x 105 100 100 100 100 99.93 97.88 71.1
SRR11734775 pos 1.2 x 105 100 100 100 100 99.32 50.33 30.6
SRR11734777 pos 8.6 x 104 100 100 100 99.98 89.8 30.07 17.36
SRR11734778 pos 7.1 x 104 100 100 100 98.8 39.97 7.42 3.73
SRR11734781 pos 6.7 x 104 100 100 100 100 89.09 24.38 13.52
SRR11734776 pos 4.3 x 104 100 100 100 99.98 66.17 13.89 9.003
SRR11734779 pos 2.7 x 104 100 100 100 99.99 68.22 13.6 7.575
SRR11734773 pos 1.1 x 104 100 100 100 99.99 93.81 29.67 16.4
SRR11734774 pos 1.1 x 104 100 100 99.8 71.75 14.28 2.36 1.756
SRR11734783 pos 1.0 x 104 100 100 100 99.74 48.4 8 4.61
SRR11734784 pos 2.9 x 103 100 100 100 99.99 67.56 13.65 8.118
SRR11734782 pos 1.9 x 103 96.72 92.74 74.01 23.71 3.04 0.43 0.8356
SRR11734790 neg NA 85.08 79.43 53.74 14.74 1.84 0.25 0.9267
SRR11734789 neg NA 65.16 57.18 34.14 8.43 0.99 0.13 0.4599
SRR11734787 neg NA 50.28 41.88 22.96 5.29 0.68 0.11 0.1616
SRR11734788 neg NA 46.98 39.93 22.75 5.4 0.63 0.08 0.8494
SRR11734786 mock NA 3.56 2.88 1.47 0.57 0.14 0.04 0.4101

Pos: Positive for C. auris; neg: Negative for C. auris; mock: Pooled skin swab samples negative for C. auris.

a Units in CFU/mL.

Fig 3. Distribution of the number of reads aligned in the benchmark set to the B12342 reference genome.

Fig 3

The reference genome is binned by 200 Kb to reflect padding of 100 Kb on both ends of read alignments (A) and similarly by 2 Kb to reflect the padding of 1 Kb (B). The read sets are sorted by decreasing concentration levels of C. auris, with the topmost run SRR11734785 having the highest concentration.

Detection of Candida auris in metagenomic datasets

Using MetaNISH (Fig 2), the number of samples per year that met the filtering criteria increased from one sample in 2010 to 57,756 in 2022 (Table 2). As of December 2022, 291,341 SRA samples were analyzed (Table 2) to produce an output of C. auris hits with varying genome coverage (Table 3). C. auris was identified in five sample datasets: PRJNA488992 (2 SRA runs), PRJNA657014 (4 SRA runs), PRJNA475330 (1 SRA run), PRJNA631031 (15 SRA runs), and PRJNA557323 with two SRA runs (Table 4). Sequence reads from PRJNA488992 were collected from wastewater drains and river samples from Delhi, India. Reads from PRJNA657014 and PRJNA631031 datasets were from skin swabs of the residents of healthcare facilities in the United States where C. auris had been identified [25] and our benchmark dataset, respectively. Sequence reads from PRJNA557323 were collected from human stool samples from Hong Kong. Finally, PRJNA475330 samples were collected from Germany’s sulfur-oxidizing nitrate-reducing enrichment culture of a groundwater sample (Table 4).

Table 2. Number of SRA records scanned.

Year Records Scanned Cumulative
2010 1 1
2011 41 42
2012 551 593
2013 2570 3163
2014 7023 10186
2015 17402 27588
2016 11529 39117
2017 21458 60575
2018 26897 87472
2019 51395 138867
2020 47766 186633
2021 46952 233585
2022 57756 291341

Table 3. Binned padded genome coverage (score) of SRA runs (samples) with Candida auris hits.

Score Ranges
Year [50–75) [75–85) [85–95) [95–100) 100*
2010 0 0 0 0 0
2011 0 0 0 0 0
2012 0 0 0 0 0
2013 0 0 0 0 0
2014 0 0 0 0 0
2015 0 0 0 0 0
2016 5 0 0 0 0
2017 0 0 0 0 0
2018 0 0 0 0 0
2019 1 0 2 4 6
2020 6 3 5 9 14
2021 0 0 0 0 0
2022 1 0 0 0 0
Total 13 3 7 13 20

Numbers in bold correspond to the samples where C. auris was identified, and its metadata is shown in Table 4. *This is not an interval; it equals the number of runs with a score of 100. For all other columns, the interval is closed at the beginning and open at the end.

Table 4. Bioproject metadata for samples with WGS data at SRA with C. auris positive hits.

Run Record Score Release
Year
Bioproject SRA study Title Environment or isolation source
SRR8584355 100% 2019 PRJNA488992 SRP159446 Metagenomics of wastewater drains and river samples from Delhi, India Wastewater drain
SRR8584356 100% Urban river
SRR9016982 100% 2019 PRJNA657014 SRP277451 Sequencing data from point prevalence study associated with C. auris Raw sequence reads Combined axilla and inguinal crease (groin) and anterior nares
(Human skin metagenome)
SRR9016983 100%
SRR9016984 100%
SRR9016985 100%
SRR10237756 >90% 2019 PRJNA475330 SRP161559 Metagenomic assembly of the iron-reducing, 1-methylnaphthalene-degrading enrichment culture (1MN) Sulfur-oxidizing nitrate-reducing enrichment culture
SRR11734772 100% 2020 PRJNA631031 SRP260772 Study of microbial diversity of anterior nares swabs from patients colonized by the pathogen Candida auris Human nasopharyngeal metagenome
SRR11734773 100%
SRR11734774 100%
SRR11734775 100%
SRR11734776 100%
SRR11734777 100%
SRR11734778 100%
SRR11734779 100%
SRR11734780 100%
SRR11734781 100%
SRR11734783 100%
SRR11734784 100%
SRR11734785 100%
SRR11734791 100%
SRR11734782 >90%
SRR10680803 >90% 2020 PRJNA557323 SRP237407 Human gut metagenomes from Hong Kong populations Stool samples
(Human gut metagenome)
SRR10680804

Prospective monitoring

GridRepublic has implemented a molecular monitoring system using MetaNISH on a volunteer computing platform (i.e., a distributed computing platform comprised of resources volunteered by the general public). This system successfully screens all new metagenomic data submitted to SRA daily for C. auris (averaging 925 runs per day). These results are available on the web at www.gridrepublic.org/biosurveillance.

Discussion

In a collaborative effort between CDC, NCBI, and GridRepublic, we developed and benchmarked bioinformatics tools for the prospective monitoring of metagenomic datasets for the detection of C. auris and examined ~300,000 SRA runs released between 2010 and 2022 to identify this pathogen. Using benchmarking samples generated by spiking human skin microbiome with known concentrations of C. auris, we found that the MetaNISH pipeline with the following parameters was successful in identifying samples with C. auris: (i) alignment of the first 100 bases to the target using SRPRISM, (ii) padding length of 100 Kb, and (iii) score threshold of at least 75. Increasing a cutoff score to 80 was able to separate all positive and negative samples; however, using a cutoff of 75 may increase the chances of identifying samples with a low prevalence of C. auris. The proposed parameters were especially beneficial for the detection of benchmarking samples with lower concentrations of C. auris reads, such as SRR11734774 and SRR11734782, which showed low base pair coverage of 2.36% and 0.43% by SRPRISM, and 1.8% and 0.8%, by KrakenUniq but generated scores of 100 and 92.4 with MetaNISH and 100 Kb padding (Table 1). The alignment scores above 90 indicate that the alignments were well-distributed throughout the genome, increasing confidence in the results.

Our study identified five metagenomic datasets containing C. auris sequences in the public SRA repository. The first was the benchmark dataset used to test our pipeline. The second was from skin swabs of patients colonized with C. auris (24). Detection of C. auris in these samples was not surprising, although important for providing an independent validation of the developed method. The third dataset was from a study of stool microbiome of healthy individual in Hong Kong [26], which was novel and unexpected. Although C. auris has previously been isolated from the gastrointestinal tract [27,28], it is generally accepted that this fungus is primarily a skin colonizer [12]. Several publications show that Candida spp. can survive to passage through the gut in healthy adults and possibly generate further spread via wastewater [29,30]. Our observation raises the question of whether patients colonized with C. auris on the skin are also colonized in the gut and whether some human communities may harbor the previously unknown reservoirs of C. auris [31,32]. More studies of healthy people are needed to understand the prevalence of C. auris in the community [33].

The presence of C. auris in the fourth set, laboratory culture of the iron-reducing bacteria most likely indicates contamination [34], although it suggests its ability to survive under such iron-reducing conditions. The detection of C. auris in aquatic biome samples from Delhi is consistent with the recent report showing isolation of C. auris from the coastal waters in India and Colombia [14,15], which provided support to the hypothesis of an environmental origin of this pathogen [9,10]. However, it is also equally likely that C. auris might have been spread into aquatic environment from contaminated wastewater after being excreted from the gastrointestinal tract or washed off the skin of a colonized people [35]. These findings point to its most likely mode of spread (any aquatic stream or aqueous medium) between the human populations and environment and vice versa [36].

A limitation of this study is that only shotgun metagenomic data were analyzed, and amplicon sequencing data were excluded. Relatively high costs and the need for more advanced bioinformatics have limited the use of shotgun metagenomics for microbiome analysis on a large scale [3739]. In contrast, the amplicon sequencing approach is the most widely used method for analyzing microbial communities due to its cost-effectiveness, established data analysis pipelines, and availability of an extensive archive of reference data [12,25,36]. Thus, building and validating a search option for amplicon datasets into MetaNISH by generating benchmark amplicon datasets and C. auris reference databases will complement the existing metagenomic search function. The other limitation of the study is that for most SRA submissions, only limited metadata on the specimens is available. A follow-up with the submitters is often needed to identify additional details of the study and to determine whether a finding of C. auris sequences in the sample is indeed a reflection of the sequenced community and not an artifact of laboratory practices, in which C. auris might have been used as a loading control or occurred as a contaminant. It is also important to point out that the identification of reads of C. auris in certain samples may not necessarily indicate that these samples represent the ecological niche for this fungus. As described above with an example of finding C. auris in coastal waters, the directionality of C. auris transition between the human skin and coastal waters is not clear. It is equally likely that the fungus might have emerged in costal habitats and later transitioned into human population, or in contrast, that it has emerged elsewhere and was introduced into the coastal waters from colonized persons.

Because, in many cases, there is a significant time lag between the collection of a sample and the submission/publication of its sequence reads, the detections that can be made (even daily) do not imply an active outbreak response but are valuable post hoc information that allows tracking trends of spread and are encompassed in the data integration of a One Health surveillance system [40].

The findings presented in this study using MetaNISH on public metagenomic data support the results of previous work on C. auris in natural environments. This work also lays the foundation for the prospective monitoring system for C. auris because the modular design of MetaNISH makes it suitable for this daily job, which in addition to addressing scientific questions about the origin of C. auris, provides a necessary public health monitoring tool for investigating the spread of C. auris into the new areas. GridRepublic has implemented the pipeline developed and evaluated in our study on a distributed computing network, adapting it into a real-time monitoring system. Future efforts can adapt this tool to monitor other emerging pathogens and public health threats.

Supporting information

S1 Table. Reference genomes.

(DOCX)

Acknowledgments

This research work was supported in part by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health. This work was also made possible through support from the CDC Office of Advanced Molecular Detection (OAMD). The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention.

Data Availability

SRPRISM is available at https://github.com/ncbi/SRPRISM. Software for rank calculation, sequence data for reference genomes, and a sample run have been deposited at Zenodo.org and are available at https://doi.org/10.5281/zenodo.10214980. The benchmark dataset is available at NCBI under BioProject PRJNA631031. All additional data analyzed in the manuscript is already publicly available in SRA at NCBI (https://www.ncbi.nlm.nih.gov/sra).

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Watkins RR, Gowen R, Lionakis MS, Ghannoum M. Update on the Pathogenesis, Virulence, and Treatment of Candida auris. Pathog Immun. 2022;7(2):46–65. doi: 10.20411/pai.v7i2.535 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.WHO. WHO fungal priority pathogens list to guide research, development and public health action: World Health Organization; 2022. [Available from: https://www.who.int/publications/i/item/9789240060241. [Google Scholar]
  • 3.Satoh K, Makimura K, Hasumi Y, Nishiyama Y, Uchida K, Yamaguchi H. Candida aurissp. nov., a novel ascomycetous yeast isolated from the external ear canal of an inpatient in a Japanese hospital. Microbiology and Immunology. 2009;53(1):41–4. doi: 10.1111/j.1348-0421.2008.00083.x [DOI] [PubMed] [Google Scholar]
  • 4.Lee WG, Shin JH, Uh Y, Kang MG, Kim SH, Park KH, et al. First Three Reported Cases of Nosocomial Fungemia Caused by Candida auris. Journal of Clinical Microbiology. 2011;49(9):3139–42. doi: 10.1128/JCM.00319-11 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lockhart SR, Etienne KA, Vallabhaneni S, Farooqi J, Chowdhary A, Govender NP, et al. Simultaneous Emergence of Multidrug-Resistant Candida auris on 3 Continents Confirmed by Whole-Genome Sequencing and Epidemiological Analyses. Clin Infect Dis. 2017;64(2):134–40. doi: 10.1093/cid/ciw691 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chow NA, De Groot T, Badali H, Abastabar M, Chiller TM, Meis JF. Potential Fifth Clade ofCandida auris,Iran, 2018. Emerging Infectious Diseases. 2019;25(9):1780–1. doi: 10.3201/eid2509.190686 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Spruijtenburg B, Badali H, Abastabar M, Mirhendi H, Khodavaisy S, Sharifisooraki J, et al. Confirmation of fifth Candida auris clade by whole genome sequencing. Emerging Microbes & Infections. 2022;11(1):2405–11. doi: 10.1080/22221751.2022.2125349 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Chow NA, Gade L, Tsay SV, Forsberg K, Greenko JA, Southwick KL, et al. Multiple introductions and subsequent transmission of multidrug-resistant Candida auris in the USA: a molecular epidemiological survey. The Lancet Infectious Diseases. 2018;18(12):1377–84. doi: 10.1016/S1473-3099(18)30597-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Casadevall A, Kontoyiannis DP, Robert V. Environmental Candida auris and the Global Warming Emergence Hypothesis. mBio. 2021;12(2). doi: 10.1128/mBio.00360-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jackson BR, Chow N, Forsberg K, Litvintseva AP, Lockhart SR, Welsh R, et al. On the Origins of a Species: What Might Explain the Rise of Candida auris? Journal of Fungi. 2019;5(3):58. doi: 10.3390/jof5030058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Huang X, Hurabielle C, Drummond RA, Bouladoux N, Desai JV, Sim CK, et al. Murine model of colonization with fungal pathogen Candida auris to explore skin tropism, host risk factors and therapeutic strategies. Cell Host & Microbe. 2021;29(2):210–21.e6. doi: 10.1016/j.chom.2020.12.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Proctor DM, Dangana T, Sexton DJ, Fukuda C, Yelin RD, Stanley M, et al. Integrated genomic, epidemiologic investigation of Candida auris skin colonization in a skilled nursing facility. Nature Medicine. 2021;27(8):1401–9. doi: 10.1038/s41591-021-01383-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Chow NA, Muñoz JF, Gade L, Berkow EL, Li X, Welsh RM, et al. Tracing the Evolutionary History and Global Expansion of Candida auris Using Population Genomic Analyses. mBio. 2020;11(2). doi: 10.1128/mBio.03364-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Arora P, Singh P, Wang Y, Yadav A, Pawar K, Singh A, et al. Environmental Isolation of Candida auris from the Coastal Wetlands of Andaman Islands, India. mBio. 2021;12(2). doi: 10.1128/mBio.03181-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Escandón P. Novel Environmental Niches for Candida auris: Isolation from a Coastal Habitat in Colombia. Journal of Fungi. 2022;8(7):748. doi: 10.3390/jof8070748 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Abdill RJ, Adamowicz EM, Blekhman R. Public human microbiome data are dominated by highly developed countries. PLOS Biology. 2022;20(2):e3001536. doi: 10.1371/journal.pbio.3001536 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Katz K, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50(D1):D387–d90. doi: 10.1093/nar/gkab1053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Morgulis A, Agarwala R. SRPRISM (Single Read Paired Read Indel Substitution Minimizer): an efficient aligner for assemblies with explicit guarantees. Gigascience. 2020;9(4). doi: 10.1093/gigascience/giaa023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Muñoz JF, Welsh RM, Shea T, Batra D, Gade L, Howard D, et al. Clade-specific chromosomal rearrangements and loss of subtelomeric adhesins in Candida auris. Genetics. 2021;218(1). doi: 10.1093/genetics/iyab029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Welsh RM, Misas E, Forsberg K, Lyman M, Chow NA. Candida auris Whole-Genome Sequence Benchmark Dataset for Phylogenomic Pipelines. J Fungi (Basel). 2021;7(3). doi: 10.3390/jof7030214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Welsh RM, Bentz ML, Shams A, Houston H, Lyons A, Rose LJ, et al. Survival, Persistence, and Isolation of the Emerging Multidrug-Resistant Pathogenic Yeast Candida auris on a Plastic Health Care Surface. Journal of Clinical Microbiology. 2017;55(10):2996–3005. doi: 10.1128/JCM.00921-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Sexton DJ, Kordalewska M, Bentz ML, Welsh RM, Perlin DS, Litvintseva AP. Direct Detection of Emergent Fungal Pathogen Candida auris in Clinical Skin Swabs by SYBR Green-Based Quantitative PCR Assay. Journal of Clinical Microbiology. 2018;56(12). doi: 10.1128/JCM.01337-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Souvorov A, Agarwala R, Lipman DJ. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biology. 2018;19(1). doi: 10.1186/s13059-018-1540-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biology. 2018;19(1). doi: 10.1186/s13059-018-1568-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Huang X, Welsh RM, Deming C, Proctor DM, Thomas PJ, Gussin GM, et al. Skin Metagenomic Sequence Analysis of Early Candida auris Outbreaks in U.S. Nursing Homes. mSphere. 2021;6(4):e0028721. doi: 10.1128/mSphere.00287-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Yeoh YK, Chen Z, Wong MCS, Hui M, Yu J, Ng SC, et al. Southern Chinese populations harbour non-nucleatum Fusobacteria possessing homologues of the colorectal cancer-associated FadA virulence factor. Gut. 2020;69(11):1998–2007. doi: 10.1136/gutjnl-2019-319635 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Alam MJ, Begum K, Endres BT, McPherson J, Costa G, Miranda JM, et al. 1720. Isolation and Characterization of Candida auris From an Active Surveillance System in Texas. Open Forum Infectious Diseases. 2019;6(Supplement_2):S630–S. [Google Scholar]
  • 28.Zuo T, Zhan H, Zhang F, Liu Q, Tso EYK, Lui GCY, et al. Alterations in Fecal Fungal Microbiome of Patients With COVID-19 During Time of Hospitalization until Discharge. Gastroenterology. 2020;159(4):1302–10.e5. doi: 10.1053/j.gastro.2020.06.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Leonardi I, Paramsothy S, Doron I, Semon A, Kaakoush NO, Clemente JC, et al. Fungal Trans-kingdom Dynamics Linked to Responsiveness to Fecal Microbiota Transplantation (FMT) Therapy in Ulcerative Colitis. Cell Host & Microbe. 2020;27(5):823–9.e3. doi: 10.1016/j.chom.2020.03.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Rossi A, Chavez J, Iverson T, Hergert J, Oakeson K, Lacross N, et al. Candida auris Discovery through Community Wastewater Surveillance during Healthcare Outbreak, Nevada, USA, 2022. Emerging Infectious Diseases. 2023;29(2):422–5. doi: 10.3201/eid2902.221523 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Olsen M, Nassar R, Senok A, Moloney S, Lohning A, Jones P, et al. Mobile phones are hazardous microbial platforms warranting robust public health and biosecurity protocols. Scientific Reports. 2022;12(1). doi: 10.1038/s41598-022-14118-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Tharp B, Zheng R, Bryak G, Litvintseva AP, Hayden MK, Chowdhary A, et al. Role of Microbiota in the Skin Colonization of Candida auris. mSphere. 2023;8(1):e0062322. doi: 10.1128/msphere.00623-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ahmad S, Asadzadeh M. Strategies to Prevent Transmission of Candida auris in Healthcare Settings. Curr Fungal Infect Rep. 2023:1–13. doi: 10.1007/s12281-023-00451-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Müller H, Marozava S, Probst AJ, Meckenstock RU. Groundwater cable bacteria conserve energy by sulfur disproportionation. The ISME Journal. 2020;14(2):623–34. doi: 10.1038/s41396-019-0554-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Akinbobola AB, Kean R, Hanifi SMA, Quilliam RS. Environmental reservoirs of the drug-resistant pathogenic yeast Candida auris. PLOS Pathogens. 2023;19(4):e1011268. doi: 10.1371/journal.ppat.1011268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Irinyi L, Roper M, Malik R, Meyer W. Finding a Needle in a Haystack—In Silico Search for Environmental Traces of Candida auris. Jpn J Infect Dis. 2022;75(5):490–5. doi: 10.7883/yoken.JJID.2022.068 [DOI] [PubMed] [Google Scholar]
  • 37.Hilt EE, Ferrieri P. Next Generation and Other Sequencing Technologies in Diagnostic Microbiology and Infectious Diseases. Genes (Basel). 2022;13(9). doi: 10.3390/genes13091566 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ranjan R, Rani A, Metwally A, Mcgee HS, Perkins DL. Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing. Biochemical and Biophysical Research Communications. 2016;469(4):967–77. doi: 10.1016/j.bbrc.2015.12.083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Rausch P, Rühlemann M, Hermes BM, Doms S, Dagan T, Dierking K, et al. Comparative analysis of amplicon and metagenomic sequencing methods reveals key features in the evolution of animal metaorganisms. Microbiome. 2019;7(1). doi: 10.1186/s40168-019-0743-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ko KKK, Chng KR, Nagarajan N. Metagenomics-enabled microbial surveillance. Nature Microbiology. 2022;7(4):486–96. doi: 10.1038/s41564-022-01089-w [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Ricardo Santos

31 Oct 2023

PONE-D-23-27761Finding Candida auris in public metagenomic repositoriesPLOS ONE

Dear Dr. Chow,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 15 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Ricardo Santos

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

Additional Editor Comments:

Please see below the comments and suggested MAJOR revisions made by the individual(s) who reviewed your manuscript.  If provided, the referee's report(s) indicate the revisions that need to be made before it can be accepted for publication.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript presented by Mario-Vasquez et al. describing the finding of C. auris in WGS data using “MetaNISH” is very well written and can be divided into two main parts: i) the presentation of a tool that can identify C. auris in metagenomics sequencing data; and ii) the screening of public repositories for the presence of this species in metagenomics data. For this reason, it is of interest for a broad community. Still, I have some comments/concerns that I would like to clarify:

1. The manuscript describes the MetaNISH tool and states its availability through an NCBI ftp address. This is not the most standard way to comply with the FAIR principles. It would be good to have the code of the tool freely available in an open-source repository, such as github or gitlab.

2. Regarding the behavior of MetaNISH, it was not clear to me why only the first 100bp of a read are aligned to the reference database.

3. The padding strategy applied by the authors is very interesting and seems to be of extreme relevance for the quality of the results. How was the 100kb padding determined as the best threshold? In the manuscript I see a comparison between 1kb and 100kb, which is a huge difference. Were intermediate distances also tested?

4. The authors describe a “collage” of reference genomes, which I assume correspond to the MetaNISH reference database. It would be important to know which genomes correspond to this reference and the criteria used to choose them. Moreover, the authors mention 44 other Candida species, and 42 other fungal genera, but which species and assemblies? A supplementary table with this information would be important for transparency.

5. Still regarding the reference database, it is not clear to me whether i) the reads are aligned to the combination of all these species, or ii) exclusively to the assemblies of the species of interest. If i), how does MetaNISH deals with multimappings and how does this affect the score? If ii), by accepting a 5% error rate, how does MetaNISH ensures that the reads correspond to C. auris and not to a close-related species? It is important to clarify this in the text.

6. In Figure 1, the authors describe the pipeline used for the whole analysis and MetaNISH is just a small part of it. It would be important to clarify in the text and in the caption of the figure what were the pre-MetaNISH steps. It is not clear to me, for example, why the Docker logo is used without any reference to it in the manuscript. Moreover, as the “Curated Reference Database” is outside the MetaNISH, does this mean that MetaNISH can be run with any reference database created by the user? I think the manuscript would benefit if this Figure was substantially improved, clarifying not only the pre-MetaNISH steps, but also, and more importantly, the MetaNISH workflow and its input/output files.

7. Regarding the benchmarking and comparison with KrakenUniq, what were the settings used for KrakenUniq? Also, how much time does MetaNISH require for the analysis of an SRA entry?

8. Sample SRR11734778 seems to have an outlier behavior when compared to the others (Table 1). Do the authors have an idea of what can be influencing the score in this sample?

9. Table 3 needs more explanation. Does “score” correspond to “genome coverage”? And the numbers correspond to the number of samples? Why are some numbers in red?

10. Regarding the MetaNISH implementation in GridRepublic, does it specifically screen for C. auris? Or for all the species present in the reference database?

Thank you.

Reviewer #2: The authors describe an interesting system for searching for Candida auris using metagenomic data deposited in the NCBI public database. They present their analyses clearly and in detail, and also explain the many limitations of their method. This method looks very promising not only for C. auris, but also for searching for other fungal pathogens in the environment, determining the microbiome and proposing possible hypotheses for the emergence of C. auris.

One question; do you think these results can be modified if you used other strain from different clades to determine the parameters?

Reviewer #3: I find the manuscript very interesting and thorough, and the method they propose appears very sensitive in detecting C.auris infections. I have no major comments about the paper.

Minor comment: I find the description of "padding" confusing, and particularly this: "For each read and each genome, the padding length was scaled down by the number of alignments to the genome by that read. The score was the percentage of the genome covered by padded alignments." I suggest using a formula to make it clearer and more motivated, and possibly a figure to illustrate what the padding achieves.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Rahul Siddharthan

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jan 19;19(1):e0291406. doi: 10.1371/journal.pone.0291406.r002

Author response to Decision Letter 0


15 Dec 2023

Review Comments to the Author

Reviewer #1: The manuscript presented by Mario-Vasquez et al. describing the finding of C. auris in WGS data using “MetaNISH” is very well written and can be divided into two main parts: i) the presentation of a tool that can identify C. auris in metagenomics sequencing data; and ii) the screening of public repositories for the presence of this species in metagenomics data. For this reason, it is of interest for a broad community. Still, I have some comments/concerns that I would like to clarify:

1. The manuscript describes the MetaNISH tool and states its availability through an NCBI ftp address. This is not the most standard way to comply with the FAIR principles. It would be good to have the code of the tool freely available in an open-source repository, such as github or gitlab.

MetaNISH uses SRPRISM that is already published and available on GitHub at https://github.com/ncbi/SRPRISM.

A snapshot of the code for computing score, sequences for reference genomes, and supporting data was created on Zenodo (https://doi.org/10.5281/zenodo.10214980), which is more appropriate for creating a publicly available snapshot of such data.

2. Regarding the behavior of MetaNISH, it was not clear to me why only the first 100bp of a read are aligned to the reference database.

In section MetaNISH Design, Materials and methods, lines 83-92, now reads:

For reads shorter than 100 bases, the full length of the read is aligned. The design of SRPRISM guarantees that the first 100 bases will have at most 5 of the 15 errors allowed. We chose the first 100 bases as the region that must align as the read quality drops beyond that in many Illumina runs, and 100 bp is also long enough to avoid spurious matches. An example of such a read is SRR11734778.40769.1, which is a paired read with each mate of length 251 bases. Alignments for this read are included in the data released at Zenodo.org (https://doi.org/10.5281/zenodo.10214980). It was shown that the first 163 bases of the read were an exact match to C. auris genomes. However, the remaining portion of the read (specifically, the substring from 164 to 251) seems of inferior quality as it did not report a match to anything in the non-redundant (nr) nucleotide database at NCBI as of November 2023.

3. The padding strategy applied by the authors is very interesting and seems to be of extreme relevance for the quality of the results. How was the 100kb padding determined as the best threshold? In the manuscript I see a comparison between 1kb and 100kb, which is a huge difference. Were intermediate distances also tested?

In section MetaNISH Design, Materials and methods, lines 109-113, now reads:

Using the reference genomes provided and empirical data from a set of 4,000 SRA runs, we proposed padding of 100 Kb and a score of at least 75 to indicate the presence of the corresponding genome in the read set. The choice is conservative, where it can potentially flag a few runs as scoring at least 75 when the genome is not present (false positives) but is unlikely to miss any (false negatives). At the same time, the parameters are not too conservative to make the false positive rate unacceptable.

In the section Benchmarking the reference dataset in the monitoring tool, Results, lines 211-214, now reads:

As reflected in Table 1 and Figure 3, a padding length of 100 Kb was found to be effective in differentiating positive samples like SRR11734782 with a score of 92.74 from negative samples, while MetaNISH, with lesser than 100 Kb or no padding and KrakenUniq coverage values were not effective.

Table 1 (lines 222-225) has been updated, indicating several padding lengths tested:

Table 1. Benchmark results using B12342 reference genome.

Benchmark design MetaNISH scores with the specified padding KrakenUniq

Run Status Concentration a 150 Kb 100 Kb 50 Kb 10 Kb 1 Kb None coverage*100

SRR11734785 pos 5.8 x 105 100 100 100 100 99.98 99.9 82.9

SRR11734772 pos 3.5 x 105 100 100 100 100 99.97 99.57 76.98

SRR11734791 pos 2.0 x 105 100 100 100 100 99.96 85.92 56.38

SRR11734780 pos 1.6 x 105 100 100 100 100 99.93 97.88 71.1

SRR11734775 pos 1.2 x 105 100 100 100 100 99.32 50.33 30.6

SRR11734777 pos 8.6 x 104 100 100 100 99.98 89.8 30.07 17.36

SRR11734778 pos 7.1 x 104 100 100 100 98.8 39.97 7.42 3.73

SRR11734781 pos 6.7 x 104 100 100 100 100 89.09 24.38 13.52

SRR11734776 pos 4.3 x 104 100 100 100 99.98 66.17 13.89 9.003

SRR11734779 pos 2.7 x 104 100 100 100 99.99 68.22 13.6 7.575

SRR11734773 pos 1.1 x 104 100 100 100 99.99 93.81 29.67 16.4

SRR11734774 pos 1.1 x 104 100 100 99.8 71.75 14.28 2.36 1.756

SRR11734783 pos 1.0 x 104 100 100 100 99.74 48.4 8 4.61

SRR11734784 pos 2.9 x 103 100 100 100 99.99 67.56 13.65 8.118

SRR11734782 pos 1.9 x 103 96.72 92.74 74.01 23.71 3.04 0.43 0.8356

SRR11734790 neg NA 85.08 79.43 53.74 14.74 1.84 0.25 0.9267

SRR11734789 neg NA 65.16 57.18 34.14 8.43 0.99 0.13 0.4599

SRR11734787 neg NA 50.28 41.88 22.96 5.29 0.68 0.11 0.1616

SRR11734788 neg NA 46.98 39.93 22.75 5.4 0.63 0.08 0.8494

SRR11734786 mock NA 3.56 2.88 1.47 0.57 0.14 0.04 0.4101

Pos: positive for C. auris; neg: negative for C. auris; mock: pooled skin swab samples negative for C. auris. a Units in CFU/mL.

4. The authors describe a “collage” of reference genomes, which I assume correspond to the MetaNISH reference database. It would be important to know which genomes correspond to this reference and the criteria used to choose them. Moreover, the authors mention 44 other Candida species, and 42 other fungal genera, but which species and assemblies? A supplementary table with this information would be important for transparency.

Following the reviewer’s suggestion, the supplementary table S1 was created. It contains information about the assemblies used in our reference database.

In addition to the comprehensive list of C. auris genomes in which we tried to cover the diversity of clades known up to that time. We included references mainly to pathogenic fungi of public health interest, which, given the knowledge at that time, we would not necessarily expect to find in environments other than hospital environments and, like C. auris, to see if they could have other sources of origin/dispersion.

Additional text has been added at lines 137-139 of the Benchmark Development section, Materials and methods, relating Table S1 to the main manuscript.

5. Still regarding the reference database, it is not clear to me whether i) the reads are aligned to the combination of all these species, or ii) exclusively to the assemblies of the species of interest. If i), how does MetaNISH deals with multimappings and how does this affect the score? If ii), by accepting a 5% error rate, how does MetaNISH ensures that the reads correspond to C. auris and not to a close-related species? It is important to clarify this in the text.

Reads are aligned to each assembly on the reference database, and then the alignments for each SRA run are ranked according to the percentage of padded coverage. This is stated in the manuscript as follows:

- Lines 75- 78: The pipeline consists of two steps: (I) the alignment step using SRPRISM (18) that aligns reads to all reference genomes as a single database, and (II) the score computation step, which increases the score for samples with reads aligned across the genome compared to samples with reads aligned to a small section of the genome.

- Lines 79-80: Alignment is performed with SRPRISM as it guarantees the reporting of all equally good alignments (max 255) across all sequences in the database.

- Lines 159-161: The alignment and scoring were done as per the MetaNISH design described earlier. The scores for all 100 reference genomes for each SRA ID are reported by the pipeline.

- Lines 191-192: SRA reads were aligned to the set of reference genomes, and a score for each reference genome using padded coverage was obtained for SRA ids from January 2010 to November 2022, retrospectively.

A new figure was created to address the reviewer's concerns about multimappings (now the new Figure 1), and Lines 96-104 now reads:

A scaling factor was applied to adjust the padding length for each read and genome based on the number of alignments, so this ensures that multiple mappings of a read to a genome do not exceed 100 Kb for each location. The score was the percentage of the genome covered by padded alignments. For example, read SRR11734778.40769.1 aligned to four C. auris genomes with only one location in each genome. Therefore, full padding of 100 Kb was used for each of the four alignments. However, read SRR11734778.1429600.1 aligned to four contigs on three genomes, as shown in Figure 1. Figures 1A and 1B show four alignments on two genomes, which reduced the padding to 25 Kb. Figure 1C shows two alignments on the third genome, which reduced the padding to 50 Kb. Padded coverage cannot extend beyond contig boundaries.

6. In Figure 1, the authors describe the pipeline used for the whole analysis and MetaNISH is just a small part of it. It would be important to clarify in the text and in the caption of the figure what were the pre-MetaNISH steps. It is not clear to me, for example, why the Docker logo is used without any reference to it in the manuscript. Moreover, as the “Curated Reference Database” is outside the MetaNISH, does this mean that MetaNISH can be run with any reference database created by the user? I think the manuscript would benefit if this Figure was substantially improved, clarifying not only the pre-MetaNISH steps, but also, and more importantly, the MetaNISH workflow and its input/output files.

Figure 2 (formerly Figure 1, addressed in the reviewer’s question) has been refined following the reviewer's suggestion, and more details have been added to make it self-explanatory. Additional text has been added to the caption of the figure to clarify the whole process. Docker was used to create a container to query data from the SRA db cloud, as indicated in Figure 2.

MetaNISH can be run with any reference database created by the user. This is stated in the manuscript as follows:

- Lines 125-130: The design of MetaNISH can be used for tracking any pathogen. It requires developing a reference set with representative genomes from all clades for the pathogen to be tracked and for nearby species. Doing so allows SRPRISM to find the best matches for each read among the genomes where a match can be expected. Then, empirical analysis is needed to find suitable parameters for padding and score threshold. If read properties change substantially over time, the alignment method and parameters may need revisited.

- Lines 327-328: Future efforts can adapt this tool to monitor other emerging pathogens and public health threats.

7. Regarding the benchmarking and comparison with KrakenUniq, what were the settings used for KrakenUniq? Also, how much time does MetaNISH require for the analysis of an SRA entry?

Settings used for KrakenUniq are stated in Lines 181-184:

These scores were compared to KrakenUniq (24), a method for metagenomic classification that provides a quantitative measure of genome coverage. KrakenUniq was run with defaults, except no information was printed for unclassified sequences using parameter --only-classified-output.

Text describing requirements for MetaNISH running analysis is found at Lines 119-124:

The CPU time for the 4,000 runs used for determining the parameters varied from 25 seconds (for SRR7125652) to 17 hours 14 minutes (for SRR8550535) with a median time of 25 minutes. SRR7125652 has 86,554 paired reads, while SRR8550535 has over 418 million paired reads. We noted that SRPRISM, which takes almost all the time in the MetaNISH pipeline (as score calculation takes only a couple of seconds), can be run in multi-threaded mode with good scaling till eight threads, but we did not use that option for results reported here.

8. Sample SRR11734778 seems to have an outlier behavior when compared to the others (Table 1). Do the authors have an idea of what can be influencing the score in this sample?

Additional columns for padding added to Table 1 show that SRR11734778 is not an outlier, even with a padding of 10 Kb. Figure 1 also illustrates the analysis of a read from this read set.

9. Table 3 needs more explanation. Does “score” correspond to “genome coverage”? And the numbers correspond to the number of samples? Why are some numbers in red?

As stated in lines 98-99 of the MetaNISH Design section:

The score was the percentage of the genome covered by padded alignments.

So, the score corresponds to the percentage of padded genome coverage. The word ‘score’ has been added to the table title text referring to the above.

The numbers in the table correspond to the number of SRA runs (or samples); the word ‘samples’ has also been added to the table title text; the ones in red bold are samples whose metadata info is shown in Table 4; this information has been added to the table's footnotes.

Table 3, lines 249-253, has been updated:

Table 3. Binned padded genome coverage (score) of SRA runs (samples) with Candida auris hits.

Score Ranges

Year [50-75) [75-85) [85-95) [95-100) 100*

2010 0 0 0 0 0

2011 0 0 0 0 0

2012 0 0 0 0 0

2013 0 0 0 0 0

2014 0 0 0 0 0

2015 0 0 0 0 0

2016 5 0 0 0 0

2017 0 0 0 0 0

2018 0 0 0 0 0

2019 1 0 2 4 6

2020 6 3 5 9 14

2021 0 0 0 0 0

2022 1 0 0 0 0

Total 13 3 7 13 20

Numbers in red bold correspond to the samples where C. auris was identified, and its metadata is shown in Table 4.

*This is not an interval; it equals the number of runs with a score of 100. For all other columns, the interval is closed at the beginning and open at the end.

10. Regarding the MetaNISH implementation in GridRepublic, does it specifically screen for C. auris? Or for all the species present in the reference database?

Thank you.

While this implementation could be screening any of the reference genomes in the database, benchmark development is necessary to determine the appropriate set of reference genomes from all clades for any pathogen to be monitored; given this, we can only guarantee results for C. auris.

Reviewer #2: The authors describe an interesting system for searching for Candida auris using metagenomic data deposited in the NCBI public database. They present their analyses clearly and in detail, and also explain the many limitations of their method. This method looks very promising not only for C. auris, but also for searching for other fungal pathogens in the environment, determining the microbiome and proposing possible hypotheses for the emergence of C. auris.

One question; do you think these results can be modified if you used other strain from different clades to determine the parameters?

According to the stated in Lines 125-130. If the reference database comprises all the representative genome diversity known for a pathogen. In that case, a different strain is expected to not significantly modify the parameters to be determined unless it is a different species.

Reviewer #3: I find the manuscript very interesting and thorough, and the method they propose appears very sensitive in detecting C.auris infections. I have no major comments about the paper.

Minor comment: I find the description of "padding" confusing, and particularly this: "For each read and each genome, the padding length was scaled down by the number of alignments to the genome by that read. The score was the percentage of the genome covered by padded alignments." I suggest using a formula to make it clearer and more motivated, and possibly a figure to illustrate what the padding achieves.

A new figure was created to clarify the reviewer's concerns (now the new Figure 1), and new text has been added Lines 96-104 now reads:

A scaling factor was applied to adjust the padding length for each read and genome based on the number of alignments, so this ensures that multiple mappings of a read to a genome do not exceed 100 Kb for each location. The score was the percentage of the genome covered by padded alignments. For example, read SRR11734778.40769.1 aligned to four C. auris genomes with only one location in each genome. Therefore, full padding of 100 Kb was used for each of the four alignments. However, read SRR11734778.1429600.1 aligned to four contigs on three genomes, as shown in Figure 1. Figures 1A and 1B show four alignments on two genomes, which reduced the padding to 25 Kb. Figure 1C shows two alignments on the third genome, which reduced the padding to 50 Kb. Padded coverage cannot extend beyond contig boundaries.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Ricardo Santos

5 Jan 2024

Finding Candida auris in public metagenomic repositories

PONE-D-23-27761R1

Dear Dr. Chow,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ricardo Santos

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I thank the authors for carefully addressing all my comments. Some minor comments on this new version are:

1. Figure 1 should be provided as a single figure with 3 panels, and not as 3 independent figures. The same applies to Figure 3.

2. Figure 2, Box in step 4 is out of format and species name in step 6 should be In italics

3. In Figure 2, I miss an arrow connecting step 1 and step 3? Or is the first arrow in step 2 in the wrong direction?

4. Table 3, numbers in red bold were clarified, but numbers in bold (not red) are not clarified.

Reviewer #3: I am satisfied with the responses of the authors to my and the other reviewer's comments. I have no further comments or questions about the paper.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: Yes: Rahul Siddharthan

**********

Acceptance letter

Ricardo Santos

11 Jan 2024

PONE-D-23-27761R1

PLOS ONE

Dear Dr. Chow,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ricardo Santos

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Reference genomes.

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    SRPRISM is available at https://github.com/ncbi/SRPRISM. Software for rank calculation, sequence data for reference genomes, and a sample run have been deposited at Zenodo.org and are available at https://doi.org/10.5281/zenodo.10214980. The benchmark dataset is available at NCBI under BioProject PRJNA631031. All additional data analyzed in the manuscript is already publicly available in SRA at NCBI (https://www.ncbi.nlm.nih.gov/sra).


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES