Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Mar 14;25(2):bbae085. doi: 10.1093/bib/bbae085

metaProbiotics: a tool for mining probiotic from metagenomic binning data based on a language model

Shufang Wu 1, Tao Feng 2, Waijiao Tang 3, Cancan Qi 4, Jie Gao 5,6, Xiaolong He 7, Jiaxuan Wang 8, Hongwei Zhou 9,, Zhencheng Fang 10,
PMCID: PMC10940841  PMID: 38487846

Abstract

Beneficial bacteria remain largely unexplored. Lacking systematic methods, understanding probiotic community traits becomes challenging, leading to various conclusions about their probiotic effects among different publications. We developed language model–based metaProbiotics to rapidly detect probiotic bins from metagenomes, demonstrating superior performance in simulated benchmark datasets. Testing on gut metagenomes from probiotic-treated individuals, it revealed the probioticity of intervention strains–derived bins and other probiotic-associated bins beyond the training data, such as a plasmid-like bin. Analyses of these bins revealed various probiotic mechanisms and bai operon as probiotic Ruminococcaceae’s potential marker. In different health–disease cohorts, these bins were more common in healthy individuals, signifying their probiotic role, but relevant health predictions based on the abundance profiles of these bins faced cross-disease challenges. To better understand the heterogeneous nature of probiotics, we used metaProbiotics to construct a comprehensive probiotic genome set from global gut metagenomic data. Module analysis of this set shows that diseased individuals often lack certain probiotic gene modules, with significant variation of the missing modules across different diseases. Additionally, different gene modules on the same probiotic have heterogeneous effects on various diseases. We thus believe that gene function integrity of the probiotic community is more crucial in maintaining gut homeostasis than merely increasing specific gene abundance, and adding probiotics indiscriminately might not boost health. We expect that the innovative language model–based metaProbiotics tool will promote novel probiotic discovery using large-scale metagenomic data and facilitate systematic research on bacterial probiotic effects. The metaProbiotics program can be freely downloaded at https://github.com/zhenchengfang/metaProbiotics.

Keywords: probiotics, metagenome, language model, machine learning

INTRODUCTION

As defined by the World Health Organization (WHO), probiotics are live microorganisms that confer a health benefit on the host when administered in adequate amounts [1]. Some relatively well-studied bacteria have shown multiple distinct probiotic properties, such as protecting intestinal barriers [2], immunomodulation [3], regulating the gut microbiota [4], metabolic regulation [5] and alleviating mental illness through the gut–brain axis [6]. Owing to the health benefits to the host, an enormous effort has been made to discover novel probiotics and understand their probiotic mechanisms in recent years, and the probiotics market is expected to reach $85.4 billion by 2027 [7].

However, determining whether a microorganism is a probiotic may require a comprehensive assessment based on multiple complex factors and lengthy clinical trials. Currently, many molecular mechanisms of probiotics remain unknown. In particular, many intestinal commensal bacteria are potential candidates for next-generation probiotics (NGPs), but challenges in their cultivation often hinder physiological studies [8]. Clinically, selecting and verifying a probiotic strain typically involves a time-consuming randomized controlled trial (RCT) [9], and only certain common genera, such as Lactobacillus (which has been reclassified into 25 genera, namely, the emended genus Lactobacillus, Paralactobacillus and 23 novel genera [10]) and Bifidobacterium, can be used to perform large-scale RCTs due to safety concerns. Perplexingly, conflicting results regarding the probiotic effects of certain bacteria often emerge in different studies and reports [11, 12]. Due to the multifaceted interactions between bacteria and their hosts, without a systematic approach for characterizing probiotics within microbial communities, it is challenging to elucidate their specific roles in maintaining host health.

With the accumulation of vast amounts of metagenomic data, the machine learning approach makes it possible for systematic, large-scale probiotic mining using deoxyribonucleic acid (DNA) data, bypassing the above difficulties. Such techniques can derive decision rules from labeled datasets, enabling effective assessments of new data. Their recent application in microbial genomic annotation tasks [13, 14] and the increasing amount of genomic data on known probiotics [15, 16] highlight the potential of such approaches in the field of probiotic mining.

Sun et al. introduced such a related tool, iProbiotics, which utilizes k-mer frequencies to characterize complete bacterial genomes and employs the support vector machine algorithm for probiotic identification [17]. While iProbiotics has been validated using complete bacterial genomes, its efficacy in draft genomes from metagenomes remains uncertain. Despite the application of the k-mer frequency model in various bioinformatics tasks [18], it primarily captures oligonucleotide occurrence frequencies and may not reflect sequence function. The limitations of k-mers include redundancy in conserved regions [19–22], a lack of specificity [23, 24] and high-dimensional feature spaces with high k values [18]. Metagenomic sequences are often fragmented due to poor assembly, which can lead to fluctuating k-mer frequencies. Recently, natural language processing (NLP) techniques have offered innovative biological sequence representation methods. In such models, oligonucleotides or oligo-amino acids are treated as ‘words’ and DNA or protein sequences as ‘sentences’. Through unsupervised pretraining on large datasets, each word is mapped to a context-based feature vector, potentially offering more informative representations than k-mer frequencies. This approach has shown success in tasks such as antimicrobial peptide prediction [25], secreted effector identification [26], microbial gene function interpretation [27] and human-virus–protein interaction prediction [28].

In this study, we developed the metaProbiotics tool, which digitally characterizes DNA sequences in metagenomic bins using word vectors and employs random forests for the identification of bins originating from probiotics. Upon assembling and binning the raw metagenomic data using sequence assembly and binning tools, users can then utilize metaProbiotics to identify probiotics from the binned data. Our aim was to overcome the limitations of traditional methods, such as wet-lab experiments and clinical trials, used for probiotic discovery. This approach offers biologists an efficient, systematic method for characterizing probiotic communities. The metaProbiotics program can be freely downloaded at https://github.com/zhenchengfang/metaProbiotics.

RESULTS

The unsupervised pretrained word vector for sequence representation

The construction of metaProbiotics primarily involved two steps, as illustrated in Figure 1. During unsupervised pretraining, we employed the skip-gram algorithm, using 120 reference prokaryotic genomes from the National Center for Biotechnology Information (NCBI) (https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_reference_genomes.txt) as the ‘corpus’. Based on the contextual information of nucleotide sequences in DNA, each 8-nucleotide combination was mapped to a word vector with a dimension of 100. For the given DNA sequence(s) from a complete genome, metagenomic fragment or set of fragments from a metagenomic bin, the average word vector of all 8-mers was used to represent the sequence(s).

Figure 1.

Figure 1

The framework of metaProbiotics. The development of metaProbiotics involved two main steps. The unsupervised pretraining step used the skip-gram algorithm to convert each 8-mer into a word vector language model. Taking the word vector as the mathematical model for the DNA sequences, the supervised training step employed the experimentally validated probiotic and non-probiotic genomic data to construct a random forest for the identification of metagenomic bins of probiotics.

Using the word vector model, we constructed feature vectors for 237 experimentally validated probiotic genomes from 69 species and 470 non-probiotic genomes from 105 species collected by Sun et al. [17] (link: http://bioinfor.imu.edu.cn/iprobiotics/public/Download, hereafter referred to as the Sun dataset).

The traditional k-mer model solely quantifies the frequency of various nucleotide strings in DNA, failing to capture the biochemical properties inherent in different nucleotide combinations. In contrast, the word vector model, by analyzing the context of nucleotide strings across diverse sequences, maps contextually similar nucleotide strings to similar spatial coordinates. This spatial representation enables the biochemical characteristics of nucleotide sequences to be more effectively reflected, thereby enhancing the characterization of a species’ biological functions [29]. To compare the advantages of the word vector model with the traditional k-mer model, we performed t-distributed Stochastic Neighbor Embedding (tSNE) dimensionality reduction and visual analysis on both the 8-mer’s word vector and 8-mer models for probiotics and non-probiotics, focusing on their spatial distribution characteristics (Figure S1A and C). We found that both models displayed distinct clustering characteristics for probiotics, with approximately four prominent probiotic clusters in the tSNE plots of both the 8-mer and word vectors. However, in the 8-mer tSNE plot, one probiotic cluster was surrounded by non-probiotics and tended to merge with nearby non-probiotics after k-means clustering (Figure S1C and D), indicating that the 8-mer model does not perfectly separate probiotics in the coordinate space. By contrast, in the word vector tSNE plot, probiotic clusters were not significantly surrounded by non-probiotics and formed relatively independent groups after k-means clustering (Figure S1A and B). We further assessed clustering efficacy using silhouette coefficients, independent of label information, and mutual information, dependent on label information. The silhouette coefficients for each clustered point in the word vector model were significantly higher than those in the 8-mer model (Figure S1E, P = 9.42e-19***, rank sum test), and the mutual information for the word vector model clustering also surpassed the 8-mer model (Figure S1F), demonstrating the superiority of the word vector model in characterizing the function of probiotics.

We employed cosine similarity from NLP to assess whether word vectors can highlight essential attributes of probiotics. As part of the preclinical evaluation, probiotics should exhibit robust survival and colonization capabilities, particularly within the gastrointestinal tract [30]. We found that the average cosine similarity between the word vectors of probiotics and those of genes associated with bacterial colonization ability (specifically clpE and dps for bile resistance; bsh for bile salt hydrolase; gadC for acid resistance; msrB for heavy metal exclusion; clpC for persistence; copA and met for competitiveness; ispA for adherence; treC for growth; and lsp for adsorption [17, 31]) was significantly greater than the similarity between these genes and the word vectors of non-probiotics (Figure 2A).

Figure 2.

Figure 2

Utilizing cosine similarity to demonstrate the characterization ability of the word vector model for probiotic properties (A, B) and comparing the performance of metaProbiotics and iProbiotics using simulated metagenomic bin benchmark datasets (CH). (A) The average cosine similarity of word vectors between the bacterial genome and genes related to survival and colonization ability, including clpE, dps, bsh, gadC, msrB, clpC, copA, met, ispA, treC and lsp. The probiotics showed significantly higher cosine similarity than the non-probiotics (P = 0.0029**, rank sum test). (B) The average cosine similarity of word vectors between the bacterial genome and genes from each category of the CAZy database. The probiotics showed significantly higher cosine similarity than the non-probiotics (P = 0.0313*, paired rank sum test). (C–H) Performance comparison between metaProbiotics and iProbiotics using the test set bins with different completeness; the (C) Youden index, (D) accuracy, (E) MCC, (F) recall, (G) precision and (H) F1-score were used for the comprehensive evaluation.

Another pivotal attribute of probiotic microorganisms is their carbohydrate metabolism capacity, including the conversion of specific carbohydrates into propionic, lactic or acetic acid [32–34]. Genes were sourced from the Carbohydrate-Active enZYmes (CAZy) database [35], and the mean word vector for each CAZy class was determined, including Auxiliary Activities (AA), Carbohydrate-Binding Modules (CBM), Carbohydrate Esterases (CE), Glycoside Hydrolases (GH), GlycosylTransferases (GT) and Polysaccharide Lyases (PL). Subsequently, the cosine similarity between probiotics and these CAZy classes as well as between non-probiotics and these classes was computed. Notably, the average cosine similarity of probiotics to each CAZy class was significantly increased compared to that of non-probiotics (Figure 2B).

On the whole, the unsupervised word vector model effectively distinguishes between probiotics and non-probiotics and can highlight related attributes of probiotics.

Benchmarking the supervised random forest for probiotic mining

In the supervised training phase, a random forest model for probiotic mining was developed using the word vector model. The Sun dataset genomes [17] (link: http://bioinfor.imu.edu.cn/iprobiotics/public/Download) were partitioned into training and test sets at a 9:1 ratio (Table S1). For benchmarking, DNA fragments were randomly extracted from the complete genomes of the Sun dataset to form simulated metagenomic bins. Given the challenges of full genome reconstruction from metagenomes, bins with completeness ranging from 5% to 100% were created, with sequence lengths uniformly distributed between 2000 and 20 000 bp. These bins were mapped to pretrained word vectors. Following 10-fold cross-validation on the training set, the random forest was chosen as the classifier due to its superior accuracy and area under the curve (AUC) among other related machine learning algorithms (Figures S2 and S3).

After the supervised training of the random forest on the training set bins, we evaluated the performance of metaProbiotics against iProbiotics using the benchmark test set bins. The evaluation metrics included the Youden index, accuracy, Matthews correlation coefficient (MCC), recall, precision and F1-score. Across all bin completeness levels, metaProbiotics consistently outperformed iProbiotics (Figure 2 C–H). While the performance of metaProbiotics slightly decreased at 5% completeness, it remained robust otherwise. iProbiotics struggled with fragmented metagenomic data, particularly in the Youden index and MCC, both of which are sensitive to prediction bias. Notably, iProbiotics often misclassified sequences as negative in fragmented data, leading to reduced metric values. Using the Lactobacillus genus as an example, we assessed the ability of metaProbiotics to predict probiotics across species within the same genus. The evaluation of Lactobacillus bins from the test set revealed an impressive AUC of 94.31% (Figure S4), suggesting that metaProbiotics maintains a robust resolution at the species level.

We further collected draft genomes for Lactobacillus and Bifidobacterium strains sequenced and submitted to NCBI, including Lactobacillus paracasei (recently reclassified as Lacticaseibacillus paracasei) and Bifidobacterium breve, which are typically considered as probiotics, and L. iners, which is typically not considered as probiotic. We found that for draft genomes from L. iners, both metaProbiotics and iProbiotics predicted them as non-probiotics. For species commonly classified as probiotics, metaProbiotics was largely able to predict these draft genomes as probiotics, whereas iProbiotics only identified a few high-quality Lc. paracasei draft genomes (average contig length > 20 000 bp, N50 > 60 000 bp) as probiotics (Tables S2S4). This suggests that metaProbiotics has a relatively lower requirement for genome completeness, making it more suitable for mining probiotics in complex metagenomic data. Additionally, through literature research, we collected genomes of six recently isolated and sequenced Bifidobacterium strains reported to have probiotic effects [36–38] (see Table S5 for the accessions). We found that metaProbiotics successfully predicted all of them as probiotics, further demonstrating its potential in identifying novel probiotic strains.

These results indicate that metaProbiotics excels in analyzing fragmented and low-integrity metagenomic data, making it appropriate for probiotic mining, particularly for identifying novel NGPs that are challenging to cultivate from metagenomes.

Application case for testing metaProbiotics with real gut metagenome

Due to the lack of data labels, real metagenomic data are often challenging to use for evaluating tools designed for metagenomic data analysis [39]. To better demonstrate the performance of metaProbiotics in real-world scenarios, we employed gut metagenomes from cohorts subjected to probiotic interventions, in which bins originating from the intervention strains were utilized as labeled data for the quantitative assessment of metaProbiotics.

We employed sample data from two cohorts (see Table S6 for the sample accessions): one was treated with the probiotic strain Lc. casei Zhang [40] (a total of 30 samples that were sampled longitudinally, referred to as the LcZ group), and the other was treated with B. longum AH1206 [41] (a total of 22 samples from the treatment subgroup, referred to as the BlA group). Sequencing data from both cohorts were subjected to sequence assembly and binning. Ultimately, we obtained 209 bins in the LcZ group and 728 bins in the BlA group, with a notable number of bins exhibiting low completeness (Figures S5 and S6). By calculating the average nucleotide identity (ANI) with genomes in the Culturable Genome Reference (CGR) [42], we annotated the taxonomic units of bins identified by metaProbiotics as originating from probiotics. In both groups, a substantial number of bins could not be assigned to specific genera (Figure 3A), suggesting that a significant portion of the probiotics identified by metaProbiotics either had an unknown taxonomic classification or originated from unculturable genera. This underscores the potential of metaProbiotics to reveal novel probiotics.

Figure 3.

Figure 3

Probiotic mining from healthy human gut samples derived from two probiotic intervention cohorts (LcZ group and BlA group). (A) The proportion of probiotic bins identified by metaProbiotics that can be associated at the genus level using the gut reference bacterial genome database (CGR). (B) Venn diagram of the probiotic bins identified by the metaProbiotics and iProbiotics tools in the LcZ group. The yellow star (LcZ_bin90) indicates a bin with an ANI higher than 98% against the intervention strains, and the gray star (LcZ_bin44) indicates a plasmid-like bin whose abundance is highly correlated with the intervention strain. (C) Venn diagram for the LcZ group. We adjusted the decision threshold to allow metaProbiotics to identify the same number of probiotics as iProbiotics to perform a more quantitative comparison. (D) The correlation of abundance between LcZ_bin90 and LcZ_bin44 in samples from different sampling time in the LcZ group. The Spearman correlation index was used. (E) The plasmid likelihood score was calculated by PPR-Meta [39] for each contig in LcZ_bin90 and LcZ_bin44. (F) Similar to (B), the Venn diagram for the BlA group, yellow stars refer to bins with ANI higher than 98% against the intervention strains. (G) Similar to (C), the Venn diagram for the BlA group after adjusting the decision threshold of metaProbiotics.

We further observed that in both the LcZ and BlA groups, the overlap between probiotics predicted by metaProbiotics and iProbiotics was notably limited. Bins with an ANI greater than 98% to either the Lc. casei Zhang or B. longum AH1206 genome were considered to have originated from these two strains. Remarkably, metaProbiotics accurately revealed the probiotic property of all these bins. In contrast, iProbiotics struggled to recognize these bins, as evident from the Venn diagram, in which these bins were primarily predicted by metaProbiotics (Figure 3B and F). Given that metaProbiotics predicted more bins than iProbiotics, we adjusted the decision threshold of metaProbiotics to match the number of positive predictions for iProbiotics, ensuring that its broader coverage was not due to leniency. Even so, metaProbiotics still identified all bins closely resembling the intervention bacteria (Figure 3C and G), indicating that the discrepancy in prediction counts between metaProbiotics and iProbiotics was due to the low sensitivity of iProbiotics, rather than an elevated false-positive rate of metaProbiotics. In analyzing the six bins from the BlA group associated with the intervention probiotic strain, we further assessed the quality of these bins and the predictive outcomes of two tools. We found that iProbiotics is particularly sensitive to the length of contigs within a bin and the bin’s overall completeness, identifying only those bins with N50 > 40 000 bp and Completeness >70% as probiotics (Table S7). On the other hand, metaProbiotics demonstrated the ability to identify all bins as probiotics, including those with lower quality, indicating a more robust performance.

In the LcZ group, a bin (bin ID: LcZ_bin44) whose abundance was strictly synchronized with that of the Lc. casei Zhang-derived bin (bin ID: LcZ_bin90) was predicted as a probiotic by metaProbiotics (Figure 3B and D). Although its taxonomy was unassigned, it encodes the Lactobacillaceae-derived prtP gene, a secreted protease that breaks down milk proteins, suggesting a synergistic prebiotic function with Lc. casei Zhang. According to PPR-Meta [39], contigs in LcZ_bin44 showed significantly higher plasmid likelihood scores (Figure 3E), implying a plasmid origin of LcZ_bin44 and highlighting metaProbiotics’ ability to detect mobile elements with probiotic functions.

To assess the tool’s generalizability, we focused on the common genera that are not present in the probiotic training set, which have more than one bin in either cohort and at least one bin predicted as probiotics by the metaProbiotics tool. These genera, including Butyricicoccus, Parabacteroides, Phocaeicola, Agathobaculum, Alistipes, Collinsella and Ruminococcus, have reported probiotic capacities, including functions such as short-chain fatty acid (SCFA) production, immunomodulation, metabolic regulation and intestinal barrier protection via diverse molecular mechanisms (Table S8).

Gene function analysis based on Gene Ontology (GO) enrichment indicated that the probiotic bins identified by metaProbiotics encode a variety of genes linked to probiotic mechanisms (Figure 4A). Notably, in addition to genes involved in carbohydrate metabolism, which are traditionally deemed essential for probiotics, there were genes tied to metal–iron responses. This finding was in accord with observations that certain probiotics can regulate the harmful accumulation and toxicity of heavy metals in humans [43, 44]. Furthermore, these identified probiotics show abundant genes associated with the synthesis of substances exhibiting probiotic effects, including glycosaminoglycan, which is known to protect against diseases such as ankylosing spondylitis [45] and rheumatoid arthritis [46].

Figure 4.

Figure 4

Gene function analysis of the probiotics identified by metaProbiotics. (A) GO enrichment analysis of probiotic-encoded genes. (B) The enrichment of the bai operon in unclassified Ruminococcaceae probiotics (P = 0.0007***, Fisher’s exact test). All five unclassified putative Ruminococcaceae bins containing bai operons were identified as probiotics, while no bai operon was found in non-probiotic Ruminococcaceae bins. (C) The bai operons in Ruminococcaceae-related bins. The contigs with bai operons and the alignment graph with reference bai operons are shown.

Interestingly, focusing on the bins associated with the family of unclassified Ruminococcaceae, we found that all five Ruminococcaceae-related bins containing the bai operon were identified as probiotics by metaProbiotics, whereas no bai operon was found in non-probiotics Ruminococcaceae bins (Figure 4B and C). Our prior research indicated that certain Ruminococcaceae can ameliorate chronic insomnia and cardiometabolic diseases via the microbiota–bile acid axis [47]. Thus, based on our findings, we speculate that promoting bile acid metabolism is an important factor in the prebiotic functions of some Ruminococcaceae and that genes related to bile acid metabolism, such as the bai operon, are the key genes responsible for the probiotic functions of these bacteria.

We further evaluated the capability of metaProbiotics to identify probiotics from metagenomic binning data derived from various sources, thereby demonstrating the tool’s universality. We utilized data from Huang et al. [48], in which they conducted interventions on human and mice populations using the probiotic strain Lactiplantibacillus plantarum HNU082. We downloaded metagenomic data from these human and mice subjects and processed it similarly to the LcZ and BlA groups. This process yielded metagenomic bins from different populations, which were then predicted using metaProbiotics and iProbiotics. The results indicate that metaProbiotics consistently identified bins derived from the intervention strain as probiotics, outperforming iProbiotics (Figure S7). This suggests that the efficacy of metaProbiotics in identifying probiotics is not confined to specific types of metagenomic data, thus indicating a broad range of applicability.

Consistency and heterogeneity in probiotic abundance profiles across healthy and diseased individuals

We assessed the abundance of probiotic bins identified from the above application cases by metaProbiotics in the gut metagenome of individuals across various cohorts. Each cohort consisted of samples from both healthy individuals and individuals with specific diseases, including type 2 diabetes (T2D) [49], liver cirrhosis (LC) [50] and colorectal cancer (CRC) [51–54] (see Table S9 for the samples’ accessions). We found that in cohorts of different disease types, the proportion of healthy individuals with a total normalized abundance of probiotics (NAP) greater than 0 consistently exceeded that of disease patients, indicating a higher odds value for health (Figure 5A: bins present in all individuals in a certain cohort were used for statistic; Figure S8: all bins were used). When considering a total NAP >0 as an exposure factor, these probiotic bins could serve as protective factors against diverse diseases (OR = 0.5177, P = 0.0005***, the light-blue boxes in Figure 5A), underscoring their consistent roles across various diseases.

Figure 5.

Figure 5

The abundance profile of the identified probiotic bins among cohorts of individuals with different conditions, including healthy, CRC, LC and T2D. (A) Comparison of the overall NAP between healthy individuals and those with specific diseases across different cohorts. The high abundance of probiotics (sum of NAP exceeding 0) was designated as the exposure factor, based on which the odds values in each cohort were determined. The relationship between probiotic abundance and health status is depicted in the light-blue box. This demonstrates that probiotics identified by metaProbiotics act as a protective factor across different diseases (OR = 0.5177, P = 0.0005***, one-sided Fisher’s exact test). (B) In each cohort, a health status prediction model based on a random forest was trained to identify human health status using the abundance profiles of probiotic bins in each healthy individual and each individual with a specific disease. (C) For each model in (B), we predicted the health status of every individual in each cohort with a specific disease and calculated the predictive AUC. When the cohort used for training was the same as the cohort used for prediction, we employed a leave-one-out approach to evaluate the AUC. We observed that when the prediction cohort was different from the training cohort, there was a significant decline in the AUC, particularly for cross-cohort predictions between CRC and LC.

However, we also observed that among cohorts of different disease types, the distribution patterns of the abundance of these probiotics between healthy individuals and disease patients might show significant heterogeneity. Following the procedure depicted in Figure 5B, we attempted to train a health status prediction model based on random forests using the abundance features of these probiotic bins for each disease type. However, these models struggled with cross-disease predictions; when they were trained on one disease type and tested on another, the predictive AUC notably decreased. This decline was especially evident for CRC and LC, where the AUC for cross-disease predictions dropped below 50%. This suggests that probiotics may exhibit considerable variability in their protective roles against different diseases. We further explored the possibility of identifying a core subset of probiotics that play a significant role in health status prediction across different diseases, potentially aiding in future cross-disease universal predictions. Using feature importance calculation based on out-of-bag (OOB) in random forests, we pinpointed the top 20 bins of highest importance in health status prediction models developed using various disease cohorts. These bins were then examined for their shared presence across different models using a Venn diagram (Figure S9). The results revealed that we could not identify any bins that were crucial for health status prediction in all three diseases analyzed. Additionally, there were only a few bins that played a significant role in at least two diseases. This further suggests that the heterogeneous expression of probiotics across different diseases is quite pronounced, and identifying a group of bacterial species beneficial for all types of diseases remains challenging.

We further collected cohorts from two other disease types, including irritable bowel syndrome (IBS) and Crohn’s disease (CD) [55, 56] (see Table S10 for the samples’ accessions). After undergoing similar analyses, we also observed patterns of consistency and heterogeneity in probiotics. Specifically, the overall abundance of probiotics can be considered a protective factor across various diseases, but health status prediction models constructed using these probiotic abundance features struggled to achieve favorable outcomes in cross-disease predictions (Figure S10). We further investigated whether expanding the range of diseases in the training set of the health prediction model could enhance its cross-disease predictive ability. We merged the cohorts of CRC, LC and T2D (i.e. combining healthy individuals from all cohorts into one category and those with any of the diseases into another) for training the health status prediction model. Subsequently, we applied the trained model to predict health status in IBS and CD cohorts. We found that increasing the variety of diseases in the training set did not significantly improve the model’s ability of cross-disease predictions (Figure S10B). The variation of probiotics across different diseases continues to impede the development of a universal predictive model.

Revealing the heterogeneity of probiotic gene modules across diseases using the global gut probiotic set created by metaProbiotics

To elucidate the heterogeneous behavior of probiotics in various diseases based on a comprehensive view of the gut probiotic community, we utilized metaProbiotics to identify probiotics from 30 691 metagenome-assembled genomes (MAGs) sourced from over 5700 global healthy gut samples in the HumGut database [57], ultimately yielding 5990 probiotics spanning different phyla (Figure 6A). This genomic set significantly expands the species coverage of the current probiotic databases. The information of this probiotic is provided in the ‘probiotics_set.csv’ file in the GitHub site https://github.com/zhenchengfang/metaProbiotics/releases/tag/Globle_gut_probiotics_set.

Figure 6.

Figure 6

Using metaProbiotics to construct a comprehensive gut probiotic set and revealing the abundance patterns of gene modules within these probiotic communities in healthy individuals and in different disease groups (including CRC, LC and T2D) through WGCNA. (A) Utilizing metaProbiotics to construct a dataset resource for human gut probiotics from the 30 691 metagenome-assembled genomes in the HumGut database. The pie chart displays the number of probiotic genomes sourced from HumGut, and the bar chart shows the phylum distribution of the identified probiotics. (B) Calculating the abundance of genes encoded by probiotics in (A) among healthy individuals and specific disease groups (CRC, LC and T2D). Using WGCNA, these genes were divided into different gene modules. Each row in the heatmap represents a gene module; the color of the module indicates its correlation with the respective phenotype. (C). Illustrating the correlation of each probiotic gene module with CRC and LC diseases, identifying a contrasting trend in the module’s correlation with the two conditions. The Spearman correlation index was used. (D) Determining the ratio of significantly positively correlated module counts to significantly negatively correlated module counts for each disease and providing a reference value for this ratio in the healthy population (indicated by the red dashed line). The P-value was obtained through a one-sided binomial test based on the proportion of positive modules to the total statistical modules. (E) Calculating the ratio of significantly negatively correlated module counts to the total module count for each disease, providing a reference value in the healthy population (red dashed line). The P-value was obtained through a one-sided binomial test. (F) Venn diagram displaying the modules with significant negative correlations across different phenotypes. (G) Showcasing hub gene information for modules showing significant correlation (P < 0.05, including both positive and negative correlations) across the three diseases. This includes the proteins encoded by the hub genes, the taxon of the genome containing the hub gene and a bubble chart indicating the magnitude of the correlation of the corresponding gene module in different diseases.

Utilizing the gut metagenomic data from the aforementioned individuals who were healthy or had one of the CRC, LC and T2D, we calculated the abundance of each gene from the 5990 probiotics constructed by metaProbiotics across different phenotypic groups. These probiotic genes were then categorized into gene modules using weighted correlation network analysis (WGCNA) (Figure 6B). As shown in Figure 6B, certain gene modules in the healthy condition exhibited a degree of enrichment (positively correlated with the phenotype), with no substantial module depletion (negatively correlated), indicating a relatively complete composition of probiotic gene modules in the healthy state. In contrast, within various disease conditions, while there was enrichment of gene modules, a conspicuous depletion of gene modules was also evident. This suggests that the enrichment of certain probiotic gene modules may not necessarily promote host health, and the onset of diseases often coincides with the loss of gene modules.

Concurrently, we observed substantial variations in the types of modules that were enriched or depleted across different disease states. For instance, modules that were positively correlated with CRC appeared to be negatively correlated in LC and vice versa, indicating a pronounced antagonistic relationship (Figure 6C). This could also explain the AUC value below 0.5 observed in Figure 5C when making cross-cohort health status predictions for CRC and LC. Furthermore, we examined the ratio of significantly positively correlated modules to negatively correlated modules (P < 0.05) in each condition. We found that this ratio was consistently lower in various disease states compared to the healthy state, although the difference was not always statistically significant (Figure 6D). Additionally, we assessed the proportion of significantly negatively correlated (P < 0.05) modules among the total modules in each state. We observed that this proportion was significantly higher in disease states than in the healthy state (Figure 6E), with considerable variation in the missing modules across different conditions (Figure 6F). These observations suggest that the integrity of gene functions within the probiotic community is more crucial for maintaining gut homeostasis than merely increasing the abundance of specific functional genes. The same probiotic gene module exhibits heterogeneity across different diseases, making it challenging to determine the host’s health status based solely on the enrichment or absence of specific probiotics.

Building on this, we further identified modules that were significantly correlated (both positively and negatively) across CRC, LC and T2D. We pinpointed the hub genes of these modules (a total of 21 hub genes) and documented the taxonomic information of the genomes containing these hub genes as well as their module correlations across various conditions (Figure 6G). We again observed that the probiotic effects on diseases exhibited heterogeneity. No taxon associated with any of the hub genes was found to have probiotic benefits across all analyzed diseases. For instance, the probiotic Akkermansia exhibited beneficial effects (negatively correlated) for CRC but not for LC or T2D. Moreover, different genes from a specific probiotic species can be categorized into different gene modules of the community, and a single module can have varied probiotic effects across diseases. For example, Faecalibacterium prausnitzii had five genes serving as hub genes for five modules. Two of these hub genes (HPt domain–containing protein and putative RRM domain-containing protein) were beneficial for CRC and LC but not for T2D, while the other three exhibited the opposite effect, offering no protective benefits for CRC but being beneficial for LC and T2D. Similarly, Ruminococcus bromii had three genes identified as hub genes for the three modules. Two of these modules were beneficial for LC and T2D but not for CRC, while the other module was beneficial for CRC and LC but not for T2D.

Overall, utilizing the comprehensive gut probiotic set constructed by metaProbiotics, we elucidated the significance of probiotic gene functional integrity in maintaining host health, taking CRC, LC and T2D as examples. We also highlighted the heterogeneous characteristics of probiotics across different diseases. We believe that these findings offer insights for the clinical application of probiotics. Specifically, not all functional modules of probiotics may be protective against every disease. In interventions involving probiotics, the integrity of microbial functions should be prioritized, as an excessive increase in the abundance of specific probiotics might, in fact, produce undesirable outcomes.

DISCUSSION

The language model of metaProbiotics considers eight adjacent bases as a ‘word’ to construct word vectors. However, in the genome, genes or domains are often the basic units of function. Some studies have also considered genes as ‘words’ for modeling [58]. metaProbiotics is mainly aimed at next-generation sequencing data of the metagenome. The length of contigs in these data is often short, while a complete metabolic gene cluster generally contains at least five genes. If using a gene as a ‘word’, it is unlikely that a contig obtained from next-generation sequencing will contain a complete ‘sentence’, which does not allow a full demonstration of the advantages of the language model. Therefore, we chose the string of bases instead of the entire gene as a word unit to better capture the different motifs in the DNA sequence. By combining ultrahigh-depth next-generation sequencing and third-generation sequencing, it is now possible to reconstruct a complete bacterial genome from metagenomic data instead of fragmented draft genomes [59]. Therefore, in the future, we will further combine the data characteristics of different sequencing technologies to build genome language models from different dimensions and improve the algorithm’s discrimination of probiotic genomes.

metaProbiotics cannot directly utilize raw sequencing reads to identify probiotics within a metagenome. Therefore, the ability to reconstruct the draft genome of probiotics from the metagenome through sequence assembly and binning is a prerequisite for the successful identification of probiotics by metaProbiotics. Numerous assembly and binning tools have been published, with common assembly tools including MegaHit [60], IDBA-UD [61], metaSPAdes [62] and MetaVelvet [63] and common binning tools including MetaBAT2 [64], MaxBin [65] and CONCOCT [66]. However, studies have shown that different tools exhibit varying preferences for samples with distinct microbial community structures [67] (such as species abundance and diversity), leading to significant variations in the results obtained from the same sample by different tools. These discrepancies can introduce biases in metaProbiotics’ predictions. For instance, when using binning tools, we attempted binning on metagenomic data from six mouse samples [68] (accession: PRJNA293015) using both MetaBAT2 and MaxBin, resulting in 239 and 448 bins, respectively, a substantial difference. Without prior knowledge of the microbiota, it is challenging to determine which result is more accurate. To integrate the strengths of different tools, comprehensive workflows such as MetaWRAP [69] and Metaphor [70] have been developed. These tools often improve accuracy compared to individual tools but at the cost of significantly longer run times, making them less suitable for large-scale data processing. We recommend that readers choose tools flexibly based on their needs. For instance, if a user’s primary interest lies in focusing on the dominant species in a microbial community rather than the diversity of species, metaSPAdes can be utilized for sequence assembly. This tool is capable of generating longer contigs but is less sensitive to species diversity [67], when dealing with samples exhibiting high species diversity and the presence of low-abundance species, there may be a loss of information. For smaller sample sizes, comprehensive but time-consuming workflows like MetaWRAP can be used to enhance the accuracy of downstream metaProbiotics. For detailed information on common sequencing assembly and binning tools, readers may refer to reviews by Yang et al [71]. Moreover, if researchers aim to obtain high-quality metagenomic binning data, they may opt to use Hi-C sequencing data. This sequencing technique involves crosslinking DNA from the same bacterial cell prior to sequencing, allowing for more accurate binning of DNA sequences post-sequencing, utilizing the crosslinking information [71, 72]. However, compared to standard metagenomic sequencing, Hi-C sequencing is more costly, and researchers should select the data type based on their specific needs.

In some cases, binning software may group different strains with high sequence similarity into the same bin. For instance, in Table S7, we identified a bin with a contamination level of 281.82%. This bin likely contains sequences from both probiotic strains and other bacterial strains. However, the current design of metaProbiotics does not allow for further differentiation between components from probiotics and other strains within a bin. On one hand, as discussed earlier, we can use a combination of different binning tools to further optimize the binning quality and obtain purer probiotic sequences. On the other hand, analysis in this paper suggests that probiotics exhibit relatively distinct modular characteristics in their genes when exerting probiotic effects. In subsequent research, it is necessary to further expand the capabilities of our tool to include enhanced modeling and characterization of various gene modules within the genome, along with providing more comprehensive functional annotations for these modules. This will enable metaProbiotics to answer the question ‘which gene modules in a bin are responsible for specific probiotic functions?’ This approach could not only help researchers extract core functional modules from contaminated data for in-depth mechanistic studies but also provide guidance for the rational use of probiotics in clinical settings.

One of the significant findings of this work is that probiotics exhibit heterogeneous behavior across different disease types. Predicting this heterogeneity based on the probiotic community structure and host factors and establishing a robust health state prediction model to enhance personalized probiotic interventions are crucial for future research. In fact, the probiotic properties of certain bacteria might be influenced by various factors, such as the abundance profile of other bacteria, environmental conditions and the host’s diet. While metaProbiotics can accurately capture vital information related to probiotic functionality from primary sequences using language models, it does not consider other external factors in the probiotic discrimination process. This limitation could hinder the application of metaProbiotics in personalized disease prevention and treatment. In recent years, numerous bioinformatics tools have been developed for predicting the microbiome–drugs interaction, as well as the microbiome–diet interaction [73, 74]. These tools are immensely helpful in further analyzing the heterogeneous effects of probiotics in different individuals. Users can integrate them with their specific needs for further analysis of identified probiotics, serving as a complement to metaProbiotics. Since these tools consider a broader range of factors, they may require more diverse data type inputs, such as nutrient intake matrices [74]. Therefore, it is advisable to have data from specifically designed experiments before utilizing these tools. In our future work, we plan to utilize large-scale population cohort data, gather comprehensive host information and leverage these multimodal data to establish mathematical models for bacterial strains. These models incorporate diverse information, including genomic data, environmental factors and host details, to develop a more holistic discrimination algorithm. We aspire to ultimately guide personalized medicine tailored to individual differences through big data–driven AI algorithms. This includes personalized probiotic interventions based on individual dietary habits and lifestyle information, as well as customized dietary and medication recommendations based on the characteristics of an individual’s probiotic microbiota.

In conclusion, leveraging artificial intelligence language models, we introduced a novel computational approach to identify probiotics from metagenomic bin data. This method facilitates the discovery of novel probiotics and offers a fresh perspective, systematically elucidating the characteristics and heterogeneous manifestations of probiotic communities in maintaining host health.

METHODS

Pretraining of word vectors

We downloaded 120 representative prokaryotic genomes from NCBI (https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/). These 120 bacteria contained a total of 240 DNA sequences derived from bacterial chromosomes and plasmids, and we also extracted the complementary sequence for each DNA sequence, doubling the number of DNA sequences to 480. These 480 sequences were then used for the pretraining of word vectors. In the metaProbiotics language model, each set of k adjacent bases is treated as a word, where k can be fixed or can vary during processing. According to Sun et al. [17], oligonucleotide markers distinguishing probiotics from non-probiotics are predominantly found in oligonucleotides with higher k values (k > 5), where the top four oligonucleotides with the highest importance scores are 8-mers. Therefore, a fixed k of 8 was used to train the word vector. All overlapping 8-mers were extracted from the 480 sequences, and each 8-mer combination was embedded into an initialized numeric vector. Using the skip-gram-based dna2vec algorithm [75], a neural network with one hidden layer containing 100 nodes was employed for word vector training. During training, the neural network takes an 8-mer string of bases from the DNA sequence as input and predicts 10 adjacent 8-mer strings both upstream and downstream of the initial 8-mer sequence. Assuming that k-mers with similar biological properties will have similar sequence contexts, these k-mers are likely to produce similar outputs in the final layer. Consequently, during unsupervised training, k-mers with analogous biological properties are mapped to similar feature vectors in the hidden layer. The feature vector from the 100-node hidden layer then serves as the word vector embedding for the corresponding 8-mer. Relevant parameters of the dna2vec algorithm during training were set as follows: ‘k=8, vec-dim=100, epoch=10, context=10’. Negative sampling was employed to enhance the efficiency of the standard softmax function. In subsequent algorithm construction and application steps, for any given DNA sequences (whether a single sequence or multiple sequences from a bin), we employ JellyFish [76] (v2.3.0) to obtain all 8-mer strings. All corresponding word vectors of these 8-mer strings are summed, and each feature of the word vector is averaged. Consequently, any given DNA sequence, or multiple DNA fragments within a bin, can be represented by a word vector with a dimensionality of 100. Ultimately, this average word vector represents the word vector for the given DNA sequences.

Data set construction of probiotics and non-probiotics

We downloaded the experimentally verified probiotic and non-probiotics genomes from Sun et al. [17] at http://bioinfor.imu.edu.cn/iprobiotics/public/Download. These probiotics and non-probiotics were randomly divided into a training set and a test set at a 9:1 ratio. The training set was further partitioned into 10 equal parts for 10-fold cross-validation. For each genome in the dataset, the ID information, category and the groups into which it was randomly divided are detailed in Table S1. We then randomly extracted sequence fragments from each genome to construct simulated metagenomic binning data, which were used as a benchmark dataset for the development of metaProbiotics. The lengths of these randomly extracted sequences follow a uniform distribution between 2000 and 20 000 bp. Sequences are extracted with equal probability from either the positive strand or the complementary strand of the genome. In the training set, each genome yielded 20 simulated bins with 5% completeness, 10 bins with 10% completeness, 5 bins with 20% completeness, 3 bins with 40% completeness, 2 bins each with 60% and 80% completeness and 1 bin with 100% completeness. In the test set, each genome produced one bin for each completeness level: 5%, 10%, 20%, 40%, 60%, 80% and 100%.

Cosine similarity calculation

For a given bacterial genome word vector (B) or gene word factor (g), the cosine similarity is defined as Inline graphic. Before calculating the similarity, each element of the word vector will be centered so that it has a zero mean. When calculating the cosine similarity between the genome and colonization-related genes, we collected the base sequences of related genes from NCBI by a keyword search. For example, for the bsh gene, we used keywords ‘(bsh[Gene Name]) AND “bacteria”[Organism]’ for searches. Each gene had multiple sequence records, and the average word vector of each record was used as the representative word vector of this gene. For each genome, we calculated its cosine similarity with the representative word vector of each gene and obtained the average cosine similarity, which was used as a measure of similarity related to colonization ability between the genome and genes. When calculating the cosine similarity between the genome and the genes from the CAZy database, the base sequences of genes in the database were downloaded from the dbCAN2 website [77] and then clustered with 90% similarity using the cd-hit-est module of CD-HIT [78] (v4.8.1, parameter settings: -c 0.90, -n 8 and -d 100), and the word vector of the representing sequence was calculated. For the genes of each category in CAZy, we took the average word vectors of these genes as the representative word vectors of this category and calculated their average cosine similarity with the word vectors of each probiotic and non-probiotic genome.

Evaluation of the tools using a simulated benchmark dataset

Both iProbiotics and metaProbiotics can assign scores to input sequences ranging from 0 to 1. A threshold of 0.5 is used for discrimination: sequences with scores above 0.5 are classified as positive samples (probiotics), while those below 0.5 are deemed negative samples (non-probiotics). metaProbiotics can score either a single sequence or multiple sequences from a bin collectively. In contrast, iProbiotics can only evaluate individual sequences sequentially. Given our focus on metagenomic data analysis, where DNA from a bacterial bin often appears as multiple sequence fragments, if multiple sequences from a specific bin are input into iProbiotics, we use the average score of these sequences as the bin’s score. Furthermore, metagenomic sequences vary in length, with longer sequences typically yielding more reliable predictions than shorter ones. Thus, when computing the average score of iProbiotics, we weight each sequence’s score by its length to obtain a weighted average score, which serves as the iProbiotics identification result. When assessing tools based on benchmark datasets from simulated metagenomics, we train metaProbiotics using a training set data and test it on test set data. For real sample applications and released tool versions, we train metaProbiotics using all available data.

During the benchmarking process, the criteria of the Youden index, accuracy, MCC, recall, precision and F1-score are defined as follows:

graphic file with name DmEquation1.gif
graphic file with name DmEquation2.gif
graphic file with name DmEquation3.gif
graphic file with name DmEquation4.gif
graphic file with name DmEquation5.gif
graphic file with name DmEquation6.gif

where Inline graphic, Inline graphic, and TP, TN, FN and FP refer to the number of true positives, true negatives, false negatives and false positives, respectively.

Data processing of metagenomic samples from the LcZ group and BlA group

We obtained the sequencing data from the SRA database according to the accessions of the samples. We used fastp [79] for quality control, MegaHit [60] (v1.2.9) for sequence assembly (using default parameters) and then the built-in MetaBAT2 [64] in MetaWRAP [69] (docker version) to bin the contigs assembled ​​from the LcZ group and BlA group (using default parameters). The completeness of each bin was evaluated by CheckM [80] (v1.0.12). To identify the bins associated with the intervening strain, we used fastaANI [81] (v1.33) to calculate the ANI between each bin and the intervening strain genome (Lc. casei Zhang: NC_014334.2; B. longum AH1206: NZ_CP016019.1) (parameter setting: --fragLen 1000), and bins with an ANI higher than 98% were regarded as bins derived from the intervening strain. To infer the associated taxon of each bin using the CGR database, we used fastANI to calculate the ANI between each bin and each reference genome and used 80% as the threshold for genus determination. If a bin is linked to multiple taxa at the same time, we think that this bin may be mixed with genomes from different bacteria, and this bin will be assigned to multiple taxa at the same time. To identify bins associated with the family of unclassified Ruminococcaceae, we searched for bins with an ANI greater than 75% compared to the reference unclassified Ruminococcaceae genomes. The ANI thresholds for the genus and family levels were chosen based on the report of Zhang et al. [82], and the threshold was slightly lowered considering that the metagenomic bins usually do not contain a complete and pollution-free genome. In the LcZ group, we also used the quant_bins module (default parameters) in MetaWRAP to calculate the abundance of each bin. We used eggNOG-mapper [83] for gene function annotation, took all gene GO terms as the background gene set and the probiotic gene GO terms as the target objects and then conducted GO enrichment analysis using the enricher function in clusterProfiler [84] (parameter setting: pAdjustMethod = ‘BH’, pvalueCutoff = 0.05, qvalueCutoff = 0.2). The gutSMASH [85] was used to identify the bai operon over each bin.

Abundance profiles of probiotics identified from test populations in cohorts with various disease types

The gut metagenomic reads of cohorts with different disease types (CRC, LC and T2D) were downloaded according to their accession. Both the LC and T2D types contained one cohort with healthy individuals and disease patients, while the CRC type contained three cohorts from different regions (Table S9). The reads passed quality control in T2D cohort was obtained from Feng et al. [86], and we used fastp for quality control of other cohorts. We used the quant_bins module in MetaWRAP to calculate the abundance of the identified probiotics from the LcZ and BlA groups in each cohort. During the use of MetaWRAP, we used the -b parameter to specify the probiotic bins from the LcZ group and the BlA group and did not specify the -a parameter. To calculate the odds value of each cohort, the abundance of bins was z score–normalized; therefore, the positive or negative value indicated the upregulation or downregulation of the bin abundance. When constructing the cross-cohort healthy prediction model, the original abundance value of all probiotic bins served as the random forest input so that the model could perform prediction for a single sample independently.

Gene modules analysis based on WGCNA

We used Prodigal [87] (v2.6.3) to extract genes from the 5990 predicted probiotic genomes from the global gut metagenome, and we used CD-HIT to reduce redundancy under a 90% similarity threshold. The abundance of each probiotic gene in each individual included in Table S9 was calculated using the quant module of salmon [88] (docker version, parameters setting: -l A). Considering that the abundance profile of genes across samples was sparse, we extracted the top 20 000 genes by calculating the total abundance among all the samples to perform the downstream analysis. The gene abundance matrix for the healthy condition and three types of disease conditions was then used for co-expression network analysis using the R package WGCNA [89] (v1.72.1). Then, the network was constructed by the following steps with specific functions: (1) choose a set of powers to analyze the network topology using the pickSoftThreshold function (parameters: powerVector = c(1:30), verbose = 5) and then choose 5 as the best power value; (2) use the adjacency function to transform the similarity matrix into an adjacency matrix (parameters: power = 5); (3) use the TOMsimilarity function to transform the adjacency matrix into a topological overlap matrix (TOM) and use the default parameters; and (4) use the cutreeDynamic function to identify modules, based on the dynamic tree cut method (parameters: deepSplit = 2, minClusterSize = 50, pamRespectsDendro = F). The genes were first hierarchically clustered using the hclust function (parameters: method = ‘average’) and then transferred to the cutreeDynamic function; (5) use the moduleEigengenes function to obtain the eigengene in each module (default parameters); (6) use the mergeCloseModules function to merge similar gene modules (parameters: cutHeight = 0.25, verbose = 3); and (7) use the cor function to calculate the correlation between each module and its phenotype and the corPvalueStudent function to calculate the corresponding p value. The hub genes are obtained using chooseTopHublnEachModule function. All the hub genes were submitted to the UniProt online blast server (https://www.uniprot.org/blast) for function annotation. Based on the UniProt output file, the function of the protein with the highest match will be used for hub gene annotation. If the top-matching protein lacks functional annotation, the highest-matching protein with functional information will be used, prefixed with ‘putative’.

Key Points

  • The metaProbiotics tool uses the word vector language model to characterize deoxyribonucleic acid sequences and employs the random forest algorithm to rapidly identify the probiotic-derived bins from metagenomic data, demonstrating superior performance in simulated benchmark datasets.

  • Testing on gut metagenomes from probiotic-treated individuals, metaProbiotics can better identify the bins with probioticity, including a plasmid-like bin showing synergistic prebiotic function with the intervention strain.

  • Gene function analysis revealed multiple probiotic mechanisms of the probiotic bins identified by metaProbiotics and bai operon is a potential key marker of some probiotically active Ruminococcaceae.

  • Probiotic bins identified by metaProbiotics could serve as protective factors against diverse diseases, signifying their probiotic role.

  • Large-scale probiotics analysis with metaProbiotics shows that gene function integrity of the probiotic community is more crucial in maintaining gut homeostasis than merely increasing specific gene abundance, owning to the fact that different gene modules on the same probiotic have heterogeneous effects on various diseases; adding probiotics indiscriminately might not boost health.

Supplementary Material

cover_letter_bbae085
cover_letter_bbae085.docx (46.5KB, docx)
Supplementary_information_1_bbae085
Supplementary_information_2_bbae085

ACKNOWLEDGEMENTS

We thank Dr Rongrong Luo of Zhujiang Hospital for her helpful discussion regarding the code debugging.

Author Biographies

Shufang Wu is a post-doctoral researcher in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University. Her research interest includes bioinformatics and microbial genomics.

Tao Feng is a doctor student in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University.

Waijiao Tang is a pharmacist in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University.

Cancan Qi is a post-doctoral researcher in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University.

Jie Gao is a professor in the Department of Gastroenterology at The Second Affiliated Hospital of Guangzhou Medical University.

Xiaolong He is a professor in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University.

Jiaxuan Wang is an assistant researcher in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University.

Hongwei Zhou is a professor in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University. His research interest includes microbiome, clinical medicine and bioinformatics.

Zhencheng Fang is an assistant researcher in the Department of Laboratory Medicine at Zhujiang Hospital, Southern Medical University. His research interest includes genomics, bioinformatics and artificial intelligence.

Contributor Information

Shufang Wu, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Tao Feng, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Waijiao Tang, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Cancan Qi, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Jie Gao, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China; Department of Gastroenterology, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.

Xiaolong He, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Jiaxuan Wang, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Hongwei Zhou, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

Zhencheng Fang, Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, China.

FUNDING

This work was supported by the National Key Research and Development Program of China (2022YFA0806400) and the National Natural Science Foundation of China (82102508, 81925026, 82104625, 82302610).

DATA AVAILABILITY

The list of 120 reference prokaryotic genomes used for language model training can be downloaded from NCBI (link: https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_reference_genomes.txt). The list of the probiotic and non-probiotics genomes used for metaProbiotics construction can be downloaded from Sun et al. [17] (link: http://bioinfor.imu.edu.cn/iprobiotics/public/Download), and the organization of these genomes is provided in Table S1. The accession list of metagenomic sequencing data for the LcZ and BlA groups is provided in Table S6. The accession list of the metagenomic sequencing data for cohorts with healthy individuals and different disease types is provided in Table S9. The accession list of the probiotic set constructed by metaProbiotics using the global gut metagenome is provided in the ‘probiotics_set.csv’ file in https://github.com/zhenchengfang/metaProbiotics/releases/tag/Globle_gut_probiotics_set.

References

  • 1. Hill C, Guarner F, Reid G, et al. Expert consensus document. The International Scientific Association for Probiotics and Prebiotics consensus statement on the scope and appropriate use of the term probiotic. Nat Rev Gastroenterol Hepatol 2014;11(8):506–14. [DOI] [PubMed] [Google Scholar]
  • 2. Wu Y, Zhang X, Liu X, et al. Strain specificity of lactobacilli with promoted colonization by galactooligosaccharides administration in protecting intestinal barriers during Salmonella infection. J Adv Res 2024;56:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Pacyga-Prus K, Jakubczyk D, Sandström C, et al. Polysaccharide BAP1 of Bifidobacterium adolescentis CCDM 368 is a biologically active molecule with immunomodulatory properties. Carbohydr Polym 2023;315:120980. [DOI] [PubMed] [Google Scholar]
  • 4. Merenstein D, Fraser CM, Roberts RF, et al. Bifidobacterium animalis subsp. lactis BB-12 protects against antibiotic-induced functional and compositional changes in human fecal microbiome. Nutrients 2021;13(8):2814. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Purton T, Staskova L, Lane MM, et al. Prebiotic and probiotic supplementation and the tryptophan-kynurenine pathway: a systematic review and meta analysis. Neurosci Biobehav Rev 2021;123:1–13. [DOI] [PubMed] [Google Scholar]
  • 6. Bambury A, Sandhu K, Cryan JF, Dinan TG. Finding the needle in the haystack: systematic identification of psychobiotics. Br J Pharmacol 2018;175(24):4430–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Markets and Markets . https://www.marketsandmarkets.com/PressReleases/probiotics.asp. Accessed 3 September 2023.
  • 8. O'Toole PW, Marchesi JR, Hill C. Next-generation probiotics: the spectrum from probiotics to live biotherapeutics. Nat Microbiol 2017;2:17057. [DOI] [PubMed] [Google Scholar]
  • 9. Sharif S, Meader N, Oddie SJ, et al. Probiotics to prevent necrotising enterocolitis in very preterm or very low birth weight infants. Cochrane Database Syst Rev 2020;10(10):CD005496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zheng J, Wittouck S, Salvetti E, et al. A taxonomic note on the genus Lactobacillus: description of 23 novel genera, emended description of the genus Lactobacillus Beijerinck 1901, and union of Lactobacillaceae and Leuconostocaceae. Int J Syst Evol Microbiol 2020;70(4):2782–858. [DOI] [PubMed] [Google Scholar]
  • 11. Suez J, Zmora N, Segal E, Elinav E. The pros, cons, and many unknowns of probiotics. Nat Med 2019;25(5):716–29. [DOI] [PubMed] [Google Scholar]
  • 12. Fölster-Holst R. Probiotics in the treatment and prevention of atopic dermatitis. Ann Nutr Metab 2010;57(Suppl):16–9. [DOI] [PubMed] [Google Scholar]
  • 13. Mathieu A, Leclercq M, Sanabria M, et al. Machine learning and deep learning applications in metagenomic taxonomy and functional annotation. Front Microbiol 2022;13:811495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fang Z, Feng T, Zhou H, Chen M. DeePVP: identification and classification of phage virion proteins using deep learning. Gigascience 2022;11:giac076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Tao L, Wang B, Zhong Y, et al. Database and bioinformatics studies of probiotics. J Agric Food Chem 2017;65(35):7599–606. [DOI] [PubMed] [Google Scholar]
  • 16. Chan PL, Lauw S, Ma KL, et al. ProBioQuest: a database and semantic analysis engine for literature, clinical trials and patents related to probiotics. Database (Oxford) 2022(2022):baac059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Sun Y, Li H, Zheng L, et al. iProbiotics: a machine learning platform for rapid identification of probiotic properties from whole-genome primary sequences. Brief Bioinform 2022;23(1):bbab477. [DOI] [PubMed] [Google Scholar]
  • 18. Jaillard M, Palmieri M, Belkum A, Mahé P. Interpreting k-mer-based signatures for antibiotic resistance prediction. Gigascience 2020;9(10):giaa110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Davis JJ, Boisvert S, Brettin T, et al. Antimicrobial resistance prediction in PATRIC and RAST. Sci Rep 2016;6:27930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Drouin A, Giguère S, Déraspe M, et al. Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons. BMC Genomics 2016;17(1):754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Mahé P, Tournoud M. Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection. BMC Bioinformatics 2018;19(1):383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Drouin A, Letarte G, Raymond F, et al. Interpretable genotype-to-phenotype classifiers with performance guarantees. Sci Rep 2019;9(1):4071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Nguyen M, Brettin T, Long SW, et al. Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Sci Rep 2018;8(1):421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Nguyen M, Long SW, McDermott PF, et al. Using machine learning to predict antimicrobial MICs and associated genomic features for nontyphoidal salmonella. J Clin Microbiol 2019;57(2):e01260–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Ma Y, Guo Z, Xia B, et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat Biotechnol 2022;40(6):921–31. [DOI] [PubMed] [Google Scholar]
  • 26. Zhang Y, Zhang Y, Xiong Y, et al. T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Brief Bioinform 2022;23(1):bbab420. [DOI] [PubMed] [Google Scholar]
  • 27. Miller D, Stern A, Burstein D. Deciphering microbial gene function using natural language processing. Nat Commun 2022;13(1):5731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Tsukiyama S, Hasan MM, Fujii S, Kurata H. LSTM-PHV: prediction of human-virus protein-protein interactions by LSTM with word2vec. Brief Bioinform 2021;22(6):bbab228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Asgari E, Mofrad MR. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 2015;10(11):e0141287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Singer JR, Blosser EG, Zindl CL, et al. Preventing dysbiosis of the neonatal mouse intestinal microbiome protects against late-onset sepsis. Nat Med 2019;25(11):1772–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Corcoran BM, Stanton C, Fitzgerald G, Ross R. Life under stress: the probiotic stress response and how it may be manipulated. Curr Pharm Des 2008;14(14):1382–99. [DOI] [PubMed] [Google Scholar]
  • 32. Derrien M, Hylckama Vlieg JE. Fate, activity, and impact of ingested bacteria within the human gut microbiota. Trends Microbiol 2015;23(6):354–66. [DOI] [PubMed] [Google Scholar]
  • 33. Le Barz M, Anhê FF, Varin TV, et al. Probiotics as complementary treatment for metabolic disorders. Diabetes Metab J 2015;39(4):291–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Schwiertz A, Taras D, Schäfer K, et al. Microbiota and SCFA in lean and overweight healthy subjects. Obesity (Silver Spring) 2010;18(1):190–5. [DOI] [PubMed] [Google Scholar]
  • 35. Drula E, Garron ML, Dogan S, et al. The carbohydrate-active enzyme database: functions and literature. Nucleic Acids Res 2022;50(D1):D571–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Zhu M, Zhu J, Fang S, Zhao B. Complete genome sequence of Bifidobacterium animalis subsp. lactis BLa80, a strain isolated from human breast milk. Microbiol Resour Announc 2023;12(6):e0046522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Reyes-Castillo PA, González-Vázquez R, Torres-Maravilla E, et al. Bifidobacterium longum LBUX23 isolated from feces of a newborn; potential probiotic properties and genomic characterization. Microorganisms 2023;11(7):1648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Korzhenkov AA, Tepliuk AV, Sidoruk KV, et al. A dataset of four probiotic Bifidobacterium strains genome assemblies. Data Brief 2021;34:106710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Fang Z, Tan J, Wu S, et al. PPR-meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience. 2019;8(6):giz066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Wang J, Zhang J, Liu W, et al. Metagenomic and metatranscriptomic profiling of Lactobacillus casei Zhang in the human gut. NPJ Biofilms Microbiomes 2021;7(1):55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Maldonado-Gómez MX, Martínez I, Bottacini F, et al. Stable engraftment of Bifidobacterium longum AH1206 in the human gut depends on individualized features of the resident microbiome. Cell Host Microbe 2016;20(4):515–26. [DOI] [PubMed] [Google Scholar]
  • 42. Zou Y, Xue W, Luo G, et al. 1,520 reference genomes from cultivated human gut bacteria enable functional microbiome analyses. Nat Biotechnol 2019;37(2):179–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Astolfi ML, Protano C, Schiavi E, et al. A prophylactic multi-strain probiotic treatment to reduce the absorption of toxic elements: in-vitro study and biomonitoring of breast milk and infant stools. Environ Int 2019;130:104818. [DOI] [PubMed] [Google Scholar]
  • 44. Arun KB, Madhavan A, Sindhu R, et al. Probiotics and gut microbiome - prospects and challenges in remediating heavy metal toxicity. J Hazard Mater 2021;420:126676. [DOI] [PubMed] [Google Scholar]
  • 45. Zhou C, Zhao H, Xiao XY, et al. Metagenomic profiling of the pro-inflammatory gut microbiota in ankylosing spondylitis. J Autoimmun 2020;107:102360. [DOI] [PubMed] [Google Scholar]
  • 46. Kishikawa T, Maeda Y, Nii T, et al. Metagenome-wide association study of gut microbiome revealed novel aetiology of rheumatoid arthritis in the Japanese population. Ann Rheum Dis 2020;79(1):103–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Jiang Z, Zhuo LB, He Y, et al. The gut microbiota-bile acid axis links the positive association between chronic insomnia and cardiometabolic diseases. Nat Commun 2022;13(1):3002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Huang S, Jiang S, Huo D, et al. Candidate probiotic Lactiplantibacillus plantarum HNU082 rapidly and convergently evolves within human, mice, and zebrafish gut but differentially influences the resident microbiome. Microbiome 2021;9(1):151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Qin J, Li Y, Cai Z, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 2012;490(7418):55–60. [DOI] [PubMed] [Google Scholar]
  • 50. Qin N, Yang F, Li A, et al. Alterations of the human gut microbiome in liver cirrhosis. Nature 2014;513(7516):59–64. [DOI] [PubMed] [Google Scholar]
  • 51. Gupta A, Dhakan DB, Maji A, et al. Association of Flavonifractor plautii, a flavonoid-degrading bacterium, with the gut microbiome of colorectal cancer patients in India. mSystems 2019;4(6):e00438–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Dhakan DB, Maji A, Sharma AK, et al. The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches. Gigascience. 2019;8(3):giz004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Thomas AM, Manghi P, Asnicar F, et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med 2019;25(4):667–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Hannigan GD, Duhaime MB, RuffinMT, IV, et al. Diagnostic potential and interactive dynamics of the colorectal cancer virome. MBio 2018;9(6):e02248–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Li M, Wang C, Guo Q, et al. More positive or more negative? Metagenomic analysis reveals roles of virome in human disease-related gut microbiome. Front Cell Infect Microbiol 2022;12:846063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Lewis JD, Chen EZ, Baldassano RN, et al. Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn's disease. Cell Host Microbe 2015;18(4):489–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Hiseni P, Rudi K, Wilson RC, et al. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome 2021;9(1):165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Shang J, Tang X, Sun Y. PhaTYP: predicting the lifestyle for bacteriophages using BERT. Brief Bioinform 2023;24(1):bbac487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Jin H, Quan K, He Q, et al. A high-quality genome compendium of the human gut microbiome of Inner Mongolians. Nat Microbiol 2023;8(1):150–61. [DOI] [PubMed] [Google Scholar]
  • 60. Li D, Liu CM, Luo R, et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 2015;31(10):1674–6. [DOI] [PubMed] [Google Scholar]
  • 61. Peng Y, Leung HC, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012;28(11):1420–8. [DOI] [PubMed] [Google Scholar]
  • 62. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res 2017;27(5):824–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 2012;40(20):e155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Kang DD, Li F, Kirton E, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 2019;7:e7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 2016;32(4):605–7. [DOI] [PubMed] [Google Scholar]
  • 66. Alneberg J, Bjarnason BS, Bruijn I, et al. Binning metagenomic contigs by coverage and composition. Nat Methods 2014;11(11):1144–6. [DOI] [PubMed] [Google Scholar]
  • 67. Vollmers J, Wiegand S, Kaster AK. Comparing and evaluating metagenome assembly tools from a microbiologist's perspective - not only size matters! PloS One 2017;12(1):e0169662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Ni Y, Wong VH, Tai WC, et al. A metagenomic study of the preventive effect of Lactobacillus rhamnosus GG on intestinal polyp formation in ApcMin/+ mice. J Appl Microbiol 2017;122(3):770–84. [DOI] [PubMed] [Google Scholar]
  • 69. Uritskiy GV, DiRuggiero J, Taylor J. MetaWRAP-a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018;6(1):158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Salazar VW, Shaban B, Quiroga MDM, et al. Metaphor-a workflow for streamlined assembly and binning of metagenomes. Gigascience. 2022;12:giad055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Yang C, Chowdhury D, Zhang Z, et al. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J 2021;19:6301–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Stalder T, Press MO, Sullivan S, et al. Linking the resistome and plasmidome to the microbiome. ISME J 2019;13(10):2437–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Wang L, Tan Y, Yang X, et al. Review on predicting pairwise relationships between human microbes, drugs and diseases: from biological data to computational models. Brief Bioinform 2022;23(3):bbac080. [DOI] [PubMed] [Google Scholar]
  • 74. Xu X, Lubomski M, Holmes AJ, et al. NEMoE: a nutrition aware regularized mixture of experts model to identify heterogeneous diet-microbiome-host health interactions. Microbiome. 2023;11(1):51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Ng P. dna2vec: Consistent vector representations of variable-length k-mers, 2017, arXiv preprint arXiv:1701.06279.
  • 76. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011;27(6):764–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Zhang H, Yohe T, Huang L, et al. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res 2018;46(W1):W95–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Fu L, Niu B, Zhu Z, et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28(23):3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34(17):i884–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Parks DH, Imelfort M, Skennerton CT, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015;25(7):1043–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Jain C, Rodriguez-R LM, Phillippy AM, et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 2018;9(1):5114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Zhang W, Du P, Zheng H, et al. Whole-genome sequence comparison as a method for improving bacterial species definition. J Gen Appl Microbiol 2014;60(2):75–8. [DOI] [PubMed] [Google Scholar]
  • 83. Cantalapiedra CP, Hernández-Plaza A, Letunic I, et al. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol 2021;38(12):5825–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (Camb) 2021;2(3):100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Pascal Andreu V, Augustijn HE, Chen L, et al. gutSMASH predicts specialized primary metabolic pathways from the human gut microbiota. Nat Biotechnol 2023;41:1416–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Feng T, Wu S, Zhou H, et al. MOBFinder: a tool for MOB typing for plasmid metagenomic fragments based on language model. bioRxiv 2023. 10.1101/2023.12.06.570414. [DOI] [Google Scholar]
  • 87. Hyatt D, Chen GL, Locascio PF, et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Patro R, Duggal G, Love MI, et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 2017;14(4):417–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008;9:559. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

cover_letter_bbae085
cover_letter_bbae085.docx (46.5KB, docx)
Supplementary_information_1_bbae085
Supplementary_information_2_bbae085

Data Availability Statement

The list of 120 reference prokaryotic genomes used for language model training can be downloaded from NCBI (link: https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_reference_genomes.txt). The list of the probiotic and non-probiotics genomes used for metaProbiotics construction can be downloaded from Sun et al. [17] (link: http://bioinfor.imu.edu.cn/iprobiotics/public/Download), and the organization of these genomes is provided in Table S1. The accession list of metagenomic sequencing data for the LcZ and BlA groups is provided in Table S6. The accession list of the metagenomic sequencing data for cohorts with healthy individuals and different disease types is provided in Table S9. The accession list of the probiotic set constructed by metaProbiotics using the global gut metagenome is provided in the ‘probiotics_set.csv’ file in https://github.com/zhenchengfang/metaProbiotics/releases/tag/Globle_gut_probiotics_set.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES