ABSTRACT
Microorganism culturing is essential in microbiological research, with the selection of suitable culture media being critical for successful microbial growth. Traditionally, this selection has relied on empirical knowledge or trial and error, often resulting in inefficiency. In this study, we analysed nutrient compositions from the MediaDive database to construct a dataset of 2369 media types. Leveraging this dataset and microbial 16S rRNA sequences, we developed 45 binary classification models using the XGBoost algorithm. These models demonstrated strong predictive performance, achieving accuracies ranging from 76% to 99.3%, with the top‐performing models for J386, J50 and J66 media reaching 99.3%, 98.9% and 98.8%, respectively. The models effectively predicted growth conditions for various human gut microbes, confirming their practical utility. This research improves the efficiency of microbial cultivation and highlights the potential of machine learning to optimise culture media selection and advance microbiological studies.
Keywords: 16S rRNA, bacterial growth prediction, culture medium, machine learning, XGBoost algorithm
Leveraging 2369 media types and microbial 16S rRNA sequences, we developed 45 binary classification models using the XGBoost algorithm. These models demonstrated strong predictive performance, achieving accuracies ranging from 76% to 99.3%. The models effectively predicted growth conditions for various human gut microbes, confirming their strong practical utility.

1. Introduction
Studies of microorganisms often require their culture in the laboratory. However, it is challenging to determine the appropriate culture medium for each microorganism due to their high degree of diversity. The appropriate culture medium must provide the necessary nutrients, including nitrogen sources, carbon sources and inorganic salts, as well as the essential environmental conditions for studying their growth patterns, metabolic characteristics and functions (Guerrero‐Ferreira and Nishiguchi 2011; Overmann et al. 2017).
Recent advances in machine learning have opened new avenues for addressing these challenges. For example, Schinn et al. (2021) used the Markov chain Monte Carlo method combined with the Chinese hamster ovary (CHO) cell metabolic model to adjust the amino acid concentrations in culture media dynamically, and thereby enhanced the growth rate of CHO cells. Similarly, Ashino et al. (2019) used a decision tree model to predict the key components of a culture medium that would significantly influence the growth rate and cell density of microorganisms, which play crucial roles in the effectiveness of the culture process. Akin et al. (2020) used multivariate adaptive regression splines (MARS) to predict the quality of strawberry shoots, their growth density and leaf colour based on the concentrations of the main inorganic components of the culture medium. The above studies implied that artificial‐intelligence methods have potential for screening the components of microbial culture media.
Traditional methods of culture medium selection, such as process analysis, experiment design and statistical methods (Yongmin and Zhaolie 2007), are time‐consuming, inefficient and increasingly limited due to the continuous discovery of new microbial species (Kaiser et al. 2007). Therefore, it is important to develop new methods that can quickly and accurately select an appropriate culture medium to support the growth of a particular bacterium and thereby improve microbial research.
A 16S rRNA sequence consisting of about 1500 bp, including 10 evolutionarily conserved and 9 variable regions (Varliero et al. 2023), has been used in the classification (conserved regions) and analysis of the evolution (variable regions) of bacteria (Clarridge 2004). Classification using this 16S rRNA sequence has been shown to provide higher resolution and accuracy than traditional clustering methods based on amino acid composition or physicochemical properties (Chen et al. 2022), (Janda and Abbott 2007).
In this study, we have developed a tool named MediaMatch. We constructed 45 binary classification models using the XGBoost algorithm with data on culture media from the MediaDive database together with microbial 16S rRNA sequences to predict the appropriate culture media for various microorganisms. These 45 models learned from the variable regions of the 16S rRNA sequence to identify whether different bacterial species could grow on 1 or more of the 45 culture media. The F1 score of most models exceeded 90%, and experimental results demonstrated that the predictions were efficient and accurate.
2. Materials and Methods
2.1. Culture Medium–Microorganism Dataset Construction
MediaDive (Koblitz et al. 2023) is a comprehensive database that contains information on microorganisms and culture media. In total, 2369 entries on culture media were sourced from this database for training and compilation purposes. The comprehensive dataset used encompassed parameters including culture conditions, nutritional components of the media and their respective contents, as well as all bacterial strains that could be cultured on the media. This information was collated to construct a culture medium–microorganism dataset.
2.2. Construction of Input Datasets
To further explore the adaptability of bacteria to growth in various culture media, we constructed a dataset as the input for the model to build 45 models. Initially, we compiled a table of all bacteria that could be cultured in the media available in our database; the total number of bacteria was 33,852. We then used iLearnPlus (Chen et al. 2021) to convert the 16S rRNA sequences of these bacteria into feature values for model construction. iLearnPlus calculates the frequencies of different 3‐mers in the 16S rRNA sequences using a sliding‐window technique, rather than direct counting, to avoid biases caused by varying sequence lengths. We used the frequencies of the 3‐mers as features, and whether the bacteria could grow in a specific medium as the label, assigning a value of 1 for growth and 0 for no growth. We found the 16S rRNA sequences for 26,271 bacteria. The input data for the 45 models were the same, consisting of the 3‐mer frequencies of the 16S rRNA sequences of these 26,271 bacteria, but the labels were different. For each medium, if it could support the growth of a particular bacterium, the label was set to 1; otherwise, it was set to 0. During the construction of different models, we identified bacteria that could grow on this medium as positive, while those that could not were defined as negative. This dataset became the foundation for training and validating our prediction models, with the aim of identifying the growth adaptability of different bacterial species to specific culture media.
2.3. Selection of Machine‐Learning Algorithms
Numerous machine‐learning algorithms have been shown to be both effective and versatile across a broad spectrum of applications (Zhang et al. 2022). In the present study, we selected five algorithms for detailed comparative analysis: Extreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016), Classification and Regression Tree (CART) (Acito 2023), Support Vector Machine (SVM) (Cortes and Vapnik 1995), K‐Nearest Neighbors (KNN) (Cover and Hart 1967) and Random Forest (RF) (Breiman 2001). Finally, we chose the XGBoost algorithm and built models for each of the 45 culture media.
2.3.1. Model Evaluation Metrics
To evaluate the performance of the four selected models, we used four key metrics: accuracy, precision, recall and F1 score. Taken together, these metrics provided a comprehensive assessment and enabled us to determine the most suitable algorithm for our research.
2.3.2. Accuracy
Accuracy was calculated by dividing the number of correctly predicted samples by the total number of samples (Zhang et al. 2017):
| (1) |
where TP is true positive, TN is true negative, FP is false positive and FN is false negative.
2.3.3. Precision
Precision is a measure of how many of the predicted positive samples were actually true positives. High precision indicated that the model was less likely to classify negative samples as positive (Saito and Rehmsmeier 2015):
| (2) |
where TP is true positive and FP is false positive.
2.3.4. Recall
Recall is a measure of the proportion of true positive samples among all actual positive samples. High recall indicates that the model was better at capturing positive samples and reducing the number of missed positive samples (Zhang et al. 2023):
| (3) |
where TP is true positive and FN is false negative.
2.3.5. F1 Score
The F1 score is the harmonic mean of precision and recall, and was used to provide a comprehensive evaluation of the model's performance. The F1 score combines both precision and recall, providing a more balanced performance measure, especially in cases of class imbalance (Chicco and Jurman 2020):
| (4) |
2.3.6. AUPRC
The area under the precision–recall curve (AUPRC) summarises precision and recall across all decision thresholds and was used to provide a comprehensive evaluation of the model's performance. By focusing on the positive class, AUPRC offers a more informative measure under class imbalance than ROC‐based metrics (Davis and Goadrich 2006):
| (5) |
where R is Recall and P is Precision.
2.4. Model Optimization Method
After selecting a specific model, we built a model for each chosen culture medium, transforming the multilabel, multiclass problem into a binary classification problem. In addition, we performed parameter optimization for each of the constructed models.
2.4.1. Grid Search
To ensure that the model had strong generalisation capabilities, we used ‘GridSearchCV’ from the scikit‐learn library, which is a grid search technique for optimising model parameters. Grid search (Budiman 2019) utilises various parameter combinations to train models, thus confirming the best configuration. During the grid search process, we focused primarily on two key parameters of the XGBoost model: the maximum depth of the tree and the learning rate (Ma et al. 2022). In this study, the exploration range for the maximum tree depth was set from 3 to 10, and the learning rate was set from 0.01 to 0.4. This range was chosen based on consideration of balancing the performance of the XGBoost model and the risk of overfitting. Lower tree depths help prevent overfitting, while higher tree depths may increase a model's complexity and performance. Similarly, by exploring a wide range of learning rates, a rate that ensured efficient model learning while avoiding too‐rapid convergence to local optima was determined (Pan et al. 2022).
2.4.2. Fivefold Cross
In fivefold cross‐validation, the dataset was divided into five equal subsets. In each iteration, four subsets were used to train the model, and the remaining subset was used for validation. This process was repeated five times, ensuring that each subset served as the validation set at least once, giving every data point a chance to be included in the validation set. This method improved the accuracy of our performance evaluation by ensuring comprehensive coverage across different data subsets (Mahmudah et al. 2021).
2.4.3. Loss Function
In the present study, the loss function of the XGBoost model was chosen specifically as binary: logistic, which is particularly well suited for binary classification models. It can calculate the probability of each prediction belonging to the positive class, which is crucial for distinguishing between the two possible outcomes in a binary classification task (Wang et al. 2020).
| (6) |
where n is the number of observations in the dataset, y i is the actual label of the i‐th observation (taking a value of 0 or 1) and ŷ i is the predicted probability of the i‐th observation belonging to the positive class (1).
2.5. Extraction of Important Features
Feature importance scores for motifs were calculated using XGBoost's built‐in get_score function. This method quantifies each motif's contribution to model accuracy by evaluating its relative influence on prediction outcomes, where higher scores indicate greater predictive importance. To obtain comprehensive importance metrics, we: (1) computed motif importance scores for each of the 45 culture medium‐specific models, and (2) aggregated these values across all models to determine each motif's overall significance.
2.6. Bacterial Culture
In the experiment, Collinsella aerofaciens was cultivated using three types of culture media: 78+ (supplemented with heme and vitamin K1), 78 (without heme and vitamin K1) and 104, while Eubacterium ventriosum was cultivated using culture medium 104. Each strain was inoculated into the corresponding culture medium at 3% (V/V) under strict anaerobic conditions, with an uninoculated culture medium serving as the control group. Each group included three replicates. Subsequently, the mixtures were incubated at 37°C with shaking at 300 rpm for 48 h. The absorbance values at 600 nm were measured every 30 min throughout the incubation period. The growth curves were plotted with error bars for statistical analysis.
3. Results
3.1. Analysis of the Culture Medium–Microorganism Datasets
Analysis of the culture medium–microorganism dataset showed that the culture media examined contained 863 different types of nutrients, reflecting the diversity and variety of nutrients required for bacterial culture. Due to the large number of available nutrients, manual experimentation to determine the optimal medium would be unfeasible. Statistical analysis showed that most culture media contained several nutrients in common, such as NaCl and MgSO4·7H2O, together with one or several specific nutrients, such as flavin adenine dinucleotide (FAD) (Figure 1A). Culture media containing specific nutrients could be used to culture certain types of bacteria, indicating that some bacteria require specific nutrients. For example, the aerobic heterotrophic bacterium Calycomorphotria hydatis is recorded in the MediaDive database as growing only on culture medium ID 1560 at pH 8.5 under laboratory conditions, while a review of the literature showed that this bacterium has also been cultured in M1H NAG ASW medium (Boersma et al. 2020). By comparing their components, we found 14 nutrients in common between the two culture media (Table 1). We counted the occurrence of these 14 nutrients across the culture media recorded in the MediaDive database and found that some were very common, including yeast extract, NaCl and MgSO4·7H2O, each appearing over 1000 times. However, others were used relatively infrequently, such as N‐acetylglucosamine, which appeared only 21 times in the database. Nutrients such as N‐acetylglucosamine, which were present in both media but are used less frequently, may be essential for the growth of C. hydatis. We sorted the compilation of 863 nutrients by overall frequency of occurrence and highlighted those with the highest frequency across different media (Figure 1B). Sodium salts, potassium salts, Mg2+, yeast extract and beef powder appeared frequently across 2369 media, implying that these nutrients may fulfil essential requirements for bacterial growth. The most frequently used nutrients were Mg2+ and Na+, indicating that bacteria have relatively consistent demands for these ions, while their needs for nitrogen and carbon sources may be more flexible (Figure 1B). Therefore, in the formulation of bacterial media, selecting suitable nitrogen and carbon sources should be prioritised to meet the needs of various microorganisms.
FIGURE 1.

Analysis of nutritional components in culture media. (A) Nutritional components of 2369 culture media, containing 863 different nutrients, with many media having several special nutrients. Colours ranging from light to dark indicate increasing concentrations. (B) The 50 most‐frequently occurring nutrients were identified.
TABLE 1.
Common nutrients and their occurrence in culture media.
| Nutritional component | Frequency |
|---|---|
| Yeast extract | 1523 |
| NaCl | 1425 |
| MgSO4·7H2O | 1252 |
| MgCl2·6H2O | 871 |
| KCl | 841 |
| FeSO4·7H2O | 787 |
| ZnSO4·7H2O | 766 |
| Glucose | 416 |
| MnSO4·H2O | 311 |
| Na2SO4 | 258 |
| Peptone | 252 |
| CaCl2 | 106 |
| HEPES | 59 |
| N‐Acetylglucosamine | 21 |
3.2. Construction and Analysis of Input Dataset
To construct an accurate predictive model, we selected all 45 media types in the MediaDive database that could be used to culture over 100 species of bacteria (Figure 2A). Media suitable for culturing only one or two species of bacteria were discarded to ensure a sufficient and balanced dataset. Among these 45 media, medium ID 65 could be used for the culture of 1539 species of bacteria, while medium ID 381 was suitable for 101 species (Figure 2A). We then categorised the frequencies of nutrient components in these 45 media types as low, intermediate or high (Figure 2B). Low‐frequency nutrients, appearing only one to three times, such as MnSO4·H2O, Na.acetate and X.(NH4).3.citrate, may be used for specific bacteria or under special culture conditions (Tramontano et al. 2018). Intermediate‐frequency nutrients, such as beef extract, FeSO4·7H2O and peptone, are used to provide nitrogen sources and other nutrients to facilitate microbial growth (Madigan et al. 2018). High‐frequency nutrients, including yeast extract, K2HPO4, NaCl, MgSO4·7H2O, starch and glucose, were the most prevalent (Figure 2B). Ions, such as K+, Na+, Mg2+, etc., are present in general culture medium and play crucial roles in bacterial growth and function. They are indispensable in biological processes, involved in cellular metabolism and physiological activities and thereby in promoting the growth and reproduction of microorganisms (Macêdo et al. 2019).
FIGURE 2.

Cultural capacity and nutritional analysis of selected media. (A) Statistical ranking of the number of bacteria each medium can be used to culture. The results show that 45 media cultured more than 100 types of bacteria. (B) Cluster analysis of the 45 media using the nutritional component contents of each medium as input. The analysis yielded four categories.
3.3. Comparison of Different Model Algorithms
Before using machine‐learning techniques to predict the most suitable culture media for bacterial culture, choosing an appropriate algorithm was a crucial step. This not only achieved better accuracy but also prevented overfitting. Considering their diverse strengths and limitations, we compared several commonly used algorithms: XGBoost, RF, CART, SVM and KNN. To comprehensively evaluate these algorithms, we tested all 45 culture media types using multiple modelling approaches and compared their averaged performance metrics. The test results showed that XGBoost demonstrated superior performance across nearly all metrics, achieving 93.0% accuracy, 92.2% precision, 93.0% recall, 92.3% F1 score and a 65.1% AUPRC score. While KNN (with 92.3% accuracy, 92.5% precision, 92.3% recall, 92.3% F1 score and 57.9% AUPRC) and RF (with 92.8% accuracy, 91.7% precision, 92.8% recall, 91.7% F1 score and 61.4% AUPRC) showed competitive performance, they fell slightly short of XGBoost in overall metrics. CART trailed further behind (with 90.2% accuracy, 90.0% precision, 90.2% recall, 90.1% F1 score and 41.6% AUPRC) and SVM exhibited the weakest results (with 89.6% accuracy, 88.0% precision, 89.6% recall, 87.3% F1 score and 26.1% AUPRC) (Figure 3 and Table S1). Given its balanced and superior performance across all key indicators, particularly in handling imbalanced data, XGBoost emerged as the optimal choice, ensuring both high detection rates and reliable positive‐class predictions for practical applications.
FIGURE 3.

Comparison of different algorithms.
3.4. Model Training and Validation
In this study, we used the XGBoost algorithm along with features calculated from 16S rRNA sequences to construct models. The 16S rRNA sequence is crucial for bacterial identification and is widely applied in species classification (Raghava et al. 2000). Ultimately, we collected 26,271 16S rRNA sequences and developed 45 binary classification models for the prediction of growth on culture media, naming each model after its respective medium ID.
Subsequently, we evaluated the predictive performance of the models, and the results were presented visually using a heat map (Figure 4A). As shown in Figure 4A, accuracy was 0.769–0.998, precision was 76.6%–99.8%, recall was 77.0%–99.9%, and F1 scores were 76.8%–99.9%. These data underscored the high accuracy and reliability of most models in predicting suitable culture media for bacteria. With all values above 70% and most above 80%, the models demonstrated effective predictive performance for most bacteria.
FIGURE 4.

Performance metrics and ROC curve analysis of the models. (A) Model parameter analysis included accuracy, precision, F1 score and recall, with darker colors indicating better results and colors ranging from light to dark indicating increasing values. (B) Model ROC curve analysis, where a larger area under the ROC curve (AUC) indicates better model performance.
Receiver operating characteristic (ROC) curves were generated, and the area under the curve (AUC) for each model was computed to evaluate their diagnostic capability in identifying suitable culture media for bacteria. The AUC values were 85–100%, with the majority exceeding 90% (Figure 4B), indicating their exceptional classification capability. Significantly, models J386, J50 and J58 had a perfect AUC value of 100%. Models 84, 11, 252, J42, J43, 83, 987, 553 and 554 also had high predictive accuracy, with AUC values of 99%. Even the models with the lowest AUC values reached a value of 78%, signifying considerable predictive capability. The models with high AUC values not only affirmed the effectiveness of our approach but also underscored the vast potential of machine learning in microbial culture analysis.
3.5. Extraction of Important Features
To elucidate the mechanism of the above models that had accurate prediction capability, we used the get_score function to filter out key features among the 16S rRNA sequences of different bacteria. Due to differences in the 16S rRNA sequences, conserved and variable regions can be used for species classification and can effectively distinguish between species. Through systematic comparison, we found that splitting 16S rRNA sequences into 3‐mers yielded higher accuracy and lower computational cost (Table S2). Thus, these 3‐mer motifs were used to calculate importance scores in the model. After accumulation of the importance of features across 45 different culture medium suitability models, we found that the models paid more attention to five motifs in 16S rRNA: TTT, TTG, AAT, AGT and CGG (Figure 5A). Furthermore, these five motifs were identified frequently in each of the 45 models (Figure 5B). The five motifs on 45 16S rRNA sequences selected at random from each of the 45 models are shown in Figure S1. We randomly selected 16S rRNA sequences from media ID 104, 78 and 693. The results show that the TTT, TTG, AAT, AGT and CGG motifs were all within the variable regions (Figure 5C). These observations indicate that the model can effectively learn the characteristics of the variable regions in the 16S rRNA sequences of different bacteria, enabling differentiation between different strains and matching to the appropriate culture media.
FIGURE 5.

Analysis of motif importance across various models. (A) Summed importance of various motifs across all models, with the most important highlighted in red. (B) The importance of motifs in each model displayed separately. Colours ranging from light to dark indicate increasing values. (C) Input sequences were selected randomly, the variable regions were labelled and the distributions of various motifs were depicted.
3.6. Model Testing
To test the versatility and accuracy of our model further, we applied it to predict suitable culture media for specific bacterial species. Through literature review, we selected 10 bacteria from 73 components of the human gut microbiome (Zheng et al. 2022). The selected bacteria included 3— Roseburia faecis , Ruminococcus torques 1 and Anaerostipes hadrus (Ezaki et al. 1994; Duncan et al. 2006; Kant et al. 2015)—which are difficult to culture in the laboratory. The remaining seven bacteria— Faecalibacterium prausnitzii 2 (Barcenilla et al. 2000), Parasutterella excrementihominis (Nagai et al. 2009), Eubacterium ventriosum (Varel et al. 1995), Collinsella aerofaciens (Kageyama et al. 1999), Coprococcus comes (Van de Merwe and Stegeman 1985), Dorea formicigenerans (Shen et al. 2022) and Blautia obeum 1 (Hatziioanou et al. 2017). Specific strain information can be found in Table 2. We obtained the 16S rRNA sequences of these 10 bacteria from the National Center for Biotechnology Information (NCBI) database. Together with data from the literature, we summarised the culture media that support the growth of these strains (Table 3).
TABLE 2.
Strain information.
| Bacterium | Strain |
|---|---|
| Roseburia faecis | M88/1 |
| Ruminococcus torques 1 | GIFU 12126 |
| Anaerostipes hadrus | DSM 3319 |
| Faecalibacterium prausnitzii 2 | A2‐165 |
| Parasutterella excrementihominis | YIT 11859 |
| Eubacterium ventriosum | ATCC 27560 |
| Collinsella aerofaciens | ATCC 25986 |
| Coprococcus comes | 27758 |
| Dorea formicigenerans | ATCC 27755 |
| Blautia obeum 1 | ATCC 29174 |
TABLE 3.
Culture media of selected bacteria.
| Bacterium | Report | Prediction results |
|---|---|---|
| Roseburia faecis | M2GSC, Selenomonas‐like (Zheng et al. 2022) | NA |
| Ruminococcus torques 1 | GAM (Zheng et al. 2022) | NA |
| Anaerostipes hadrus | NA | 110 |
| Faecalibacterium prausnitzii 2 | M2GSC (Zheng et al. 2022) | NA |
| Parasutterella excrementihominis | Anaerobe basal agar, PY, GAM (Zheng et al. 2022) | NA |
| Eubacterium ventriosum | NA | 104 |
| Collinsella aerofaciens | EG agar (Zheng et al. 2022) | 104, 78, J14 |
| Coprococcus comes | NA | 104 |
| Dorea formicigenerans | NA | 693, 78 |
| Blautia obeum 1 | BHI (Zheng et al. 2022) | 104 |
Note: The table includes data for 10 bacteria. Entries marked as ‘NA’ indicate information not mentioned in the literature. The ‘Report’ column refers to the media mentioned in the literature as capable of supporting bacterial growth. The ‘Prediction results’ column refers to the media predicted by the model to support bacterial growth.
Predictive analysis using our model determined that 6 of the 10 bacteria could grow on 1 or more of the 45 common culture media. Specifically, E. ventriosum , C. comes and B. obeum 1 were predicted to grow on medium ID 104, while C. aerofaciens was predicted to grow on media ID 104, 78 and J14. D. formicigenerans was predicted to grow on ID 693 and 78. We have downloaded the culture media and the bacteria that they support from the MediaDive database (Table S3). If the model cannot be used for prediction, it is possible to refer to this table to determine whether there is a suitable culture medium.
To verify these predictions experimentally, we tested the growth of E. ventriosum DSM 3988 and C. aerofaciens DSM 13712 on media ID 104 and ID 78. The MediaDive database shows that medium ID 78 can selectively be supplemented with heme and vitamins K1 and K3 from the vitamin K group. Therefore, we cultured the bacteria on ID 78+ (supplemented with heme and vitamin K1) and ID 78 (without heme and vitamin K1), as well as on the predicted ID 104 medium with heme and vitamin K1. We measured the optical density at 600 nm (OD600) every 30 min to assess bacterial growth and plotted the growth curves. The results show that C. aerofaciens grew well on media ID 78 and 78+, and E. ventriosum grew well on medium ID 104, confirming the reliability of the predictions of our model (Figure 6).
FIGURE 6.

Bacterial growth curves.
4. Discussion
In this study, we combined the XGBoost algorithm with 16S rRNA sequence features and developed a series of 45 binary classification models based on data from the MediaDive database to predict the adaptability of bacterial growth on 45 different culture media. These models provide a new means of selecting suitable media for novel, unknown bacterial strains, with the F1 scores of most models exceeding 90. Compared to traditional experimental methods that require laborious individual testing of each medium, the AI‐driven prediction system identifies optimal culture media conditions within seconds per microbial strain, simultaneously enhancing biosafety by minimising laboratory exposure through reduced hands‐on experimentation and focusing validation efforts only on high‐confidence predictions. Furthermore, when compared to traditional machine learning models, although all these machine learning methods evaluated are open‐source models requiring only seconds for both model construction and medium prediction, XGBoost outperforms the other models across overall evaluation metrics.
The MediaDive database contains data on over 2000 culture media capable of supporting the growth of more than 30,000 bacterial species. The database shows that some media can support the growth of hundreds of bacterial species, while most media can support only one or two specific species, resulting in an extremely imbalanced dataset. Although many data points are not included, the existing data in the MediaDive database is still sufficiently vast to support the training of our models. Therefore, to ensure a sufficient and balanced dataset to train the models, we selected 45 culture media capable of supporting the growth of over 100 bacterial species.
The 16S rRNA sequence can be used to represent a group of similar bacteria. Our model uses the frequencies of different 3‐mers from 16S rRNA as input; 3‐mers are computationally efficient and can effectively avoid data sparsity issues, retaining enough information without being overly long, which helped to improve the performance and accuracy of the model. To process these 16S rRNA sequences, we used iLearnPlus, which employs a sliding‐window technique to process the sequences. Specifically, iLearnPlus segments the 16S rRNA sequence into all possible 3‐mer fragments and calculates the frequency of each fragment. This approach effectively captures local patterns and features within the sequence, providing strong support for subsequent classification tasks. 16S rRNA performed well in classification tasks, effectively distinguishing between different strains. Most models using 16S rRNA also showed high accuracy, with success rates exceeding 90%.
Most of the models constructed performed well, but some, such as those for media 514 and 830, had accuracy rates of around 70%. We examined these media and found that their nutritional components are quite flexible. For example, medium 514 can be modified by changing the basic formula, such as adding more inorganic salts, to create seven different variants (media 514a–514g), each capable of supporting the growth of different bacteria. Medium 830 can be transformed into four different variants (media 820a–820d). This flexibility may lead to instability in the performance of the model. Different formulation variants can support the growth of different bacterial species, increasing the diversity and complexity of the dataset, which in turn impacts the accuracy of the model.
In future work, existing prediction models will be refined and expanded to cover a broader range of culture media types and more diverse bacterial communities. This will significantly improve the prediction accuracy and generalizability of the model, address the limitations of current research methods and provide richer and more precise analytical tools for microbiological research.
Author Contributions
Jianhan Liu: methodology, software, data curation, validation, formal analysis, visualization, writing – review and editing. Guoshun Xu: software, visualization. Wuge Liu: formal analysis. Tuoyu Liu: software. Yanjun Li: data curation, formal analysis. Tao Tu: supervision. Huiying Luo: supervision, funding acquisition. Ningfeng Wu: supervision. Bin Yao: supervision, funding acquisition. Jian Tian: conceptualization, formal analysis, funding acquisition. Jie Zhang: conceptualization, investigation, project administration, writing – review and editing. Feifei Guan: conceptualization, methodology, investigation, formal analysis, visualization, project administration, writing – review and editing.
Ethics Statement
The authors have nothing to report.
Consent
All authors provided their consent for the publication.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Figure S1: Distribution of 5 motifs on the 16S rRNA.
Table S1: mbt270245‐sup‐0002‐TableS1.xlsx.
Acknowledgements
We would like to thank Dr. Su Xiaoyun and Dr. Hao Zhenzhen from the Institute of Animal Sciences, Chinese Academy of Agricultural Sciences for their guidance and help in the bacterial culture experiment. This study was supported by the National Key R&D Program of China [2022YFC2105500], the Agricultural Science and Technology Innovation Program [CAAS‐ZDRW202304] and the China Agriculture Research System of MOF and MARA (CARS‐41).
Liu, J. , Xu G., Liu W., et al. 2025. “ MediaMatch: Prediction of Bacterial Growth on Different Culture Media Using the XGBoost Algorithm.” Microbial Biotechnology 18, no. 10: e70245. 10.1111/1751-7915.70245.
Funding: This work was supported by the Agricultural Science and Technology Innovation Program, CAAS‐ZDRW202304. The China Agriculture Research System of MOF and MARA, CARS‐41. The National Key R&D Program of China, 2022YFC2105500.
Jianhan Liu and Guoshun Xu contributed equally to this work.
Contributor Information
Jie Zhang, Email: zhangjie09@caas.cn.
Feifei Guan, Email: guanfeifei@caas.cn.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request. Datasets, models and codes used in this study are available on GitHub: https://github.com/liujianhan12/MicroBoost.
References
- Acito, F. 2023. Classification and Regression Trees, 169–191. Springer Nature Switzerland. [Google Scholar]
- Akin, M. , Eyduran S. P., Eyduran E., and Reed B. M.. 2020. “Analysis of Macro Nutrient Related Growth Responses Using Multivariate Adaptive Regression Splines.” Plant Cell Tissue and Organ Culture 140: 661–670. [Google Scholar]
- Ashino, K. , Sugano K., Amagasa T., and Ying B. W.. 2019. “Predicting the Decision Making Chemicals Used for Bacterial Growth.” Scientific Reports 9: 7251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barcenilla, A. , Pryde S. E., Martin J. C., et al. 2000. “Phylogenetic Relationships of Butyrate‐Producing Bacteria From the Human Gut.” Applied and Environmental Microbiology 66: 1654–1661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boersma, A. S. , Kallscheuer N., Wiegand S., et al. 2020. “Alienimonas Californiensis gen. nov. sp. nov., a Novel Planctomycete Isolated From the Kelp Forest in Monterey Bay.” Antonie van Leeuwenhoek International Journal of General and Molecular Microbiology 113: 1751–1766. [DOI] [PubMed] [Google Scholar]
- Breiman, L. 2001. “Random Forests.” Machine Learning 45: 5–32. [Google Scholar]
- Budiman, F. 2019. “Svm‐Rbf Parameters Testing Optimization Using Cross Validation and Grid Search to Improve Multiclass Classification.” Scientific Visualization 11: 80–90. [Google Scholar]
- Chen, T. Q. , and Guestrin C.. 2016. “22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD).” In XGBoost: a Scalable Tree Boosting System. The Association for Computing Machinery. [Google Scholar]
- Chen, Z. , Liu X., Zhao P., et al. 2022. “iFeatureOmega: An Integrative Platform for Engineering, Visualization and Analysis of Features From Molecular Sequences, Structural and Ligand Data Sets.” Nucleic Acids Research 50: W434–w447. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, Z. , Zhao P., Li C., et al. 2021. “iLearnPlus: A Comprehensive and Automated Machine‐Learning Platform for Nucleic Acid and Protein Sequence Analysis, Prediction and Visualization.” Nucleic Acids Research 49: 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chicco, D. , and Jurman G.. 2020. “The Advantages of the Matthews Correlation Coefficient (MCC) Over F1 Score and Accuracy in Binary Classification Evaluation.” BMC Genomics 21: 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clarridge, J. E., III . 2004. “Impact of 16S rRNA Gene Sequence Analysis for Identification of Bacteria on Clinical Microbiology and Infectious Diseases.” Clinical Microbiology Reviews 17: 840–862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cortes, C. , and Vapnik V.. 1995. “Support‐Vector Networks.” Machine Learning 20: 273–297. [Google Scholar]
- Cover, T. , and Hart P.. 1967. “Nearest Neighbor Pattern Classification.” IEEE Transactions on Information Theory 13: 21–27. [Google Scholar]
- Davis, J. , and Goadrich M.. 2006. “The Relationship Between Precision‐Recall and ROC Curves.” In Proceedings of the 23rd International Conference on Machine Learning, 233–240. Association for Computing Machinery. [Google Scholar]
- Duncan, S. H. , Aminov R. I., Scott K. P., Louis P., Stanton T. B., and Flint H. J.. 2006. “Proposal of Roseburia faecis sp nov., Roseburia hominis sp nov and Roseburia inulinivorans sp nov., Based on Isolates From Human Faeces.” International Journal of Systematic and Evolutionary Microbiology 56: 2437–2441. [DOI] [PubMed] [Google Scholar]
- Ezaki, T. , Li N., Hashimoto Y., Miura H., and Yamamoto H.. 1994. “16S Ribosomal DNA Sequences of Anaerobic Cocci and Proposal of Ruminococcus hansenii Comb. nov. and Ruminococcus productus Comb. nov.” International Journal of Systematic and Evolutionary Microbiology 44: 130–136. [DOI] [PubMed] [Google Scholar]
- Guerrero‐Ferreira, R. C. , and Nishiguchi M. K.. 2011. “Bacterial Biodiversity in Natural Environments.” In: The Importance of Biological Interactions in the Study of Biodiversity. Lopez Pujol J.. (Ed.), 3–14. London: IntechOpen. [Google Scholar]
- Hatziioanou, D. , Gherghisan‐Filip C., Saalbach G., et al. 2017. “Discovery of a Novel Lantibiotic Nisin O From Blautia Obeum A2‐162, Isolated From the Human Gastrointestinal Tract.” Microbiology 163: 1292–1305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janda, J. M. , and Abbott S. L.. 2007. “16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls.” Journal of Clinical Microbiology 45: 2761–2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kageyama, A. , Benno Y., and Nakase T.. 1999. “Phylogenetic Evidence for the Transfer of Eubacterium lentum to the Genus Eggerthella as Eggerthella lenta gen. nov., Comb. nov.” International Journal of Systematic Bacteriology 49: 1725–1732. [DOI] [PubMed] [Google Scholar]
- Kaiser, C. , Peuker T., Bauch T., Ellert A., and Luttmann R.. 2007. “Pat – Process Analytical Technology in Cultivation Processes With Recombinant Escherichia coli .” IFAC Proceedings Volumes 40: 267–272. [Google Scholar]
- Kant, R. , Rasinkangas P., Satokari R., Pietilä T. E., and Palva A.. 2015. “Genome Sequence of the Butyrate‐Producing Anaerobic Bacterium Anaerostipes Hadrus PEL 85.” Genome Announcements 3: 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koblitz, J. , Halama P., Spring S., et al. 2023. “MediaDive: The Expert‐Curated Cultivation Media Database.” Nucleic Acids Research 51: D1531–D1538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma, Y. , Pan H., Qian G., et al. 2022. “Prediction of Transmission Line Icing Using Machine Learning Based on Gs‐Xgboost.” Journal of Sensors 2022: 2753583. [Google Scholar]
- Macêdo, W. V. , Sakamoto I. K., Azevedo E. B., and Damianovic M. H. R. Z.. 2019. “The Effect of Cations (Na+, Mg2+, and Ca2+) on the Activity and Structure of Nitrifying and Denitrifying Bacterial Communities.” Science of the Total Environment 679: 279–287. [DOI] [PubMed] [Google Scholar]
- Madigan, M. , Sattley W., Bender K., Stahl D., and Buckley D.. 2018. Brock Biology of Microorganisms: Global Edition. Pearson Deutschland. [Google Scholar]
- Mahmudah, K. R. , Indriani F., Takemori‐Sakai Y., Iwata Y., Wada T., and Satou K.. 2021. “Classification of Imbalanced Data Represented as Binary Features.” Applied Sciences 11: 7825. [Google Scholar]
- Nagai, F. , Morotomi M., Sakon H., and Tanaka R.. 2009. “ Parasutterella excrementihominis Gen. Nov., sp Nov., a Member of the Family Alcaligenaceae Isolated From Human Faeces.” International Journal of Systematic and Evolutionary Microbiology 59: 1793–1797. [DOI] [PubMed] [Google Scholar]
- Overmann, J. , Abt B., and Sikorski J.. 2017. “Present and Future of Culturing Bacteria.” Annual Review of Microbiology 71: 711–730. [DOI] [PubMed] [Google Scholar]
- Pan, S. , Zheng Z., Guo Z., and Luo H.. 2022. “An Optimized XGBoost Method for Predicting Reservoir Porosity Using Petrophysical Logs.” Journal of Petroleum Science and Engineering 208: 109520. [Google Scholar]
- Raghava, G. P. S. , Solanki R. J., Soni V., and Agrawal P.. 2000. “Fingerprinting Method for Phylogenetic Classification and Identification of Microorganisms Based on Variation in 16S rRNA Gene Sequences.” BioTechniques 29: 108–116. [DOI] [PubMed] [Google Scholar]
- Saito, T. , and Rehmsmeier M.. 2015. “The Precision‐Recall Plot Is More Informative Than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets.” PLoS One 10: e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schinn, S. M. , Morrison C., Wei W., Zhang L., and Lewis N. E.. 2021. “A Genome‐Scale Metabolic Network Model and Machine Learning Predict Amino Acid Concentrations in Chinese Hamster Ovary Cell Cultures.” Biotechnology and Bioengineering 118: 2118–2123. [DOI] [PubMed] [Google Scholar]
- Shen, Y. , Wang Y. L., Wei X., et al. 2022. “Engineering the Active Site Pocket to Enhance the Catalytic Efficiency of a Novel Feruloyl Esterase Derived From Human Intestinal Bacteria Dorea formicigenerans .” Frontiers in Bioengineering and Biotechnology 10: 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tramontano, M. , Andrejev S., Pruteanu M., et al. 2018. “Nutritional Preferences of Human Gut Bacteria Reveal Their Metabolic Idiosyncrasies.” Nature Microbiology 3: 514–522. [DOI] [PubMed] [Google Scholar]
- Van de Merwe, J. P. , and Stegeman J. H.. 1985. “Binding of coprococcus comes to the Fc Portion of IgG. A Possible Role in the Pathogenesis of Crohn's Disease?” European Journal of Immunology 15: 860–863. [DOI] [PubMed] [Google Scholar]
- Varel, V. H. , Tanner R. S., and Woese C. R.. 1995. “ Clostridium herbivorans sp‐nov, a Cellulolytic Anaerobe From the Pig Intestine.” International Journal of Systematic Bacteriology 45: 490–494. [DOI] [PubMed] [Google Scholar]
- Varliero, G. , Lebre P. H., Stevens M. I., Czechowski P., Makhalanyane T., and Cowan D. A.. 2023. “The Use of Different 16S rRNA Gene Variable Regions in Biogeographical Studies.” Environmental Microbiology Reports 15: 216–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, C. , Deng C. Y., and Wang S. Z.. 2020. “Imbalance‐XGBoost: Leveraging Weighted and Focal Losses for Binary Label‐Imbalanced Classification With XGBoost.” Pattern Recognition Letters 136: 190–197. [Google Scholar]
- Yongmin, W. , and Zhaolie C.. 2007. “The Study and Design Methods of Serum‐Free Medium for Animal Cells.” Journal of Chinese Biotechnology 27: 110–114. [Google Scholar]
- Zhang, C. , Liu C., Zhang X., and Almpanidis G.. 2017. “An Up‐To‐Date Comparison of State‐of‐the‐Art Classification Algorithms.” Expert Systems with Applications 82: 128–150. [Google Scholar]
- Zhang, J. M. , Harman M., Ma L., and Liu Y.. 2022. “Machine Learning Testing: Durvey, Landscapes and Horizons.” IEEE Transactions on Software Engineering 48: 1–36. [Google Scholar]
- Zhang, S. , Huo Z., Sun Y., Li F., and Jia B.. 2023. “Pilot Maneuvering Performance Analysis and Evaluation With Deep Learning.” International Journal of Aerospace Engineering 2023: 6452129. [Google Scholar]
- Zheng, W. S. , Zhao S. J., Yin Y. H., et al. 2022. “High‐Throughput, Single‐Microbe Genomics With Strain Resolution, Applied to a Human Gut Microbiome.” Science 376: 1068. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1: Distribution of 5 motifs on the 16S rRNA.
Table S1: mbt270245‐sup‐0002‐TableS1.xlsx.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request. Datasets, models and codes used in this study are available on GitHub: https://github.com/liujianhan12/MicroBoost.
