Abstract
Accurately identifying genes responsible for specific functions is a cornerstone of biological research, but current methods are often limited to single-species analyses. Here, we present a novel method, called Genomic and Phenotype-based machine learning for Gene Identification (GPGI), that leverages large-scale, cross-species genomic and phenotypic data for functional gene discovery. Using bacterial rod-shape determination as a case study, we demonstrate GPGI’s ability to rapidly identify key genes. Our approach uses machine learning to predict bacterial shape from protein structural domain profiles, identifying influential domains whose corresponding genes are selected for experimental validation. Focused gene knockouts in Escherichia coli confirmed the critical roles of two genes, pal and mreB, in maintaining rod-shaped morphology. We further validated GPGI’s robustness by demonstrating its consistent performance even with reduced datasets. GPGI thus offers a rapid, accurate, and efficient way to identify multiple key genes associated with complex traits across diverse organisms.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12866-025-04411-8.
Keywords: Genomic and phenotype-based machine learning for gene identification (GPGI), Functional gene identification, Bacterial morphology, Rod shape, Machine learning, Predictive model
Introduction
The accurate identification of key functional genes responsible for specific functions is a cornerstone of biological research. As the fundamental molecular units orchestrating diverse life processes and traits [1, 2], genes are key to unraveling life’s basic principles, tracing biological evolution, and deciphering the mechanisms underlying disease [3, 4]. Traditional methods for functional gene research, such as mutant screening, map-based cloning, and genome-wide association analysis (GWAS), have achieved considerable success in discovering key genes that determine traits like bacterial morphology [5–10]. For instance, the rod shape in many bacteria is maintained by the elongasome, a complex of essential proteins including PBP2, MreBCD, RodA, and others [10]. However, these methods, which are primarily based on the analysis of individual species, have inherent limitations. The random nature of mutations induced by chemical or physical mutagenesis demands considerable time and effort to screen for meaningful variants [7, 8]. Furthermore, constructing and maintaining mutant libraries is not only time-consuming and resource-intensive but also presents formidable challenges in handling lethal mutations and achieving comprehensive gene mutation coverage. While effective, methods like map-based cloning often require protracted timelines [5, 11, 12]. Even widely adopted methods like GWAS necessitate substantial resources for sample collection and sequencing within a single species [13–16]. The common bottleneck of these approaches is their “single-species” analytical framework, which struggles to systematically unveil the multifunctional properties of genes across different organisms and leaves the functions of numerous predicted genes an enigma.
The explosive growth in genome sequencing has led to an abundance of data, with over 430,000 bacterial genomes sequenced to date (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/, July 2025). Yet, the depth of research has not kept pace with the breadth of data. The functional genes of the vast majority of bacteria remain unexplored “data silos,” with research resources highly concentrated on a few “model” organisms [17]. This “data-rich, knowledge-poor” paradigm presents an unprecedented opportunity for artificial intelligence, particularly machine learning (ML), making its application increasingly vital and valuable in modern biological research.
Machine learning has demonstrated a formidable capacity for handling complex biological data. Its landmark application, AlphaFold2, has solved the 50-year-old challenge of predicting protein 3D structures [18]. Inspired by this breakthrough, researchers are increasingly utilizing machine learning and deep learning to predict gene expression levels, analyze epigenetic characteristics, discover antimicrobial peptides, and identify disease-related biomarkers and drug targets [19–22]. The core advantages of ML lie in its exceptional ability to analyze large, complex, non-linear datasets and achieve high-accuracy predictions [23, 24]. Furthermore, ML methods are distinguished by their efficiency, cost-effectiveness, and high-throughput capabilities, which significantly reduce analytical time and labor costs [23–25]. Predicting an organism’s phenotype directly from its genomic data, a feat long considered the Holy Grail of genomics research [25]—would revolutionize fields from medicine to environmental science. Currently, ML has made initial progress towards this goal, accurately predicting bacterial traits like optimal pH and Gram type from genomic data, and showing promise in predicting temperature preference and oxygen requirements [26–28]. Recently, Koblitz et al. (2025) also utilized machine learning to explore the relationship between prokaryotic genotypes and phenotypes, successfully constructing predictive models for eight physiological traits [29]. However, these studies have predominantly remained at the “phenotype prediction” level, and to date, there have been no reports of machine learning being accurately applied for the identification of key biological functional genes.
In this study, we developed a novel computational framework named “Genomic and Phenotype-based machine learning for Gene Identification” (GPGI). This method is designed to leverage machine learning techniques to predict phenotypes by analyzing the protein structural domain profiles of thousands of bacterial genomes. The core concept of GPGI is that functionally similar genes across different species share similar protein domain composition. Therefore, protein domains can serve as a “universal functional language” across species, enabling the characterization and correlation of gene functions. Building on this premise, we employ machine learning algorithms to establish a precise model that links structural domains to phenotypes across species. Genes containing highly contributing domains are then selected as candidate genes. The detailed workflow is illustrated in Fig. 1. To systematically validate the effectiveness of the GPGI method, we selected bacterial rod shape as a case study, with the aim of identifying key genes involved in its determination.
Fig. 1.
Workflow of GPGI. First, all genes encoded by each bacterial genome are analyzed using pfam_scan. Then, after counting the number of domains, a domain count matrix is constructed for each species. By employing various machine learning models to train and optimize the relationship between the domain matrix and phenotypes, an optimal predictive model is obtained. Finally, candidate genes are selected based on the ranking of domain importance
Methods
Bacterial genomes and phenotypic data
We compiled a list of sequenced bacterial genomes from the NCBI [30] FTP server (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt.) and sourced morphological data from the BacDive database [31]. Most bacteria with documented shape information were categorized as cocci, bacilli, or spirilla. Less common morphologies, such as star-shaped or ring-shaped, were consolidated into an “other” category for analytical convenience due to their limited numbers. This resulted in four primary shape classifications: cocci, rods, spirilla, and other (Supplementary Tables 1 & Supplementary Table 2).
To create our analytical dataset, we first intersected the genomic and phenotypic data lists. To manage the immense size of the full genomic database and avoid an impractical bulk download of entire genomes, we developed a program to specifically download only the proteomes (encoded protein sets) for bacteria where matched phenotypic information was available. This targeted approach yielded a curated dataset of 3,750 bacteria with corresponding proteomic and trait information, which was used for all subsequent GPGI analyses.
Construction of the feature matrix
We resolved protein structural domains from the proteomic data encoded by each genome using the pfam_scan software package with and the Pfam-A database (version 33.0) [32]. To create a feature matrix, all the identified protein domains within create a feature matrix bacterium’s genome were concatenated. Each unique concatenated domain string was treated as a feature (column). We then constructed a frequency matrix where each row corresponded to a single bacterium (identified by its Latin name or strain ID) and each column to a unique concatenated domain string, with cell values representing the occurrence count. This matrix served as our primary dataset (Supplementary Table 3).
Machine learning model construction and optimization
The construction of our machine learning model proceeded through three sequential stages: feature matrix integration, dataset partitioning, and model training. First, we integrated bacteria with known morphological information and their corresponding protein domain frequency matrices to generate a training dataset with protein domains as features. Next, we employed stratified sampling to randomly partition the entire dataset into training and testing sets at a 3:1 ratio.
To select the optimal predictive model, we systematically compared the performance of five commonly used machine learning algorithms within the R environment: a decision tree model was built using the rpart() function from the rpart package; a random forest model was trained using the randomForest() function from the randomForest package; support vector machine classification was implemented with the svm() function from the e1071 package, which performs separation in a high-dimensional space and automatically selects the optimal kernel; a non-parametric conditional inference tree was established using the ctree() function from the party package; and a naive Bayes classifier was trained using the naiveBayes() function from the e1071 package.
During the random forest model training, key hyperparameters were set to optimize model performance and analytical efficiency. We explicitly set the number of trees (ntree) to 1000. This value was chosen to balance model stability and computational efficiency: a sufficient number of trees reduces the model’s random fluctuations, ensuring the reliability of feature importance assessment, while avoiding excessive computational overhead. Additionally, we enabled feature importance evaluation (importance = TRUE) and sample proximity matrix calculation (proximity = TRUE) to rank the contribution of each feature and analyze sample relationships after model construction; however, these two parameters do not directly affect the model’s predictive performance. For other core hyperparameters, such as the number of features randomly sampled at each split (mtry) and the minimum size of terminal nodes (nodesize), we used the default values provided in the R package (i.e., mtry is the square root of the total number of features, and nodesize is 1 for classification tasks). The use of default values aimed to simplify and standardize the analytical pipeline, which not only improved computational efficiency but also ensured the comparability and reproducibility of the results across different analysis batches.
We evaluated the performance of our classification models using common machine learning metrics: accuracy, recall, and Kappa coefficient [33, 34], which were calculated from the confusion matrix generated during prediction. Based on these metrics, we continuously refined the model to achieve better classification performance and accuracy by adjusting the proportion of different bacterial types in the training data.
A machine learning approach for identifying key candidate genes
We utilized protein domains as classification features in our machine learning predictions and assessed their importance using the “importance” function of the random forest algorithm [35]. We hypothesized that a higher the ranking of importance corresponds to a more significant the influence of that protein domain on bacterial shape. This importance ranking formed the basis for exploring key shared genes using extensive cross-species genomic data.
We selected Escherichia coli BL21 (DE3) as the target strain for subsequent knockout experiments, focusing on the top ten ranked protein domains as key determinants of bacterial rod shape. We then identified the corresponding genes for these domains within the E. coli BL21 (DE3) genome as knockout candidates.
Construction and verification of gene knockout strains
Gene knockout procedures in this study were based on the CRISPR/Cpf1 dual-plasmid gene editing system (pEcCpf1/pcrEG) developed by the research group of Professor Liu at Jiangnan University, using E. coli BL21(DE3) as the host strain [36]. All strains were cultured at 37 °C with selection using kanamycin (50 µg/ml) and spectinomycin (100 µg/ml).
The construction of gene knockout vectors was a two-step process. First, crRNA sequences targeting each gene of interest (sequences in Supplementary Table 4) were cloned into the pcrEG plasmid backbone to create an intermediate plasmid, pcrE-XXX. Subsequently, using E. coli BL21(DE3) genome as a template, the upstream and downstream homology arm fragments (each ~ 500 bp) of the target gene were amplified. These homology arms were then recombined with the linearized pcrE-XXX plasmid using a multi-fragment homologous recombination kit (Vazyme Biotech Co., Ltd.) to generate the final knockout vector, pcrEG-XXX. Primers for homology arm amplification and vector linearization (sequences in Supplementary Table 5) were designed for this recombination method, with the linearization primers being vector-F (5’-CGAACGGCAGATCAGAATTTTGTAATAAAA-3’) and vector-R (5’-GGCCAAGCTAACTAAGTTTGAAAAAACAAAC-3’).
Prior to transformation, competent cells were prepared by introducing the helper plasmid pEcCpf1 into wild-type E. coli BL21(DE3), followed by induction of Cpf1 protein expression. The constructed knockout vector pcrEG-XXX (50–100 ng) was then transformed into these pEcCpf1-containing competent cells.
To verify successful gene knockout, 10 single colonies were randomly picked from each transformation plate for initial screening via colony PCR. The PCR system (10 µl) contained: 5 µl of 2×Taq Plus Master Mix (Dye Plus), 0.5 µl of upstream primer test-xxx-F (10 µM), 0.5 µl of downstream primer test-xxx-R (10 µM), 1 µl of cell lysate template, and 3 µl of ddH₂O (verification primer sequences in Supplementary Table 6). The PCR program was: 95 °C for 5 min; 28 cycles of (95 °C for 30 s, 56 °C for 30 s, 72 °C for 20 s); and a final extension at 72 °C for 10 min. Samples yielding PCR products of the expected size via gel electrophoresis were sent to Sangon Biotech (Shanghai) Co., Ltd. for Sanger sequencing to confirm successful gene knockout.
Scanning electron microscopy (SEM) sample Preparation and observation
To observe cellular morphology, the verified gene knockout strains and the original E. coli BL21(DE3) were analyzed by SEM. Strains were first activated from − 80 °C glycerol stocks by streaking onto LB agar plates containing appropriate antibiotics (50 µg/ml kanamycin and 100 µg/ml spectinomycin) and incubated at 37 °C for 12 h. A single colony was then inoculated into liquid LB medium with antibiotics and grown at 37 °C with shaking at 200 r/min for 8 h.
Cells in the logarithmic growth phase were harvested by centrifugation (4000 r/min, 5 min), washed three times with PBS buffer, and fixed in 2.5% glutaraldehyde solution at 4 °C for 4 to 12 h, with gentle resuspension to prevent cell aggregation.
Fixed samples were thoroughly washed with PBS and subsequently subjected to a graded ethanol dehydration series: 20 min each in 30%, 50%, 70%, 80%, 95%, and 100% ethanol at 4 °C. Finally, the samples were washed twice with 100% acetone for 20 min each to complete dehydration. The dehydrated samples were dried using a critical point dryer with CO₂ to prevent structural damage from surface tension. During this process, acetone was completely replaced by liquid CO₂, the chamber was heated to ~ 35 °C to bring CO₂ beyond its critical point, and pressure was slowly released to exhaust the gaseous CO₂. Dried samples were retrieved and examined for morphological changes using an SEM. All SEM images presented were captured at 10,000× magnification at the Biotechnology Center of Anhui Agricultural University.
Results
Building a predictive model for bacterial shape using machine learning
We employed a dataset consisting of 3,750 bacteria with matched genomic and trait information to train and test our model using five machine learning algorithms. Both the support vector machine and random forest algorithms performed well. The accuracy of model training reached 91.37% and 87.89%, respectively, with both algorithms yielding kappa coefficients above 0.81 (Fig. 2). The random forest algorithm outperformed the support vector machine algorithm in predicting the test set, achieving a high kappa coefficient of 0.93, indicating a close match between its classifications and the actual data (Fig. 2).
Fig. 2.
Predicting results of five machine learning algorithms. Note: SVM, support vector machine algorithm; CTree, conditional inference tree algorithm; decision_tree, decision tree algorithm; naive_bayes, naive bayes algorithm; random_forest, random forest algorithm; train, training set; test, test set
A novel approach for identifying key genes
Focusing on rod shape as the trait of interest, we utilized protein structure domains as classification features in the machine learning predictions and assessed their importance using the “importance” function of the random forest algorithm to identify potential key genes regulating rod shape. To assess the importance of the classification features, we utilized the rfcv function in the randomForest program package to perform a five-fold cross-validation of the training model. This analysis generated a model accuracy curve (Fig. 3), where the horizontal axis represents the number of feature values and the vertical axis represents the error rate. Observing the curve, we noticed a significant decrease in the error rate as the number of feature values increased, particularly for the first 10 or so feature values. However, the curve tends to level off after surpassing 10 feature values, indicating that further increasing the number of features has little impact on model accuracy, suggesting that additional features hold relatively low importance. Based on this analysis, we selected 10 feature values as they were deemed most relevant and informative for predicting bacterial rod shape (Table 1). Among all the knockout strains (Supplementary Fig. 1), only Δpal and ΔmreB displayed significant morphological changes compared to the original cells (Fig. 4; Table 1). The original E. coli cells exhibited long or short rod shapes with rounded ends. In contrast, the knockout strain lacking the mreB gene displayed a nearly spherical shape, significantly shorter than the original cells, with no notable change in diameter. The E. coli strain lacking the pal gene exhibited an irregular shape resembling a protoplasmic state with no cell wall (Fig. 4). Regarding the knockout strains of the remaining six genes, no significant changes in length were observed, although some strains, specifically those lacking yicC, tolQ, amiC, yddB, and rpoZ, displayed alterations in surface morphology, such as folds and depressions (Supplementary Fig. 1 & Supplementary Fig. 2).
Fig. 3.
The model accuracy curve
Table 1.
Candidate domain and genes of regulating bacterial shapes
| Domain | Coccus | Rod↓ | Spiral | Other | MeanDecrease Accuracy | MeanDecrease Gini | Gene Names | Protein id |
|---|---|---|---|---|---|---|---|---|
| OmpA | 0.005285 | 0.002413 | 0.005321 | 0.00515 | 0.004262 | 6.238621 | pal | WP_001295306.1 |
| YicC_N | 0.001397 | 0.002047 | 0.001862 | 0.003572 | 0.002016 | 1.886309 | yicC | WP_000621336.1 |
| FlaA | 0.000973 | 0.001771 | 0.011123 | 0.001188 | 0.003407 | 5.90406 | - | - |
| MreB_Mbl | 0.007044 | 0.001759 | 0.00416 | 0.003285 | 0.004038 | 4.691432 | mreB | WP_000913396.1 |
| MotA_ExbB | 0.007018 | 0.001662 | 0.004844 | 0.004291 | 0.004293 | 5.899423 | tolQ | WP_001307062.1 |
| FtsX_ECD | 0.000809 | 0.001579 | 0.004971 | 0.003328 | 0.002283 | 2.331755 | ftsX | WP_001042003.1 |
| Amidase_3 | 0.000697 | 0.001513 | 0.000851 | 0.001527 | 0.00113 | 1.499186 | amiC | WP_000016907.1 |
| Plug | 0.001401 | 0.001462 | 0.001479 | 0.0014 | 0.001435 | 1.521337 | yddB | WP_000832450.1 |
| FlgD | 0.003965 | 0.001427 | 0.005393 | 0.004115 | 0.003401 | 3.466897 | - | - |
| RNA_pol_Rpb6 | 0.000605 | 0.001347 | 0.005913 | 0.001938 | 0.002155 | 4.183334 | rpoZ | WP_000135058.1 |
-, It means it’s not present in E. coli
Fig. 4.
Morphology of E. coli after knockdown of pal and mreB genes, respectively. Note: The magnification is 10,000 times, the length of the ruler is ten squares in total
Assessing the stability of the novel approach
We used four shape phenotypes to identify key rod-determining genes and analyzed the stability of the gene identification method with a constant total dataset of 3750 samples. We assessed the impact of training set size on both prediction accuracy and gene identification accuracy by progressively reducing the number of genomes used. When the training set comprised 50% or more of the total dataset, genes containing both OmpA and MreB_Mbl domains were consistently present among the top ten important domains (Fig. 5A). OmpA consistently ranked first when the training set included more than 1500 genomes (Supplementary Table 7), and even with only 338 genomes, at least one gene encoding a key structural domain was present within the top ten important domains (Fig. 5A), demonstrating the robustness of our method.
Fig. 5.
Assessing the robustness of the method with varying genome numbers. Note: A Four shapes are employed; B Limited to cocci and bacilli. The upper panel shows the presence or absence of the two validated structural domains among the top ten domains when trained with different numbers of genomes; “+” indicates presence; “-” indicates absence. The lower panel illustrates the influence of genome number on predictive performance
To enhance efficiency, we refined our analysis by focusing solely on cocci and bacilli, reducing the dataset to 2468 genomes. Even with this smaller dataset, phenotype prediction accuracy consistently exceeded 91% when the training set included more than 741 samples (Fig. 5B). Remarkably, genes containing both OmpA and MreB_Mbl domains were consistently ranked among the top ten important domains when the training set represented 60% or more of the data. The OmpA-containing gene maintained its top ranking with training sets exceeding 988 genomes (Supplementary Table 8). Importantly, our method robustly identified at least one key structural domain gene even when using as few as 124 genomes for training (Fig. 5B). While this was also observed with only 25 genomes used for training, the results were less stable compared to using more than 124 genomes.
Discussion
Cross-species genomic analysis for identifying key genes
Genes are pivotal in expressing life processes and characteristics, holding the key to unraveling the nature of life, exploring organismal development and evolution, and understanding the mechanisms behind disease occurrence. However, current methods for functional gene research primarily focus on analyzing individual species, and new identification methods are crucial for discovering functional genes that have significant effects on phenotypic traits. Our paper establishes a novel approach that leverages the vast amount of published genomic and phenotypic data across species, directly offering a faster and more cost-effective alternative to traditional approaches involving mutant creation and population genetics.
A key challenge addressed by our novel approach is the identification of functionally similar genes across diverse species, even when their sequences differ significantly (orthologous genes) [37]. To achieve this, we utilized protein structural domains as a means to distinguish genes with similar functions in different species [38], leading to a breakthrough in predicting bacterial shape. While some studies have employed similar methods to establish correlations between genomes or proteins and traits, there is currently no conclusive experimental validation confirming a direct relationship between the two [28].
In the initial machine learning prediction model, we utilized the protein structural domain as a classification feature value and assessed the significance of each feature using the “importance” feature in the random forest algorithm [35, 39, 40]. A higher importance ranking indicated a greater impact of the feature. Based on this importance index, we hypothesized that these highly ranked features represented essential combinations of genes related to bacterial shape, a conjecture we aimed to validate experimentally. To validate this, we individually knocked out the first eight representative genes, observing significant changes in the length or morphology of E. coli after knocking out the pal and mreB genes. Specifically, the knockout of mreB transformed the bacteria from rod-shaped to spherical, which is consistent with findings from previous studies [41]. Additionally, the knockout of YicC, TolQ, AmiC, YddB, and RpoZ genes resulted in the formation of folds and depressions on the surface of E. coli. These observations suggest that the key genes identified by our novel approach do indeed play a role in determining the rod shape of bacteria. Furthermore, our method, which uses proteomic features from multiple species to identify key genes regulating common traits, demonstrates its feasibility.
Traditional functional gene identification using mutant libraries requires extensive creation and screening of mutants, often leading to redundant identification of the same genes, wasting considerable effort. Moreover, if the mutant library is generated by physical or chemical mutagenesis, a single strain or plant can have multiple mutations, requiring further extensive efforts to pinpoint the crucial mutation. After extensive screening, only a few key genes may be identified. Finding additional key genes requires further effort, and some of the identified genes may be previously validated ones, leading to a waste of time and resources. GWAS is effective for identifying superior alleles and key genes within a single species, but it is not applicable to identifying key functional genes for shared traits across species. The core advantage of machine learning lies in its exceptional big data analysis capabilities, particularly in handling complex, non-linear data to achieve high prediction accuracy [23, 24]. Additionally, some ML models offer good interpretability, aiding in the identification of key factors [23, 24]. This study harnesses the advantages of machine learning to identify key functional genes for shared shapes across different species, enabling the simultaneous identification of multiple key genes in a single analysis—a feat not achievable with existing techniques, thus demonstrating a significant technological advantage.
Factors influencing bacterial shape prediction accuracy
The relative proportions of different bacterial types have a significant impact on prediction accuracy; initially, we utilized all 4,847 bacterial species with available genomic and trait information for training and testing the model, but the results were unsatisfactory (Data not shown). To address this issue, we attempted to rebalance the dataset by reducing the number of bacilli while proportionally increasing cocci and spirochetes. By doing so, we aimed to create a more representative and diverse dataset that could potentially enhance the accuracy of our predictions. Bergey’s Manual of Determinative Bacteriology [42] was consulted to manually add 251 species of cocci and 160 species of spirochetes. We refined the training set by manually adjusting the number of strains for cocci, bacilli, and spirochetes and carefully examining the phenotypes, resulting in a final composition of 854 cocci species, 999 bacilli species, 589 spirochetes species, and 373 other bacterial species. The test set included 284 cocci, 331 bacilli, 196 spirochetes, and 124 other bacteria. The classification results from the five algorithms improved. Two algorithms, the support vector machine and random forest, achieved high Kappa coefficients in the training set (0.88 and 0.83, respectively), indicating excellent agreement between predicted and true values. While the support vector machine initially showed higher accuracy and recall on the training set, the random forest algorithm emerged as the top performer when predicting the test set, achieving an accuracy of 94.76% and recall rates of 97.18%, 92.75%, 92.75%, and 87.90% for cocci, bacilli, spirochetes, and other bacteria, respectively.
Insights into the genetic basis of bacterial rod shape
The E. coli MreB protein is a main component of the cell cytoskeleton, playing a crucial role in cell vitality and the formation of rod-shaped cell morphology [41]. In E. coli, MreB forms helical filaments underneath the cell membrane, controlling cell wall growth and maintaining cell shape [43, 44]. Mutations in the mreB gene can disrupt cell wall symmetry, leading to bent or abnormally twisted cell morphologies [45]. While a complete knockout of the mreB gene was not achieved in this study (Data not show), previous research suggests it would likely be lethal due to the protein’s essential role in cell wall synthesis and assembly. Some bacteria with partial loss-of-function mutations in mreB were able to survive, likely due to the retention of some cytoskeletal function, but exhibited altered morphologies, appearing as shorter rods or nearly spherical cells (ΔmreB mutants).
The E. coli Pal protein is associated with peptidoglycan, interacting with the BamA protein to participate in the folding and assembly of outer membrane proteins, thereby contributing to outer membrane integrity [46–48]. Deletion of the pal gene can lead to outer membrane abnormalities and increased sensitivity to antibiotics [46]. Given that the outer membrane is a major determinant of bacterial morphology [43], the Δpal mutant likely exhibited altered outer membrane structure in E. coli, resulting in morphological changes, including a near-spherical or irregular shape. While knockout of the YicC, TolQ, AmiC, YddB, and RpoZ genes did not cause significant changes in E. coli length or overall shape, the observation of numerous folds and depressions on the bacterial surface suggests these genes might have subtle or combinatorial effects on bacterial phenotype.
While our study successfully identified two key genes influencing rod-shaped morphology in E. coli (pal and mreB), other known morphology-related genes encoding rodZ and pbp [47, 48] were not identified. This discrepancy could be attributed to limitations in the training data. For example, some bacterial shape annotations, derived from online sources or literature, might be inaccurate or incomplete. Additionally, the use of genomes sequenced using older technologies could introduce errors that affect model accuracy. Despite these limitations, our method’s ability to identify two key genes highlights its potential. As more accurate and comprehensive genomes become available, particularly with the advancements in third-generation sequencing, we anticipate that our approach will prove increasingly valuable for the rapid and precise identification of key genes across various biological traits.
Robustness of the novel approach across varying dataset sizes
Identifying functional genes is crucial for studying biological mechanisms at the molecular level, but common methods primarily rely on mutant-based approaches or single-species population analysis. This study establishes a novel method for gene identification that leverages large-scale, cross-species genomic data. Accurate phenotype prediction from genomic information is a prerequisite for gene identification. We achieved this by classifying bacteria into four shape categories: cocci, bacilli, spirilla, and other shapes, based on their genomic information. Subsequently, we hypothesized that the ranking of domain weights, assuming that highly ranked domains are also crucial for determining bacterial shape, could be utilized for gene identification. We experimentally knocked out a subset of candidate genes and successfully identified two key shape-determining genes. Considering the potential cost of using up to 3750 bacterial genomes to predict four bacterial shapes in practical applications, we investigated the predictive accuracy with progressively reduced genome and phenotype numbers. Our results show that using two or four bacterial shape phenotypes can accurately identify key rod-shape determining genes. Gene identification accuracy improved with larger genomic datasets, and we could simultaneously identify two key genes. Even with only 124 genomes, at least one gene encoding a key structural domain could be identified. This evaluation of identification stability with decreasing genome numbers demonstrates the robustness of this novel gene identification method. In the era of readily available bacterial genome sequences, the establishment of this large-scale, genome-based method for identifying functional genes provides a powerful tool for accelerating biological function discovery in the age of big data.
Conclusion
We developed GPGI, a novel cross-species machine learning approach that successfully identifies key genes by predicting phenotypes from protein structural domain profiles. Using bacterial rod-shape as a case study, GPGI rapidly identified pal and mreB, whose critical roles were experimentally confirmed in E. coli. The method demonstrated robustness even with reduced datasets, offering a promising, scalable, and efficient alternative to traditional gene identification techniques. GPGI has the potential to accelerate the discovery of genes underlying complex traits across diverse microorganisms.
Limitations
The accuracy of GPGI is inherently dependent on the quality and comprehensiveness of the input genomic and phenotypic data. Biases in data representation could lead to the spurious identification of genes as key candidates if they are primarily associated with co-occurring traits rather than the direct target phenotype, or if they are merely tightly linked to actual causal genes. Additionally, the overrepresentation of certain large gene families in the dataset might cause the machine learning model to overlook some genuinely important, less prevalent genes during the ‘black box’ analysis. Furthermore, experimental validation was confined to E. coli and a single trait. Given the relatively smaller gene count in bacteria compared to more complex organisms, the generalizability of this bacterial genome-based model for key functional gene identification in eukaryotes such as fungi, protozoa, plants, and animals remains to be established. These challenges necessitate further methodological refinement and investigation in future studies.
Supplementary Information
Supplementary Material 1. Supplementary Fig. 1 Verification of gene knockout by sequencing analysis.
Supplementary Material 2. Supplementary Fig. 2 Morphological observation of E. coli after gene knockout. Note:The magnification is 10,000 times, the length of the ruler is ten squares in total.
Supplementary Material 3. Supplementary Table 1 Optimal model grouping list.
Supplementary Material 4. Supplementary Table 2 Species composition and genome source information.
Supplementary Material 5. Supplementary Table 3 Data Structure.
Supplementary Material 6. Supplementary Table 4 crRNA sequences.
Supplementary Material 7. Supplementary Table 5 Sequences of primers used for amplifying homology arms.
Supplementary Material 8. Supplementary Table 6 Sequences of primers used for colony PCR verification.
Supplementary Material 9. Supplementary Table 7 Influence of Training Set Size on the Stability of the Top 10 Domains.
Supplementary Material 10. Supplementary Table 8 Influence of Training Set Size (Cocci and Bacilli Only) on the Stability of the Top 10 Domains.
Supplementary Material 11. Supplementary Table 9 Download URLs for the bacterial strains used in this study.
Supplementary Material 12. Dataset S1–S17: Detailed information on feature values derived from the analysis of incrementally larger genome datasets using four shape phenotypes.
Supplementary Material 13. Dataset S18–S35: Detailed information on feature values derived from the analysis of incrementally larger genome datasets using two shape phenotypes.
Authors’ contributions
G.H. conceptualized the project. G.H. and Q.L. designed the experiments. Q.L., Y.Z., H.L., and J.S. performed most of the experiments and data analysis. C.X., Y.X., and S.L. performed part of the experiments. Y.Z. and G.H. wrote the manuscript. G.H. supervised the project.
Funding
This work was supported by grants from the Anhui Provincial Peak Discipline Project of Biology, the Germplasm Resource Bank Construction Project of the Anhui Provincial Department of Agriculture and Rural Affairs, and the Joint Research Project for Elite Maize Varieties of Anhui Province.
Data availability
The protein sequence datasets analysed in this study are publicly available in the NCBI RefSeq repository (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/); specific accession numbers and download shell commands are listed in Supplementary Tables 2 and Supplementary Table 9. The curated protein sequence dataset of the 3,750 bacterial species used in the final analysis is also available in a figshare repository (10.6084/m9.figshare.30022573), and the corresponding machine learning datasets, analysis code, and results are available in a separate figshare repository (10.6084/m9.figshare.29975167.v1). All data and R code are also available from the corresponding author (G.H.) upon reasonable request.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Yunhua Zhang, Email: yunhua9681@163.com.
Guomin Han, Email: guominhan@ahau.edu.cn.
References
- 1.Yuan Y, Huo Q, Zhang Z, Wang Q, Wang J, Chang S, et al. Decoding the gene regulatory network of endosperm differentiation in maize. Nat Commun. 2024;15:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Huang Y, Wang H, Zhu Y, Huang X, Li S, Wu X, et al. THP9 enhances seed protein content and nitrogen-use efficiency in maize. Nature. 2022;612:292–300. [DOI] [PubMed] [Google Scholar]
- 3.Liu X, Wang Y, Lu H, Li J, Yan X, Xiao M, et al. Genome-wide analysis identifies NR4A1 as a key mediator of T cell dysfunction. Nature. 2019;567:525–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Schmitz MT, Sandoval K, Chen CP, Mostajo-Radji MA, Seeley WW, Nowakowski TJ, et al. The development and evolution of inhibitory neurons in primate cerebrum. Nature. 2022;603:871–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Song X-J, Huang W, Shi M, Zhu M-Z, Lin H-X. A qtl for rice grain width and weight encodes a previously unknown RING-type E3 ubiquitin ligase. Nat Genet. 2007;39:623–30. [DOI] [PubMed] [Google Scholar]
- 6.Wang X, Wang H, Liu S, Ferjani A, Li J, Yan J, et al. Genetic variation in ZmVPP1 contributes to drought tolerance in maize seedlings. Nat Genet. 2016;48:1233–41. [DOI] [PubMed] [Google Scholar]
- 7.Datta MS, Kishony R. A spotlight on bacterial mutations for 75 years. Nature. 2018;563:633–44. [DOI] [PubMed] [Google Scholar]
- 8.Price MN, Wetmore KM, Waters RJ, Callaghan M, Ray J, Liu H, et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature. 2018;557:503–9. [DOI] [PubMed] [Google Scholar]
- 9.Singh SP, Montgomery BL. Determining cell shape: adaptive regulation of cyanobacterial cellular differentiation and morphology. Trends Microbiol. 2011;19:278–85. [DOI] [PubMed] [Google Scholar]
- 10.Liu X, Biboy J, Consoli E, Vollmer W, Blaauwen Tden. Mrec and Mred balance the interaction between the elongasome proteins PBP2 and RodA. PLoS Genet. 2020;16:e1009276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang E, Wang J, Zhu X, Hao W, Wang L, Li Q, et al. Control of rice grain-filling and yield by a gene with a potential signature of domestication. Nat Genet. 2008;40:1370–4. [DOI] [PubMed] [Google Scholar]
- 12.Jin J, Huang W, Gao J-P, Yang J, Shi M, Zhu M-Z, et al. Genetic control of rice plant architecture under domestication. Nat Genet. 2008;40:1365–9. [DOI] [PubMed] [Google Scholar]
- 13.Wang Y, Wang X, Sun S, Jin C, Su J, Wei J, et al. GWAS, MWAS and mGWAS provide insights into precision agriculture based on genotype-dependent microbial effects in foxtail millet. Nat Commun. 2022;13:5913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: realizing the promise. Am J Hum Genet. 2023;110:179–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Cano-Gamez E, Trynka G. From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. Front Genet. 2020. 10.3389/fgene.2020.00424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Han G, Li C, Xiang F, Zhao Q, Zhao Y, Cai R, et al. Genome-wide association study leads to novel genetic insights into resistance to Aspergillus flavus in maize kernels. BMC Plant Biol. 2020;20:206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jensen PA. Ten species comprise half of the bacteriology literature, leaving most species unstudied. bioRxiv. 2025;01(04):631297. 10.1101/2025.01.04.631297.
- 18.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with alphafold. Nature. 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Geng H, Wang M, Gong J, Xu Y, Ma S. An Arabidopsis expression predictor enables inference of transcriptional regulators for gene modules. Plant J. 2021;107:597–612. [DOI] [PubMed] [Google Scholar]
- 20.Zhang Y, Hamada M. DeepM6Aseq: prediction and characterization of m6a-containing sequences using deep learning. BMC Bioinformatics. 2018;19:524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ma Y, Guo Z, Xia B, Zhang Y, Liu X, Yu Y, et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat Biotechnol. 2022;40:921–31. [DOI] [PubMed] [Google Scholar]
- 22.Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-generation machine learning for biological networks. Cell. 2018;173:1581–92. [DOI] [PubMed] [Google Scholar]
- 23.Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol. 2022;23:40–55. [DOI] [PubMed] [Google Scholar]
- 24.Jiang Y, Luo J, Huang D, Liu Y, Li D. Machine learning advances in microbiology: a review of methods and applications. Front Microbiol. 2022. 10.3389/fmicb.2022.925454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Karlsen ST, Rau MH, Sánchez BJ, Jensen K, Zeidan AA. From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry. FEMS Microbiol Rev. 2023;47:fuad030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Edirisinghe JN, Goyal S, Brace A, Colasanti R, Gu T, Sadhkin B et al. Mach Learning-Driven Phenotype Predictions Based Genome Annotations. bioRxiv. 2023;08(11)552879. 10.1101/2023.08.11.552879.
- 27.Jensen DB, Vesth TC, Hallin PF, Pedersen AG, Ussery DW. Bayesian prediction of bacterial growth temperature range based on genome sequences. BMC Genomics. 2012;13:S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ramoneda J, Stallard-Olivera E, Hoffert M, Winfrey CC, Stadler M, Niño-García JP, et al. Building a genome-based understanding of bacterial pH preferences. Sci Adv. 2023;9:eadf8998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Koblitz J, Reimer LC, Pukall R, Overmann J. Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets. Commun Biol. 2025;8:897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sayers EW, Cavanaugh M, Frisse L, Pruitt KD, Schneider VA, Underwood BA, et al. GenBank 2025 update. Nucleic Acids Res. 2025;53:D56–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Schober I, Koblitz J, Sardà Carbasse J, Ebeling C, Schmidt ML, Podstawka A, et al. BacDive in 2025: the core database for prokaryotic strain data. Nucleic Acids Res. 2025;53:D748–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Li W, Cowley A, Uludag M, Gur T, McWilliam H, Squizzato S, et al. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 2015;43:W580–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Natland ST, Andersen LF, Nilsen TIL, Forsmo S, Jacobsen GW. Maternal recall of breastfeeding duration twenty years after delivery. BMC Med Res Methodol. 2012;12:179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Macarthur C, Macarthur A, Weeks S. Accuracy of recall of back pain after delivery. BMJ. 1996;313:467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Han H, Guo X, Yu H. Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest. In: 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS). 2016:219–24.
- 36.Zhu X, Wu Y, Lv X, Liu Y, Du G, Li J, et al. Combining CRISPR–Cpf1 and recombineering facilitates fast and efficient genome editing in Escherichia coli. ACS Synth Biol. 2022. 10.1021/acssynbio.2c00041. [DOI] [PubMed] [Google Scholar]
- 37.Sun J, Lu F, Luo Y, Bie L, Xu L, Wang Y. Orthovenn3: an integrated platform for exploring and visualizing orthologous data across genomes. Nucleic Acids Res. 2023;51:W397-403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49:D412–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]
- 40.Gregorutti B, Michel B, Saint-Pierre P. Correlation and variable importance in random forests. Statist Comput. 2017;27:659–78. [Google Scholar]
- 41.Kruse T, Bork-Jensen J, Gerdes K. The morphogenetic MreBCD proteins of Escherichia coli form an essential membrane-bound complex. Mol Microbiol. 2005;55:78–89. [DOI] [PubMed] [Google Scholar]
- 42.RS. J-B. Bergey’s manual of determinative bacteriology: a key for the identification of organisms of the class schizomycetes. Nature. 1931;128:6–6. [Google Scholar]
- 43.Soufo HJD, Graumann PL. Dynamic movement of actin-like proteins within bacterial cells. EMBO Rep. 2004. 10.1038/sj.embor.7400209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Jiang H, Si F, Margolin W, Sun SX. Mechanical control of bacterial cell shape. Biophys J. 2011;101:327–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Billaudeau C, Yao Z, Cornilleau C, Carballido-López R, Chastanet A. MreB forms subdiffraction nanofilaments during active growth in Bacillus subtilis. mBio. 2019. 10.1128/mbio.01879-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Cascales E, Bernadac A, Gavioli M, Lazzaroni J-C, Lloubes R. Pal lipoprotein of Escherichia coli plays a major role in outer membrane integrity. J Bacteriol. 2002;184:754–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Malinverni JC, Silhavy TJ. An ABC transport system that maintains lipid asymmetry in the Gram-negative outer membrane. Proc Natl Acad Sci. 2009;106:8009–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Noinaj N, Kuszak AJ, Balusek C, Gumbart JC, Buchanan SK. Lateral opening and exit pore formation are required for BamA function. Structure. 2014;22:1055–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Material 1. Supplementary Fig. 1 Verification of gene knockout by sequencing analysis.
Supplementary Material 2. Supplementary Fig. 2 Morphological observation of E. coli after gene knockout. Note:The magnification is 10,000 times, the length of the ruler is ten squares in total.
Supplementary Material 3. Supplementary Table 1 Optimal model grouping list.
Supplementary Material 4. Supplementary Table 2 Species composition and genome source information.
Supplementary Material 5. Supplementary Table 3 Data Structure.
Supplementary Material 6. Supplementary Table 4 crRNA sequences.
Supplementary Material 7. Supplementary Table 5 Sequences of primers used for amplifying homology arms.
Supplementary Material 8. Supplementary Table 6 Sequences of primers used for colony PCR verification.
Supplementary Material 9. Supplementary Table 7 Influence of Training Set Size on the Stability of the Top 10 Domains.
Supplementary Material 10. Supplementary Table 8 Influence of Training Set Size (Cocci and Bacilli Only) on the Stability of the Top 10 Domains.
Supplementary Material 11. Supplementary Table 9 Download URLs for the bacterial strains used in this study.
Supplementary Material 12. Dataset S1–S17: Detailed information on feature values derived from the analysis of incrementally larger genome datasets using four shape phenotypes.
Supplementary Material 13. Dataset S18–S35: Detailed information on feature values derived from the analysis of incrementally larger genome datasets using two shape phenotypes.
Data Availability Statement
The protein sequence datasets analysed in this study are publicly available in the NCBI RefSeq repository (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/); specific accession numbers and download shell commands are listed in Supplementary Tables 2 and Supplementary Table 9. The curated protein sequence dataset of the 3,750 bacterial species used in the final analysis is also available in a figshare repository (10.6084/m9.figshare.30022573), and the corresponding machine learning datasets, analysis code, and results are available in a separate figshare repository (10.6084/m9.figshare.29975167.v1). All data and R code are also available from the corresponding author (G.H.) upon reasonable request.





