Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2025 Nov 20;26:1068. doi: 10.1186/s12864-025-12282-6

A systematic longitudinal study of microbiome: integrating temporal-spatial dimensions with causal and deep learning models

Liugen Wang 1, Guanpeng Qi 1, Yongle Shi 2, Yibing Ma 2, Jie Gao 2,
PMCID: PMC12636166  PMID: 41266961

Abstract

Longitudinal microbiome data provide a unique opportunity to explore dynamic interactions between microbial communities and disease progression. However, these data are often characterized by missing values, sparse signals, and limited interpretability, which impede effective biomarker discovery and accurate disease modeling. Therefore, we propose SysLM, a comprehensive deep learning framework for systematic analysis of longitudinal microbiome data. It comprises two synergistic modules: SysLM-I and SysLM-C. SysLM-I focuses on the task of missing-value inference, combines metadata and three feature enhancement strategies, and comprehensively captures temporal causality and long-term dependence through Temporal Convolutional Network and Bi-directional Long Short-Term Memory modules. SysLM-C integrates deep learning with causal inference modeling to construct three causal spaces to accomplish the tasks of classification and screening of multiple types of biomarkers, including differential biomarkers of microbiomes, network biomarkers, core biomarkers, dynamic biomarkers, disease-specific biomarkers, and shared biomarkers. SysLM demonstrates superior performance in imputation, classification, and biomarker discovery across multiple datasets. Importantly, it uncovers novel microbial mechanisms underlying ulcerative colitis, highlighting its value for precision medicine. By integrating deep learning with causal modeling, SysLM offers a promising approach to advance microbiome-based disease research and facilitate the development of targeted therapeutic strategies.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-025-12282-6.

Keywords: Longitudinal microbiome data, Deep learning, Causal spaces, Multiple types of biomarkers

Introduction

Advances in high-throughput sequencing have revolutionized microbiome research, highlighting the critical role of microbial communities in human health and disease, including gastrointestinal disorders, metabolic syndromes, immune dysregulation, and cancers [1]. In contrast to cross-sectional data, longitudinal microbiome data, which samples the dynamics of individual microbiomes through multiple time points, has provided a unique perspective for understanding the temporal changes in microbial communities and their relationship with disease development [2].

However, longitudinal microbiome datasets often contain missing values caused by technical limitations or irregular sampling, which can introduce bias and compromise downstream analyses [3]. Traditional methods usually rely on missing mechanism assumptions (e.g., MCAR, MAR, or MNAR), which may not adequately consider the complexity and time-dependence of the data [4]. In recent years, deep learning methods have performed well in inferring missing values, especially the BRITS model, which infers missing values through a bidirectional recurrent neural network without assuming the data distribution [5], and the CATSI model introduces “context vectors” to learn the global temporal dynamics and further improves the accuracy [6]. These methods effectively improve the accuracy of missing data inference by automatically learning the temporal patterns of the data.

Beyond missing data inference, effective analysis of longitudinal microbiome data requires integrated approaches that combine metadata incorporation, feature enhancement, and robust biomarker discovery. Existing models tend to address these challenges in isolation: DeepMicroGen model uses generative adversarial networks to infer missing values and classification tasks for longitudinal microbiome data, but lacks in-depth exploration of biomarker screening and functional analysis [7]. Fung et al. focus on classifying tasks by sequentially populating zero-values and integrating self-knowledge distillation with neural networks. However, they do not also explore further analyses [8]. MDITRE is an interpretable model designed to discover associations between microbiome dynamics and host status through human-readable rules, but it does not support causal inference [9]. Similarly, while some models have attempted to consider the increased use of metadata for classification, the models have also been limited to downstream classification tasks and have not been able to exploit the potential of microbiome data fully. For example, Chen et al. uses the FE_GRU model to predict host status based on longitudinal microbiome data in combination with phylogenetic trees and metadata [10]. phyLoSTM model combines a convolutional neural network and a long and short-term memory network, which integrates feature extraction with the environmental factors of the host and improves the model’s ability to classify [11]. Some models primarily focus on the classification task. For instance, Metwally et al. combines Minimum Redundancy Maximum Relevance (mRMR) and LSTM to classify food allergies [12].

Moreover, the complexity of longitudinal microbiome data and substantial inter-individual variability present significant challenges to model interpretability [13], which is crucial for clinical applications. Many existing approaches lack sufficient transparency, which to some extent restricts the extraction of meaningful biological insights and reduce their translational potential. To address these challenges, we propose SysLM, a systematic and interpretable deep learning framework. The framework combines the strengths of deep learning and causal modeling to effectively handle time-dependence and individual differences in longitudinal microbiome data. SysLM integrates feature augmentation and metadata with Temporal Convolutional Network (TCN) and Bi-directional Long Short-Term Memory (BiLSTM) networks in its SysLM-I module to accurately infer missing values while capturing temporal causality and long-term dependencies [14, 15]. On this basis, the SysLM-C module employs causal inference to construct multiple causal spaces, enabling comprehensive biomarker identification and elucidation of key microbial pathways linked to ulcerative colitis (UC). Extensive validation on diverse datasets demonstrates SysLM’s improved predictive accuracy, interpretability, and biological relevance, offering a promising tool for microbiome-based precision medicine.

Materials and methods

Datasets

To comprehensively assess model performance, we have collected six publicly available real datasets or research projects, namely DIABIMMUNE, BONUS-CF, DiGiulio, PROTECT, iHMP-IBD, and iHMP-T2D. Among them, the DIABIMMUNE project aims to unravel the biological mechanisms underlying the increased prevalence of autoimmune diseases and allergies. The data are collected from a total of 222 infants in three countries (Finland, Estonia and Russia) [16]. Different allergy types are used as classification targets to support early diagnosis of related diseases. The BONUS-CF project, a prospective multicenter study of infants with cystic fibrosis (CF), aims at identifying microbiological correlates of poor growth observed in infants with cystic fibrosis, includes a total of 232 subjects with sequenced samples, and in this paper, the presence or absence of cystic fibrosis is used as a classification result [17, 18]. The DiGiulio data is a case-control study of 40 pregnant women designed to explore the relationship between microorganisms and preterm labor. In this paper, the presence or absence of preterm labor is used as a categorization goal to better understand the impact of microorganisms on pregnancy outcomes [19, 20]. The PROTECT dataset covers follow-up data from 428 patients with new-onset pediatric UC, and in this paper, patients with UC are classified into active and inactive groups based on the subjects’ PUCAI scores [21]. The iHMP-IBD project collects multinomial longitudinal data from approximately 90 patients with inflammatory bowel diseases (IBD) to analyze the temporal changes in the gut microbiota, and we use the type of IBD patients as a classification outcome [22, 23]. The iHMP-T2D project establishes a cohort of about 100 diabetes-risk patients to study the biological changes associated with T2D, and we use the IR and IS of T2D patients as a classification outcome, which is intended to assist in the development of personalized treatment plans [22]. A description of the data preprocessing process is provided in the Supplementary Section 1.1.

SysLM

Compared to microbiome data from a single time point, longitudinal microbiome data, in which samples are collected at multiple time points, can provide detailed information on the dynamics of microbial communities and reveal their relationship with host health. However, longitudinal studies often compromise the accuracy and consistency of data due to missing samples during follow-up. Therefore, we propose the SysLM model, which contains two sub-models, SysLM-I and SysLM-C. The structure of the SysLM model is schematically shown in Fig. 1A. The SysLM-I model is mainly used to infer longitudinal microbiome data with missing values to maximize the recovery of missing values and improve the predictive performance and generalization ability of the model’s downstream analysis (Methodological details are provided in the Supplementary Sect. 1.2). Inferring missing values in data helps reduce data bias and inconsistency, improve analysis accuracy, and avoid wasting potential information by deleting samples with missing values. In our experiments, we compare SysLM-I with multiple benchmark methods, using MAE, MSE, RMSE, and R² as evaluation metrics, to validate its inferential performance.

Fig. 1.

Fig. 1

A The flowchart of SysLM. B The flowchart of SysLM-C

To enhance the performance of the SysLM-I model on different datasets and to better capture the complex structure and distributional features of the data, we introduce two diversity loss functions based on the mean square error (MSE) loss. Firstly, lossa is used to measure the difference between the alpha-diversity of the generated values and the true values, which is calculated based on the Shannon index. Secondly, lossβ is used to measure the difference between the beta-diversity calculated for the generated and true values, and this difference is calculated using the Bray-Curtis distance. That is, the loss of the model consists of the following three main components, which can be expressed as:

graphic file with name d33e312.gif 1

where, lossr measures the error between the model generated values and the true values, H (∙) is the computed Shannon index, DBC (∙) is the calculated Bray-Curtis distance. To effectively optimize the model, the differences in diversities are measured using the Frobenius norm. By optimizing the diversity of the generated data, the model can generate microbial data with higher biological plausibility and ecological diversity. wa, wβ are their weighting coefficients in the range [1e-1,1e-2, ⋯,1e-5].

Currently, deep learning-based longitudinal microbiome studies mainly focus on inferring missing values and classification tasks based on subject status, and some studies have not deeply explored how to fully utilize the temporal properties in longitudinal microbiome data to identify potential microbial biomarkers. To address this issue, we propose the SysLM-C model, which is based on inferred data to classify subjects and find microbial biomarkers associated with subject status (Methodological details are provided in the Supplementary Sect. 1.3). Its performance has been assessed by comparing it with published classification models and using metrics such as AUC. The model can identify microbial biomarkers with potential causal relationships and provide insights into microbe-host health relationships. The SysLM-C model has enhanced interpretability in biomarkers discovery and can reveal potential causal relationships between microbes and host status, providing theoretical support for subsequent mechanistic studies and the development of personalized therapeutic regimens.

As shown in Fig. 1B, the SysLM-C model consists of three causal spaces. The first causal space is mainly used for subject classification and to improve model interpretability, and contains three loss functions, i.e., classification loss (used to train the classifier and differentiate between Case and Ctrl labels), reconstruction loss (ensures that the data features are reconstructed from the causal relationships and maintains the structural information), and causal loss (captures the causal paths and causal effects between semantic vectors to help the model understand the causal inference). The second causal space is the static causal space, which is designed to capture global causal relationships between microbes and subject states, and contains causal loss (learning causal paths and causal effects between microbes and subject states) and reconstruction loss. The third causal space is referred to as the dynamic causal space, which is designed to capture trends in the temporal evolution of diseases or health changes. This space incorporates dynamic causal loss, which captures causal relationships in time-series data, alongside causal loss and reconstruction loss. This combination ensures that the model comprehensively understands the causal structure over time, thereby enhancing the relevance of the dynamic relationships extracted from the dynamic causal map [24]. In addition, it is also necessary to constrain the consistency of the causal relations learned in the static and dynamic causal spaces with the reconstructed state semantic vectors, which helps the model to better understand the causal inference in the data at both the static and dynamic levels, and thus improves the generalization ability and accuracy of the model. Thus, the total loss of the SysLM-C model can be represented as follows:

graphic file with name d33e354.gif 2

where wrec, wdag, wdyn, and wcon are the weights of the reconstruction loss, causal loss, dynamic causal loss, and consistency loss, respectively, in the range [1e-1,1e-2, ⋯,1e-5]. The formulas for the different types of losses mentioned above can specifically be expressed as follows:

graphic file with name d33e385.gif 3
graphic file with name d33e389.gif 4
graphic file with name d33e393.gif 5
graphic file with name d33e397.gif 6
graphic file with name d33e401.gif 7

where αt is a balancing factor used to adjust the weights between categories. pt is the predicted probability of the correct category. γ is the moderating factor, which is used to control the loss contribution to easily categorized and difficult to categorize samples. According to Lin et al. [25], αt and γ are set to 0.25 and 2, respectively. T is the number of time points. Inline graphic, Inline graphic, Inline graphic represent the causal matrices of the three spaces, respectively. Inline graphic denotes the Hadamard product. q1, q2, q3 denote the number of variables in the three causal spaces, respectively. 𝜌 and α are the two hyperparameters of the causal loss, which are set to 1 [26]. In calculating the causal loss, we consider that the dimension of the causal space of the characteristics is too large, which may lead to the difficulty of numerical computation problems. Therefore, based on Zheng et al. [27] we normalize the Hadamard product of the causal matrix, which does not change the constraint of directed acyclicity of the matrix. The error termInline graphic(Inline graphic) compensates for numerical errors from normalization and relaxes constraints, ensuring stability and convergence in DAG learning. Here, Tr denotes the trace of a matrix. Inline graphic, Inline graphic are the Frobenius norm and L2 norm, respectively. Inline graphic and Inline graphic denote the reconstruction of the state semantic vector Case in the 3rd causal space and the 2nd causal space, respectively. Similarly, Inline graphic and Inline graphic are the reconstructions of the state semantic vector Ctrl, respectively.

Results

Parameter sensitivity analysis and ablation study

In microbiome data, there are more zeros at finer taxonomic levels (e.g. species or genus), whereas data at the phylum level are more complete. Therefore, to ensure robustness and generalizability, we conduct a parameter sensitivity analysis using the phylum-level DIABIMMUNE dataset.

For SysLM-I, we evaluate the impact of loss weights wa and wβ using 5-fold cross-validation. The optimal configuration is determined as wa = wβ=1e-5, which is then applied uniformly across all datasets and taxonomic levels to ensure consistency and biological plausibility (see Fig. 2A).

Fig. 2.

Fig. 2

The Results of the analysis of parameter sensitivity. In the SysLM-C model, for clear visualization, we take a logarithm to base 10 for the parameter combinations

For SysLM-C, we tune four loss weights (wrec, wdag, wdyn, and wcon) based on class imbalance ratios, optimizing primarily for AUC (see Fig. 2B-Fig. 2E.). Supplementary Table S3 summarizes representative weight combinations for different positive-to-negative sample ratios. This analysis highlights the importance of appropriately balancing optimization objectives under different data distributions. By tailoring weight configurations to specific imbalance ratios, we improve model performance and stability in classification tasks.

Ablation experiments further confirm the necessity of key modules and loss components for both SysLM-I and SysLM-C, with detailed results presented in Supplementary Sect. 1.4.

Performance comparison of missing value imputation methods

To comprehensively evaluate the performance of the SysLM-I model in inferring missing values from longitudinal microbiome datasets, we conduct extensive comparative experiments using 5-fold cross-validation. In each fold, the observed values in the test set are masked and treated as missing, and models are trained on the remaining training data. The imputation performance is assessed by calculating MAE, MSE, RMSE, and R² at the masked positions, and final results are averaged.

We compare SysLM-I against several baseline methods: deep learning models BRITS [5], CATSI [6], DeepMicroGen [7], as well as simple statistical approaches like Mean and Median imputation. The results, shown in Supplementary Fig. S2, demonstrate that SysLM-I generally outperforms or performs comparably to other methods across most evaluation metrics and taxonomic levels. Specifically, SysLM-I achieve the best MAE in 88.24% of experiments and the best MSE, RMSE, and R² in 91.18% of cases. This highlights the strength of SysLM-I in handling complex missing data across diverse microbial datasets.

Further, to assess whether the inferred values maintain ecological validity, we compare the alpha and beta diversity between inferred and true values across all methods. Due to the relatively poor performance of statistical methods in imputation, this diversity comparison focuses only on deep learning-based methods. As shown in Supplementary Fig. S3, SysLM-I demonstrates the best reconstruction of microbial diversity patterns, particularly from the phylum to genus levels, where community profiles tend to be more consistent and biologically informative. Performance at the species level is relatively less consistent, likely due to the high sparsity and variability inherent to this taxonomic resolution. Among other methods, BRITS performs reasonably well, while DeepMicroGen performs worse, possibly due to its reliance on post-processing strategies.

These results collectively suggest that SysLM-I not only achieves accurate numerical imputation but also preserves the underlying ecological structure of microbial communities, thus offering both quantitative and biological advantages.

Performance evaluation and method comparison in classification tasks

To comprehensively evaluate the effects of different missing value imputation methods and classification strategies on downstream classification tasks, we systematically compare various classification models and loss functions using longitudinal microbiome data with key metrics such as AUC, AUPR, ACC, and F1-score.

First, we compare the classification performance of different imputation methods using three classifiers: phyLoSTM [11], CNN-LSTM [8], and DeepMicroGen [7]. Results show that imputations from the SysLM-I model achieve the best and most stable performance across all classification models (see Supplementary Fig. S4). Other imputation methods show more fluctuation, indicating that SysLM-I provides more reliable inferred values to support accurate and robust classification.

Second, considering the severe imbalance of positive and negative samples, we compare various classification loss functions, including cross-entropy (CE), weighted cross-entropy (WCE), DICE loss, and Focal Loss. Using the gold standard dataset and 5-fold cross-validation, Focal Loss outperforms others on AUC and AUPR metrics, though it performs slightly worse on ACC and F1-score in the egg allergy classification task (Supplementary Fig. S5). This suggests that Focal Loss emphasis on hard-to-classify samples contributes to more stable and robust key metric performance.

Finally, we compare the SysLM-C model with several baseline models— phyLoSTM, CNN-LSTM, and DeepMicroGen—focusing on AUC performance across different taxonomic levels. SysLM-C achieves multiple best optimizations across phylum, class, order, family, genus, and species levels, demonstrating the advantages of its temporal and spatial modeling capabilities (Supplementary Fig. S6). Additionally, SysLM-C also shows improvements in AUPR, ACC, and F1-score (see Supplementary Fig.S6), confirming its robustness and superiority across datasets.

In summary, the SysLM series models achieve efficient, stable, and accurate predictions in classification tasks on longitudinal microbiome data through high-quality imputation, appropriate loss design, and effective classification architecture.

Statistical test

To further assess the necessity of introducing metadata, we conduct a mantel test to analyze the correlation between metadata and specific microbial taxa at the phylum level across all datasets. The results are shown in Supplementary Fig.S7, indicate statistically significant correlations between metadata and OTUs across all datasets. This implies that there may be some correlation or influence between these factors (factors such as country, gender, and race) and OTUs. These results support the necessity of incorporating metadata when inferring missing values for microbiomes as well as for classification tasks.

Based on the gold standard data, we further assess the reliability of the SysLM method by statistical tests. Firstly, a Shapiro-Wilk test has been used to determine whether the data satisfy the normal distribution (see Supplementary Table S4-S5, p-value < 0.01). If the data satisfy the normal distribution assumption, a one-way ANOVA (p-value < 0.05) is then used to compare the performance of the different models on the evaluation metrics. Statistical tests indicate that the SysLM method is significantly different from the other methods in the inference task and that the results of all inference methods conform to a normal distribution (see Table 1).

Table 1.

The results of one-way ANOVA test (p-value < 0.05)

Methods R 2 MAE MSE RMSE AUC AUPR ACC F1-score
BRITS 7.48E-04 5.54E-05 7.64E-04 1.16E-04 - - - -
CATSI 2.68E-17 3.81E-17 6.02E-13 3.82E-14 - - - -
DeepMicroGen (inference task) 8.44E-16 3.24E-16 4.15E-13 2.15E-14 - - - -
Mean 1.70E-20 2.14E-17 2.79E-14 5.13E-15 - - - -
Median 4.56E-20 1.81E-17 3.20E-14 5.27E-15 - - - -
CNN-LSTM - - - - 2.35E-03 0.074 2.91E-03 0.048
DeepMicroGen (classification task) - - - - 2.16E-02 0.11 1.39E-01 0.28
phyLoSTM - - - - 2.26E-02 0.035 2.68E-02 0.21

For the classification task, considering that the four-label classification task of the gold-standard dataset can be viewed as a multi-label task, we combine classification results to simplify the analysis process. Since DeepMicroGen does not conform to a normal distribution on the ACC metrics, we assume that it does to ensure a fair comparison. The results show that SysLM-C is significantly different from other methods on AUC metrics, while only part of the results is significantly different on AUPR, ACC, and F1-score metrics (see Table 1). To further enhance the statistical robustness of our results, we additionally conduct Wilcoxon signed-rank tests, and the corresponding results are provided in Supplementary Table S6. Some of the statistical test results are consistent with those obtained from the one-way ANOVA.

Interpretability of the model

As shown in Fig. 1B, the learned causal DAG graph derived from the semantic vector causal space offers a potential approach for interpreting model predictions. Specifically, when we obtain the optimized causal matrix Inline graphic, we will gradually remove edges in Inline graphic according to the weight size of the edges until Inline graphic satisfies the property of directed acyclic graph. This process ensures that the cropped DAG has a reliable and significant structure (The result after cropping is shown in Fig. 3A - K). These causal graphs appear to be interpretable, and in several cases, some inferred relationships align with prior domain knowledge or reported associations.

Fig. 3.

Fig. 3

The results of DAG visualization for different tasks that explain possible causal relationships in the task

Specifically, as can be seen in Fig. 3A, for the classification task of milk allergy in the DIABIMMUNE dataset, it can be found thatInline graphic, which suggests that the microbial community, by regulating the function of the immune system, may influence the occurrence of milk allergic reactions to a certain extent [2830]. Whereas, country factors of the subjects, such as dietary habits, environmental pollution and sanitary conditions, may play a role in the development of milk allergy by altering the composition of the microbial community or by directly influencing the immune response [31]. Gender may also influence susceptibility to allergy and the strength to immune response through physiologic and immunologic differences. For the classification task of egg allergy when it can be found Inline graphic (see Fig. 3B), this suggests that gender may influence egg allergy through differences in the immune system, whereas national factors may alter the composition of the microbial community through diet and environment. In addition, the occurrence of egg allergy conditions may lead to changes in the structure of the microbial community in the subjects [28, 32, 33]. Based on the classification task of CF, Inline graphic indicates that microbial community composition may be associated with subject-specific biological characteristics (see Fig. 3E). Furthermore, the status of CF patients may be associated with the composition of their microbial communities, suggesting that certain specific microbial communities could either contribute to the maintenance of health or interfere with it [34].

For the other classification tasks, we can also observe similar causal relationships, i.e., the interaction between the microbial community and the metadata may jointly influence the subject’s state. In addition, we also notice that some of the causal relationships are consistent with the mantel test results, which provides additional support for the interpretability and reliability of the model. In conclusion, through the causal graph analysis of the SysLM-C model, we can reveal the potential complex interactions between microbial communities and subject metadata (e.g., state, gender, etc.). This provides an important perspective for understanding the multilevel causal mechanisms of disease occurrence and the role of microbial communities in disease.

Microbial biomarkers

Leveraging the static and dynamic causal space (Inline graphic,Inline graphic), our model identifies six types of microbial biomarkers that provide insights into disease mechanisms. Below, we define and analyze each type with representative findings.

Differential biomarkers

Differential biomarkers are defined as microbial taxa that differ significantly between case and control groups. In this work, they are identified based on their simultaneous connections to both Case and Ctrl nodes in the static DAG. Compared to traditional statistical methods (e.g., Mann-Whitney U test, false discovery rate (FDR) < 0.05) [35], our method identifies a larger number of differential microbe’s biomarkers with some overlap (Fig. 4A–C; Supplementary Fig. S8), suggesting improved sensitivity.

Fig. 4.

Fig. 4

The visualization results of differential biomarkers. A Differential biomarkers of milk allergy screened based on the statistical test method. B Differential biomarkers of milk allergy screened based on our method. C Comparison of the number of differential biomarkers at all taxonomic levels in all datasets between the statistical test method and our method

To further validate the reliability of the screened differential biomarkers, we compare the screening results with those reported in the literature. For example, p__Firmicutes and c__Clostridia are investigated as probiotic candidates for milk allergy treatment [30]. Meanwhile, g__Bifidobacterium and g__Coprococcus have been reported to show reduced abundance in patients with milk allergy [36], and f__Lachnospiraceae and f__Leuconostocaceae have shown significant differences between egg allergy patients and healthy people [28]. It has also been found that c__Clostridia showed significant differences between food allergy patients and healthy populations, especially in peanut allergy [37]. In addition, p__Actinobacteria has also been shown to be used in early infancy to prevent food allergies [33]. o__Pseudomonadales is strongly associated with CF, especially Pseudomonas aeruginosa which is one of the main causative organisms of lung infections in patients with cystic fibrosis [38]. It has been found that the abundance of p__Firmicutes is higher and p__Proteobacteria is lower in preterm pregnant women compared to those who delivered at term [39]. The higher abundance of g__Lactobacillus in CD patients compared to UC patients and non-IBD patients suggests that g__Lactobacillus may be associated with the pathologic process of CD [40]. The abundance of g__Faecalibacterium and g__Roseburia has been significantly reduced in UC patients [41], c__Actinobacteria has been significantly increased in abundance [42]. It has been confirmed that g__Flavonifractor is highly correlated with IS in T2D patients, and g__Coprococcus and g__Dorea are closely associated with IR [43]. These findings demonstrate partial agreement between our screened biomarkers and previously reported microbial alterations, thereby providing preliminary support for the potential applicability of our method across multiple diseases. However, further biological and clinical validation is required.

Network biomarkers

Network biomarkers are defined as microbial communities exhibiting coherent associations related to the phenotype of interest. They are identified through community detection performed on the static DAG using the Louvain algorithm. Modules containing both Case and Ctrl nodes highlight structural network changes across groups (Supplementary Fig. S9).

It can be observed that the microbial network structure of the Ctrl group is relatively denser, showing stronger intra-community interactions. In some Case groups, such as the CF group, we observe a denser network structure, which may reflect stronger interactions or more complex network formations among microbial communities in specific disease states. These observations indicate potential variations in microbial networks between individuals with certain diseases and health. In addition, distinct network patterns are observed across different taxonomic levels, highlighting the importance of multi-level taxonomic analysis.

To assess the reliability of the screened biomarkers, validation is performed based on published literature. For example, for milk allergy, network analysis shows that the p__Firmicutes and Ctrl groups show positive regulatory causality, which is consistent with the findings of Bunyavanich et al. [30]. In addition, the network suggests that the abundance of p__Firmicutes could potentially be modulated through the regulation of other connected microbes, which offer possible avenues for influencing the immune response in milk allergy. In the network structure of CF patients, we observe that the Case group positively regulated p__Actinobacteria, suggesting that the abundance or activity of this group may be up-regulated, which is consistent with the findings of Burke et al. [44]. In the network of CD patients, p__Firmicutes is negatively regulated, suggesting that the activity may be suppressed in CD patients, which is consistent with the findings of Loh et al. [45, 46]. In contrast, p__Actinobacteria is positively regulated in UC patients [46], suggesting that it may play a role in the pathological mechanism of IBD by promoting inflammation and exacerbating the disease. Screening of these network biomarkers provides potential insights into changes in disease-associated microbial communities, which may have implications for significant clinical.

Core biomarkers

Core biomarkers are microbes that appear consistently across time-resolved DAGs and are connected to Case at each time point. These stable associations suggest long-term involvement in disease progression (Table 2; Supplementary Table S7). For example, p__Cyanobacteria have been shown to be strongly associated with CD and UC [47, 48]. In addition, reduction of p__Proteobacteria has been found to improve the therapeutic efficacy of CD in two prospective studies [49], and reduction of p__Proteobacteria can alleviate the symptoms of UC [50]. However, phylum-level biomarkers, especially those of low abundance or unusual taxa, should be interpreted with caution, as they may result from sequencing noise or annotation artifacts. And o__Clostridiales has also been confirmed by Schirmer et al. to be associated with the severity of UC [21]. These findings may enhance our understanding of disease progression and could inform future studies on microbial markers for disease monitoring.

Table 2.

Core biomarkers at the phylum level for different datasets or different categories

Dataset Categories Core marker
DIABIMMUNE Total p__Fusobacteria
BONUS-CF CF p__Proteobacteria
iHMP-IBD CD p__Cyanobacteria
p__Gemmatimonadetes
p__Proteobacteria
p__Planctomycetes
UC p__Fibrobacteres
p__Chlorobi
p__Tectomicrobia
p__Proteobacteria
Non-IBD p__Tectomicrobia
p__Cyanobacteria
p__Tenericutes
p__Armatimonadetes
p__Latescibacteria
p__Firmicutes
p__Actinobacteria
p__Euryarchaeota
PROTECT Active p__TM7
p__Cyanobacteria

No core biomarkers found at the phylum level for category labels not shown

Dynamic biomarkers

Dynamic biomarkers are defined based on time-varying DAGs. OTUs showing substantial variation in causal weights or exhibiting time-associated trends in their relationship with Case are considered dynamic biomarkers. The results are shown in Supplementary Fig.S10. We observe that microbial biomarkers exhibit different trends across time and taxonomic scales, suggesting their involvement may vary at different stages of disease development. For example, the significant enrichment of f__Prevotellaceae in IBD disease and the increasing trend of its causal effect on IBD patients may imply its progressively increasing role in the disease process of IBD [51, 52]. g__Lachnospira’s abundance has been confirmed to be significantly reduced in CD patients [53], and we observe a decreasing trend in its causal effect, which may be associated with its reduced presence in disease progression. The p__Proteobacteria shows an increasing trend of causal effect in UC patients, which is also consistent with the previous findings that the alleviation of UC symptoms may be achieved by decreasing p__Proteobacteria [50]. In a longitudinal study, Zhou et al. have found that the abundance of g__Butyricicoccus is reduced in IR patients compared to IS typed T2D patients [54]. Further, in our study, g__Butyricicoccus is found to exhibit an incremental trend of a causal effect on IR, which may imply that its role in the disease is exacerbated. To further elucidate the evolving roles of key microbial taxa in disease progression, we visualize their dynamic causal trajectories (Supplementary Fig. S10M), which highlight the importance of temporal modeling in uncovering potential microbial mechanisms. The causal weights reflect the estimated regulatory effect of a taxon on disease outcome, which may not always directly follow the relative abundance trends in the original or imputed data. Supplementary Fig. S10N provides the temporal abundance trends for comparison, illustrating cases where causal effect and abundance trajectories are consistent as well as cases where they differ.

Disease-specific and shared biomarkers

Disease-specific biomarkers refer to microbes identified in only one disease group, while shared biomarkers refer to microbes that are present in at least two disease groups. For example, the presence of 15 shared biomarkers in the IR and IS subtypes of T2D disease can be observed in Fig. 5., with 38 specific microbial biomarkers in the IR group and 18 specific markers in the IS group. This finding suggests that different subtypes of T2D may have unique microbial ecological profiles, reflecting potential differences in disease mechanisms between these subtypes. Specifically, IR and IS subtypes may be influenced by different microbial communities, which could be associated with disease processes such as insulin resistance and glucose metabolism. Moreover, no shared microbial biomarkers were found between the two T2D subtypes and other diseases, which may indicate the microbial specificity of T2D and suggest its distinctiveness from other diseases in terms of microbiome composition. This finding may contribute to a better understanding of the pathogenic mechanisms of T2D and provides potential biomarker candidates for personalized therapy and precision medicine.

Fig. 5.

Fig. 5

The visualization results of disease-specific and shared biomarkers based on all taxonomic levels. Shared biomarkers refer to microbes that are present in at least two disease groups

We also have observed the presence of shared microbial biomarkers for CF and IBD diseases, which may reveal a common pathogenesis for both diseases [55]. In addition, we have identified shared biomarkers for food allergies, including milk, egg, and peanut allergies, as well as IBD. This discovery could uncover common mechanisms of these immune-mediated diseases at the microbial level. Understanding these connections may offer novel insights into immune response mechanisms and help develop potential cross-disease therapeutic strategies [5658]. Through this visualization, we can observe the microbial biomarkers specific to each disease as well as those that co-occur in multiple diseases, thus providing potential support for understanding the pathological mechanisms of diseases and cross-disease commonalities.

By identifying diverse types of microbial biomarkers, our framework provides multi-perspective insights into the roles of microbes in disease onset and progression, offering potential targets for diagnosis, monitoring, and intervention. Due to the high sparsity and large proportion of unclassified taxa at the species level—which may compromise biological interpretability—our biomarker analyses primarily focus on higher taxonomic levels, including phylum, class, order, family, and genus.

Functional analysis

In our preliminary analysis, we have screened different types of biomarkers that provide valuable insights for a deeper understanding of disease mechanisms. However, identifying microbial community composition alone is insufficient to fully elucidate their roles in disease progression, particularly about their functional potential. Therefore, to further explore the functional information of microbial communities, we use the phylogenetic investigation of communities by reconstruction of unobserved states (PICRUST) tool for functional analysis based on the screened differential biomarkers [59]. PICRUST can infer functional characteristics of microbial communities from 16 S rRNA gene sequences. Considering that the PICRUST relies on the Greengenes database and requires the ID of this database, we perform a functional analysis based on the PROTECT dataset and the results are shown in Supplementary Fig.S11-Fig.S12.

Our results demonstrate that different functions exhibit significant abundance differences at different taxonomic levels (see Supplementary Fig.S11A). In the KEGG function prediction analysis, the function Cell Motility shows a notably higher abundance at the family level compared to other taxonomic levels. These findings underscore the necessity of fine-grained, multi-level taxonomic analysis to better understand the relationship between microbial classification and function.

In addition, we have analyzed between-group differences (UC active vs. inactive state) in the functions of the samples at the L2 and L3 levels of the KEGG prediction results (see Supplementary Fig.S11B- Fig.S11C) using STAMP software (default Welch’s t-test, p-value < 0.05) [60]and the OmicStudio cloud platform (default t-test, p-value < 0.05) [61], respectively. The results of both analyses are consistent, further validating the reliability of the results of screening differential pathways. The results obtained through the OmicStudio cloud platform are presented in this study.

The results reveal significant functional differences between the UC active and inactive groups across various taxonomic levels. Specifically, at the L2 level, a total of 22 functional modules are significantly different, of which 18 functional modules are higher in the UC active group than in the inactive group, and 14 functional modules show the opposite trend. At the L3 level, a total of 92 functional modules shows significant differences, of which 72 functional modules are higher in the UC active group than in the inactive group, and 32 functional modules are lower than in the inactive group. These results further support the potential relevance of the screened biomarkers. In addition, we have observed inconsistency in the performance of different functional modules at different taxonomic levels. For example, Metabolism in the L2 level is higher in the active state of UC than in the inactive state at the phylum level, whereas the opposite trend is observed at the family level. This phenomenon may highlight the complexity of microbial community structure at different levels, indicating heterogeneous differences in microbiome data at various taxonomic levels. Analyzing microbial communities at different levels may provide valuable insights into their dynamics across various spatial scales, reflecting their complexity and adaptability in different disease states, and potentially aiding in understanding the underlying biological mechanisms.

To further explore the relationship between microbiomes and functional modules, we have conducted a spearman correlation analysis based on differential biomarkers and differential functional pathways, and the results are shown in Supplementary Fig.S12. By combining the results of STAMP intergroup analysis and Spearman correlation analysis, we observe that there is a strong link between the abundance of p__Actinobacteria and p__Bacteroidetes and the Carbohydrate metabolism pathway, and it has been shown that the Carbohydrate metabolism pathway is crucial in the development of UC [62]. These findings indicate that p__Actinobacteria and p__Bacteroidetes may contribute to the pathogenesis of UC by modulating Carbohydrate metabolism.

In addition, we have found that the Amino acid metabolism pathway is closely related to c__Bacilli and c__Erysipelotrichi, and the lipid metabolism pathway is closely associated with f__Ruminococcaceae, f__Bifidobacteriaceae, f__Erysipelotrichaceae, and f__Lachnospiraceae. It has been noted that the Lipid metabolism and Amino acid metabolism pathways undergo specific changes in UC activation [63], which is consistent with the results in our study, and further supports the possibility that these pathways may be involved in the pathophysiological processes of UC.

Meanwhile, we also identified a close correlation between the Energy metabolism pathway and p__Actinobacteria, p__Firmicutes, p__Tenericutes, and c__Betaproteobacteria. It has been shown that the Energy metabolism pathway is down-regulated in UC [64], and this finding is consistent with our findings, suggesting that these microbiotas may influence the onset and progression of UC by regulating the energy metabolism pathway. Finally, we also observe that the Protein kinases pathway shows significant differences in UC and is strongly associated with g__Blautia. The existing studies are consistent with the results of our study, which also indicate a potential involvement of the Protein kinases pathway in the pathogenesis of UC [65].

By deeply analyzing the functional differences at different taxonomic levels, this study reveals the role of microbiomes at different taxonomic levels and their impact on functional expression. Overall, this study mainly reveals the complex relationship between microbiome and functional modules in UC, especially in Carbohydrate Metabolism, Amino acid metabolism, lipid metabolism, Energy metabolism and Protein kinases pathways are discovered. These results provide new insights into the pathogenesis of UC and may inform the development of future therapeutic strategies targeting these pathways.

Execution time

We also count the run times based on the DIABIMMUNE dataset at different taxonomic levels. As can be seen from the results (see Supplementary Table S8-S9), the Mean and Median methods have negligible time in inferring missing values. For the deep learning methods, all the models basically show a tendency of prolonged running time with the subdivision of the taxonomic level. Overall, SysLM-I achieves a more significant advantage. SysLM-C has shown some efficiency improvement in classification tasks, especially in comparison with DeepMicroGen. However, we are also continuously optimizing its performance with a view to matching other state-of-the-art models in terms of speed. Note that for the inference task, all models are implemented based on the 11th Gen Intel(R) Core (TM) i7-11800 H @ 2.30 GHz. For the classification task, all models are implemented based on a 14 vCPU Intel(R) Xeon(R) Gold 6348 CPU @ 2.60 GHz. Their GPU acceleration is NVIDIA GeForce RTX 3060 Laptop GPU and A800-80GB respectively.

Discussion

Understanding dynamic host–microbiome interactions requires longitudinal data, but pervasive missing values and fragmented analyses often prevent comprehensive biological insights, calling for an integrated and interpretable framework. To address this challenge, we propose SysLM, a systematic framework integrating two complementary models: SysLM-I for missing data imputation and SysLM-C for causal analysis and biomarker identification. SysLM-I leverages advanced deep learning techniques to capture temporal dependencies and causal relationships, effectively restoring missing values while preserving data integrity. SysLM-C constructs interpretable causal spaces to explore associations between the microbiome and host health, supporting the identification and interpretation of potential disease-associated microbial biomarkers.

Through multi-scale temporal and taxonomic analyses from the phylum to genus levels, we identify biomarkers across different classification levels, contributing to a better understanding of potential pathological mechanisms. Functional pathway analyses further support the potential biological relevance of these biomarkers by implicating several metabolic pathways associated with ulcerative colitis, including Carbohydrate metabolism, Amino acid metabolism, Lipid metabolism, Energy metabolism, and pathways involving Protein kinases. These findings suggest that microbial communities may play an important role in disease progression and could inform future research into potential therapeutic strategies.

Despite the promising results demonstrated by the SysLM framework, several limitations should be acknowledged. First, the current model does not explicitly account for complex or varying missing mechanisms, and parameters tuned at higher-level taxonomic resolutions may not be optimal for sparse and variable lower-level data. Second, while our framework identifies biomarkers from the phylum to genus levels, species-level data are sparse and highly variable, which substantially reduces model performance and limits reliable species-level biomarker discovery. Additionally, further experimental validation is necessary, and future work will focus on integrating multi-omics data, expanding SysLM’s applicability, and optimizing the model to handle diverse missing mechanisms and sparse taxonomic levels, thereby enhancing both accuracy and generalizability.

Supplementary Information

Authors’ contributions

LW, GQ, YS, YM, JG designed the study. LW carried out analyses and wrote the program. LW and JG wrote the paper. All authors read and approved the final manuscript.

Funding

The work is supported by the National Natural Science Foundations of China (Grant No. 12271216, 11831015, 92370131).

Data availability

The raw data used in this study are described in the Materials and Methods section under Datasets. Specifically, the DIABIMMUNE dataset is accessible via DIABIMMUNE Microbiome Project (https://diabimmune.broadinstitute.org/diabimmune), with DOI: 10.1016/j.cell.2016.04.007 and MicrobiomeDB (https://microbiomedb.org), with DOI: 10.1093/nar/gkx1027. The BONUS-CF dataset is available from MicrobiomeDB (https://microbiomedb.org), with DOI: 10.1038/s41591-019-0714-x and 10.1093/nar/gkx1027. The DiGiulio dataset is available with DOI: 10.1073/pnas.1502875112 and 10.1186/s12859-020-03803-z. The PROTECT dataset is available with DOI: 10.1016/j.chom.2018.09.009. The iHMP-IBD dataset is available via IBDMDB (http://ibdmdb.org), with DOI: 10.1016/j.chom.2014.08.014 and 10.1038/s41586-019-1237-9. The iHMP-T2D dataset is available with DOI: 10.1016/j.chom.2014.08.014. Both iHMP datasets are described in the HMPDACC multi-omic data resource article (10.1093/nar/gkaa996). To facilitate reproducibility, a representative processed dataset has been deposited in Figshare (10.6084/m9.figshare.30443732). The source code is available in a public GitHub repository: https://github.com/wang-124/SysLM-model.git.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Zheng D, Liwinski T, Elinav E. Interaction between microbiota and immunity in health and disease. Cell Res. 2020;30:492–506. 10.1038/S41422-020-0332-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lugo-Martinez J, Ruiz-Perez D, Narasimhan G, Bar-Joseph Z. Dynamic interaction network inference from longitudinal Microbiome data. Microbiome. 2019;7:54. 10.1186/S40168-019-0660-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kodikara S, Ellul S, Le Cao KA. Statistical challenges in longitudinal Microbiome data analysis. Brief Bioinform. 2022;23:bbac273. 10.1093/BIB/BBAC273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ren L, Wang T, Sekhari Seklouli A, Zhang H, Bouras A. A review on missing values for main challenges and methods. Inf Syst. 2023;119:102268. 10.1016/J.IS.2023.102268. [Google Scholar]
  • 5.Cao W, Zhou H, Wang D, Li Y, Li J, Li L. BRITS: bidirectional recurrent imputation for time series. Adv Neural Inf Process Syst. 2018;2018–December:6775–85. [Google Scholar]
  • 6.Yin K, Feng L, Cheung WK. Context-Aware time series imputation for Multi-Analyte clinical data. J Healthc Inf Res. 2020;4:411–26. 10.1007/S41666-020-00075-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Choi JM, Ji M, Watson LT, Zhang L. DeepMicroGen: a generative adversarial network-based method for longitudinal Microbiome data imputation. Bioinformatics. 2023;39:btad286. 10.1093/BIOINFORMATICS/BTAD286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fung DLX, Li X, Leung CK, Hu P. A self-knowledge distillation-driven CNN-LSTM model for predicting disease outcomes using longitudinal Microbiome data. Bioinf Adv. 2023;3:vbad059. 10.1093/BIOADV/VBAD059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Maringanti VS, Bucci V, Gerber GK. MDITRE: scalable and interpretable machine learning for predicting host status from Temporal Microbiome dynamics. mSystems. 2022;7:e0013222. 10.1128/MSYSTEMS.00132-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chen X, Liu L, Zhang W, Yang J, Wong KC. Human host status inference from Temporal Microbiome changes via recurrent neural networks. Brief Bioinform. 2021;22:bbab223. 10.1093/BIB/BBAB223. [DOI] [PubMed] [Google Scholar]
  • 11.Sharma D, Xu W. PhyLoSTM: a novel deep learning model on disease prediction from longitudinal Microbiome data. Bioinformatics. 2021;37:3707–14. 10.1093/BIOINFORMATICS/BTAB482. [DOI] [PubMed] [Google Scholar]
  • 12.Metwally AA, Yu PS, Reiman D, Dai Y, Finn PW, Perkins DL. Utilizing longitudinal Microbiome taxonomic profiles to predict food allergy via long Short-Term memory networks. PLoS Comput Biol. 2019;15. 10.1371/JOURNAL.PCBI.1006693. [DOI] [PMC free article] [PubMed]
  • 13.Gerber GK. The dynamic Microbiome. FEBS Lett. 2014;588:4131–9. 10.1016/J.FEBSLET.2014.02.037. [DOI] [PubMed] [Google Scholar]
  • 14.Zhang S, Zheng D, Hu X, Yang M. Bidirectional Long Short-Term Memory Networks for Relation Classification. 2015.
  • 15.Bai S, Zico Kolter J, Koltun V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. 2018.
  • 16.Vatanen T, Kostic AD, D’Hennezel E, Siljander H, Franzosa EA, Yassour M, et al. Variation in Microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell. 2016;165:842–53. 10.1016/J.CELL.2016.04.007/ATTACHMENT/FF71A98D-0A64-44FB-AB09-2F31A93D71D9/MMC4.PDF. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Oliveira FS, Brestelli J, Cade S, Zheng J, Iodice J, Fischer S, et al. MicrobiomeDB: a systems biology platform for integrating, mining and analyzing Microbiome experiments. Nucleic Acids Res. 2018;46:D684–91. 10.1093/NAR/GKX1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hayden HS, Eng A, Pope CE, Brittnacher MJ, Vo AT, Weiss EJ, et al. Fecal dysbiosis in infants with cystic fibrosis is associated with early linear growth failure. Nat Med. 2020;26:215. 10.1038/S41591-019-0714-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.DiGiulio DB, Callahan BJ, McMurdie PJ, Costello EK, Lyell DJ, Robaczewska A, et al. Temporal and Spatial variation of the human microbiota during pregnancy. Proc Natl Acad Sci U S A. 2015;112:11060–5. 10.1073/PNAS.1502875112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhang X, Yi N. NBZIMM: negative binomial and zero-inflated mixed models, with application to microbiome/metagenomics data analysis. BMC Bioinformatics. 2020;21:488. 10.1186/S12859-020-03803-Z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Schirmer M, Denson L, Vlamakis H, Franzosa EA, Thomas S, Gotman NM, et al. Compositional and Temporal changes in the gut Microbiome of pediatric ulcerative colitis patients are linked to disease course. Cell Host Microbe. 2018;24:600–e6104. 10.1016/J.CHOM.2018.09.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.The Integrative Human Microbiome Project. Dynamic analysis of Microbiome-Host omics profiles during periods of human health and disease. Cell Host Microbe. 2014;16:276–89. 10.1016/J.CHOM.2014.08.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lloyd-Price J, Arze C, Ananthakrishnan AN, Schirmer M, Avila-Pacheco J, Poon TW, et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569:655–62. 10.1038/S41586-019-1237-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhao K, Zhang L, Causality-inspired spatial-temporal. expla-nations for dynamic graph neural networks. 2024.
  • 25.Lin TY, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell. 2017;42:318–27. 10.1109/TPAMI.2018.2858826. [DOI] [PubMed] [Google Scholar]
  • 26.Lin T, Song K, Jiang Z, Kang Y, Yuan W, Li X, et al. Towards human-like perception: learning structural causal model in heterogeneous graph. Inf Process Manag. 2024;61:103600. 10.1016/J.IPM.2023.103600. [Google Scholar]
  • 27.Zheng X, Aragam B, Ravikumar P, Xing EP. DAGs with NO TEARS: Continuous Optimization for Structure Learning. 2018.
  • 28.Fazlollahi M, Chun Y, Grishin A, Wood RA, Burks AW, Dawson P, et al. Early-life gut Microbiome and egg allergy. Allergy. 2018;73:1515–24. 10.1111/ALL.13389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Moriki D, Francino MP, Koumpagioti D, Boutopoulou B, Rufián-Henares JÁ, Priftis KN, et al. The role of the gut Microbiome in cow’s milk allergy: A clinical approach. Nutrients. 2022;14:4537. 10.3390/NU14214537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bunyavanich S, Shen N, Grishin A, Wood R, Burks W, Dawson P, et al. Early-life gut Microbiome composition and milk allergy resolution. J Allergy Clin Immunol. 2016;138:1122–30. 10.1016/J.JACI.2016.03.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Petrus NCM, Henneman P, Venema A, Mul A, Van Sinderen F, Haagmans M, et al. Cow’s milk allergy in Dutch children: an epigenetic pilot survey. Clin Transl Allergy. 2016;6. 10.1186/S13601-016-0105-Z. [DOI] [PMC free article] [PubMed]
  • 32.Goldberg MR, Mor H, Magid Neriya D, Magzal F, Muller E, Appel MY, et al. Microbial signature in IgE-mediated food allergies. Genome Med. 2020;12:92. 10.1186/S13073-020-00789-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wang S, Zhang R, Li X, Gao Y, Dai N, Wei Y, et al. Relationship between maternal-infant gut microbiota and infant food allergy. Front Microbiol. 2022;13:933152. 10.3389/FMICB.2022.933152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Martínez-Rodríguez S, Friaza V, Girón-Moreno RM, Gallego EQ, Salcedo-Posadas A, Figuerola-Mulet J, et al. Fungal microbiota dynamics and its geographic, age and gender variability in patients with cystic fibrosis. Clin Microbiol Infect. 2023;29:539.e1-539.e7. 10.1016/J.CMI.2022.11.001. [DOI] [PubMed] [Google Scholar]
  • 35.Xiao L, Zhang F, Zhao F. Large-scale Microbiome data integration enables robust biomarker identification. Nat Comput Sci. 2022;2:307–16. 10.1038/s43588-022-00247-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Moriki D, León ED, García-Gamero G, Jiménez-Hernández N, Artacho A, Pons X, et al. Specific gut Microbiome signatures in children with cow’s milk allergy. Nutrients. 2024;16:2752. 10.3390/NU16162752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ponda P, Cerise JE, Navetta-Modrov B, Kiehm J, Covelli GM, Weiss J, et al. The age-specific Microbiome of children with milk, egg, and peanut allergy. Ann Allergy Asthma Immunol. 2024;133:203–e2106. 10.1016/J.ANAI.2024.04.028. [DOI] [PubMed] [Google Scholar]
  • 38.Sousa AM, Pereira MO. Pseudomonas aeruginosa diversification during infection development in cystic fibrosis Lungs-A review. Pathogens. 2014;3:680–703. 10.3390/PATHOGENS3030680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.You YA, Yoo JY, Kwon EJ, Kim YJ. Blood microbial communities during pregnancy are associated with preterm birth. Front Microbiol. 2019;10:1122. 10.3389/FMICB.2019.01122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Rengarajan S, Vivio EE, Parkes M, Peterson DA, Roberson EDO, Newberry RD, et al. Dynamic Immunoglobulin responses to gut bacteria during inflammatory bowel disease. Gut Microbes. 2020;11:405–20. 10.1080/19490976.2019.1626683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Machiels K, Joossens M, Sabino J, De Preter V, Arijs I, Eeckhaut V, et al. A decrease of the butyrate-producing species roseburia hominis and Faecalibacterium Prausnitzii defines dysbiosis in patients with ulcerative colitis. Gut. 2014;63:1275–83. 10.1136/GUTJNL-2013-304833. [DOI] [PubMed] [Google Scholar]
  • 42.Barberio B, Facchin S, Patuzzi I, Ford AC, Massimi D, Valle G, et al. A specific microbiota signature is associated to various degrees of ulcerative colitis as assessed by a machine learning approach. Gut Microbes. 2022;14:2028366. 10.1080/19490976.2022.2028366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Takeuchi T, Kubota T, Nakanishi Y, Tsugawa H, Suda W, Kwon ATJ, et al. Gut microbial carbohydrate metabolism contributes to insulin resistance. Nature. 2023;621:389–95. 10.1038/S41586-023-06466-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Burke DG, Fouhy F, Harrison MJ, Rea MC, Cotter PD, O’Sullivan O, et al. The altered gut microbiota in adults with cystic fibrosis. BMC Microbiol. 2017;17:1–11. 10.1186/S12866-017-0968-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Loh G, Blaut M. Role of commensal gut bacteria in inflammatory bowel diseases. Gut Microbes. 2012;3:544–55. 10.4161/GMIC.22156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kang DY, Park JL, Yeo MK, Kang SB, Kim JM, Kim JS, et al. Diagnosis of crohn’s disease and ulcerative colitis using the Microbiome. BMC Microbiol. 2023;23:336. 10.1186/S12866-023-03084-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Cougnoux A, Movassaghi M, Picache JA, Iben JR, Navid F, Salman A, et al. Gastrointestinal tract pathology in a BALB/c Niemann-Pick disease type C1 null mouse model. Dig Dis Sci. 2018;63:870–80. 10.1007/S10620-018-4914-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Moradi S, Bagheri R, Amirian P, Zarpoosh M, Cheraghloo N, Wong A, et al. Effects of spirulina supplementation in patients with ulcerative colitis: a double-blind, placebo-controlled randomized trial. BMC Complement Med Ther. 2024;24:109. 10.1186/S12906-024-04400-W. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Levine A, Wine E, Assa A, Sigall Boneh R, Shaoul R, Kori M, et al. Crohn’s disease exclusion diet plus partial enteral nutrition induces sustained remission in a randomized controlled trial. Gastroenterology. 2019;157:440–e4508. 10.1053/J.GASTRO.2019.04.021. [DOI] [PubMed] [Google Scholar]
  • 50.Cui L, Guan X, Ding W, Luo Y, Wang W, Bu W, et al. Scutellaria baicalensis Georgi polysaccharide ameliorates DSS-induced ulcerative colitis by improving intestinal barrier function and modulating gut microbiota. Int J Biol Macromol. 2021;166:1035–45. 10.1016/J.IJBIOMAC.2020.10.259. [DOI] [PubMed] [Google Scholar]
  • 51.Becker F, Gavins FNE, Fontenot J, Jordan P, Yun JY, Scott R, et al. Dynamic gut Microbiome changes following regional intestinal lymphatic obstruction in primates. Pathophysiology. 2019;26:253–61. 10.1016/J.PATHOPHYS.2019.06.004. [DOI] [PubMed] [Google Scholar]
  • 52.Zhang T, Kayani M, ur R, Hong L, Zhang C, Zhong J, Wang Z, et al. Dynamics of the salivary Microbiome during different phases of crohn’s disease. Front Cell Infect Microbiol. 2020;10:544704. 10.3389/FCIMB.2020.544704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Wang Y, Gao X, Ghozlane A, Hu H, Li X, Xiao Y, et al. Characteristics of faecal microbiota in paediatric crohn’s disease and their dynamic changes during Infliximab therapy. J Crohns Colitis. 2018;12:337–46. 10.1093/ECCO-JCC/JJX153. [DOI] [PubMed] [Google Scholar]
  • 54.Zhou X, Shen X, Johnson JS, Spakowicz DJ, Agnello M, Zhou W, et al. Longitudinal profiling of the Microbiome at four body sites reveals core stability and individualized dynamics during health and disease. Cell Host Microbe. 2024;32:506–e5269. 10.1016/J.CHOM.2024.02.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lee JM, Leach ST, Katz T, Day AS, Jaffe A, Ooi CY. Update of faecal markers of inflammation in children with cystic fibrosis. Mediators Inflamm. 2012;2012:948367. 10.1155/2012/948367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Grzybowska-Chlebowczyk U, Woś H, Sieroń AL, Wiȩcek S, Auguściak-Duma A, Koryciak-Komarska H, et al. Serologic investigations in children with inflammatory bowel disease and food allergy. Mediators Inflamm. 2009;2009:512695. 10.1155/2009/512695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Frehn L, Jansen A, Bennek E, Mandic AD, Temizel I, Tischendorf S, et al. Distinct patterns of IgG and IgA against food and microbial antigens in serum and feces of patients with inflammatory bowel diseases. PLoS ONE. 2014;9:e106750. 10.1371/JOURNAL.PONE.0106750. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Hu S, Bourgonje AR, Gacesa R, Jansen BH, Björk JR, Bangma A, et al. Mucosal host-microbe interactions associate with clinical phenotypes in inflammatory bowel disease. Nat Commun. 2024;15:1–14. 10.1038/s41467-024-45855-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol. 2013;31:814–21. 10.1038/NBT.2676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Parks DH, Beiko RG. Identifying biologically relevant differences between metagenomic communities. Bioinformatics. 2010;26:715–21. 10.1093/BIOINFORMATICS/BTQ041. [DOI] [PubMed] [Google Scholar]
  • 61.Lyu F, Han F, Ge C, Mao W, Chen L, Hu H, et al. OmicStudio: A composable bioinformatics cloud platform with real-time feedback that can generate high-quality graphs for publication. iMeta. 2023;2:e85. 10.1002/IMT2.85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Camarillo GF, Goyon EI, Zuñiga RB, Salas LAS, Escárcega AEP, Yamamoto-Furusho JK. Gene expression profiling of mediators associated with the inflammatory pathways in the intestinal tissue from patients with ulcerative colitis. Mediators Inflamm. 2020;2020:9238970. 10.1155/2020/9238970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Cui D, Han X, Jin J, Wang Y, Chen Z, Gong Y, et al. Metabolite profiling in assessing ulcerative colitis activity: A systematic review. Hum Nutr Metabolism. 2025;39:200298. 10.1016/J.HNM.2025.200298. [Google Scholar]
  • 64.Noh JY, Farhataziz N, Kinter MT, Yan X, Sun Y. Colonic dysregulation of major metabolic pathways in experimental ulcerative colitis. Metabolites. 2024;14:194. 10.3390/METABO14040194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Zhang M, Zhou J, Wang H, He L, Wang J, Yang X, et al. Exploration of the shared pathways and common biomarker PAN3 in ankylosing spondylitis and ulcerative colitis using integrated bioinformatics analysis. Front Immunol. 2023;14:1089622. 10.3389/FIMMU.2023.1089622/PDF. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The raw data used in this study are described in the Materials and Methods section under Datasets. Specifically, the DIABIMMUNE dataset is accessible via DIABIMMUNE Microbiome Project (https://diabimmune.broadinstitute.org/diabimmune), with DOI: 10.1016/j.cell.2016.04.007 and MicrobiomeDB (https://microbiomedb.org), with DOI: 10.1093/nar/gkx1027. The BONUS-CF dataset is available from MicrobiomeDB (https://microbiomedb.org), with DOI: 10.1038/s41591-019-0714-x and 10.1093/nar/gkx1027. The DiGiulio dataset is available with DOI: 10.1073/pnas.1502875112 and 10.1186/s12859-020-03803-z. The PROTECT dataset is available with DOI: 10.1016/j.chom.2018.09.009. The iHMP-IBD dataset is available via IBDMDB (http://ibdmdb.org), with DOI: 10.1016/j.chom.2014.08.014 and 10.1038/s41586-019-1237-9. The iHMP-T2D dataset is available with DOI: 10.1016/j.chom.2014.08.014. Both iHMP datasets are described in the HMPDACC multi-omic data resource article (10.1093/nar/gkaa996). To facilitate reproducibility, a representative processed dataset has been deposited in Figshare (10.6084/m9.figshare.30443732). The source code is available in a public GitHub repository: https://github.com/wang-124/SysLM-model.git.


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES