Abstract
Background
Asthma is a heterogeneous disease with high morbidity. Advancement in high-throughput multi-omics approaches has enabled the collection of molecular assessments at different layers, providing a complementary perspective of complex diseases. Numerous computational methods have been developed for the omics-based patient classification or disease outcome prediction. Yet, a systematic benchmarking of those methods using various combinations of omics data for the prediction of asthma development is still lacking.
Objective
We aimed to investigate the computational methods in disease status prediction using multi-omics data.
Method
We systematically benchmarked 18 computational methods using all the 63 combinations of six omics data (GWAS, miRNA, mRNA, microbiome, metabolome, DNA methylation) collected in The Vitamin D Antenatal Asthma Reduction Trial (VDAART) cohort. We evaluated each method using standard performance metrics for each of the 63 omics combinations.
Results
Our results indicate that overall Logistic Regression, Multi-Layer Perceptron, and MOGONET display superior performance, and the combination of transcriptional, genomic and microbiome data achieves the best prediction. Moreover, we find that including the clinical data can further improve the prediction performance for some but not all the omics combinations.
Conclusions
Specific omics combinations can reach the optimal prediction of asthma development in children. And certain computational methods showed superior performance than other methods.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12931-023-02368-8.
Keywords: Asthma, Disease status, Prediction, Multi-omics
Background
Asthma is a chronic condition characterized by wheezing, coughing and reversible airflow obstruction [1]. The global prevalence, morbidity, mortality, and economic burden associated with asthma have been increasing in the past decades [2]. Advances in high-throughput sequencing technologies enable the availability of molecular assessments at the genome, epigenome, transcriptome, proteome, metabolome, and microbiome levels, providing the potential for a comprehensive understanding of human health and diseases [3–6]. Prediction of disease status, including asthma, is critical for understanding the etiology of the disease, discovering the molecular biomarkers and subsequentially identifying suitable interventions. Integrated approaches through combining multi-omics data from different biological layers might improve our ability to bridge the gap from genotype to phenotype [7–10].
Numerous computational methods have been developed to classify patients using their single- or multi-omics data. For example, ensemble-based methods, random forest, and gradient boost decision trees have shown superior performance over only using single-omics data or by directly concatenating the features from different omics data types for multi-omics classification tasks [11–13]. Moreover, several deep learning-based methods have been proposed for the classification in biomedical applications, generating higher performance than existing supervised multi-omics integration methods in various classification tasks [14, 15]. However, benchmarking those computational methods using various combinations of omics data for the disease status prediction has not been studied before. Note that for the disease status prediction, the omics data were collected before the disease onset, which is fundamentally different from the patient classification problem where the omics data were collected after the disease onset.
Here, we compared different disease status prediction methods (using standard performance metrics) on six different types of omics data collected in The Vitamin D Antenatal Asthma Reduction Trial (VDAART) cohort [16]. Our aim is to identify the best prediction method and the best combination of omics data for the prediction of asthma development (see Fig. 1). Our results indicate that Logistic Regression, Multi-Layer Perceptron, and Graph Neural Network-based method MOGONET display superior performance and the combinations of transcriptional, genomic and microbiome data can yield the best prediction of asthma development. Moreover, we found that including the clinical covariates can further improve the prediction performance for some (but not all) omics combinations.
Methods
VDAART cohort
VDAART is a clinical trial to examine the hypothesis that vitamin D supplementation in pregnant women will prevent the development of asthma and allergies in their children [17, 18]. Pregnant women between 18 and 40 years of age and at an estimated gestational age between 10 and 18 weeks were recruited at three clinical centers: Boston Medical Center, Washington University at Saint Louis, and Kaiser Permanente Southern California Region. In the VDAART study, six types of omics data of the children have been collected: (1) GWAS: genome-wide SNP genotyping data and genome-wide association study analysis results. Genotyping of children in VDAART was performed on the Illumina Infinium HumanOmniExpressExome BeadChip, and SNP genotypes are called using the Illumina GenCall software. (2) child miRNA (cord blood); (3) child mRNA transcriptomics (cord blood). Total RNA was isolated from samples by the Qiagen miRNAeasy Serum/Plasma extraction kit and QIAcube automation. Small RNA sequencing libraries were prepared using the Norgen Biotek Small RNA Library Prep Kit and then sequenced on the Illumina NextSeq 500 platform at 51 bp single-end reads. (4) child microbiome at 3–6 months. DNA extractions were performed on stool samples, and the bacterial 16S rRNA gene (V3 to V5 hypervariable regions) was amplified. (5) child metabolomics at 1 year. Nontargeted global metabolomic profiles were generated at Metabolon Inc. by using ultra-performance liquid chromatography–tandem mass spectroscopy (UPLC-MS/MS). (6) child DNA methylation data (cord blood). Cord blood and peripheral blood DNA using the Qiagen Puregene Kit (Valencia, CA, USA) and bisulfite converted using the EZ DNA Methylation-Gold Kit (Zymo Research, Irvine, CA, USA). We randomized samples by chips and plates and generated DNA methylation data using the Infinium HumanMethylation450 BeadChip (Illumina, San Diego, CA, USA).
Among the 748 child participants in VDAART, 102 participants (13.6%) have all the six types of omics data available. Among the 6 omics data types, GWAS data has the largest sample size (see Fig. 2). Postnatally, every 3 months, questionnaires administered to the mother by telephone up to the child’s third birthday inquired about the health of the infant and child, especially the occurrence of wheezing illnesses and asthma and allergy symptoms and diagnoses. In-person visit for the child obtained yearly questionnaire data, determined anthropometric measurements, and collected blood. Here, we applied various machine learning models to predict the children’s asthma status at year 3 using those six omics data collected at/before year 1. Assessment of asthma was based on a doctor’s diagnosis which was defined as a positive response to a direct question to the mother at any time in the first three years of the life of the child. As recent symptoms may help identify young children with significant asthma [19], a more specific definition of doctor’s diagnosis plus symptoms and medication use in the past was used. In addition, the following were also collected in the VDAART study: vitamin D levels in blood of both the mother (through measurement of 25(OH)D levels in cord blood at delivery) and the child (at year 1); and other relevant covariates, e.g., maternal asthma, race and clinical center (see Table 1 for characteristic).
Table 1.
Characteristics | Healthy (n = 249) | Asthmatic (n = 83) |
---|---|---|
Gender | ||
Male | 128 | 33 |
Female | 121 | 50 |
Race | ||
Asian | 20 | 2 |
Black, African American | 100 | 42 |
Native Hawaiian | 2 | 1 |
White | 86 | 30 |
Others | 40 | 6 |
Mother’s age (year) | 28.11 5.66 | 27.15 6.09 |
Mother’s gestation age in days, at enrollment | 96.51 18.78 | 101.11 + 19.76 |
Vitamin blood values (ng/ml) at Enrollment visit | 24.42 1046 | 23.00 9.99 |
Site name | ||
Boston Medical Center | 45 | 25 |
Kaiser Permanente Southern California Region | 104 | 23 |
Washington University at Saint Louis | 100 | 35 |
Prediction methods and performance evaluation
We leveraged several classical classifiers in scikit-sklearn [20], i.e., k-Nearest Neighbors (KNN), Logistic Regression (LR), LRCV (Logistic Regression with cross-validator), Random Forest (RF), Multi-Layer Perceptron (MLP) and Gradient Boosting. We also considered two state-of-the-art deep learning methods: MOGONET [14] and Tabnet [21]. In addition, we also evaluated LR-VAE (Variational AutoEncoder) and LRCV-VAE, compressing the input dimension of miRNA, mRNA, microbiome, metabolomics and DNA methylation data to 5 via the variational autoencoder, which has been heavily used in dimension reduction for biological data [22, 23] (see Table 2 for the list of prediction methods). To compare the performance of different methods on prediction of asthma status, we first split the subjects into two groups for the following evaluation purposes: (1) Hold-out validation: among the 102 subjects that have all six omics data types available, we randomly chose 16 cases, then randomly selected 16 controls whose race and clinical center match each case. (2) Cross-validation: fivefold cross-validation was used to evaluate the performance of each classification method on the remaining subjects (in total 300). To evaluate the performance of each method, we used the standard classification performance metrics: (1) Accuracy; (2) F1-score; (3) AUROC: Area Under the Receiver Operating Characteristic (ROC) curve and (4) AUPRC: Area Under the Precision-Recall Curve (PRC).
Table 2.
Method | Description | Refs. |
---|---|---|
Linear models | ||
LR | Logistic Regression models the probability of object belonging to a class by having the log-odds for the class to be a linear combination of features | [42] |
LRCV | Logistic Regression with build-in validation support to find the optimal parameters | [42] |
LR-VAE | Logistic Regression with reduced features using VAE (Variational AutoEncoder) | [43, 44] |
LRCV-VAE | LRCV-VAE: Logistic Regression with build-in validation support to find the optimal parameters and reduced features using VAE | [43, 44] |
Nearest neighbors | ||
KNN | k-nearest neighbors algorithm that predicts the class of object to the class of most common among its k nearest neighbors | [45] |
Support vector machine | ||
SVC | C-Support Vector Classification is a method for classification by constructing a set of hyperplanes in high dimensional space | [46] |
Ensemble methods | ||
AdaBoost | AdaBoost algorithm is an iterative procedure that tries to approximate the Bayes classifiers by combining many weak classifiers | [47, 48] |
GTB | Learning procedure in Gradient Tree Boosting consecutively fit new models to provide a more accurate estimate of the response variable | [49, 50] |
RF | Random forest is an ensemble classifier by constructing many decision trees and the final prediction is selected by most trees | [51] |
Bagging | Bagging algorithm is a method for generating multiple versions of a predictor, then using these predictions to get an aggregated predictor | [52] |
Ensemble | Aggregate the predictions of all other classifiers together. The continuous probability of a subject being asthmatic is the average probabilities of 15 methods, and a subject is predicted as asthmatic if it was predicted as asthmatic by at least 7 methods | |
Decision trees | ||
DecisionTree | Decision Trees predict the response value by learning simple decision rules inferred from the data features | [53] |
ERT | An extremely randomized tree classifier is a tree-based ensemble method consisting of randomizing strongly both attribute and cut point choice | [54] |
Naïve Bayes | ||
BernoulliNB | Implements the Naïve Bayes training and classification for data that is distributed based on multivariate Bernoulli distribution | [55] |
GaussianNB | Implements the Naïve Bayes training and classification for data that is distributed based on multivariate Gaussian distribution | [56] |
Neural networks | ||
MLP | Multi-layer Perceptron in a fully connected feedforward neural networks with at least three layers | [57] |
MOGONET | MOGONET is a multi-omics data analysis framework for classification tasks utilizing graph convolutional networks | [14] |
Tabnet | Tabnet uses a canonical deep neural networks architecture for tabular data with interpretability | [21] |
Feature selection
Omics data is typically high-dimensional in the sense that the number of features is significantly larger than the number of samples [24, 25]. Feature selection can filter out irrelevant and redundant features by identifying a subset of relevant features [26]. Besides, when fewer features are used as inputs in machine learning models, it also minimizes over-fitting risks. Numerous methods can be used for feature selection, e.g., univariate statistical testing, feature variance, Random Forest importance ranking, and information-theoretic measures [15, 27]. Here, we used the Wilcoxon rank-sum test on cross-validation subjects to identify the key features of count data, including miRNA, mRNA and microbiome data, due to its solid False Discover Rate (FDR) control and good power [28] (see Additional file 1: sec.2 for detail of statistical analysis). For each of those data types, the top 300 features with the lowest p-values were selected, so that the number of features is comparable to the number of subjects (249 healthy controls and 83 asthmatic cases). For continuous metabolomics and methylation data, we used the feature variance to identify the top 300 features with the largest variance across subjects [29]. We reduced the genetic data to 4 polygenic scores (PGS) computed from previous work [30] and 2 SNPs (rs4795399 and rs117097909) in the established 17q21 locus [31].
Omics data imputation
Since not all six omics data types are available for each subject, we performed data imputation first so that the evaluation of each prediction method was performed on the same set of subjects, enabling us to systematically examine the capability of each omics in the prediction of asthma development. To keep more omics data unimputed and the subject size maximized, we selected the subjects with the following three omics data types: GWAS, DNA methylation and the microbiome all available. Then, we imputed the miRNA, mRNA and metabolomics data using the following three methods, respectively: (1) median imputation: the missing value of a feature is replaced with the median value of the other samples. (2) TOBMI [32] (trans-omics block missing data): missing data of a subject in one omics is the weighted combination of k-nearest neighbors identified from another omics data. Here, the missing values of miRNA and mRNA were imputed using a k-nearest neighbors (KNN) weighted method, where a gene expression of a missing subject is the weighted combination of k nearest neighbors identified using the DNA methylation data. We leveraged this idea to impute the metabolomics data using the microbiome data. Hence, the distance matrix was constructed from the microbiome data. (3) missForest [33]: an iterative imputation method based on a random forest classifier. 66% subjects were missing one omics data type, 28% subjects were missing two omics data types, and only 5% subjects were missing all three omics data types. Note that, imputing the missing data on the original omics data requires significantly high computational effort, so we performed the imputation process after the feature selection. We emphasize that the imputation here is subject based, in the sense that the entire omics of some subjects were missing, rather than only few features within an omics were missing. Therefore, some traditional imputation methods, i.e., k-nearest neighbors cannot be directly utilized.
Results
Heathy and asthmatic children show differences in their multi-omics profiles
We firstly examined the differences in the imputed multi-omics profiles between the healthy controls () and asthmatic cases (). We found a significant difference between the distributions of healthy and asthmatic groups using the t-SNE visualization (permutational multivariate analysis of variance (PERMANOVA), ), regardless of imputation methods (see Fig. 3).
There are four consistently high-performing methods in the cross-validations
Among all tested methods in fivefold cross-validations, we found that LR, LRCV, MLP and MOGONET show relatively higher performance over all four types of evaluation metrics (imputed using the median). For example, the highest Accuracy, F1, AUROC and AUPRC of LRCV are 0.92, 0.8, 0.96 and 0.89 among fivefold cross-validations. MOGONET is a novel multi-omics integrative method that jointly explores omics-specific learning and cross-omics correlation learning based on Graph Convolutional Networks (GCN) showing similar performance to LRCV (see Fig. 4; Additional file 1: Fig. S1). In particular, we found that the performance of those top-ranking methods is robust to different imputation methods (see Additional file 1: Fig. S2, S3 for missForest and TOBMI imputation). Higher performance of those four methods implies that prediction of children’s asthma development through leveraging the rich information in multi-omics is feasible.
Transcriptional and genomic data are critical for asthma prediction
Figure 4 shows the predictive performance of each prediction method across all possible combinations of six omics data types. We observed that the prediction performance largely depends on the omics used. To examine the importance of different omics combinations on children’s asthma status prediction, we ranked those 63 combinations from six omics data types based on their median performance across all prediction methods. Interestingly, we found a consistent omics importance ranking over four evaluation metrics: mRNA alone, and combinations of GWAS, miRNA and mRNA can achieve the highest performance. Especially, mRNA alone shows the highest ranking among Accuracy, AUROC and AUPRC (see Additional file 1: Fig. S4). Furthermore, we measured the importance of each feature (such as gene, mRNA, miRNA) using MOGONET, since it yields the overall best performance with omics combination of genome, miRNA, and mRNA data yields the overall best performance, we selected this omics combination and the feature importance in MOGONET was computed by the performance decrease, e.g., F1 score after the feature is removed. We found biomarkers (i.e., features with high importance scores) identified by MOGONET have also shown associations with asthma (see Additional file 1: Table S1). For example, has-miR-581, a microRNA downregulated in severe asthma, is associates with forced expiratory volume in 1 s (FEV1) and immune inflammation [34]. In addition, hsa-miR-376c-3p, hsa-miR-374b-5p, hsa-miR-374c-5p et al., are circulating microRNAs associated with lung function in asthma [35]. When compared to healthy controls, bronchial smooth muscle cells from asthmatic patients express different levels of hsa-miR-376a-3p and hsa-miR-330-5p [36]. ENSG00000267174 is a long noncoding RNA (lncRNA), and many lncRNAs have been shown to be associated with asthma severity or inflammatory phenotype [37]. ENSG00000004139 can regulate the cell survival and cytokine release after inflammasome activation [38]. Again, we found that those top-ranking omics combinations are quite robust to different imputation methods. These results suggest that accurate prediction of asthma development in children does not require sequencing as many as possible omics data. Whereas, using transcriptional with genomic data can yield superior performance for predicting asthma development at year 3.
Different imputation methods produce a similar performance
Although multi-omics analysis can provide the connections between biomolecules from different layers of omics data, one of the key challenges in multi-omics approaches is missing values within and across the omics data. Missing values across omics are a particular concern as they will result in different sample sizes among the omics, which requires imputation for the downstream analyses, i.e., classification. We compared the prediction performance of each prediction method using all 63 omics combinations imputed with three different methods, showing that median and TOBMI imputations can achieve significantly higher AUPRC than missForest (see Additional file 1: Fig. S5). Yet, the overall performance of the three imputation methods is similar.
Hold-out validation displays similar results to cross-validations
Phenotypes in biological studies are typically imbalanced; for example, most binary traits have fewer cases than controls [39]. To examine the performance of each prediction method on a balanced data set without imputation, we trained each method using all the 300 subjects in fivefold cross-validations, then evaluated them using an additional 32 subjects with 16 healthy controls and 16 asthmatic cases, respectively. Again, we found that LR and MOGONET show superior performance over other methods, i.e., the Accuracy, F1, AUROC and AUPRC of LR were 0.78, 0.74, 0.70 and 0.72, respectively, and 0.69, 0.59, 0.66 and 0.75 for MOGONET (see Fig. 5; Additional file 1: Fig. S6). In addition, we found that the combination of miRNA and mRNA achieves the highest Accuracy and AUPRC. Yet, the combination of miRNA and microbiome data can produce the highest F1 and AUROC (see Additional file 1: Fig. S7).
Utilizing covariates can further improve the prediction performance for particular omics combination
To evaluate whether including covariates together with omics data can further improve the prediction performance, we considered the following covariates associated with each subject, i.e., father and mother’s asthma status, race, as well as vitamin D level into the prediction model. Previous analysis in hold-out validation using all 63 omics combinations has shown that the combination of miRNA and mRNA or the combination between miRNA and microbiome omics can reach the optimal performance for most of the prediction methods. Here we intended to investigate the influence of covariates by examining the performance of each method before and after including those covariates in addition to best-performing omics combinations. As those covariates cannot be included easily in all prediction models, i.e., treating these covariates as an additional omics data type for MOGONET, we focused on two promising methods LR and LRCV that can fully exploit all predictors fairly. We found that that the impact of covariates on the asthma prediction depends on the omics used, e.g., it can further improve the prediction for miRNA and mRNA combination for both of LR and LRCV, regardless of the performance metrics (see Fig. 6a). Yet, including those covariates will decrease the prediction performance for the miRNA and microbiome combination (see Fig. 6b). To understand this difference, we examined the association between coefficients of each covariate in LR using two omics combinations, respectively, finding that the coefficients from two omics combinations display a positive correlation. Yet, we do find that for some covariates, such as, history of eczema or atopic dermatitis in mother, mother’s marriage status and history of hay fever or allergic rhinitis in mother are associated with high coefficients in one combination, but not for another.
Discussion
The global prevalence, morbidity, mortality and economic burden of children’s asthma has significantly increased in the past 40 years [1]. Predicting asthma development for children is imperative to understand the etiology of the disease and identify suitable interventions [10]. Yet, many diseases (including asthma) are heterogeneous, which renders the prediction of the disease status a big challenge. Here, we leveraged the rich omics collected in the VDAART cohort, examining the existing classification methods in the prediction of children’s asthma development at year 3 using multi-omics data collected at/before year 1. Our results imply that including a subset of all types of omics data is helpful in asthma outcome prediction, especially a combination of transcriptional, genomic and microbiome data can achieve optimal prediction. In addition, the imputation methods for missing values do not show a significant impact on the prediction.
Our analysis related to the impact of covariates on the asthma development prediction suggests that including the covariates in the prediction models does not always improve the performance. This also implies that the conclusion drawn from VDAART can also be valid in other cohorts, i.e., compromised of subjects with different racial distribution, as, in this study, race is not an importance predictor. However, we acknowledge the importance of replicating these findings in additional diverse populations.
Vitamin D can impact the developing of the lung and immune system during the fetal and early postnatal periods [40, 41], thus deficiency of vitamin D in pregnancy may be important in early asthma and wheezing. The VDAART Randomized Clinical Trial implies that the 3-year incidence of asthma or recurrent wheeze in the infants was 24.3% with 4400-IU/d and 30.4% with a 400-IU/d supplement [17]. This reduction demonstrates that supplementation of vitamin D may be an important intervention for child health. The prediction of children’s asthma development after including the covariates indicates that vitamin D level is associated with a reduction (negative coefficient) in the relative risk of asthma if the prediction is accurate, for instance, using the combination of miRNA and microbiome omics data types. This confirms that supplementation of vitamin D in pregnancy can reduce the risk of asthma for children.
Omics data usually contains missing values. Integration of those omics data together typically requires all omics of each subject available, which is challenging as more types of omics data are included. Data imputation enables us to systematically examine the impact of each omics data type in the prediction of disease status. Our results demonstrate that the performance of those superior methods, i.e., Logistic Regression using combinations of non-imputed omics, i.e., miRNA and microbiome still displayed superior performance than other methods.
Supplementary Information
Acknowledgements
We wish to thank all VDAART participants. We thank John P. Ziniti, Rob Chase, Kathleen Lee-Sarwar, Nancy Laranjo, Mike McGeachie, Hooman Mirzakhani, Priyadarshini Kachroo, and Jody Sylvia for preparing the omics data. We thank Kimberly Glass, Arda Halu, and Enrico Maiorino for valuable discussion.
Author contributions
YYL and STW conceived and designed the project. XWW, TW, DPS, CC, ZS, SLK, JH, AMH, OAZ, RZ evaluated the methods. XWW and YYL drafted the manuscript. All authors contributed to result interpretation and revised the manuscript. All authors read and approved the final manuscript.
Funding
VDAART was supported by grant U01HL091528 from the NHLBI. YYL was supported by the National Institutes of Health (grant numbers: R01AI141529, R01HD093761, RF1AG067744, UH3OD023268, U19AI095219, U01HL089856).
Availability of data and materials
The data presented in this study are available upon request.
Declarations
Ethics approval and consent to participate
VDAART IRB approval was obtained from each of the three clinical centers and the Data Coordinating Center, that is, Washington University in St. Louis, Kaiser Health Care San Diego, Boston Medical Center and Brigham and Women’s Hospital in Boston. Study subjects provided written, informed consent.
Consent for publication
Not applicable.
Competing interests
In the past three years, EKS received grant support from GlaxoSmithKline and Bayer. Other authors declare no competing interests. Scott. T. Weiss receives author royalty from UpToDate, and is on the Board of Directors of Histolix. Jessica Lasky-Su is a consultant for TruDiagnositc, and is on the Scientific Advisory Board of Precion. Augusto A. Litonjua is on the Data Safety Monitoring Board of PreCISE Network, and receives author royalty from UpToDate.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Yang-Yu Liu, Email: yyl@channing.harvard.edu.
Scott T. Weiss, Email: restw@channing.harvard.edu
References
- 1.Braman SS. The global burden of asthma. Chest. 2006;130:4S–12S. doi: 10.1378/chest.130.1_suppl.4S. [DOI] [PubMed] [Google Scholar]
- 2.Caffrey Osvald E, Bower H, Lundholm C, et al. Asthma and all-cause mortality in children and young adults: a population-based study. Thorax. 2020;75:1040–1046. doi: 10.1136/thoraxjnl-2020-214655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Di Resta C, Galbiati S, Carrera P, et al. Next-generation sequencing approach for the diagnosis of human diseases: open challenges and new opportunities. Ejifcc. 2018;29:4. [PMC free article] [PubMed] [Google Scholar]
- 4.Grada A, Weinbrecht K. Next-generation sequencing: methodology and application. J Invest Dermatol. 2013;133:e11. doi: 10.1038/jid.2013.248. [DOI] [PubMed] [Google Scholar]
- 5.Kilpinen H, Barrett JC. How next-generation sequencing is transforming complex disease genetics. Trends Genet. 2013;29:23–30. doi: 10.1016/j.tig.2012.10.001. [DOI] [PubMed] [Google Scholar]
- 6.Ku CS, Naidoo N, Wu M, et al. Studying the epigenome using next generation sequencing. J Med Genet. 2011;48:721–730. doi: 10.1136/jmedgenet-2011-100242. [DOI] [PubMed] [Google Scholar]
- 7.Bersanelli M, Mosca E, Remondini D, et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics. 2016;17:S15. doi: 10.1186/s12859-015-0857-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Graw S, Chappell K, Washam CL, et al. Multi-omics data integration considerations and study design for biological systems and disease. Mol Omics. 2021;17:170–185. doi: 10.1039/D0MO00041H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18:83. doi: 10.1186/s13059-017-1215-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights. 2020;14:117793221989905. doi: 10.1177/1177932219899051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Picard M, Scott-Boyer M-P, Bodein A, et al. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021;19:3735–3746. doi: 10.1016/j.csbj.2021.06.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Xie G, Dong C, Kong Y, et al. Group lasso regularized deep learning for cancer prognosis from multi-omics and clinical features. Genes. 2019;10:240. doi: 10.3390/genes10030240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Chaudhary K, Poirion OB, Lu L, et al. Deep learning-based multi-omics integration robustly predicts survival in liver cancerusing deep learning to predict liver cancer prognosis. Clin Cancer Res. 2018;24:1248–1259. doi: 10.1158/1078-0432.CCR-17-0853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang T, Shao W, Huang Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12:3445. doi: 10.1038/s41467-021-23774-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rohart F, Gautier B, Singh A, et al. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13:e1005752. doi: 10.1371/journal.pcbi.1005752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Group CAMPR. The childhood asthma management program (CAMP): design, rationale, and methods. Controlled clinical trials 1999; 20:91–120. [PubMed]
- 17.Litonjua AA, Carey VJ, Laranjo N, et al. Effect of prenatal supplementation with vitamin D on asthma or recurrent wheezing in offspring by age 3 years: the VDAART randomized clinical trial. JAMA. 2016;315:362–370. doi: 10.1001/jama.2015.18589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Weiss ST, Litonjua AA. Can we prevent childhood asthma before birth? Summary of the VDAART results so far. Expert Rev Respir Med. 2016;10:1039–1040. doi: 10.1080/17476348.2016.1227257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Galant SP, Morphew T, Amaro S, et al. Current asthma guidelines may not identify young children who have experienced significant morbidity. Pediatrics. 2006;117:1038–1045. doi: 10.1542/peds.2005-1076. [DOI] [PubMed] [Google Scholar]
- 20.Buitinck L, Louppe G, Blondel M, et al. API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238 2013.
- 21.Arik SO, Pfister T. TabNet: Attentive Interpretable Tabular Learning. arXiv:1908.07442 [cs, stat] 2020.
- 22.Lin E, Mukherjee S, Kannan S. A deep adversarial variational autoencoder model for dimensionality reduction in single-cell RNA sequencing analysis. BMC Bioinformatics. 2020;21:1–11. doi: 10.1186/s12859-020-3401-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Wang D, Gu J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genomics Proteomics Bioinformatics. 2018;16:320–331. doi: 10.1016/j.gpb.2018.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Leclercq M, Vittrant B, Martin-Magniette ML, et al. Large-scale automatic feature selection for biomarker discovery in high-dimensional OMICs data. Front Genet. 2019;10:452. doi: 10.3389/fgene.2019.00452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Moon KR, van Dijk D, Wang Z, et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol. 2019;37:1482–1492. doi: 10.1038/s41587-019-0336-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Bommert A, Sun X, Bischl B, et al. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal. 2020;143:106839. doi: 10.1016/j.csda.2019.106839. [DOI] [Google Scholar]
- 27.Du W, Cao Z, Song T, et al. A feature selection method based on multiple kernel learning with expression profiles of different types. BioData Mining. 2017;10:4. doi: 10.1186/s13040-017-0124-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li Y, Ge X, Peng F, et al. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol. 2022;23:79. doi: 10.1186/s13059-022-02648-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhuang J, Widschwendter M, Teschendorff AE. A comparison of feature selection and classification methods in DNA methylation studies using the illumina infinium platform. BMC Bioinformatics. 2012;13:59. doi: 10.1186/1471-2105-13-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sordillo JE, Lutz SM, Jorgenson E, et al. A polygenic risk score for asthma in a large racially diverse population. Clin Exp Allergy. 2021;51:1410–1420. doi: 10.1111/cea.14007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Ferreira MA, Mathur R, Vonk JM, et al. Genetic architectures of childhood-and adult-onset asthma are partly distinct. Am J Hum Genet. 2019;104:665–684. doi: 10.1016/j.ajhg.2019.02.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Dong X, Lin L, Zhang R, et al. TOBMI: trans-omics block missing data imputation using a k-nearest neighbor weighted approach. Bioinformatics. 2019;35:1278–1283. doi: 10.1093/bioinformatics/bty796. [DOI] [PubMed] [Google Scholar]
- 33.Stekhoven DJ, Buhlmann P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28:112–118. doi: 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]
- 34.Francisco-Garcia AS, Garrido-Martín EM, Rupani H, et al. Small RNA species and microRNA profiles are altered in severe asthma nanovesicles from broncho alveolar lavage and associate with impaired lung function and inflammation. Noncoding RNA. 2019;5:51. doi: 10.3390/ncrna5040051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Kho AT, Sharma S, Davis JS, et al. Circulating MicroRNAs: association with lung function in asthma. PLoS ONE. 2016;11:e0157998. doi: 10.1371/journal.pone.0157998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Alexandrova E, Miglino N, Hashim A, et al. Small RNA profiling reveals deregulated phosphatase and tensin homolog (PTEN)/phosphoinositide 3-kinase (PI3K)/Akt pathway in bronchial smooth muscle cells from asthmatic patients. J Allergy Clin Immunol. 2016;137:58–67. doi: 10.1016/j.jaci.2015.05.031. [DOI] [PubMed] [Google Scholar]
- 37.Gysens F, Mestdagh P, de Bony de Lavergne E, et al. Unlocking the secrets of long non-coding RNAs in asthma. Thorax. 2022;77:514–522. doi: 10.1136/thoraxjnl-2021-218359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Carty M, Kearney J, Shanahan KA, et al. Cell survival and cytokine release after inflammasome activation is regulated by the Toll-IL-1R protein SARM. Immunity. 2019;50:1412–1424.e6. doi: 10.1016/j.immuni.2019.04.005. [DOI] [PubMed] [Google Scholar]
- 39.Zhou W, Nielsen JB, Fritsche LG, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet. 2018;50:1335–1341. doi: 10.1038/s41588-018-0184-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zosky GR, Berry LJ, Elliot JG, et al. Vitamin D deficiency causes deficits in lung function and alters lung structure. Am J Respir Crit Care Med. 2011;183:1336–1343. doi: 10.1164/rccm.201010-1596OC. [DOI] [PubMed] [Google Scholar]
- 41.Yurt M, Liu J, Sakurai R, et al. Vitamin D supplementation blocks pulmonary structural and functional changes in a rat model of perinatal vitamin D deficiency. Am J Physiol Lung Cell Mol Physiol. 2014;307:L859–L867. doi: 10.1152/ajplung.00032.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Tolles J, Meurer WJ. Logistic regression: relating patient characteristics to outcomes. JAMA. 2016;316:533–534. doi: 10.1001/jama.2016.7653. [DOI] [PubMed] [Google Scholar]
- 43.Doersch C. Tutorial on Variational Autoencoders. arXiv:1606.05908 [cs, stat] 2016.
- 44.Arnold TB. kerasR: R Interface to the keras deep learning library. J Open Source Softw. 2017;2:296. doi: 10.21105/joss.00296. [DOI] [Google Scholar]
- 45.Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46:175–185. [Google Scholar]
- 46.Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2:1–27. doi: 10.1145/1961189.1961199. [DOI] [Google Scholar]
- 47.Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55:119–139. doi: 10.1006/jcss.1997.1504. [DOI] [Google Scholar]
- 48.Hastie T, Rosset S, Zhu J, et al. Multi-class AdaBoost. Statis Interface. 2009;2:349–360. doi: 10.4310/SII.2009.v2.n3.a8. [DOI] [Google Scholar]
- 49.Friedman JH. Greedy function approximation: a gradient boosting machine. Annal Statis. 2001;29:1189–1232. doi: 10.1214/aos/1013203450. [DOI] [Google Scholar]
- 50.Natekin A, Knoll A. Gradient boosting machines, a tutorial. Front Neurorobot. 2013;7:21. doi: 10.3389/fnbot.2013.00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ho TK. Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition 1995; 1:278–282
- 52.Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140. doi: 10.1007/BF00058655. [DOI] [Google Scholar]
- 53.Loh W-Y. Classification and regression trees. Wiley Interdiscip Rev Data Mining Knowl Discov. 2011;1:14–23. doi: 10.1002/widm.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42. doi: 10.1007/s10994-006-6226-1. [DOI] [Google Scholar]
- 55.McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization 1998; 752:41–48.
- 56.Zhang H. The optimality of naive Bayes. Aa. 2004;1:3. [Google Scholar]
- 57.Hinton GE. Connectionist learning procedures. Mach Learn. 1990; 555–610.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data presented in this study are available upon request.