Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2022 Dec 13;24(1):bbac537. doi: 10.1093/bib/bbac537

Incomplete time-series gene expression in integrative study for islet autoimmunity prediction

Khandakar Tanvir Ahmed 1, Sze Cheng 2, Qian Li 3, Jeongsik Yong 4, Wei Zhang 5,
PMCID: PMC9851333  PMID: 36513375

Abstract

Type 1 diabetes (T1D) outcome prediction plays a vital role in identifying novel risk factors, ensuring early patient care and designing cohort studies. TEDDY is a longitudinal cohort study that collects a vast amount of multi-omics and clinical data from its participants to explore the progression and markers of T1D. However, missing data in the omics profiles make the outcome prediction a difficult task. TEDDY collected time series gene expression for less than 6% of enrolled participants. Additionally, for the participants whose gene expressions are collected, 79% time steps are missing. This study introduces an advanced bioinformatics framework for gene expression imputation and islet autoimmunity (IA) prediction. The imputation model generates synthetic data for participants with partially or entirely missing gene expression. The prediction model integrates the synthetic gene expression with other risk factors to achieve better predictive performance. Comprehensive experiments on TEDDY datasets show that: (1) Our pipeline can effectively integrate synthetic gene expression with family history, HLA genotype and SNPs to better predict IA status at 2 years (sensitivity 0.622, AUC 0.715) compared with the individual datasets and state-of-the-art results in the literature (AUC 0.682). (2) The synthetic gene expression contains predictive signals as strong as the true gene expression, reducing reliance on expensive and long-term longitudinal data collection. (3) Time series gene expression is crucial to the proposed improvement and shows significantly better predictive ability than cross-sectional gene expression. (4) Our pipeline is robust to limited data availability. Availability: Code is available at https://github.com/compbiolabucf/TEDDY

Keywords: incomplete time-series gene expression, type-1 diabetes, islet autoimmunity prediction, multi-omics, autoencoders, long short-term memory

Introduction

Gene expression changes throughout the timeline of chronic diseases such as diabetes, hypertension, obesity and heart disease; therefore, a periodically measured gene expression may better explain the underlying mechanisms of these diseases compared with cross-sectional gene expression collected once per participant [1]. Some prospective longitudinal cohort studies collect that information. However, these studies tend to suffer from loss to follow-up [2, 3], which means the time series data will have missing values if participants are absent during scheduled visits when data are collected. Moreover, limited by cost and logistics, data are often collected for a subset of participants, i.e. some participants will have no gene expression data available. An effective data imputation technique is necessary to use the gene expression for downstream analyses [4, 5]. Researchers have investigated computational methods for handling the missing value problem in gene expression, and several algorithms have been proposed to impute gene expression. The missing gene expression problem can be broadly divided into two groups: (1) the first group contains participants with partially available gene expression, and (2) the second group contains participants with no available gene expressions. Many frameworks have been developed to solve the prior stated problem, which consider global or local relations among genes, domain knowledge and other omics data for imputation [6–10]. The second group of missing value problems is more apparent in multi-omics analysis, where some participants can be present in another omics type but absent in gene expression. For such conditions, several frameworks have been developed that use other omics data to guide the imputation of gene expression [11–13]. As most studies evaluate gene expression profiles at a single time point, most of the available imputation frameworks are also designed to impute such gene expression datasets. The imputation of time series data offers additional challenges because of the time dependency among the time steps from the same participants. A handful of frameworks were proposed for the imputation of time series gene expression data [10, 14, 15] but they do not involve multi-omics data and participants with completely missing gene expression. More recently, some advanced algorithms have been proposed for time series data imputation in other domains [16–19]. However, samples with no available gene expression still complicate integrative time series analysis to study chronic diseases.

Type 1 diabetes (T1D) is a common chronic disease in children caused by the autoimmune response against pancreatic Inline graphic cells. Despite active research, the exact causes or any cure for the disease is still unknown [20]. Islet autoimmunity (IA), which precedes the clinical onset of T1D [21], can be used as a marker to study the progression toward T1D. The Environmental Determinants of Diabetes in the Young (TEDDY) is a longitudinal prospective study that uses a nested case-control cohort to identify risk factors associated with T1D. Early and accurate identification of high-risk children (children with a high probability of developing IA) will allow us to design a better case-control cohort to identify risk factors, which may eventually lead to the prevention and cure of the disease. Therefore, predicting IA has been the center of attention in diabetes studies for a long time [22–24]. Recent attempts to predict outcomes in T1D studies have used genetic factors [25–29], metabolic status [30, 31], family history and environmental risk factors [22, 23, 32] for the prediction. Gene expression of the participants has been widely ignored, even though the predictive power of gene expression is well established in the literature to study different diseases [33–36]. Integration of gene expression with other omics profiles is also well documented to result in improved prediction results for different objectives such as biomarker identification, patient stratification and survival prediction [37–41] which may translate into a better outcome prediction for T1D. One reason for the reluctance to use gene expression is its weak association with the outcome, partially contributed by the high missing rate. Few previous studies [23, 42] have used time series gene expression from TEDDY to explain the progression toward T1D and found encouraging results for the T1D onset prediction. However, they only predict T1D for a small subset of total participants, and some use IA information for the prediction. IA information becomes available years after the birth of the child and their enrollment in the study. Therefore, the prediction can only be performed after a certain time. Missing data in gene expression hinder the analyses in these studies as well. TEDDY collects times series gene expression from its participants as well as their SNP, HLA genotype and family history. The gene expression is collected for less than 6% of the enrolled participants and suffers from a large amount of missing time steps. It limits the opportunity for a comprehensive integrative study involving all participants. However, SNP, cross-sectional data, is available for all participants, enabling us to impute partially or completely missing gene expression profiles.

The primary objective of this work is to propose a model that will impute partially or entirely missing gene expressions with synthetic data. We employ a deep learning-based model to generate synthetic gene expression from SNP data and available gene expression. We demonstrate that it contains a competitive predictive signal compared with the true gene expression and improves state-of-the-art prediction results. We also explore the importance of time series gene expression in capturing the underlying mechanisms of T1D. The rest of the manuscript is organized as follows: TEDDY study setup, our research design and methodology are described in the next section. The Experiments section is dedicated to experimental setups and validation of the results. The Discussion section contains a brief discussion of the results along with our limitations and future directions. Concluding remarks are presented in the last section.

Research design and methods

Data sets and participants

TEDDY study is designed to identify the environmental risk factors impacting the development of IA and the onset of T1D [43]. TEDDY enrolls 8676 high-risk children in this study based on the HLA genotype of the children and their first-degree relatives [22]. Follow-up for each child starts at 3 months and lasts until 15 years of age. Children are tested for islet autoantibodies (IAA, GADA, IA-2A, ZnT8A) at each visit, and gene expression is also measured. Visits for the participants are 3 months apart for the first 4 years. After that, it is 3 months for participants with any positive islet autoantibody test and 6 months for the rest. The outcome of interest in this study, IA, is defined as the presence of two consecutive positive tests for any particular islet autoantibody. In other words, if there are consecutive positive tests for at least one of the four autoantibodies, we consider that participant as IA positive. Many risk factors, including HLA genotype, SNPs, dietary factors, family history, sex and seroconversion age, have been investigated in previously published works that narrow down the candidates for a predictive study [22, 42, 44–46]. Although these risk factors are found to be weakly associated with T1D outcome [22], these studies ignored time series gene expression which may introduce complementary information and time factor for better prediction results.

Many TEDDY-identified risk factors have been previously explored, and family history, HLA genotype and SNP were shown to be better predictors for IA status [22]. Based on the literature, we include 12 SNPs, HLA genotype and family history in this study. Details about the 12 SNPs can be found in [22]. We performed an exhaustive search for the best SNP combination and found rs4597342, rs12708716, rs4948088 and rs1143678 combined with HLA genotype and family history to be the best-performing combination for IA status prediction. Therefore, we include these variables in further analyses of this study. Risk factors are binarized before feeding them into models. Family history was categorized as first-degree relatives having T1D versus no T1D. SNPs were categorized as major (no copy of minor allele) versus minor (one or two copies of minor alleles). The HLA genotype is defined as DR3/DR4 versus others. TEDDY participants (m = 6,812) with available family history, HLA genotype and SNP data are included in this study.

The gene expression in TEDDY is a time series with 2013 time steps belonging to 401 children. Gene expression is collected until 72 months at 3 or 6 months intervals. Approximately 79% of time steps are missing for the 401 participants, which significantly impedes its ability to be used in a time series study. In the cohort of 6812 participants, the missing rate rises to 98.77%, as the other 6411 (94.11% of 6812) participants have no available gene expression. Therefore, the gene expression is unusable for downstream analyses involving a cohort of 6812 participants. Nevertheless, the number of missing participants changes across time steps, allowing us to remove some time steps with fewer available participants.

The number of available participants at each time step is presented in Figure 1, which illustrates that the rate of missing participants increases in later time steps. Moreover, after 48 months, some participants visited every 6 months instead of 3, resulting in an even lower data availability rate. As available data are necessary to train the imputation model, a lower data availability rate disrupts the model training, and thus the quality of the synthetic gene expression. To reduce the impact of missing data and maintain a regular interval of 3 months between consecutive time steps, we set a cutoff of 48 months for gene expression in this study. Therefore, gene expression of each participant consists of 16 time steps corresponding to 3 to 48 months at 3 months intervals. Although setting a cutoff lowers the missing rate to 98.22%, it is still impractical to use gene expression with predictive algorithms without an effective data imputation. Therefore, we propose a deep-learning-based imputation model described in the following subsection that can generate synthetic gene expression at missing time steps from SNPs. We keep 17 039 protein-coding genes in the gene expression. 17 039 features may overfit the model or impose a computational burden with redundant information [47, 48]; so we find an optimal number of genes using forward feature selection that will provide us with the best prediction results. Once we have the optimal number of genes, gene expression, family history and SNPs are merged into a single time series dataset. For family history, HLA genotype and SNPs, the same value for a participant is replicated at every time step. Dimensions of the datasets used in different stages of this study are tabulated in Table 1.

Figure 1.

Figure 1

The number of available participants at each time step. The number of participants with available gene expression at each time step; up to 24 time steps (72 months) are plotted. The plot shows a decrease in the availability of gene expression at later time points. 16Inline graphic time step is selected as an optimum point for gene expression cutoff.

Table 1.

Dimensions of gene expression and SNP used in different stages of the study

Dataset Dimension
Original gene expression Inline graphic
Imputed gene expression Inline graphic
Original SNP Inline graphic
SNP (for imputation) Top 50 PCs of original SNP
SNP (for IA prediction) 12 SNPs selected from literature

Imputation model overview

The overall framework of the proposed study is illustrated in Figure 2. The framework has two main components: a deep learning-based imputation model and a long short-term memory (LSTM)-based classifier. Synthetic gene expression is first generated for missing time steps through the imputation model using SNP and available gene expression. Family history, HLA genotype, SNP and completed gene expression are then fed into the classifier to predict IA positive and IA negative participants.

Figure 2.

Figure 2

An overall illustration of the proposed framework. Incomplete gene expression is imputed using SNP in the imputation model (DNN). Completed gene expression, SNP, HLA genotype and family history are fed into the classifier (LSTM) to predict IA positive and IA negative participants.

Although gene expression is either partially or completely missing for every participant, SNP data are available for all of them. Therefore, our proposed imputation model is trained to map the SNP data to gene expression and generate the value for missing time steps. The imputation is carried out for each time step separately, i.e. the model is retrained for imputing every time step as seen in Figure 2. For imputing a time step, participants with available gene expression at that time step are separated and randomly divided into training and validation sets with a 70-30 split ratio. All other participants without gene expression are considered as the test set. The training samples’ SNP data and gene expression are used as input and output to train the model. The imputation model is illustrated in Figure 3. Let Inline graphic be the total number of participants in the study. SNP data (Inline graphic) are available for all Inline graphic participants, whereas the gene expression is available for Inline graphic participants among them. Inline graphic genes from Inline graphic participants in the gene expression Inline graphic, observed in Inline graphic is defined as Inline graphic, where Inline graphic denotes the observation from Inline graphic time step. For imputation of Inline graphic time step, we first train an autoencoder (Inline graphic) to find a lower dimensional representation of Inline graphic, given by Inline graphic, where Inline graphic represents transposition and Inline graphic is the embedding size. Inline graphic contains the property of each feature in the observed data at Inline graphic time step, which will be later used in equation (7) to guide the synthetic data generation. Both encoder and decoder are five layers feed-forward neural networks. The encoder finds the lower dimensional embedding from the original data, which is fed to the decoder as it tries to reconstruct the original data from the embedding. The autoencoder is trained with the reconstruction loss for 100 epochs using Adam optimizer and a learning rate of 0.0001. Output from the encoder and decoder are given by equations (1) and (2), respectively, and the network is trained following equation (3).

graphic file with name DmEquation1.gif (1)
graphic file with name DmEquation2.gif (2)
graphic file with name DmEquation3.gif (3)

Figure 3.

Figure 3

An illustration of the proposed imputation model. Incomplete gene expression Inline graphic is imputed using autoencoders Inline graphic, Inline graphic and multilayer perceptron (MLP) Inline graphic.

Then we move forward to the imputation of missing values. It consists of two components: (i) a six-layer fully connected deep neural network (Inline graphic) and (ii) an autoencoder (Inline graphic) that follows the first component (Inline graphic). Inline graphic takes SNP data Inline graphic as input and generates synthetic gene expression Inline graphic for Inline graphic time step. Inline graphic is the number of features in the SNP data. Dimension of the SNP data is reduced to avoid overfitting using principal component analysis (PCA) implemented using sklearn.decomposition.PCA package [49]. The top 50 principal components (PCs) are used in the imputation. Inline graphic notation is used throughout the study to denote synthetic data and values derived from synthetic data. The ReLU activation function follows hidden layers in the Inline graphic network. Layers can be formulated as

graphic file with name DmEquation4.gif (4)

where Inline graphic, Inline graphic are learnable parameters and Inline graphic is the activation function. For the first layer, Inline graphic and in the final layer Inline graphic. Additionally, we use the same model to generate the synthetic gene expression Inline graphic for all Inline graphic participants from Inline graphic. To recollect, Inline graphic is the total number of participants in the study and Inline graphic is the participants with available gene expression; therefore, Inline graphic. Afterwards, Inline graphic is fed into the autoencoder Inline graphic where an embedding Inline graphic is generated that represents the characteristics of the imputed features in a lower dimension following equations (5) and (6). The purpose of this autoencoder Inline graphic is to ensure that feature properties remain the same before and after imputation, which means generated data will have similar properties as the true data.

graphic file with name DmEquation5.gif (5)
graphic file with name DmEquation6.gif (6)

The objective functions for Inline graphic and Inline graphic are formulated as equations (7) and (8), respectively.

graphic file with name DmEquation7.gif (7)
graphic file with name DmEquation8.gif (8)

The first element (Inline graphic) in equation (7) makes the synthetic gene expression for the Inline graphic participants similar to the true gene expression at Inline graphic time step. The second element (Inline graphic) introduces information from previous time steps in the imputation. Inline graphic represents the last observed gene expression at Inline graphic time step while imputing data at Inline graphic time step. We assume that gene expression at a time step is more similar to its closest time step, which provides a better estimation for the missing gene expression. Inline graphic denotes the time difference between Inline graphic and Inline graphic time steps, which ensures that gene expression observed in closer time step from Inline graphic has more contribution in the imputation compared with gene expression observed at further time steps. The last element (Inline graphic) in equation (7) ensures that feature characteristics are similar in gene expression (Inline graphic) before and after imputation.

Inline graphic is trained for 100 epochs using Adam optimizer with a learning rate of 0.001, and Inline graphic is trained for 25 epochs using Adam optimizer with a learning rate of 0.00001. Inline graphic is fully trained at each epoch of Inline graphic to obtain the best embedding value. High-quality embedding generated by Inline graphic will in turn result in better training for Inline graphic as seen in equation (7).

As most participants have no true gene expression, the imputation model must have two characteristics to ensure the best performance. It has to be able to generate synthetic data for a participant using only SNP and maximize the information extraction from the SNP simultaneously. In our model, the multilayer perceptron, Inline graphic, is responsible for mapping SNP to gene expression. To ensure that Inline graphic uses only SNP as input, we cannot integrate available gene expression from other time steps. On the other hand, ignoring other time steps will result in loss of valuable information and inferior mapping of SNP to gene expression that contradicts the second characteristic. As mentioned before, all participants in the training set have partial true gene expression. Therefore, we employ the available gene expression from other time steps for a participant in the objective function through the second element, whereas once trained, Inline graphic only uses SNP data to generate synthetic gene expression. It ensures that we can generate synthetic data for participants in the test set with no prior gene expression and also harness the prior information from participants during training. The validation set is used to tune the hyperparameters and choose the best model during training. Then SNP data of the test set are fed into the network to generate gene expression values for those participants with missing time steps. The same procedure is repeated for each time point. The deep neural network model is implemented in PyTorch [50].

Classifier and metrics

In this study, we use an LSTM-based classifier [51] for time series predictions. LSTM is a type of recurrent neural network that takes time series data as input and maps them to a label considering the time factor in the analysis. It is a three-layer network followed by a fully connected layer and a sigmoid activation function. The hidden size of the LSTM is 200. For predictions using only family history, HLA genotype and SNP, we use random forest implemented through sklearn.ensemble.RandomForestClassifier package. Area Under the Receiver Operating Characteristic Curve (AUC), sensitivity, specificity and Youden’s index are applied to evaluate the performance of the classifiers.

Experiments

A total of 6812 samples from TEDDY with family history, HLA genotype and SNP data available are included in our study; 338 (4.96 %) of them develop IA within 24 months. In the experiments, we show the proposed improvement of IA prediction after the integration of gene expression with family history, HLA genotype and SNP, the quality of our imputed gene expression used in the prediction and the enriched gene sets (e.g. pathways) are significant for T1D pathogenesis.

Integration of gene expression improves IA prediction

Feature properties and selection

Family history, HLA genotype and SNPs: TEDDY-identified risk factors have a wide range of abilities to predict IA. Sensitivity, specificity, Youden’s index, AUC of family history, HLA genotype and SNPs used in this study for predicting IA outcome at 24 months are reported in Supplementary Table S1. Family history is the best predictor of IA (Youden’s index = 0.265) followed by HLA genotype, rs4597342, rs12708716, rs4948088 and rs1143678, respectively. Details of the individual feature properties can be found in Table S1. Additionally, these risk factors do not work in seclusion; therefore, the risk of developing IA associated with different combinations of risk factors is investigated and reported in Supplementary Table S2.

Gene expression: An optimum number of genes from the gene expression is added to the family history, HLA genotype and SNP data to improve the IA prediction. The genes are ranked based on their variance across the samples, and genes with the highest variances are added sequentially to the LSTM-based classifier along with family history, HLA genotype and SNP until its performance on validation data goes down. Based on this result, we choose to add the top 10 genes in our analysis which gives the highest validation AUC of 0.73.

Prediction results

Improved IA outcome prediction: A collective role of omics layers determines the physiology behind a complex disease like T1D. Therefore, integrating additional omics data into the analysis can provide complementary information enabling a better outcome prediction. We predict the IA status of participants using: (1) family history, HLA genotype and SNPs, (2) synthetic gene expression, (3) combination of family history, HLA genotype, SNPs and synthetic gene expression. IA status labels are generated at different time cutoff Inline graphic months, individually, where all participants developing IA by Inline graphic month are considered as IA positive, and all others are considered IA negative. Predictions using only gene expression and a combination of family history, HLA genotype, SNPs and gene expression are carried out employing the LSTM model described in the subsection Classifier and metrics. On the other hand, predictions using family history, HLA genotype and SNPs are carried out employing the random forest model described in the subsection Classifier and metrics. All predictions in this study are repeated 50 times with random splitting of samples into training, validation and test set. The mean values of AUC, sensitivity, specificity and Youden’s index of test sets from 50 repetitions are reported in Table 2 and all results thereafter.

Table 2.

Predictions at different IA cutoff. Results (sensitivity, specificity, Youden’s index, AUC) of IA status prediction using three input data at different IA cutoffs are calculated. The combination of family history, HLA genotype, SNP and gene expression shows better performance compared with them individually. AUC, sensitivity and Youden’s index drop when the IA cutoff is increased suggesting the difficulty associated with predicting further into the future. Improvements using combined data at all cutoffs are statistically significant (P-value<0.001)

Inline graphic Family history+HLA+ SNP Gene Expression Combined
Sen Spe Y index AUC Sen Spe Y index AUC Sen Spe Y index AUC
18 0.597 0.701 0.298 0.651 0.421 0.799 0.220 0.623 0.640 0.750 0.390 0.717
24 0.542 0.719 0.261 0.643 0.393 0.861 0.254 0.639 0.622 0.761 0.383 0.715
30 0.484 0.746 0.230 0.634 0.390 0.871 0.261 0.639 0.599 0.771 0.370 0.708
36 0.467 0.737 0.204 0.623 0.378 0.887 0.265 0.646 0.575 0.772 0.347 0.701
48 0.476 0.716 0.192 0.598 0.342 0.910 0.252 0.633 0.531 0.780 0.311 0.681
72 0.460 0.715 0.175 0.591 0.360 0.876 0.236 0.631 0.494 0.784 0.278 0.671

The results in Table 2 illustrate the improvement in prediction at every time cutoff caused by the inclusion of gene expression with other features, which signifies the importance of additional information contained within the gene expression. Moreover, gene expression considers the time factor and reflects the physiological changes over a period of time instead of a snapshot of the underlying processes. We also find it more difficult to predict further into the future as sensitivity, Youden’s index, and AUC decreases gradually with a higher cutoff value of Inline graphic. AUC, sensitivity, specificity and Youden’s index of our proposed model show better results than the baseline where we used only the time-invariant features. Higher sensitivity is crucial for this prediction as false negative results can result in neglected care of a high-risk child. Moreover, for IA status cutoff at 24 months, our proposed model (AUC 0.715) outperforms the state-of-the-art result published by the TEDDY study group in an 8-year progress report [22] (AUC 0.682). They also used family history, HLA genotype and SNPs in the predictive model; therefore, it can be inferred that the improvement in our study is caused by the use of reliable synthetic gene expression. Additionally, we investigated the prediction of the appearance of the first islet autoantibody type by 24 months. The results are tabulated in Table 3 which shows that combined data perform better at predicting both IAA-first (IAA appears first) and GADA-first (GADA appears first) participants compared with the baselines. Additionally, demography is known to be linked with the progression to T1D [52]. Therefore, we performed the experiments reported in Table 2 including gender and race with family history, HLA genotype, SNP and gene expression as input. We found no improvement in the prediction performance by including demographic information in the analysis and the results are reported in Supplementary Table S3.

Table 3.

Predictions of different IA outcomes. Results (sensitivity, specificity, Youden’s index, AUC) of first islet autoantibody appearance at 24 months are calculated. The combination of family history, HLA genotype, SNP and gene expression shows better performance compared with them individually

Inline graphic Family history+HLA+ SNP Gene Expression Combined
Sen Spe Y index AUC Sen Spe Y index AUC Sen Spe Y index AUC
IAA-first 0.486 0.721 0.207 0.635 0.454 0.798 0.252 0.627 0.615 0.716 0.331 0.690
GADA-first 0.524 0.738 0.262 0.657 0.351 0.905 0.256 0.622 0.646 0.759 0.405 0.718

Impact of time series gene expression: As most studies collect single gene expression data from a participant, we designed an experiment to investigate what the results would be if TEDDY collected a cross-sectional gene expression instead. We use the random forest to predict IA labels at 24 months using gene expression at one time step up to 24 months along with family history, HLA genotype and SNP. The experiment is repeated for each time step; therefore, eight predictions correspond to the eight time steps (24 months) in gene expression. All predictions use the same value for family history, HLA genotype and SNP, differing only in gene expression data. The results are illustrated in Figure 4 which shows decreased performance at all time steps. The best AUC (0.704) is obtained when gene expression at the third time step (9 months) is used to predict the IA status. The results show a significant gain in prediction performance by including the time factor of gene expression in the analysis. Moreover, the AUC values drop at later time steps, which indicates a deteriorating predictive signal at synthetic gene expression at later time steps. This behavior can be attributed to the insufficient availability of true gene expression (training data) in later time steps, as shown in Figure 1, which results in a decaying training of the imputation model. However, our proposed model is not significantly vulnerable to this limitation, as LSTM considers all time steps during the prediction.

Figure 4.

Figure 4

IA status prediction using one gene expression time step, family history, HLA genotype and SNP. IA status is predicted at 24 months to illustrate the predictive ability of the gene expression if collected at one time point instead of a longitudinal study.

Impact of the availability of gene expression: We used 16 time points in our time series data analysis which is determined by the regular interval and availability of gene expression up to 48 months. We investigate the importance of using longer time series data for prediction by setting different cutoffs to gene expression. For a cutoff at Inline graphic month, we only use the input data up to Inline graphic month in our LSTM-based classifier, which imitates the limitation in data availability. IA status for this experiment is generated at 24 months and is fixed for all gene expression cutoffs. The mean AUCs, sensitivity, specificity and Youden’s index, are reported in Table 4. Using combined input data with more time steps improves the prediction, which is expected as longer time series data can capture more physiological changes. Moreover, our proposed model shows a robust performance with limited data availability. The AUC drops to 0.701 from 0.715 (1.96% drop) even if we use 12 months as the cutoff for gene expression (25% of input data). We also investigate whether SNP solely contributes to the predictive ability of synthetic gene expression. In that case, we could use SNPs to replace gene expression and still get similar predictive performance. We find that SNPs by themselves have a poor prediction but can help us get an effective mapping to gene expression. The results are reported in Supplementary Table S4.

Table 4.

Predictions with gene expression at different time cutoffs. Results (sensitivity, specificity, Youden’s index, AUC) of IA status prediction using gene expression and combined data up to Inline graphic month are calculated. Higher value of AUC, sensitivity and Youden’s index when the cutoff is increased shows the improvement associated with additional time steps. Inline graphic denotes the results with statistically significant differences compared with the result using all time steps (48Inline graphic month)

Inline graphic Gene Expression Combined
Sen Spe Y index AUC Sen Spe Y index AUC
12 0.317 0.945 0.262 0.608Inline graphic 0.578 0.792 0.370 0.701Inline graphic
24 0.357 0.907 0.264 0.623 0.577 0.800 0.377 0.710
36 0.357 0.904 0.261 0.631 0.502 0.787 0.289 0.721
48 0.421 0.799 0.220 0.623 0.622 0.761 0.383 0.715

Quality of synthetic gene expression

The improvement in prediction with the addition of synthetic gene expression with family history, HLA genotype and SNP depends on the quality of the synthetic data. Here, we design an experiment to compare the predictive ability of synthetic gene expression with true gene expression. Only the 401 samples with true gene expression are included in this experiment to make the results comparable. Therefore, we have two sets of data: a true gene expression dataset and a synthetic gene expression dataset representing the same samples. We predict IA labels at different time cutoffs of Inline graphic months using the two datasets using the LSTM-based classifier. The results are shown in Figure 5 which illustrates the better predictive performance of synthetic data across all time step cutoffs. The improvement can primarily be attributed to the higher availability of data in synthetic gene expression. As mentioned before, 79% of time steps are missing in the cohort of 401 participants, which translates to 79% time steps having a synthetic gene expression against the 21% time steps having true gene expression. More time steps in the input time series data resulted in better analysis and, consequently, a higher AUC. However, true gene expression only comprises approximately 1.5% of the input time series gene expression for 6812 participants, as most participants have no available true gene expression. It is inconsequential to merge the true gene expression with synthetic gene expression; therefore, all predictions in the subsection Prediction results were designed using only synthetic gene expression.

Figure 5.

Figure 5

IA status prediction using true gene expression and synthetic gene expression. IA status is predicted using true and synthetic gene expression representing the same 401 participants.

Discussion

Detection of IA-positive children helps the researchers in the early identification and intervention of high-risk children. It also helps them to reduce the cohort size and increase T1D study efficiency. However, detection of IA might not always be enough as children can develop IA years after birth, whereas TEDDY participants were enrolled at birth. Hence, the prediction of IA can play a significant role in finding high-risk children more effectively. In T1D outcome prediction, the time series gene expression collected by TEDDY is inadequately used partly because of a high percentage of missing data. We generated synthetic gene expression from SNP to solve the missing data problem that shows competitive performance compared with true gene expression. We also successfully translated it into a better IA prediction, as shown in Table 2. Inspired by the superior predictive ability, we identified several pathways known to be related to T1D using synthetic gene expression reported in Supplementary Table S5. Although gene expression with all true data might improve the prediction even further, due to the longitudinal data collection procedure limitations, missing data are also inevitable for future participants. This is evident because all 401 participants in the existing TEDDY study have incomplete gene expression. Therefore, developing a framework that can reduce the reliance on time series gene expression collection but accounts for the improvement introduced by the time factor is an important task. In this study, not only do we predict the IA status with higher sensitivty, specificity, Youden’s index and AUC, but we can do it with all synthetic gene expression in the classifier.

In our study, we used input data for up to 48 months but, in some cases, predicted IA status earlier than 48 months, as seen in Table 2. For example, two critical questions arise if we try to predict IA status at 24 months using input data up to 48 months. First, whether the prediction results are biased due to the presence of gene expression after 24 months. Secondly, participants are tested for islet autoantibodies when gene expression is collected at visits on or after 24 months. If the participants already have their IA results, further prediction becomes a moot point. To address the first concern, in this study, all gene expressions used in classifications are synthetic data, and true gene expression is only used to generate them. We have shown better prediction results for 6812 participants using true gene expression from only 401 participants. Moreover, the percentage of available gene expression is exceedingly tiny at 1.23%. Therefore, it cannot create a significant bias for the prediction results. We also showed competitive results when IA status is predicted at 24 months using input data up to 12 or 24 months in Table 4. For the second concern, it is to be noted that we do not have any gene expression for approximately 94% of participants, which is the only time series data used as input in the classifier. For future participants, we do not need to collect gene expression; instead, the synthetic gene expression can be generated as soon as we have their SNP data and then predict their IA status years down the line. Incomplete RNA-seq gene expression is also available for the cohort; however, due to the inferior prediction performance we observed, microarray gene expression was used in this study.

Our proposed method improves IA prediction after including synthetic gene expression. We focused on gene expression in early life IA prediction and identifying prognostic genes, whereas family history, HLA genotype and selected SNPs are used based on literature. A study involving other datasets such as metabolites, more SNPs and environmental variables can further improve the prediction accuracy and draw a more detailed picture of the disease pathogenesis. Additionally, the participants in TEDDY are all high-risk children screened using the HLA genotype. Therefore, the true gene expression used as training data in the imputation model is also from those high-risk participants. Synthetic gene expression for children not yet identified as high risk can be inaccurate if the imputation model is trained using only the gene expression of high-risk participants. Therefore, our proposed pipeline is not an alternative to HLA genotype-based risk assessment but rather complements it to identify high-risk children better.

Conclusion

T1D is a chronic autoimmune disease characterized by irreversible destruction of islet Inline graphic-cell. The incidence and prevalence of T1D have increased worldwide in recent years, which can disrupt the access and affordability of insulin, the only treatment to keep a T1D patient alive. Therefore, it is now more important than ever to find the key factors affecting the onset of this disease and develop effective treatments or cures. A comprehensive case-control study such as TEDDY can provide the researchers with answers to those questions. Early prediction of IA can help us design a better case-control study and ensure in-time care for high-risk children. This study offers an approach for generating synthetic time series gene expression from SNP and obtaining an improved and early IA prediction. Our proposed framework improves state-of-the-art IA prediction by integrating synthetic gene expression in the analysis. Additionally, we compared the time series gene expression with cross-sectional gene expression and showed superior performance of time series gene expression even when it is entirely synthetic. It also widens the door for further computational analyses to link genes to T1D outcomes and time series analyses using incomplete multi-omics data to study other chronic diseases.

Key Points

  • A framework for incomplete time series gene expression imputation is developed which shows better prediction performance can be achieved introducing synthetic gene expression in an integrative study.

  • Synthetic gene expression provide more training data for a better characterization of participants and islet autoimmunity prediction.

  • Synthetic time series gene expression has superior ability to describe chronic diseases compared with cross-sectional gene expression.

Supplementary Material

SupplementaryDocument_bbac537

Funding

This work is supported by U24DK097771 from the National Institute of Diabetes, Digestive and Kidney Diseases via the NIDDK Information Network’s (dkNET) New Investigator Pilot Program in Bioinformatics.

Acknowledgments

This research utilizes data obtained by the TEDDY study group, a collaborative clinical study sponsored by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), and Centers for Disease Control and Prevention (CDC) and JDRF. The data from the TEDDY study reported here were supplied by the database of Genotypes and Phenotypes (dbGaP), which is maintained by the National Center for Biotechnology Information (NCBI). This manuscript was not prepared in collaboration with investigators of the TEDDY study and does not necessarily reflect the opinions or views of the TEDDY study, dbGaP, or the NIDDK.

Khandakar Tanvir Ahmed is currently a PhD student in Computer Science at the University of Central Florida. He received his BS in Electronic Engineering from Bangladesh University of Engineering and Technology, Bangladesh. His research interests include modeling post-transcriptional regulations and application of machine learning in computational biology.

Sze Cheng is currently a PhD candidate in Biochemistry, Molecular Biology, and Biophysics at the University of Minnesota, Twin Cities. She received her BS in Biology from Lafayette College. Her research interests include post-transcription gene regulation by cellular signaling as well as RNA cancer biology.

Qian Li is an assistant professor in Biostatistics at St. Jude Children’s Research Hospital. She received her Ph.D. in Statistics from University of Missouri-Kansas City and postdoctoral training in statistical genomics at University of Kansas Medical Center and Moffitt Cancer Center. Her research interests are statistical modeling of genomics, longitudinal omics and multi-omics in cancer and T1D.

Jeongsik Yong is an associate professor of Biochemistry, Molecular Biology and Biophysics at the University of Minnesota Twin Cities. He received his PhD and postdoctoral training in Biochemistry and Biophysics at the University of Pennsylvania. His research interests are broadly in the plasticity of functional transcriptomics by cellular signaling.

Wei Zhang is an assistant professor in Computer Science at the University of Central Florida. He received his PhD in Computer Science from University of Minnesota Twin Cities. His research interests are broadly in cancer genomics, network-based learning in bioinformatics and disease phenotype prediction.

Contributor Information

Khandakar Tanvir Ahmed, Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.

Sze Cheng, Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA.

Qian Li, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA.

Jeongsik Yong, Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA.

Wei Zhang, Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.

References

  • 1. Crabtree BF, Ray SC, Schmidt PM, et al. The individual over time: time series applications in health care research. J Clin Epidemiol 1990;43(3):241–60. [DOI] [PubMed] [Google Scholar]
  • 2. Euser AM, Zoccali C, Jager KJ, et al. Cohort studies: prospective versus retrospective. Nephron Clin Pract 2009;113(3):c214–7. [DOI] [PubMed] [Google Scholar]
  • 3. Hammoudeh S, Gadelhaq W, Janahi I. Prospective cohort studies in medical research. IntechOpen 2018;11–28. [Google Scholar]
  • 4. Fortuin V, Baranchuk D, Rätsch G, et al. Gp-vae: Deep probabilistic time series imputation. In: International conference on artificial intelligence and statistics. Cambridge MA: JMLR, 2020, 1651–61. [Google Scholar]
  • 5. Saad M, Chaudhary M, Karray F, et al. Machine learning based approaches for imputation in time series data and their impact on forecasting. In: Derek Abbott (ed.) 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC) New York City: IEEE, 2020, 2621–7. [Google Scholar]
  • 6. Badsha MB, Li R, Liu B, et al. Imputation of single-cell gene expression with an autoencoder neural network. Quantitative Biology 2020;8(1):78–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Yungang X, Zhang Z, You L, et al. scIGANs: single-cell RNA-seq imputation using generative adversarial networks. Nucleic Acids Res 2020;48(15):e85–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Viñas R, Azevedo T, Gamazon ER, et al. Deep Learning Enables Fast and Accurate Imputation of Gene Expression. Front Genet 2021;12:489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Song M, Greenbaum J, Joseph Luttrell IV, et al. A review of integrative imputation for multi-omics datasets. Front Genet 2020;11:570255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Moorthy K, Mohamad MS, Deris S. A review on missing value imputation algorithms for microarray gene expression data. Current Bioinformatics 2014;9(1):18–22. [DOI] [PubMed] [Google Scholar]
  • 11. Zhou X, Chai H, Zhao H, et al. Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning–based neural network. GigaScience 2020;9(7):giaa076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Voillet V, Besse P, Liaubet L, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. BMC bioinformatics 2016;17(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Lin D, Zhang J, Li J, et al. An integrative imputation method based on multi-omics datasets. BMC bioinformatics 2016;17(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Moorthy K, Jaber AN, Ismail MA, et al. Missing-values imputation algorithms for microarray gene expression data. Microarray Bioinformatics 2019;1986:255–66. [DOI] [PubMed] [Google Scholar]
  • 15. Choong MK, Charbit M, Yan H. Autoregressive-model-based missing value estimation for DNA microarray time series data. IEEE Trans Inf Technol Biomed 2009;13(1):131–7. [DOI] [PubMed] [Google Scholar]
  • 16. Luo Y, Cai X, Zhang Y, et al. Multivariate time series imputation with generative adversarial networks. Advances in neural information processing systems 2018;31:1603–14. [Google Scholar]
  • 17. Afrifa-Yamoah E, Mueller UA, Taylor SM, et al. Missing data imputation of high-resolution temporal climate time series data. Meteorological Applications 2020;27(1):e1873. [Google Scholar]
  • 18. Cao W, Wang D, Li J, et al. Brits: Bidirectional recurrent imputation for time series. Advances in neural information processing systems 2018;31:6776–86. [Google Scholar]
  • 19. Yoon J, Zame WR, Schaar M. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Transactions on Biomedical Engineering 2018;66(5):1477–90. [DOI] [PubMed] [Google Scholar]
  • 20. Teddy . The environmental determinants of diabetes in the young (TEDDY) study, n.d. https://teddy.epi.usf.edu/ (31 December 2021, date last accessed).
  • 21. Kawasaki E. Type 1 diabetes and autoimmunity. Clinical pediatric endocrinology 2014;23(4):99–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Krischer JP, Liu X, Vehik K, et al. Predicting islet cell autoimmunity and type 1 diabetes: an 8-year TEDDY study progress report. Diabetes Care 2019;42(6):1051–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Webb-Robertson B-JM, Bramer LM, Stanfill BA, et al. Prediction of the development of islet autoantibodies through integration of environmental, genetic, and metabolic markers. J Diabetes 2021;13(2):143–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Orešič M, Gopalacharyulu P, Mykkänen J, et al. Cord serum lipidome in prediction of islet autoimmunity and type 1 diabetes. Diabetes 2013;62(9):3268–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Winkler C, Krumsiek J, Buettner F, et al. Feature ranking of type 1 diabetes susceptibility genes improves prediction of type 1 diabetes. Diabetologia 2014;57(12):2521–9. [DOI] [PubMed] [Google Scholar]
  • 26. Oram RA, Patel K, Hill A, et al. A type 1 diabetes genetic risk score can aid discrimination between type 1 and type 2 diabetes in young adults. Diabetes Care 2016;39(3):337–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Beyerlein A, Bonifacio E, Vehik K, et al. Progression from islet autoimmunity to clinical type 1 diabetes is influenced by genetic factors: results from the prospective TEDDY study. J Med Genet 2019;56(9):602–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Bonifacio E, Beyerlein A, Hippich M, et al. Genetic scores to stratify risk of developing multiple islet autoantibodies and type 1 diabetes: a prospective study in children. PLoS Med 2018;15(4):e1002548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Hippich M, Beyerlein A, Hagopian WA, et al. Genetic contribution to the divergence in type 1 diabetes risk between children from the general population and children from affected families. Diabetes 2019;68(4):847–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Sosenko JM, Palmer JP, Rafkin-Mervis L, et al. Glucose and C-peptide changes in the perionset period of type 1 diabetes in the Diabetes Prevention Trial–Type 1. Diabetes Care 2008;31(11):2188–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Redondo MJ, Geyer S, Steck AK, et al. A type 1 diabetes genetic risk score predicts progression of islet autoimmunity and development of type 1 diabetes in individuals at risk. Diabetes Care 2018;41(9):1887–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Ferrat LA, Vehik K, Sharp SA, et al. A combined risk score enhances prediction of type 1 diabetes among susceptible children. Nat Med 2020;26(8):1247–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol 2002;9:505–12. [DOI] [PubMed] [Google Scholar]
  • 34. Ran S, Liu X, Wei L, et al. Deep-Resp-Forest: A deep forest model to predict anti-cancer drug response. Methods 2019;166:91–102. [DOI] [PubMed] [Google Scholar]
  • 35. Zarringhalam K, Degras D, Brockel C, et al. Robust phenotype prediction from gene expression data using differential shrinkage of co-regulated genes. Sci Rep 2018;8(1):1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Ahmed KT, Sun J, Chen W, et al. In silico model for miRNA-mediated regulatory network in cancer. Brief Bioinform 2021;22(6):bbab264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Sharifi-Noghabi H, Zolotareva O, Collins CC, et al. MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics 2019;35(14):i501–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Dimitrakopoulos C, Hindupur SK, Häfliger L, et al. Network-based integration of multi-omics data for prioritizing cancer genes. Bioinformatics 2018;34(14):2441–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 2019;35(17):3055–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Chaudhary K, Poirion OB, Liangqun L, et al. Deep learning–based multi-omics integration robustly predicts survival in liver cancer. Clin Cancer Res 2018;24(6):1248–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Ahmed Khandakar Tanvir, Sun Jiao, Cheng Sze, Yong Jeongsik, and Zhang Wei. Multi-omics data integration by generative adversarial network. Bioinformatics, 38(1):179–86, 08 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Xhonneux L-P, Knight O, Lernmark Å, et al. Transcriptional networks in at-risk individuals identify signatures of type 1 diabetes progression. Sci Transl Med 2021;13(587):eabd5666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. TEDDY Study Group, et al. The environmental determinants of diabetes in the young (TEDDY) study. Ann N Y Acad Sci 2008;1:1150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Jacobsen LM, Larsson HE, Tamura RN, et al. Predicting progression to type 1 diabetes from ages 3 to 6 in islet autoantibody positive TEDDY children. Pediatr Diabetes 2019;20(3):263–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Li Q, Parikh H, Butterworth MD, et al. Longitudinal metabolome-wide signals prior to the appearance of a first islet autoantibody in children participating in the TEDDY study. Diabetes 2020;69(3):465–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Steck AK, Vehik K, Bonifacio E, et al. Predictors of progression from the appearance of islet autoantibodies to early childhood diabetes: The Environmental Determinants of Diabetes in the Young (TEDDY). Diabetes Care 2015;38(5):808–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Cai J, Luo J, Wang S, et al. Feature selection in machine learning: A new perspective. Neurocomputing 2018;300:70–9. [Google Scholar]
  • 48. Ghaddar B, Naoum-Sawaya J. High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research 2018;265(3):993–1004. [Google Scholar]
  • 49. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011;12:2825–30. [Google Scholar]
  • 50. Paszke A, Gross S, Massa F, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 2019;32:8026–37. [Google Scholar]
  • 51. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–80. [DOI] [PubMed] [Google Scholar]
  • 52. Redondo MJ, Libman I, Cheng P, et al. Racial/ethnic minority youth with recent-onset type 1 diabetes have poor prognostic factors. Diabetes Care 2018;41(5):1017–24. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SupplementaryDocument_bbac537

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES