Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2020 Apr 1;9:e54532. doi: 10.7554/eLife.54532

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

© 2020, Taujale et al

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

PMC Copyright notice

Figure 6. — (A) Brief pipeline of the ML analysis. Training set input into the pipeline are shown in green boxes. Steps of the ML analysis in purple boxes are associated with different panels of the figure. (B) Percent accuracy based on 10-fold cross validation (CV) for each of the trained ML models. (C) Confusion matrix from the best model (GDBT using 239 features). (D) Scatter plot showing the probability scores assigned for each predicted sequence by the predicted donor type. Colors indicate the confidence level of the prediction based on probability of assignment to a given donor class as well as confidence intervals of the predicted class i.e. difference in probability values between the 1^st prediction class and the 2^nd prediction class. (Figure 6—source data 2).

Figure 6—source data 1. List of the 713 training dataset sequences used for machine learning.
The ‘Assigned Donor Class’ column indicates one of the six classes the donor belongs to.

elife-54532-fig6-data1.xlsx^{(58.1KB, xlsx)}

Figure 6—source data 2. Results for donor prediction using the GDBT ML model for GT-A sequences from five model organisms.
The validation datasets (highlighted in blue rows) include GTs that have some experimental characterization but were not included in the characterized dataset. The validation set was used to compare the model predictions with the experimental results. The ‘Match Experimental’ column indicates whether the prediction matched experimental results. The prediction set includes predictions for GTs of unknown functions. The ‘Confidence’ column includes the confidence for prediction which was derived based on the probability for the 1 st class and its difference with the probability for the 2nd class. Probabilities for all the six classes are provided in the ‘Classwise Probablity’ columns.

elife-54532-fig6-data2.xlsx^{(71.5KB, xlsx)}

Figure 6—figure supplement 1. — (A) Brief pipeline of the ML analysis. Training set input into the pipeline are shown in green boxes. Steps of the ML analysis in purple boxes are associated with different panels of the figure. (B) Percent accuracy based on 10-fold cross validation (CV) for each of the trained ML models. (C) Confusion matrix from the best model (GDBT using 239 features). (D) Scatter plot showing the probability scores assigned for each predicted sequence by the predicted donor type. Colors indicate the confidence level of the prediction based on probability of assignment to a given donor class as well as confidence intervals of the predicted class i.e. difference in probability values between the 1^st prediction class and the 2^nd prediction class. (Figure 6—source data 2).

Figure 6—source data 1. List of the 713 training dataset sequences used for machine learning.
The ‘Assigned Donor Class’ column indicates one of the six classes the donor belongs to.

elife-54532-fig6-data1.xlsx^{(58.1KB, xlsx)}

Figure 6—source data 2. Results for donor prediction using the GDBT ML model for GT-A sequences from five model organisms.
The validation datasets (highlighted in blue rows) include GTs that have some experimental characterization but were not included in the characterized dataset. The validation set was used to compare the model predictions with the experimental results. The ‘Match Experimental’ column indicates whether the prediction matched experimental results. The prediction set includes predictions for GTs of unknown functions. The ‘Confidence’ column includes the confidence for prediction which was derived based on the probability for the 1 st class and its difference with the probability for the 2nd class. Probabilities for all the six classes are provided in the ‘Classwise Probablity’ columns.

elife-54532-fig6-data2.xlsx^{(71.5KB, xlsx)}