Fig. 2 |. A classical ML approach to predict mammalian TEs from mRNA sequence.

a, UpSet plot showing the r2 measured on ten held-out CV folds of LGBM models that predict the mean TE across human cell types using various feature sets (n = 10,242 total genes among the tenfold). Colored feature sets are indicative of those that contributed to the optimal sequence-only model. Median r2 and statistically significant differences in performance between pairs of models are indicated. P values were calculated using one-sided, paired t tests adjusted with a Bonferroni correction. NS, nonsignificant p values. All additional feature sets considered, but that did not have a significant improvement on performance, are labeled as ‘other’. b,c, Scatter plots comparing the predicted and observed mean TEs, averaged across cell types, for both the human (b) and mouse (c). The r2, Pearson (r) and Spearman (ρ) correlation coefficients, integrating the results across ten CV folds, are also shown. d,e, Importance of the features used by the optimal sequence-only model (shown as a red bar in a for both the human (d) and mouse (e)). For a given feature, importance was measured as the sum total information gain across all splits using the feature, averaged across all folds. The colors of the bars correspond to the mean Spearman ρ, averaging ρ values between the features and TE values from each cell type. Feature names are colored according to the feature set to which they belong.