Predicting functional UTR variants by integrating region-specific features

Guangyu Li; Jiayu Wu; Xiaoyue Wang

doi:10.1093/bib/bbae248

. 2024 May 23;25(4):bbae248. doi: 10.1093/bib/bbae248

Predicting functional UTR variants by integrating region-specific features

Guangyu Li ¹, Jiayu Wu ², Xiaoyue Wang ^3,^✉

PMCID: PMC11116830 PMID: 38783704

Abstract

The untranslated region (UTR) of messenger ribonucleic acid (mRNA), including the 5′UTR and 3′UTR, plays a critical role in regulating gene expression and translation. Variants within the UTR can lead to changes associated with human traits and diseases; however, computational prediction of UTR variant effect is challenging. Current noncoding variant prediction mainly focuses on the promoters and enhancers, neglecting the unique sequence of the UTR and thereby limiting their predictive accuracy. In this study, using consolidated datasets of UTR variants from disease databases and large-scale experimental data, we systematically analyzed more than 50 region-specific features of UTR, including functional elements, secondary structure, sequence composition and site conservation. Our analysis reveals that certain features, such as C/G-related sequence composition in 5′UTR and A/T-related sequence composition in 3′UTR, effectively differentiate between nonfunctional and functional variant sets, unveiling potential sequence determinants of functional UTR variants. Leveraging these insights, we developed two classification models to predict functional UTR variants using machine learning, achieving an area under the curve (AUC) value of 0.94 for 5′UTR and 0.85 for 3′UTR, outperforming all existing methods. Our models will be valuable for enhancing clinical interpretation of genetic variants, facilitating the prediction and management of disease risk.

Keywords: UTR, functional variants, prediction model

Introduction

The untranslated region (UTR) of messenger ribonucleic acid (mRNA)—5′UTR and 3′UTR—plays a crucial role in regulating transcript expression, stability, localization and translation. Due to their critical roles in maintaining gene function and cellular homeostasis, the sequences encoding UTRs are among the most conserved noncoding sequences within the genome [1–5]. Both 5′UTR and 3′UTR of mRNAs modulate mRNA functions through interactions with RNA-binding proteins (RBPs) and microRNAs via specific cis-regulatory elements [1, 6, 7].

Within the 5′UTR, upstream open reading frames (uORFs) interact with the translational machinery to modulate the translation rates of downstream coding sequence [8]. Disruption of uORFs can lead to significant reduction in protein expression [8]. Internal ribosome entry site (IRES) sequences, which belong to secondary structures, can interact with IRES trans-acting factors, subsequently recruit ribosomes to initiate cap-independent translation of the mRNA [6, 9]. IRES elements are mostly present in genes involved in stress, mitosis or apoptosis and those activated in human tumor cells [10]. Other regulatory structures in 5′UTRs, such as pseudoknots, hairpins and RNA G-quadruplexes, also contribute to the modulation of translation by interacting with different RBPs [6, 11].

The 3′UTR of mRNA contains polyadenylation signals (PAS), which affect maturation, stability and translation of mRNA by guiding the cleavage and polyadenylation process [1, 7]. Variations in these regions can lead to alternative polyadenylation and then cause mRNA isoforms with diverse functions, this process is implicated in diseases like cancer [1, 12]. Additionally, 3′UTR contains miRNA target sites, which are sequences complementary to the seed region of miRNAs. Binding of miRNAs to their target sites can lead to decreased mRNA stability or repression of translation [1, 12].

Variants within UTRs that affect these regulatory motifs can have significant consequences, leading to phenotypic changes and diseases. For example, a melanoma risk variant, G-34 T on the 5′UTR of CDKN2A gene, reduces CDKN2A protein level by producing an abnormal start codon uAUG that inhibits downstream translation [13]. The rs10954213 variant on the 3′UTR of IRF5 gene, which is associated with increased susceptibility to systemic lupus erythematosus, changes the length of 3′UTR by altering PAS, leading to reduced mRNA stability [14]. Understanding such UTR variants is essential for disease risk management and the advancement of precision medicine.

Existing methods for predicting functional variants, such as CADD [15], FunSeq2 [16] and LINSIGHT [17], primarily focus on evolutionary conservation and are effective mainly for variants in coding regions or under strong selection. Other models designed specifically for noncoding regions, like FATHMM-MKL [18], GWAVA [19] and RF [20], have demonstrated that the incorporation of variant-specific regulatory and sequence features improves prediction of functional regulatory variants. Tools like UTRannotator [21] and UTR.annotation [22] annotate variants affecting known regulatory elements (uORF, Kozak, PAS, etc.) in UTRs for evaluation of deleteriousness. Notably, UTR.annotation revealed variants in UTR regulatory elements, while not always conserved, are associated with expression changes. High-throughput reporter assays have identified additional UTR-specific sequence features crucial for function [23, 24]. Incorporating these features may lead to a model with enhanced sensitivity for predicting functional UTR variants.

Therefore, we developed FunUV, a set of two specialized classification models, one tailored for the 5′UTR and the other for the 3′UTR. Each model incorporates site conservation, secondary structure and UTR-specific features, including well-known cis-regulatory elements and other newly identified sequence attributes. Our model demonstrates enhanced sensitivity in predicting functional UTR variants, outperforming existing methods and thereby enhancing the interpretation of UTR variant functionality.

Methods

Collection of multi-sourced datasets for UTR variants

We collect five databases of variants reported in population or patients: ClinVar [25], HGMD-PUBLIC [26], NHGRI-EBI GWAS Catalog [27], 1000 Genomes Project (1KG) [28] and gnomAD [29]. Additionally, we include three datasets of disease-associated variants: candidate causal SNPs for 39 immune and nonimmune diseases in the fine-mapping study (FineMap) [30], microRNA-related variants in 3′UTR associated with the risk of Alzheimer’s disease (AD-SNP) [31] and a database named miRdSNP of disease-associated SNPs in 3′UTR [32] (miRdSNP). We also collect three experimental datasets, including allelic imbalanced SNPs (AIS) [33], functional variants in 3′UTR based on massive parallel reporter assays (MPRA) [24] and loss of function (LOF) variants in BRCA1/2 identified through base editing screening (BBE) [34]. We then filter and annotate these datasets to generate uniformly formatted data files for subsequent analysis (Supplementary Fig. S1 available online at http://bib.oxfordjournals.org/). The final information of valid samples contained in each dataset can be found in Supplementary Tables S1 and S2 available online at http://bib.oxfordjournals.org/.

Regarding the positive set, we collect the following variants: pathogenic and likely pathogenic (PLP) variants from ClinVar; deleterious variants from HGMD; GWAS variants with P-value <5E-8, odds ratio ≥1, and no SNPs in high linkage disequilibrium (r² > 0.2, calculated by LDtrait [35]); causal variants from FineMap; imbalanced variants with false discovery rate <0.01 from AIS (AIS-IB); disease-associated variants from AD and miRdSNP (only for 3′UTR); and MPRA variants with significant effects in all six cell lines and corrected P-value <0.1 (only for 3′UTR); LOF variants from BBE.

Regarding the negative set, we use the following variants: benign and likely benign (BLB) variants from ClinVar; common variants with population frequency > 5% in all continental ancestries from 1KG and gnomAD, filtered if present in positive datasets and merged (1KG-gnomAD); and nonimbalanced variants (AIS-NIB) from AIS.

We then match the genes in the positive and negative sets, meaning that only the genes present in the positive set are retained in the negative set. In this way, the deviation caused by the heterogeneity sequences inter genes can be mitigated in the comparation of sequence composition features to a certain extent.

All datasets are annotated with UTR features as described in the Methods, forming a variant-feature matrix for subsequent feature distribution analysis and prediction model construction.

Transcripts selection and UTR definition

We refer to Matched Annotation from NCBI and EMBL-EBI (MANE) [36] to determine the transcript of a gene. The MANE project contains the canonical transcripts set ‘MANE Select’, which serves as a universal standard for clinical reporting. Additionally, it includes an extended set of clinical transcripts set ‘MANE Plus Clinical’. This set is used when ‘MANE Select’ alone is not sufficient to report clinical pathogenicity variants available in public resources.

Annotation of region-specific features in UTR

For feature annotation, we provide not only the full-length sequence of the whole UTR but also the flanking sequences of a variant. We extract flanking sequences of 50 base pair (bp) for 5′UTR and 100 bp for 3′UTR, referring to other studies [23, 24]. During the process, we preferentially extract upstream and downstream sequences of equal length, unless there is insufficient length at one end.

We identified 52 features of 5′UTR and 50 features of 3′UTR and categorized them into four groups: site conservation, secondary structure, functional elements and sequence composition (Supplementary Table S3 available online at http://bib.oxfordjournals.org/). Below are detailed descriptions of the methods used to define these features:

Site conservation

We use two canonical conservation scores: PhastCons100way score (PCS) [4] and PhyloP100way score (PPS) [37] to annotate site conservation.

Minimum free energy alteration

We use the minimum free energy (MFE) to characterize the secondary structure of the sequence, since many of the regulatory structures are presented and function in the form of secondary structures, especially for 5′UTR [6]. RNAFold [38, 39] is used to calculate the MFE of flanking sequences containing alt or ref, respectively. The difference between the two MFE values is used as the final feature value for subsequent analysis.

uAUG alteration or uStop alteration

We scan the full sequence of 5′UTR and classify the alterations of uAUG elements or uStop elements into three categories [21], according to whether the variant produce a new uAUG/uStop or alter an existing one. The feature value is denoted by the indicator variable 1, −1 or 0.

uORF alteration

We calculate the proportion of uORF length change to total 5′UTR length after a variant generation as the feature value.

Kozak-score alteration

The Kozak motif strength of all 8 bp flanking sequences containing the variant location is identified and defined as Kozak-score. Then we calculate the alteration of Kozak-score caused by the variant as feature value. The calculation method is as follows:

a) All 8 bp flanking sequences containing ATG are extracted from the upstream and downstream of the target base site.
b) Tomtom [40] is used to compute the similarity between sequences extracted in the above step and the two Kozak motifs (strong or weak) identified from a previous mean ribosome load based research [23].
c) The sequence with the smallest q-value (should be ≤0.1) in the Tomtom result is selected as the most similar Kozak motif sequence.
d) The effect size of strong Kozak motif is set to 1 and the effect size of the weak Kozak motif is set to 0.5, then the final effect score of the most similar Kozak motif is computed as:

(1)

IRES annotation

We annotate whether this variant is located in IRES region using the Human IRES Atlas database [10]. The feature value is denoted by the indicator variable 1 or 0.

PAS alteration

We search all 6 bp flanking sequences containing the variant site and map them to 15 common PAS motifs [41] to determine whether the variant produces a new PAS or alters an existing one. The feature value is denoted by the indicator variable 1, −1 or 0.

Other AU-enriched regulator elements

We also analyze changes to some other canonical AU-enriched regulator elements in 3′UTR, including AUUUA of AU-class element, UUAUUUAWW of AU-9mer element and WWWUAUUUAUWWW/WWAUUUAUUUAWW of AU-13mer element [24, 42]. We search for all these element sequences containing variant site and determine whether the variant generation may cause a change in the element count. This feature is named as AU-elements and the feature value is the count alteration.

miRNA binding site annotation

We annotate miRNA binding sites using the TargetScanHuman database [43]. P_ct value, the probability of preferentially conserved targets for all highly conserved miRNA families [44], is adopted as the feature value.

Sequence composition

According to the MPRA based screening for variants in 3′UTR [24], we calculate the 44 metrics in the study to measure sequence composition. The 44 metrics include 4 nucleotide percentage (A-ratio/T-ratio/C-ratio/G-ratio), 3 dinucleotide percentage (GC-ratio/AC-ratio/AG-ratio), 16 exact dinucleotide counts (AA-count/AT-count/AC-count/AG-count …), 4 maximum homopolymer length (A-homo/T-homo-/C-homo/G-homo), 16 maximum dinucleotide length across all bases (AA-dimer/AT-dimer/AC-dimer/AG-dimer…), and a measure of sequence uniformity (Sequni): for i in range (1,len (seq)): if seq[i] == seq[(i-1)]: Sequni + = 1.

Construction of classification model

Building the training set and test set

As described above, we merge the 11 datasets into positive and negative sets, respectively (Supplementary Fig. S1 available online at http://bib.oxfordjournals.org/). The final datasets include 1,795 positive variants and 1,135 negative variants for 5′UTR, 3,332 positive variants and 11,197 negative variants for 3′UTR (Supplementary Tables S1 and S2 available online at http://bib.oxfordjournals.org/). To address the issue of sample imbalance in our dataset, we employ resampling techniques on the training sets. For the 5′UTR variants, we apply Edited Nearest Neighbors (ENN) for undersampling the negative samples, while for the 3′UTR variants, we use Synthetic Minority Over-sampling Technique (SMOTE) for oversampling the positive variants. Then we split the merged data into a training set and a test set according to a ratio of 7:3 and this process is repeated 10 times with different random seeds. In each iteration, the training set is used to tune the hyperparameters of each model by 10-fold cross-validation, and the test set is used to evaluate the performance of the model. The final optimal model is determined based on the average area under the curve (AUC) values of the receiver operating characteristic (ROC) on the test sets.

Training model

We use GridSearchCV package in Python3.7 for hyperparameter tuning of six models including XGBoost, gradient boosting decision tree (GBDT), random forest (RF), logistic regression (LR), support vector machine (SVM) and multi-layer perceptron (MLP). After systematic tuning, we select the GBDT model as the final optimal model for both 5′UTR and 3′UTR (Supplementary Table S4 available online at http://bib.oxfordjournals.org/).

Feature selection

We use a two-step approach, including a filter-based method and an embedded method to refine the feature set utilized in our models.

Step1: We select an initial set of 21 features from the 52 features for 5′UTR and 19 features from the 50 features for 3′UTR with the following criteria based on feature distribution analysis and prior knowledge:

a) Retain two conservation score features, PCS and PPS.
b) Retain all features related to functional elements (Kozak-score, uORF, uAUG, uStop and IRES for 5′UTR; PAS, miRNA binding site and AU-elements for 3′UTR).
c) Retain key sequences composition features including MFE, GC-ratio and Sequni.
d) For 5′UTR, all features related to C/G base composition are retained (CC-count, CG-count, GC-count, GG-count, C-homo, G-homo, CC-dimer, CG-dimer, GC-dimer, GG-dimer).
e) For 3′UTR, all features related to A/T base composition are retained (AA-count, AT-count, TA-count, TT-count, A-homo, T-homo, AA-dimer, AT-dimer, TA-dimer, TT-dimer).
f) Other features that show significant differences (P < = 0.1) in distribution between negative and positive datasets are also retained, as illustrated in Supplementary Fig. S2A and B available online at http://bib.oxfordjournals.org/.

Step2: we then use an iterative feature selection process to determine the final feature set as follows:

a) Use all features from Step1 to fit initial models.
b) Calculate AUC value on the test set and feature importance scores, ranking features based on the importance score in descending order.
c) Remove the least important feature to get a new feature set.
d) Refit model using the new feature set and calculate the AUC value on the test set.
e) If the AUC value increases with the removal of a feature, repeat c) and d). If AUC value decreases, the removed feature will be added back and the model will be stopped fitting to get the final feature set.

Finally, we get 20 features for 5′UTR: PCS, PPS, MFE alteration, Sequni, GC-ratio, Kozak-score, uORF, uAUG, uStop, IRES, CC_count, CG_count, GC_count, GG_count, AG_dimer, CC_dimer, GC_dimer, GG_dimer, C_homo, G_homo; and 16 features for 3′UTR: PCS, PPS, MFE alteration, Sequni, GC-ratio, PAS, miRNA binding sites, AU-elements, AA_count, AT_count, TA_count, TT_count, AC_dimer, AT_dimer, TA_dimer, T_homo.

External independent validation dataset

We collect another independent dataset from a high-throughput screening research for 3′UTR variants, as validation set, to assess the performance of FunUV [45]. The data in the Table S5 of that paper contains a total of 6,266 variants, with 169 impacting mRNA translation and 137 impacting mRNA stability. We define those variants that either affect mRNA translation (polysome significant) or mRNA stability (in vitro transcribed mRNA significant) as functional variants, and finally obtain 298 positive variants and 5,967 negative variants in this validation set (Supplementary Table S5 available online at http://bib.oxfordjournals.org/).

Obtaining functional scores of other prediction methods

We compare our method with seven existing prediction methods designed for the noncoding region variants, including CADD, FATHMM, FunSeq2, GWAVA, LINSIGHT, RF and WEVar [46]. For FATHMM and GWAVA, we install source code to run calculations locally on all collected datasets. For other software, the corresponding precomputed functional score files are directly downloaded and annotated to the datasets. CrossMap [47] is used for genome coordinate transformation, when the results are based on GRCh37/hg19. Since none of these methods is specifically designed for UTR, the sample size for each method varies (Supplementary Table S6 available online at http://bib.oxfordjournals.org/).

Statistical methods

We use Kolmogorov–Smirnov test (KS-test) to compare the distribution between different datasets based on their shape. We use Chi-square test (sample size ≥40 and frequency number ≥5) or exact Fisher’s exact test to compare the proportion of positive values between different datasets, such as point-line plot showing the nonzero proportion of specific functional elements alterations and bar plot showing lower value proportion of homopolymer length or dinucleotide length.

Evaluation metrics

The metrics we use in evaluating model performance including:

Sensitivity or recall or true positive rate (TPR) = Inline graphic

Specificity or true negative rate (TNR) = Inline graphic

False positive rate (FPR) = 1- Specificity = Inline graphic

Precision = Inline graphic

Balanced accuracy = Inline graphic

F1 score = Inline graphic

Where TP is true positive, TN is true negative, FP is false positive and FN is false negative.

Results

Existing prediction algorithms have reduced performance on UTR variants

The UTR regions have unique regulatory features that set them apart from other noncoding regions. Our comparative assessment of seven algorithms for predicting functional noncoding variants revealed a notable reduction in performance when applied to UTR variants. Using PLP and BLB variants from ClinVar dataset, we observed a general decrease in the AUC values of the ROC for UTR regions compared to the total noncoding region. This decrease is most pronounced in WEVar, showing a maximum difference ratio of 0.2 (Fig. 1A).

Performance comparison between UTR and total noncoding region for existing methods and variants conservation profiling across genome region. (A) Bar plot comparing AUC values on ClinVar variants between UTR and total noncoding region for different methods.

(B) Box plot of two conservation scores for ClinVar variants in different genome regions.

Current algorithms predominantly rely on inter-species evolutionary conservation [15–17], designed under the assumption that variants in highly conserved regions are more likely to have functional consequences. However, evaluation of conservation scores for variants in UTR regions demonstrated less distinction between BLB and PLP UTR variants compared to variants in coding or intronic regions (Fig. 1B). This observation suggests a need for developing UTR-specific prediction models that incorporate features beyond conservation.

Differential patterns of UTR-specific features distinguish functional variants

To explore UTR-specific features that might differentiate functional from nonfunctional variants, we analyzed a comprehensive set of UTR variants from 11 datasets. These datasets included variants classified as PLP/BLB from sources like ClinVar, HGMD, 1000 genome and genomAD, as well as fine-mapped hits in GWAS studies, and variants evaluated through high-throughput assays like MPRA and AIS assays. Variants were annotated as positive or negative based on their documented functional impacts in each dataset (Supplementary Fig. S1 available online at http://bib.oxfordjournals.org/). For example, in ClinVar, BLB variants were categorized as negative, while PLP variants were positive. In AIS assays, variants with significantly allelic imbalance (AIS-IB) were classified as positive variants.

Our analysis uncovered that positive variants more frequently altered known regulatory elements. For 5′UTR variants, which included 1,795 positive and 1,135 negative variants, we observed a wider distribution of alterations in the Kozak-score, uORF, uAUG and uStop in the positive set (Supplementary Fig. S3A available online at http://bib.oxfordjournals.org/-3D). In particular, the proportions of nonzero values of features in the positive datasets were 1.5 times to 9 times larger compared to the corresponding negative datasets (P < 0.05), with the largest differences seen between PLP and BLB variants in ClinVar (Fig. 2A). For 3′UTR variants, which including 3,332 positive and 11,197 negative variants, alterations in miRNA binding sites and PAS were significantly enriched in the positive datasets, displaying a 2 folds to 10 folds increase compared to the negative variants (P < 0.05) (Fig. 2B and Supplementary Fig. S3E available online at http://bib.oxfordjournals.org/-3F).

Distribution comparison of features across datasets or between positive set and negative set. (A) Point-line plot of nonzero value ratios for functional elements in 5′UTR including Kozak-score, uORFs, uAUGs, uStops and IRES. nonzero value, the annotation value that is not equal to 0, indicating the increase or decrease of functional elements number. Chi-square tests (sample size ≥40 and frequency number ≥5) or exact Fisher’s exact tests are used to compare nonzero counts. (B) Point-line plot of nonzero value ratios for functional elements in 3′UTR including PAS, AU-elements and miRNA binding sites. Here, the AD-SNP dataset is merged into miRdSNP dataset and labeled as AD-miRdSNP, due to the small sample size. Chi-square tests (sample size ≥ 40 and frequency number ≥ 5) or exact Fisher’s exact tests are used to compare nonzero counts. (C) Bar plot comparing MFE alterations between positive and negative sets for variants in 5′UTR or 3′UTR. Y-axis is the mean value of MFE alterations between alt and ref. The error bar stands for 95% confidence interval. MFE to characterize the secondary structure of the sequence. P-values are from KS-tests. (D and E) Kernel density plot comparing sequence composition related features between positive set and negative set for variants in 5′UTR or 3′UTR, including (D) Sequni in 5′UTR or in 3′UTR and (E) GC-ratio in 5′UTR or in 3′UTR. The black vertical lines indicate the median values. P-values are from KS-tests.

In addition to these established regulatory elements, we also explored additional sequence features for potential predictive utility. MFE alterations were significantly higher for positive variants, indicating larger effects on UTR secondary structure (Fig. 2C). Moreover, the distribution of sequence composition features like GC ratio, specific dinucleotide and Sequni (continuous G/C or A/T sequences) showed distinctive patterns. For instance, the Sequni distribution in the positive set of the 5′UTR exhibited a double-peaked profile, while in the 3′UTR positive set, a wider tail of smaller Sequni values was observed (Fig. 2D). The GC ratio displayed a contrasting trend, decreasing in the positive set of the 5′UTR and increasing in the positive set of the 3′UTR (Fig. 2E). The C/G-dinucleotide in 5′UTR and A/T-dinucleotide in 3′UTR both had higher proportion of low values in positive sets (Supplementary Fig. S4A available online at http://bib.oxfordjournals.org/-4B).

In summary, we observed differential distributions of various regulatory, structural and sequence composition features between functional and nonfunctional UTR variants, suggesting the potential utility of these features for predicting functional UTR variants.

Integrating UTR-specific sequence features for accurate prediction of functional UTR variants

To enhance the prediction accuracy of UTR variant functionality, we developed FunUV, a series of classification models by integrating key UTR-specific features identified in our analysis. Recognizing the distinct nature of 5′UTR and 3′UTR regions, we built two classification models separately, each using a different set of features (see details in Methods and Supplementary Fig. S2 available online at http://bib.oxfordjournals.org/ for feature selection). For the 5′UTR model building, we incorporated 20 features, and for the 3′UTR model, 16 features were selected. These features included two conservation scores, MFE alteration, various functional element alterations and sequence composition features that exhibited divergent patterns in our previous results.

To construct the most effective models for each UTR region, we explored several machine learning algorithms, including tree-based ensemble model, SVM, logistic regression (LR) and MLP (Fig. 3A). The integrated datasets were split into training and test set with a ratio of 7:3 for 10 times, using different random seeds. Employing 10-fold cross-validation on each training set, we fine-tuned the hyperparameters for each model, ultimately selecting GBDT models with 500 trees as the optimal choice for both 5′UTR and 3′UTR (see details in Supplementary Table S4 available online at http://bib.oxfordjournals.org/). On the 10 held-out test sets, the 5′UTR FunUV model achieved an average AUC of 0.94, with average precision of 0.86, average balanced accuracy of 0.85 and average F1-score of 0.85 (Fig. 3B and C). The 3′UTR FunUV model had an average AUC of 0.85, with average precision of 0.78, average balanced accuracy of 0.75 and average F1-score of 0.76 (Fig. 3D and E). Since there is little difference between the results of different random seeds (Fig. 3B–E), we fixed the random seeds to 1 for both 5′UTR and 3′UTR and then conducted subsequent analysis.

Construction of classification model using machine learning. (A) Schematic diagram illustrating the construction of FunUV. (B) ROC curves of the optimal model GBDT for 5′UTR, showing AUC values on test sets split using different random seeds. (C) Bar plot of evaluation metrics of model performance for 5′UTR, with error bar stands for standard deviation. (D) ROC curves of the optimal model GBDT for 3′UTR, showing AUC values on test sets split using different random seeds. (E) Bar plot of evaluation metrics of model performance for 3′UTR, with error bar stands for standard deviation.

To interpret feature importance, we applied SHapley Additive exPlanations (SHAP) on each model [48] (Fig. 4A and B). SHAP values indicate how much each feature contributes, positively or negatively, to pushing model output toward a ‘functional variant’ prediction. As anticipated, in both 5′UTR and 3′UTR, PCS score had the highest mean SHAP value, consistent with the high conservation of UTR regions [2–5]. The functional elements, Kozak-score, uORF, IRES (Supplementary Fig. S5A available online at http://bib.oxfordjournals.org/) and miRNA binding (Supplementary Fig. S5B available online at http://bib.oxfordjournals.org/) were all positively contributed to the prediction. Although their contributions were lower overall, possibly due to their lower nonzero value proportions. Some sequence composition features, Sequni and GC-ratio, along with specific dinucleotide counts (GC-count and CC-count in 5′UTR; AA-count and TT-count in 3′UTR), were also highly ranked, indicating their significant influence on the models.

Dissection of sequence composition related features. (A and B) Feature contributions of (A) model for 5′UTR and (B) model for 3′UTR. The higher the absolute value of the SHAP, the greater the contribution of the feature to the model. (C) Scatter plot of CC-count and GC-count and their SHAP values in 5′UTR. Each point presents one variant, and higher SHAP values imply greater propensity for functional variants. The trend of negative correlation in scatter plots indicates that the feature value in functional variants are decreasing. The color scale depicts the magnitude of dependent features, such as GC-ratio. (D) 11-mer motif comparison between ref and alt in positive set in 5′UTR. Y-axis is the probability of the four nucleotides at each location, and the larger the letter is, the greater the probability of that nucleotide occurring at that location. X-axis is the location of flanking sequence, and the 6th position is the location of the variant site. (E) Scatter plot of AA-count and TA-count and their SHAP values in 3′UTR. (F) 11-mer motif information of ref for negative set and MPRA dataset in 3′UTR.

In 5′UTR functional variants used for model training, we found that the CC-count and GC-count were negatively correlated with SHAP values (Fig. 4C). Analysis of 11-mer context sequences revealed 5′UTR functional variants were enriched in mutations that converted C/G to A/T bases within GC-rich regions (Fig. 4D and Supplementary Fig. S5C available online at http://bib.oxfordjournals.org/). This is consistent with the importance of GC-rich elements for formation of 5′UTR structures and RBP binding [6, 49].

In 3′UTR variants used for model training, AA/TA-counts were negatively correlated with SHAP values (Fig. 4E). Functional variants identified in MPRA assays were mainly found in sequences with higher AT-content, leading to a direct reduction AA/AT-dinucleotide count (Fig. 4F). This pattern is concordant with the fact that long uracil homopolymers serve as the binding motifs for RBPs in 3′UTR [49, 50]. Nontheless, when we compared functional 3′UTR variants annotated in HGMD and ClinVar, this trend was not apparent (Supplementary Fig. S5D available online at http://bib.oxfordjournals.org/). This discrepancy indicates that additional regulatory motifs, like miRNA binding sites and yet unidentified ones in the 3′UTR, may contribute to the complex post-transcriptional regulation of mRNA.

Improved performance of FunUV compared to current prediction methods

To further evaluate the prediction performance, we systematically compared FunUV with the other seven existing classical prediction methods: CADD, FATHMM, FunSeq2, GWAVA, LINSIGHT, RF and WEVar.

First, we examined the overall performance on the entire test set assembled from the 11 datasets. The AUC value of FunUV was 0.88, significantly greater than other methods, especially when the FPR was lower than 0.4, the TPR (also sensitivity) significantly increased (Fig. 5A). According to the Youden coefficient (Fig. 5B), the sensitivity and specificity of FunUV were 0.67 and 0.94, respectively, considerably higher than the next best method, WEVar (0.63 and 0.81), a new approach integrating multiple algorithms based on a statistical framework [46]. Moreover, FunUV outperformed all methods for both 5′UTR and 3′UTR variants prediction (Fig. 5C), suggesting the effectiveness of integrating UTR-specific features. Of note, GWAVA performed poorly, likely because its training utilized functional annotations mainly for enhancers and promoters rather than UTR-specific elements [19].

Performance comparison of FunUV and other methods. (A) ROC curve of FunUV and other methods showing AUC values. (B) Scatter plot of evaluation metrics. The point size depicts the Youden value. (C) Bar plot comparing AUC values between different methods for 5′UTR and 3′UTR. (D) AUC values for FunUV and other models across different data sources. ClinVar and HGMD datasets are evaluated independently. GWAS, FineMap, AD and miRdSNP datasets were combined and labeled as “Other” as they all contain disease-associated SNPs. BBE and MPRA datasets are combined and labeled as “Screening” as they both represent high-throughput experimental data. AIS dataset is evaluated separately. For all datasets except ClinVar, AIS and Screening, variants from 1KG and gnomAD are used as negative samples in the AUC calculations. (E) Bar plot comparing valuation metrics on external validation set. The left side of 0-axis shows the sensitivity when the specificity is 0.9, and the right side of 0-axis shows the sensitivity when the specificity is 0.8. The color scale depicts the magnitude of AUC value. (F) Composition of 50 bp flanking sequences for 16 variants specifically identified by FunUV. Variant information (left) includes the gene name, ref, alt and the coordinate position at 3′UTR of the gene. Sequence (middle) consists of 25 bp flanking sequences upstream and downstream of the variant position, ref/alt in parentheses. There are different feature types (right), including reduction of TT-count, reduction of AA-count, reduction of AT-count or TA-count and reduction of other elements (CU-element in TULP4_T_G_3UTR_2610 or AU-class in GRAMD1B_A_G_3UTR_1405).

Subsequently, we assessed the performance of the methods across the datasets from various source in the test set. Although performance varied across datasets, FunUV achieved the best results on all datasets (Fig. 5D). The second best method, FATHMM, also considers features like conservation and GC-content that contributed significantly in our model (Fig. 4A) [18]. Classical algorithms like CADD, FunSeq2, and LINSIGHT rely primarly on evolutionary conservation, explaining their comparable performance [15–17]. Notably, RF, another tree-based model, performed poorly particularly on screening data, which could potentially be attributed to biases in its training data [20].

Lastly, we evaluated the performance of FunUV using an independent validation dataset obtained from a newly published study [45]. This study used MPRA to identify hundreds of functional variants among 14,497 3′UTR mutations found in 185 advanced prostate tumors. In this external dataset, FunUV not only achieved the highest AUC but also exhibited a sensitivity that ranged from 1.6 to 2.1 times higher than other methods at a specificity of 0.8 (Fig. 5E). Among the 51 variants uniquely identified by FunUV, 16 caused reduction in A/T dinucleotide count and had a conservation score of 0 for PCS (Fig. 5F). In particular, the 2610 T > G variant in TULP4 changes a CU-enriched element, resulting in a decreased TT-count, and the 1405A > G variant in GRAMD1B changes an AU-class element which is reflected in the reduction of AT-count. These results underscored the power of distinct sequence composition features in identifying functional UTR variants.

Discussion

In this work, we developed FunUV, a classification model dedicated for predicting functional UTR variants. For any SNV or small insertion/deletion in a UTR region, FunUV can provide annotation results of more than 50 UTR related features as well as a predicted functional score to aid interpretation of clinical variants. FunUV outperformed existing prediction methods, demonstrating the effectiveness of incorporating UTR-specific features.

Our analysis of feature contributions provides insights into the molecular mechanisms underlying UTR variant functionality. We found that conservation remains a key factor in predicting functional UTR variants, consistent with the recognized conservation status of UTRs. Additionally, sequence composition features, particularly in regions with high GC or AT content, significantly contributed to the model. For instance, G to A or T conversions, or C to A or T conversions in GC-rich areas, may potentially result in a functional consequence by disrupting secondary structures in 5′UTR. These mutations, often resulting from environmental factors like ultraviolet light, oxidative damage, or tobacco exposure [51, 52], may have clinical implications, especially in the context of cancer-related somatic mutations. In 3′UTR, disruptions in continuous A or T sequence are also proved to be important, resulting in the identification of nonconservative yet functionally impactful variants.

As a supervised learning model, FunUV’s performance depends heavily on the scale and reliability of the labeled data. To ensure our model is trained on high-quality data, we curated a comprehensive set of UTR variants with known and predicted functional impacts from diverse sources. We included pathogenic variants from ClinVar and HGMD for their established links to diseases. Similarly, SNPs associated with disease and traits from GWAS studies were also selected for their association with functional consequences at population level. Moreover, we included results from high-throughput experimental assay as they offer direct evidence for a variant’s functional impact. This includes datasets from AIS and MPRA, which measure the effect of variants on gene expression, as well as base editing screens, which assess the consequence of variants on cellular function.

Although we have gathered all available data for FunUV’s training, the limited size of the dataset containing functional UTR variants and the presence of potential false positives may affect its prediction accuracy. Even though we have applied resampling techniques to address the class imbalance, it may still exert an influence on the model’s performance and generalizability. Additionally, unlike some deep learning models that train directly from sequence information, the efficacy of FunUV relies on the accuracy and completeness of UTR-specific functional annotations. Therefore, it exhibits higher sensitivity toward functional variants with known UTR features. As the understanding of UTR squences evolves, integrating new insights can enhance the efficacy of FunUV. While deep learning models like EVE [53] offer promising directions for variant prediction, they currently face challenges with data volume and interpretability. Consequently, we chose a classical machine learning framework for FunUV, prioritizing interpretability and data suitability. As more extensive human variant datasets become available, or as high-throughput assays for genetic sequence analysis improve, we anticipate the exploration of additional modeling strategies.

In conclusion, FunUV represent a dedicated approach to UTR variant prediction. Leveraging UTR-specific sequence features, it serves as a valuable tool for the clinical variant assessment and risk management. In future studies, collecting more validated positive UTR variants is necessary to further improve the model.

Key Points

We integrate more than 50 UTR-specific features by summarizing current studies on UTR to characterize functional UTR variants.
We train a classification model named FunUV using multi-source datasets, which is the dedicated method to predict functional UTR variants with impacts from diverse sources and achieved good performance.
Our analysis shows that features like C/G-related sequence composition in 5′UTR and A/T-related sequence composition in 3′UTR can effectively distinguish functional UTR variants from others.

Supplementary Material

Supplementary_Figures_finalsub_bbae248

supplementary_figures_finalsub_bbae248.docx^{(22.8MB, docx)}

Supplementary_Tables_finalsub_bbae248

supplementary_tables_finalsub_bbae248.xlsx^{(25.7MB, xlsx)}

Acknowledgements

We thank Jun Cao and all the other members of the Wang laboratory for helpful discussions.

Author Biographies

Guangyu Li is a PhD candidate at the Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College. Her research interests are machine learning and bioinformatics.

Jiayu Wu is a PhD candidate at the Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College. Her research interests are gene editing techniques and regulatory elements in noncoding region.

Xiaoyue Wang is a Professor at the Peking Union Medical College Hospital, Chinese Academy of Medical Sciences. Her research interests include decoding disease genes and pathogenic variants through the use of single-cell multi-omics, gene editing techniques and bioinformatics.

Contributor Information

Guangyu Li, State Key Laboratory of Common Mechanism Research for Major Diseases; Center for bioinformatics, National Infrastructures for Translational Medicine, Institute of Clinical Medicine and Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 1 Shuai Fu Yuan, Dongcheng District, Beijing 100005, China.

Jiayu Wu, State Key Laboratory of Common Mechanism Research for Major Diseases; Center for bioinformatics, National Infrastructures for Translational Medicine, Institute of Clinical Medicine and Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 1 Shuai Fu Yuan, Dongcheng District, Beijing 100005, China.

Xiaoyue Wang, State Key Laboratory of Common Mechanism Research for Major Diseases; Center for bioinformatics, National Infrastructures for Translational Medicine, Institute of Clinical Medicine and Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 1 Shuai Fu Yuan, Dongcheng District, Beijing 100005, China.

Funding

This work was supported by National Natural Science Foundation of China (grant no. 32122023 and 32070603 to X.W.), National High Level Hospital Clinical Research Funding (2023-PUMCH-E-008 to X.W.).

Data availability

FunUV related code and the ready prediction data in this study are available from GitHub https://github.com/Wangxiaoyue-lab/FunUV.

Authors’ contributions

Xiaoyue Wang, Guangyu Li and Jiayu Wu designed the project.

Guangyu Li analyzed the data.

Guangyu Li and Xiaoyue Wang wrote the manuscript with inputs from all the authors.

References

1. Schuster SL, Hsieh AC. The untranslated regions of mRNAs in cancer. Trends Cancer 2019;5:245–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Litterman AJ, Kageyama R, Le Tonqueze, et al. A massively parallel 3′ UTR reporter assay reveals relationships between nucleotide content, sequence conservation, and mRNA destabilization. Genome Res 2019;29:896–906. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Meisler MH. Evolutionarily conserved noncoding DNA in the human genome: how much and what for? Genome Res 2001;11:1617–8. [DOI] [PubMed] [Google Scholar]
4. Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:1034–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Byeon GW, Cenik ES, Jiang L, et al. Functional and structural basis of extreme conservation in vertebrate 5′ untranslated regions. Nat Genet 2021;53:729–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Leppek K, Das R, Barna M. Functional 5′ UTR mRNA structures in eukaryotic translation regulation and how to find them. Nat Rev Mol Cell Biol 2018;19:158–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Mayr C. Regulation by 3′-untranslated regions. Annu Rev Genet 2017;51:171–94. [DOI] [PubMed] [Google Scholar]
8. Calvo SE, Pagliarini DJ, Mootha VK. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc Natl Acad Sci U S A 2009;106:7507–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Weingarten-Gabbay S, Elias-Kirma S, Nir R, et al. Comparative genetics. Systematic discovery of cap-independent translation sequences in human and viral genomes. Science 2016;351:351. [DOI] [PubMed] [Google Scholar]
10. Yang TH, Wang CY, Tsai HC, Liu CT. Human IRES atlas: an integrative platform for studying IRES-driven translational regulation in humans. Database (Oxford) 2021;2021:baab025. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Auweter SD, Oberstrass FC, Allain FH. Sequence-specific binding of single-stranded RNA: is there a code for recognition? Nucleic Acids Res 2006;34:4943–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Djuranovic S, Nahvi A, Green R. miRNA-mediated gene silencing by translational repression followed by mRNA deadenylation and decay. Science 2012;336:237–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Liu L, Dilworth D, Gao L, et al. Mutation of the CDKN2A 5′ UTR creates an aberrant initiation codon and predisposes to melanoma. Nat Genet 1999;21:128–32. [DOI] [PubMed] [Google Scholar]
14. Graham RR, Kyogoku C, Sigurdsson S, et al. Three functional variants of IFN regulatory factor 5 (IRF5) define risk and protective haplotypes for human lupus. Proc Natl Acad Sci U S A 2007;104:6758–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Rentzsch P, Witten D, Cooper GM, et al. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 2019;47:D886–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Fu Y, Liu Z, Lou S, et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 2014;15:480. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Huang YF, Gulko B, Siepel A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 2017;49:618–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Shihab HA, Rogers MF, Gough J, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 2015;31:1536–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Ritchie GR, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods 2014;11:294–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Lu Y, Wu Y, Liu Y, et al. Prediction of disease-associated functional variants in noncoding regions through a comprehensive analysis by integrating datasets and features. Hum Mutat 2021;42:667–84. [DOI] [PubMed] [Google Scholar]
21. Zhang X, Wakeling M, Ware J, Whiffin N. Annotating high-impact 5′ untranslated region variants with the UTRannotator. Bioinformatics 2021;37:1171–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Liu Y, Dougherty JD. Utr.annotation: a tool for annotating genomic variants that could influence post-transcriptional regulation. Bioinformatics 2021;37:3926–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Sample PJ, Wang B, Reid DW, et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat Biotechnol 2019;37:803–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Griesemer D, Xue JR, Reilly SK, et al. Genome-wide functional screen of 3′UTR variants uncovers causal variants for human disease and evolution. Cell 2021;184:5247–5260.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Landrum MJ, Kattman BL. ClinVar at five years: delivering on the promise. Hum Mutat 2018;39:1623–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Stenson PD, Mort M, Ball EV, et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet 2014;133:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Li MJ, Wang P, Liu X, et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res 2012;40:D1047–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Genomes Project, C, Abecasis GR, Auton A, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Whiffin N, Armean IM, Kleinman A, et al. The effect of LRRK2 loss-of-function variants in humans. Nat Med 2020;26:869–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Farh KK, et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 2015;518:337–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Ghanbari M, Ikram MA, de Looper, et al. Genome-wide identification of microRNA-related variants associated with risk of Alzheimer’s disease. Sci Rep 2016;6:28387. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Bruno AE, Li L, Kalabus JL, et al. miRdSNP: a database of disease-associated SNPs and microRNA target sites on 3′UTRs of human genes. BMC Genomics 2012;13:44. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Maurano MT, Haugen E, Sandstrom R, et al. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet 2015;47:1393–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Huang C, Li G, Wu J, et al. Identification of pathogenic variants in cancer genes using base editing screens with editing efficiency correction. Genome Biol 2021;22:80. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Machiela MJ, Chanock SJ. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 2015;31:3555–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Morales J, Pujar S, Loveland JE, et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 2022;604:310–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 2010;20:110–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Hofacker IL. Vienna RNA secondary structure server. Nucleic Acids Res 2003;31:3429–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA package 2.0. Algorithms Mol Biol 2011;6:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol 2007;8:R24. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Li L, Huang KL, Gao Y, et al. An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability. Nat Genet 2021;53:994–1005. [DOI] [PubMed] [Google Scholar]
42. Bakheet T, Frevel M, Williams BR, et al. ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins. Nucleic Acids Res 2001;29:246–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. McGeary SE, Lin KS, Shi CY, et al. The biochemical basis of microRNA targeting efficacy. Science 2019;366(6472):eaav1741. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Friedman RC, Farh KK, Burge CB, Bartel DP. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res 2009;19:92–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Schuster SL, Arora S, Wladyka CL, et al. Multi-level functional genomics reveals molecular and cellular oncogenicity of patient-based 3′ untranslated region mutations. Cell Rep 2023;42:112840. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Wang Y, Jiang Y, Yao B, et al. WEVar: a novel statistical learning framework for predicting noncoding regulatory variants. Brief Bioinform 2021;22(6):bbab189. [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Zhao H, Sun Z, Wang J, et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 2014;30:1006–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process 2017;30:4768–77. [Google Scholar]
49. Dominguez D, Freese P, Alexis MS, et al. Sequence, structure, and context preferences of human RNA binding proteins. Mol Cell 2018;70:854, e9–867.e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Mukherjee N, Wessels HH, Lebedeva S, et al. Deciphering human ribonucleoprotein regulatory networks. Nucleic Acids Res 2019;47:570–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Zhivagui M, Hoda A, Valenzuela N, et al. DNA damage and somatic mutations in mammalian cells after irradiation with a nail polish dryer. Nat Commun 2023;14:276. [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Alexandrov LB, Ju YS, Haase K, et al. Mutational signatures associated with tobacco smoking in human cancer. Science 2016;354:618–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature 2021;599:91–5. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Figures_finalsub_bbae248

supplementary_figures_finalsub_bbae248.docx^{(22.8MB, docx)}

Supplementary_Tables_finalsub_bbae248

supplementary_tables_finalsub_bbae248.xlsx^{(25.7MB, xlsx)}

Data Availability Statement

FunUV related code and the ready prediction data in this study are available from GitHub https://github.com/Wangxiaoyue-lab/FunUV.

[ref1] 1. Schuster SL, Hsieh AC. The untranslated regions of mRNAs in cancer. Trends Cancer 2019;5:245–62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Litterman AJ, Kageyama R, Le Tonqueze, et al. A massively parallel 3′ UTR reporter assay reveals relationships between nucleotide content, sequence conservation, and mRNA destabilization. Genome Res 2019;29:896–906. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Meisler MH. Evolutionarily conserved noncoding DNA in the human genome: how much and what for? Genome Res 2001;11:1617–8. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:1034–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Byeon GW, Cenik ES, Jiang L, et al. Functional and structural basis of extreme conservation in vertebrate 5′ untranslated regions. Nat Genet 2021;53:729–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Leppek K, Das R, Barna M. Functional 5′ UTR mRNA structures in eukaryotic translation regulation and how to find them. Nat Rev Mol Cell Biol 2018;19:158–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Mayr C. Regulation by 3′-untranslated regions. Annu Rev Genet 2017;51:171–94. [DOI] [PubMed] [Google Scholar]

[ref8] 8. Calvo SE, Pagliarini DJ, Mootha VK. Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans. Proc Natl Acad Sci U S A 2009;106:7507–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Weingarten-Gabbay S, Elias-Kirma S, Nir R, et al. Comparative genetics. Systematic discovery of cap-independent translation sequences in human and viral genomes. Science 2016;351:351. [DOI] [PubMed] [Google Scholar]

[ref10] 10. Yang TH, Wang CY, Tsai HC, Liu CT. Human IRES atlas: an integrative platform for studying IRES-driven translational regulation in humans. Database (Oxford) 2021;2021:baab025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Auweter SD, Oberstrass FC, Allain FH. Sequence-specific binding of single-stranded RNA: is there a code for recognition? Nucleic Acids Res 2006;34:4943–59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Djuranovic S, Nahvi A, Green R. miRNA-mediated gene silencing by translational repression followed by mRNA deadenylation and decay. Science 2012;336:237–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Liu L, Dilworth D, Gao L, et al. Mutation of the CDKN2A 5′ UTR creates an aberrant initiation codon and predisposes to melanoma. Nat Genet 1999;21:128–32. [DOI] [PubMed] [Google Scholar]

[ref14] 14. Graham RR, Kyogoku C, Sigurdsson S, et al. Three functional variants of IFN regulatory factor 5 (IRF5) define risk and protective haplotypes for human lupus. Proc Natl Acad Sci U S A 2007;104:6758–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Rentzsch P, Witten D, Cooper GM, et al. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 2019;47:D886–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Fu Y, Liu Z, Lou S, et al. FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol 2014;15:480. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Huang YF, Gulko B, Siepel A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat Genet 2017;49:618–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Shihab HA, Rogers MF, Gough J, et al. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 2015;31:1536–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Ritchie GR, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods 2014;11:294–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Lu Y, Wu Y, Liu Y, et al. Prediction of disease-associated functional variants in noncoding regions through a comprehensive analysis by integrating datasets and features. Hum Mutat 2021;42:667–84. [DOI] [PubMed] [Google Scholar]

[ref21] 21. Zhang X, Wakeling M, Ware J, Whiffin N. Annotating high-impact 5′ untranslated region variants with the UTRannotator. Bioinformatics 2021;37:1171–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Liu Y, Dougherty JD. Utr.annotation: a tool for annotating genomic variants that could influence post-transcriptional regulation. Bioinformatics 2021;37:3926–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Sample PJ, Wang B, Reid DW, et al. Human 5′ UTR design and variant effect prediction from a massively parallel translation assay. Nat Biotechnol 2019;37:803–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. Griesemer D, Xue JR, Reilly SK, et al. Genome-wide functional screen of 3′UTR variants uncovers causal variants for human disease and evolution. Cell 2021;184:5247–5260.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Landrum MJ, Kattman BL. ClinVar at five years: delivering on the promise. Hum Mutat 2018;39:1623–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Stenson PD, Mort M, Ball EV, et al. The human gene mutation database: building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genet 2014;133:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27. Li MJ, Wang P, Liu X, et al. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res 2012;40:D1047–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Genomes Project, C, Abecasis GR, Auton A, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Whiffin N, Armean IM, Kleinman A, et al. The effect of LRRK2 loss-of-function variants in humans. Nat Med 2020;26:869–77. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Farh KK, et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature 2015;518:337–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Ghanbari M, Ikram MA, de Looper, et al. Genome-wide identification of microRNA-related variants associated with risk of Alzheimer’s disease. Sci Rep 2016;6:28387. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] 32. Bruno AE, Li L, Kalabus JL, et al. miRdSNP: a database of disease-associated SNPs and microRNA target sites on 3′UTRs of human genes. BMC Genomics 2012;13:44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Maurano MT, Haugen E, Sandstrom R, et al. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet 2015;47:1393–401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34. Huang C, Li G, Wu J, et al. Identification of pathogenic variants in cancer genes using base editing screens with editing efficiency correction. Genome Biol 2021;22:80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Machiela MJ, Chanock SJ. LDlink: a web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants. Bioinformatics 2015;31:3555–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Morales J, Pujar S, Loveland JE, et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 2022;604:310–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 2010;20:110–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. Hofacker IL. Vienna RNA secondary structure server. Nucleic Acids Res 2003;31:3429–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] 39. Lorenz R, Bernhart SH, Höner Zu Siederdissen C, et al. ViennaRNA package 2.0. Algorithms Mol Biol 2011;6:26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol 2007;8:R24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] 41. Li L, Huang KL, Gao Y, et al. An atlas of alternative polyadenylation quantitative trait loci contributing to complex trait and disease heritability. Nat Genet 2021;53:994–1005. [DOI] [PubMed] [Google Scholar]

[ref42] 42. Bakheet T, Frevel M, Williams BR, et al. ARED: human AU-rich element-containing mRNA database reveals an unexpectedly diverse functional repertoire of encoded proteins. Nucleic Acids Res 2001;29:246–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] 43. McGeary SE, Lin KS, Shi CY, et al. The biochemical basis of microRNA targeting efficacy. Science 2019;366(6472):eaav1741. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] 44. Friedman RC, Farh KK, Burge CB, Bartel DP. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res 2009;19:92–105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref45] 45. Schuster SL, Arora S, Wladyka CL, et al. Multi-level functional genomics reveals molecular and cellular oncogenicity of patient-based 3′ untranslated region mutations. Cell Rep 2023;42:112840. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46. Wang Y, Jiang Y, Yao B, et al. WEVar: a novel statistical learning framework for predicting noncoding regulatory variants. Brief Bioinform 2021;22(6):bbab189. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] 47. Zhao H, Sun Z, Wang J, et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 2014;30:1006–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] 48. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process 2017;30:4768–77. [Google Scholar]

[ref49] 49. Dominguez D, Freese P, Alexis MS, et al. Sequence, structure, and context preferences of human RNA binding proteins. Mol Cell 2018;70:854, e9–867.e9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref50] 50. Mukherjee N, Wessels HH, Lebedeva S, et al. Deciphering human ribonucleoprotein regulatory networks. Nucleic Acids Res 2019;47:570–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref51] 51. Zhivagui M, Hoda A, Valenzuela N, et al. DNA damage and somatic mutations in mammalian cells after irradiation with a nail polish dryer. Nat Commun 2023;14:276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref52] 52. Alexandrov LB, Ju YS, Haase K, et al. Mutational signatures associated with tobacco smoking in human cancer. Science 2016;354:618–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref53] 53. Frazer J, Notin P, Dias M, et al. Disease variant prediction with deep generative models of evolutionary data. Nature 2021;599:91–5. [DOI] [PubMed] [Google Scholar]

PERMALINK

Predicting functional UTR variants by integrating region-specific features

Guangyu Li

Jiayu Wu

Xiaoyue Wang

Abstract

Introduction

Methods

Collection of multi-sourced datasets for UTR variants

Transcripts selection and UTR definition

Annotation of region-specific features in UTR

Site conservation

Minimum free energy alteration

uAUG alteration or uStop alteration

uORF alteration

Kozak-score alteration

IRES annotation

PAS alteration

Other AU-enriched regulator elements

miRNA binding site annotation

Sequence composition

Construction of classification model

Building the training set and test set

Training model

Feature selection

External independent validation dataset

Obtaining functional scores of other prediction methods

Statistical methods

Evaluation metrics

Results

Existing prediction algorithms have reduced performance on UTR variants

Figure 1.

Differential patterns of UTR-specific features distinguish functional variants

Figure 2.

Integrating UTR-specific sequence features for accurate prediction of functional UTR variants

Figure 3.

Figure 4.

Improved performance of FunUV compared to current prediction methods

Figure 5.

Discussion

Key Points

Supplementary Material

Acknowledgements

Author Biographies

Contributor Information

Funding

Data availability

Authors’ contributions

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases