Skip to main content
. 2023 Jan 11;3(1):100384. doi: 10.1016/j.crmeth.2022.100384

Table 2.

Transcriptomic-level deep-learning applications

Method name Year Main functionalities Datasets Model Species Tissue/cell types
Promoter/TSS:

CNNProm69 2017 promoter recognition
  • EPD70
    • o
      human
    • omouse
    • o
      Arabidopsis
  • RegulonDB71
    • o
      839 E. coli promoters
  • DBTBS72
    • o
      746 B. subtilis promoters
  • 1–2 layer CNN (250 bp for eukaryotes and 85 bp input for prokaryotes)

human
mouse
Arabidopsis
E. coli
B. subtilis
Non-specific
DeeReCT-PromID73 2019 promoter recognition on highly imbalanced dataset
  • 16,455 human promoter sequences from EPD

  • two-branch CNN, one branch with pooling layer, the other without pooling (600 bp input)

  • Iteratively enriching hard examples during training

human non-specific
DeeReCT-TSS74 2021 promoter recognition guided by RNA-seq
  • FANTOM539
    • o
      RNA-seq from 10 cell lines
    • o
      CAGE-seq identified TSS from 10 cell lines
  • two-branch CNN (1,001 bp input)
    • o
      one for sequence and one RNA-seq base coverage
human multiple (10 cell types)

Splicing:

Barash et al.75 2010 splicing prediction of cassette exons
  • Microarray profile of 3,665 cassette exons in 27 mouse tissues from Fagnani et al.76

  • 1,014-dim features extracted from flanking sequence of the cassette exon

  • a dedicated probabilistic model to estimate (qinc,qexc,qnc) from microarray profiles

  • a one-layer NN (1,014 dim input) to predict the above probabilities from features of flanking sequence

mouse multiple (27 tissues)
Leung et al.77 2014 splicing prediction of cassette exons
  • RNA-seq profile of 11,019 cassette exons in five mouse tissues from Brawand et al.78

  • 1,393-dim features extracted from flanking sequence of the cassette exon

  • MLP (1,393 dim input), with indices of two tissues to compare

mouse multiple (5 tissues)
Xiong et al.79 2015 splicing prediction of cassette exons in exon triplets
  • Bodymap 2.0 (NCBI accession GEO: GSE30611)
    • o
      10,689 cassette exons
    • o
      16 normal tissues
  • 1,393-dim features extracted from flanking sequence of the cassette exon

  • MLP (1,393 dim input)

  • use Bayesian MCMC for learning without overfitting

human multiple (16 tissues)
DARTS80 2019 splicing prediction of cassette exons guided by RNA-seq
  • training: ENCODE8

  • K562 and HepG2 shRNA RBP knockdown datasets

  • testing: Roadmap Epigenomics31 RNA-seq data

  • MLP (2,926 + 1,498 × 2 dim input)
    • o
      2,926 cis sequence features
    • o
      1,498 × 2 RBP expression levels
    • o
      BHT integration and deep-learning prediction, and RNA-seq evidence
human K562, HepG2
SpliceAI81 2019 Splice sites prediction from pre-mRNA
  • GENCODE82 v24 isoforms
    • o
      train: 13,384 genes, 130,796 donor-acceptor pairs
    • o
      test: 1,652 genes, 14,289 donor-acceptor pairs
  • GTEx10 (novel isoforms)
    • o
      67,012 splice donors, 62,911 splice acceptors
  • CNN with dilated convolution62 and residual block (5,000 nt input)
    • dense classification of (no splice site, donor, acceptor)
human non-specific
Pangolin83 2022 splice sites prediction from pre-mRNA
  • reference transcripts
    • o
      GENCODE v34 for human transcripts
    • o
      ENSEMBL84 release 100 for rhesus monkey transcripts
    • o
      GENCODE m25 for mouse transcripts
    • o
      ENSEMBL release 101 for rat transcripts
  • RNA-seq data of the four tissues (heart, liver, brain, and testis) of human, rhesus monkey, mouse, and rat from Cardoso-Moreira et al.85

  • CNN with dilated convolution and residual blocks (15,000 nt input)

  • predicts per-tissue splicing event

human
rhesus monkey
mouse
rat
multiple (4 tissues in each species)

Polyadenylation:

Leung et al., 201886 2018 PAS quantification (pairwise comparison)
  • dataset for PAS reference
    • o
      PolyA_DB 287
    • o
      GENCODE
    • o
      APADB88
    • o
      Derti et al. (polyA-seq data)89
    • o
      Lianoglou et al.90 (3′-seq data)
  • dataset for PAS quantification

  • Lianoglou et al.90(3′-seq data)

  • two-branch CNN for PAS pairwise comparison

human multiple (7 tissue types)
DeeReCT-PolyA91 2019 PAS recognition
  • Dragon human poly(A) dataset92
    • o
      14,740 sequences for the 12 main human PAS motif variants
  • Omni human poly(A) dataset93
    • o
      18,786 positive true PAS sequences for 12 human PAS motif variants
  • Xiao et al.94 3′-READS sequencing of mouse fibroblast cells of C57BL/6J (BL), SPRET/EiJ (SP), and their F1

  • CNN with group normalization95 (200 nt input)

human
mouse
non-specific
APARENT96 2019 PAS quantification
  • 3 million APA massively parallel reporter assay from 13 libraries
    • o
      use 9 out of 13 libraries for training, ∼2.4 million variants; the other four are held out entirely
  • two-layer CNN (186 nt, a length that all randomized regions of the reporters can fit in)

  • prediction of the inclusion ratio of the proximal PAS in the reporter assay

  • gradient-based forward engineering of PAS sequences

human non-specific
DeeReCT-APA97 2021 PAS quantification
  • Xiao et al.94 3′-READS sequencing of mouse fibroblast cells of C57BL/6J (BL), SPRET/EiJ (SP), and their F1

  • CNN + BiLSTM (448 nt per each PAS, variable PASs in each example)
    • models the interactions between competing PAS
mouse (BL, SP, and BLxSP F1 hybrid) fibroblast

RNA subcellular localization:

RNATracker98 2019 subcellular localization prediction
  • mRNA sequences from Ensembl84 2017 release

  • RNA secondary structure implied from RNAplfold99

  • mRNA subcellular localization profiles
    • o
      CeFra-Seq data from Benoit Bouvrette et al., 2018100
      • cytosol, nuclear, membranes, insoluble
    • o
      APEX-RIP data from Kaewsapsak et al.101
      • ER, mitochondrial, cytosol, nuclear
  • CNN + BiLSTM (∼200 nt to more than 30,000 nt)

human HepG2 and HEK293T cell lines

MicroRNA targets:

MiRTDL102 2015 microRNA targets prediction
  • TarBase dataset103
    • o
      1,297 positive miRNA + mRNA pairs and 309 negative pairs (human, mouse, and rat)
    • o
      Dataset further extended by a constraint relaxing method, 198,620 positive pairs, and 19,660 negative pairs
  • CNN prediction based on 20 features of the miRNA and mRNA pair

human
mouse
rat
non-specific

Translation:

Cuperus et al.104 2017 5′ UTR translational efficiency prediction
  • measurement of 489,348 50-nt-long 5′ UTR of yeast in a massively parallel growth selection experiment

  • 3-layer CNN (>50 nt to fit the randomized region)

  • forward engineering of 5′ UTR sequences

yeast N/A

RNA-protein binding:

DeepBind17 2015 sequence-based RNA-protein binding prediction
  • RNAcompete105
    • o
      207 distinct RBPs from 24 eukaryotes
  • CNN (101 nt input)

multiple (24 eukaryotes) non-specific
NucleicNet106 2019 structure-based RNA-protein binding prediction
  • 483 RNA-protein complexes from PDB107 and de-duplicated to 158 ribonucleoprotein structures

  • CNN with ResNet-like architecture

multiple non-specific