---
title: "nORF_Data_Prep"
author: "Matt Neville"
date: "12/05/2020"
output: html_document
---
The following is a document for processing OpenProt (http://www.openprot.org/) ([Brunet et al., 2018](https://doi.org/10.1093/nar/gky936)) and sorfs.org (http://www.sorfs.org/) ([Olexiouk et al., 2018](https://doi.org/10.1093/nar/gkx1130)) into a human novel open reading frame (nORF) dataset.
Command line steps are given in bash blocks with eval = F so will not run by default. These are provided for replication purposes, exact commands needed may differ by machine.
# 1. Setup and Downloads
## 1a. Load required libraries
```{r Libraries, include=FALSE}
library(GenomicRanges)
library(rtracklayer)
library(GenomicFeatures)
library(Biostrings)
library(tidyverse)
```
## 1b. Download OpenProt entries
The code below downloads and names `.tsv`, `.bed`, and `.fasta` files with the following parameters from OpenProt (http://www.openprot.org/p/download):
* Release: 1.3
* Species: Homo Sapiens
* Assembly: GRCg38.p5
* Protein Type: AltProts and Isoforms
* Annotation: Ensembl (GRCh38.83)
* Supporting Evidence: All predicted
```{bash OpenProt_files, eval = F}
#TSV file
wget http://www.openprot.org/download/files/1.3/human-openprot-r1_3-altprots+isoforms-grch38.83.tsv.zip
unzip human-openprot-r1_3-altprots+isoforms-grch38.83.tsv.zip
mv human-openprot-r1_3-altprots+isoforms-grch38.83.tsv dataFiles/ensemblOpenProtAllPredicted.tsv
#BED file
wget http://www.openprot.org/download/files/1.3/human-openprot-r1_3-altprots+isoforms-grch38.83.bed.zip
unzip human-openprot-r1_3-altprots+isoforms-grch38.83.bed.zip
mv human-openprot-r1_3-altprots+isoforms-grch38.83.bed dataFiles/ensemblOpenProtAllPredicted_38.bed
#Protein FASTA file
wget http://www.openprot.org/download/files/1.3/human-openprot-r1_3-altprots+isoforms-grch38.83.fasta.zip
unzip human-openprot-r1_3-altprots+isoforms-grch38.83.fasta.zip
mv human-openprot-r1_3-altprots+isoforms-grch38.83.fasta dataFiles/ensemblOpenProtAllPredicted_38.fasta
#Clean up
rm human-openprot-r1_3-altprots+isoforms-grch38.83.*
```
## 1c: Download sorfs.org entries
The code below downloads files with the following parameters from sorfs.org (http://www.sorfs.org/BioMart):
* Database: Homo Sapiens
* Floss Classification: Good, Extreme
* Main Attributes: Sorf ID, Chromosome, Sorf Start, Sorf End, Strand, Spliced Start Parts, Spliced Stop Parts, Start Codon, Sorf Length, AA sequence, Transcript sequence, Biotype, Annotation, Ensembl Transcript ID
```{bash sorfs_file, eval = F}
# Main sorfs download used in norfs.org dataset
wget -O dataFiles/sorfsDownload.txt 'http://biomart.biobix.be/martservice/results?query='
# ** Alternative download not part of main pipeline**
# Optionally, the code below downloads a similar file for sorfs.org entries that also have mass spec evidence from their PRIDE ReSpin pipeline. This can be processed in 2b. with processDataset(dataset = "sorfs_MS")
# wget -O dataFiles/sorfsDownload_MS.txt 'http://biomart.biobix.be/martservice/results?query='
```
## 1d. Download ensembl GFF file for annotation
This code downloads the ensembl `.gff` file for classifying the novel ORFs:
```{bash ensembl_file, eval = F}
#Download and unzip GFF3 file
wget -O dataFiles/Homo_sapiens.GRCh38.96.gff3.gz ftp://ftp.ensembl.org/pub/release-96/gff3/homo_sapiens/Homo_sapiens.GRCh38.96.gff3.gz
gunzip dataFiles/Homo_sapiens.GRCh38.96.gff3.gz
```
# 2. Process Files
The following section will create custom BED and GTF files for the nORF dataset.
Specifically, this script will:
* A) Extract sorfs.org entries and OpenProt entries labelled as 'AltProt' with mass spec AND/OR ribo-seq evidence. These are the entries that qualify as novel ORFs.
* B) Remove duplicate entries based on matching genomics coordinates
* C) Output subsetted `.gtf` and `.bed` files based on input argument: openprot, openprot2pep, sorfs, sorfsMS, all
## 2a. Processing Functions
These are functions used during part 2b
```{r Proccessing_functions}
#Functions for processing sorfs and openProt datasets
processOpenProt <- function(annotation) {
#Read in BED, FASTA, and TSV files based on ensembl or refseq
if (annotation == "ensembl") {
bedFile <- read_tsv("dataFiles/ensemblOpenProtAllPredicted_38.bed",
col_names = c("chrom", "chromStart", "chromStop", "name", "score",
"strand", "thickStart", "thickEnd", "itemRgb", "blockCount","blockSizes", "blockStarts"),
col_types = "ciiciciicicc") %>%
#Remove a bugged entry
filter(name != "IP_296985" | chrom != "chrY")
tsvFile <- read_tsv("dataFiles/ensemblOpenProtAllPredicted.tsv", skip = 1, col_types = "ccciddicciicccciiddcccd")
proteinFasta <- readAAStringSet("dataFiles/ensemblOpenProtAllPredicted_38.fasta")
} else if (annotation == "refseq") {
bedFile <- read_tsv("dataFiles/refseqOpenProtAllPredicted_38.bed",
col_names = c("chrom", "chromStart", "chromStop", "name", "score",
"strand", "thickStart", "thickEnd", "itemRgb", "blockCount","blockSizes", "blockStarts"),
col_types = "ciiciciicicc")
tsvFile <- read_tsv("dataFiles/refseqOpenProtAllPredicted.tsv", skip = 1, col_types = "ccciddicciicccciiddcccd")
proteinFasta <- readAAStringSet("dataFiles/refseqOpenProtAllPredicted_38.fasta")
}
#Get IDs for subsets of interest
#Also filters out some altProts that are isoforms on other transcripts and therefore unlikely to be true nORFs
IDaltEvidence <- tsvFile %>%
filter(`protein type` == "AltProt" & (`MS score` != 0 | `TE score` != 0)) %>%
filter(!grepl('II', `protein accession (others)`)) %>%
dplyr::select(`protein accession numbers`) %>%
distinct()
#Dataframe with AA seq
dfp <- data.frame(str_split(names(proteinFasta), "\\|", simplify = T)[,1], paste(proteinFasta))
colnames(dfp) <- c("name", "aaSeq")
#Create table formatted consistently with sorfs (effectively bed12 format + extra info)
openProtTable <- bedFile %>%
filter(name %in% IDaltEvidence$`protein accession numbers`) %>%
mutate(source = "openprot.org") %>%
left_join(dfp, by = "name") %>%
mutate(startCodon = "ATG")
openProtTable <- openProtTable %>%
mutate(length = str_length(aaSeq))
return(openProtTable)
}
processSorfs <- function(dataset = "sorfs") {
#Read in TXT file
if (dataset == "sorfsMS") {
rawFile <- read_tsv("dataFiles/sorfsDownload_MS.txt", col_types = 'cciiccciccccccc')
} else {
rawFile <- read_tsv("dataFiles/sorfsDownload.txt", col_types = 'cciicccicccccc')
}
#Remove duplicates based on identical Chr, start, stop, strand, AND AAseq
rawFile <- rawFile %>%
distinct(Chromosome, `Sorf start`, `Sorf end`, Strand, `AA-sequence` , .keep_all = TRUE) %>%
filter(`Sorf start` >= 0) %>%
mutate(`AA-sequence` = str_remove(`AA-sequence`, "\\*"))
#Convert splice info from sorfs.org into usable bed12 format
#First deal with simple no splice cases
noSplice <- rawFile %>%
filter(is.na(`Spliced start parts`)) %>%
mutate(blockCount = "1",
blockSizes = `Sorf end` - (`Sorf start` - 1),
blockStarts = "0")
#Cases with splicing
spliced <- rawFile %>%
filter(!is.na(`Spliced start parts`))
blockCounter <- function(sorf) {
blockCount <- length(str_split(sorf[6], "_", simplify = T))
return(blockCount)
}
blockSizer <- function(sorf) {
starts <- as.numeric(str_split(sorf[6], "_", simplify = T)) - as.numeric(sorf[3]) - 1
ends <- as.numeric(str_split(sorf[7], "_", simplify = T)) - as.numeric(sorf[3])
blockSizes <- paste((ends - starts), collapse = ",")
return(blockSizes)
}
blockStarter <- function(sorf) {
starts <- as.numeric(str_split(sorf[6], "_", simplify = T)) - as.numeric(sorf[3])
blockStarts <- paste(starts, collapse = ",")
return(blockStarts)
}
spliced <- spliced %>%
mutate(blockCount = apply(spliced, 1, blockCounter),
blockSizes = apply(spliced, 1, blockSizer),
blockStarts = apply(spliced, 1, blockStarter))
#Join back together
sorfsSpliceFormatted <- rbind(noSplice, spliced)
#Deal with similar sorfs by selecting longest when having the same end site + splice count
sorfsPlus <- sorfsSpliceFormatted %>%
filter(Strand == '1') %>%
arrange(Chromosome,`Sorf end`, `Sorf start`) %>%
distinct(Chromosome,`Sorf end`, blockCount, .keep_all = T)
sorfsMinus <- sorfsSpliceFormatted %>%
filter(Strand == '-1') %>%
arrange(Chromosome, `Sorf start`, -`Sorf end`) %>%
distinct(Chromosome, `Sorf start`, blockCount, .keep_all = T)
sorfsJoined <- bind_rows(sorfsPlus, sorfsMinus) %>%
arrange(Chromosome, `Sorf start`, `Sorf end`) %>%
#Remove a set of ~100 incorrectly annotated splice sites
filter(substr(blockStarts, 1, 1) == "0")
#Create table formatted consistently with openProt (effectively bed12 format + extra info)
sorfsTable <- tibble(
chrom = paste0("chr", sorfsJoined$Chromosome),
chromStart = sorfsJoined$`Sorf start` - 1,
chromStop = sorfsJoined$`Sorf end`,
name = gsub(":", "_", sorfsJoined$`Sorf ID`),
score = 0,
strand = gsub("1", "+", gsub("-1", "-", sorfsJoined$Strand)),
thickStart = sorfsJoined$`Sorf start` - 1,
thickEnd = sorfsJoined$`Sorf end`,
itemRgb = "0,0,0",
blockCount = as.integer(sorfsJoined$blockCount),
blockSizes = sorfsJoined$blockSizes,
blockStarts = sorfsJoined$blockStarts,
source = "sorfs.org",
aaSeq = sorfsJoined$`AA-sequence`,
startCodon = sorfsJoined$`Start codon`,
length = sorfsJoined$`Sorf length`)
return(sorfsTable)
}
processPseudogenes <- function() {
bedFile <- read_tsv("dataFiles/pseudogenes.bed", col_names = c("chrom", "chromStart", "chromStop", "name", "score", "strand",
"thickStart", "thickEnd", "itemRgb", "blockCount","blockSizes",
"blockStarts"),
col_types = "ciiciciicicc")
pseudogeneTable <- bedFile %>%
mutate(source = "pseudogenes_Xu2016") %>%
mutate(aaSeq = NA) %>%
mutate(startCodon = 'ATG') %>%
mutate(length = NA)
return(pseudogeneTable)
}
combineNovelORFs <- function(sorfs,openprot,pseudogenes) {
#Bind and remove duplcates between datasets
merged <- bind_rows(sorfs,openprot,pseudogenes) %>%
arrange(chrom, chromStart, chromStop, source) %>%
distinct(chrom, chromStart, chromStop, aaSeq, .keep_all = T)
#Deal with similar orfs by selecting longest when having the same end site + splicing
plusStrand <- merged %>%
filter(strand == '+') %>%
arrange(chrom, chromStop, chromStart, source) %>%
distinct(chrom, chromStop, blockCount, .keep_all = T)
minusStrand <- merged %>%
filter(strand == '-') %>%
arrange(chrom, chromStart, -chromStop) %>%
distinct(chrom, chromStart, blockCount, .keep_all = T)
strandsJoined <- bind_rows(plusStrand, minusStrand) %>%
arrange(chrom, chromStart, chromStop)
return(strandsJoined)
}
addIDs <- function(novelORFtable) {
idKey <- readRDS("dataFiles/nORFsDB1.3.rds") %>%
mutate(mergeKey = str_c(start, end, name)) %>%
dplyr::select(mergeKey, id)
novelORFtableMerge <- novelORFtable %>%
mutate(mergeKey = str_c((chromStart +1),chromStop, name)) %>%
left_join(idKey, by = "mergeKey") %>%
mutate(name = ifelse(str_detect(name, "psH"), name, id)) %>%
dplyr::select(-id, mergeKey) %>%
filter(!(is.na(name)))
return(novelORFtableMerge)
}
createBed12 <- function(novelORFtable) {
return(novelORFtable[,1:12])
}
createGTF <- function(novelORFtable) {
gtfFile <- tibble(
seqname = gsub("chr", "", novelORFtable$chrom),
source = novelORFtable$source,
feature = "gene",
start = novelORFtable$chromStart + 1,
end = novelORFtable$chromStop,
score = ".",
strand = novelORFtable$strand,
frame = ".",
attributes = paste0('gene_id "', novelORFtable$name,'"; AA_seq "', novelORFtable$aaSeq,
'"; start_codon "', novelORFtable$startCodon, '"; sorf_length "', novelORFtable$length,'";' ))
return(gtfFile)
}
```
## 2b. Processing Code
Here we show the code for the whole norfs dataset but could change to specifically OpenProt or sorfs at this point
```{r proccssingCode}
#Function calls appropriate functions from 'processingFunctions.R' based on supplied argument
processDataset <- function(dataset) {
if (dataset == "openprotEnsembl") {
openprot <- processOpenProt(annotation = "ensembl")
return(openprot)
} else if (dataset == "openprotRefseq") {
openprotRefseq <- processOpenProt(annotation = "refseq")
return(openprotRefseq)
} else if (dataset == "sorfs") {
sorfs <- processSorfs()
return(sorfs)
} else if (dataset == "pseudogenes") {
pseudogenes <- processPseudogenes()
return(pseudogenes)
} else if (dataset == "sorfsMS") {
sorfsMS <- processSorfs(dataset = "sorfsMS")
return(sorfsMS)
} else if (dataset == "all") {
openprot <- processOpenProt(annotation = "ensembl")
sorfs <- processSorfs()
pseudogenes <- processPseudogenes()
combined <- combineNovelORFs(sorfs,openprot,pseudogenes)
return(combined)
}}
novelORFtableOriginal <- processDataset(dataset = "all")
novelORFtableNamed <- addIDs(novelORFtableOriginal)
#Convert to bed12 and write out
novelORFbed <- createBed12(novelORFtableNamed)
write_tsv(novelORFbed, path = paste0("all","_38.bed"), col_names = F)
#Convert to gtf and write out
novelORFgtf <- createGTF(novelORFtableNamed)
write.table(novelORFgtf, paste0("all","_38.gtf"), col.names = F, row.names = F, sep = "\t", quote = F)
```
# 3. Classify Entries
This section classifies the combined OpenProt and sorfs.org entries by their transcript type/relation to annotated protein coding regions.
## 3a. Convert from bed 12 format to bed 6
**Note**: This code requires bedtools to be installed and in your PATH.
```{bash bed12tobed6, eval = F}
#Create bed6 file
bedtools bed12tobed6 -i all_38.bed > dataFiles/all_38.6.bed
```
## 3b. Classifying Functions
```{r classifyingFunctions}
# 3 Functions from old version of R package Guitar that covert bed12 directly to granges
BED12toGRangesList <- function(filepath, header=FALSE) {
# message
print("Converting BED12 to GRangesList")
print("It may take a few minutes")
# read bed file
a = read.table(filepath,sep="\t",header=header,stringsAsFactors =FALSE)
# mcols_info =a[,13:length(a[1,])]
a = a[,1:12]
# get transcripts
no_tx = length(a[,1])
tx_id = 1:no_tx;
tx_name = paste("line_",1:no_tx,sep="")
tx_chrom = a[,1]
tx_strand = a[,6]
tx_start = a[,2]+1
tx_end = a[,3]
transcripts= data.frame(tx_id,tx_name,tx_chrom,tx_strand,tx_start,tx_end)
head(transcripts)
# get genes
tx_name = tx_name
gene_id = as.character(a[,4])
gene_id[is.na(gene_id)]="NA"
gene=data.frame(tx_name,gene_id)
#
splicing <- lapply(1:no_tx, .spliceSingleTrans, a=a, tx_start=tx_start)
splicing <- .combineListOfSplicing(splicing)
# make txdb
peaks = suppressWarnings(
makeTxDb(transcripts=transcripts,
splicings=splicing,
genes=gene))
# generate GRangesList
tx <- exonsBy(peaks, "tx",use.names=TRUE)
mcols(tx) <- a
return(tx)
}
.combineListOfSplicing <- function(t){
a <- paste("t[[",1:length(t),"]]", sep="")
a <- paste(a,collapse =",")
a <- paste("rbind(",a,")",sep="")
c <- parse(text=a)
b <- suppressWarnings(eval(c))
return(b)
}
.spliceSingleTrans <- function(i,a,tx_start) {
tx = a[i,]
tx_id = i
exon_rank=1:as.integer(tx[10])
# get start
temp = as.integer(strsplit(as.character(tx[12]), ",")[[1]]) + tx_start[i]
exon_start=temp
# get end
temp = as.integer(strsplit(as.character(tx[11]), ",")[[1]])
temp2 = temp + exon_start - 1
exon_end=temp2
# get CDS
cds_start = exon_start
cds_end = exon_end
# get data frame
splicing_tx = data.frame(tx_id,exon_rank,exon_start,exon_end,cds_start,cds_end)
return(splicing_tx)
}
generateTranscriptPairs <- function(transcriptFeature, class, annotationTibble) {
#Create subsetted Granges objects
nORF_List <- annotationTibble %>%
filter(transcriptClass == class)
novelORFs <- novelORFs[novelORFs@elementMetadata@listData$V4 %in% nORF_List$novelORF_ID]
#Find any and full overlaps
anyoverlap <- findOverlaps(novelORFs, transcriptFeature, type = "any")
transcriptPairsAnnotation <- tibble(novelORF_ID = novelORFs@elementMetadata@listData$V4[anyoverlap@from],
transcript_ID = transcriptFeature@partitioning@NAMES[anyoverlap@to],
transcriptClass = class) %>%
left_join(transcriptTypes, by = "transcript_ID") %>%
mutate(matchCompare = str_c(novelORF_ID, transcript_ID))
#Check which ones are also full overlaps
fulloverlap <- findOverlaps(novelORFs, transcriptFeature, type = "within")
transcriptFullAnnotation <- tibble(novelORF_ID = novelORFs@elementMetadata@listData$V4[fulloverlap@from],
transcript_ID = transcriptFeature@partitioning@NAMES[fulloverlap@to],
transcriptClass = class) %>%
mutate(matchCompare = str_c(novelORF_ID, transcript_ID))
#Label any matches that are also full matches
transcriptPairsAnnotation <- transcriptPairsAnnotation %>%
mutate(isFull = ifelse(transcriptPairsAnnotation$matchCompare %in% transcriptFullAnnotation$matchCompare, T, F)) %>%
dplyr::select(-matchCompare)
return(transcriptPairsAnnotation)
}
prioritizeNonCoding <- function(annotationPairs, ranks) {
#Label the priority rank of each annotation
annotationPairs <- annotationPairs %>%
left_join(ranks, by = "transcriptBiotype")
#Sort by full overlaps first and then highest priority rank, then take first top result for each nORF
annotationPairs <- annotationPairs %>%
arrange(novelORF_ID, -isFull, rank, transcript_ID) %>%
distinct(novelORF_ID, .keep_all = T) %>%
dplyr::select(-isFull, -rank)
return(annotationPairs)
}
prioritizeCodingIntrons <- function(annotationPairs) {
#Sort by full overlaps, then take first top result for each nORF
annotationPairs <- annotationPairs %>%
arrange(novelORF_ID, -isFull, transcript_ID) %>%
distinct(novelORF_ID, .keep_all = T) %>%
dplyr::select(-isFull)
return(annotationPairs)
}
prioritizeCoding <- function(annotationPairs, ranks) {
#Label the priority rank of each annotation
annotationPairs <- annotationPairs %>%
left_join(ranks, by = "ORFannotation")
#Sort by full overlaps first and then highest priority rank, then take first top result for each nORF
annotationPairs <- annotationPairs %>%
arrange(novelORF_ID, rank, transcript_ID) %>%
distinct(novelORF_ID, .keep_all = T) %>%
dplyr::select(-rank)
return(annotationPairs)
}
classifyProteinCoding <- function(annotationPairs, txdb) {
#Create features
introns <- intronsByTranscript(txdb, use.names = T)
cds <- cdsBy(txdb, "tx", use.names = T)
utr5 <- fiveUTRsByTranscript(txdb, use.names = T)
utr3 <- threeUTRsByTranscript(txdb, use.names = T)
annotOverlap <- function(novelORFs, annot, type) {
overlap <- findOverlaps(novelORFs, annot, type = type)
pairsTibble <- tibble(novelORF_ID = novelORFs@elementMetadata@listData$V4[overlap@from],
transcript_ID = annot@partitioning@NAMES[overlap@to]) %>%
mutate(matchCompare = str_c(novelORF_ID, transcript_ID))
}
#Find partial and full overlaps with features
CDSany <- annotOverlap(novelORFs = novelORFs, annot = cds, type = "any")
CDSfull <- annotOverlap(novelORFs = novelORFs, annot = cds, type = "within")
UTR5any <- annotOverlap(novelORFs = novelORFs, annot = utr5, type = "any")
UTR5full <- annotOverlap(novelORFs = novelORFs, annot = utr5, type = "within")
UTR3any <- annotOverlap(novelORFs = novelORFs, annot = utr3, type = "any")
UTR3full <- annotOverlap(novelORFs = novelORFs, annot = utr3, type = "within")
INTRONany <- annotOverlap(novelORFs = novelORFs, annot = introns, type = "any")
#Get comparable column
annotationPairs <- annotationPairs %>%
mutate(matchCompare = str_c(novelORF_ID, transcript_ID))
#Add feature overlaps for each transcript
annotationPairs <- annotationPairs %>%
mutate(CDSany = ifelse(annotationPairs$matchCompare %in% CDSany$matchCompare, T, F)) %>%
mutate(CDSfull = ifelse(annotationPairs$matchCompare %in% CDSfull$matchCompare, T, F)) %>%
mutate(UTR5any = ifelse(annotationPairs$matchCompare %in% UTR5any$matchCompare, T, F)) %>%
mutate(UTR5full = ifelse(annotationPairs$matchCompare %in% UTR5full$matchCompare, T, F)) %>%
mutate(UTR3any = ifelse(annotationPairs$matchCompare %in% UTR3any$matchCompare, T, F)) %>%
mutate(UTR3full = ifelse(annotationPairs$matchCompare %in% UTR3full$matchCompare, T, F)) %>%
mutate(INTRONany = ifelse(annotationPairs$matchCompare %in% INTRONany$matchCompare, T, F)) %>%
mutate(ORFannotation = "None") %>%
dplyr::select(-matchCompare)
#Use feature overlaps to classify each norf-transcript pair
annotationPairs <- annotationPairs %>%
mutate(ORFannotation = ifelse(CDSfull, "cds", ORFannotation)) %>%
mutate(ORFannotation = ifelse(CDSany & UTR5any, "utr5-cds", ORFannotation)) %>%
mutate(ORFannotation = ifelse(CDSany & UTR3any, "cds-utr3", ORFannotation)) %>%
mutate(ORFannotation = ifelse(UTR5full, "utr5", ORFannotation)) %>%
mutate(ORFannotation = ifelse(UTR3full, "utr3", ORFannotation)) %>%
mutate(ORFannotation = ifelse(CDSany & INTRONany & !CDSfull, "cds-intronic", ORFannotation)) %>%
mutate(ORFannotation = ifelse(UTR5any & INTRONany, "utr5-intronic", ORFannotation)) %>%
mutate(ORFannotation = ifelse(UTR3any & INTRONany, "utr3-intronic", ORFannotation)) %>%
mutate(ORFannotation = ifelse(CDSany & !CDSfull & !UTR3any & !UTR5any & !INTRONany, "cds-intergenic", ORFannotation)) %>%
mutate(ORFannotation = ifelse(UTR5any & !UTR5full & !CDSany & !INTRONany, "utr5-intergenic", ORFannotation)) %>%
mutate(ORFannotation = ifelse(UTR3any & !UTR3full & !CDSany & !INTRONany, "utr3-intergenic", ORFannotation)) %>%
dplyr::select(novelORF_ID, transcript_ID, transcriptClass, ORFannotation) %>%
mutate(transcriptBiotype = "protein_coding")
return(annotationPairs)
}
groupClasses <- function() {
#1. Within a protein coding exon
codingOverlaps <- findOverlaps(novelORFs, proteinCodingExons, type = "within")
codingOverlapsIDs <- novelORFs@elementMetadata@listData$V4[unique(codingOverlaps@from)]
#2. Within a non coding exon
nonCodingOverlaps <- findOverlaps(novelORFs, nonCodingExons, type = "within")
nonCodingOverlapsIDs <- novelORFs@elementMetadata@listData$V4[unique(nonCodingOverlaps@from)]
#3. Partial overlap of protein coding exon
partialCodingOverlaps <- findOverlaps(novelORFs, proteinCodingExons, type = "any")
partialCodingOverlapsIDs <- novelORFs@elementMetadata@listData$V4[unique(partialCodingOverlaps@from)]
#4. Partial overlap of non coding exon
partialnonCodingOverlaps <- findOverlaps(novelORFs, nonCodingExons, type = "any")
partialnonCodingOverlapsIDs <- novelORFs@elementMetadata@listData$V4[unique(partialnonCodingOverlaps@from)]
#5. Within an intron of protein coding transcript
codingIntronsOverlaps <- findOverlaps(novelORFs, proteinCodingIntrons, type = "any")
codingIntronsOverlapsIDs <- novelORFs@elementMetadata@listData$V4[unique(codingIntronsOverlaps@from)]
#6. Within an intron of non-coding transcript
nonCodingIntronsOverlaps <- findOverlaps(novelORFs, nonCodingIntrons, type = "any")
nonCodingIntronsOverlapsIDs <- novelORFs@elementMetadata@listData$V4[unique(nonCodingIntronsOverlaps@from)]
#Create tibble for annotating each novel ORF
annotationMaster <- tibble(novelORF_ID = novelORFs@elementMetadata@listData$V4,
transcript_ID = NA,
transcriptClass = NA,
transcriptBiotype = NA,
ORFannotation = NA) %>%
#Defaults intergenic
mutate(transcriptClass = "intergenic") %>%
#Overwrite if non-coding intronic
mutate(transcriptClass = ifelse(novelORF_ID %in% nonCodingIntronsOverlapsIDs, "intronic_ncTranscript", transcriptClass)) %>%
#Overwrite if coding intronic
mutate(transcriptClass = ifelse(novelORF_ID %in% codingIntronsOverlapsIDs, "intronic_codingTranscript", transcriptClass)) %>%
#Overwrite if partial non-coding transcript exonic
mutate(transcriptClass = ifelse(novelORF_ID %in% partialnonCodingOverlapsIDs, "non_coding", transcriptClass)) %>%
#Overwrite if partial coding transcript exonic
mutate(transcriptClass = ifelse(novelORF_ID %in% partialCodingOverlapsIDs, "protein_coding", transcriptClass)) %>%
#Overwrite if full non-coding transcript exonic
mutate(transcriptClass = ifelse(novelORF_ID %in% nonCodingOverlapsIDs, "non_coding", transcriptClass)) %>%
#Overwrite if full coding transcript exonic
mutate(transcriptClass = ifelse(novelORF_ID %in% codingOverlapsIDs, "protein_coding", transcriptClass))
return(annotationMaster)
}
classify_nORFs <- function(annotationTibble, txdbInput) {
annotationTibble2 <- annotationTibble %>%
mutate(ORFannotation = ifelse(transcriptClass == "intergenic", "intergenic", ORFannotation))
ncIntronsPairs <- generateTranscriptPairs(transcriptFeature = nonCodingIntrons, class = "intronic_ncTranscript", annotationTibble = annotationTibble2)
annotation_ncIntrons <- prioritizeNonCoding(annotationPairs = ncIntronsPairs, ranks = ncAnnotationRank) %>%
mutate(ORFannotation = "intronic")
codingIntronsPairs <- generateTranscriptPairs(transcriptFeature = proteinCodingIntrons, class = "intronic_codingTranscript", annotationTibble = annotationTibble2)
annotation_codingIntrons <- prioritizeCodingIntrons(annotationPairs = codingIntronsPairs) %>%
mutate(ORFannotation = "intronic")
nonCodingPairs <- generateTranscriptPairs(transcriptFeature = nonCodingExons, class = "non_coding", annotationTibble = annotationTibble2)
annotation_nonCoding <- prioritizeNonCoding(annotationPairs = nonCodingPairs, ranks = ncAnnotationRank) %>%
#Annotate as ncRNA, unless it is pseudogenic or antisense
mutate(ORFannotation = "ncRNA") %>%
mutate(ORFannotation = ifelse(transcriptBiotype == "antisense", "antisense", ORFannotation)) %>%
mutate(ORFannotation = ifelse(transcriptBiotype == "retained_intron", "retained_intron", ORFannotation)) %>%
mutate(ORFannotation = ifelse(transcriptBiotype == "bidirectional_promoter_lncRNA", "bidirectional_promoter_lncRNA", ORFannotation)) %>%
mutate(ORFannotation = ifelse(transcriptBiotype == "nonsense_mediated_decay", "nmd", ORFannotation)) %>%
mutate(ORFannotation = ifelse(str_detect(transcriptBiotype, 'pseudogene'), "pseudogene", ORFannotation))
proteinCodingPairs <- generateTranscriptPairs(transcriptFeature = proteinCodingExons, class = "protein_coding", annotationTibble = annotationTibble2)
proteinCodingPairsDetailed <- classifyProteinCoding(annotationPairs = proteinCodingPairs, txdb = txdbInput)
annotation_coding <- prioritizeCoding(annotationPairs = proteinCodingPairsDetailed, ranks = proteinCodingAnnotationRank)
annotationCombined <- annotationTibble2 %>%
filter(transcriptClass == "intergenic") %>%
bind_rows(annotation_ncIntrons) %>%
bind_rows(annotation_codingIntrons) %>%
bind_rows(annotation_nonCoding) %>%
bind_rows(annotation_coding)
return(annotationCombined)
}
#Get IDs of all nORFs in frame with a coding exon
generateInFrameIDs <- function(gffFile, bed6File, txdb) {
gff0 = read_tsv(gffFile, comment = "#", col_names = F, col_types = 'ccccccccc')
gff <- gff0 %>%
filter(X3 == "exon") %>%
separate(X9, sep = ";", c("a", "b","c","d","e","f","g","h")) %>%
dplyr::select(b,e,d, X7)
colnames(gff) <- c("exon_ID", "start_phase", "end_phase", "strand")
gff <- gff %>%
mutate(exon_ID = str_remove(exon_ID, "Name=")) %>%
mutate(end_phase = str_remove(end_phase, "ensembl_end_phase=")) %>%
mutate(start_phase = str_remove(start_phase, "ensembl_phase="))
bed0 = read_tsv(bed6File, col_names = F, col_types = 'ciiccc')
bedFirstExonPlus <- bed0 %>%
filter(X6 == "+") %>%
arrange(X1, X2) %>%
distinct(X1, X4, .keep_all = T)
bedFirstExonMinus <- bed0 %>%
filter(X6 == "-") %>%
arrange(X1, -X3) %>%
distinct(X1,X4, .keep_all = T)
bedFirstExon <- bedFirstExonPlus %>%
bind_rows(bedFirstExonMinus) %>%
arrange(X1, X2)
write_tsv(bedFirstExon, "dataFiles/all_38_firstExon.6.bed", col_names = F)
exonsAll <- exons(txdb, use.names=T)
novelORFsExons <- import.bed("dataFiles/all_38_firstExon.6.bed")
novelORFsExons <- renameSeqlevels(novelORFsExons, gsub("chr","", seqlevels(novelORFsExons)))
inFrame <- findOverlaps(novelORFsExons, exonsAll, type = "any")
inFrameTibble <- tibble(norf_ID = novelORFsExons@elementMetadata@listData$name[inFrame@from],
norfStart = novelORFsExons@ranges@start[inFrame@from],
norfEnd = novelORFsExons@ranges@start[inFrame@from] + novelORFsExons@ranges@width[inFrame@from] - 1,
exon_ID = exonsAll@ranges@NAMES[inFrame@to],
exonStart = exonsAll@ranges@start[inFrame@to],
exonEnd = exonsAll@ranges@start[inFrame@to] + exonsAll@ranges@width[inFrame@to] - 1) %>%
left_join(gff, by = "exon_ID") %>%
filter(end_phase != "-1" | start_phase != "-1")
#Add reading frames for + strand
inFrameNorfsPlus <- inFrameTibble %>%
filter(strand == "+") %>%
mutate(norfFrame = norfStart %% 3) %>%
#If exon has start_phase -1 then shift based on end phase
#Else shift based on start phase
mutate(exonFrame = ifelse(start_phase == "-1", (exonEnd - as.integer(end_phase) + 1) %% 3, (exonStart - as.integer(start_phase)) %% 3)) %>%
filter(exonFrame == norfFrame) %>%
distinct(norf_ID)
inFrameNorfsMinus <- inFrameTibble %>%
filter(strand == "-") %>%
mutate(norfFrame = norfEnd %% 3) %>%
#If exon has end_phase -1 then shift based on start phase
#Else shift based on end phase
mutate(exonFrame = ifelse(start_phase == "-1", (exonStart + as.integer(end_phase) - 1) %% 3, (exonEnd + as.integer(start_phase)) %% 3)) %>%
filter(exonFrame == norfFrame) %>%
distinct(norf_ID)
inFrameNorf_IDs <- inFrameNorfsPlus %>%
bind_rows(inFrameNorfsMinus)
cds <- cds(human)
cdsOverlap <- findOverlaps(novelORFs, cds)
cdsNovelORFs <- novelORFs@elementMetadata@listData$V4[unique(cdsOverlap@from)]
#Choose only inFrameNorfs that actually overlap cds
inFrameNorf_IDs <- inFrameNorf_IDs %>%
filter(norf_ID %in% cdsNovelORFs)
return(inFrameNorf_IDs)
}
#Annotate as coding region nORF if it matches selected transcript types OR overlaps with CDS
generateCodingRegionIDs <- function(annotationTable, proteinCodingExons) {
overlap <- findOverlaps(novelORFs, proteinCodingExons, type = "any")
overlapNovelORFs <- novelORFs@elementMetadata@listData$V4[unique(overlap@from)]
annotationTable2 <- annotationTable %>%
filter(transcriptClass == "protein_coding" | transcriptClass == "intronic_codingTranscript" |
transcriptBiotype == "nonsense_mediated_decay" | transcriptBiotype == "retained_intron" |
transcriptBiotype == "antisense" | novelORF_ID %in% overlapNovelORFs)
return(annotationTable2)
}
```
## 3c. Classifying Code
```{r classiyingCode}
##### First load in 3 required data files: txdb, novelORFs and transcript biotype
#1.Txdb for humans hg38
human <- makeTxDbFromEnsembl(organism="Homo sapiens",
release=96,
circ_seqs=DEFAULT_CIRC_SEQS,
server="ensembldb.ensembl.org",
username="anonymous", password=NULL, port=0L,
tx_attrib=NULL)
#2.Read in novel ORFs list
novelORFs <- BED12toGRangesList("all_38.bed")
novelORFs <- renameSeqlevels(novelORFs, gsub("chr","", seqlevels(novelORFs)))
#3.Load in transcript biotypes file
#In process group IG and TR genes into protein coding and re-annotate a few misclassified transcripts
#Note that transcriptTypes.txt file was generated for the ensembl biomart page with the follwing parameters:
#Ensembl Genes 96, Dataset: Human genes(GRCh38.p12), Attributes: Transcript stable ID + Transcript type, Unique results only
proteinCodingBiotypes <- c("protein_coding","IG_C_gene", "IG_D_gene", "IG_J_gene", "IG_V_gene",
"TR_C_gene", "TR_D_gene", "TR_J_gene", "TR_V_gene")
transcriptTypes <- read_tsv("dataFiles/transcriptTypes.txt", col_names = c("transcript_ID", "transcriptBiotype"), skip = 1, col_types = "cc") %>%
mutate(transcriptBiotype = ifelse(transcriptBiotype %in% proteinCodingBiotypes, 'protein_coding', transcriptBiotype))
##### Set priority ranks for coding and non-coding annotations
#Priority rank for coding annotations
proteinCodingAnnotationRank <- tibble(ORFannotation = c("cds", "utr5-cds", "cds-utr3", "utr5", "utr3", "cds-intronic",
"utr5-intronic", "utr3-intronic", "cds-intergenic", "utr5-intergenic", "utr3-intergenic", "None"),
rank = 1:12)
#Priority rank for non-coding annotations
ncAnnotationRank <- tibble(
transcriptBiotype = c('nonsense_mediated_decay', 'non_stop_decay', 'retained_intron',
'sense_intronic', 'sense_overlapping', '3prime_overlapping_ncrna', 'antisense', 'bidirectional_promoter_lncRNA',
'IG_V_pseudogene', 'IG_C_pseudogene','TR_V_pseudogene', 'rRNA_pseudogene',
'polymorphic_pseudogene', 'translated_unprocessed_pseudogene', 'transcribed_processed_pseudogene',
'transcribed_unprocessed_pseudogene', 'processed_pseudogene', 'unprocessed_pseudogene',
'unitary_pseudogene', 'transcribed_unitary_pseudogene', 'pseudogene', 'lincRNA',
'macro_lncRNA', 'snoRNA','Mt_rRNA','Mt_tRNA','miRNA', 'snRNA','misc_RNA','processed_transcript', 'TEC'),
rank = 1:31)
exonsByTranscript <- exonsBy(human, by = "tx", use.names = T)
intronsByTranscript <- intronsByTranscript(human, use.names = T)
proteinCodingTranscriptIDs <- transcriptTypes %>%
filter(transcriptBiotype == 'protein_coding')
proteinCodingExons <- exonsByTranscript[exonsByTranscript@partitioning@NAMES %in% proteinCodingTranscriptIDs$transcript_ID]
proteinCodingIntrons <- intronsByTranscript[intronsByTranscript@partitioning@NAMES %in% proteinCodingTranscriptIDs$transcript_ID]
nonCodingTranscriptIDs <- transcriptTypes %>%
filter(transcriptBiotype != 'protein_coding')
nonCodingExons <- exonsByTranscript[exonsByTranscript@partitioning@NAMES %in% nonCodingTranscriptIDs$transcript_ID]
nonCodingIntrons <- intronsByTranscript[intronsByTranscript@partitioning@NAMES %in% nonCodingTranscriptIDs$transcript_ID]
#Separate nORFs into five major classes and create tibble for annotating each novel ORF
#1. Within a protein coding exon
#2. Within a non coding exon
#3. Within an intron of protein coding transcript
#4. Within an intron of non-coding transcript
#5. None of 1-4 (intergenic)
annotationClasses <- groupClasses()
#Classify in full detail
annotationMaster <- classify_nORFs(annotationTibble = annotationClasses, txdbInput = human)
#Label as protein region coding T/F and in-frame with protein coding T/F
inFrameIDs <- generateInFrameIDs(gffFile = "dataFiles/Homo_sapiens.GRCh38.96.gff3", bed6File = "dataFiles/all_38.6.bed", txdb = human)
codingRegionIDs <- generateCodingRegionIDs(annotationMaster, proteinCodingExons)
annotationMasterFiltered <- annotationMaster %>%
mutate(inFrame = ifelse(novelORF_ID %in% inFrameIDs$norf_ID, T, F)) %>%
mutate(codingRegion = ifelse(novelORF_ID %in% codingRegionIDs$novelORF_ID, T, F)) %>%
dplyr::select(-transcriptClass) %>%
arrange(novelORF_ID)
write_tsv(annotationMasterFiltered, "nORFclassification.tsv")
```
## 3d. Removing "in-frame" entries
This generates the noInFrame files used in all manuscript analyses.
```{r removeInFrame}
noInFrame_nORFs <- read_tsv("nORFclassification.tsv", col_types = 'ccccll') %>%
filter(inFrame == F)
# novelORFtableNamed from section 2b
noInFrame_nORFs <- novelORFtableNamed %>%
#Remove 28 pseudogenes from Xu et al 2016 that were added to the norfs database but not used in the manuscript analyses
filter(source != "pseudogenes_Xu2016") %>%
#Remove in-frame entries
filter(name %in% noInFrame_nORFs$novelORF_ID)
#Convert to bed12 and write out
noInFrame_bed <- createBed12(noInFrame_nORFs)
write_tsv(noInFrame_bed, path = "noInFrame_38.bed", col_names = F)
#Convert to gtf and write out
noInFrame_gtf <- createGTF(noInFrame_nORFs)
write.table(noInFrame_gtf, "noInFrame_38.gtf", col.names = F, row.names = F, sep = "\t", quote = F)
```