Discovery of Allele-Specific Protein-RNA Interactions in Human Transcriptomes

Emad Bahrami-Samani; Yi Xing

doi:10.1016/j.ajhg.2019.01.018

. 2019 Feb 28;104(3):492–502. doi: 10.1016/j.ajhg.2019.01.018

Discovery of Allele-Specific Protein-RNA Interactions in Human Transcriptomes

Emad Bahrami-Samani ^1,², Yi Xing ^1,^2,^3,^∗

PMCID: PMC6407496 PMID: 30827501

Abstract

Gene expression is tightly regulated at the post-transcriptional level through splicing, transport, translation, and decay. RNA-binding proteins (RBPs) play key roles in post-transcriptional gene regulation, and genetic variants that alter RBP-RNA interactions can affect gene products and functions. We developed a computational method ASPRIN (Allele-Specific Protein-RNA Interaction) that uses a joint analysis of CLIP-seq (cross-linking and immunoprecipitation followed by high-throughput sequencing) and RNA-seq data to identify genetic variants that alter RBP-RNA interactions by directly observing the allelic preference of RBP from CLIP-seq experiments as compared to RNA-seq. We used ASPRIN to systematically analyze CLIP-seq and RNA-seq data for 166 RBPs in two ENCODE (Encyclopedia of DNA Elements) cell lines. ASPRIN identified genetic variants that alter RBP-RNA interactions by modifying RBP binding motifs within RNA. Moreover, through an integrative ASPRIN analysis with population-scale RNA-seq data, we showed that ASPRIN can help reveal potential causal variants that affect alternative splicing via allele-specific protein-RNA interactions.

Keywords: genetic variation, protein-RNA interactions, CLIP-seq, RNA-seq, post-transcriptional regulation, bioinformatics, transcriptomics, single-nucleotide polymorphism, alternative splicing

Introduction

Natural genetic polymorphisms can diversify the transcriptome and proteome among individuals by altering the post-transcriptional processing and modification of RNA.¹ Such regulatory variation can cause disease, modify disease risk, or affect therapeutic response.²^,³ Thus, the discovery of genetic variants that affect post-transcriptional RNA regulation may reveal causal mechanisms underlying phenotypic variability and disease pathogenesis in human populations.⁴^,⁵

RNA-binding proteins (RBPs) are key regulators of post-transcriptional RNA processing and modification.⁶ RBPs participate in various steps of RNA regulation, including splicing, transport, translation, and decay, thus determining the fate of RNAs after transcription.⁷ RBPs bind to their RNA targets via defined sequence and/or structural motifs.⁸

The predominant technology for transcriptome-wide mapping of RBP-RNA interactions is CLIP-seq.⁹^,¹⁰^,¹¹^,¹² Multiple variants of CLIP-seq (HITS-CLIP,⁹ PAR-CLIP,¹⁰ iCLIP,¹¹ and eCLIP¹²) aimed at improving library efficiency and reducing artifacts have been used to define the RBP-RNA binding landscape of hundreds of RBPs across different cell types and species. These variants of CLIP experiment are all fairly similar in essence, which is cross-linking RBP and its targets for a more stringent washing of unbound RNA followed by high-throughput sequencing, but due to their technical differences and biases, deliver slightly different datasets, as detailed in Chakrabarti et al.¹³

Previous studies have investigated the effects of genetic variants on post-transcriptional regulation, primarily using a sequence motif-based approach. Jian et al.¹⁴ reviewed eight bioinformatics tools that predict splice-altering single nucleotide variants in the human genome. These methods use information about highly conserved splicing regulatory elements (5′ and 3′ splice sites and branch point signals) as well as auxiliary cis-acting elements recognized by trans-acting RBPs¹⁴ to predict the effects of genetic variants on alternative splicing. Some other recent studies used defined binding motifs of RBPs to predict variants that alter RBP-RNA interactions.¹⁵^,¹⁶ However, as RBP binding motifs are typically short (4–6 nucleotides) and degenerate, methods based on RBP motifs are expected to have a low accuracy and high noise.¹⁷

We developed ASPRIN (Allele-Specific Protein-RNA Interaction), a computational method to identify genetic variants that alter RBP-RNA interactions via a joint analysis of CLIP-seq and RNA-seq data. The premise of ASPRIN is that the allelic ratio in CLIP-seq data compared to that in RNA-seq data of the same cell type can reflect the effects of genetic variants on RBP-RNA interactions. We performed a systematic ASPRIN analysis of ENCODE CLIP-seq (eCLIP) and RNA-seq data for 166 RBPs in two cell lines. One advantage of eCLIP is that it is designed to enrich for fragments that are truncated at the cross-link location,¹² although the degree of enrichment is RBP dependent, while in some other types of CLIP experiments such as PAR-CLIP, the characteristic T-to-C mutation at the cross-link sites¹⁰ introduces additional complications to allele-specific analysis. ASPRIN identified genetic variants that alter RBP-RNA interactions by modifying conserved RBP binding sites. Moreover, through an integrative ASPRIN analysis with population-scale RNA-seq data, we showed that ASPRIN can help reveal causal variants that affect alternative splicing via allele-specific protein-RNA interactions.

Material and Methods

Calling Variants from RNA-Seq Data

The total RNA-seq data for HepG2 whole-cell preparations from two different labs (ENCODE: ENCSR468ION and ENCSR181ZGR), a HepG2 cytosolic fraction (ENCODE: ENCSR862HPO), a HepG2 nuclear fraction (ENCODE: ENCSR061SFU), K562 whole-cell preparations from two different labs (ENCODE: ENCSR000AEN and ENCSR885DVH), a K562 cytosolic fraction (ENCODE: ENCSR860DWK), and a K562 nuclear fraction (ENCODE: ENCSR040YBR) were downloaded from the ENCODE website.

The GATK Best Practices workflow for calling single-nucleotide polymorphisms (SNPs) and indels on RNA-seq data was used with minor modifications.¹⁸ Briefly, the datasets were mapped using STAR v.2.5.2a,¹⁹ and total RNA-seq data from all fractions and all labs were merged to make one large RNA-seq dataset for each cell line. The rest of the pipeline included adding read groups, sorting, marking duplicates, and creating the index using Picard tools v1.134 (see Web Resources), followed by splitting and trimming, subsequent reassignment of mapping qualities, indel realignment, base recalibration, and finally, calling variants using GATK pipeline.¹⁸ Mapping and variant calling statistics are given in Table S1.

Filtering SNPs

In our analyses, we removed false positive SNPs due to sequencing errors, alignment artifacts, and RNA editing events. Only heterozygous variants in the RNA-seq data that matched known SNPs in the NCBI SNP Database (dbSNP)²⁰ were kept. Potential RNA editing events were labeled and removed by intersecting called heterozygous variants with the RADAR (Rigorously Annotated Database of A-to-I RNA editing) RNA editing database.²¹ However, ASPRIN can run in different modes to consider variants that are SNPs, RNA editing events, or both.

eCLIP Data Analysis

For pre-processing ENCODE eCLIP data, the standard operating procedure (SOP) published on the ENCODE website was followed. In brief, (1) adaptors were trimmed using cutadapt v1.10,²² (2) a second round of adaptor cutting was performed to control for double ligation events, (3) the resulting reads were mapped to the human-specific version of Repbase²³ using STAR 2.5.2a¹⁹ to remove repetitive elements and other repetitive reads, as well as to control for spurious artifacts from rRNA, (4) reads mapped to repetitive regions were filtered out of the resulting output from STAR, and (5) PCR duplicates were further removed using random-mers that were provided in the names of the reads. The raw read files available at the ENCODE data portal are already pre-processed, and random-mers that can reveal PCR duplicates are removed from the reads and put in the read names. This information can be used for removing PCR duplicates that are mapped to the same genomic location.

Mapped reads for each replicate were sorted, merged, and indexed, and the resulting mapped reads file was used as input for ASPRIN. In our analysis of ASPRIN results, when we needed the peaks, second (paired-end) reads were used to perform peak-calling using Piranha,²⁴ with a bin size of 1 nt so we can achieve single-nucleotide resolution in peak-calling for the eCLIP data. We considered significant peaks to be those that had a corrected p value of less than 0.01. Mapping and peak calling statistics for RBPs in HepG2 and K562 cell lines are given in Tables S2 and S3, respectively.

ASPRIN Allelic Ratio Test

For each RBP, ASPRIN counts the number of reads that cover each allele in the CLIP-seq and RNA-seq datasets and forms a contingency table with (1) the number of reads covering the reference allele in CLIP-seq, (2) the number of reads covering the alternative allele in CLIP-seq, (3) the number of reads covering the reference allele in RNA-seq, and (4) the number of reads covering the alternative allele in RNA-seq. The result of Fisher’s exact test for each SNP shows whether a particular SNP is significantly differentially bound by an RBP. For each RBP, ASPRIN p values are corrected for multiple hypothesis testing using the Benjamini-Hochberg method and SNPs with q value < 0.1 are reported as significantly differentially bound, or “ASPRIN SNPs.”

Assessing the Robustness of ASPRIN

To measure the error associated with the used variant filtering method, RNA-seq datasets for the GM12878 cell line were downloaded from SRA (SRA: SRR307897 and SRR307898) and the complete genotype for this cell line was downloaded from the 1000 Genomes (1000G) project website.²⁵ We performed variant calling as described above and intersected the set of called variants with 1000G SNPs, dbSNP, and RADAR.

To investigate the choice of RNA-seq protocol and how it may affect the power of ASPRIN, in addition to the total RNA-seq data, we also downloaded polyA+ mRNA-seq data for the same cell lines, fractions, and laboratories: HepG2 whole-cell preparations from two different labs (ENCODE: ENCSR985KAT and ENCSR561FEE), a HepG2 cytosolic fraction (ENCODE: ENCSR931WGT), a HepG2 nuclear fraction (ENCODE: ENCSR058OSL), K562 whole-cell preparations from two different labs (ENCODE: ENCSR000AEO and ENCSR545DKY), a K562 cytosolic fraction (ENCODE: ENCSR384ZXD), and a K562 nuclear fraction (ENCODE: ENCSR530NHO). To normalize for read number and length, we sampled n number of reads from all of these datasets, ten times, where n was the minimum number of reads among these datasets. The RNA-seq libraries that had 100-nucleotide reads (from Brenton Graveley’s lab) were also truncated to 50 nucleotides, to have the same read length as the RNA-seq libraries with 50-nucleotide reads (from Eric Lecuyer’s lab). We then called variants from all these datasets and compared the number of called variants and the regions in which these variants were located. We also ran the ASPRIN pipeline on all eCLIP datasets with these ten subsampled RNA-seq datasets using only cytosolic polyA+ mRNA-seq and nuclear total RNA-seq, to compare the number of ASPRIN SNPs that can be called using these two distinct RNA-seq sets representing different RNA species and subcellular fractions.

To investigate the cross-linking bias and its potential effects on our analysis, for any ASPRIN SNP that was associated with at least one of the 75 RBPs in the HepG2 cell line, we counted for how many RBPs this SNP was (1) called significant with preference for the reference allele, (2) called significant with preference for the alternative allele, (3) not called significant, and (4) not present in enough reads to pass the filters for the ASPRIN analysis.

ASPRIN SNP Enrichment or Depletion in Genomic Regions

We measured the enrichment of ASPRIN SNPs in different genomic regions using Fisher’s exact test. For RBP x and region r, we counted (1) the number of ASPRIN SNPs for x in r and (2) the rest of ASPRIN SNPs for x. In addition, for the background, we counted (3) the number of ASPRIN SNPs in r for the rest of the RBPs and (4) the number of ASPRIN SNPs in any region except r for the rest of the RBPs. Then, we used Fisher’s exact test to measure the significance of enrichment or depletion of ASPRIN SNPs in region r for RBP x compared to the average expectation.

Measuring RBP Sequence Specificity

We determined the sequence specificity of RBPs as the information content of the motif obtained by de novo motif discovery in the high-quality binding sites as defined by the Piranha peak caller.²⁴ For each RBP, peaks output was obtained using Piranha. Then the genomic region (intron, 5′ UTR, coding segment, 3′ UTR, noncoding RNA, and intergenic sequence) containing each peak was assigned to them. All peaks in noncoding or intergenic regions were filtered out and the highest peak in each gene was selected as the representative peak of that RBP binding to the gene. Finally, top 1,000 peaks based on the corrected p value reported by Piranha were selected as the set of high-quality peaks. Zagros⁸ was then used for de novo motif discovery, using sequence and secondary structure information. The parameters were window size 6 and top 10 motifs (-w 6 –n 10), and we selected the top motif reported by Zagros as the discovered motif for that RBP. Information content for each RBP consensus motif is obtained by taking the average information content over all positions within the consensus sequence and for each position defined by Shannon’s entropy. RBPs with consensus sequences that had more information content were considered to have higher sequence specificity.

Motif Enrichment Analysis

Motif enrichment analysis was done using the STORM software.²⁶ As described above the top 6-nucleotide motif discovered by Zagros in top 1,000 peaks for each RBP was used as the consensus motif for that RBP. STORM can use the motif position weight matrix output from Zagros directly and calculate the enrichment of that motif in the set of input sequences.

For each SNP, a sequence of 11 nucleotides centered at the SNP (windows containing all 6-mer positions in the genome that include the SNP) was extracted. Then for each sequence we flipped the center nucleotide, the SNP, to the alternative allele. Therefore, for each RBP, two sets of sequences were formed, that are pairwise identical, except for the center position that contains two alleles of the SNP. One set contains the alleles with low-affinity binding and the other contains the alleles with high-affinity binding. Then, STORM was run using the corresponding consensus motif for each RBP in two sets of sequences for the said RBP to assess the difference in motif score. Parameters for STORM can be set in a way to find the top occurrence of a motif per sequence (-n 1 -q) in single stranded mode (-S) for RNA. For each RBP, we only considered SNPs that have positive scores in both high and low binding affinity sequences to filter out SNPs occurring outside the binding site. For each SNP the maximum motif score among all six possible windows in the high binding affinity sequence and its corresponding motif score in the low binding affinity sequence were selected to produce the boxplots of motif scores for each RBP in each position of the motif. We also defined a motif impact score for each RBP and its associated ASPRIN SNP set as the maximum difference in average motif score between the two alleles with high versus low binding affinity in the window of six nucleotides overlapping the ASPRIN SNP.

Splicing Quantitative Trait Loci (sQTLs) Analysis

To demonstrate the utility of ASPRIN in finding relevant SNPs that may cause changes in splicing, we analyzed ASPRIN SNPs in HepG2 cell line and sQTLs calculated from population-scale RNA-seq data in liver as part of the GTEx consortium.²⁷ RNA-seq and genotype data of liver tissues from 71 individuals (GTEx v6) were downloaded, mapped to the hg19 genome and Percent Spliced In (Psi) values were calculated for each splicing event in each individual. We selected events requiring the condition Max(Psi) – Min(Psi) > 0.1 over all individuals. Then, for each splicing event, GLiMMPS²⁸ was run on SNPs within a 400-kb window centered on the splicing event. The false discovery rate (FDR) was estimated using a permutation procedure to obtain the null hypothesis. In each of the ten permutations, we shuffled the individuals’ genotypes so that each individual would have a randomly assigned genotype. We then ran GLiMMPS to obtain the sQTLs on the permutated data and recorded the minimum p value for each exon over all cis SNPs in each permutation and used this set of p values as the empirical null distribution for estimating the FDR. Using an FDR threshold of 10%, we calculated the p value cutoff t such that P(p₀ < t)/P(p₁ < t) = 0.1, where P(p₀ < t) is the fraction of expected p values from the null distribution less than t and P(p₁ < t) is the fraction of observed p values less than t from the real data. For each splicing event, the sQTLs were defined as the SNPs that have p values less than the cutoff. The linkage disequilibrium (LD) with all the ASPRIN SNPs was calculated and used for selecting only the exons that had sQTLs in high LD with ASPRIN SNPs (r² > 0.8). The LD map was created using a CEU population.²⁹ Exons for events in which the ASPRIN SNP is near the exon were further filtered with the criteria that the ASPRIN SNP is within a window of 500 nucleotides around the alternative splicing event. The windows were defined for each alternative splicing event as follows: (1) skipped exon: 500 nucleotides into the introns on each side of the skipped exon; (2) mutually exclusive exons: 500 nucleotides into the introns on each side of two mutually exclusive exons; (3 and 4) alternative 5′ or 3′ splice sites: 500 nucleotides into the introns on each side of the longer exon; and (5) intron retention: 500 nucleotides into the exons on each side of the retained intron. The numbers of each type of alternative splicing event that pass the filters are given in Table S4.

Genome-wide Association Study (GWAS) Signals

23,444 GWAS SNPs with p values < 10⁻⁵ were downloaded from the NHGRI GWAS catalog²⁹ and PLINK v1.08p³⁰ was used to calculate the LD between ASPRIN SNPs and GWAS SNPs on the LD map that was created using a CEU population.²⁹ SNPs in high LD (r² > 0.8) with GWAS SNPs were reported as GWAS-correlated ASPRIN SNPs.

Results

ASPRIN Pipeline for Detecting Allele-Specific Protein-RNA Interactions

The discovery of allele-specific protein-RNA interactions in ASPRIN is based on the rationale that if a particular SNP creates or disrupts an RBP binding site, we would expect to observe a difference in the allelic ratio of the SNP in the CLIP-seq reads compared to the corresponding RNA-seq reads from the same cell type. A schematic diagram of ASPRIN is provided in Figure 1A. Briefly, to call SNPs, RNA-seq reads were mapped to the human genome and transcriptome, and single nucleotide variants (SNVs) were called using the GATK pipeline¹⁸ (see details in Material and Methods). We then applied stringent filters to remove false positive SNPs contributed by potential sequencing errors, alignment artifacts, and RNA editing events. Specifically, heterozygous variants in RNA-seq data that matched known SNPs in dbSNP were kept,²⁰ while potential RNA editing events were removed by intersection with the RADAR RNA editing database.²¹ After this set of high-confidence SNPs was generated, CLIP-seq reads were mapped and reads supporting the reference or alternative allele in the CLIP-seq data were counted. Additionally, because RNA-seq reads are typically longer than CLIP-seq reads, we split the 100 bp RNA-seq reads in the ENCODE data into two 50 bp segments and mapped them separately to count reference and alternative alleles in the RNA-seq data, to alleviate systematic mapping bias for the reference over the alternative alleles in CLIP-seq data compared to the RNA-seq data. Indeed, by splitting 100 bp RNA-seq reads, the mapping bias was largely removed (Figure S1). Finally, we tested each SNP site with at least ten reads (sum of two alleles) in both the RNA-seq and CLIP-seq data for significant difference in allelic ratio via Fisher’s exact test of allelic read counts in RNA-seq versus CLIP-seq data. After correcting for multiple hypothesis testing, we reported SNPs with corrected p values of less than 0.1 as ASPRIN SNPs (Figure 1A). An example result for the HepG2 cell line is an A-to-G SNP (rs115776575) in PTPN4 (MIM: 176878) that disrupts a highly conserved “A” nucleotide in the “TGCATG” consensus motif of RBFOX2. While the allelic ratio between “A” and “G” was 1:1 in the RNA-seq reads, the “G” allele represented only 10.5% of the CLIP-seq reads (Figure 1B), consistent with RBFOX2 binding to the TGCATG motif, and that the A-to-G SNP at the fourth nucleotide position of the motif disrupts RBFOX2 binding.

The ASPRIN Pipeline for Identifying Allele-Specific Protein-RNA Interactions from CLIP-Seq and RNA-Seq Data

(A) Flowchart of the ASPRIN pipeline: variants are called from RNA-seq data, and heterozygous variants are intersected with dbSNP to obtain a list of high-confidence SNPs and intersected with RADAR to filter out potential A-to-I RNA editing events. For each SNP, ASPRIN counts the number of reads in the CLIP-seq and RNA-seq data that support each allele. An allelic ratio test then assesses whether one allele is significantly more preferred for RBP binding.

(B) An A-to-G SNP (rs115776575) disrupts a consensus RBFOX2 binding site in *PTPN4*. This disruption of binding is illustrated in the difference in the numbers of reads containing each allele in CLIP-seq reads, while equal numbers of reads contain each allele in the RNA-seq data.

ASPRIN Is Robust in Discovering SNPs Involved in Allele-Specific Protein-RNA Interactions

We evaluated various issues that may affect the performance of ASPRIN, such as errors arising from calling variants from RNA-seq data, choice of RNA-seq protocols, and potential artifacts due to the cross-linking step in CLIP-seq experiments. First, since whole-genome genotype data are not available for most of the cell types with CLIP-seq data, we assessed our SNP calling procedure using RNA-seq data alone. To obtain a ground truth for this assessment, we called SNVs using RNA-seq data for the GM12878 cell line (SRA accessions SRR307897 and SRR307898), for which high-quality whole-genome genotype data are available from the 1000G project.²⁵ After calling SNVs in GM12878 using our pipeline, we intersected the set of heterozygous variants with known SNPs in GM12878 from the 1000G project²⁵ and known A-to-I RNA editing sites in the RADAR database²¹ to investigate the distribution of different variant types. As shown in Figure 2A, 63.2% of the called SNVs were known SNPs and 23.8% were known RNA editing events. The remaining 13.0% were unknown variants that did not match any 1000G SNPs or RADAR sites and the distribution of all 12 possible single-nucleotide changes suggested that these unknown variants represented a mixture of SNPs and RNA editing events (Figure 2A). As shown in Figure 2B, 89.6% of the called SNVs that were in the dbSNP were also present in the 1000G data for GM12878, suggesting an upper bound of 10.4% for the false discovery rate of our RNA-seq-based SNP calling procedure. Moreover, 3.6% of the called SNVs for GM12878 present in the 1000G data were not in the dbSNP, suggesting that the use of the dbSNP had a minimal impact on the false negative rate of SNP identification. Collectively, our data suggest that, by using dbSNP and RADAR as filters, we can obtain a set of high-confidence SNPs from our RNA-seq variant calling in the absence of matching genotype data.

RNA-Seq Variants Called in the GM12878 Cell Line

dbSNP and RADAR were used as external references to obtain a set of high-confidence SNPs from RNA-seq variant calling in the absence of matching genotype data.

(A) Intersection of variants with the 1000G SNPs and RADAR RNA editing events as well as the distribution of variant types over all 12 possible single-nucleotide changes.

(B) The variant filtering steps in the ASPRIN pipeline yield low false discovery and low false negative rates.

Next, we investigated issues that may affect the power of ASPRIN for calling SNPs and identifying allele-specific protein-RNA interactions. Specifically, the choice of RNA-seq protocol may affect the power of ASPRIN depending on the binding location of a given RBP within the RNA. For instance, a cytosolic polyA+ RNA-seq library would be appropriate for an RBP that predominantly binds to exons within mRNAs in the cytosol, but not for an RBP that predominantly binds to introns within precursor mRNAs in the nucleus. To investigate the most appropriate RNA-seq protocols and libraries, we randomly sampled equal numbers of reads from polyA+ and total RNA-seq libraries of distinct subcellular fractions (nucleus, cytosol, and whole-cell) from the HepG2 cell line and performed SNP calling and ASPRIN analysis on the sampled RNA-seq data. For both polyA+ and total RNA-seq libraries, we called the highest number of SNPs from the nuclear RNA-seq data and the lowest number of SNPs from the cytosolic RNA-seq data (Figure 3A). The lowest number of SNPs was called from cytosolic polyA+ RNA-seq data (Figure 3A); these SNPs were enriched for exonic regions within UTRs (untranslated regions) and CDS (coding segments) and depleted for intronic regions within pre-mRNAs (Figure 3B). A similar trend was observed for the K562 leukemia cell line (Figure S2). On the other hand, as reads of cytosolic polyA+ RNA-seq libraries were concentrated within CDS and UTR regions, such data may have better power for detecting allele-specific protein-RNA interactions of RBPs that bind predominantly to exons. As expected, the nuclear fraction of the total RNA-seq library provided a much greater power for ASPRIN analysis of an RBP that binds predominantly to introns (HNRNPM), while ASPRIN analyses of an RBP that binds predominantly to exonic regions (YBX3) identified similar numbers of ASPRIN SNPs from the cytosolic polyA+ RNA-seq library and the nuclear total RNA-seq library (Figure 3C). Furthermore, after calling peaks, we sorted all RBPs in both cell lines based on the ratio of exonic (CDS and UTR regions) to intronic peaks. The complete distributions of peaks in different regions for all RBPs are shown in Figure S3 and we excluded RBPs for which more than 50% of peaks fell in intergenic regions and noncoding RNAs. We observed a positive correlation (Pearson correlation coefficient = 0.34, p value < 0.0001) between binding of an RBP to exonic regions and the relative power of identifying significant ASPRIN SNPs using cytosolic polyA+ RNA-seq libraries, despite large variation among individual RBPs (Figure 3D).

RNA-Seq Variants Called from Different RNA-Seq Libraries of the HepG2 Cell Line

Two methods of library selection (polyA+ and total RNA) in different subcellular fractions (nucleus, cytosol, and whole-cell fractions from two different labs: EL = Eric Lecuyer’s lab at Institut de Recherches Cliniques de Montréal, and BG = Brenton Graveley’s lab at University of Connecticut).

(A) Numbers of variants called from different RNA-seq libraries and their intersections with dbSNP and RADAR.

(B) Distribution of called variants in different genomic regions.

(C) Numbers of significant ASPRIN variants from polyA+ cytosolic or total RNA nuclear RNA-seq libraries for an RBP that binds predominantly to intronic regions (HNRNPM) and an RBP that binds predominantly to exonic regions (YBX3). Standard error of the mean is indicated as the error bar for each library selection method and subcellular fraction.

(D) The ratio of ASPRIN SNPs found using polyA+ cytosolic RNA-seq libraries to ASPRIN SNPs found using total RNA nuclear RNA-seq libraries increases as the ratio of exonic to intronic peaks increases.

Finally, we evaluated potential false positives that may arise from the cross-linking step in CLIP-seq experiments. Specifically, the sequences in the CLIP-seq libraries may be altered by mutation or deletion at the cross-linking site.⁹^,¹⁰^,¹¹ We noted that in the eCLIP protocol used for generating the ENCODE CLIP-seq data, the majority of fragments were truncated at the cross-linking site rather than containing mutations or deletions.¹² Nonetheless, we investigated this issue further by calling SNVs from the ENCODE eCLIP data and comparing the distribution of variant types to that of the RNA-seq data and observed a similar distribution (Figure S4). Another possible source of artifacts is cross-linking bias that may shift the read count toward specific nucleotides in the CLIP-seq data. However, 70% of ASPRIN SNPs were called significant for only one RBP. Only 6% of ASPRIN SNPs were called significant for more than five RBPs. Among these SNPs, the same allele was preferred by all RBPs in 87% of the SNPs, whereas in the remaining 13%, different alleles were preferred by different RBPs (Figure S5). Overall, these data suggest that the fraction of ASPRIN SNPs that may be attributable to CLIP-seq cross-linking bias is small.

To assess the reproducibility of ASPRIN using different eCLIP replicates, we ran ASPRIN on all the ENCODE data and each eCLIP replicate separately. For each pair of datasets, we calculated the normalized intersection over union of the number of ASPRIN SNPs to show for each eCLIP replicate which dataset shows the highest degree of agreement. As shown in Figures S6 and S7, ASPRIN is reproducible between replicates in both cell lines.

ASPRIN Identifies Functionally Relevant SNPs for Different Classes of RBPs

To assess the potential functional relevance of the ASPRIN results, we investigated the positional distribution of ASPRIN SNPs for different classes of RBPs. To this end, we classified RBPs based on their known functions,³¹ and we defined genomic regions as follows: (1) 5′ UTRs, (2) upstream proximal intronic regions (500 nucleotides upstream of an internal exon), (3) coding regions, (4) downstream proximal intronic regions (500 nucleotides downstream of an internal exon), (5) 3′ UTRs, (6) distal intronic regions (more than 500 nucleotides away from exons on both sides), (7) noncoding RNAs, and (8) intergenic regions. Then, for each RBP, we calculated the enrichment of ASPRIN SNPs in different genomic regions (see details in Material and Methods). As expected, ASPRIN SNPs were more enriched in regions to which RBPs bind to perform their known functions (Figure 4). For instance, in the HepG2 cell line, we observed an enrichment (p value < 0.001) of ASPRIN SNPs in the 5′ UTR for translation regulators such as DDX3X and NCBP2, with 27.1% and 16.0% of their ASPRIN SNPs found within the 5′ UTR, respectively. Multiple classes of splicing factors showed distinct patterns of positional distributions for their ASPRIN SNPs. We observed an enrichment of ASPRIN SNPs in upstream proximal intronic regions for branch point recognition factors such as SF3B4 (30.5%), U2AF2 (12.2%), U2AF1 (8.8%), and SF3A3 (11.8%). Similarly, ASPRIN SNPs were enriched in the downstream proximal intronic regions for RBPs that are part of the 5′ splice site machinery such as PRPF8 (22.0%), EFTUD2 (14.8%), and RBM22 (11.7%). There was an enrichment of ASPRIN SNPs in coding regions for several splicing regulators that primarily bind to coding exons, such as SRSF1 (40.7%) and TRA2A (29.0%). For RBFOX2, we observed an enrichment of ASPRIN SNPs in both upstream and downstream proximal intronic regions (6.0% and 15.1%, respectively), as we expect RBFOX2 to bind to either region to promote exon skipping or inclusion, respectively. The ASPRIN SNPs of HNRNP proteins were enriched in distal intronic regions and depleted in coding regions, which fits that these RBPs predominantly bind to distal intronic regions. Finally, RBPs that regulate mRNA stability, such as IGF2BP proteins and LIN28, showed an enrichment of ASPRIN SNPs in the 3′ UTR (Figure 4A). We observed a similar pattern in the K562 cell line, where the same RBPs in both cell lines show the similar pattern of regional preference (Figures 4B and S8). The numbers of ASPRIN SNPs for each RBP in HepG2 and K562 are provided in Figure S9.

Enrichment of ASPRIN SNPs in Different Genomic Regions

Positional distributions of ASPRIN SNPs for different classes of RBPs in HepG2 (A) and K562 (B) cell lines. The top diagram of the figure depicts the different genomic regions used in the analysis. RBPs were classified based on their known functions.³¹ In both panels the enrichment of ASPRIN SNPs for each RBP in different genomic regions is shown as heatmaps for color coded log fold enrichment (top) and barplots for percent of total ASPRIN SNPs (bottom).

ASPRIN SNPs Affect RBP Consensus Motifs

To explore the potential molecular mechanisms by which ASPRIN SNPs affect protein-RNA interactions, we investigated the effects of ASPRIN SNPs on RBP consensus motifs. We predicted that if an RBP binds to RNAs in a highly sequence-specific manner, then variants within the conserved RBP consensus motif are likely to affect binding. First, we called peaks from ENCODE CLIP-seq data using Piranha²⁴ and performed de novo motif discovery on called peaks using Zagros⁸ to obtain a 6-nucleotide consensus motif for each RBP. We then calculated the information content of the consensus motif, defined as the average information content of each position within the 6-nucleotide motif, as a measure of sequence specificity (see details in Material and Methods). Figure 5A shows the RBPs in HepG2, sorted by the sequence specificity of their consensus motifs. Among all RBPs, HNRNPA1 and RBFOX2 had the highest sequence specificity of their consensus motifs, and they are known to bind to highly conserved AGGGAG³² and TGCATG³³ motifs, respectively. Next, for all ASPRIN SNPs of a given RBP, we obtained two sets of sequences that corresponded to the two alleles, i.e., one with high binding affinity and the other with low binding affinity. Finally, we used the position weight matrix that was obtained for all RBP consensus motifs by Zagros and calculated the motif scores for the two sets of sequences using STORM²⁶ (Figure S10 and Material and Methods). Figure 5B shows the motif scores of five RBPs with high (HNRNPA1, RBFOX2), median (DKC1), and low (NCBP2, XRN2) consensus motif sequence specificity. Variants in different positions within the consensus motif did not seem to affect binding equally. For example, for HNRNPA1, variants in position 5 of the motif had a more significant effect on binding than did variants in other positions. This result shows that not all positions in the consensus motif contribute equally to RBP-RNA interactions.

The Effect of ASPRIN SNPs on RBP Consensus Motifs

(A) RBPs in the HepG2 cell line, sorted based on the sequence specificity (i.e., information content) of their consensus motif. For each RBP, the information content was calculated by taking the average of the information content for each position within the motif, calculated using Shannon’s entropy.

(B) Boxplots comparing the consensus motif scores for alleles with high and low binding affinity. Two RBPs with the lowest sequence specificity (XRN2 and NCBP2), one RBP with the median sequence specificity (DKC1), and two RBPs with the highest sequence specificity (RBFOX2 and HNRNPA1) are shown. The consensus motif obtained from the top 1,000 peaks for each RBP is represented at the bottom of each graph. The middle line of the boxplot represents median value. The low and high ends of the box represent the 25% and 75% quantiles, respectively. The two whiskers extend to 1.5 times the interquartile range.

(C) As sequence specificity increases, we observe a larger difference between the consensus motif scores of the high-affinity versus low-affinity ASPRIN alleles.

To further explore the relationship between the ASPRIN SNPs and RBP consensus motifs, we defined a motif impact score for each RBP and its associated ASPRIN SNP set as the maximum difference of average motif score between the two alleles with high versus low binding affinity in the window of six nucleotides overlapping the ASPRIN SNP (see details in Figure S10). We observed a positive correlation (Pearson correlation coefficient = 0.29, p value < 0.05) between the motif impact score and the sequence specificity of a given RBP’s consensus motif (Figure 5C), suggesting that for highly sequence-specific RBPs, ASPRIN SNPs tend to affect binding by altering the consensus binding motifs within the RNA. For instance, in the case of HNRNPA1 and RBFOX2, we observed a higher motif score for alleles with higher binding affinity, while for NCBP2 and XRN2, we did not observe noticeable differences in motif scores between the two alleles in any position of their consensus motif (Figure 5C).

ASPRIN Can Help Reveal Causal Variants Affecting Alternative Splicing

Finally, we investigated whether ASPRIN can help reveal causal genetic variants that affect post-transcriptional gene regulation. For this analysis, we focused on the genetic variation of alternative splicing. A series of population-scale transcriptome studies have revealed widespread alternative splicing variation among human individuals,⁴ but it remains challenging to pinpoint the causal genetic variants underlying this splicing variation. To match our ASPRIN analysis of the HepG2 liver cell line, we analyzed liver RNA-seq data along with matching genotype data of 71 individuals from the GTEx consortium (v6). We performed a transcriptome-wide scan of splicing quantitative trait loci (sQTLs) using GLiMMPS²⁸ and obtained ASPRIN SNPs correlated with GLiMMPS sQTLs (see details in Material and Methods).

Our joint ASPRIN and GLiMMPS analyses revealed candidate causal SNPs that affected alternative splicing via allele-specific protein-RNA interactions. For example, GLiMMPS identified several SNPs that were significantly associated with an exon-skipping event in FAM114A1, one of which was an ASPRIN SNP (Figure 6A). The genotype at the ASPRIN SNP was significantly associated with the level of exon inclusion, with the GG and AA genotypes showing the highest and lowest levels of exon inclusion, respectively (Figure 6B). The ASPRIN analysis indicated that the G allele was associated with significantly greater binding by the splicing factor SRSF9 (Figure 6B), while the A allele disrupted binding at the highly conserved “G” nucleotide at the fourth position of the SRSF9 consensus motif (Figure 6B). Collectively, these data suggest that the G-to-A SNP disrupted the binding of the splicing activator SRSF9, leading to reduced inclusion of the FAM114A1 exon. Similarly, we identified an ASPRIN SNP for the splicing factor SF3B4, which was significantly associated with an alternative 3′ splice site event in ARL6IP4 (MIM: 607668) (Figures 6C and 6D). This C-to-T SNP was located seven nucleotides upstream of the intron-exon boundary and disrupted a highly conserved “C” nucleotide at the fourth position of the SF3B4 consensus motif. This was reflected by a much lower percentage of the T allele in the SF3B4 CLIP-seq data than in the RNA-seq data and increased usage of an upstream cryptic 3′ splice site for the TT genotype (Figures 6C and 6D). Overall, our results show that ASPRIN can help pinpoint causal variants within a window of SNPs that are correlated with levels of alternative splicing and in high linkage disequilibrium with each other.

ASPRIN Helps Reveal Causal Variants Affecting Alternative Splicing

(A) Distribution of GLiMMPS p values around the exon skipping event in *FAM114A1*. For each SNP, the p value indicates the significance of correlation between genotype and exon inclusion level within a 400-kb window centered on the splicing event.

(B) Plots indicating the correlation of exon inclusion level with genotype for the ASPRIN SNP, differential binding of SRSF9 to the ASPRIN SNP that is in high LD with the GLiMMPS sQTL, and CLIP-seq allelic coverage on the ASPRIN SNP illustrating the effect of the SNP on the RBP consensus motif. The middle line of the boxplot represents median value. The low and high ends of the box represent the 25% and 75% quantiles, respectively. The two whiskers extend to 1.5 times the interquartile range.

(C and D) Similar plots are shown for a GLiMMPS sQTL involving alternative 3′ splice site usage in *ARL6IP4*, along with an ASPRIN SNP with differential binding of SF3B4 that is in high LD with the sQTL.

We further associated ASPRIN SNPs with GWAS SNPs.²⁹ Specifically, we used the LD map of a CEU population to calculate LD correlations between all ASPRIN SNPs and SNPs associated with diseases and traits in the NHGRI GWAS catalog.²⁹ Tables S5 and S6 show all ASPRIN SNPs in high LD (r² > 0.8) with GWAS SNPs in HepG2 and K562 cell lines, respectively. These tables can be used by researchers to narrow down their search for candidate causal SNPs from GWAS signals of human traits or diseases.

Discussion

We report ASPRIN, a computational tool for identifying genetic variants that may affect RBP-RNA interactions, by quantifying and contrasting the allelic ratios of heterozygous SNPs in CLIP-seq versus RNA-seq data. Unlike previous work that relied on short RBP consensus motifs,¹⁵^,¹⁶ ASPRIN adopts a data-driven approach to directly observe the allelic preference of RBPs in CLIP-seq data, using matching RNA-seq data from the same cell type as the control. Our comprehensive ASPRIN analysis of 166 RBPs in two ENCODE cell lines identified 55,646 candidate allele-specific protein-RNA interaction events. These events may provide valuable information for interpreting causal signals underlying human transcriptomic variation and phenotypic diversity. Of note, recent population transcriptomic studies (such as the GTEx project²⁷) have revealed widespread genetic variation of gene expression and RNA processing in human populations, but identifying the causal SNPs underlying such regulatory variation remains difficult. The ASPRIN analysis provides an independent source of information that may assist the fine mapping of SNPs associated with gene expression levels or RNA processing patterns. In this work, we present two example cases in which the ASPRIN analysis reveals the likely causal variant responsible for splicing QTLs in the human liver. Future studies integrating other layers of RNA regulatory processes may reveal ASPRIN SNPs that causally impact other aspects of RNA processing and metabolism in human cells.

Declaration of Interests

Y.X. is a scientific co-founder of Panorama Medicine Inc.

Acknowledgments

The authors thank Drs. Levon Demirdjian and Ying Nian Wu for insightful discussions and the ENCODE Consortium and the ENCODE production laboratories for generating the eCLIP and RNA-seq data. This work was supported by National Institutes of Health grants R01GM088342 and U01CA233074 to Y.X. E.B.S. was partly supported by National Institutes of Health T32 Tumor Cell Biology Training Grant (T32CA009056).

Published: February 28, 2019

Footnotes

Supplemental Data can be found with this article online at https://doi.org/10.1016/j.ajhg.2019.01.018.

Web Resources

ASPRIN source code, https://github.com/Xinglab/ASPRIN
dbSNP, https://www.ncbi.nlm.nih.gov/projects/SNP/
ENCODE, https://www.encodeproject.org/
OMIM, http://www.omim.org/
Picard, http://broadinstitute.github.io/picard/
RADAR database version 2, http://lilab.stanford.edu/GokulR/database/Human_AG_all_hg19_v2.txt
Repbase, https://www.girinst.org/downloads/
SRA, https://www.ncbi.nlm.nih.gov/sra

Supplemental Data

Document S1. Figures S1–S10

mmc1.pdf^{(3.6MB, pdf)}

Table S1. RNA-Seq Data Mapping and Variant Calling Stats

mmc2.xlsx^{(14.6KB, xlsx)}

Table S2. eCLIP Mapping and Peak Calling Stats HepG2

mmc3.xlsx^{(21.3KB, xlsx)}

Table S3. eCLIP Mapping and Peak Calling Stats K562

mmc4.xlsx^{(30.7KB, xlsx)}

Table S4. ASPIRIN SNP sQTL Correlation

mmc5.xlsx^{(9.8KB, xlsx)}

Table S5. ASPIRIN SNP GWAS Correlation HepG2

mmc6.xlsx^{(95.7KB, xlsx)}

Table S6. ASPIRIN SNP GWAS Correlation K562

mmc7.xlsx^{(91.3KB, xlsx)}

Document S2. Article plus Supplemental Data

mmc8.pdf^{(5.2MB, pdf)}

References

1.Glisovic T., Bachorik J.L., Yong J., Dreyfuss G. RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett. 2008;582:1977–1986. doi: 10.1016/j.febslet.2008.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Cooper T.A., Wan L., Dreyfuss G. RNA and disease. Cell. 2009;136:777–793. doi: 10.1016/j.cell.2009.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lukong K.E., Chang K.W., Khandjian E.W., Richard S. RNA-binding proteins in human genetic disease. Trends Genet. 2008;24:416–425. doi: 10.1016/j.tig.2008.05.004. [DOI] [PubMed] [Google Scholar]
4.Park E., Pan Z., Zhang Z., Lin L., Xing Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 2018;102:11–26. doi: 10.1016/j.ajhg.2017.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Wang G.-S., Cooper T.A. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 2007;8:749–761. doi: 10.1038/nrg2164. [DOI] [PubMed] [Google Scholar]
6.Hentze M.W., Castello A., Schwarzl T., Preiss T. A brave new world of RNA-binding proteins. Nat. Rev. Mol. Cell Biol. 2018;19:327–341. doi: 10.1038/nrm.2017.130. [DOI] [PubMed] [Google Scholar]
7.Moore M.J. From birth to death: the complex lives of eukaryotic mRNAs. Science. 2005;309:1514–1518. doi: 10.1126/science.1111443. [DOI] [PubMed] [Google Scholar]
8.Bahrami-Samani E., Penalva L.O., Smith A.D., Uren P.J. Leveraging cross-link modification events in CLIP-seq for motif discovery. Nucleic Acids Res. 2015;43:95–103. doi: 10.1093/nar/gku1288. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Licatalosi D.D., Mele A., Fak J.J., Ule J., Kayikci M., Chi S.W., Clark T.A., Schweitzer A.C., Blume J.E., Wang X. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008;456:464–469. doi: 10.1038/nature07488. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Hafner M., Landthaler M., Burger L., Khorshid M., Hausser J., Berninger P., Rothballer A., Ascano M., Jr., Jungkamp A.-C., Munschauer M. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell. 2010;141:129–141. doi: 10.1016/j.cell.2010.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.König J., Zarnack K., Rot G., Curk T., Kayikci M., Zupan B., Turner D.J., Luscombe N.M., Ule J. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat. Struct. Mol. Biol. 2010;17:909–915. doi: 10.1038/nsmb.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Van Nostrand E.L., Pratt G.A., Shishkin A.A., Gelboin-Burkhart C., Fang M.Y., Sundararaman B., Blue S.M., Nguyen T.B., Surka C., Elkins K. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP) Nat. Methods. 2016;13:508–514. doi: 10.1038/nmeth.3810. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chakrabarti A.M., Haberman N., Praznik A., Luscombe N.M., Ule J. Data science issues in studying protein–RNA interactions with CLIP technologies. Ann. Rev. Biomed. Data Sci. 2018;1:235–261. doi: 10.1146/annurev-biodatasci-080917-013525. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Jian X., Boerwinkle E., Liu X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res. 2014;42:13534–13544. doi: 10.1093/nar/gku1206. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Mao F., Xiao L., Li X., Liang J., Teng H., Cai W., Sun Z.S. RBP-Var: a database of functional variants involved in regulation mediated by RNA-binding proteins. Nucleic Acids Res. 2016;44(D1):D154–D163. doi: 10.1093/nar/gkv1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Singh B., Trincado J.L., Tatlow P.J., Piccolo S.R., Eyras E. Genome sequencing and RNA-motif analysis reveal novel damaging noncoding mutations in human tumors. Mol. Cancer Res. 2018;16:1112–1124. doi: 10.1158/1541-7786.MCR-17-0601. [DOI] [PubMed] [Google Scholar]
17.Bahrami-Samani E., Vo D.T., de Araujo P.R., Vogel C., Smith A.D., Penalva L.O., Uren P.J. Computational challenges, tools, and resources for analyzing co- and post-transcriptional events in high throughput. Wiley Interdiscip. Rev. RNA. 2015;6:291–310. doi: 10.1002/wrna.1274. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Sherry S.T., Ward M.-H., Kholodov M., Baker J., Phan L., Smigielski E.M., Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ramaswami G., Li J.B. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 2014;42:D109–D113. doi: 10.1093/nar/gkt996. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 2011;17:10–12. [Google Scholar]
23.Bao W., Kojima K.K., Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Uren P.J., Bahrami-Samani E., Burns S.C., Qiao M., Karginov F.V., Hodges E., Hannon G.J., Sanford J.R., Penalva L.O., Smith A.D. Site identification in high-throughput RNA-protein interaction data. Bioinformatics. 2012;28:3013–3020. doi: 10.1093/bioinformatics/bts569. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R., 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Schones D.E., Smith A.D., Zhang M.Q. Statistical significance of cis-regulatory modules. BMC Bioinformatics. 2007;8:19. doi: 10.1186/1471-2105-8-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Ward M.C., Gilad Y. Human genomics: Cracking the regulatory code. Nature. 2017;550:190–191. doi: 10.1038/550190a. [DOI] [PubMed] [Google Scholar]
28.Zhao K., Lu Z.X., Park J.W., Zhou Q., Xing Y. GLiMMPS: robust statistical model for regulatory variation of alternative splicing using RNA-seq data. Genome Biol. 2013;14:R74. doi: 10.1186/gb-2013-14-7-r74. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Van Nostrand E.L., Freese P., Pratt G.A., Wang X., Wei X., Blue S.M., Dominguez D., Cody N.A., Olson S., Sundararaman B. A large-scale binding and functional map of human RNA binding proteins. bioRxiv. 2017 doi: 10.1038/s41586-020-2077-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Burd C.G., Dreyfuss G. RNA binding specificity of hnRNP A1: significance of hnRNP A1 high-affinity binding sites in pre-mRNA splicing. EMBO J. 1994;13:1197–1204. doi: 10.1002/j.1460-2075.1994.tb06369.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Damianov A., Ying Y., Lin C.-H., Lee J.-A., Tran D., Vashisht A.A., Bahrami-Samani E., Xing Y., Martin K.C., Wohlschlegel J.A., Black D.L. Rbfox proteins regulate splicing as part of a large multiprotein complex LASR. Cell. 2016;165:606–619. doi: 10.1016/j.cell.2016.03.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S10

mmc1.pdf^{(3.6MB, pdf)}

Table S1. RNA-Seq Data Mapping and Variant Calling Stats

mmc2.xlsx^{(14.6KB, xlsx)}

Table S2. eCLIP Mapping and Peak Calling Stats HepG2

mmc3.xlsx^{(21.3KB, xlsx)}

Table S3. eCLIP Mapping and Peak Calling Stats K562

mmc4.xlsx^{(30.7KB, xlsx)}

Table S4. ASPIRIN SNP sQTL Correlation

mmc5.xlsx^{(9.8KB, xlsx)}

Table S5. ASPIRIN SNP GWAS Correlation HepG2

mmc6.xlsx^{(95.7KB, xlsx)}

Table S6. ASPIRIN SNP GWAS Correlation K562

mmc7.xlsx^{(91.3KB, xlsx)}

Document S2. Article plus Supplemental Data

mmc8.pdf^{(5.2MB, pdf)}

[bib1] 1.Glisovic T., Bachorik J.L., Yong J., Dreyfuss G. RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett. 2008;582:1977–1986. doi: 10.1016/j.febslet.2008.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Cooper T.A., Wan L., Dreyfuss G. RNA and disease. Cell. 2009;136:777–793. doi: 10.1016/j.cell.2009.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Lukong K.E., Chang K.W., Khandjian E.W., Richard S. RNA-binding proteins in human genetic disease. Trends Genet. 2008;24:416–425. doi: 10.1016/j.tig.2008.05.004. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Park E., Pan Z., Zhang Z., Lin L., Xing Y. The expanding landscape of alternative splicing variation in human populations. Am. J. Hum. Genet. 2018;102:11–26. doi: 10.1016/j.ajhg.2017.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Wang G.-S., Cooper T.A. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 2007;8:749–761. doi: 10.1038/nrg2164. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Hentze M.W., Castello A., Schwarzl T., Preiss T. A brave new world of RNA-binding proteins. Nat. Rev. Mol. Cell Biol. 2018;19:327–341. doi: 10.1038/nrm.2017.130. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Moore M.J. From birth to death: the complex lives of eukaryotic mRNAs. Science. 2005;309:1514–1518. doi: 10.1126/science.1111443. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Bahrami-Samani E., Penalva L.O., Smith A.D., Uren P.J. Leveraging cross-link modification events in CLIP-seq for motif discovery. Nucleic Acids Res. 2015;43:95–103. doi: 10.1093/nar/gku1288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Licatalosi D.D., Mele A., Fak J.J., Ule J., Kayikci M., Chi S.W., Clark T.A., Schweitzer A.C., Blume J.E., Wang X. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008;456:464–469. doi: 10.1038/nature07488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Hafner M., Landthaler M., Burger L., Khorshid M., Hausser J., Berninger P., Rothballer A., Ascano M., Jr., Jungkamp A.-C., Munschauer M. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell. 2010;141:129–141. doi: 10.1016/j.cell.2010.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.König J., Zarnack K., Rot G., Curk T., Kayikci M., Zupan B., Turner D.J., Luscombe N.M., Ule J. iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat. Struct. Mol. Biol. 2010;17:909–915. doi: 10.1038/nsmb.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Van Nostrand E.L., Pratt G.A., Shishkin A.A., Gelboin-Burkhart C., Fang M.Y., Sundararaman B., Blue S.M., Nguyen T.B., Surka C., Elkins K. Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP) Nat. Methods. 2016;13:508–514. doi: 10.1038/nmeth.3810. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Chakrabarti A.M., Haberman N., Praznik A., Luscombe N.M., Ule J. Data science issues in studying protein–RNA interactions with CLIP technologies. Ann. Rev. Biomed. Data Sci. 2018;1:235–261. doi: 10.1146/annurev-biodatasci-080917-013525. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Jian X., Boerwinkle E., Liu X. In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res. 2014;42:13534–13544. doi: 10.1093/nar/gku1206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Mao F., Xiao L., Li X., Liang J., Teng H., Cai W., Sun Z.S. RBP-Var: a database of functional variants involved in regulation mediated by RNA-binding proteins. Nucleic Acids Res. 2016;44(D1):D154–D163. doi: 10.1093/nar/gkv1308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Singh B., Trincado J.L., Tatlow P.J., Piccolo S.R., Eyras E. Genome sequencing and RNA-motif analysis reveal novel damaging noncoding mutations in human tumors. Mol. Cancer Res. 2018;16:1112–1124. doi: 10.1158/1541-7786.MCR-17-0601. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Bahrami-Samani E., Vo D.T., de Araujo P.R., Vogel C., Smith A.D., Penalva L.O., Uren P.J. Computational challenges, tools, and resources for analyzing co- and post-transcriptional events in high throughput. Wiley Interdiscip. Rev. RNA. 2015;6:291–310. doi: 10.1002/wrna.1274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Sherry S.T., Ward M.-H., Kholodov M., Baker J., Phan L., Smigielski E.M., Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Ramaswami G., Li J.B. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 2014;42:D109–D113. doi: 10.1093/nar/gkt996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 2011;17:10–12. [Google Scholar]

[bib23] 23.Bao W., Kojima K.K., Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Uren P.J., Bahrami-Samani E., Burns S.C., Qiao M., Karginov F.V., Hodges E., Hannon G.J., Sanford J.R., Penalva L.O., Smith A.D. Site identification in high-throughput RNA-protein interaction data. Bioinformatics. 2012;28:3013–3020. doi: 10.1093/bioinformatics/bts569. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R., 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Schones D.E., Smith A.D., Zhang M.Q. Statistical significance of cis-regulatory modules. BMC Bioinformatics. 2007;8:19. doi: 10.1186/1471-2105-8-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Ward M.C., Gilad Y. Human genomics: Cracking the regulatory code. Nature. 2017;550:190–191. doi: 10.1038/550190a. [DOI] [PubMed] [Google Scholar]

[bib28] 28.Zhao K., Lu Z.X., Park J.W., Zhou Q., Xing Y. GLiMMPS: robust statistical model for regulatory variation of alternative splicing using RNA-seq data. Genome Biol. 2013;14:R74. doi: 10.1186/gb-2013-14-7-r74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Van Nostrand E.L., Freese P., Pratt G.A., Wang X., Wei X., Blue S.M., Dominguez D., Cody N.A., Olson S., Sundararaman B. A large-scale binding and functional map of human RNA binding proteins. bioRxiv. 2017 doi: 10.1038/s41586-020-2077-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Burd C.G., Dreyfuss G. RNA binding specificity of hnRNP A1: significance of hnRNP A1 high-affinity binding sites in pre-mRNA splicing. EMBO J. 1994;13:1197–1204. doi: 10.1002/j.1460-2075.1994.tb06369.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Damianov A., Ying Y., Lin C.-H., Lee J.-A., Tran D., Vashisht A.A., Bahrami-Samani E., Xing Y., Martin K.C., Wohlschlegel J.A., Black D.L. Rbfox proteins regulate splicing as part of a large multiprotein complex LASR. Cell. 2016;165:606–619. doi: 10.1016/j.cell.2016.03.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Discovery of Allele-Specific Protein-RNA Interactions in Human Transcriptomes

Emad Bahrami-Samani

Yi Xing

Abstract

Introduction

Material and Methods

Calling Variants from RNA-Seq Data

Filtering SNPs

eCLIP Data Analysis

ASPRIN Allelic Ratio Test

Assessing the Robustness of ASPRIN

ASPRIN SNP Enrichment or Depletion in Genomic Regions

Measuring RBP Sequence Specificity

Motif Enrichment Analysis

Splicing Quantitative Trait Loci (sQTLs) Analysis

Genome-wide Association Study (GWAS) Signals

Results

ASPRIN Pipeline for Detecting Allele-Specific Protein-RNA Interactions

Figure 1.

ASPRIN Is Robust in Discovering SNPs Involved in Allele-Specific Protein-RNA Interactions

Figure 2.

Figure 3.

ASPRIN Identifies Functionally Relevant SNPs for Different Classes of RBPs

Figure 4.

ASPRIN SNPs Affect RBP Consensus Motifs

Figure 5.

ASPRIN Can Help Reveal Causal Variants Affecting Alternative Splicing

Figure 6.

Discussion

Declaration of Interests

Acknowledgments

Footnotes

Web Resources

Supplemental Data

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases