Summary
Gene misexpression is the aberrant transcription of a gene in a context where it is usually inactive. Despite its known pathological consequences in specific rare diseases, we have a limited understanding of its wider prevalence and mechanisms in humans. To address this, we analyzed gene misexpression in 4,568 whole-blood bulk RNA sequencing samples from INTERVAL study blood donors. We found that while individual misexpression events occur rarely, in aggregate they were found in almost all samples and a third of inactive protein-coding genes. Using 2,821 paired whole-genome and RNA sequencing samples, we identified that misexpression events are enriched in cis for rare structural variants. We established putative mechanisms through which a subset of SVs lead to gene misexpression, including transcriptional readthrough, transcript fusions, and gene inversion. Overall, we develop misexpression as a type of transcriptomic outlier analysis and extend our understanding of the variety of mechanisms by which genetic variants can influence gene expression.
Keywords: misexpression, ectopic expression, aberrant expression, transcriptomic outlier, structural variants, transcriptional readthrough, transcript fusion
Misexpression is the aberrant transcription of a gene in a context where it is usually inactive. Despite its known pathological consequences, we have a limited understanding of its wider prevalence and mechanisms. Here, we explore the prevalence of misexpression, the genetic variants associated with misexpression, and their mechanisms of action.
Introduction
Temporal and spatial regulation of gene expression is essential for the functioning of multicellular eukaryotes. Gene regulation involves the context-specific activation and maintenance of transcription, as well as gene silencing to avoid aberrant transcription interfering with normal cellular function. The aberrant transcription of a gene in a context where it is usually inactive is termed gene misexpression (also referred to as ectopic expression) (Figure 1A).1 Gene misexpression can occur either via the transcription of a single inactive gene or via the production of a novel transcript derived in part from an inactive gene. We refer to these different types of events as non-chimeric and chimeric misexpression, respectively.
Figure 1.
Identification of misexpression events and characterization of misexpressed genes
(A) Gene misexpression is the aberrant transcription of a gene in a context where it is usually inactive. In this schematic, the majority of individuals have negligible or no expression of gene A (inactive, gray), with only a handful of individuals showing high expression (misexpression, red).
(B) Distribution of gene activity across 29,614 genes within the INTERVAL whole-blood RNA-seq dataset. For each gene, activity is quantified as the percentage of samples where the gene has a TPM >0.1 (x axis). Inactive genes are defined as having a TPM >0.1 in less than 5% of samples (vertical dashed line).
(C) Proportion of 39,513,200 gene-sample pairs (8,650 inactive genes across 4,568 samples) that are misexpressed (y axis) across different misexpression Z score thresholds (x axis). Text labels indicate the total number of misexpression events at each misexpression Z score threshold.
(D) Enrichment of gene-level features within 4,437 genes that are misexpressed (Z score >2 and TPM >0.5) versus 4,213 non-misexpressed genes. The 15 features with the highest absolute log odds passing Bonferroni correction are shown. Lines indicate 95% confidence intervals for the fitted parameters using the standard normal distribution.
(E) Human Phenotype Ontology (HPO) terms by -log10(adjusted p value) on the x axis, underrepresented within 1,070 misexpressed protein-coding genes using 3,092 inactive protein-coding genes as the custom background. The top 10 most significant results are shown.
Gene misexpression can have profound phenotypic consequences, as evidenced by the development of ectopic eyes across different tissues in Drosophila melanogaster upon targeted misexpression of eyeless.2 In humans, gene misexpression has been implicated in cancers3,4 and several rare diseases, for example, congenital limb malformations,5 congenital hyperinsulinism,6 and monogenic severe childhood obesity.7 These studies have identified gain-of-function genetic variants that lead to both chimeric and non-chimeric gene misexpression. For example, chimeric misexpression can be caused by transcript fusions7 and non-chimeric misexpression via rearrangements in 3D chromatin architecture8 or loss of silencer function.6 However, these studies have predominantly focused on a limited number of disease-related loci.
Recent large-scale RNA sequencing (RNA-seq) studies analyzing transcriptional outliers in humans have demonstrated that outliers are enriched for rare single-nucleotide variants (SNVs), indels, and structural variants (SVs) in cis9,10,11,12 and that these outlier-associated genetic variants can contribute to complex disease risk.11,13 However, these studies focused on outliers in highly expressed genes within the tissue(s) under study, overlooking misexpression of inactive genes. Consequently, the prevalence of gene misexpression in humans, the genes whose misexpression can be tolerated, and their associated properties are unknown. Furthermore, the types of genetic variants associated with misexpression and their mechanisms remain underexplored.
To address these gaps in our understanding, we conducted a genome-wide analysis of gene misexpression using bulk RNA-seq data from 4,568 blood donors from the INTERVAL study.14,15 We assessed the prevalence of gene misexpression across genes and samples and the characteristics of genes that tolerate misexpression. Additionally, we established the types of genetic variants associated with gene misexpression as well as their putative mechanisms of action using 2,821 paired whole-genome sequencing (WGS) and RNA-seq samples.
Subjects and methods
The INTERVAL study
The INTERVAL study is a prospective cohort study of approximately 50,000 participants nested within a randomized trial of varying blood donation intervals.14,15 Between 2012 and 2014, blood donors aged 18 years and older were recruited at 25 centers of England’s National Health Service Blood and Transplant (NHSBT). All participants gave informed consent before joining the study, and the National Research Ethics Service approved this study (11/EE/0538). Participants were generally in good health, as blood donation criteria exclude individuals with a history of major diseases (e.g., myocardial infarction, stroke, cancer, HIV, and hepatitis B or C) and who have had a recent illness or infection. Participants completed an online questionnaire comprising questions on demographic characteristics (e.g., age, sex, ethnicity), lifestyle (e.g., alcohol and tobacco consumption), self-reported height and weight, diet, and use of medications.
WGS
WGS was performed on 12,354 samples using the Illumina HiSeq X10 platform as paired-end 151-bp reads at the Wellcome Sanger Institute (WSI). Reads were aligned to the GRCh38 human reference genome with decoys (also known as HS38DH) using BWA MEM.16 Variants were called using GATK4.0.10.1.17 GATK Variant Quality Score Recalibration (VQSR) was used to identify probable false-positive calls. We removed 491 samples, including 77 samples with coverage below 12×, 134 with >3% non-reference discordance, 118 with >3% FreeMix (VerifyBamID2) score, 221 samples failing identity checks, 30 samples swapped, 40 samples failing sex checks, 39 duplicates, and 9 samples with possible contamination. Genotypes with allele read balance >0.1 for homozygous reference variants, <0.9 for homozygous alternative variants, or not between 0.2 and 0.8 for heterozygous variants were removed. Genotypes were also removed if the proportion of informative reads was <0.9 or the total read depth >100. We performed variant quality control and filtered out variants that failed to meet the following requirements: call rate per site >95%, mean genotype quality value >20, Hardy-Weinberg equilibrium (HWE) p value >1 × 10−6 only for autosomes. All monomorphic variants with alternative allele count = 0 were further removed, although we kept all monomorphic variants with reference allele count = 0. For chrX and chrY, we applied an additional step to correct the allele counts and frequencies due to female and male samples. Overall, this resulted in 116,382,870 variants (100,694,832 SNVs and 15,688,038 indels), including 6,637,420 (5.7%) multi-allelic sites across 11,863 participants. The WGS data have been deposited at the European Genome-phenome Archive (EGA) under accession number EGAD00001008661.
SV calling
Generation of the SV callset has been described in full previously.18 In brief, deletions, duplications, inversions, and mobile element insertions were called using a combination of Genome STRiP,19 Lumpy,20 CNVnator,21 and svtools.22 For the sv-pipeline duplications and deletions, a random forest classifier using read alignment parameters was trained to minimize false positives. This resulted in 88% sensitivity and 99% specificity for deletions and 55% sensitivity and 92% specificity for duplications. Inversions were retained if <10% of genotypes were missing, HWE was not violated, and >10% of alternate allele supporting reads came from split and paired read ends. Breakends were removed from the callset. Final tuning of the overall quality score was modeled to ensure that 90% of genotypes were identical among duplicate samples. To produce a single set of non-overlapping calls, we performed additional pruning steps. Briefly, we identified overlapping sites with significant genotype concordance and retained the site of higher mean sample quality (sv-pipeline) or the larger site (Genome STRiP). To merge the Genome STRiP deletions with the sv-pipeline deletions, we identified overlapping sv-pipeline deletions or mobile element insertions. If the deletion was <5 kb or involved a mobile element insertion, the sv-pipeline coordinates were retained; otherwise, the coordinates of the larger SV were retained. The final callset consisted of 123,801 SVs comprising 107,966 deletions, 11,681 duplications, 1,395 inversions, and 1,395 mobile element insertions across 10,728 participants. The final callset was compared to SV calls from the 1000 genomes and Hall-SV cohorts.19,23 The callset captured 93% and 92% of common deletions and 65% and 75% of common duplications from each cohort, respectively. An overview of the SV callset is provided in Figure S1.
RNA sample processing and sequencing
Generation of the RNA-seq data has been described in full previously.24 In brief, blood samples were collected from INTERVAL participants in Tempus Blood RNA Tubes (ThermoFisher Scientific) and stored at −80°C until use. RNA extraction was performed by QIAGEN Genomic Services using an in-house-developed protocol. mRNA was isolated using a NEBNext Poly(A) mRNA Magnetic Isolation Module (NEB) and then re-suspended in nuclease-free water. Globin depletion was performed using a KAPA RiboErase Globin Kit (Roche). RNA library preparation was done using a NEBNext Ultra II RNA Library Prep Kit for Illumina (NEB) on a Bravo WS automation system (Agilent). Samples were PCR amplified using a KapaHiFi HotStart ReadyMix (Roche) and unique dual-indexed tag barcodes. PCR products were purified using AMPure XP SPRI beads (Agencourt). Libraries were pooled up to 95-plex in equimolar amounts on a Biomek NX-8 liquid handling platform (Beckman Coulter), quantified using a High Sensitivity DNA Kit on a 2100 Bioanalyzer (Agilent), and then normalized to 2.8 nM. Samples were sequenced using 75-bp paired-end sequencing (reverse stranded) on a NovaSeq 6000 system (S4 flow cell, Xp workflow; Illumina).
RNA-seq alignment
The data pre-processing, including RNA-seq quality control, STAR and Salmon alignments were performed with a Nextflow pipeline, which is publicly available at https://github.com/wtsi-hgi/nextflow-pipelines/blob/rna_seq_interval_5591/pipelines/rna_seq.nf, including the specific aligner parameters. We assessed the sequence data quality using FastQC v.0.11.8. Reads were aligned using STAR v.2.7.3.a.25 The STAR index was built against GRCh38 Ensembl GTF v.97 using the option -sjdbOverhang 75. STAR was run in a two-pass setup with recommended ENCODE options to increase mapping accuracy: (1) a first alignment step of all samples was used to discover novel splice junctions; (2) splice junctions of all samples from the first step were collected and merged into a single list; and (3) a second alignment step realigned all samples using the merged splice junctions list from (2) as input. From the aligned RNA-seq read data, gene-level read counts were calculated from the number of reads mapping to exons using featureCounts v.2.0.0.26 The raw gene-level count data contained 60,617 genes across 4,778 samples.
Quality control of RNA-seq samples
Samples mismatched between RNA-seq and genotyping data within the cohort were identified using QTLtools MBV v.1.2.27 Five sample swaps were corrected. Samples with covariates indicating lower-quality data were identified and removed using the following metrics: RIN <4 or read depth <10 million assigned reads by featureCounts v.2.0.0.26 Samples with missing sequencing covariates and genotyping data were removed, as were samples with suspected contamination. One sample from each flagged pair of related participants, estimated as first- or second-degree from genetic data, was removed, prioritizing samples with WGS data. After this stage, 47 samples were removed, leaving 4,731 remaining samples.
Gene expression quantification
Prior to expression quantification, the following genes were removed: globin genes, rRNA genes, genes on non-reference chromosomes, pseudoautosomal region genes, and genes with “retained_intron” or “read_through” transcript annotation. After removing these genes, 59,144 remained. Gene-level read counts were converted to transcripts per million (TPM) values using the total length of merged exons. The total length of merged exons was computed by collapsing the GENCODE v.31 annotation for each gene to a single transcript using the custom isoform-collapsing procedure from the Genotype-Tissue Expression (GTEx) project.28,29
Removal of global expression outliers
Using TPM values across 59,144 genes and 4,731 samples, samples with many top expression events due to either technical or biological effects were removed as global expression outliers. To count the number of top expression events per sample, genes with TPM equal to zero across all samples were not included, leaving 57,555 genes. Then, for each sample, the number of most extreme expression events in these remaining genes was calculated. Based on an elbow plot of the number of top expression events per sample, 3.4% of samples (163/4,731) with ≥5× the expected number of top expression outliers (total genes/total samples) were removed, resulting in 4,568 samples (Figure S2). The remaining samples had 0–60 top expression events, with 3,273 samples having at least 1 top expression event.
Inactive gene identification
Enrichment testing and downstream analysis were limited to autosomal protein-coding or long non-coding RNA genes in GENCODE v.31.29 Additionally, to minimize misexpression false positives, only genes that passed the expression thresholds for expression quantitative trait loci (eQTL) mapping in at least one of 49 GTEx tissues were retained, leaving 29,614 genes.28 For the remaining genes, we calculated the percentage of samples with a TPM >0.1. Across all genes, this percentage had a bimodal distribution separating highly and lowly expressed genes (Figure 1B). To focus on genes that had very low or no detectable expression, we selected 8,779 genes that had a TPM >0.1 in less than 5% of samples. This approach is analogous to the method used by the GTEx consortium to define active genes for eQTL mapping.28
Inactive gene set validation
To validate our inactive gene set, we used several approaches.
-
(1)
We intersected 60,603 genes in GENCODE v.31 with predicted chromatin states from the Roadmap Epigenomics Project’s 15-state ChromHMM trained on peripheral blood mononuclear cell (PBMC) data.30 Then, for each gene, we calculated the fractional overlap of each chromatin state and conducted k-means clustering to group genes that had similar epigenetic modifications. We performed k-means clustering with 2–10 clusters and selected k = 8 clusters because these clusters were the most biologically interpretable. Based on the overlapping chromatin states, we manually annotated these 8 clusters with the following labels: transcription, weak transcription, quiescent, polycomb weak quiescent, polycomb weak, polycomb repressed, heterochromatin, and unassigned (Figure S2). Ninety-four percent (8,277/8,779) of inactive genes were grouped in clusters with high overlap of repressive or quiescent chromatin states (Figure S2).
-
(2)
We checked whether our method of identifying inactive genes led to a similar gene set using a different whole-blood RNA-seq dataset. We identified inactive genes in whole-blood RNA-seq data from GTEx using the same approach. In brief, we focused on 558 American-European individuals, as defined by the GTEx consortium, to limit the effects of population stratification. The GTEx consortium defined American-European individuals as the subset of self-reported White individuals that grouped together tightly according to genotype principal components 1 and 2.28 Samples with ≥5× the expected number of top expression outliers (total genes/total samples) were removed (n = 18). Comparison between INTERVAL and GTEx was restricted to the 29,614 genes defined previously. Inactive genes were defined as having a TPM >0.1 in less than 5% of samples (n = 8,207). Eighty percent of the INTERVAL inactive gene set were found in the inactive genes identified in GTEx. This is in spite of cohort differences including sample size, participant age, transcript annotation reference, RNA-seq strandedness, and sampling method (blood donation versus postmortem).
-
(3)
We tested whether our inactive gene set contained many genes expressed in other whole-blood RNA-seq datasets. To do so, we examined the overlap between inactive genes and eGenes (genes with at least one eQTL) from GTEx and eQTLgen.28,31 Only 6.4% and 3.1% of the INTERVAL inactive genes were eGenes in GTEx whole blood and eQTLGen, respectively.
These results confirmed that we had identified a set of genes with very low or no expression across different datasets using information from different data types.
Defining gene misexpression
TPM values were Z score transformed for each inactive gene across all 4,568 samples passing quality control. A gene in a sample was defined as misexpressed if it had a TPM >0.5 and a Z score >2. Z scores were used to allow comparison of misexpression events across genes. In addition to the Z score threshold, a TPM threshold of 0.5 was used to remove misexpression events that had a high Z score but only low expression.
Accounting for non-genetic sources of gene misexpression
To ensure that gene misexpression was not associated with biological or technical confounders, we correlated the expression of each inactive gene with 225 covariates. These covariates included participant age, height, weight, BMI, sex, 89 Sysmex cell count measurements, 67 inferred xCell cell enrichments,32 25 technical covariates, top 20 genetic PCs, season, and sequencing batch. We removed 1.5% (129/8,779) of genes whose expression was significantly correlated (|Spearman’s rho| > 0.2, FDR-adjusted p < 0.05) with any covariate (Table S1). The low percentage of genes removed confirmed that for the majority of genes, misexpression events could not be attributed to the systematic effect of a measured or inferred covariate. The final inactive gene set is provided in Table S2.
Gene-level features
We curated a set of gene-level features in order to understand the differences between misexpressed and non-misexpressed genes. The full set of features is provided in Table S3. For each of the 8,650 inactive genes, we compiled a set of 82 gene-level features across five major categories: genomic, constraint and conservation, expression, regulation, and gene sets. Genomic features such as gene length, gene density, and distance to the closest gene were calculated from GENCODE v.31.29 Constraint scores included LOEUF and missense OUEF from gnomAD, as well as pLI, probability of recessive lethality, probability of complete haploinsufficiency, pHaplo, pTriplo, episcore, and the enhancer domain score (EDS).33,34,35,36 The mean conservation score across the gene body was calculated using PhyloP basewise conservation score across 100 vertebrates.37 The number of conserved elements per base pair within a ±10 kb window around a gene was calculated using GERP++ conserved elements.38 The number and type of tissues a gene is expressed in were calculated from GTEx.28 Active gene density and distance were calculated by subsetting to genes with a median TPM >0.5 in INTERVAL. Regulation features derived from chromatin states were calculated using the Roadmap Epigenomics Project’s 15-state ChromHMM trained on PBMC data.30 The fraction of a gene overlapping A/B compartments was derived from GM12878 Hi-C data processed by the 4D Nucleome project.39,40 We generated a set of topologically associating domain (TAD) boundaries in GM12878 cells that were shared (within ±50 kb) across IMR90, HUVEC, HNEK, and HMEC cell lines from the 4D Nucleome project and used these shared boundaries to calculate the closest distance from each gene to a TAD boundary. Enhancer features based on proximity- and activity-linking methods were from Wang and Goldstein.36 Gene sets included protein-coding genes annotated in GENCODE v.31, oncogenes (tier 1, dominant) from the COSMIC v.97 Cancer Gene Census (CGC),41 approved drug targets curated by OpenTargets (OT v.22.11),42 developmental disorder genes from the Decipher DDG2P database43 and OMIM,44 and different gene sets annotated by gnomAD including olfactory, autosomal recessive, autosomal dominant, and haploinsufficient genes.33 All features were Z score transformed across all inactive genes. Inactive genes were split into two groups depending on whether they were not misexpressed (4,437 genes) or misexpressed at least once (4,213 genes), defining misexpression with a misexpression Z score >2 and TPM >0.5. Using different Z score thresholds did not lead to markedly different results (Figure S3). The enrichment of each feature within the misexpressed group was calculated using logistic regression. Across all tests, p values were adjusted using Bonferroni correction. 95% confidence intervals for the fitted parameters were calculated using the standard normal distribution. Underenrichment of Human Phenotype Ontology (HPO) terms within misexpressed protein-coding genes was calculated using the gProfiler (gProfiler2 v.0.2.1) functional profiling function with all 3,092 inactive protein-coding genes used as the custom background.45
Matching RNA-seq and WGS samples
Samples with matching RNA-seq and WGS data were identified using QTLtools MBV v.1.2.27 Out of 4,568 RNA-seq samples, 2,821 and 2,640 samples had a matching WGS sample with SNV/indel calls and SV calls, respectively. The difference in matching samples was due to a higher number of samples failing SV calling.
Genetic variant enrichment calculations
For each enrichment test, we defined a misexpression group as all expression events (expression of a given gene in an individual) passing the specified misexpression Z score threshold and a TPM >0.5. The control group was defined as all expression events below these thresholds restricted to the genes within the misexpression group. Therefore, for each misexpression threshold, the misexpression and control group gene sets were identical. This ensured that enrichment calculations reflected differences in genetic effects rather than differences in mutation background distributions between non-identical gene sets. Risk ratios were calculated as the proportion of expression events in the misexpression group with a given variant type within the tested genomic region and minor allele frequency (MAF) range over the proportion of events in the control group. For these enrichment tests, we counted variants overlapping a ±10 kb window around the gene body. p values were calculated using a two-sided Fisher’s exact test, and 95% confidence intervals were calculated using a normal approximation. We tested four non-overlapping MAF thresholds: rare (0%–1% MAF), low frequency (1%–5% MAF), and common variants (5%–10% and 10%–50% MAF).
To test the enrichment of variants at different genomic distances from genes involved in misexpression events, we assigned variants to a genomic window for each gene. Variants were assigned to 200 kb windows up to 1 Mb upstream and downstream of the gene start and end, respectively, or when overlapping the gene itself to the gene body window. In cases where a variant spanned multiple windows, the variant was placed in the window closest to the gene, with variants overlapping any part of the gene assigned to the gene body window. This resulted in all gene-variant pairs being uniquely assigned to a single window. Enrichment testing was then conducted for each genomic window separately.
To investigate variant consequences, SVs were annotated with the most severe Ensembl Variant Effect Predictor (VEP) consequence on the gene in the test window (VEP v.97.3).46 Variants with no predicted consequence on the gene were annotated according to the most severe consequence if the annotation was regulatory or intergenic (TFBS ablation, TF_binding_site_variant, regulatory_region_variant, TFBS_amplification, intergenic_variant, regulatory_region_ablation, regulatory_region_amplification). If the variant had no predicted consequence on the gene and its most severe consequence was not regulatory or intergenic, then it was annotated as having no predicted effect. Enrichment calculations were performed for each variant consequence that had at least one individual with an SV within the tested window. For SVs, a ±200 kb window around the gene body was used.
Overall, we performed 700 genetic variant enrichment tests. Across all tests, p values were adjusted using Bonferroni correction. All generic variant enrichment results can be found in Table S5.
Identifying misexpression-associated and control rare SVs
We identified 23,159 rare (MAF <1%) SVs located within ±200 kb of an inactive gene for which misexpression (Z score >2 and TPM >0.5) was observed at least once (4,437 genes). For each gene-SV pair, we calculated the median TPM and Z score across all samples with the SV. We defined misexpression-associated SVs as SVs with a nearby gene that had a median TPM >0.5 and median Z score >2. We additionally excluded gene-SV pairs where any sample with the SV had a TPM <0.1, resulting in 105 misexpression-associated SVs. These criteria allow for variable levels of gene expression around the misexpression threshold while removing likely non-causal variants. From the 23,159 rare SVs, we defined control SVs as having a maximum TPM equal to 0 for every inactive gene where the SV is within 200 kb. This resulted in 20,157 control variants.
Misexpression-associated SVs were annotated based on their VEP consequence on the misexpressed gene as done for the genetic enrichment analysis and their position relative to the misexpressed gene. We confirmed that all misexpression-associated duplications were tandem duplications by manually inspecting them in the Integrative Genome Viewer.47
SV properties
To understand the different properties of misexpression-associated and control SVs, we annotated SVs with five features based on conservation, mutational constraint, and deleteriousness scores. Deletions and duplications were scored with CADD-SV v.1.1 in batches of 5,000 variants.48 Scoring inversions and mobile element insertions is currently not supported by CADD-SV. Since we were comparing CADD-SV distributions between control and misexpression-associated variants, we used the raw CADD-SV scores as recommended by the CADD-SV authors. PhyloP conservation scores were downloaded from UCSC genome browser.37 Each SV was annotated based on the maximum conservation score observed across all overlapping bases. Constraint Z scores passing all quality control checks for coding and non-coding regions were downloaded from gnomAD.49 SVs were annotated with the maximum gnomAD Z score across all overlapping 1 kb windows. SVs were annotated based on the minimum gwRVIS across all overlapping bases.50 SVs were annotated with a categorical variable based on whether they overlapped a human accelerated region (HAR).51 Misexpression-associated deletions and duplications were compared separately versus controls using logistic regression, with each score modeled independently, Z score transformed, and SV length included as a covariate. p values were adjusted across all tests using Bonferroni correction. 95% confidence intervals for the fitted parameters were calculated using the standard normal distribution. The genomic score enrichment results can be found in Table S6. Excluding SV length as a covariate did not lead to dramatic changes in the genomic score enrichments (Figure S4).
SV regulatory features
To understand the regulatory features specific to misexpression-associated SVs, we conducted enrichment analysis across 23 regulatory features. All regulatory features with their transformation, cell type, and data source are described in Table S7. A/B compartments measured in the GM12878 cell line were downloaded from the 4D Nucleome project.39,40 We generated a set of TAD boundaries in GM12878 cells that were shared across IMR90, HUVEC, HNEK, and HMEC cell lines. A TAD boundary was considered shared if another TAD boundary was located within ±50 kb in another cell line. CpG islands were downloaded from the UCSC genome browser.52 CCCTC-binding factor (CTCF)-only candidate cis-regulatory elements (cCREs) across all cell types were downloaded from ENCODE.53 CTCF cCREs generated in primary cells from whole blood with CTCF ChIP-seq data available (CD14+ monocytes, neutrophils, and B cells) were also downloaded.53 Regulatory features derived from chromatin states were calculated using the Roadmap Epigenomics Project’s 15-state ChromHMM trained on PBMC data.30 For all data types, features were generated by encoding SV overlap as a binary indicator that was subsequently Z score transformed.
To assess the enrichment of different regulatory annotations in misexpression-associated SVs, we performed logistic regression, modeling misexpression status as a function of each regulatory feature individually, with SV length included as an additional covariate. Excluding SV length did not lead to dramatic changes in the log-odds values, but many more regulatory features were significant (Figure S4). Logistic regression was conducted separately for deletions and duplications. p values were adjusted across all tests using Bonferroni correction. 95% confidence intervals for the fitted parameters were calculated using the standard normal distribution. The regulatory feature enrichment results can be found in Table S8.
Selection and characterization of transcriptional readthrough candidate SVs
To identify deletions that were transcriptional readthrough candidates, we selected misexpression-associated deletions that satisfied the following criteria.
-
(1)
The deletion is located upstream of the misexpressed gene.
-
(2)
The deletion partially overlaps a gene’s 3′ end and overlaps a terminal exon polyA site from polyASite 2.0.54 The overlapping gene is expressed in whole blood (median TPM >0.5 in the INTERVAL dataset) and is on the same strand as the misexpressed gene.
-
(3)
The region upstream of the misexpressed gene up to the SV breakpoint does not contain an expressed gene (median TPM >0.5 in the INTERVAL dataset) on the same strand as the misexpressed gene.
If a deletion was associated with misexpression of multiple genes, then the gene closest to the SV was selected in order to define the expected readthrough region.
To identify duplications that were transcriptional readthrough candidates, we selected misexpression-associated duplications that satisfied the following criteria.
-
(1)
The duplication overlaps the entire misexpressed gene.
-
(2)
The duplication partially overlaps the 5′ end of a gene that is expressed (median TPM >0.5 in the INTERVAL dataset), is positioned downstream of the misexpressed gene, and is on the same strand as the misexpressed gene.
-
(3)
The region upstream of the misexpressed gene up to the SV breakpoint does not contain an expressed gene (median TPM >0.5 in the INTERVAL dataset) on the same strand as the misexpressed gene.
If a duplication was associated with misexpression of multiple genes, then the gene with the shortest expected readthrough region was selected. This resulted in 12 transcriptional readthrough candidate deletions and 5 transcriptional readthrough candidate duplications.
For both deletions and duplications, the uniquely mapped read count and fraction of bases with non-zero coverage (FBNC) of the region upstream of the misexpressed gene up to the SV breakpoint was calculated using BedTools coverage, requiring the same strandedness and treating split BAM entries as distinct bed intervals.55 Read counts were converted to fragments per kilobase of transcript per million mapped reads (FPKM) using sample read depth and the length of the region. The FPKM metric indicates the number of reads mapping across the region but not their distribution, whereas the FBNC metric indicates how much of the region is transcribed but not the quantity of the expression. They thereby provide complementary evidence for transcriptional readthrough. For each readthrough region, Z scores were calculated for both FPKM and FBNC metrics across all 4,568 RNA-seq samples passing quality control. A total of 2,640 samples with available SV calls and WGS were then annotated as having either a deletion, a duplication, or no transcriptional readthrough candidate SV.
Identification of fusion transcripts
We used STAR fusion v.1.10.1 to identify fusion transcripts.56 First, we ran STAR fusion across all samples with a misexpression-associated variant in max sensitivity mode with a STAR max mate distance of 50 kb and with no annotation filter (as recommended for detecting fusion in non-cancer samples). We selected fusion events that involved misexpressed genes that had a misexpression-associated SV in cis. We removed fusion events that were not supported across all samples with the misexpression-associated SV. Next, we ran STAR fusion again using the same parameters, except without applying the max sensitivity mode and with the -denovo_reconstruct, FusionInspector validate, and -examine_coding_effect flags applied.57 Fusion transcripts validated by FusionInspector from this run were labeled high evidence, whereas those that were only identified in the first run were labeled low evidence.
Salmon transcript quantification
We used Salmon v.1.1.0 for transcript quantification.58 The Salmon index was built against GRCh38 cDNA, which was used to generate transcript-level quantification from the sequence data. R packages tximport v.1.14.2, AnnotationHub v.2.18.0, BiocFileCache v.1.10.2, and BiocGenerics v.0.32.0 were applied to obtain various count matrices from these quantifications at the transcript or gene level. For samples with inversion chr3:125,966,617–125,980,782 (GRCh38), the transcript percentage for each ROPN1B transcript was calculated as the transcript TPM divided by the total TPM across all transcripts.
Identification of SVs with potential to alter 3D chromatin architecture
From the misexpression events with no putative mechanism, we selected misexpression-associated SVs that overlapped a TAD boundary (shared across multiple cell lines) and a CTCF-only cCRE across all cell types from ENCODE.39 For duplications, we also required that the variant completely overlapped the misexpressed gene and an Enh or EnhG ChromHMM state from PBMCs.30 For deletions, we also required that the variant did not overlap the misexpressed gene. This led to the identification of four misexpression events with a candidate SV for altering 3D chromatin architecture.
Genome track visualization
Gviz v.1.38.4 was used to visualize genomic tracks and FusionInspector results.59
Results
Identification of misexpression events in whole blood
To identify misexpression events, we first defined a set of inactive genes with negligible or no detectable expression across the majority of the 4,568 whole-blood RNA-seq samples from the INTERVAL study (subjects and methods). To focus on genes more likely to be functional, we restricted our analysis to 29,614 autosomal protein-coding and long non-coding RNA genes with evidence of being expressed in at least one tissue from the GTEx project. From these, we identified 8,779 inactive genes that were expressed (TPM >0.1) in less than 5% of samples (Figure 1B). These comprised 3,173 (36.1%) protein-coding genes and 5,606 (63.9%) long non-coding RNAs. We confirmed that these genes were likely inactive using other whole-blood RNA-seq datasets, such as GTEx, and predicted chromatin states from PBMCs (Figure S2). To account for non-genetic drivers of misexpression, such as sequencing depth or variation in cell proportions, we removed 129 (1.5%) genes that were significantly correlated (|Spearman’s rho| > 0.2, FDR-adjusted p < 0.05) with any of 225 technical and cellular covariates. We transformed expression values into Z scores and identified 28,956 misexpression events (Z score >2 and TPM >0.5). Across all inactive gene-sample pairs, the proportion of misexpression events was low (0.07%, 28,956/39,513,200), with the number of events decreasing substantially at higher Z score thresholds (Figures 1C and S5). While individual misexpression events occurred rarely, in aggregate they were found in 51% of inactive genes (4,437/8,650) and in 96% of samples (4,386/4,568), with a median of four events per sample (Figure S5). A total of 34.6% of inactive protein-coding genes (1,070/3,092) and 60.6% of inactive lncRNA genes (3,367/5,558) were misexpressed at least once in our cohort (Table S4).
Misexpressed genes are shorter, depleted of developmental genes, and less tightly regulated
Next, we investigated the properties that differ between genes with and without misexpression events. We tested for enrichment of 82 gene-level features in genes that were misexpressed at least once (''misexpressed genes'') versus genes with no observed misexpression events (''non-misexpressed genes'') across different misexpression Z score thresholds and based on protein-coding status (Figures 1D, S3, and S6; subjects and methods). Overall, misexpressed genes were shorter, less likely to be protein-coding, less constrained according to both mutational (gnomAD LOEUF and missense OEUF) and non-mutational (EDS and Episcore) metrics, and less likely to be implicated in developmental diseases.33,35,36,44 These genes also had fewer predicted enhancer interactions based on both proximity- and activity-linking approaches from Wang and Goldstein,36 suggesting that they are under weaker regulatory control. Misexpressed genes were less likely to be expressed in brain, pituitary, and heart tissues and generally were expressed across fewer GTEx tissues (n = 52). Additionally, protein-coding misexpressed genes were underrepresented for HPO terms relating to phenotypic abnormalities of the nervous and musculoskeletal system (Figures 1E and S6; subjects and methods). This is of interest as congenital limb malformations are known to be caused by gene misexpression.8 Taken together, these results suggest that natural selection has acted to prevent the misexpression of genes important in developmental processes. Importantly, these results also demonstrate that our method for identifying gene misexpression is valid, as we would expect misexpression of inactive developmental genes to be deleterious and therefore underenriched in a generally healthy population cohort.
Rare SVs are associated with gene misexpression
To assess the influence of genetic variation on gene misexpression in cis, we conducted genetic variant enrichment analyses. Our analysis focused on 2,821 participants with both WGS and RNA-seq data in the INTERVAL study (subjects and methods). In total, we conducted 700 enrichment tests, determining significance using a Bonferroni-adjusted p value threshold (p < 0.05). Firstly, we tested whether rare (MAF <1%), low-frequency (1% ≤ MAF <5%), or common (5% ≤ MAF <50%) SNVs, indels (≤50 bp), or SVs (>50 bp) were enriched within the gene body and flanking sequence of genes involved in misexpression events (subjects and methods).
Across all tested Z score thresholds, we observed a significant enrichment of rare SVs around gene misexpression events, whereas no consistent significant enrichment was observed for low-frequency or common SVs across Z score thresholds (Figure 2A). The enrichment for rare SVs increased dramatically at increasing Z score thresholds, with 2.37% (19/803) of extreme misexpression events (Z score >40) having a nearby rare SV compared to 0.38% (66/17,380) of less extreme events (Z score >2, Figures 2A and S7). Notably, we did not find a significant enrichment for SNVs or indels at any MAF or Z score threshold (maximum enrichment SNVs = 1.04 and indels = 1.15) and even observed a significant weak underenrichment in some cases (Figures 2A and S8).
Figure 2.
Enrichment of genetic variants near to gene misexpression events
Across all figures, enrichments were calculated as the relative risk of having a nearby variant type or consequence given the misexpression status. Bars represent 95% Wald confidence intervals of the relative risk estimates. The line at enrichment = 1 indicates no enrichment; asterisks positioned either side of the line indicate significant enrichment or underenrichment after Bonferroni correction.
(A) Enrichment of SNVs, indels, and SVs within the gene body and flanking sequence of genes involved in misexpression events across different misexpression Z score thresholds and MAF cutoffs. A flanking sequence of ±10 kb was used around each gene.
(B) Enrichment of rare (MAF <1%) SNVs, indels and SVs within 200 kb genomic windows and the body of genes involved in misexpression events. The misexpression threshold shown is a Z score >10 and TPM >0.5.
(C) Enrichment of rare (MAF <1%) deletions and duplications in a ±200 kb window around genes involved in misexpression events across different misexpression Z score thresholds.
(D) Enrichment of rare (MAF <1%) deletions, duplications, and inversions within 200 kb genomic windows and the body of genes involved in misexpression events. The misexpression threshold shown is a Z score >10 and TPM >0.5.
(E) Enrichment of rare (MAF <1%) SVs, stratified by their class and predicted VEP consequence in a ±200 kb window around genes involved in misexpression events. The misexpression threshold shown is a Z score >10 and TPM >0.5. Only SV consequences with at least one Bonferroni significant enrichment at any Z score threshold are shown.
We examined whether the observed rare SV enrichment could be due to a small number of SVs leading to the misexpression of many genes. However, out of the 312 SVs within 200 kb of a misexpression event, 95% (297) were linked to only one gene and the remainder to a maximum of two genes. We also assessed whether the observed rare SV enrichment could be due to a small number of participants with a high number of SVs and misexpression events. Similarly, of the 206 participants containing misexpression events with a nearby SV, 89% (183) had only one misexpression event with an SV in cis.
To investigate the influence of genetic variation on gene misexpression over longer distances, we tested rare variant enrichment at increasing distances from genes. For each gene, we assigned each rare SNV, indel, and SV to a unique genomic window up to 1 Mb upstream or downstream and tested variant enrichment for each window independently (subjects and methods). Across all Z score thresholds, enrichment was highest for rare SVs within the gene body and decreased at greater distances from the misexpressed gene (Figures 2B and S9), remaining significant up to 200 kb upstream. Interestingly, rare SV enrichment was not symmetrical around misexpressed genes, with greater enrichment upstream of transcription start sites (TSSs) compared to downstream of the transcription termination site (TTS). Similarly to the gene-level analysis, we found that rare SNVs were not significantly enriched across any window or expression threshold and again observed significant weak underenrichment in some genomic windows (Figures 2B and S10). While indels did show a significant enrichment in some windows, the level of enrichment was much lower than SVs (maximum significant enrichment = 1.11) and was not consistently observed across all Z score thresholds (Figure S10).
Rare deletions, duplications, and inversions are associated with gene misexpression
We hypothesized that misexpression events are associated with a specific type of structural variation and therefore conducted enrichment tests for the four different SV classes available: deletions, duplications, inversions, and mobile element insertions (Figures 2C and S11; subjects and methods). Across all Z score thresholds, rare deletions and duplications were significantly enriched within a 200 kb window around misexpressed genes, with duplications consistently showing the highest enrichment. However, at this sample size and genomic window size, rare inversions and mobile element insertions were not significantly enriched.
Next, we tested whether different SV classes showed distinct patterns of enrichment at increasing distances from misexpressed genes (subjects and methods). Rare duplications and inversions were significantly enriched only within the gene body of the misexpressed gene (Figure 2D). For duplications, all tested Z score thresholds were significant, while for inversions this enrichment was significant only up to a Z score threshold of 10, likely due to the low number of inversion calls (Figure S11). Rare deletions were significantly enriched in the window 200 kb upstream of the TSS across all Z score thresholds (Figures 2D and S11). Rare deletions were also enriched within the gene body of the misexpressed gene, but this was only significant at higher Z score thresholds (Figure S11). Rare mobile element insertions were not significantly enriched within any tested window at any Z score threshold. No significant enrichment was observed at greater distances for any SV class.
We annotated each rare SV by its predicted consequence on the inactive genes in the tested window using VEP.46 For each SV class, we then tested for enrichment of predicted consequences ±200 kb around misexpressed genes relative to controls (Figures 2E and S12; subjects and methods). We found that inversions affecting coding regions had the highest enrichment of any variant consequence; however, this was only significant up to a Z score threshold of 20, again likely due to the low number of inversion calls. Deletions upstream and with no predicted effect on the tested gene were significantly enriched, as were deletions affecting non-coding transcripts at higher Z score thresholds. Additionally, duplications leading to transcript amplification and affecting coding regions as well as non-coding transcripts were significantly enriched. These results support the enrichment observed for SVs classes within specific genomic windows.
Properties and regulatory features of misexpression-associated SVs
To understand the general properties of misexpression-associated rare SVs, we compared a set of 105 misexpression-associated and 20,150 control SVs (subjects and methods). Of these 105 SVs, 87 were deletions, 16 were duplications, and 2 were inversions (Figure 3A). All the duplications were confirmed to be tandem duplications (subjects and methods). Notably, for 60% and 28% of deletions and duplications, VEP did not predict an effect on the misexpressed gene (Figure 3B). While the majority (72%) of duplications overlapped the misexpressed gene either entirely or partially, this was not the case for deletions (8% overlapping) (Figure 3C). Therefore, we analyzed the properties of deletions and duplications separately, excluding inversions due to their low numbers.
Figure 3.
Properties and regulatory features associated with misexpression-associated rare SVs
(A) Proportion of misexpression-associated and control deletions, duplications, inversions, and mobile element insertions.
(B and C) Proportion of misexpression-associated deletions and duplications by their predicted VEP consequence on the misexpressed gene (B) and position relative to the misexpressed gene (C).
(D) SV length distributions of misexpression-associated and control duplications and deletions. The lower, middle, and upper hinges of the boxplots correspond to the 25th percentile, median, and 75th percentile, respectively. p values were calculated using a one-sided Mann-Whitney test comparing the lengths of control and misexpression-associated SVs.
(E) Enrichment (x axis) of misexpression-associated deletions (left panel, red) and duplications (right panel, green) compared to controls for genomic scores (y axis) including evolutionary conservation (PhyloP), predicted deleteriousness (CADD-SV), constraint (gnomAD Z score constraint and gwRVIS), and HARs. Enrichments were calculated as the log-odds ratio with lines indicating 95% confidence intervals for the fitted parameters using the standard normal distribution. Asterisks indicate significant enrichment after Bonferroni correction.
(F) Enrichment of misexpression-associated deletions and duplications compared to controls for regulatory features including CTCF candidate cis-regulatory elements (cCREs) from ENCODE, TAD boundaries shared across multiple cell lines, A and B compartments, chromatin states from the Roadmap Epigenomics Project, and CpG islands from the UCSC genome browser. Enrichments were calculated as the log-odds ratio, and tiles shaded in gray do not pass Bonferroni correction.
First, we found that misexpression-associated deletions and duplications were on average 2.5 and 3.4 times longer, respectively, than control variants (p = 6.03 × 10−7 and p = 3.9 × 10−6, one-sided Mann-Whitney U test; Figure 3D). Since MAF and SV length are inversely correlated, we also compared the lengths of singletons and found that misexpression-associated deletions and duplications remained on average 2.8 and 2.0 times longer, respectively (p = 2.3 × 10−4 and p = 2.2 × 10−3, one-sided Mann-Whitney U test; Figure S4). To avoid the correlation between length and other genomic features driving enrichment, we included length as a covariate in subsequent enrichment analyses.
To investigate the importance of regions overlapping misexpression-associated SVs versus controls, we tested for enrichment of five different genomic scores spanning evolutionary conservation, constraint, and deleteriousness (Figures 3E and S4; subjects and methods). Both misexpression-associated deletions and duplications were significantly enriched within more conserved regions compared to controls (Bonferroni p < 0.05) and were predicted to be significantly more deleterious by CADD-SV (Bonferroni p < 0.05).37,48 However, only duplications were located in more constrained regions (Bonferroni p < 0.05).49,50 Neither misexpression-associated deletions nor duplications were significantly enriched for HARs (Bonferroni p ≥ 0.05).51
To determine whether misexpression-associated SVs were enriched in specific regulatory features compared to the control SVs, we annotated SVs with 23 regulatory features (Figures 3F and S4; subjects and methods). Misexpression-associated deletions were most strongly enriched for transcribed regions but were also significantly enriched for regions with weak repressed polycomb, CTCF-binding sites in B cells and neutrophils, CpG islands, enhancers, and TAD boundaries (Bonferroni p < 0.05). Misexpression-associated duplications were most strongly enriched for enhancers but also showed significant enrichment for transcribed regions, active promoters, CpG islands, and repressed polycomb (Bonferroni p < 0.05). Overall, these enrichments suggest that a subset of SVs may lead to gene misexpression via disruption of regulatory regions.
Deletions and duplications lead to chimeric misexpression via transcriptional readthrough
Next, we aimed to identify putative mechanisms whereby the 105 misexpression-associated SVs lead to gene misexpression. From the genetic variant and regulatory feature enrichment analysis, we hypothesized that a subset of deletions and duplications could cause transcriptional readthrough resulting in chimeric gene misexpression. Based on their position and genomic context, we identified 17 (16.2%) transcriptional readthrough candidate SVs (12 deletions, 5 duplications) from the 105 misexpression-associated SVs (Figures 4A and 4B; subjects and methods).
Figure 4.
Transcriptional readthrough leads to chimeric misexpression
(A) Schematic diagram of a deletion resulting in transcriptional readthrough and chimeric gene misexpression. Deletion of the transcription termination site of an expressed gene (green) leads to transcriptional readthrough. This results in misexpression of the usually inactive gene (blue) located downstream.
(B) Schematic diagram of a tandem duplication resulting in transcriptional readthrough and chimeric gene misexpression. The tandem duplication places an inactive gene (blue) downstream of an expressed gene (green) with no transcription termination site. This leads to transcriptional readthrough and misexpression of the usually inactive gene.
(C) FPKM and FBNC Z scores over the predicted readthrough regions for samples with candidate deletions (red triangles) and duplications (green squares), as well as samples with no candidate SVs (gray circles). Relationship between (D) FBNC Z score and (E) FPKM Z score with the respective misexpression Z score for samples with candidate deletions (red) and duplications (green), as well as samples with no candidate SVs (gray).
(F) Length of the predicted readthrough region for duplications (red) and deletions (green).
(G) Deletion of the 3′ end of ST6GAL1 results in transcriptional readthrough. Transcriptional readthrough leads to RTP1 misexpression (orange gene) and intergenic splicing between ST6GAL1 and RTP1 (intergenic reads, orange). In the sashimi plot, the line width corresponds to the number of reads spanning a given junction.
(H) Expression of RTP1 comparing a sample with deletion chr3:187,069,321–187,094,542 (GRCh38) to samples without the deletion. Red color indicates samples passing the misexpression threshold TPM >0.5 and Z score >2; gray samples are below this threshold.
DEL, deletion; DUP, duplication.
To assess whether these candidate SVs resulted in transcriptional readthrough, we computed FPKM and the FBNC over the predicted readthrough regions across all 4,568 RNA-seq samples (subjects and methods). We Z score-transformed FPKM and FBNC metrics to account for differing levels of background transcription at each locus. We considered a readthrough mechanism likely only where there was evidence of outlying transcription (FPKM Z score >2) across a larger region (FBNC Z score >2). For all samples with one of the 17 deletions and duplications, we observed aberrant (Z score >2) levels of both FPKM and FBNC over the predicted readthrough region (Figures 4C and S13). Furthermore, both Z scores were positively correlated with the level of gene misexpression across samples with a transcriptional readthrough candidate SV (FBNC Spearman’s rho = 0.74, p = 5.10 × 10−5, FPKM = 0.72, p = 1.11 × 10−4; Figures 4D and 4E), while this was not the case for samples with no candidate SV (FBNC Spearman’s rho = 0.05 and FPKM = 0.12). Together these results provide strong evidence that these SVs lead to transcriptional readthrough resulting in gene misexpression.
Of the 12 transcriptional readthrough deletions, only 5 were within 5 kb and therefore were annotated by VEP as upstream variants with respect to the misexpressed gene (Figure S13). Of the 5 transcriptional readthrough duplications, 2 were annotated as non-coding transcript variants and 3 as transcript amplifications (Figure S13). These VEP consequences supported the enrichments observed in Figure 2E. The median length of the readthrough region was 15 kb, but remarkably for one deletion (chr3:187,069,321–187,094,542 [GRCh38]) we observed misexpression of a gene 103 kb away (Figure 4F). At this locus, split reads revealed that intergenic splicing occurred between the expressed ST6GAL1 and the usually inactive RTP1 (Figure 4G). The sample with this deletion had highly aberrant expression levels of RTP1 relative to samples without the deletion (Figure 4H). RTP1 is normally expressed in multiple non-blood tissues, with the highest expression in the brain frontal cortex according to the GTEx project.28 We also observed evidence of intergenic splicing due to transcriptional readthrough for a deletion (chr5:77,674,588−77,771,600 [GRCh38]) involving misexpression of OTP. According to the GTEx project, OTP is normally expressed in the hypothalamus (Figure S13; subjects and methods).28
Deletions and duplications lead to chimeric misexpression via transcript fusion
Previous studies in rare diseases have demonstrated that pathogenic gene misexpression can occur via transcript fusion.7 Therefore, we hypothesized that a subset of deletions and duplications could lead to chimeric gene misexpression via transcript fusion. To assess this, we used STAR fusion to identify fusion transcripts (subjects and methods).56 We identified 12 fusion transcripts involving misexpressed genes that were consistently observed with a misexpression-associated SV within 200 kb. Out of these, we labeled 10 as high evidence and 2 as low evidence using STAR fusion’s filtering criteria (subjects and methods) and focused our mechanistic analysis on fusion transcripts with high evidence. Of these, 3 were associated with duplications and 7 with deletions.
We had described 2 of the 7 deletion-associated fusion transcripts previously as being the result of transcriptional readthrough and intergenic splicing involving RTP1 and OTP (Figures 4H and S13). Out of the other deletions, 2 removed the 3′ end and TTS of an active gene and 5′ end of an inactive gene on the same strand, resulting in the inactive gene coming under the control of an active promoter (Figures 5A and S14). For the remaining 3 deletion-associated fusion events, the mechanism was unclear. This may be due to failure to detect more complex rearrangements at these loci or because these variants are non-causal.
Figure 5.
Transcript fusion and gene inversions lead to gene misexpression
(A) Schematic diagram of a deletion resulting in transcript fusion and chimeric gene misexpression. The deletion of the 3′ end of an active gene (green) and 5′ end of an inactive gene (blue) results in a fusion transcript containing portions of the active and inactive gene’s transcripts.
(B) Schematic diagram of a tandem duplication resulting in transcript fusion and gene misexpression. The duplication of the 3′ end of an inactive gene (blue) and 5′ end of an active gene (green) in tandem results in a fusion transcript containing portions of the active and inactive gene’s transcripts.
(C) Expression of MYH1 comparing a sample with duplication chr17:10,078,018–10,512,685 (GRCh38) to samples without this duplication. Red color indicates samples passing the misexpression threshold TPM >0.5 and Z score >2; gray samples are below this threshold.
(D) FusionInspector visualization of the GAS7–MYH1 fusion transcript. Duplication chr17:10,078,018–10,512,685 (GRCh38) breakpoints are labeled in green. Introns have been shortened for visualization, and breakpoint positions have been approximated accordingly. In the sashimi plot, the line width corresponds to the number of reads spanning a given junction. The misexpressed gene and the fusion reads are highlighted in orange.
(E) Expression of ROPN1B comparing samples with inversion chr3:125,966,617–125,980,782 (GRCh38) and samples without this inversion. Red color indicates samples passing the misexpression threshold TPM >0.5 and Z score >2; gray samples are below this threshold.
(F) Location of inversion chr3:125,966,617–125,980,782 (GRCh38) showing all ROPN1B transcripts. The Ensembl canonical transcript is labeled with an asterisk and the major misexpressed transcript with a double asterisk.
(G) Percentage expression of ROPN1B transcripts for all samples with inversion chr3:125,966,617–125,980,782 (GRCh38) compared to the average transcript percentage across 170 samples without the inversion and with non-zero transcript expression.
(H) Distribution of misexpression Z scores across different types of misexpression mechanisms. Text labels indicate the number of misexpression events for each putative mechanism.
DEL, deletion; DUP, duplication; INV, inversion.
All 3 duplications resulted in the 3′ end of an inactive gene being positioned within an active gene on the same strand (Figures 5B and S14). This leads to part of the inactive gene coming under the control of an active promoter. One of the duplications (chr17:10,078,018−10,512,685 [GRCh38]) was associated with a GAS7-MYH1 fusion transcript (Figure 5C). The duplication’s breakpoints were consistent with the structure of the fusion transcript (Figure 5D). MYH1 is normally exclusively expressed in skeletal muscle tissue.28 FusionInspector predicted that this fusion transcript contained a novel open reading frame resulting from the in-frame concatenation of 61 N-terminal residues of GAS7 and 1,603 C-terminal residues of MYH1 (total predicted protein length 1,664 residues).57 Similar fusion reads have been detected previously in multiple cancer samples from lung, stomach, and intestine.60,61
Inverting gene orientation is associated with non-chimeric misexpression
Gene misexpression was not limited to transcriptional readthrough and transcript fusion mechanisms. We observed that ROPN1B had consistently elevated expression across 10 samples with an inversion (chr3:125,966,617−125,980,782 [GRCh38]) spanning the 5′ end of the gene (Figures 5E and 5F). Transcript quantification using Salmon showed that ROPN1B misexpression was transcript specific, with the major misexpressed transcript being completely contained within the inversion (Figure 5G).58 Inverting ROPN1B’s orientation may lead to ectopic enhancer-promoter contacts resulting in misexpression, but this cannot be confirmed with the data available. Additionally, we observed elevated expression of intestinal alkaline phosphatase (ALPI) across 6 participants carrying two deletions (chr2:232,375,546−232,379,537 and chr2:232,428,106−232,431,877 [GRCh38]) in cis (Figure S15). However, the mechanism by which these deletions lead to misexpression is unclear.
Overall, we have identified a putative mechanism for 42% (41/98) of events with a misexpression-associated SV in cis. Out of these mechanisms, transcript fusion on average led to the most extreme levels of misexpression and gene inversions the lowest (Figure 5H). We manually inspected the remaining 57 events but could not identify SVs with shared mechanisms that could explain the observed misexpression. Interestingly, only 4 of these events had an SV in cis that overlapped a TAD boundary and CTCF-binding site in the required orientation to result in misexpression via rearrangements in 3D chromatin architecture (subjects and methods). However, we could not confirm that these SVs were causal. This result suggests that in our cohort 3D genome rearrangements leading to gene misexpression may be exceedingly rare.
Discussion
In this study, we have developed gene misexpression as a type of transcriptomic outlier analysis and conducted a genome-wide characterization of the gene misexpression landscape using bulk RNA-seq in a cohort of 4,568 blood donors. We found that misexpression events occur in the majority of samples and in a third of inactive protein-coding genes. By integrating WGS and RNA-seq data, we assessed the influence of genetic variation on gene misexpression, demonstrating that these events are enriched for rare SVs in cis. We also show that a subset of SVs lead to misexpression via specific mechanisms, including transcriptional readthrough, transcript fusion, and inverting gene orientation. These findings extend our understanding of gene misexpression and its genetic mechanisms beyond the limited number of samples and disease-relevant loci where misexpression had previously been described.
Large-scale RNA-seq studies have found that different categories of transcriptional outliers are enriched for distinct types of genetic variation.9,11 We found that although only a small proportion of misexpression events had a nearby rare SV, this represented a strong enrichment compared to non-misexpression events in the same genes. Compared to SV enrichment in other outlier types measured across multiple tissues, this rare SV enrichment occurred at shorter distances from the TSS.11 Although rare disease studies have demonstrated that SNVs and indels can lead to gene misexpression,6 we did not observe a consistent significant enrichment for these types of genetic variation in this large, predominantly healthy cohort. These results emphasize the disproportionate impact that large genetic perturbations have in influencing gene expression.9,10,11,12 Unlike in previous studies focusing on SVs,10,12 we identified multiple regulatory features that were enriched in misexpression-associated SVs and used these features to identify specific misexpression mechanisms.
Previous studies of rare diseases have observed that non-chimeric misexpression, where an individual inactive gene is aberrantly transcribed, can result from rearrangements in 3D chromatin architecture resulting in changes in enhancer-promoter interactions (enhancer adoption).8 However, in our cohort, we were unable to identify such events with confidence. This might indicate that these events are exceedingly rare in healthy human populations. Alternatively, our approach may fail to identify these events due to a lack of high-resolution, context-specific Hi-C data. However, parallels can be drawn between the SV mechanisms resulting in chimeric misexpression we have observed here and those involving rearrangements in 3D chromatin architecture. While SVs leading to chimeric misexpression via transcript fusion or transcriptional readthrough place an inactive gene under the control of a different active promoter, SVs resulting in enhancer adoption place an inactive promoter under the control of active enhancers. Therefore, a consistent theme across different SV misexpression mechanisms is changes to the regulatory environment of an inactive gene through alterations to either its promoter or enhancers.
It is important to highlight the limitations of this study. Firstly, we have only analyzed gene misexpression within whole blood in predominantly European participants, and further studies should examine the prevalence of misexpression across different populations, tissues, and cell types as well as in disease. Indeed, a link between misexpression and disease is more likely to be detected in the relevant disease tissue rather than in whole blood. We further note that misexpression events occurred more frequently in shorter genes, which may be due to biological or technical effects. Secondly, we have focused on high-confidence events by using a stringent TPM expression threshold, acknowledging that we are likely missing some true misexpression events with this approach. This threshold is arbitrary, and the level of misexpression required to influence cellular processes is likely to vary across genes and contexts. However, our enrichment results for rare SVs were consistent and increased at higher misexpression Z score thresholds, suggesting that we are effectively capturing genetic associations. Thirdly, due to the technical difficulties of calling SVs in short-read genome sequencing, we may be unable to detect some SVs that lead to misexpression, and therefore the proportion of misexpression events associated with an SV is likely underestimated. Some of the misexpression events not associated with SVs may be due to non-genetic mechanisms such as leaky transcription, chromatin plasticity, or specific environmental cues. Finally, when interpreting the consequences of rare SVs, we have focused on mechanisms that are shared by multiple variants or events where the SV is found in multiple samples. Therefore, we are biased toward detecting more common misexpression mechanisms and may miss additional mechanisms caused by ultra-rare SVs.
Interpreting the functional effects of rare genetic variation remains challenging and is important for understanding the molecular mechanisms by which variants influence human traits. Here, we have extended our understanding of how genetic variants influence gene expression. The fact that rare SVs can induce misexpression not just in the rare disease context should be taken into account in future studies when cataloging and interpreting their effects in population cohorts. This is especially important for SVs associated with human complex diseases, as it is currently unknown what fraction of these SVs may mediate their phenotypic effects by causing gene misexpression.
Data and code availability
The INTERVAL study data used in this paper are available to bona fide researchers by emailing helpdesk@intervalstudy.org.uk. The data access policy for the data is available by emailing CEU-DataAccess@medschl.cam.ac.uk. The RNA-seq data (n = 4,732 INTERVAL participants) have been deposited at the EGA under the accession number EGAD00001008015. The WGS data have been deposited at the EGA under accession number EGAD00001008661. The Nextflow pipeline used for STAR and Salmon alignments is available at https://github.com/wtsi-hgi/nextflow-pipelines/blob/rna_seq_interval_5591/pipelines/rna_seq.nf. Custom code used for analysis of processed sequencing data is available here: https://github.com/tvdStichele/interval_misexpression_manuscript.
Acknowledgments
We thank members of the Davenport and Parts laboratories for helpful discussion and feedback. In particular, we thank Megan Gozzard, Jacob Hepkema, Matthew Hurles, Seri Kitada, Andrew Lawson, Julie Matte, and Juliane Weller for providing comments on the manuscript and discussing results. We thank the Wellcome Sanger Institute’s Human Genetics Informatics (HGI) team for mapping the bulk RNA-seq reads. This research was funded in whole, or in part, by the Wellcome Trust (grant numbers 206194 and 220540/Z/20/A). This study makes use of data generated by the DECIPHER community. A full list of centers who contributed to the generation of the data is available from https://deciphergenomics.org/about/stats and via email from contact@deciphergenomics.org. Funding for the DECIPHER project was provided by Wellcome (grant number WT223718/Z/21/Z). The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from the GTEx portal on 09/20/2023. For the purpose of open access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission. Further acknowledgments can be found in the supplemental information.
Author contributions
Conceptualization: T.V.; data analysis: T.V., K.L.B., W.L., B.H., K.W., K.K., E.P., J.M., A.P.N.; provided resources: B.H., K.W., K.K., E.P., J.M., A.P.N., D.J.R., E.D.A.; funding acquisition: E.E.D., D.S.P., M.I., N.S., A.P., A.B., A.S.B., J.D., S.P.; interpreted results: T.V., K.L.B., E.E.D., N.d.K., M.T., D.S.P., M.I., L.P., J.K., A.T., E.P.; supervision: E.E.D., L.P., D.S.P., N.S., M.I., A.P., A.B., A.S.B.; writing, original draft: T.V., K.L.B., E.E.D.; all authors reviewed and edited the manuscript.
Declaration of interests
The authors declare the following interests: T.V. has received PhD studentship funding from AstraZeneca. J.M. completed this work while employed by the University of Cambridge but is now an employee of Genomics plc. S.P. is a current employee and stockholder of AstraZeneca. D.J.R. is an employee of NHS Blood and Transplant. A.B. is currently an employee of Bayer AG, Research and Early Development Precision Medicine, Research & Development, Pharmaceutical Division, Wuppertal, DE. A.P. is a current employee and stockholder of AstraZeneca. D.S.P. is a current employee and stockholder of AstraZeneca. K.K. is a current employee and stockholder of AstraZeneca. J.D. serves on scientific advisory boards for AstraZeneca, Novartis, and UK Biobank and has received multiple grants from academic, charitable, and industry sources outside of the submitted work. M.I. is a trustee of the Public Health Genomics (PHG) Foundation, is a member of the Scientific Advisory Board of Open Targets, and has a research collaboration with AstraZeneca, which is unrelated to this study.
Published: July 24, 2024
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2024.06.017.
Web resources
A/B compartments in GM12878, https://data.4dnucleome.org/files-processed/4DNFILYQ1PAY/
ChromHMM PBMC states, https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E062_15_coreMarks_hg38lift_mnemonics.bed.gz
Cosmic Cancer Gene Census v.97, https://cancer.sanger.ac.uk/census
CpG islands, http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/cpgIslandExt.txt.gz
DECIPHER gene list, https://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
ENCODE cCREs all CTCF-only sites, https://downloads.wenglab.org/Registry-V3/GRCh38-cCREs.CTCF-only.bed
ENCODE cCREs B cells, https://downloads.wenglab.org/Registry-V3/Seven-Group/ENCFF035DJL.7group.bed
ENCODE cCREs CD14+ monocytes, https://downloads.wenglab.org/Registry-V3/Seven-Group/ENCFF389PZY_ENCFF587XGD_ENCFF184NWF_ENCFF496PSJ.7group.bed
ENCODE cCREs neutrophils, https://downloads.wenglab.org/Registry-V3/Seven-Group/ENCFF685DZI_ENCFF311TAY_ENCFF300LXQ.7group.bed
Ensembl v.97, http://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
GENCODE v.31, https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.annotation.gtf.gz
GERP++ elements, https://bds.mpi-cbg.de/hillerlab/120MammalAlignment/Human120way/data/conservation/gerpElements_hg38_multiz120Mammals.bed.gz
gnomAD constraint Z scores, https://gnomad.broadinstitute.org/downloads#v3-genomic-constraint
gnomAD gene lists, https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2308-7/MediaObjects/41586_2020_2308_MOESM4_ESM.zip
GTEx v.8 eQTL data all tissues, https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL_EUR.tar
GTEx v.8 eQTL expression matrices, https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL_expression_matrices.tar
GTEx v.8 median TPM per tissue, https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz
GTEx v.8 read count whole blood, https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/gene_tpm/gene_tpm_2017-06-05_v8_whole_blood.gct.gz
gwRVIS, https://az.app.box.com/v/jarvis-gwrvis-scores/folder/159704875574
HARs, https://ftp.ncbi.nlm.nih.gov/geo/series/GSE180nnn/GSE180714/suppl/GSE180714%5FHARs%2Ebed%2Egz
OMIM, https://www.omim.org
OpenTargets targets information, http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.11/output/etl/json/targets/
pHaplo and pTriplo scores, https://ars.els-cdn.com/content/image/1-s2.0-S0092867422007887-mmc7.xlsx
PhyloP 100-way, http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phyloP100way/hg38.phyloP100way.bw
polyASite 2.0 database polyA sites, https://polyasite.unibas.ch/download/atlas/2.0/GRCh38.96/atlas.clusters.2.0.GRCh38.96.bed.gz
polyASite 2.0 database polyA sites with sample information, https://polyasite.unibas.ch/download/atlas/2.0/GRCh38.96/atlas.clusters.2.0.GRCh38.96.tsv.gz
sv-pipeline, https://github.com/hall-lab/sv-pipeline
TAD boundaries in GM12878, https://data.4dnucleome.org/files-processed/4DNFIVK5JOFU/
TAD boundaries in HMEC, https://data.4dnucleome.org/files-processed/4DNFIJL18YS3/
TAD boundaries in HNEK, https://data.4dnucleome.org/files-processed/4DNFICLU9GUP/
TAD boundaries in HUVEC, https://data.4dnucleome.org/files-processed/4DNFI9MZWZF7/
TAD boundaries in IMR90, https://data.4dnucleome.org/files-processed/4DNFIMNT2VYL/
Supplemental information
References
- 1.Prelich G. Gene overexpression: uses, mechanisms, and interpretation. Genetics. 2012;190:841–854. doi: 10.1534/genetics.111.136911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Halder G., Callaerts P., Gehring W.J. Induction of ectopic eyes by targeted expression of the eyeless gene in Drosophila. Science. 1995;267:1788–1792. doi: 10.1126/science.7892602. [DOI] [PubMed] [Google Scholar]
- 3.Northcott P.A., Lee C., Zichner T., Stütz A.M., Erkek S., Kawauchi D., Shih D.J.H., Hovestadt V., Zapatka M., Sturm D., et al. Enhancer hijacking activates GFI1 family oncogenes in medulloblastoma. Nature. 2014;511:428–434. doi: 10.1038/nature13379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Weischenfeldt J., Dubash T., Drainas A.P., Mardin B.R., Chen Y., Stütz A.M., Waszak S.M., Bosco G., Halvorsen A.R., Raeder B., et al. Pan-cancer analysis of somatic copy-number alterations implicates IRS4 and IGF2 in enhancer hijacking. Nat. Genet. 2017;49:65–74. doi: 10.1038/ng.3722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lupiáñez D.G., Kraft K., Heinrich V., Krawitz P., Brancati F., Klopocki E., Horn D., Kayserili H., Opitz J.M., Laxova R., et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell. 2015;161:1012–1025. doi: 10.1016/j.cell.2015.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wakeling M.N., Owens N.D.L., Hopkinson J.R., Johnson M.B., Houghton J.A.L., Dastamani A., Flaxman C.S., Wyatt R.C., Hewat T.I., Hopkins J.J., et al. Non-coding variants disrupting a tissue-specific regulatory element in HK1 cause congenital hyperinsulinism. Nat. Genet. 2022;54:1615–1620. doi: 10.1038/s41588-022-01204-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kempf E., Landgraf K., Stein R., Hanschkow M., Hilbert A., Abou Jamra R., Boczki P., Herberth G., Kühnapfel A., Tseng Y.-H., et al. Aberrant expression of agouti signaling protein (ASIP) as a cause of monogenic severe childhood obesity. Nat. Metab. 2022;4:1697–1712. doi: 10.1038/s42255-022-00703-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Weischenfeldt J., Ibrahim D.M. When 3D genome changes cause disease: the impact of structural variations in congenital disease and cancer. Curr. Opin. Genet. Dev. 2023;80 doi: 10.1016/j.gde.2023.102048. [DOI] [PubMed] [Google Scholar]
- 9.Li X., Kim Y., Tsang E.K., Davis J.R., Damani F.N., Chiang C., Hess G.T., Zappala Z., Strober B.J., Scott A.J., et al. The impact of rare variation on gene expression across tissues. Nature. 2017;550:239–243. doi: 10.1038/nature24267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Chiang C., Scott A.J., Davis J.R., Tsang E.K., Li X., Kim Y., Hadzic T., Damani F.N., Ganel L., Consortium G.T.E., et al. The impact of structural variation on human gene expression. Nat. Genet. 2017;49:692–699. doi: 10.1038/ng.3834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ferraro N.M., Strober B.J., Einson J., Abell N.S., Aguet F., Barbeira A.N., Brandt M., Bucan M., Castel S.E., Davis J.R., et al. Transcriptomic signatures across human tissues identify functional rare genetic variation. Science. 2020;369 doi: 10.1126/science.aaz5900. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Scott A.J., Chiang C., Hall I.M. Structural variants are a major source of gene expression differences in humans and often affect multiple nearby genes. Genome Res. 2021;31:2249–2257. doi: 10.1101/gr.275488.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Smail C., Ferraro N.M., Hui Q., Durrant M.G., Aguirre M., Tanigawa Y., Keever-Keigher M.R., Rao A.S., Justesen J.M., Li X., et al. Integration of rare expression outlier-associated variants improves polygenic risk prediction. Am. J. Hum. Genet. 2022;109:1055–1064. doi: 10.1016/j.ajhg.2022.04.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Moore C., Sambrook J., Walker M., Tolkien Z., Kaptoge S., Allen D., Mehenny S., Mant J., Di Angelantonio E., Thompson S.G., et al. The INTERVAL trial to determine whether intervals between blood donations can be safely and acceptably decreased to optimise blood supply: study protocol for a randomised controlled trial. Trials. 2014;15:363. doi: 10.1186/1745-6215-15-363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Di Angelantonio E., Thompson S.G., Kaptoge S., Moore C., Walker M., Armitage J., Ouwehand W.H., Roberts D.J., Danesh J., INTERVAL Trial Group Efficiency and safety of varying the frequency of whole blood donation (INTERVAL): a randomised trial of 45 000 donors. Lancet. 2017;390:2360–2371. doi: 10.1016/S0140-6736(17)31928-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 2013 doi: 10.48550/arXiv.1303.3997. Preprint at. [DOI] [Google Scholar]
- 17.McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., DePristo M.A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Howell B. University of Cambridge; 2022. The Contribution of Structural Variants to 2,095 Molecular Phenotypes in 12,354 European Ancestry Individuals. PhD thesis. [DOI] [Google Scholar]
- 19.Handsaker R.E., Korn J.M., Nemesh J., McCarroll S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 2011;43:269–276. doi: 10.1038/ng.768. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Layer R.M., Chiang C., Quinlan A.R., Hall I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15:R84. doi: 10.1186/gb-2014-15-6-r84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Abyzov A., Urban A.E., Snyder M., Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Larson D.E., Abel H.J., Chiang C., Badve A., Das I., Eldred J.M., Layer R.M., Hall I.M. svtools: population-scale analysis of structural variation. Bioinformatics. 2019;35:4782–4787. doi: 10.1093/bioinformatics/btz492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Abel H.J., Larson D.E., Regier A.A., Chiang C., Das I., Kanchi K.L., Layer R.M., Neale B.M., Salerno W.J., Reeves C., et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature. 2020;583:83–89. doi: 10.1038/s41586-020-2371-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tokolyi A., Persyn E., Nath A.P., Burnham K.L., Marten J., Vanderstichele T., Tardaguila M., Stacey D., Farr B., Iyer V., et al. Genetic determinants of blood gene expression and splicing and their contribution to molecular phenotypes and health outcomes. medRxiv. 2023 doi: 10.1101/2023.11.25.23299014. Preprint at. [DOI] [Google Scholar]
- 25.Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Liao Y., Smyth G.K., Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
- 27.Fort A., Panousis N.I., Garieri M., Antonarakis S.E., Lappalainen T., Dermitzakis E.T., Delaneau O. MBV: a method to solve sample mislabeling and detect technical bias in large combined genotype and sequencing assay datasets. Bioinformatics. 2017;33:1895–1897. doi: 10.1093/bioinformatics/btx074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–1330. doi: 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Frankish A., Diekhans M., Jungreis I., Lagarde J., Loveland J.E., Mudge J.M., Sisu C., Wright J.C., Armstrong J., Barnes I., et al. GENCODE 2021. Nucleic Acids Res. 2021;49:D916–D923. doi: 10.1093/nar/gkaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Roadmap Epigenomics Consortium. Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Võsa U., Claringbould A., Westra H.-J., Bonder M.J., Deelen P., Zeng B., Kirsten H., Saha A., Kreuzhuber R., Yazar S., et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 2021;53:1300–1310. doi: 10.1038/s41588-021-00913-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Aran D., Hu Z., Butte A.J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017;18:220. doi: 10.1186/s13059-017-1349-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434–443. doi: 10.1038/s41586-020-2308-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Collins R.L., Glessner J.T., Porcu E., Lepamets M., Brandon R., Lauricella C., Han L., Morley T., Niestroj L.-M., Ulirsch J., et al. A cross-disorder dosage sensitivity map of the human genome. Cell. 2022;185:3041–3055.e25. doi: 10.1016/j.cell.2022.06.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Han X., Chen S., Flynn E., Wu S., Wintner D., Shen Y. Distinct epigenomic patterns are associated with haploinsufficiency and predict risk genes of developmental disorders. Nat. Commun. 2018;9:2138. doi: 10.1038/s41467-018-04552-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wang X., Goldstein D.B. Enhancer Domains Predict Gene Pathogenicity and Inform Gene Discovery in Complex Disease. Am. J. Hum. Genet. 2020;106:215–233. doi: 10.1016/j.ajhg.2020.01.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Pollard K.S., Hubisz M.J., Rosenbloom K.R., Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. doi: 10.1101/gr.097857.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Davydov E.V., Goode D.L., Sirota M., Cooper G.M., Sidow A., Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++ PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1001025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Rao S.S.P., Huntley M.H., Durand N.C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D., Lander E.S., Aiden E.L. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Dekker J., Belmont A.S., Guttman M., Leshyk V.O., Lis J.T., Lomvardas S., Mirny L.A., O’Shea C.C., Park P.J., Ren B., et al. The 4D nucleome project. Nature. 2017;549:219–226. doi: 10.1038/nature23884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Tate J.G., Bamford S., Jubb H.C., Sondka Z., Beare D.M., Bindal N., Boutselakis H., Cole C.G., Creatore C., Dawson E., et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 2019;47:D941–D947. doi: 10.1093/nar/gky1015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ochoa D., Hercules A., Carmona M., Suveges D., Baker J., Malangone C., Lopez I., Miranda A., Cruz-Castillo C., Fumis L., et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 2023;51:D1353–D1359. doi: 10.1093/nar/gkac1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Firth H.V., Richards S.M., Bevan A.P., Clayton S., Corpas M., Rajan D., Van Vooren S., Moreau Y., Pettett R.M., Carter N.P. DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 2009;84:524–533. doi: 10.1016/j.ajhg.2009.03.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Amberger J.S., Bocchini C.A., Schiettecatte F., Scott A.F., Hamosh A. Omim.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015;43:D789–D798. doi: 10.1093/nar/gku1205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Reimand J., Kull M., Peterson H., Hansen J., Vilo J. g:Profiler--a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 2007;35:W193–W200. doi: 10.1093/nar/gkm226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.McLaren W., Gil L., Hunt S.E., Riat H.S., Ritchie G.R.S., Thormann A., Flicek P., Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122. doi: 10.1186/s13059-016-0974-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., Mesirov J.P. Integrative genomics viewer. Nat. Biotechnol. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kleinert P., Kircher M. A framework to score the effects of structural variants in health and disease. Genome Res. 2022;32:766–777. doi: 10.1101/gr.275995.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chen S., Francioli L.C., Goodrich J.K., Collins R.L., Kanai M., Wang Q., Alföldi J., Watts N.A., Vittal C., Gauthier L.D., et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature. 2024;625:92–100. doi: 10.1038/s41586-023-06045-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Vitsios D., Dhindsa R.S., Middleton L., Gussow A.B., Petrovski S. Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat. Commun. 2021;12:1504. doi: 10.1038/s41467-021-21790-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Girskis K.M., Stergachis A.B., DeGennaro E.M., Doan R.N., Qian X., Johnson M.B., Wang P.P., Sejourne G.M., Nagy M.A., Pollina E.A., et al. Rewiring of human neurodevelopmental gene regulatory programs by human accelerated regions. Neuron. 2021;109:3239–3251.e7. doi: 10.1016/j.neuron.2021.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Gardiner-Garden M., Frommer M. CpG islands in vertebrate genomes. J. Mol. Biol. 1987;196:261–282. doi: 10.1016/0022-2836(87)90689-9. [DOI] [PubMed] [Google Scholar]
- 53.ENCODE P.C., Moore J.E., Purcaro M.J., Pratt H.E., Epstein C.B., Shoresh N., Adrian J., Kawli T., Davis C.A., Dobin A., et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Herrmann C.J., Schmidt R., Kanitz A., Artimo P., Gruber A.J., Zavolan M. PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3’ end sequencing. Nucleic Acids Res. 2020;48:D174–D179. doi: 10.1093/nar/gkz918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Haas B.J., Dobin A., Li B., Stransky N., Pochet N., Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20:213. doi: 10.1186/s13059-019-1842-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Haas B.J., Dobin A., Ghandi M., Van Arsdale A., Tickle T., Robinson J.T., Gillani R., Kasif S., Regev A. Targeted in silico characterization of fusion transcripts in tumor and normal tissues via FusionInspector. Cell Rep. Methods. 2023;3 doi: 10.1016/j.crmeth.2023.100467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Patro R., Duggal G., Love M.I., Irizarry R.A., Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Hahne F., Ivanek R. Visualizing Genomic Data Using Gviz and Bioconductor. Methods Mol. Biol. 2016;1418:335–351. doi: 10.1007/978-1-4939-3578-9_16. [DOI] [PubMed] [Google Scholar]
- 60.Walsh P.S., Hao Y., Ding J., Qu J., Wilde J., Jiang R., Kloos R.T., Huang J., Kennedy G.C. Maximizing Small Biopsy Patient Samples: Unified RNA-Seq Platform Assessment of over 120,000 Patient Biopsies. J. Personalized Med. 2022;13 doi: 10.3390/jpm13010024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Vellichirammal N.N., Albahrani A., Banwait J.K., Mishra N.K., Li Y., Roychoudhury S., Kling M.J., Mirza S., Bhakat K.K., Band V., et al. Pan-Cancer Analysis Reveals the Diverse Landscape of Novel Sense and Antisense Fusion Transcripts. Mol. Ther. Nucleic Acids. 2020;19:1379–1398. doi: 10.1016/j.omtn.2020.01.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The INTERVAL study data used in this paper are available to bona fide researchers by emailing helpdesk@intervalstudy.org.uk. The data access policy for the data is available by emailing CEU-DataAccess@medschl.cam.ac.uk. The RNA-seq data (n = 4,732 INTERVAL participants) have been deposited at the EGA under the accession number EGAD00001008015. The WGS data have been deposited at the EGA under accession number EGAD00001008661. The Nextflow pipeline used for STAR and Salmon alignments is available at https://github.com/wtsi-hgi/nextflow-pipelines/blob/rna_seq_interval_5591/pipelines/rna_seq.nf. Custom code used for analysis of processed sequencing data is available here: https://github.com/tvdStichele/interval_misexpression_manuscript.