Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2019 Mar 23;35(20):3906–3912. doi: 10.1093/bioinformatics/btz202

ORE identifies extreme expression effects enriched for rare variants

F Richter 1, G E Hoffman 2,3, K B Manheimer 4, N Patel 5, A J Sharp 3,5, D McKean 6, S U Morton 6, S DePalma 6, J Gorham 6, A Kitaygorodksy 7, G A Porter Jr 8, A Giardini 9, Y Shen 7,10, W K Chung 11, J G Seidman 6, C E Seidman 6, E E Schadt 2,3,4,2, B D Gelb 3,5,12,✉,2
Editor: Oliver Stegle
PMCID: PMC6792115  PMID: 30903145

Abstract

Motivation

Non-coding rare variants (RVs) may contribute to Mendelian disorders but have been challenging to study due to small sample sizes, genetic heterogeneity and uncertainty about relevant non-coding features. Previous studies identified RVs associated with expression outliers, but varying outlier definitions were employed and no comprehensive open-source software was developed.

Results

We developed Outlier-RV Enrichment (ORE) to identify biologically-meaningful non-coding RVs. We implemented ORE combining whole-genome sequencing and cardiac RNAseq from congenital heart defect patients from the Pediatric Cardiac Genomics Consortium and deceased adults from Genotype-Tissue Expression. Use of rank-based outliers maximized sensitivity while a most extreme outlier approach maximized specificity. Rarer variants had stronger associations, suggesting they are under negative selective pressure and providing a basis for investigating their contribution to Mendelian disorders.

Availability and implementation

ORE, source code, and documentation are available at https://pypi.python.org/pypi/ore under the MIT license.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Quantitative trait loci (QTLs) are widely used to relate non-coding genetic variation to gene expression and other molecular phenotypes (Schadt et al., 2003). QTLs are limited by requiring multiple observations of the same variant in the same biological context (e.g. similar cell types, epistatic relationships and epigenetic backgrounds), to allow for sufficient power to detect correlations between gene expression and common genetic variation. These conditions also limit the power to study non-coding rare variants (RVs), which are more frequent and some of which are hypothesized to contribute significantly to disease heritability (Gibson, 2012).

One method to address this challenge is to study associations between RVs and outliers in gene expression (Chiang et al., 2017; Cummings et al., 2017; Kremer et al., 2017; Li et al., 2014, 2017; Montgomery et al., 2011; Pala et al., 2017; Zeng et al., 2015; Zhao et al., 2016). RNAseq from relevant tissue has become increasingly recognized as a tool for implicating outliers in Mendelian disorders (Cummings et al., 2017; Kremer et al., 2017), and there has been some work relating outliers to RVs (Table 1), even with non-expression molecular phenotypes (Guo et al., 2015). There are two steps to identifying outliers. First, a ‘typical’ biological range has to be established for each gene by assuming a distribution (e.g. normal or binomial) or lack thereof (e.g. with ranks or permutations). Second, a definition of anomalous expression has to be established, which is either a value more extreme than a threshold, the most extreme value observed in a cohort, or both. In a subset of previous investigations in which outliers were associated with RVs, the corresponding code repositories were made publicly available. However, these repositories are specific to a single group’s data, analyses and hypotheses, thus limiting their generalizability.

Table 1.

Outlier definitions in previous studies associating RV with extreme expression events

Outlier definition Distribution References
Absolute z-score ≥ 2 Normal Montgomery et al. (2011)
Mahalanobis distance per sample from other samples per gene set Multinomial Zeng et al. (2015)
Most extreme bin (5/410 samples per bin) Non-parametric Zhao et al. (2016)
Z-score ≤ −3, logFC −3.3 Normal McKean et al. (2016)
Absolute z-score ≥ 3 Normal Cummings et al. (2017)
1-vs-all DE using DESeq2 (FWER < 0.05, logFC z-score ≥ 3) Negative binomial Kremer et al. (2017)
Most extreme in cohort and absolute z-score ≥ threshold (2, 3, 4) Normal Chiang et al. (2017) and Li et al. (2017)
Family information: family QTL effect size >95%, likelihood of most extreme parent–child expression sharing Normal Li et al. (2014, 2017) and Pala et al. (2017)
Metabolite read-out: variable (most extreme, P-value < 0.05) Normal, non-parametric Guo et al. (2015)

DE, differential expression; FWER, family wise error rate; logFC, log-fold change.

Here, we introduce ORE (outlier-RV enrichment), an open-source software package that associates RVs from whole-genome sequencing (WGS) and gene expression outliers from RNAseq to identify biologically meaningful non-coding RVs. It combines previous approaches into a single software package that supports multiple variant functional classes, utilizes pre-programed as well as user-defined outlier definitions, and provides an extensible output that is easily interpreted and suitable for downstream analyses. ORE has the same inputs as FastQTL, a commonly used QTL tool, thus facilitating integration into existing pipelines that analyze RNAseq and WGS from the same individuals (Ongen et al., 2016). We implemented ORE in WGS and heart tissue RNAseq data obtained from two cohorts, 192 samples from patients with structural congenital heart disease (CHD) from the Pediatric Cardiac Genomics Consortium (PCGC) and 71 samples from deceased adult donors from the Genotype-Tissue Expression (GTEx) Consortium (Li et al., 2017). We chose these data to highlight ORE’s utility in the context of human congenital anomalies.

2 Materials and methods

2.1 Software architecture, design and implementation

A flowchart describing the software is in Supplementary Figure S1.

2.1.1 Inputs

The two required inputs are (i) gene transcription start site (TSS) location with expression as Browser Extensible Data (BED) and (ii) genotypes as Variant Call Format (VCF). Both are standard file formats and must be indexed with Tabix, which are the same input specifications required for FastQTL (Li, 2011; Ongen et al., 2016). It is recommended to run bcftools norm on the VCF to standardize every allele into its most parsimonious and left aligned form (Danecek et al., 2011) so that appropriate allele frequencies (AFs) are determined from population databases; however, intra-cohort AF calculation is also performed to mitigate population database biases. Optional inputs are a covariate matrix (same format as FastQTL), the outlier definition, specific populations for determining AF, AF thresholds, maximum variant distance from the TSS, variant functional class (e.g. intronic, intergenic), variant quality metrics, and non-coding annotations (e.g. enhancers, promoters) in 0-based BED format. An example ORE command to calculate outlier-RV associations is shown in Command 1.

  1. ore –-vcf wgs.vcf.gz

  2.   –-bed expression_residuals.bed.gz

  3.   –-output ore_results

  4.   –-distribution normal

  5.   –-threshold 2 3

  6.   –-extrema

  7.   –-af_rare 0.01 1e-3

  8.   –-tss_dist 1e4 5e3

  9.   –-variant_class UTR5

Command 1. BASH command for running ORE. The variant and gene expression files are specified with –vcf (Line 1) and –bed (Line 2), respectively. The output prefix is provided with –output (Line 3). The outlier specifications –distribution (Line 4), –threshold (Line 5) and –extrema (Line 6) indicate that outliers are defined using a normal distribution with a z-score more extreme than two or three and being the most extreme value observed per gene. Variant information is specified with –af_rare (Line 7), –tss_dist (Line 8) and –variant_class (Line 9) to encode that variants are defined as rare with a maximum population AF ≤ 0.01 or 0.001, and to only use variants within 5 or 10 kb of the TSS and in 5’ untranslated regions (UTRs). There are two z-score, two AF, and two TSS distance thresholds, so ORE will automatically calculate enrichments for all 8 permutations for low, high and all expression outliers.

2.1.2 Variant abstraction and annotation

ORE first abstracts the contents of a VCF from allelic to genotypic values (i.e. 0/0 → 0, 0/1 → 1 and 1/1 → 2) and wide to long format, such that each row corresponds to a single variant in a single individual. ORE automatically detects the genome build, confirms the genome build matches between variants and gene expression, matches sample names between the VCF and BED files (only the intersecting samples are used to calculate the association), handles multi-allelic lines as well as bi-allelic values (via a simplified implementation of bcftools norm –multiallelics –both) (Li et al., 2009), and contains options for filtering variants based on genotype quality, total depth and, for heterozygous variants, alternate allelic ratio.

Having abstracted variants, the first step is obtaining each variant’s AF. In addition to calculating the intra-cohort AF and extracting the VCF AF, ANNOVAR (v2018Apr16) is packaged with ORE and optionally adopted for this function (Wang et al., 2010). ORE prepares one-based inputs for ANNOVAR, which includes extending the ‘End’ of the reference allele in a deletion to comport with its length (Wang et al., 2010). ORE executes ANNOVAR and uses the output to determine each variant’s maximum AF in any population in the Genome Aggregation Database, by default only considering populations with at least 1000 samples in the database (Lek et al., 2016). After obtaining AF, ORE assigns variants to the closest TSS in the gene expression input file. Simultaneously, ORE converts the variants to 0-based BED files and uses BEDTools to intersect variants with segmental duplications (>0.99 pairwise similarity for ≥ 1 kb in human genome), low complexity regions, and other mappability features (Bailey et al., 2001; Dale et al., 2011; Quinlan and Hall, 2010). These variants are excluded from downstream analyses. Finally, variants are filtered for those within a certain distance of the gene TSS (default 10 kb) and, optionally, those belonging to a specific functional class (e.g. intergenic, intronic) or user-defined annotations.

2.1.3 Phenotype abstraction and outlier calling

ORE provides parametric and non-parametric outlier definitions in addition to supporting custom, user-defined outliers. The default definition comports with GTEx: in normally distributed data, outliers are events more extreme than a z-score threshold and the most extreme event observed in the cohort (Chiang et al., 2017; Li et al., 2017). With this definition, there is a maximum of one outlier per gene, while each sample can have multiple outliers. Alternatively, users can classify all values more extreme than a threshold as outliers or specify outliers using percentile thresholds in a non-parametric rank-based distribution. When outliers are specified as being drawn from a normal distribution, genes where >5% of samples are outliers are excluded. Finally, outliers can be calculated externally (e.g. with OUTRIDER) (Brechtmann et al., 2018) and outlier status can simply be passed in with 1 or 0 corresponding to whether or not each gene–sample pair is an outlier by specifying –distribution custom. The number of outliers per sample is automatically printed as a table and histogram; optionally, the maximum number of outliers per individual can be set to automatically exclude samples with global expression differences not necessarily attributable to RVs. If the most extreme outlier definition is used, sample exclusion is automatically iterated until all samples have less than the maximum number specified. Custom lists of samples can also be excluded. If the user provides a covariance matrix, these covariates are automatically regressed out, with the mean added to the residuals before all outlier calculations.

2.1.4 Outlier output and enrichment calculation

Having abstracted variants and outliers to Python objects, class composition is used to combine these data into a single object. After joining, the outliers associated with RVs are printed to a file as a table that can be used for other downstream work (e.g. gene set enrichment). In addition, the outlier-RV association is calculated with both variant- and gene-centric methods (Supplementary Fig. S2). For the variant-centric method, the number of rare/common variants associated with outliers/non-outliers is enumerated with a 2 × 2 contingency table. This is similar to RV association tests. For the gene-centric method, the number of outlier/non-outlier genes associated with/without RVs is expressed as a 2 × 2 contingency table. The controls are non-outliers from the subset of genes that have outliers, to avoid confounding owing to differences in outlier detection propensity. These contingency tables are determined for the Cartesian product of annotations and parameters being tested (i.e. all combinations of thresholds and annotations). ORE automatically calculates enrichment for high, low and all expression outliers.

For every contingency table specification, the odds ratio and Fishers exact test P-value are calculated. Statistical inference from 2 × 2 contingency tables can be classified into frequentist, Bayesian, and likelihood approaches (Choi et al., 2015). Of these, frequentist inference with the Fisher’s exact test was chosen as the enrichment statistic because it is relatively non-parametric, conservative and well-known, the latter being important for software uptake. However, ORE automatically provides raw counts, which can be studied with other approaches, such as Bayesian inference (e.g. by subtracting parameter estimates from two binomials with non-informative priors) or likelihood inference (e.g. by back-calculating the P-value from the chi-squared approximation of the likelihood ratio test). These alternatives might be preferable for improving power with smaller sample sizes.

To assess the genome-wide significance of these associations (i.e. correct for all correlated hypotheses tested) and account for correlated gene expression, ORE optionally implements an orthogonal permutation test. We permutated RNAseq sample ID and defined the permutation P-value as the number of permutations with as or more extreme association P-values than the most extreme observed (Equation 1). Here, x is the most significant P-value with OR > 1 observed in the real data, nj is the most significant P-value with OR > 1 observed in permutation j, N is the total number of permutations, and 1 is the indicator function. These permutations are automatically implemented in ORE (–n_perms flag) and empirically prioritize cutoffs for downstream work.

Ppermutation=j=1N1(xnj)N (1)

2.2 Data ascertainment and preparation

2.2.1 WGS and RNAseq data

Patients with structural CHD were enrolled in the PCGC CHD Network Study (CHD GENES: ClinicalTrials.gov identifier NCT01196182) (Gelb et al., 2013). The protocols were approved by the Institutional Review Boards of Boston’s Children’s Hospital, Brigham and Women’s Hospital, Children’s Hospital of Los Angeles, Children’s Hospital of Philadelphia, Columbia University Medical Center, Great Ormond Street Hospital, Icahn School of Medicine at Mount Sinai, Rochester School of Medicine and Dentistry, Steven and Alexandra Cohen Children’s Medical Center of New York and Yale School of Medicine. All participants or their parents provided informed consent. The DNAs were sequenced at the Baylor College of Medicine Genomic and RNA Profiling Core (n = 27), the New York Genome Center (NYGC) Genomic Research Services (n = 21), or the Broad Institute for Genomic Services (n = 132) following the same protocol. Genomic DNAs from venous blood or saliva were prepared for sequencing using a PCR-free library preparation (n = 48, Baylor and NYGC) or SK2-IES library preparation (n = 132, Broad). All samples were sequenced on an Illumina Hi-Seq X Ten with 150-bp paired reads to a median depth > 30× per individual.

Reads were aligned to GRCh37 with the Burrows-Wheeler Aligner-MEM (Li and Durbin, 2009), GATK Best Practices recommendations were implemented for base quality score recalibration (QSR), indel realignment, and duplicate removal (McKenna et al., 2010). Standard hard filtering parameters were used for SNV and indel discovery across all PCGC samples, followed by N + 1 joint genotyping and variant QSR (DePristo et al., 2011; van der Auwera et al., 2002). Variants were kept if data were missing in ≤ 30% of calls and variants had genotype quality ≥ 30, total depth ≥ 7, PASS classification with GATK QSR and for heterozygous variants, an alternate allelic ratio 0.2:0.8.

For RNAseq, discarded cardiac tissues were immediately frozen in liquid nitrogen or placed in RNAlater obtained from 327 participants at PCGC recruitment sites during cardiothoracic surgical procedures, and RNA was subsequently extracted and sequenced (50-bp paired reads) to a target depth of >4 million reads (median, 16 million reads; range, 4–223 million reads; n = 327 probands). Reads were aligned to hg19 using Subread, and expression counts per gene were quantified with featureCounts (Liao et al., 2013, 2014).

WGS data from GTEx (n = 148) were downloaded from the Database of Genotypes and Phenotypes (www.ncbi.nlm.nih.gov/gap) under accession phs000424.v6.p1 (Carithers et al., 2015). GTEx V6 heart ventricle RNAseq counts and sample metadata (n = 130) were downloaded from the GTEx portal (https://www.gtexportal.org/).

2.2.2 RNAseq expression processing

Counts were analyzed separately for every tissue in each cohort. Genes with an average ≥1 reads per kb of transcript per million mapped reads were kept. Trimmed mean of M-values was used to normalize the total read counts by both the total number of counts and distribution of counts across genes, and Voom was used to non-parametrically estimate the precision of the variance of this normalization (Law et al., 2014; Robinson and Oshlack, 2010). Following normalization, principal component analysis was used to visualize biases in each dataset. In PCGC data, the tissue types of 18 samples were reclassified to comport with their tissue cluster and four samples were removed due to ambiguous tissue clustering. In addition, samples were excluded due to syndromic diagnoses (n = 23; either 22q11.2del or trisomy 21) or sex mismatch and sample mixing based on XIST/UTY gene expression and annotated sample gender (n = 7).

All known confounding variables and the first five surrogate variables (SVs) were regressed out for downstream analyses. In PCGC, known covariates were library preparation kit (Illumina Nextera or 5 PRIME), sequencing platform (Illumina Hi-Seq X Ten or NextSeq), tissue storage (RNAlater or frozen in liquid nitrogen), age, and gender (Supplementary Fig. S3). In GTEx, known covariates were ischemic time, nucleic acid isolation batch, collection site code and RNA integrity number. In all cohorts, SVs were identified through probabilistic evaluation of expression residuals and subsequently regressed out (Stegle et al., 2012). Following quality control, expression z-scores were calculated from the residuals for each gene in each individual.

3 Results

We analyzed the relationship between outliers in expression and RV in heart tissue RNAseq data from GTEx and the PCGC. We first used PCGC data to compare different classes of variants within 10 kb of the TSS, considering only variants with consistent class assignments for RefSeq and ENSEMBL. We defined outliers as any gene with an expression z-score absolute value ≥ 2 and the most extreme observed in the cohort. We used successively more stringent AF cutoffs to ensure robustness of our findings. Nineteen samples had more than 200 outliers and were automatically excluded by ORE with –max_outliers_per_id 200 (Supplementary Fig. S4). Histograms and lists of outliers per sample before and after removal (as in Supplementary Fig. S4) are automatic ORE outputs. In PCGC atrial tissues (n = 82), we observed gene-centric enrichment for RVs across all AF cut-offs in 5’ UTRs (P ranged from 1.6 × 103 to 3 × 102, Fisher’s exact test). This is consistent with GTEx, which found a stronger enrichment of 5’ UTR compared with other variant classes for RVs associated with outliers (Li et al., 2017). In contrast to atrial, the most significant associations were with splicing variants in PCGC ventricle (n = 45) and 3’ UTRs in PCGC vascular/valvar (n = 50) (Supplementary Fig. S5).

As the 5’ UTR functional class was the most significant across all AFs, it provided the best lens for comparing different outlier methods (Fig. 1). In PCGC atrial tissues, all outlier definitions were powered to identify significant associations with RVs. We observed the highest effect size for outliers representing the most extreme events per gene. Figure 1c illustrates typical examples of gene-centric and variant-centric 2 × 2 contingency tables for most extreme outliers and 5’ UTR RVs at AF < 105. In order to account for testing multiple hypotheses (in this case multiple AF thresholds) and gene expression correlations, ORE automatically implements a permutation test (see Section 2). This resulted in P = 1 × 10−3 for 5’ UTR RVs with the most extreme outliers (1000 permutations, Fig. 1d). In GTEx, expression outliers defined with ranks were the only category powered to observe a statistically significant association for variants in 5’ UTRs. To provide insight into these observations, we simulated sample size as a function of effect size, RVs per gene per sample, and outliers per sample. These results indicate that GTEx and PCGC were both powered to detect an association for rank-based outliers with 5’ UTR RVs for AF < 0.05, but only PCGC was powered to detect associations for most extreme outliers for AF < 10−5 (Supplementary Fig. S6). The counts for each comparison and the final set of outliers with 5’ UTR RVs are provided in Supplementary Material.

Fig. 1.

Fig. 1.

Association between expression outliers and 5’ UTR RVs within 10 kb of the TSS. The effect size and significance for different outlier definitions are illustrated for (a) PCGC atrial tissues (n = 82) and (b) GTEx ventricular tissues (n = 71). In PCGC, associations with the most extreme outliers had the largest effect size and most significant association between effect size and AF (P = 2 × 10−3). In GTEx, rank-based expression outliers were the only category powered to observe a statistically significant association (light-colored bars indicate zero 5’ UTR RV–outlier pairs). (c) Examples of variant- and gene-centric enrichment for most extreme expression outliers with 5’ UTR RVs at AF < 10−5. (d) A permutation approach affirms significance for 5’ UTR RVs with most extreme outliers while accounting for multiple hypotheses and gene expression correlations. OR, odds ratio; RV, rare variant; CV, common variant

AF tended to be inversely associated with effect size; we observed the most significant association in most extreme outliers (linear model of log2-OR regressed on log10-AF P = 3 × 103). To confirm this association, we performed logistic regression of most extreme outlier status on a non-zero intercept plus log10-transformed maximum population AF, with AF = 0 transformed to 105. Rarer AFs were a significant predictor of outlier status overall (P = 4 × 103) and among variants with AF < 0.05 (P = 0.01).

In addition to the broad variant classes described earlier, users can also specify an arbitrary number of 0-based BED files, within which enrichment statistics are automatically computed. Examples are illustrated in Supplementary Figure S7, highlighting strong associations for RVs with AF < 105 in CCCTC-binding factor (CTCF) and cardiac transcription factor-binding sites (TFBSs) and UCSC DNase Hypersensitivity Sites (DHSs) (Dickel et al., 2016; Thurman et al., 2012; Wang et al., 2012). RVs in CTCF motifs had stronger enrichment closer to the TSS, consistent with cis-regulatory effects. The TFBSs and DHSs enrichments were more significant for low expression outliers, consistent with loss of function.

In order to assess the impact of systematic biases, we tested for an association between effect size and the number of SVs in PCGC atrial samples. Consistent with GTEx, we observed an association (linear model of log2-OR regressed on SV count P = 6.5 × 103, Fig. 2a), although this association was primarily driven by the first five SVs. We also tested for an association between effect size and the maximum number of outliers per sample (Fig. 2b). We observed a disproportionate effect on the number of outliers with RVs and a moderate association between effect size and removing samples with divergent outlier profiles (linear model of log2-OR regressed on max outliers/sample P = 0.1).

Fig. 2.

Fig. 2.

Effects of SVs and individuals with divergent outlier profiles. After regressing out five known covariates (Supplementary Fig. S3), (a) regressing out additional SVs (x-axis) had significant effects on the signal (P = 6.5 × 10−3). (b) Excluding individuals with divergent outlier profiles had moderate effects on the signal and large effects on outlier–RV pair counts

Finally, we profiled the performance of ORE on 12 AMD 2.3 GHz Interlagos cores (Supplementary Table S1). The VCF variant abstraction step (i.e. looping over variants sequentially within each chromosome to convert them to formats suitable for performing enrichment) was the computational bottleneck, taking >85% of compute time and providing an avenue for future computational refinements, necessary for larger cohorts.

4 Discussion

ORE provides a systematic framework for testing hypotheses relevant to identifying biologically meaningful non-coding RVs. To encourage uptake, ORE handles multiple classes of variants and outlier definitions, has a user-friendly command line interface, and uses standard file formats. The software is written in Python 3, is deposited on the Python Package Index, has robust error handling and logging, and takes advantage of Python’s object-oriented programing principles to enable class composition, modularity, encapsulation and abstraction.

We tested ORE in two independent cardiac tissue datasets. We demonstrated that the association between RVs and outliers observed in previous studies is robust for multiple definitions in our current analysis, although this is contingent on sample ascertainment. For smaller sample sizes or to maximize sensitivity when prioritizing non-coding RVs with large effect sizes, we recommend defining outliers using ranks. To maximize specificity for identifying functional non-coding RVs, we recommend defining outliers as the most extreme values observed in a cohort after setting a minimum z-score threshold. These recommendations are based on comparisons of the number of outlier–RV pairs and the significance of these associations in multiple cohorts with varying sample sizes. These outlier definitions can also serve as a benchmark for custom, more high-confidence outlier definitions, such as those obtained from repeated measures or families. Furthermore, minimum sample sizes for detecting associations can be extrapolated from Supplementary Figure S6. Biology could be driving differences in power: the congenital pathology in PCGC patients, compared with the relatively healthy GTEx cohort, could be associated with an overall burden of outlier–RV pairs.

In addition to testing the robustness of different outlier definitions and replicating previous work in a novel cohort, we showed that rarer variants have increased associations with outliers, consistent with the notion that these RVs are under negative selective pressure. We also repeated GTEx experiments on the impact of correcting for SVs and removing samples with divergent outlier profiles. We found significant associations with SVs, and the user should be aware of impacts on the number of outlier–RV pairs. Users can specify a FastQTL-formatted covariance matrix, providing flexibility in deciding to remove as much transcriptional covariance as possible (the general strategy for cis-regulatory variants), or retain SVs that could reflect relevant biological variance.

One limitation to this software is that ORE assigns variants to genes with the closest TSS. An alternative approach is to assign variants to all genes within a region, which could increase sensitivity and capture pleiotropic events. However, this could inflate test statistics by double counting variants. Ideally, we would sidestep this question with tissue and cell-type specific 3D DNA interaction data. High resolution Hi-C enhancer–promoter pairs (Whalen et al., 2015), correlations between expression and genomic loci (e.g. DNA variants, H3K27ac, methylation, DNase I hypersensitivity) (Osterwalder et al., 2018; Shooshtari et al., 2017; Short et al., 2018), and integrative machine learning methods can identify these interactions, and future work will incorporate these data as they come online.

Another limitation of ORE is that it provides genome-wide measures and therefore does not give confidence in individual outlier–RV pairs. Aside from experimental perturbation, methods for identifying causal variants are (i) focusing on enrichments with the expected number of outlier–RV pairs approaching zero or (ii) identifying the same outlier–RV pairs in multiple individuals.

WGS has primarily shed light on the genetic architecture of Mendelian disorders through detection of structural variants and enhanced coverage of coding regions (Meienberg et al., 2016). Although WGS promises to identify relevant non-coding variation, constraining this expansive hypothesis space has proven difficult. Recent literature has shown that this benefit of WGS could be realized by using RNAseq as a functional read-out for identifying non-coding RVs with large effect sizes (Cummings et al., 2017; Kremer et al., 2017; Li et al., 2017; McKean et al., 2016). ORE facilitates this type of analysis. Furthermore, ORE can be generalized to outliers from other molecular phenotypes including allelic ratios, splicing, protein levels, methylation and chromatin-immunoprecipitation.

Supplementary Material

btz202_Supplementary_Data

Acknowledgements

We are grateful to the patients and families who participated in this research, and thank the following for patient recruitment: A. Julian, M. Mac Neal, Y. Mendez, T. Mendiz-Ramdeen and C. Mintz (Icahn School of Medicine at Mount Sinai); N. Cross (Yale School of Medicine); J. Ellashek and N. Tran (Children's Hospital of Los Angeles); B. McDonough, J. Geva and M. Borensztein (Harvard Medical School), K. Flack, L. Panesar and N. Taylor (University College London); E. Taillie (University of Rochester School of Medicine and Dentistry); S. Edman, J. Garbarini, J. Tusi and S. Woyciechowski (Children's Hospital of Philadelphia); D. Awad, C. Breton, K. Celia, C. Duarte, D. Etwaru, N. Fishman, M. Kaspakoval, J. Kline, R. Korsin, A. Lanz, E. Marquez, D. Queen, A. Rodriguez, J. Rose, J.K. Sond, D. War-burton, A. Wilpers and R. Yee (Columbia Medical School); D. Gruber (Cohen Children's Medical Center, Northwell Health). This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai. These data were generated by the Pediatric Cardiac Genomics Consortium (PCGC), under the auspices of the National Heart, Lung, and Blood Institute's Bench to Bassinet Program (https://benchtobassinet.com). The results analyzed and published here are based in part on data generated by Gabriella Miller Kids First Pediatric Research Program projects phs001138.v1.p2/phs001194.v1.p2, and were accessed from from the Kids First Data Resource Portal (https://kidsfirstdrc.org/) and/or dbGaP (www.ncbi.nlm.nih.gov/gap). This manuscript was prepared in collaboration with investigators of the PCGC and has been reviewed and/or approved by the PCGC. PCGC investigators are listed at https://benchtobassinet.com/Centers/PCGCCenters.aspx.

Funding

This work was supported by the National Institute of Dental and Craniofacial Research Interdisciplinary Training in Systems and Developmental Biology and Birth Defects [T32HD075735 to F.R.] and Mount Sinai Medical Scientist Training Program [5T32GM007280 to F.R.]. The Pediatric Cardiac Genomics Consortium (PCGC) program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through grants UM1HL128711, UM1HL098162, UM1HL098147, UM1HL098123, UM1HL128761, and U01HL131003. The PCGC Kids First study includes data sequenced by the Broad Institute (U24 HD090743-01).

Conflict of Interest: none declared.

References

  1. Bailey J.A. et al. (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res., 11, 1005–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brechtmann F. et al. (2018) OUTRIDER: a statistical method for detecting aberrantly expressed genes in RNA sequencing data. Am. J. Hum. Genet., 103, 907–917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Carithers L.J. et al. (2015) A novel approach to high-quality postmortem tissue procurement: the GTEx project. Biopreserv. Biobank, 13, 311–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chiang C. et al. (2017) The impact of structural variation on human gene expression. Nat. Genet., 49, 692–699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Choi L. et al. (2015) Elucidating the foundations of statistical inference with 2 × 2 tables. PLoS One, 10, e0121263.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cummings B.B. et al. (2017) Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med., 9, eaal5209.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Dale R.K. et al. (2011) Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics, 27, 3423–3424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Danecek P. et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. DePristo M.A. et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet., 43, 491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dickel D.E. et al. (2016) Genome-wide compendium and functional assessment of in vivo heart enhancers. Nat. Commun., 7, 12923.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gelb B. et al. (2013) The Congenital Heart Disease Genetic Network Study: rationale, design, and early results. Circ. Res., 112, 698–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gibson G. (2012) Rare and common variants: twenty arguments. Nat. Rev. Genet., 13, 135–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Guo L. et al. (2015) Plasma metabolomic profiles enhance precision medicine for volunteers of normal health. Proc. Natl. Acad. Sci. USA, 112, E4901–E4910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kremer L.S. et al. (2017) Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat. Commun., 8, 15824.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Law C.W. et al. (2014) voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol., 15, R29.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lek M. et al. (2016) Analysis of protein-coding genetic variation in 60, 706 humans. Nature, 536, 285–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Li H. (2011) Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27, 718–719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li H., Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li X. et al. (2017) The impact of rare variation on gene expression across tissues. Nature, 550, 239–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li X. et al. (2014) Transcriptome sequencing of a large human family identifies the impact of rare noncoding variants. Am. J. Hum. Genet., 95, 245–256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Liao Y. et al. (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30, 923–930. [DOI] [PubMed] [Google Scholar]
  23. Liao Y. et al. (2013) The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res., 41, e108.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. McKean D.M. et al. (2016) Loss of RNA expression and allele-specific expression associated with congenital heart disease. Nat. Commun., 7, 12824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. McKenna A. et al. (2010) The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 20, 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Meienberg J. et al. (2016) Clinical sequencing: is WGS the better WES? Hum. Genet., 135, 359–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Montgomery S.B. et al. (2011) Rare and common regulatory variation in population-scale sequenced human genomes. PLoS Genet., 7, e1002144.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ongen H. et al. (2016) Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics, 32, 1479–1485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Osterwalder M. et al. (2018) Enhancer redundancy provides phenotypic robustness in mammalian development. Nature, 554, 239–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Pala M. et al. (2017) Population- and individual-specific regulatory variation in Sardinia. Nat. Genet., 49, 700–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Quinlan A.R., Hall I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Robinson M.D., Oshlack A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol., 11, R25.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Schadt E.E. et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature, 422, 297–302. [DOI] [PubMed] [Google Scholar]
  34. Shooshtari P. et al. (2017) Integrative genetic and epigenetic analysis uncovers regulatory mechanisms of autoimmune disease. Am. J. Hum. Genet., 101, 75–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Short P.J. et al. (2018) De novo mutations in regulatory elements in neurodevelopmental disorders. Nature., 555, 611–616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Stegle O. et al. (2012) Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc., 7, 500–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Thurman R.E. et al. (2012) The accessible chromatin landscape of the human genome. Nature, 489, 75–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Van der Auwera G.A. et al. (2002) In: Bateman A.et al. (eds.) Current Protocols in Bioinformatics. John Wiley & Sons, Inc, Hoboken, NJ. [Google Scholar]
  39. Wang J. et al. (2012) Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res., 22, 1798–1812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Wang K. et al. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res., 38, e164–e164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Whalen S. et al. (2015) Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat. Genet., 48, 488–496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zeng Y. et al. (2015) Aberrant gene expression in humans. PLoS Genet., 11, e1004942.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhao J. et al. (2016) A burden of rare variants associated with extremes of gene expression in human peripheral blood. Am. J. Hum. Genet., 98, 299–309. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz202_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES