Skip to main content
. 2021 Nov 5;126(7):981–993. doi: 10.1038/s41416-021-01612-6

Table 3.

Methods for identifying putative target genes and functional variants.

Method: summary Advantages Disadvantages
Defining candidate-regulatory sequences (CRS)
In silico alignment: Alignment of “local” genes and credible variants with markers of open chromatin, active histone marks and/or transcription factors. Reviewed in Klein and Hainer [103].

High-throughput in silico analysis

Multiple data sources, widely available through, for example ENCODE and Roadmap Epigenomics Project (box 2).

Primary cell data available through Roadmap Epigenomics Project.

Can be combined into an algorithm.

The relevant tissue and/or cell type is not necessarily known.

Biased towards cell lines (MCF-7, MCF 10A and T-47D) and tissue (breast epithelium) rather than primary cells (Fig. 1a)

Limited markers/TF in primary cells (Fig. 1b)

By combining data sources, algorithms lose granularity; can use a weighting scheme for different data types but these by definition require a series of assumptions about the hierarchy of data sources.

Functional outputs for CRS

MPRA: Massively Parallel Reporter Assay [45, 46], plasmid-based high-throughput approach to reporter gene assays.

CRS are placed upstream of a reporter gene driven by a minimal promoter and barcodes are inserted in the 3’UTR of the reporter gene.

The activity of the CRS is measured by pairing its RNA expression to the transcribed barcodes.

High-throughput functional readout of CRS and variants within those sequences across the whole genome.

Limited to cells that can be easily transfected.

The length of the sequences tested is restricted by the length of oligos that can be synthesised (~200 bp).

Episomal assay.

May be confounded by possible effects from promoter-binding proteins.

 lenti-MPRA [50]: modification of MPRA that uses lentiviral vectors as opposed to plasmids.

Broadens the range of cells and tissue types that can be used, to include hard-to-transfect cell types.

Barcodes cloned into the 5’ UTR to reduce the distance between the CRS and barcode and hence, the risk of CRS-barcode swapping.

Integration of viral vector provides “in-genome” readout.

Using on average >50 barcodes per CRS reduces the impact of binding of RNA-associated factors and RNA stability on the results.

The length of the sequences tested is restricted by the length of oligos that can be synthesised (~200 bp).

May be confounded by possible effects from promoter-binding proteins.

STARR-seq [47]: Self-Transcribing Active Regulatory Region sequencing, plasmid-based high-throughput reporter gene assay in which the CRS itself is used as the barcode.

CRS are cloned downstream of the reporter gene in the 3’UTR. The activity of the CRS is measured by comparing the amount of RNA produced relative to the amount of genomic DNA in the STARR-seq library.

The elimination of barcodes simplifies the library and allows screening of complex libraries.

CRS are cloned rather than synthesised; the length of CRS are limited only by cloning efficiency and a range of 150–1500 bp is possible.

Enhancer activity may be confounded by effects from the binding of RNA-associated factors and the stability of the assayed RNA sequence.

Episomal assay.

Limited applicability to mammalian genomes due to their size and complexity; has been applied to human cells using selected bacterial artificial chromosomes.

CapStarr-seq [31]: modification of STARR-seq which incorporates a sequence capture step. Overcomes limited applicability to mammalian genomes by incorporating a sequence capture step to focus on regions of interest.

Enhancer activity may be confounded by effects from the binding of RNA-associated factors and the stability of the assayed RNA sequence.

Episomal assay.

GRO-seq [48]: Global nuclear Run-On sequencing, captures nascent and newly synthesised RNA, by bromodeoxyuridine (BrUTP) labelling of transcripts followed by immunoprecipitation of labelled transcripts with an antibody against BrUTP.

Assesses transcriptional regulation and activity across the whole genome.

Sensitive, with a resolution of 10 bp.

Robust nascent transcriptome profiles, including short-lived enhancer RNAs

Capable of assessing RNAPI, RNAPII, and RNAPIII dynamics and processing properties.

Generates precise quantification of promoter-proximal RNA polymerases.

Low contamination of processed RNA.

Laborious assay.

Requires a high input of cells (~1 × 107).

In vitro assay.

Regulatory factors bounded to the polymerase might be eliminated by the use of sarkosyl to prevent de novo initiation of transcription.

fastGRO-seq [56]: modification of GRO-seq using 4-thio ribonucleotide (4-S-UTP) labelling followed by biotin tagging of the 4-S-UTP residues which are then captured using streptavidin beads.

More efficient assay time wise and in terms of cell input (0.5 × 106) cells required.

Can be used to analyse tissue and primary cells.

Highly reproducible.

Low contamination of processed RNA.

In vitro assay
PRO-seq [54]: Precision nuclear Run-On sequencing, modified GRO-seq assay that incorporates biotinylated nucleotides into the 3′ end of the nascent RNA and uses biotin–streptavidin pulldown.

High resolution (single nucleotide)

Low contamination of processed RNA.

Laborious assay.

Requires a high input of cells (~1 × 107).

In vitro assay.

The RNA polymerase position at the beginning of transcription is mostly lost and so, it may not generate a precise quantification of promoter-proximal RNA polymerases.

TTchem-seq [57]: Transient Transcriptome chemical sequencing. Captures nascent and newly synthesised RNA using 4-thiouridine (4SU) labelling, uses hydrolysis instead of sonication to fragment RNA, biotin tagging of the 4SU residues and biotin streptavidin pulldown.

In vivo assay, based on metabolic labelling of RNA which minimises any variability or cellular stress.

4SU labelling is relatively easy to perform and control which is important when handling multiple samples.

Highly reproducible.

Identification of regions of active transcription is limited to a resolution of 20–500 nucleotides which is the RNA fragment size range obtained after fragmentation.

High contamination of processed RNA

Identifying putative target genes
eQTL [59, 60]: Expression of Quantitative Trait Locus analysis: Test of association between gene expression (measured by RNA-seq now, previously microarray) and genotype.

Direct test of genotype–phenotype association.

Can test local (generally defined as ≤1 to 2 Mb) and distant (>1 to 2 Mb) genes.

The relevant tissue and/or cell type is not necessarily known

Limited availability of appropriate tissue and/or primary cell data, particularly large series of “normal” tissue/cells

Steady state mRNA levels may not be relevant phenotype.

Colocalization [66]: Extension to individual SNP:eQTL approaches. Uses multiple variants and compares the distribution of summary statistics from eQTL and GWAS.

Reduces false positives by comparing distributions of summary statistics (as opposed to individual variants).

By using gene expression data from multiple tissues, can be informative regarding “causal tissues”.

Limited availability of appropriate tissue and/or primary cell data, particularly large series of “normal” tissue/cells

Steady state mRNA levels may not be relevant phenotype.

LDSC-SEG [76], DESE [77], CoCoNet [78]: Examples of statistical methods that use gene expression and GWAS data to infer causal tissues. These, and additional such methods, are reviewed in (reference [79]).

Requires gene expression but not eQTL data (i.e., does not require genotypes to be associated with the gene expression).

Can help to inform relevant tissue or cell type for in vitro experiments.

Assumes that driver genes will be relatively highly expressed in the most disease-relevant tissue types

LDSC-SEG additionally assumes that SNPs near such driver genes will be enriched for heritability

Limited by the availability of gene expression data in relevant tissues or cell types

Steady-state mRNA levels may not be relevant phenotype.

Transcriptome-wide association studies (TWAS [68, 69]): eQTL cohorts are used to develop models of expression variation on a per gene basis; models are then used to predict gene expression for individuals in GWAS and test for association between gene expression and outcome.

Informative both for discovery (new risk loci) and for inferring target genes at “known” GWAS loci.

Can help to inform relevant tissue or cell type for in vitro experiments.

Limited availability of appropriate tissue and/or primary cell data, particularly large series of “normal” tissue/cells

Steady state mRNA levels may not be relevant phenotype.

 Comparison with somatically mutated cancer genes (boxes 1 and 2): in silico analysis of somatic variation in tumours using whole genome or exome sequences. Provides robust evidence for a functional role in cancer either on an ad hoc basis or by comprehensively comparing genes that are local (generally within 1 Mb of a locus) with lists of somatically mutated genes. Undermines the “discovery” aspect of GWAS; only provides confirmation that the concept of an unbiased GWAS approach is sound.
Linking CRS with putative target genes
CHi-C [96, 97]: Capture Hi-C. Chromatin-interaction method that exploits the 3D proximity of long-range regulatory elements and the genes that they regulate using formaldehyde cross-linking of chromatin followed by sequence capture to focus on regions or features of interest.

High throughput

Potentially two-sided (i.e., either GWAS loci or the promoters of putative target genes can be used as “baits”).

Agnostic

CHi-C interaction peaks will include interactions that are structural (e.g. driven by CTCF and/or cohesion) rather than regulatory

in situ CHi-C requires large numbers of cells (new Hi-C kits are reducing the numbers of cells required).

Most data have been generated in cell lines, not primary cells—in part due to the requirement for large numbers of cells

Interaction peaks are defined by a viewpoint—i.e., linkage-disequilibrium blocks or promoters.

ChIA-PET [98]: Chromatin Interaction Analysis by Paired-End Tag sequencing, HiChIP [10]: combination of 3C or Hi-C technology with chromatin immunoprecipitation.

High-throughput

two-sided, but only when both ends of the interaction are captured (i.e., they both involve the TF or histone modification of choice).

ChIA-PET requires large numbers of cells; HiChIP less so, particularly with new HiChIP kits

Very little published data – ChIA-PET data generated in MCF-7 for ESR1, MCF-7, and POLR2A as part of ENCODE. Interaction peaks are defined by a viewpoint—the TF or histone modification used for the immunoprecipitation.

CRISPR-Cas9: Genome editing system in which a guide RNA delivers a Cas9 nuclease to a specific DNA locus where the nuclease makes a double-stranded break. Genetic changes are introduced during the DNA repair process. These genetic changes could be a specific nucleotide change (knock-in using homologous directed repair (HDR)), a DNA sequence or an entire gene could be removed (knock out).

In genome (as opposed to episomal) assay

Genome can be precisely manipulated by the CRISPR system’s ability to introduce specific changes.

Relatively simple assay to design and perform.

Random modifications can occur in off-target sequences.

It is not suitable for all cells; some do not use homologous directed recombination as their main repair pathway, some cells are non-diploid due to genome instability.

HDR efficiency is relatively low; for GWAS CCVs where a single base change is often required, base editing approaches may provide an alternative (reviewed in ref. [104]).

CRISPRi (CRISPR interference), CRISPRa (CRISPR activation [105]) and other CRISPR modifications: techniques use a deactivated Cas9 (dCas9) fused to an effector domain eg Kruppel associated box (KRAB) which spreads repressive histone modifications (CRISPRi) or an activator eg VP64-p65-Rta (VPR, CRISPRa). Reviewed in ref. [104], with recent additions including CRISPR knock-in [106] and repression CRISPRoff [107]. Highly specific assays, multiple target genes can be modulated simultaneously and the introduced genomic changes are potentially reversible.

Can be challenging to design sgRNA proximal to the region of interest.

It is important to design multiple sgRNA for each target as they have variable efficiency.