TargetOrtho: A Phylogenetic Footprinting Tool to Identify Transcription Factor Targets

Lori Glenwinkel; Di Wu; Gregory Minevich; Oliver Hobert

doi:10.1534/genetics.113.160721

. 2014 Feb 20;197(1):61–76. doi: 10.1534/genetics.113.160721

TargetOrtho: A Phylogenetic Footprinting Tool to Identify Transcription Factor Targets

Lori Glenwinkel ^1,¹, Di Wu ^1,², Gregory Minevich ¹, Oliver Hobert ^1,¹

PMCID: PMC4012501 PMID: 24558259

Abstract

The identification of the regulatory targets of transcription factors is central to our understanding of how transcription factors fulfill their many key roles in development and homeostasis. DNA-binding sites have been uncovered for many transcription factors through a number of experimental approaches, but it has proven difficult to use this binding site information to reliably predict transcription factor target genes in genomic sequence space. Using the nematode Caenorhabditis elegans and other related nematode species as a starting point, we describe here a bioinformatic pipeline that identifies potential transcription factor target genes from genomic sequences. Among the key features of this pipeline is the use of sequence conservation of transcription-factor-binding sites in related species. Rather than using aligned genomic DNA sequences from the genomes of multiple species as a starting point, TargetOrtho scans related genome sequences independently for matches to user-provided transcription-factor-binding motifs, assigns motif matches to adjacent genes, and then determines whether orthologous genes in different species also contain motif matches. We validate TargetOrtho by identifying previously characterized targets of three different types of transcription factors in C. elegans, and we use TargetOrtho to identify novel target genes of the Collier/Olf/EBF transcription factor UNC-3 in C. elegans ventral nerve cord motor neurons. We have also implemented the use of TargetOrtho in Drosophila melanogaster using conservation among five species in the D. melanogaster species subgroup for target gene discovery.

Keywords: C. elegans, cis-regulatory element, transcription factor

TRANSCRIPTION factors (TFs) and small RNAs represent the largest families of gene regulatory molecules in eukaryotes. Identifying target genes for these regulatory factors is a key challenge that remains to be solved. While targets of regulatory RNAs can often be inferred by sequence complementarity, there are no clearly delineated rules to de novo predict DNA sequence targets of DNA-binding domains of transcription factors.

In vitro techniques such as CASTing (cyclic amplification and selection of targets) (Wright et al. 1991), EMSA (electrophoretic mobility shift assay) (Hellman and Fried 2007), and multiple sequence comparisons between small sets of hand picked cis-regulatory sequences, as well as in vivo techniques such as DNase-seq (Song and Crawford 2010) and ChIP-seq (Carey et al. 2009) or mutational analysis of transcription factor-regulated reporter genes, have allowed the derivation of high-information-content consensus-binding motifs for many transcription factors. While ChIP-seq allows for the genome-wide identification of transcription-factor-binding sites (TFBSs), in cases where the signal-to-noise ratio of TF binding is small, a certain level of nonfunctional TF binding is expected to occur, rendering it difficult to predict true regulatory targets with high confidence without utilizing additional predictive strategies.

Using a set of experimentally verified binding sequences, it is possible to build a representative position weight matrix (PWM) and to perform a purely bioinformatic genome-wide search for TF consensus sites. This approach provides a cost- and time-efficient alternative to in vivo experiments, and, with the accessibility of whole-genome sequence data, multiple species genomes are available for a comparative genomic analysis that utilizes conservation of binding sites between species. Strong purifying selection is expected to maintain binding elements in functional regions so that conservation of TFBS between species is predictive of function. While sequence conservation may suggest function, additional predictive criteria, including binding-site enrichment among orthologous regulatory regions together with expression profiling data or chromatin immunoprecipitation (ChIP) data, especially tissue-specific data, provide a multi-faceted approach for confident regulatory target gene prediction.

Existing tools such as the MEME suite (Bailey and Elkan 1994; Bailey et al. 2009), PhyloCon (Wang 2007), PhyME (Sinha et al. 2005), PhyloGibbs (Siddharthan et al. 2005), and EvoPrinter (Odenwald et al. 2005) that utilize sequence conservation for motif discovery as well as programs like EEL (Hallikas et al. 2006), which evaluate regulatory modules genome-wide without incorporating sequence conservation, are excellent resources for identifying a TFBS to build a PWM or for identifying novel target genes without considering conservation. These programs do not provide a way to assess the novel regulatory targets of a given TF or do not include sequence conservation for functional prediction, however. TargetOrtho fills this gap by providing an alignment-free conservation assignment of orthologous motifs that is independent of motif orientation and out-performs pairwise alignment methods (Elemento and Tavazoie 2005; Gordân et al. 2010). This more relaxed definition of conservation accounts for the inherent degeneracy and orientation independence of TFBS so that variant nucleotides within a motif do not prevent conservation calls between species. Such strategies for target gene prediction have been implemented for specific TF regulatory target gene discovery (Aerts et al. 2006; Ward and Bussemaker 2008; Herrmann et al. 2012), but these approaches have not been applied to the automated prediction of TF regulatory target genes from user-defined PWMs together with a target gene ranking system that accounts for the degree of motif match conservation, quality, and frequency for target gene prediction. For an overview of target gene prediction strategies, see Aerts et al. 2012.

We have previously described one framework for the application of an exhaustive in silico approach for the identification of transcription factor target genes using experimentally derived consensus-binding sites together with an alignment-free assignment of conservation across multiple species genomes (Bigelow et al. 2004). This application, called CisOrtho, compared genome scans of two distinct nematode genomes. We describe here a number of significant expansions to this original pipeline. The new pipeline, TargetOrtho, includes (1) an expansion from the PWM search of two genomes to that of the genomes of five species (see “genomes” in Supporting Information, File S1); (2) region-specific and alignment-independent conservation assignments controlled by user-defined positional conservation constraints between orthologous motif matches; (3) display of binding-site frequency by gene region and cross-species motif match score-based filtering by gene region; (4) the option to restrict motif location relative to the first or last exon of a gene; and (5) the ability to display predicted binding sites on standard genome browsers including the Wormbase and FlyBase Gbrowse tools in the form of bed-formatted genome browser track files where sites are shaded according to predicted binding-site strength as derived from the binding-site log-likelihood ratio score. The new ranking scheme used by TargetOrtho can be finely tuned by the user by scaling the weight of a given filtering criteria. Moreover, we have expanded TargetOrtho to include an option to search each genome against up to five co-occurrences of TFBSs using up to five predetermined PWMs for the discovery of conserved, enriched cis-regulatory modules (CRMs). The CRM option allows the user to restrict the nucleotide distance between TFBSs in the same gene region as well as the order of the TFBSs by using the order from the user’s uploaded input motifs. Further filtering may be applied through user-selected query lists that restrict the results or report specifically on a subset of genes such as putative target genes determined through expression-profiling experiments, ChIP-ChIP/ChIP-seq data, or gene ontology associations. Finally, TargetOrtho can now be used for target gene discovery in both Caenorhabditis and Drosophila species.

Materials and Methods

Ortholog assignments

Nematode ortholog assignments based on Ensembl COMPARA (Vilella et al. 2009), which predicts orthology of the longest isoform based on homology as well as on conserved gene order, were downloaded using BioMart WS220 datasets (Smedley et al. 2009) .The melanogaster subgroup ortholog assignments were downloaded from FlyBase precomputed data files (http://flybase.org/static_pages/downloads/bulkdata7.html, version: gene_orthologs_fb_2013_03.tsv.gz).

Gene coordinates

Exon and gene coordinates for nematode genomes were parsed from gff3 annotations files (current versions: C. elegans—WS220; Caenorhabditis briggsae—WS234; Caenorhabditis brenneri—WS234; Caenorhabditis remanei—WS234; Caenorhabditis japonica—WS234) downloaded from wormbase’s FTP site (ftp://ftp.wormbase.org/pub/wormbase/). Exon and gene coordinates for fly genomes were parsed from exon sequence files (fasta) downloaded from FlyBase precomputed data files (http://flybase.org/static_pages/downloads/bulkdata7.html). Current genome versions include the following: Drosophila melanogaster—r-5.1; Drosophila yakuba: r-1.3; Drosophila erecta—r-1.4; Drosophila simulans—r-1.4; and Drosophila sechelia—r-1.3.

Source code

TargetOrtho employs the FIMO (Grant et al. 2011) tool from the MEME suite (Bailey et al. 2009) for genome-wide motif scanning. Motif matches are associated with genes using an ANSI C++ script written by Henry Bigelow. All other Target-Ortho scripts were written in python (or XML for the Galaxy interface scripts) by L. A. Glenwinkel.

Data analysis

Outcomes of comparison tests were determined using the Mann–Whitney–Wilcoxon test using python’s scipy.stats module. q-values for multiple testing corrections were calculated as in Storey and Tibshirani (2003). P-values were accepted as significant if the corresponding q-value was <0.05, which is representative of the minimum false discovery rate that is incurred when calling that test significant.

For each test, motif matches in the set of previously validated transcription factor target genes were compared to a set of 1000 random coding genes for each ranking criteria in each gene region. Six unique gene regions were analyzed (upstream, intron, exon, downstream, best site of any region, and upstream plus intron) for each of eight ranking criteria (C. elegans site score, C. elegans averaged region score, C. elegans site frequency, averaged species site score, averaged species region score, averaged species site frequency, site conservation, and site offset variance measured as the coefficient of variation). In addition, four total gene-ranking criteria (C. elegans averaged gene score, C. elegans total site frequency per gene, averaged species averaged gene score, and averaged species site frequency per gene) and the cumulative site score derived from all criteria per region were analyzed (see Figure S2 for an overview of the results of all tests in all regions).

For each ranking criteria in each gene region, the best motif match value was considered between comparison groups when several values were present. For example, the best upstream motif-match log-likelihood score per gene region was compared with transcription factor-dependent genes and 1000 random coding genes. Additionally, cumulative site scores derived from upstream and intronic data were compared in previously validated target genes and random genes.

Wilcoxon rank-sum tests were used to compare ventral nerve cord neuron counts in wild-type or unc-3(e151) worms (Table S1). See Table S5 for all input parameters used for motif analysis with TargetOrtho.

Gene Ontology term analysis

Gene ontology (GO) enrichment analysis was done using the web-based GOrilla tool (Eden et al. 2007, 2009) using the single list of ranked genes option with a P-value threshold of 10_e⁻³ using slow mode. See Table S12 for full GO term analysis results. Genes in each ontology category were binned according to the best TargetOrtho upstream or intronic site rank per gene and plotted showing the number of genes in each TargetOrtho ranking bin for selected ontology terms.

Reporter constructs

GFP fusions were generated as in Hobert (2002). The VL6 and BC14284 strains were provided by the Caenorhabditis Genetics Center, which is funded by the National Institutes of Health Office of Research Infrastructure Programs (P40 OD010440). See Table S1 for strain details.

Availability

The TargetOrtho package is available as a command line tool or for installation as a Galaxy tool (Goecks et al. 2010). The Galaxy option offers an accessible way to use TargetOrtho on any platform via Galaxy’s web hosting option (http://wiki.galaxyproject.org/Admin/Get%20Galaxy). See http://hobertlab.org/targetortho/ for general usage and availability.

Results

To expand the known repertoire of TF target genes for a better understanding of diverse biological processes, we have engineered a bioinformatic pipeline allowing for robust target gene prediction. We first describe the program architecture for the discovery of novel TF target genes as well as target genes regulated by CRMs whereby multiple TFBSs work in concert. In the following sections, we then examine individual criteria for ranking TFBSs across entire genomes and show that, for three motifs with extensive in vivo-validated target genes, these criteria are robust predictors of real target genes. Because the regulatory logic of in vivo TF binding is not well understood, we implement user-defined adjustments for each of the ranking criteria chosen. We show that the strategy of combining binding-site data from the genomes of multiple species is justified as it drastically improves target gene prediction. Finally, we show that our pipeline further improves target gene prediction by combining the averaged species ranking data into one final cumulative site score for each predicted binding site in the genome.

Features of TargetOrtho

General overview of the pipeline:

TargetOrtho provides a comparative genomic approach for the identification of transcription factor target genes for which a collection of binding sites, represented as the PWM, has been experimentally identified. The pipeline is executed in four steps (or five if multiple input PWMs are used). Briefly, genomes of five species are searched for motif matches against a PWM in MEME plain text format (see MEME documentation at http://meme.nbcr.net/meme/doc/meme-format.html and http://meme.nbcr.net/meme/doc/examples/meme_example_output_files/meme.html) derived from experimentally validated binding sites using the FIMO (Grant et al. 2011) motif scanner. Sites from each species are then associated with the nearest exon in the upstream and downstream direction and matched to orthologous regions in the reference genome (currently, C. elegans or D. melanogaster). Finally, filtering and ranking criteria are applied to each reference genome motif match, resulting in a ranked list of sites and their associated target genes. TargetOrtho output consists of browsable HTML tables, tab-delimited text files, and bed-formatted genome browser track files along with a compressed folder containing all results for download (Figure 1 and Table S2). The execution of TargetOrtho is facilitated by Galaxy (Goecks et al. 2010), a general bioinformatics workflow management system in which results are automatically browsable and available for download and sharing from any platform (Figure 2). TargetOrtho can also be installed locally and executed via the command line as a stand-alone program or added as a tool to a locally hosted Galaxy instance (see http://galaxyproject.org). See File S1 for a detailed program overview.

Overview of TargetOrtho pipeline. Beginning with one to five input position weight matrices (PWM_j = 1–5 in meme plain text format) and an optional query list with genes of interest, five species genomes (top orange box) are scanned with the motif scanner FIMO, resulting in one motif match hit table per genome i (i = 1–5). Each site is then associated with an exon, followed by ortholog pairing between the reference species and each species associated site. Orthologous sites are then ranked according to the TargetOrtho ranking criteria. If more than one input PWM is specified, promoters having at least one motif match for each PWM are filtered to a *cis*-regulatory module table. All results are output as tab-delimited text files, html browsable files, and bed format genome browser files.

Galaxy screenshots. (A) TargetOrtho user interface hosted by Galaxy. The TargetOrtho tool is shown in the Galaxy tool (left). Two TargetOrtho input files are shown in the History (right). (1) A motif file in Meme version 4 format and (2) a user-defined list of genes in plain text format. These files are uploaded using the “get data” tool built into the Galaxy platform. Adjustable TargetOrtho parameters are shown (middle). (B) TargetOrtho Results screenshot. Upon job completion, two TargetOrtho output files appear in the History (right): TargetOrtho browse results (html/text) is selected and shown in the middle. The top-ranked site per gene table (html version) is displayed along with a link to browse all TargetOrtho output files. A second result file in the History allows for a single-click local download of all results as a compressed directory. (C) TargetOrtho Summary statistics plots are included in the results directory as html files and may be viewed from the Galaxy interface or locally from the downloaded results files. (Top left) Site distribution by conservation. Blue shows all unique motif matches; yellow shows the number of candidate target genes. (Top middle) Species representation among all motif matches. (Top right) Site count by gene region. (Bottom left) Target gene frequency by gene region. (Bottom middle) Log-likelihood motif score distribution by species. (Bottom right) Site positional distribution by species conservation. See Table S2 for additional TargetOrtho results descriptions.

Adjustable program features:

TargetOrtho includes several adjustable features (Figure 3 and Table S3): (1) two reference genomes are available for target gene discovery. The C. elegans option includes searches across five species of the Caenorhabditis genus, while a D. melanogaster option includes genome-wide comparative searches across five melanogaster subgroup species. A reference genome is defined as the genome from which candidate TF target genes are reported (see “genomes” in Supporting Information, File S1); (2) the distance between distinct motif matches (Figure 3C) and linear motif order for CRM searches (see CRM searches for multiple motifs); (3) the offset variance (Figure 3F) of orthologous motif matches to constrain the positional conservation of a motif match (see “orthology matching” in Supporting Information, File S1); (4) the upstream (Figure 3D) and downstream (Figure 3E) motif match distance from the first or last adjacent annotated exon; (5) the number of intervening genes between a motif match and an associated gene may be constrained as well as the intervening distance from the associated gene allowed if annotated genes are positioned between a motif match and its associated gene; and (6) the cumulative site score is constrained by scaling options to weight each site ranking criteria (Figure 4C and Table S3). For example, if the motif frequency among orthologous gene regions is important, the user may up-weight this factor to inflate the effect of the motif frequency on the cumulative site score. See Table S3 for a description of all adjustable features.

TargetOrtho input parameters. (Right) TargetOrtho user interface hosted on the Galaxy platform. Select TargetOrtho input parameters are shown. (Left) Graphical representation of select input parameters from right panel. The tool interface on Galaxy shows all default values and may be changed by the user. Default values may also be viewed from the command line tool by using the command “python TargetOrtho.py–h .” Each input parameter should be adjusted for the individual input motif/s by the user. See Table S3 for a description of all adjustable input parameters and default values. (A) Example TargetOrtho input motifs for target gene discovery of genes with co-occurrences of motif A and motif B. TargetOrtho takes this input as a Meme version 4 input motif file (http://meme.nbcr.net/meme/doc/meme-format.html) with up to five input motifs. (B) Example of gene query list input file in plain text format showing a subset of user-defined genes for TargetOrtho to specifically report on. Gene names must be in gene public name format (*unc-3*, *ttx-3*) when available; otherwise, transcript names (C09G1.4, F08D12.1) may be used for *C. elegans*. FlyBase gene IDs in the form FBgn must be used for *D. melanogaster* gene names. These may include suspected transcription factor target genes of interest or serve as a negative control list of genes that are expected to be transcription factor independent. An option to report only data for these genes is available (right: “only report query list results”); otherwise, whole-genome results are reported with additional reporting on the query list gene results. (C) The maximum distance between motifs (for more than one motif query only). This option constrains the allowed distance between any two motifs from the motif input file. If five motifs are used as input, this distance limits the distance between any two adjacent motifs where the adjacent motifs are from separate entries in the motif input file. This does not preclude the user from specifying a search for identical motifs in the input file. For example, one may choose to search for target genes having at least two occurrences of motif A in the upstream region. To accomplish this, the user would include motif A two times in the input file. The order of motifs in a given gene region may also be constrained by selecting “ordered’ or “unordered” for the “Order of motifs” parameter. If ordered is chosen, co-occurrences of motifs must be positioned in the order given in the motif input file. For example, if motif A, motif B, and motif C are included in the input file with the ordered option, all target gene candidates must have these three motifs in the order motif A, motif B, motif C or in the order motif C, motif B, motif A among all orthologous gene regions for a candidate target gene to be included in TargetOrtho output. (D) The maximum upstream distance that a motif may be positioned for target gene association. (E) The maximum downstream distance that a motif may be positioned for target gene association. In addition to the maximum upstream and downstream distance, the number of intervening genes allowed between any motif match and associated gene, as well as the cutoff distance from the first ATG allowed if any intervening genes are positioned between a motif match and the associated gene, may be specified by the user (right). (F) The maximum offset variance constrains the positional variance allowed between orthologous motif matches between species. The offset variance is calculated by taking the absolute value of the coefficient of variation of each motif match offset (shown as the distance from the first annotated exon of the associated gene) in each species. See *Orthology Matching* section in File S1 for detailed explanation. See Table S3 for explanations of all adjustable parameters.

TargetOrtho ranking criteria. Each orthologous gene region per species is divided into upstream, intragenic [intron (green line) and exon (green box)], and downstream regions. (A) Log-likelihood score ranking criteria. Individual predicted binding sites (orange bubbles) are overlaid with the site score. “Site score” (black numerals): the log-likelihood ratio score of an individual motif match. “Average species site score” (orange numerals): the averaged site score across orthologous regions between species where each reference species site is matched to the positionally best-matched orthologous-species-region motif match. Best matches are determined by grouping sites across species and filtering for the best offset (site position relative to exon 1 for upstream sites or the last exon for downstream sites). See “offset variance.” “Region score” (blue numerals): the average site score within each species across a given region. “Average species region score” (purple numerals): the region score averaged across species for a given region. “Gene site score” (blue numerals): the averaged site score across all regions searched for each species. “Average species gene site score” (purple numerals): the gene site score averaged across species. Conservation (orange numerals): Alignment-independent site conservation is determined by the number of species with at least one predicted binding site in an orthologous region to the reference species motif match. (B) Motif match frequency and position ranking criteria. Individual predicted binding sites (orange) are overlaid with the site offset. “Offset” (black numbers) refers to the site position relative to exon 1 for upstream sites or the last exon for downstream sites. “Offset variance” (orange numerals): the absolute value of the coefficient of variation of the offsets for each matched orthologous motif match between species. Smaller values indicate increased positional constraint compared to motif matches that are differentially positioned between species. “Site count” (blue): The number of predicted binding sites in a given region per species. Averaged species region site count (purple): The site count averaged across orthologous species regions where the region shown is upstream. “Gene site count” (blue): The total site count across all regions of a gene including upstream, intragenic, and downstream (when included) for each species. “Average species gene site count” (purple numerals): the gene site count averaged across all orthologous regions of an associated gene between species. (C) Ranking criteria and cumulative site score per predicted binding site. Column 1, “TargetOrtho ranking criteria per site” indicates the ranking criteria used to calculate the final cumulative score for each predicted binding site (orange) in the reference genome. Column 2, “Raw score”: raw values for each ranking criteria described in A and B. Column 3, “Normalized score”: Each raw value from column 2 is normalized between 0 and 100 using the minimum and maximum value unique to each motif across the genomes. Column 4, “Average normalized score”: The final cumulative score assigned to each predicted binding site in the reference genome is calculated by averaging the normalized scores in column 3. Column 5, “Site rank”: The rank order of each predicted binding site taken by ordering each predicted site in the reference genome by the cumulative score.

CRM searches for multiple motifs:

TargetOrtho includes an option to search each genome against up to five co-occurrences of transcription-factor-binding sites using up to five predetermined PWMs for the discovery of conserved, enriched CRMs. In addition to the filtering applied to individual genome-wide searches, the CRM option allows the user to restrict the nucleotide distance between TFBSs in the same gene region as well as the order of the TFBS by using the order from the user’s uploaded motifs (Figure 3C and Table S3). CRM target genes are scored by averaging the adjustable cumulative site score of each component motif (see Adjustable cumulative site score).

Binding-site ranking criteria for prediction of regulatory target genes:

After conservation assignment, additional criteria were assessed for each site in each genome for eventual cumulative score calculations and final site ranking. Generally, the cumulative site score used for site ranking is determined for each site in a reference genome (C. elegans or D. melanogaster) according to its binding strength as represented by the log-likelihood ratio score and binding-site frequency associated with the target gene. Each site score and site count is averaged across species for use in the cumulative site-score calculation. Specifically, each site is ranked by the averaged species site score (Figure 4A), the averaged species region score (Figure 4A), the averaged species gene score (Figure 4A), the site conservation (Figure 4A), the offset variance (Figure 4B), the averaged species region site count (Figure 4B), and the averaged species gene site count (Figure 4B). Each site in the reference genome is ranked individually using these ranking criteria.

For example, as shown in Figure 4, consider the site Y with a log-likelihood site score of 7.2 found at −500 nucleotides upstream of a gene and conserved in five species. The averaged species site score of 7.2 is determined by grouping site Y with one orthologous site in each genome and then averaging the site scores across species where site grouping is determined using the minimum positional offset variance (0.07) from the first exon of gene X. The offset variance is also used for site ranking. The averaged species region score of 7.37 is determined by first averaging the site score across the upstream region of gene X as well as the orthologous upstream regions in each species and then averaging this value across species, where the upstream distance is constrained by the user. The averaged species gene score (6.54) is determined by averaging the site score across all gene regions—in this case, the upstream, intron, exon, and downstream regions—for each orthologous gene and then averaging this value across species. An analogous strategy is applied for the site frequency; in this case, the averaged species upstream site count (1.4) and averaged species gene site count (2.6) of gene X. Finally, these criteria are used to generate a final cumulative site score (see Adjustable cumulative site score) of 73.84 for TargetOrtho site ranking.

Adjustable cumulative site score:

Individual ranking criteria are combined into a single cumulative site score for each site in the reference genome, providing a list of target gene candidates. The cumulative site score is generated as:

cumulative site score = \frac{\sum_{i = 1}^{n} (c_{i} - b_{i}) (a_{i -} b_{i}) 100 ω_{i}}{j}

where c_i is the raw ranking criteria value out of n total ranking criteria, a_i is the maximum value from all c_i in a given TargetOrtho search, b_i is the minimum value from all c_i in a given TargetOrtho search, $ω_{i}$ is an optional scaling factor applied to each ranking criteria (default $ω_{i}$ = 1), and j is the number of ranking criteria where $ω_{i}$ > 0. Sites that are found only in the reference genome, and hence are unconserved, were assigned a cumulative site score of zero so that they are automatically ranked last but are still displayed in the TargetOrtho results.

In detail, each motif match is ranked by first determining the average species site counts and averaged species site scores across the associated gene; then each site in the reference genome is ranked by normalizing each ranking criteria value between 0 and 100 and by averaging the normalized values for each site to obtain the final cumulative site score for sites that are present in at least two orthologous genome regions. Each normalized criteria score may be weighted to affect the cumulative site score according to user preferences (option -A, -B, -C, -D, -E, -F, -G for average species site score, average species region score, average species gene score, average species gene site count, conservation, and offset variance, respectively, where options A–G may be any real number) (Figure 4C; TargetOrtho ranking criteria). Weighting specific ranking criteria may be of interest when prior information is available as to the nature of each ranking criteria in experimentally validated TF target genes. The default strategy of evenly weighting each ranking criteria in the computation of the cumulative site score results in significantly better cumulative site scores in validated target genes compared to random genes in our analysis of three well-characterized C. elegans TFBSs.

Program output:

TargetOrtho results include a top-ranked-per-gene table for showing the best-ranked site per associated gene as well as an all-conserved-hits-ranked table showing all ranked sites where all motif matches are shown for all candidate target genes. Each site is assigned a rank order corresponding to the cumulative site score where the best cumulative site score is assigned a rank of 1. Additionally, results tables with all hit-gene associations are included for each species and each motif as well as genome browser track files in bed format. All TargetOrtho outputs are described in Table S2 and Figure 5.

TargetOrtho output example. (A) TargetOrtho top-ranked site per gene table in HTML format. This table is a subset of the “All-conserved-hits-ranked” table (see Table S2 for descriptions of all TargetOrtho output files) showing only the best-ranked site in each candidate target gene as opposed to showing data for every site in every candidate target gene. Each table row shows motif match data for one motif match in the reference genome (*C. elegans* or *D. melanogaster*) with an option to expand the row to show data for other species data. Each top-ranked site shown in the table also includes information about overall site count for the corresponding region and total site count across the entire putative target gene. Additionally, average site scores per region and per gene are shown for each table entry. To see all sites in a gene, consult the “All conserved hits ranked” table. See Figure 4C legend for explanations of column values. (B) Wormbase Gbrowse screenshot of TargetOrtho results. Genome browser track files are output in bed format for viewing predicted binding sites in standard genome browsers. Higher scoring binding sites are shaded darker grey than lower scoring sites. See Table S2 for a description of all TargetOrtho results files.

Validation of TargetOrtho using experimentally identified target genes

Strategy:

Using three well-characterized TFBSs from C. elegans and in vivo validation of TargetOrtho predicted target genes, we find that the interspecies motif match score (log-likelihood ratio score), motif conservation, and frequency of TFBSs among orthologous gene regions are successful predictors of TF regulatory target genes. The three TFBSs used for validation of TargetOrtho include the UNC-3-binding site (UNC-3 motif), bound by the terminal selector for cholinergic motor neuron fate in the ventral nerve cord, UNC-3 (Kratsios et al. 2012); the TTX-3/CEH-10 heterodimer-binding site (AIY motif), the terminal selector motif for the AIY interneuron (Wenick and Hobert 2004); and the CHE-1-binding site (ASE motif) required for terminal specification of the chemosensory ASE gustatory neurons (Etchberger et al. 2007). Several dozen experimentally validated targets genes that contain binding sites for the respective transcription factors have previously been identified. TargetOrtho ranking criteria were compared between TF-dependent genes and 1000 random coding genes for each motif (Figure S1). For a detailed explanation of the data sets and motif construction as well as data set verification bias corrections, see File S1.

Cumulative site scores in upstream and intronic regions better predict regulatory targets of TFs than sites in other gene regions:

To assess the predictive value of different gene regions, cumulative site scores derived from data from upstream, upstream + intron, exon, downstream, or the best cumulative site score from any gene region were compared in TF-dependent genes and random coding genes. We find that TF-dependent gene motif matches perform best when upstream and intronic regions are combined to generate the cumulative site score for all analyses performed. Cumulative site scores derived from upstream or upstream + intronic regions resulted in greater differences between TF-dependent genes and random coding genes compared to cumulative site scores derived from other gene regions (Figure S2).

Individual ranking criteria as well as cumulative site scores derived from averaged species data better predict verified TF target genes compared to ranking criteria from a single genome:

To assess the predictive value of individual binding-site ranking criteria derived from multiple species as opposed to using a single genome for target gene prediction, we compared individual ranking criteria derived from C. elegans data alone or data derived from multiple species. We find that each individual criterion averaged across species shows greater discrimination between TF-dependent gene sites and random gene sites. Comparison tests for individual TargetOrtho site ranking criteria (Figure 6, Figure S3, Figure S4, Figure S5, Figure S6, and Figure S7) suggest that averaging multiple species data (Figure 6, A′–G′) results in more significant differences between criteria in TF-dependent genes and random coding genes compared to ranking criteria data from the reference genome alone (Figure 6, A–E and K). Also see Table S6, Table S7, Table S8, Table S9, Table S10, and Table S11, and corresponding Figure S3, Figure S4, Figure S5, Figure S6, and Figure S7.

UNC-3 motif analysis. *unc-3*-dependent target gene data (blue) compared to random coding gene data (gray). The set of previously characterized *unc-3*-dependent genes and 1000 random coding genes were submitted to TargetOrtho using the UNC-3 motif as input (Figure S1A). Data distributions for each TargetOrtho ranking criterion were compared between known target genes and random coding genes. CDF plots of individual ranking criteria (plots A–E and plots A′–G′): CDF plots are shown for individual ranking criteria A–E and A′–G′. TargetOrtho ranking criteria derived from averaged species data (A′–G′) better distinguish previously validated TF target genes from random genes compared to using *C. elegans* (reference genome) data alone (A–E). CDF plots A–E show only ranking criteria derived from *C. elegans* genome data while CDF plots A′–E′ show the corresponding ranking criteria derived from averaged species data. CDF plots F′ and G′ show averaged species data having no reference genome counterpart, including the conservation and offset variance data distributions. CDF plots of cumulative site scores (plots H and I and plots H′–J′): Data distributions for cumulative site scores derived from unique combinations of TargetOrtho ranking criteria are shown in CDF plots H, I, H′, I′, and J′. CDF plot H shows the cumulative site score distributions derived from *C. elegans* upstream and intronic data only calculated from A–C. (Left) Plots A′-C′ show the cumulative site score CDF plots calculated from the corresponding averaged species upstream and intronic data. CDF plot I shows cumulative site scores derived from criteria shown in CDF plots A–E where CDF plots D and E represent total gene ranking criteria in *C. elegans* only. (D) *C. elegans* averaged upstream and intronic site scores. (E) *C. elegans* averaged site score across all gene regions. CDF plot I′ (left) shows the data distribution of cumulative site scores derived from A′–E′ where CDF plots D′ and E′ represent the corresponding total gene ranking criteria averaged across species. CDF plot J′ shows cumulative site scores derived from all averaged species ranking criteria (A′–G′). (K) −Log₁₀ (P-value) for each ranking criteria comparison test where transcription-factor-dependent genes were compared to 1000 random coding genes. Compare *C. elegans* data A–E to average species data A′–E′ plus F′ and G′. (L) –Log₁₀ (P-values) for each comparison test where cumulative site scores in transcription-factor-dependent genes are compared to scores in random coding genes. Compare *C. elegans*-derived cumulative site score (H and I) to averaged-species-derived cumulative sites scores (H′, I′, and J′).

To assess the predictive value of cumulative site scores derived from averaged species data compared to scores derived from a single species, we compared cumulative site scores in TF-dependent genes to scores in random genes for both cases. Generating the cumulative site score from combined averaged species data (Figure 6, H′–J′; Figure S3, Figure S4, Figure S5, Figure S6, and Figure S7) increases the significance of the difference between TF target gene sites and random gene sites compared to building cumulative site scores from the upstream and intronic site information in the reference genome alone (Figure 6, H, I, and L). Also see corresponding Figure S3, Figure S4, Figure S5, Figure S6, and Figure S7.

GO enrichments include relevant TF target genes for further investigation:

To demonstrate the utility of TargetOrtho predictions in finding biologically relevant target genes, GO enrichments among top-ranked target genes were assessed. GO analysis was performed on TargetOrtho’s top-ranked sites per gene for whole-genome runs using upstream and intronic gene regions with the UNC-3 motif, AIY motif, and ASE motif using the GOrilla tool (Eden et al. 2007, 2009). The resulting ontologies among highly ranked predicted TF target genes show enrichments in neurogenesis pathway genes for all three terminal selector genes, providing ample candidates for further in vivo experimentation (Figure S8 and Table S12).

Validation of TargetOrtho through identification of novel UNC-3 target genes

For in vivo validation of TargetOrtho, 13 highly ranked potential UNC-3 target genes (Figure 7 and Table S4, gene list 7) were further investigated. Eight of these genes are completely uncharacterized while 5 have published expression patterns in the ventral nerve cord (VNC) where UNC-3 exerts its regulation as a terminal selector of cholinergic motor neurons. To examine whether these reporters are expressed in unc-3-expressing cells and are regulated by unc-3, we generated GFP promoter fusions for the 8 candidate target genes with no reported anatomical expression patterns (Figure 8). Transgenic lines expressing each of these reporters indeed show expression in VNC motor neurons (MNs), where UNC-3 is known to be expressed. Six of these reporter transgenes (C09G1.4, F08D12.1, F32B5.2, F47D12.3, C04E6.13, F57B7.2) were crossed into the unc-3(e151) mutant background, and each one of them showed significant loss (P < 0.001) of VNC neuron expression in the unc-3(e151) mutant, suggesting UNC-3 dependence (Figure 9; Table S1). We also crossed two (hlh-32, F53E4.1) of the five transgenes with previously described VNC MN expression into an unc-3 mutant background and also found significant loss (P < 0.001) of VNC neuron expression, again suggesting UNC-3 dependence (Figure 9; Table S1). While these results confirm UNC-3 dependence, they do not distinguish direct UNC-3 regulation via binding to UNC-3 sites in each promoter from indirect regulation by downstream UNC-3 effectors. Deletion analysis of candidate UNC-3-binding sites in UNC-3-dependent genes is necessary to confirm direct UNC-3 regulation of the candidate target genes.

Cumulative site scores of novel *unc-3* target genes. CDF plot of best upstream or intronic cumulative site score per gene in novel *unc-3*-predicted target genes (blue) compared to the whole-genome distribution of upstream cumulative site scores (gray). The range of newly validated UNC-3 target gene cumulative site scores (orange) overlaps previously characterized *unc-3* target genes (blue). Sites for experimental validation were chosen before the final cumulative site score ranking scheme was finalized so that many putative target gene scores from the whole-genome sampling are higher than those from the validation set. While these results suggest that picking novel target genes that rank similarly to previously characterized TF target genes is a valid strategy, choosing candidates from the higher scoring end of the distribution may result in even better predictions.

Gbrowse shots of novel *unc-3* target genes. TargetOrtho genome browser track files from an UNC-3 whole genome were uploaded to Wormbase’s Gbrowse tool using the custom tracks option (TargetOrtho_results). This track shows each *unc-3* motif match as a shaded arrow. The direction of the arrow indicates the strand while the shading of the arrow corresponds to the strength of the motif match. Darker shading indicates higher log-likelihood motif match scores where the raw log-likelihood motif match score is scaled between 500 and 1000 using the maximum and minimum *C. elegans* (reference genome) scores from the TargetOrtho run. The reporter coverage track shows the coordinates of each GFP fusion reporter used for validation of TargetOrtho in wild-type and UNC-3 mutant animals.

Novel *unc-3*-predicted target genes validated *in vivo*. *unc-3* mutants show loss of reporter expression compared to wild-type worms in VNC motor neurons where UNC-3 is known to be a terminal selector of cholinergic motor neuron fate. (Left) Wild-type *C. elegans* worms. (Right) *unc-3(151)* worms. GFP fusions were injected into wild-type worms and then crossed into *unc-3(151)* for scoring. Bar charts show the VNC neuron counts for wild-type and *unc-3* mutant worms in all scored lines. All reporter constructs are complex extrachromosomal arrays except *hlh-32 (VL6)*.

Utility of TargetOrtho in other species

The utility of TargetOrtho for identification of TF target genes is useful beyond C. elegans. To expand the functionality of TargetOrtho, we have implemented the pipeline for the melanogaster subgroup species with D. melanogaster as the reference genome. Numerous studies have utilized sequence conservation among closely related species to identify biologically functional elements. The relatively close phylogenetic distance between species in the melanogaster subgroup makes it amenable to conservation-based prediction of sequence function and suitable for target gene prediction with TargetOrtho.

The ASE motif used for TargetOrtho analysis in C. elegans is conserved in D. melanogaster and is bound by the Drosophila GLASS TF (Moses et al. 1989), the ortholog of CHE-1 in C. elegans. Two previously characterized GLASS-binding sites in Lz and Rh1 are highly ranked by TargetOrtho using the melanogaster species subgroup to comprise the five species genomes. Other CHE-1 target genes with ASE regulatory motifs are also conserved in D. melanogaster and are highly ranked by TargetOrtho (data not shown). The UNC-3 motif is also conserved in D. melanogaster, and preliminary analysis suggests that the unc-17 ortholog, a validated UNC-3 target gene in C. elegans, is highly ranked by TargetOrtho in D. melanogaster. This trend is apparent in other UNC-3 target orthologs in Drosophila as well (data not shown). These preliminary results support a role for TargetOrtho target gene prediction in other species.

Discussion

We have demonstrated the predictive power of TargetOrtho using two approaches: bioinformatic validation of previously characterized TF-dependent genes compared to randomized coding genes and in vivo validation of novel TargetOrtho-predicted target genes. The bioinformatic validation supports a multi-species approach to candidate target gene prediction with averaged-species-derived TargetOrtho rankings showing the most discrimination between validated target genes and randomized genes. Similar trends were observed for PWM scans done on subsets of previously validated target genes not used to construct the PWM itself showing a conservative estimate of TargetOrtho’s predictive power. The latter approach suggests that whole-genome PWM scans utilizing the multi-species ranking criteria results in novel target gene predictions that are strong with 6/6 scored reporter constructs showing expression in TF-expressing cells in which the expression displays TF dependence.

TargetOrtho provides an effective in silico approach for the identification of novel TF target genes. It offers a complementary approach to existing software that focuses mainly on de novo motif discovery by instead beginning with an experimentally validated motif and searching for conserved regulatory target genes. In this respect, TargetOrtho allows one to greatly expand the repertoire of TF target genes for a more complete understanding of the extensive regulatory networks controlled by TFs. TargetOrtho employs an alignment-independent method of conservation assignment necessary to accommodate the characteristic sequence degeneracy in TFBSs as well as motif repositioning within promoters due to sequence indels introduced over evolutionary time. The ability to overlay TargetOrtho-ranked results with other experimental data such as expression profiling, ChIP, or gene ontology data allows for additional layers of filtering to narrow down the best candidate target genes for further experimentation. In this respect, TargetOrtho serves as a powerful supplement to existing data.

The compactness of its genome and the often-observed proximity of cis-regulatory elements to their target genes make C. elegans particularly suited for TargetOrtho-based analysis of TF targets. However, increases in genome size and the sometimes very distal location of cis-regulatory control elements complicates target gene assignment in more complex metazoan species so that the utility of TargetOrtho may be limited. Another caveat of TargetOrtho use is that, while it has proved to work well for the three test cases presented here, its predictive power is expected to diminish with low-information-content motifs. A motif that occurs frequently in a given genome is likely to be conserved in orthologous genomes by chance alone, thus increasing the likelihood of false-positive target gene predictions. In cases where PWM information content is high, but the motif length is low (four to seven nucleotides), the same problem is expected.

Alternatively, true TF target genes may be ranked low if appropriate ortholog assignments have not been made. In these cases, TargetOrtho will underestimate the cumulative site score due to lack of nonreference genome species information. Often target genes with nonconserved sites may also be highly ranked due to strong reference genome results (such as motif count or log-likelihood site score). A second reference genome target gene may have identical rankings, but by averaging the ranking criteria across species, there is potential to lower the overall score even though clearly having even poor scoring sites in additional species is better than having no additional sites in additional species. Assuming that conservation increases the likelihood of biological functionality, one may choose to weight the conservation score (1–5) so that, despite underperforming averaged species data, the overall extent of conservation is considered. For our analysis of three well-characterized TFs, known TF target genes outperformed randomized coding genes despite this flaw. Additionally, weighting schemas may be explored for a given TargetOrtho run by adjusting the rank scaling parameters at run time. TargetOrtho results are also available as tab-delimited text so that the user may re-sort the data as appropriate. While adjustable input parameters allow flexibility in the ranking schema, users must consider carefully the implications of tweaking individual ranking criteria. Such exploratory adjustments may result in user-biased predictions with the potential for an increase in the false discovery rate. To address this issue, we recommend running TargetOrtho using the query list option with a query list of previously characterized target genes so that ranking of these target genes may be assessed among whole-genome results.

In conclusion, TargetOrtho provides a cost- and time-efficient in silico approach for the identification of novel TF target genes, and, together with its CRM search function, is poised to unravel the regulatory logic of diverse biological processes.

Supplementary Material

Supporting Information

supp_197_1_61__index.html^{(3.8KB, html)}

Acknowledgments

We thank Q. Chen for expert assistance in generating transgenic strains, H. Bussemaker for valuable suggestions, and members of the Hobert lab for comments on the manuscript. This work was funded by the National Institutes of Health (R01NS039996-05; R01NS050266-03; 5T32DK007328-33). O.H. is an Investigator of the Howard Hughes Medical Institute.

Footnotes

Communicating editor: P. Sengupta

Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.113.160721/-/DC1.

Literature Cited

Aerts S., 2012. Computational strategies for the genome-wide identification of cis-regulatory elements and transcriptional targets. Curr. Top. Dev. Biol. 98: 121–145. [DOI] [PubMed] [Google Scholar]
Aerts S., Lambrechts D., Maity S., Van Loo P., Coessens B., et al. , 2006. Gene prioritization through genomic data fusion. Nat. Biotechnol. 24: 537–544. [DOI] [PubMed] [Google Scholar]
Bailey, T. L., and C. Elkan, 1994 Fitting a mixture model by expectation maximization to discover motifs in biopolymers, pp. 38–36 in UCSD Technical Report CS94–351. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, August 1994. [PubMed]
Bailey T. L., Boden M., Buske F. A., Frith M., Grant C. E., et al. , 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37: W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bigelow H. R., Wenick A. S., Wong A., Hobert O., 2004. CisOrtho: a program pipeline for genome-wide identification of transcription factor target genes using phylogenetic footprinting. BMC Bioinformatics 5: 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carey, M. F., C. L. Peterson, and S. T. Smale, 2009 Chromatin immunoprecipitation (ChIP). Cold Spring Harb. Protoc. 9: Prot5279 [DOI] [PubMed]
Eden E., Lipson D., Yogev S., Yakhini Z., 2007. Discovering motifs in ranked lists of DNA sequences. PLOS Comput. Biol. 3: e39. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eden E., Navon R., Steinfeld I., Lipson D., Yakhini Z., 2009. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10: 48. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elemento O., Tavazoie S., 2005. Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol. 6: R18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Etchberger J. F., Lorch A., Sleumer M. C., Zapf R., Jones S. J., et al. , 2007. The molecular signature and cis-regulatory architecture of a C. elegans gustatory neuron. Genes Dev. 21: 1653–1674. [DOI] [PMC free article] [PubMed] [Google Scholar]
Goecks J., Nekrutenko A., Taylor J., 2010. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11: R86. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gordân R., Narlikar L., Hartemink A. J., 2010. Finding regulatory DNA motifs using alignment-free evolutionary conservation information. Nucleic Acids Res. 38: e90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grant C. E., Bailey T. L., Noble W. S., 2011. FIMO: scanning for occurrences of a given motif. Bioinformatics 27: 1017–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hallikas O., Palin K., Sinjushina N., Rautiainen R., Partanen J., et al. 2006. Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity. Cell 124: 47–59. [DOI] [PubMed] [Google Scholar]
Hellman L. M., Fried M. G., 2007. Electrophoretic mobility shift assay (EMSA) for detecting protein-nucleic acid interactions. Nat. Protoc. 2: 1849–1861. [DOI] [PMC free article] [PubMed] [Google Scholar]
Herrmann C., Van de Sande B., Potier D., Aerts S., 2012. i-cisTarget: an integrative genomics method for the prediction of regulatory features and cis-regulatory modules. Nucleic Acids Res. 40: e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hobert O., 2002. PCR fusion-based approach to create reporter gene constructs for expression analysis in transgenic C. elegans. Biotechniques 32: 728–730. [DOI] [PubMed] [Google Scholar]
Kratsios P., Stolfi A., Levine M., Hobert O., 2012. Coordinated regulation of cholinergic motor neuron traits through a conserved terminal selector gene. Nat. Neurosci. 15: 205–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moses K., Ellis M. C., Rubin G. M., 1989. The glass gene encodes a zinc-finger protein required by Drosophila photoreceptor cells. Nature 340(6234): 531–536. [DOI] [PubMed] [Google Scholar]
Odenwald W. F., Rasband W., Kuzin A., Brody T., 2005. EVOPRINTER, a multigenomic comparative tool for rapid identification of functionally important DNA. Proc. Natl. Acad. Sci. USA 102: 14700–14705. [DOI] [PMC free article] [PubMed] [Google Scholar]
Siddharthan R., Siggia E. D., van Nimwegen E., 2005. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLOS Comput. Biol. 1: e67. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sinha S., Blanchette M., Tompa M., 2005. PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5: 170. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smedley D., Haider S., Ballester B., Halland R., London D., et al. , 2009. BioMart: biological queries made easy. BMC Genomics 10: 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song, L., and G. E. Crawford, 2010 DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb. Protoc. 2010(2): pdb.prot5384. [DOI] [PMC free article] [PubMed]
Storey J. D. and R. Tibshirani, 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100: 9440–9445. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vilella A. J., Severin J., Ureta-Vidal A., Heng L., Durbin R., et al. , 2009. EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19: 327–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang, T., 2007 Using PhyloCon to identify conserved regulatory motifs. Curr. Protoc. Bioinformatics, Chapter 2, Unit 2.12. [DOI] [PubMed] [Google Scholar]
Ward L. D., Bussemaker H. J., 2008. Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics 24: i165–i171. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wenick A. S., Hobert O., 2004. Genomic cis-regulatory architecture and trans-acting regulators of a single interneuron-specific gene battery in C. elegans. Dev. Cell 6: 757–770. [DOI] [PubMed] [Google Scholar]
Wright W. E., Binder M., Funk W., 1991. Cyclic amplification and selection of targets (CASTing) for the myogenin consensus binding site. Mol. Cell. Biol. 11(8): 4104–4110. [DOI] [PMC free article] [PubMed] [Google Scholar]