Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences

Ilias Georgakopoulos-Soares; Ioannis Mouratidis; Guillermo E Parada; Navneet Matharu; Martin Hemberg; Nadav Ahituv

doi:10.1093/nar/gkaa1052

. 2020 Nov 19;49(1):e4. doi: 10.1093/nar/gkaa1052

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences

Ilias Georgakopoulos-Soares ^1,², Ioannis Mouratidis ³, Guillermo E Parada ^4,⁵, Navneet Matharu ^6,^7,⁸, Martin Hemberg ^9,¹⁰, Nadav Ahituv ^11,^12,^✉

PMCID: PMC7797064 PMID: 33211865

Abstract

DNA strand asymmetries can have a major effect on several biological functions, including replication, transcription and transcription factor binding. As such, DNA strand asymmetries and mutational strand bias can provide information about biological function. However, a versatile tool to explore this does not exist. Here, we present Asymmetron, a user-friendly computational tool that performs statistical analysis and visualizations for the evaluation of strand asymmetries. Asymmetron takes as input DNA features provided with strand annotation and outputs strand asymmetries for consecutive occurrences of a single DNA feature or between pairs of features. We illustrate the use of Asymmetron by identifying transcriptional and replicative strand asymmetries of germline structural variant breakpoints. We also show that the orientation of the binding sites of 45% of human transcription factors analyzed have a significant DNA strand bias in transcribed regions, that is also corroborated in ChIP-seq analyses, and is likely associated with transcription. In summary, we provide a novel tool to assess DNA strand asymmetries and show how it can be used to derive new insights across a variety of biological disciplines.

INTRODUCTION

Even though the DNA double helix is a symmetric structure, many biological processes such as replication, transcription and transcription factor binding are directional. The directionality of these processes results in the inhomogeneous distribution of genomic sequences relative to the two complementary DNA strands. Reflecting directionality biases, strong compositional strand asymmetries have been observed across the entire tree of life, ranging all the way from viral to eukaryotic genomes. This bias has been ascribed to replication origins and transcription initiation sites in all these organisms (1–6). In double-stranded DNA viruses, a GC-skew, which measures the asymmetry in the distribution of Gs and Cs in the two strands, has been observed between the leading and lagging strands (7). In prokaryotic genomes, genes are more frequently observed in the leading orientation, a phenomenon that is more pronounced for essential genes (8). This asymmetry is in accordance with evidence suggesting that genes in the lagging strand accumulate an excess of mutations relative to those in the leading orientation (9). In mammals, the testis expresses the highest number of genes relative to any other tissue. This mechanism safeguards the germline DNA integrity through reduced mutations at the transcribed strand as a result of transcription-coupled repair and in turn leads to reduced population diversity at those sequences (10).

DNA mutations can be oriented relative to transcription and replication, using as reference the template/non-template and leading/lagging strands, respectively. If the reference nucleotide or motif at the site of the mutation is found more frequently in one strand relative to the other, following correction for background strand preferences, it indicates a mutational strand asymmetry. This mutational strand imbalance can have a major impact on disease, development and evolution. For example, the transcription-coupled repair pathway preferentially repairs DNA damage at the template strand, as it can otherwise impede the RNA polymerase progression (11). In lung cancer, tobacco-related carcinogens form bulky adducts at guanines and their preferential repair at the template strand of expressed genes results in mutational imbalance of G>T substitutions (12). The apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like (APOBEC) is a cytidine deaminase involved in antiviral defense. However, off-target APOBEC-associated mutagenesis in the human genome is often observed in cancer cells and more frequently targets cytosines on the lagging replicative strand (13,14). Transcriptional and replicative strand asymmetries have also been characterized in relationship with gene expression levels and replication timing, providing further mechanistic insights (13–15).

Sequences that are non-palindromic can be oriented relative to one another. A pair of motifs can be on the same or opposite strands and if they are on opposite strands they can be in convergent (facing each other) or divergent (facing away from each other) orientations. Examples include many transcription factor binding sites. One of the most notable is the CCCTC-binding factor (CTCF), the motif orientation of which dictates chromatin looping and three dimensional genome topology (16). Another noteworthy example is the Ying-Yang 1 (YY1) transcription factor, whose motif orientation in the c-Fos promoter reverses the expression of the downstream gene, therefore acting either as an activator or a repressor depending on the genomic orientation of its motif relative to the transcriptional direction (17). In transcription factor heterodimers, the orientation of the individual motifs can also influence binding and expression levels (18,19). Other cases of strand asymmetries include endogenous repetitive elements with preferences in their orientation relative to each other and relative to transcriptional and replicative direction, which could influence their jumping activity (20,21) or the relative orientation of genes and proximal long non-coding RNAs that can regulate the expression of each other. In particular, there is evidence that antisense transcripts can form self-regulatory circuits with the target gene (22).

Characterization of strand asymmetries can thus allow for the identification of novel DNA elements, improve our understanding regarding their interactions with one another, and advance our knowledge of the underlying processes in mutagenesis and evolution. To date, there is no versatile tool to perform analysis of strand asymmetries across biological problems. Here, we introduce Asymmetron, a novel, multi-purpose computational toolkit that systematically characterizes strand asymmetry patterns in nucleotide sequences. Asymmetron is composed of four functions (Figure 1), the first being ‘consecutive_patterns.py’ which finds strand patterns within consecutive occurrences of a single genomic element, ‘contained_asymmetries.py’ is used for pairs of genomic elements in which one is contained within the other, and ‘pairwise_asymmetries.py’ which finds asymmetries between the pairs of proximal genomic elements. The fourth function, ‘orientation.py’ assigns strand asymmetries from one genomic element to another and can be used to orient features of interest such as Chromatin immunoprecipitation followed by sequencing (ChIP-seq) peaks or mutations relative to strand-assigned genomic sequences of interest. Using Asymmetron we show that germline structural variant breakpoints can be oriented relative to transposable elements and find transcriptional and replicative strand asymmetries in them, suggesting transposable element activity in the germline. We also provide evidence that the orientation of many transcription factor binding sites (TFBSs) is highly biased across promoters and in transcribed regions and validate our findings by analyzing the orientation of TFBSs within ChIP-seq peaks.

Figure 1. — Graphical depiction of Asymmetron functionalities. The Asymmetron toolkit is composed of four functions that enable the estimation of strand asymmetries within and between BED file datasets. The consecutive_patterns.py function enables the identification of patterns within consecutive occurrences of a feature. The contained_asymmetries.py function calculates the strand asymmetries of a feature of interest (motifs) contained within another feature (regions). The pairwise_asymmetries.py function estimates the orientation bias between two features that are in proximity to each other. The orientation.py function orients an un-annotated feature relative to another overlapping feature that has strand annotation; it is integrated within all three other functions and can also run independently.

MATERIALS AND METHODS

Asymmetron enables versatile genomic investigations of strand asymmetry patterns across different biological problems. It is a Python-based toolkit and its core BED-formatted file comparison functions use the package Pybedtools (23). Asymmetron provides support for three types of analyses: (i) consecutive strand asymmetry estimation in a single file with strand annotation; (ii) strand asymmetry estimation of strand-assigned motifs within strand-assigned regions; (iii) strand asymmetry estimation between two strand-assigned motifs in proximity or overlapping each other. A fourth function performs the strand assignment of an unassigned feature based on another overlapping feature, thereby enabling the strand asymmetry analysis of the first (Figure 1).

Let us define the alphabet Inline graphic . DNA can be represented by a pair of sequences , where and , its complement strand, where if and if . Because of the directionality of the two strands, we read B_n from right to left, e.g. if B_n = AGGCT, we will say that B_n contains the motif TCG. Here, we use ‘motif’ to refer to a short sequence from the same alphabet that is of particular interest.

Analyses are often performed on genomic data, to extract all locations of a specific motif. In Asymmetron, we use these locations to estimate strand asymmetries through several types of analyses. We use these methods to evaluate strand asymmetries of non-palindromic sequences. To represent the locations of the motif in the genome, it is enough to save the chromosome, the index where the motif starts, the index where the motif ends as well as which strand the motif is found at. A common format used to store this information is a BED file, which, inter alia, saves the above mentioned information. In this format, the strand is represented with a + or − sign for An or B_n respectively, which we will also use here. The information of a BED file relevant for our analyses can be represented as a set of vectors S, where in each vector chromosome is represented as c, start coordinate is represented as s, end coordinate as e sand sign as r.

The commands used to perform the analyses and the files can be found on the GitHub page (https://github.com/Ahituv-lab/Asymmetron). Asymmetron documentation, including a tutorial, several examples and description of all available options is available in http://asymmetron.readthedocs.io/.

Consecutive strand asymmetry estimation for single motifs

Nearby recurrences of a motif in the genome can have biological significance. To examine the patterns emerging from recurring motifs, we developed this function which allows the observation of consecutive occurrences of a motif. It analyzes whether there is an asymmetry in the number of times the motif appears in one strand versus the other (Figure 2A).

Figure 2. — Schematic of strand asymmetry analyses across different scenarios. (A) Estimation of biases in the orientation patterns of consecutive occurrences of a motif relative to those found in the shuffled simulated data. Calculation of orientation patterns for consecutive motif occurrences is performed using the function *consecutive_patterns.py* function. In the presented example, there are seven motifs, six of them in the same orientation and one in the opposite orientation. We perform N simulations (in the schematic *N = 1*) and calculate the adjusted strand asymmetry ratio and empirical P-value. In this simple case, simulated strand asymmetry = 3/7 < strand asymmetry = 6/7, so the set of successes, as defined in the methods section, for which the simulated asymmetry is higher than the strand asymmetry has a cardinality of 0. This results in a trivial P-value of 1, as is to be expected from only a single simulation. (B) Estimation of transcriptional strand asymmetry of a motif in genic regions. Genes in both orientations are shown. Calculation of transcriptional strand asymmetries can be performed using the function *contained_asymmetries.py*. In the schematic, there are ten motifs distributed across two opposite oriented genes; the null hypothesis is that they are equally-likely to have either orientation relative to the gene direction. There are seven motifs in the non-template orientation resulting in , calculated with the two-tailed binomial test. Motifs can occur in (C) same (++ and –) or (D) in opposite (+- and −+) strand orientations. The order of two same-type or different motifs is not taken into consideration because the double strand DNA molecule is bidirectional; nevertheless, if a third strand-oriented feature was included, their order would be another factor to account for. (D) For those motifs in opposite strands, they can be separated in convergent (+−) or divergent (-+) orientations. These orientations of motif pairs are specific to non-palindromic motifs. (E) Orientation of motif pairs and estimation of same/opposite and convergent / divergent strand asymmetry ratios using a miniature genome example of two chromosomes and several occurrences of two motifs in pairs. Calculation of the strand asymmetry for motif pairs is performed with the function *pairwise_asymmetries.py*. In the schematic, there are eight motif pairs, across the two chromosomes; the null hypothesis is that they are equally-likely to have same or opposite orientation and in the subset of opposite orientation cases, they are equally likely to be in convergent or divergent orientation. There are three motif pairs in same orientation, resulting in and there are five motif pairs in opposite orientation resulting in , for same/opposite and convergent/divergent strand asymmetries, respectively, calculated with the two-tailed binomial test.

Let S be the vector representation of the input BED file. Let Inline graphic be the vectors of set S sorted first by chromosome c and then by start position s. We define the distance between two consecutive appearances of the motif in the same chromosome as . If they are not considered consecutive for the purpose of this analysis. Let be a set consisting of sequences of characters Inline graphic , where each character is the sign of an appearance of the motif that fit the previously mentioned criteria. We define m as the cardinality of the set , which represents all consecutive appearances of the motif on the same strand (both on A_n or both on B_n ). Similarly, o is defined as the cardinality of the set Inline graphic , which represents all consecutive appearances of the motif on opposite strands (one A_n and the other on B_n). The strand asymmetry ratio is defined as , which represents the magnitude of consecutive orientation bias. We then run N simulations (default: N = 1000), randomly assigning a value (‘+’ or ‘−‘) to every Inline graphic , while keeping the total number of ‘+’ and ‘-’ in C constant. Following the same procedure as above, the strand asymmetry ratio is calculated. The adjusted strand asymmetry ratio is then defined as the original strand asymmetry ratio r divided by the mean strand asymmetry ratio Inline graphic across simulations. We define a success as . Let L be the number of successes. We use the cardinality of L to calculate the empirical P-value as follows:

Inline graphic , where . We multiply by 2, to ensure that the P-value is not over-estimated, due to the two-tailed test.

The outputs of this tool include a table with the statistical evaluation of the asymmetry bias for each inputted pattern; BED files with statistically biased coordinates consecutively observed for each inputted pattern with an extra column having their estimated Bonferroni corrected P-value and barplot visualizations of the distribution of observed versus expected consecutive occurrences of each pattern and other relevant statistics. As an extension, the tool also offers the option to analyze custom patterns provided by the user.

Strand asymmetry estimation between regions and overlapping motifs

The strand asymmetry between regions and overlapping motifs tool requires a set of strand-oriented BED-formatted files of the regions of interest and a set of strand-oriented BED-formatted motif files. The tool performs independently the analysis across pairs of region and motif files and measures the strand asymmetry scores for the motifs overlapping or contained in the regions (Figure 2B).

Let S₁, S₂ be the set representation of two strand-annotated BED files. For each vector Inline graphic in S₁, this function will compare it to every vector in S₂. If such as , which means that there is an overlap between the two vectors, we assign the motif / region pair to one of the two following categories: If we consider them to have the same strand orientation, if we consider them to have the opposite orientation. Using the total number of pairs in same strand orientation and opposite strand orientation, we calculate the strand asymmetry ratio as follows:

We symbolize the number of occurrences in same strand orientation as k and occurrences in opposite strand orientation as l. We define Inline graphic as the total number of comparisons. We then calculate the P-value for the two-tailed binomial test as follows, where P is the user-defined probability for same strand orientation (default = 0.5, assuming a random distribution of the orientation between motifs and regions).

The corresponding P-values are calculated using the ‘scipy’ package in Python (24) and are adjusted with Bonferroni correction in case of multiple tests.

The outputs include a table with the strand asymmetry score and statistics for each comparison. It also includes visualizations in the form of barplots for the number of occurrences in same versus opposite orientations and other relevant statistics.

Strand asymmetry estimation between pairs of proximal motifs

The tool uses as input a pair of BED files representing two motifs. Let S₁,S₂ be the set representation of two strand-annotated BED files. For each vector Inline graphic in S₁ we use the ‘bedtools closest’ function to determine the closest element in S₂, , such as . In the case of a tie, i.e. multiple fulfilling that criterion, all instances are reported by default and are considered for the subsequent analysis. If the distance between the two is within the user-specified parameters, then the pair is assigned to the following categories; If Inline graphic , then they are considered to have the same orientation (Figure 2C). Conversely, if they are considered to have the opposite orientation. If they have the opposite orientation, there is a further subdivision in convergent or divergent (Figure 2D). Let if and otherwise. If then the pair is considered convergent, otherwise it is considered divergent (Figure 2E).

Strand asymmetry ratios are calculated as:

To calculate the corresponding P-values the same procedure is followed as described in the methods section of strand asymmetry estimation between regions and overlapping motifs. The convergent/divergent P-value is calculated similarly to the same/opposite P-value, with k the number of occurrences in convergent strand orientation and l the number of occurrences in divergent strand orientation.

The outputs include a table of the asymmetries for same versus opposite strand and convergent versus divergent orientations. It also includes barplots for each asymmetry comparison, distribution plots showing the strand asymmetries as a function of distance between motif pairs and other relevant statistics.

Orientation assignment prior to asymmetry analysis

The previous functions are based on the fact that the motifs are assigned a specific strand (+ or −) because they are found either on A_n or B_n. In the case that the feature of one file is present in both strands and thus lacks strand annotation, it is possible to assign to it a strand annotation based on a feature provided in a second file. For this, the user needs to provide an un-annotated BED file, as well as one annotated BED file of a different feature using the same genome annotation. Let S₁, S₂ be the sets representing the two BED files, with S₁ representing the annotated file S₂ the un-annotated file. For each vector Inline graphic in S₂, this function will compare it to every vector , in S₁. If such as , which means that there is an overlap between the two vectors, then we set ,. If there are multiple vectors that fulfil the criteria of then only the one with the minimal distance between the centers of and Inline graphic ,defined as is kept.

Genomic analyses

The human genome built hg38 was used throughout this work. Gene annotation from GENCODE was used (v33); the file was derived from (https://www.gencodegenes.org/) and filtered to include only protein-coding genes (25). Germline structural variant data were downloaded from the gnomAD (v2) website (https://gnomad.broadinstitute.org/), with version 2 of the database being used (26). Coordinates of transposable elements were derived for the human genome (hg38) from the http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz (version from 11 March 2019) which uses RepeatMasker (Smit Hubley Green, www.repeatmasker.org) and were filtered to include LINE, SINE and LTR transposable elements. Repli-seq data for the BG02ES cell line were obtained from the ENCODE project (Release 2) (https://www.encodeproject.org/) (27) and lifted over to hg38; leading and lagging orientation of the replication machinery across the human genome was inferred as described in (13). Genes were divided into ten equal-sized bins, with an upstream and a downstream 1kB bin added for each gene, resulting in twelve bins. Pearson correlations between transcriptional strand asymmetry of transposable elements and bin number was performed excluding the upstream and downstream 1kB bins. Position frequency matrices (PMWs) of transcription factors were derived from JASPAR (release 2020) for the non-redundant CORE collection (28) (http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_meme.zip) and motif scanning was performed with FIMO using as background model the nucleotide frequencies across the human genome and requiring a minimum P-value <10⁻⁶ (29). Transcription factors for which no motif occurrences below the P-value threshold in the human genome, were excluded from the analyses. Unibind PWM motif maps (https://unibind.uio.no/static/data/bulk/pwm_tfbs_per_tf.tar.gz), from the 2019 release, extracted from ChIP-seq experiments of their corresponding transcription factor with peak-caller MACS were analysed for transcriptional strand asymmetry across genic regions (30). Statistical analysis was performed in Python with the packages ‘math’, ‘scipy’, ‘pandas’ and ‘numpy’ and in R; visualizations were performed in Python with ‘matplotlib’ and ‘seaborn’ packages and in R with the ‘ggplot2’ package.

Estimation of endogenous repeat element asymmetries

Transcriptional and replicative strand asymmetries of endogenous repeat elements were estimated as:

To calculate the corresponding Bonferroni-corrected P-values the same procedure is followed as described in the methods section of strand asymmetry estimation between regions and overlapping motifs.

When calculating the bias in breakpoints in template/non-template and leading/lagging strands to correct for background asymmetries in the orientation of endogenous repeat elements we estimated the adjusted strand asymmetry ratio as:

From which we calculated the adjusted strand asymmetry ratio for transcriptional and replicative strand asymmetries for the breakpoints:

We then calculate the Bonferroni-corrected P-values, as described in the methods section, replacing the expected binomial probability P with the probability of background transcriptional strand asymmetry and the background replicative strand asymmetry of each endogenous repeat element respectively.

TFBS transcriptional strand asymmetry estimation

For each transcription factor the transcriptional strand asymmetry of its TFBSs was estimated as:

To calculate the corresponding Bonferroni-corrected P-values the same procedure is followed as described in the methods section of strand asymmetry estimation between regions and overlapping motifs.

For each PWM motif of each transcription factor we generated simulations (N = 100) in which we randomly shuffled the order of the rows. For each of these simulated PWMs we generated genome-wide maps of their motif occurrences. Next, we estimated the expected transcriptional strand asymmetry ratio of each simulated PWM and calculated the mean transcriptional strand asymmetry ratio across all simulations, resulting in an expected transcriptional strand asymmetry. The adjusted transcriptional strand asymmetry ratio was estimated as the transcriptional strand asymmetry ratio of the original PWM over the mean expected transcriptional strand asymmetry ratio from the simulations. We then calculate the Bonferroni-corrected P-value, as described in the methods section, replacing the default probability of same strand orientation with the mean probability of the simulations.

RESULTS

To illustrate the use of Asymmetron, we carried out analyses, which resulted in novel biological insights: (i) by orienting germline structural variant breakpoints relative to transposable elements we identify transcriptional and replicative strand asymmetries; (ii) we provide evidence that the orientation of a large portion of TFBSs is biased relative to the transcription direction, across the human transcriptome; (iii) we show that closely-spaced homotypic CTCF binding sites are more likely to be in the same orientation; (iv) in addition, we show how to use Asymmetron to detect motif strand asymmetries by using a previously characterized motif bias between the TATA-box and the initiator element.

Transposable element orientations reveal structural variant asymmetries

Although strand asymmetries for nucleotide substitutions, insertions and deletions (indels) have been previously characterized using the trinucleotide context of substitutions or the repetitive patterns at the site of indels (13,14,31) a strand asymmetry analysis has not been performed for structural variants. Here, we investigated transcriptional and replicative strand asymmetries of three transposable elements, Long Interspersed Nuclear Elements (LINEs), Short Interspersed Nuclear Elements (SINEs) and Long Terminal Repeats (LTRs). We also oriented population structural variant breakpoints from the Genome Aggregation Database (gnomAD) (26) for the likelihood to occur at a particular orientation at each of these repetitive elements.

We first analyzed the transcriptional strand asymmetry of each of these types of repetitive elements across genic regions using Asymmetron:

python contained_asymmetries.py gencode.v33.annotation.bed LINEs.bed

The strand asymmetry was measured as the ratio of non-template to total occurrences. A ratio above 0.5 reflects a preference of the transposable element for the non-template strand, while a ratio below 0.5 reflects a bias towards the template strand orientation. We found a preference for LINEs, SINEs and LTRs to be at the template strand (ratios 0.392, 0.471, 0.316, respectively, binomial test, Bonferroni corrected P-values < 0.001) (Figure 3A), consistent with previous reports (32–34). We also subdivided SINEs into Alu repeats and Mammalian-wide interspersed repeats (MIRs), finding strong transcriptional strand asymmetries only in Alu repeats (Figure 3A). Similarly, we divided LINE retrotransposons in L1 and L2 and found significant transcriptional strand asymmetries in both, which were more pronounced for L1 repeats (Figure 3A).

Figure 3. — Transcriptional and replicative strand asymmetries of structural variants using transposable elements for their orientation. (A) Transcriptional and replicative strand asymmetry bias for endogenous retroelements. Adjusted P-values are Bonferroni corrected and are derived from binomial tests. Transcriptional strand asymmetry is the ratio of non-template to non-template and template occurrences, while replicative strand asymmetry is the ratio of leading orientation occurrences to leading and lagging occurrences of a transposable element. (B) Transcriptional strand asymmetry bias of endogenous retroelements relative to their position across the gene. Pearson correlations were estimated for the ten bins between the TSS and the TES. Adjusted P-values displayed as * for P-value <0.05, ** for P-value <0.01 and *** for P-value <0.001. (C) Log₂ adjusted strand asymmetry ratio for transcriptional and replicative strand asymmetry of structural variant breakpoints overlapping endogenous retroelements correcting for their background strand asymmetries. Adjusted P-values are Bonferroni corrected and are derived from binomial tests.

We investigated the transcriptional strand asymmetry of transposable elements as a function of their position in the gene. To perform this, we separated each gene into ten equal-sized bins and added a 1 kB upstream window bin and a 1 kB downstream window bin (Figure 3B). For LINEs and SINEs we found a more pronounced template strand asymmetry bias in regions closer to the TSS, which decreased as a function of distance from it, whereas for LTRs, we could not observe a statistically significant correlation (Figure 3B). We also observed a positive correlation with relative position in the gene and transcriptional strand asymmetry for Alu and L1 repeats, whereas MIR repeats did not display a statistically significant correlation and L2 elements displayed a negative correlation (Figure 3B). These results suggest that transposable elements are preferentially located relative to orientation and position in genic regions.

Next, we investigated the frequency of structural variants at each of these elements at the template and non-template strand orientations. To perform this analysis, we oriented structural variants relative to endogenous elements:

python orientation.py gnomad_v2.1_sv.sites.bed LINEs.bed

After correcting for the background asymmetries of their orientation preferences within transcribed regions, we observed that for LINEs and SINEs there was a significant association between their orientation and the probability of harboring a structural variant breakpoint, with a preference for the template strand (Adjusted transcriptional strand asymmetry ratios of 0.916 and 0.936; binomial test, Bonferroni corrected, P-value < 0.001), while for LTRs we could not find a preference (Figure 3C). When we subdivided LINE and SINE repeat elements, we found that the structural variant breakpoint transcriptional strand asymmetries were found for L1 and Alu repeat elements (binomial test, Bonferroni corrected, P-value < 0.001), but not for L2 or MIR elements (P-value > 0.05) (Figure 3C).

python contained_asymmetries.py gencode.v33.annotation.bed gnomad_v2.1_sv.sites.LINEs.bed –expected_asym = 0.392

Next, we investigated if the directionality of the replication fork was associated with the orientation of LINEs, SINEs and LTRs and if their orientation also influences the likelihood of observing structural variant breakpoints within those elements. We used Repli-seq data from BG02ES (27), a human embryonic stem cell (ESC) line, to infer the directionality of the replication-fork genome-wide. Similarly to the transcriptional strand asymmetry ratio, the replicative strand asymmetry ratio reflected the occurrences of the transposable elements in the leading orientation over their total occurrences. We found that LINEs and SINEs were more likely to be found in the leading strand orientation (Strand asymmetry: LINEs: 0.524, SINEs: 0.520, binomial test, Bonferroni corrected P-values < 0.001), whereas LTRs did not display a significant orientation bias (P-value > 0.05) (Figure 3A). For SINEs, we found replicative strand asymmetries at Alu repeats but not at MIR repeats (Figure 3A). We also separated LINEs into L1 and L2 repeats and found replicative strand asymmetries only for L1 repeats (Figure 3A).

python contained_asymmetries.py Bg02es_RepliStrand.bed LINEs.bed

We investigated if the replicative orientation of these endogenous elements was associated with the likelihood of observing germline structural variants. To perform this, we used the structural variant breakpoints that were oriented relative to the repetitive elements and studied their replicative strand asymmetry:

python orientation.py Bg02es_RepliStrand.bed LINEs.bed

We corrected for the background asymmetry in the orientation of each transposable element and investigated if structural variant breakpoints were more likely to be found at a specific orientation. We found that for both LINEs and SINEs there was a significant strand asymmetry with a higher frequency of structural variant breakpoints at the leading orientation (P-value < 0.001), whereas for LTRs no bias was detected (P-value > 0.05), (Figure 3C). We performed the same analysis for LINE and SINE repeat elements and found that the structural variant breakpoint replicative strand asymmetry was found for L1 and Alu elements, but not for L2 or MIR elements (Figure 3C), similar to our results regarding the transcriptional strand asymmetries.

python contained_asymmetries.py Bg02es_RepliStrand.bed LINEs.bed gnomad_v2.1_sv.sites.LINEs.bed –expected_asym = 0.524

We also separated Alu repeats in the three subfamilies (AluJ, AluY and AluS) and L1 repeats in primate-specific (L1P) and mammalian-wide (L1M) and found consistent transcriptional and replicative strand asymmetries in all of them (Supplementary Table S1). However, we only found breakpoint strand asymmetries in AluY and AluS repeats for Alu subfamilies and L1P for L1 subfamilies (Supplementary Table S2), consistent with previous work indicating that only members of AluY, AluS and L1P subfamilies remain active in the human genome (35,36). However, we currently cannot rule out the contribution of other mechanisms such as nonallelic homologous recombination contributing to the observed differences at younger repeats and future experimental work is required to provide additional evidence for this.

Finally, we investigated if the transcriptional and replicative strand asymmetries of transposable elements were dependent on each other or if they were independent contributors. When we controlled for transcription direction and performed the replicative strand asymmetry analysis, the results remained largely unaltered, as was the case when controlling replicative orientation and performing the transcriptional strand asymmetry analysis. These results provide additional evidence that endogenous repeat elements have orientation preferences determined by both replication and transcription.

Strand asymmetries of TFBSs at promoters and across transcribed regions

Many regulatory elements are found within transcribed regions. Nevertheless, it remains unknown if the transcription process influences the transcription factor DNA strand regulatory grammar within transcribed regions. Here, we generated a transcriptome-wide map of human TFBSs with FIMO (29) using the JASPAR vertebrate non-redundant list of transcription factors (28). We filtered out transcription factor Position Weight Matrices (PWMs) for which there were no matches meeting the P-value threshold, resulting in 551 PWMs, representing a diverse set of transcription factors. We oriented each TFBS occurrence with respect to the transcription direction as template or non-template. As a null hypothesis, we assumed that TFBSs are equally likely to occur at both orientations.

First, we investigated if the TFBS orientation biases could be identified across transcribed regions (transcription start to transcription end). We found that out of 551 TF PWMs, 248 (45%) displayed significant transcriptional strand asymmetries (binomial tests with P-value < 0.05, Bonferroni corrected) (Figure 4A). To account for the influence of the nucleotide composition in TFBS transcriptional strand asymmetries, we shuffled the order of the rows of each PWM 100 times, from which we estimated the average expected transcriptional strand asymmetries. After correcting for nucleotide composition biases, we found 150 (27%) of transcription factors showed significant transcriptional strand asymmetries, with 73% being shared with our earlier model (Supplementary Figure S1a). These results indicate that the orientation of TFBSs is not random across transcribed regions.

Figure 4. — A large proportion of transcription factor binding sites display significant transcriptional strand asymmetry bias. (A, B) Volcano plots showing the transcriptional strand asymmetry of TFBSs and the associated P-values from binomial testing with Bonferroni correction, for multiple transcription factors. Grey colored marks represent TFBSs with non-statistically significant strand asymmetries. (A) Across transcribed regions. (B) Across 1 kB upstream from the TSS and across 1kB downstream from the TSS (C) Volcano plots across five cell lines showing the strand asymmetry and Bonferroni corrected binomial P-values of TFBSs for each ChIP-seq experiment. Strand asymmetry relative to gene orientation of transcription factors found in at least four ChIP-seq experiments and showing statistically significant TFBS strand asymmetry bias for at least 75% of the experiments performed. (D−F) Box plots display the statistically significant strand asymmetry scores for each transcription factor across ChIP-seq experiments. (D) Strand asymmetry across transcribed regions. (E) Strand asymmetry across promoter upstream regions. (F) Strand asymmetry across promoter downstream regions. The embedded text in D-F displays the number of ChIP-seq experiments for which statistically significant TFBS strand asymmetry was observed. (G) CTCF motif orientation across genic regions, separated in bins. Each dot represents a ChIP-seq experiment for which the CTCF motif with the highest binding score and proximity to the center of each peak were used. The light blue line represents the median strand asymmetry across all ChIP-seq experiments. (H) Strand asymmetry analysis of TATA-box and INR motifs as a function of their pairwise distance within promoter regions.

We also compared the strand asymmetry bias of each TFBS in the promoter upstream region (−1000 bp to Transcription Start Site (TSS)) and the promoter downstream region (TSS to 1000 bp). We found that on average TFBSs displayed stronger strand asymmetry patterns in the downstream promoter regions; with median absolute orientation biases of 7.14% versus 11.25% in the upstream and downstream promoter regions (Mann−Whitney U, P-value = 3.1e−9), (Figure 4B). However, the stronger strand asymmetry patterns for the downstream promoter region relative to the upstream promoter regions were explained by the nucleotide composition restrictions of the first (Supplementary Figure S1b). These results are in accordance with the notion that the transcription process imposes restrictions in the orientation preference of TFBSs.

An example of the Asymmetron command for one of the transcription factors:

python contained_asymmetries.py gencode.v33.1kB_upstream.bed CTCF.bed
python contained_asymmetries.py gencode.v33.1kB_downstream.bed CTCF.bed

To provide additional evidence that the observed TFBS strand asymmetries relative to the transcription direction reflect differences in the likelihood of transcription factor binding, we performed an extended analysis using the UniBind dataset. This dataset encompasses ChIP-seq experiments of 231 transcription regulators studied across 315 diverse cell lines and conditions (30). For each ChIP-seq peak in each experiment, the TFBS with the highest binding score and closest proximity to the peak summit for the corresponding transcription factor is selected, generating genome-wide high confidence TFBS maps. Using this dataset, we compared the strand asymmetry of TFBSs upstream and downstream of the TSS.

We measured the orientation preference of each transcription factor across cell lines and conditions at transcribed regions and found that transcription factors displayed significant orientation preference relative to the transcription direction in ∼20% of ChIP-seq experiments (binomial test with Bonferroni correction, P-value < 0.05). We focused our analysis on the five cell lines with the most experiments available, K562, MCF7, HEPG2, GM12878 and A549. We found that certain transcription factors consistently displayed orientation preference across multiple experiments and across different cell lines (Figure 4C). Some of the most pronounced asymmetries included those of RUNX1, SOX2, FOXA1, FOXA2, ZNF384, HNF4A, HNF4G, EGR1, ESRRA, NFYA, NFYB, NFIC, USF1, USF2, E2F4 and KLF4. When we compared ChIP-seq experiments across cell lines for these transcription factors, we found consistency in both the orientation preference and the statistical significance (Figure 4C and D).

We also subdivided the analysis in the promoter upstream region (−1000 bp to TSS) and the promoter downstream region (TSS to 1000 bp). Although we were underpowered, we found that in the promoter upstream region CTCF and CTCFL TFBSs consistently displayed a preference for the non-template strand. In particular, out of 202 ChIP-seq experiments analyzed, 194 of them showed statistically significant orientation preference of the CTCF motif for the non-template strand after multiple testing correction in the promoter upstream region (Figure 4E). In the promoter downstream region, YY1 displayed a preference for the non-template strand with a statistically significant strand asymmetry in 18 out of 19 experiments analyzed (Figure 4F). For CTCF, we investigated if the observed transcriptional strand asymmetry was influenced by its motif positioning across the gene. We found across ChIP-seq experiments that the bound CTCF motif orientations were influenced by the position in the gene, with a negative correlation relative to bin number from TSS to Transcription End Site (TES) (Pearson correlation, r = −0.96, P-value < 0.001) and a non-template strand asymmetry in promoter upstream and transcription termination downstream regions (Figure 4G). These results confirm our previous findings which showed that transcription factors are not strand agnostic and that orientation relative to the TSS could determine function.

Detecting motif orientation bias in homotypic motif occurrences

To showcase how Asymmetron can be used to study the orientation preference of consecutive occurrences of a motif we performed a case study on the CTCF motif. We investigated the orientation bias of high confidence homotypic CTCF motif occurrences across the human genome. We found a preference for the same orientation for consecutive CTCF motif occurrences within distances of up to 100bp (empirical P-value < 0.001).

python consecutive_patterns.py CTCF.bed

Other potential implementations of this function could include the identification of miRNA clusters with strand bias, investigation of CTCF orientations at long genomic distances and 3D organization of the genome (same / opposite, convergent/divergent orientation analyses) or the orientation preferences of endogenous repeat elements among others.

Detecting motif orientation bias in motif pairs using Asymmetron

To show how to implement Asymmetron to study strand asymmetries in motif pairs, we used a previously characterized orientation bias of the TATA-box relative to the initiator element (INR) (37), the locations of both of which were extracted using JASPAR PWM motifs. We focused our analysis at regions around the TSS (-1,000bp to +1,000bp). We found that both the orientation and the pairwise distance of the two motifs was highly biased (Figure 4H) and consistent with the literature (38). In particular, there was a significant bias relative to their orientation with preference for the opposite strand (P-value < e−11), which was pronounced at a pairwise distance of 30–50 bp and which disappeared for shorter or longer distances (Figure 4H).

python pairwise_asymmetries.py TATA_box.bed INR.bed

DISCUSSION

Asymmetron is a multi-purpose toolkit that enables the exploration of strand asymmetries in diverse biological problems. We applied Asymmetron to four different biological problems showing that: (i) germline structural variants are more likely to overlap LINE and SINE transposable elements with particular orientations relative to transcription and replication direction, (ii) 45% of transcription factors show highly biased TFBS orientation preferences relative to transcriptional direction across human genic regions, (iii) orientation bias in homotypic occurrences of nearby CTCF motifs towards the same strand and (iv) motif orientation bias for TATA-box and INR motifs found at the core promoter.

The observed background asymmetries in the orientation of LINE and SINE transposable elements could have been influenced by the contribution of polyadenylation signals within them through selection pressure, as previously suggested (33). In addition, dynamic changes in replication directionality during evolution could explain the weaker strand asymmetry biases observed, relative to transcriptional strand asymmetry biases, especially for inactive transposable elements (Figure 3A). The identification of biases in population structural variant breakpoints relative to LINE and SINE transposable elements suggests activity of these elements in the germline, which has been influenced by their orientation relative to the direction of replication and transcription. MIR and L2 repeats have lost the ability to retrotranspose (34), whereas a small subset of L1 and Alu repeats remain active today (35,36,39,40). The observed strand asymmetries at L1 and Alu repeats are consistent with this notion and with previous work, finding a preference of L1 repeat integration towards the leading orientation (41). The absence of structural variant strand asymmetries at LTRs is also consistent with the notion that these elements are inactive. Additional work is required to understand if the observed asymmetries of structural variants at transposable elements are aggravated in cancer and other disorders and if they are associated with disease development.

We have shown that the orientation preference of multiple TFBSs around promoters and at transcribed regions cannot be explained by the nucleotide composition differences in the template and non-template strands (Supplementary Figure S1). Therefore, it could be the result of transcription factor preferential affinity for the motif at the forward or reverse-complement orientations and interaction or interference with RNA polymerase progression. A strand preference could also indicate roles of certain transcription factors at the RNA level, examples being SOX2 (Figure 4E) and YY1 (Figure 4F), which are known to bind both DNA and RNA to regulate gene expression (42,43). Our results suggest that TFBS orientation in transcribed regions is non-random and influences gene regulatory grammar. However, it remains unknown how the orientation of transcription factors between closely spaced TFBSs influences steric hindrance and competition for binding (44). Experimental designs that systematically evaluate how the orientation of TFBSs within cis-regulatory modules influence regulatory element activity could further increase our understanding. The conglomeration of transcription factors in cis-regulatory modules could be influenced by the orientation and pairwise distance of TFBSs and high-throughput reporter assay experiments (45) could provide valuable insights in this direction.

Asymmetron enables the study of asymmetric biological processes. Investigation of transcriptional and replicative strand asymmetries across biological organisms reflects the number of replication forks and their orientation, the gene density and the diverse mechanisms safeguarding genome integrity (1–4). Strand asymmetry analysis could increase our understanding of mutational processes across different disorders and evolution (6,13). In cancer, the orientation of structural variant breakpoints could reveal unknown mutational mechanisms. In gene regulation, investigation of orientation preferences between transcription factors and their location relative to transcriptional direction could enable better modelling of gene expression. In summary, we have shown that Asymmetron can pose as a useful tool to annotate and detect DNA strand asymmetries and associate them with specific biological functions.

DATA AVAILABILITY

A Python implementation package can be found in GitHub:

https://github.com/Ahituv-lab/Asymmetron.

Supplementary Material

gkaa1052_Supplemental_File

Click here for additional data file.^{(110.3KB, pdf)}

ACKNOWLEDGEMENTS

We would like to thank Michael Kosicki and Vikram Agarwal for helpful discussions and useful manuscript comments.

Author contributions: I.G.S., I.M. and N.A. conceived the study. I.G.S., I.M. and G.P. wrote the code, performed the analyses and generated the visualizations. N.M., M.H. and N.A. supervised the research. I.G.S., I.M., G.P., N.M., M.H. and N.A. wrote the manuscript.

Contributor Information

Ilias Georgakopoulos-Soares, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA; Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.

Ioannis Mouratidis, Aristotle University of Thessaloniki, Department of Mathematics, Thessaloniki, GR, Greece.

Guillermo E Parada, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK; Wellcome Trust Cancer Research UK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge CB2 1QN, UK.

Navneet Matharu, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA; Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA; Innovative Genomics Institute, University of California San Francisco, San Francisco, CA, USA.

Martin Hemberg, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK; Wellcome Trust Cancer Research UK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge CB2 1QN, UK.

Nadav Ahituv, Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA; Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Human Genome Research Institute [1UM1HG009408, R01HG010333, 1R21HG010065, 1R21HG010683 to N.A.]; National Institute of Mental Health [1R01MH109907, 1U01MH116438 to N.A.]; National Heart Lung and Blood Institute [R35HL145235 to N.A.]; G.P. and M.H. were supported by a core grant from the Wellcome Trust. Funding for open access charge: NHGRI.

Conflict of interest statement. None declared.

REFERENCES

1. Lobry J.R. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 1996; 13:660–665. [DOI] [PubMed] [Google Scholar]
2. Kano-Sueoka T., Lobry J.R., Sueoka N.. Intra-strand biases in bacteriophage T4 genome. Gene. 1999; 238:59–64. [DOI] [PubMed] [Google Scholar]
3. Mrázek J., Karlin S.. Strand compositional asymmetry in bacterial and large viral genomes. Proc. Natl. Acad. Sci. U.S.A. 1998; 95:3720–3725. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Green P., Ewing B., Miller W., Thomas P.J., Comparative Sequencing Program N.I.S.C., Green E.D.. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 2003; 33:514–517. [DOI] [PubMed] [Google Scholar]
5. Rocha E.P.C., Touchon M., Feil E.J.. Similar compositional biases are caused by very different mutational effects. Genome Res. 2006; 16:1537–1547. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Polak P., Arndt P.F.. Transcription induces strand-specific mutations at the 5′ end of human genes. Genome Res. 2008; 18:1216–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Grigoriev A. Strand-specific compositional asymmetries in double-stranded DNA viruses. Virus Res. 1999; 60:1–19. [DOI] [PubMed] [Google Scholar]
8. Rocha E.P.C. The organization of the bacterial genome. Annu. Rev. Genet. 2008; 42:211–233. [DOI] [PubMed] [Google Scholar]
9. Million-Weaver S., Samadpour A.N., Moreno-Habel D.A., Nugent P., Brittnacher M.J., Weiss E., Hayden H.S., Miller S.I., Liachko I., Merrikh H.. An underlying mechanism for the increased mutagenesis of lagging-strand genes in Bacillus subtilis. Proc. Natl. Acad. Sci. U.S.A. 2015; 112:E1096–E1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Xia B., Yan Y., Baron M., Wagner F., Barkley D., Chiodin M., Kim S.Y., Keefe D.L., Alukal J.P., Boeke J.D. et al.. Widespread transcriptional scanning in the testis modulates gene evolution rates. Cell. 2020; 180:248–262. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Hanawalt P.C., Spivak G.. Transcription-coupled DNA repair: two decades of progress and surprises. Nat. Rev. Mol. Cell Biol. 2008; 9:958–970. [DOI] [PubMed] [Google Scholar]
12. Pleasance E.D., Stephens P.J., O’Meara S., McBride D.J., Meynert A., Jones D., Lin M.-L., Beare D., Lau K.W., Greenman C. et al.. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010; 463:184–190. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Morganella S., Alexandrov L.B., Glodzik D., Zou X., Davies H., Staaf J., Sieuwerts A.M., Brinkman A.B., Martin S., Ramakrishna M. et al.. The topography of mutational processes in breast cancer genomes. Nat. Commun. 2016; 7:11383. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Haradhvala N.J., Polak P., Stojanov P., Covington K.R., Shinbrot E., Hess J.M., Rheinbay E., Kim J., Maruvka Y.E., Braunstein L.Z. et al.. Mutational strand asymmetries in cancer genomes reveal mechanisms of DNA damage and repair. Cell. 2016; 164:538–549. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Tomkova M., Tomek J., Kriaucionis S., Schuster-Böckler B.. Mutational signature distribution varies with DNA replication timing and strand asymmetry. Genome Biol. 2018; 19:129. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Guo Y., Xu Q., Canzio D., Shou J., Li J., Gorkin D.U., Jung I., Wu H., Zhai Y., Tang Y. et al.. CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell. 2015; 162:900–910. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Natesan S., Gilman M.Z.. DNA bending and orientation-dependent function of YY1 in the c-fos promoter. Genes Dev. 1993; 7:2497–2509. [DOI] [PubMed] [Google Scholar]
18. Seldeen K.L., McDonald C.B., Deegan B.J., Farooq A.. Single nucleotide variants of the TGACTCA motif modulate energetics and orientation of binding of the Jun-Fos heterodimeric transcription factor†. Biochemistry. 2009; 48:1975–1983. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Jolma A., Yin Y., Nitta K.R., Dave K., Popov A., Taipale M., Enge M., Kivioja T., Morgunova E., Taipale J.. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature. 2015; 527:384–388. [DOI] [PubMed] [Google Scholar]
20. Stenger J.E., Lobachev K.S., Gordenin D., Darden T.A., Jurka J., Resnick M.A.. Biased distribution of inverted and direct Alus in the human genome: implications for insertion, exclusion, and genome stability. Genome Res. 2001; 11:12–27. [DOI] [PubMed] [Google Scholar]
21. Kim E.Z., Wespiser A.R., Caffrey D.R.. The domain structure and distribution of Alu elements in long noncoding RNAs and mRNAs. RNA. 2016; 22:254–264. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Pelechano V., Steinmetz L.M.. Gene regulation by antisense transcription. Nat. Rev. Genet. 2013; 14:880–893. [DOI] [PubMed] [Google Scholar]
23. Dale R.K., Pedersen B.S., Quinlan A.R.. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics. 2011; 27:3423–3424. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J. et al.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Frankish A., Diekhans M., Ferreira A.-M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J. et al.. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019; 47:D766–D773. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P. et al.. The mutational constraint spectrum quantified from variation in 141, 456 humans. Nature. 2020; 581:434–443. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Project Consortium, E.N.C.O.D.E. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Fornes O., Castro-Mondragon J.A., Khan A., van der Lee R., Zhang X., Richmond P.A., Modi B.P., Correard S., Gheorghe M., Baranašić D. et al.. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020; 48:D87–D92. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Grant C.E., Bailey T.L., Noble W.S.. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011; 27:1017–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Gheorghe M., Sandve G.K., Khan A., Chèneby J., Ballester B., Mathelier A.. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res. 2019; 47:e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Georgakopoulos-Soares I., Koh G., Momen S.E., Jiricny J., Hemberg M., Nik-Zainal S.. Transcription-coupled repair and mismatch repair contribute towards preserving genome integrity at mononucleotide repeat tracts. Nat. Commun. 2020; 11:1980. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Medstrand P., van de Lagemaat L.N., Mager D.L.. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 2002; 12:1483–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Roy-Engel A.M., El-Sawy M., Farooq L., Odom G.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L.. Human retroelements may introduce intragenic polyadenylation signals. Cytogenet. Genome Res. 2005; 110:365–371. [DOI] [PubMed] [Google Scholar]
34. Krull M., Petrusma M., Makalowski W., Brosius J., Schmitz J.. Functional persistence of exonized mammalian-wide interspersed repeat elements (MIRs). Genome Res. 2007; 17:1139–1145. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Bennett E.A., Keller H., Mills R.E., Schmidt S., Moran J.V., Weichenrieder O., Devine S.E.. Active Alu retrotransposons in the human genome. Genome Res. 2008; 18:1875–1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Brouha B., Schustak J., Badge R.M., Lutz-Prigge S., Farley A.H., Moran J.V., Kazazian H.H. Jr.. Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl. Acad. Sci. U. S. A. 2003; 100:5280–5285. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. O’Shea-Greenfield A., Smale S.T.. Roles of TATA and initiator elements in determining the start site location and direction of RNA polymerase II transcription. J. Biol. Chem. 1992; 267:6450. [PubMed] [Google Scholar]
38. Carcamo J., Buckbinder L., Reinberg D.. The initiator directs the assembly of a transcription factor IID-dependent transcription complex. Proc. Natl. Acad. Sci. U.S.A. 1991; 88:8052–8056. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Sassaman D.M., Dombroski B.A., Moran J.V., Kimberland M.L., Naas T.P., DeBerardinis R.J., Gabriel A., Swergold G.D., Kazazian H.H. Jr.. Many human L1 elements are capable of retrotransposition. Nat. Genet. 1997; 16:37–43. [DOI] [PubMed] [Google Scholar]
40. Rodriguez-Martin B., Alvarez E.G., Baez-Ortega A., Zamora J., Supek F., Demeulemeester J., Santamarina M., Ju Y.S., Temes J., Garcia-Souto D. et al.. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet. 2020; 52:306–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Flasch D.A., Macia Á., Sánchez L., Ljungman M., Heras S.R., García-Pérez J.L., Wilson T.E., Moran J.V.. Genome-wide de novo L1 retrotransposition connects endonuclease activity with replication. Cell. 2019; 177:837–851. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Holmes Z.E., Hamilton D.J., Hwang T., Parsonnet N.V., Rinn J.L., Wuttke D.S., Batey R.T.. The Sox2 transcription factor binds RNA. Nat. Commun. 2020; 11:1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Sigova A.A., Abraham B.J., Ji X., Molinie B., Hannett N.M., Guo Y.E., Jangi M., Giallourakis C.C., Sharp P.A., Young R.A.. Transcription factor trapping by RNA in gene regulatory elements. Science. 2015; 350:978–981. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Weingarten-Gabbay S., Segal E.. The grammar of transcriptional regulation. Hum. Genet. 2014; 133:701–711. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Inoue F., Ahituv N.. Decoding enhancers using massively parallel reporter assays. Genomics. 2015; 106:159–164. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaa1052_Supplemental_File

Click here for additional data file.^{(110.3KB, pdf)}

Data Availability Statement

A Python implementation package can be found in GitHub:

https://github.com/Ahituv-lab/Asymmetron.

[B1] 1. Lobry J.R. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 1996; 13:660–665. [DOI] [PubMed] [Google Scholar]

[B2] 2. Kano-Sueoka T., Lobry J.R., Sueoka N.. Intra-strand biases in bacteriophage T4 genome. Gene. 1999; 238:59–64. [DOI] [PubMed] [Google Scholar]

[B3] 3. Mrázek J., Karlin S.. Strand compositional asymmetry in bacterial and large viral genomes. Proc. Natl. Acad. Sci. U.S.A. 1998; 95:3720–3725. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Green P., Ewing B., Miller W., Thomas P.J., Comparative Sequencing Program N.I.S.C., Green E.D.. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 2003; 33:514–517. [DOI] [PubMed] [Google Scholar]

[B5] 5. Rocha E.P.C., Touchon M., Feil E.J.. Similar compositional biases are caused by very different mutational effects. Genome Res. 2006; 16:1537–1547. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Polak P., Arndt P.F.. Transcription induces strand-specific mutations at the 5′ end of human genes. Genome Res. 2008; 18:1216–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Grigoriev A. Strand-specific compositional asymmetries in double-stranded DNA viruses. Virus Res. 1999; 60:1–19. [DOI] [PubMed] [Google Scholar]

[B8] 8. Rocha E.P.C. The organization of the bacterial genome. Annu. Rev. Genet. 2008; 42:211–233. [DOI] [PubMed] [Google Scholar]

[B9] 9. Million-Weaver S., Samadpour A.N., Moreno-Habel D.A., Nugent P., Brittnacher M.J., Weiss E., Hayden H.S., Miller S.I., Liachko I., Merrikh H.. An underlying mechanism for the increased mutagenesis of lagging-strand genes in Bacillus subtilis. Proc. Natl. Acad. Sci. U.S.A. 2015; 112:E1096–E1105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Xia B., Yan Y., Baron M., Wagner F., Barkley D., Chiodin M., Kim S.Y., Keefe D.L., Alukal J.P., Boeke J.D. et al.. Widespread transcriptional scanning in the testis modulates gene evolution rates. Cell. 2020; 180:248–262. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Hanawalt P.C., Spivak G.. Transcription-coupled DNA repair: two decades of progress and surprises. Nat. Rev. Mol. Cell Biol. 2008; 9:958–970. [DOI] [PubMed] [Google Scholar]

[B12] 12. Pleasance E.D., Stephens P.J., O’Meara S., McBride D.J., Meynert A., Jones D., Lin M.-L., Beare D., Lau K.W., Greenman C. et al.. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010; 463:184–190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Morganella S., Alexandrov L.B., Glodzik D., Zou X., Davies H., Staaf J., Sieuwerts A.M., Brinkman A.B., Martin S., Ramakrishna M. et al.. The topography of mutational processes in breast cancer genomes. Nat. Commun. 2016; 7:11383. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Haradhvala N.J., Polak P., Stojanov P., Covington K.R., Shinbrot E., Hess J.M., Rheinbay E., Kim J., Maruvka Y.E., Braunstein L.Z. et al.. Mutational strand asymmetries in cancer genomes reveal mechanisms of DNA damage and repair. Cell. 2016; 164:538–549. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Tomkova M., Tomek J., Kriaucionis S., Schuster-Böckler B.. Mutational signature distribution varies with DNA replication timing and strand asymmetry. Genome Biol. 2018; 19:129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Guo Y., Xu Q., Canzio D., Shou J., Li J., Gorkin D.U., Jung I., Wu H., Zhai Y., Tang Y. et al.. CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell. 2015; 162:900–910. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Natesan S., Gilman M.Z.. DNA bending and orientation-dependent function of YY1 in the c-fos promoter. Genes Dev. 1993; 7:2497–2509. [DOI] [PubMed] [Google Scholar]

[B18] 18. Seldeen K.L., McDonald C.B., Deegan B.J., Farooq A.. Single nucleotide variants of the TGACTCA motif modulate energetics and orientation of binding of the Jun-Fos heterodimeric transcription factor†. Biochemistry. 2009; 48:1975–1983. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Jolma A., Yin Y., Nitta K.R., Dave K., Popov A., Taipale M., Enge M., Kivioja T., Morgunova E., Taipale J.. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature. 2015; 527:384–388. [DOI] [PubMed] [Google Scholar]

[B20] 20. Stenger J.E., Lobachev K.S., Gordenin D., Darden T.A., Jurka J., Resnick M.A.. Biased distribution of inverted and direct Alus in the human genome: implications for insertion, exclusion, and genome stability. Genome Res. 2001; 11:12–27. [DOI] [PubMed] [Google Scholar]

[B21] 21. Kim E.Z., Wespiser A.R., Caffrey D.R.. The domain structure and distribution of Alu elements in long noncoding RNAs and mRNAs. RNA. 2016; 22:254–264. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Pelechano V., Steinmetz L.M.. Gene regulation by antisense transcription. Nat. Rev. Genet. 2013; 14:880–893. [DOI] [PubMed] [Google Scholar]

[B23] 23. Dale R.K., Pedersen B.S., Quinlan A.R.. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics. 2011; 27:3423–3424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J. et al.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Frankish A., Diekhans M., Ferreira A.-M., Johnson R., Jungreis I., Loveland J., Mudge J.M., Sisu C., Wright J., Armstrong J. et al.. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019; 47:D766–D773. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Karczewski K.J., Francioli L.C., Tiao G., Cummings B.B., Alföldi J., Wang Q., Collins R.L., Laricchia K.M., Ganna A., Birnbaum D.P. et al.. The mutational constraint spectrum quantified from variation in 141, 456 humans. Nature. 2020; 581:434–443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Project Consortium, E.N.C.O.D.E. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28. Fornes O., Castro-Mondragon J.A., Khan A., van der Lee R., Zhang X., Richmond P.A., Modi B.P., Correard S., Gheorghe M., Baranašić D. et al.. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020; 48:D87–D92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Grant C.E., Bailey T.L., Noble W.S.. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011; 27:1017–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30. Gheorghe M., Sandve G.K., Khan A., Chèneby J., Ballester B., Mathelier A.. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res. 2019; 47:e21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31. Georgakopoulos-Soares I., Koh G., Momen S.E., Jiricny J., Hemberg M., Nik-Zainal S.. Transcription-coupled repair and mismatch repair contribute towards preserving genome integrity at mononucleotide repeat tracts. Nat. Commun. 2020; 11:1980. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32. Medstrand P., van de Lagemaat L.N., Mager D.L.. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 2002; 12:1483–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] 33. Roy-Engel A.M., El-Sawy M., Farooq L., Odom G.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L.. Human retroelements may introduce intragenic polyadenylation signals. Cytogenet. Genome Res. 2005; 110:365–371. [DOI] [PubMed] [Google Scholar]

[B34] 34. Krull M., Petrusma M., Makalowski W., Brosius J., Schmitz J.. Functional persistence of exonized mammalian-wide interspersed repeat elements (MIRs). Genome Res. 2007; 17:1139–1145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35. Bennett E.A., Keller H., Mills R.E., Schmidt S., Moran J.V., Weichenrieder O., Devine S.E.. Active Alu retrotransposons in the human genome. Genome Res. 2008; 18:1875–1883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36. Brouha B., Schustak J., Badge R.M., Lutz-Prigge S., Farley A.H., Moran J.V., Kazazian H.H. Jr.. Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl. Acad. Sci. U. S. A. 2003; 100:5280–5285. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] 37. O’Shea-Greenfield A., Smale S.T.. Roles of TATA and initiator elements in determining the start site location and direction of RNA polymerase II transcription. J. Biol. Chem. 1992; 267:6450. [PubMed] [Google Scholar]

[B38] 38. Carcamo J., Buckbinder L., Reinberg D.. The initiator directs the assembly of a transcription factor IID-dependent transcription complex. Proc. Natl. Acad. Sci. U.S.A. 1991; 88:8052–8056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] 39. Sassaman D.M., Dombroski B.A., Moran J.V., Kimberland M.L., Naas T.P., DeBerardinis R.J., Gabriel A., Swergold G.D., Kazazian H.H. Jr.. Many human L1 elements are capable of retrotransposition. Nat. Genet. 1997; 16:37–43. [DOI] [PubMed] [Google Scholar]

[B40] 40. Rodriguez-Martin B., Alvarez E.G., Baez-Ortega A., Zamora J., Supek F., Demeulemeester J., Santamarina M., Ju Y.S., Temes J., Garcia-Souto D. et al.. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet. 2020; 52:306–319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] 41. Flasch D.A., Macia Á., Sánchez L., Ljungman M., Heras S.R., García-Pérez J.L., Wilson T.E., Moran J.V.. Genome-wide de novo L1 retrotransposition connects endonuclease activity with replication. Cell. 2019; 177:837–851. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] 42. Holmes Z.E., Hamilton D.J., Hwang T., Parsonnet N.V., Rinn J.L., Wuttke D.S., Batey R.T.. The Sox2 transcription factor binds RNA. Nat. Commun. 2020; 11:1805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B43] 43. Sigova A.A., Abraham B.J., Ji X., Molinie B., Hannett N.M., Guo Y.E., Jangi M., Giallourakis C.C., Sharp P.A., Young R.A.. Transcription factor trapping by RNA in gene regulatory elements. Science. 2015; 350:978–981. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B44] 44. Weingarten-Gabbay S., Segal E.. The grammar of transcriptional regulation. Hum. Genet. 2014; 133:701–711. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B45] 45. Inoue F., Ahituv N.. Decoding enhancers using massively parallel reporter assays. Genomics. 2015; 106:159–164. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences

Ilias Georgakopoulos-Soares

Ioannis Mouratidis

Guillermo E Parada

Navneet Matharu

Martin Hemberg

Nadav Ahituv

Abstract

INTRODUCTION

Figure 1.

MATERIALS AND METHODS

Consecutive strand asymmetry estimation for single motifs

Figure 2.

Strand asymmetry estimation between regions and overlapping motifs

Strand asymmetry estimation between pairs of proximal motifs

Orientation assignment prior to asymmetry analysis

Genomic analyses

Estimation of endogenous repeat element asymmetries

TFBS transcriptional strand asymmetry estimation

RESULTS

Transposable element orientations reveal structural variant asymmetries

Figure 3.

Strand asymmetries of TFBSs at promoters and across transcribed regions

Figure 4.

Detecting motif orientation bias in homotypic motif occurrences

Detecting motif orientation bias in motif pairs using Asymmetron

DISCUSSION

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases