Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2025 Jun 9;26:156. doi: 10.1186/s12859-025-06160-x

SEQSIM: A novel bioinformatics tool for comparisons of promoter regions—a case study of calcium binding protein spermatid associated 1 (CABS1)

Joy Ramielle L Santos 1,✉,#, Weijie Sun 2, A Dean Befus 1, Marcelo Marcet-Palacios 1,#
PMCID: PMC12150522  PMID: 40490738

Abstract

Background

Understanding transcriptional regulation requires an in-depth analysis of promoter regions, which house vital cis-regulatory elements such as core promoters, enhancers, and silencers. Despite the significance of these regions, genome-wide characterization remains a challenge due to data complexity and computational constraints. Traditional bioinformatics tools like Clustal Omega face limitations in handling extensive datasets, impeding comprehensive analysis. To bridge this gap, we developed SEQSIM, a sequence comparison tool leveraging an optimized Needleman-Wunsch algorithm for high-speed comparisons. SEQSIM can analyze complete human promoter datasets in under an hour, overcoming prior computational barriers.

Results

Applying SEQSIM, we conducted a case study on CABS1, a gene associated with spermatogenesis and stress response but lacking well-defined functions. Our genome-wide promoter analysis revealed 41 distinct homology clusters, with CABS1 residing within a cluster that includes promoters of genes such as VWCE, SPOCK1, and TMX2. These associations suggest potential co-regulatory networks. Additionally, our findings unveiled conserved promoter motifs and long-range regulatory sequences, including LINE-1 transposable element fragments shared by CABS1 and nearby genes, implying evolutionary conservation and regulatory significance.

Conclusions

These results provide insight into potential gene regulation mechanisms, enhancing our understanding of transcriptional control and suggesting new pathways for functional exploration. Future studies incorporating SEQSIM could elucidate co-regulatory networks and chromatin interactions that impact gene expression.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-025-06160-x.

Keywords: Promoter sequence similarity, transposable elements, CABS1 gene regulation, chromatin architecture, SEQSIM algorithm

Background

The regulation of the initiation of transcription is an essential and complex process involving various regulatory nucleotide sequences adjacent to genes. These cis-regulatory elements (CREs) include promoters, enhancers, silencers and other elements that determine when a gene is expressed. Other regulatory elements that can be more distal to the gene start site, such as transposable elements, insulators, boundary elements, and DNA that code for non-coding RNAs, add further complexity to the regulation of gene expression (1, 2). For example, sites where CTCF binding factor, a protein that binds specifically to the CCCTC motif, act as insulators and facilitate the reorganization of 3D chromatin structure. This is important in bringing promoters, enhancers and transcriptional machinery together to facilitate initiation [3]. Although all human genes are differentially regulated at the transcriptional level, there are common regulatory elements used according to gene type. For example, elements such as NF-kb binding sites are present in inducible genes but absent in house-keeping genes [4, 5]. By performing a genome-wide analysis of promoter sequence homology, we will more clearly elucidate the diversity of human promoters and assess the level of similarity within these sequence clusters. We hypothesize that promoters that belong to the same homology cluster can then be aligned based on sequence to highlight conserved regulatory elements. This novel analysis will also help predict groups of genes that are likely to be co-regulated and potentially highlight co-regulation networks.

In an effort to understand human promoter sequence diversity, Gagniuc et al. performed sequence homology analysis of 8512 promoter sequences available in a promoter sequence database [6]. The authors categorized promoters into ten groups according to a 2-dimensional (2D) analysis consisting of plots of Kappa Index of Coincidence vs C + G % within the − 499 to 100 positions relative to the transcription initiation site. This paper suggested categorizing the promoters according to the presence of promoter elements (TATA box, GC-box and CCAAT-box), DNA curvature, or nucleosome positioning [712]. Unfortunately, this approach was limited by the simplistic nature of 2D comparisons, leading to strictly tabular analyses of promoter types, and was thus unable to group promoters based on predicted functional families. Other limitations of this study were the small number of promoters available in databases at the time and a relatively short selected DNA sequence e.g. (− 1950 to − 522) [13]. To overcome these limitations, we performed a comprehensive promoter homology analysis of 2000 nucleotides upstream of every human gene.

Commonly used sequence similarity tools, like Clustal Omega, have file size restrictions that limit full-genome dataset analysis. Thus, import of a genome-wide dataset using Clustal Omega is not feasible due to file size limitations and prohibitively lengthy computational times, estimated to be > 1 year to complete using modern equipment at Compute Canada servers. Thus we developed an in-house software Sequence Similarity (SEQSIM) that employs a sequence similarity algorithm inspired by the Needleman-Wunsch algorithm [14], specifically sequence alignment and scoring matrices, with extensive modification to speed up performance. Unlike traditional dynamic programming approaches, SEQSIM simplifies alignment calculations to optimize performance for large-scale analyses, sharing characteristics with alignment-free methods in its efficiency and ability to handle high-throughput datasets [15]. SEQSIM enabled fast, pairwise comparison of promoters of every human gene in the Genome Reference Consortium Human Build 38 (GRCh38.p14), providing an alternative method that complements both alignment-based and alignment-free methodologies.

SEQSIM could be especially useful in the study of under-studied genes. We selected calcium binding protein spermatid-associated 1 (CABS1), a poorly characterized gene, as a case study to evaluate SEQSIM. The CABS1 protein is most abundant in the testis, and found in salivary glands [16]. Recent analysis of GeoData has detected CABS1 mRNA widely expressed in human tissues [17]. To date, CABS1 has been associated with spermatogenesis in several species and with acute psychological stress in humans [1821]. A 7 amino acid peptide sequence in human CABS1 near its carboxyl terminus has anti-inflammatory activity. Using CABS1 as a case study presents a unique challenge and opportunity as it does not have another homologous human gene or known protein isoforms. In silico studies suggested that CABS1 likely interacts directly with other proteins [22]. Thus, a SEQSIM analysis of CABS1 promoter and identification of other genes with similar promoter sequences may help reveal CABS1-interacting partners or proteins involved in a CABS1-associated network.

Methods

The pipeline involves: data mining of genome information using a Python script; utilization of a specialized scoring function to draw comparisons between promoter sequences (SEQSIM software); homology matrix visualization and data clustering using various programs outside of SEQSIM; and validation using 3rd party software. This pipeline is summarized in Fig. 1.

Fig. 1.

Fig. 1

Methodology pipeline using the novel comparison software generated for this study. The methodology occurred in three distinct phases, data mining, data processing, and data visualization approaches. Ongoing and future approaches are also denoted by the dashed arrows

SEQSIM software development

The software, SEQSIM (Sequence Similarity), was created to perform DNA sequence similarity analyses at a rate of 27 × 106 comparisons per minute, completing a full human genome analysis in under 1 h. SEQSIM imports a list of named sequences and compares all sequences against each other to generate a sequence homology matrix with a percent similarity based on our comparison algorithm.

Data mining and preparation of comma separated values (CSV) import file

To prepare the input files required for SEQSIM, a data mining python script was used to extract information from the GRCh38. Chromosome information was downloaded in text format from the NCBI (https://www.ncbi.nlm.nih.gov/gene) with the corresponding chromosome search criteria, e.g. 1[CH] AND human[ORGN]. The script then extracted information such as the nominal gene symbol and name used to reference the promoter, gene type, gene chromosomal location and coordinates, directionality, gene description and gene exon count. Next the chromosome sequence was downloaded from NCBI’s GenBank repository [23]. NC_000001, refers to chromosome 1. Each chromosome FASTA sequence was downloaded separately. The first nucleotide of each gene and gene orientation was then used to extract the 2000 nucleotide promoters in the 5’ to 3’ direction. The program compiled all this information into a comma separated values (CSV) file.

For SEQSIM, an additional CSV file was created with the identification of the sequence in the first column and the sequences to be analyzed in the next column. Alternatively, SEQSIM also accepts TXT files with the name of the sequence and the actual sequence separated by a line break. All sequences were required to be equal in size (2000 nucleotides) and were imported into the program as a string, a sequence of non-numerical characters.

Comparison algorithm

The program currently defines a pairwise similarity score, normalized in the range of 0 and 1 between two nucleotide sequences (strings) of the same size (N = 2000 nucleotides). An anchoring sequence is identified, X, and aligned to a second sequence, Y (comparator sequence). When both sequences are perfectly aligned so that both 2000 nucleotide (i.e.: N = 2000) sequences are squared to each other, we call this position 1 (i). At position 1 (i = 1), the program finds the longest string of nucleotides within the anchoring sequence that corresponds exactly to a segment in the comparator sequence. A match of 1 nucleotide corresponds to a value of 1 and each subsequent nucleotide match adds 1 point. When a mismatch is discovered, implying that the two sequences are not identical, the comparator sequence Y is shifted left by one nucleotide (ie: comparing position 1 of the anchoring sequence with position 2 of the comparator sequence) and another comparison check is conducted at this second position. It is important to note that despite detecting mismatches, SEQSIM still compares sequences downstream of mismatched nucleotides. This process repeats until 2000 comparisons have been made between the anchor sequence X and the entire comparator sequence Y, creating a row of data that identifies the length of the segment in the comparator sequence that exactly matches a segment in the anchor sequence. This process is repeated for positions 2 to 2000 of the anchoring sequence to generate 1999 more rows of data and 4,000,000 data points for further processing (see Fig. 2 and Supplemental Fig. 1 for a more comprehensive example).

Fig. 2.

Fig. 2

Example used to develop the scoring algorithm used in our software. A Defines the variables and sequence notation used in our algorithm in a sample case study using the sequences X = ACTACT and Y = ATACTA. Variable L indicates the length of the longest matching substring of the comparator sequence minus one. The variable ‘i’ indicates the position on the first nucleotide of the first sequence (the reference sequence) in the notation, L(X,Y). Position 1 on Sequence X and Y is indicated by the black arrow. B The largest substring match, highlighted in red, in Sequence Y when compared to each position, i, in Sequence X minus 1, or L(X,i,Y). A weighing function is applied, squaring the maximum L(X,i,Y) and summing the values. C The largest L(Y,i,X) with the same weighing function applied. D The values calculated in Panels (B) and (C) substituted into the overall scoring function (S) used in our algorithm with a value between 0 to 1

In the resulting array, the maximum value generated from each positional comparison on the anchor sequence is then carried into our scoring algorithm (Supplementary Fig. 1 for detailed math functions). The scoring function assigns a value between 0 and 1, which can then be converted into a percentage. Higher scores are assigned to sequences that share long segments of matching nucleotides.

User interface

Currently, no graphical user interface is available for SEQSIM. However, SEQSIM can be executed and run as a stand-alone app using the provided code in integrated development environments such as Microsoft Visual Studio. More information and a sample of the code can be found in the Supplementary Materials, including a visual step-by-step guide for the program.

Cluster analysis

Cluster analysis of promoter similarity was performed using the open-source, network visualization software, Gephi (version 0.9.2). Data used in the analyses consisted of a set of nodes, the promoter gene ID, and edges, their pairwise similarity score. To visualize the network, Gephi’s continuous graph layout algorithm, ForceAtlas2, was implemented due to its ability to graph clusters in large datasets with high precision and speed. The algorithm used a force-directed approach to layout the nodes in 2D space, with nodes with higher similarity drawn closer together. ForceAtlas2 also utilizes a gravity function that allows nodes to stay clustered and prevent them from drifting apart [24].

Once the data was imported to Gephi, default settings for ForceAtlas2 were used, including a gravity value of 1.0. To generate images, the “Prevent Overlap” option under Behavior Alternatives was enabled. ForceAtlas2 was run until convergence was achieved. Nodes that did not appear to cluster were filtered using modularity filters, such as degree distribution. The filter range parameters displayed nodes that had greater than 10 connections (edges) to other nodes.

Heatmap generation

Heatmaps were generated using conditional formatting in Microsoft Excel. Inputs for each heatmap consisted of a matrix of numerical values (the similarity scores, S) between promoters within a specific chromosome in sequential order. The color scale chosen represented the range of values within the data set, red being a high similarity score and white a low score.

Each chromosome matrix was imported into Excel. The conditional formatting tool was then applied. Under the “highlight cell rules” option, a custom rule was created that utilized a 3-color scale. Depending on the chromosome, the color scale could be adjusted to match the specific data being analyzed. Red indicated high similarity scores and white low scores.

Validation with clustal omega and multiple sequence alignment

The promoter sequences used in SEQSIM were exported into FASTA format. The resulting file was imported into the online server for Clustal Omega, a widely used tool for multiple sequence alignment (MSA). Due to the upload file size limitations of Clustal Omega, validations were conducted piecewise, 100 by 100 promoters per analysis. Clustal Omega analysis was conducted using the following settings: Sequence Type: DNA; Output Format: ClustalW with character counts; Dealign Input: no; MBED-LIKE clustering guide-tree: yes; MBED-like clustering iteration: yes; combined iterations: default(0); Max guide-tree: default; Max HMM iterations: default; Order: input; Distance matrix: no; Output guide tree: yes. The resulting Percent Identity Matrices generated using Clustal Omega were then visualized using the methods described in Sect. 2.3. Clustal Omega heatmaps and SEQSIM heatmaps were then compared to each other and superimposed to identify similarities and obvious differences.

Alignments from Clustal Omega were downloaded in FASTA file format. Similar promoter matches to that of CABS1 or sequences of interest were visualized using NCBI’s Multiple Sequence Alignment Viewer 1.23.0 by uploading the resulting FASTA file to the server. The promoter of CABS1, our case study gene, was then set as the “anchor” so all other sequences were compared relative to CABS1’s promoter. The other sequences were also arranged from highest to lowest homology to the anchor. For clarity, the resulting alignment was colored green for matching nucleotide sequences and red for differences. Any large regions of interest were searched in NCBI’s nucleotide BLAST (Basic Local Alignment Search Tool) server. To visualize and analyze the extended 10,000 nucleotide alignments, we also used JalView Desktop, a tool used for in-depth sequence analysis [25].

Determination of homology island boundaries

To determine the boundaries of highly homologous regions, referred to as “islands” between promoters and an anchor (reference) sequence, a systematic approach based on sequence alignment and homology scoring was utilized. The starting point of an “island” was defined as the position where ten consecutive nucleotides in the promoter sequence exactly matched the corresponding nucleotides in the anchor sequence.

A sliding window approach was used to compute a homology score across the sequences. Using a 5-nucleotide window, a score of 1 was assigned for an exact match and 0 for each mismatch, calculating the homology score as the number of matches divided by the window size (e.g., four matches in a 5-nucleotide window equals a score of 0.8). This window was shifted by one nucleotide along the entire comparison length.

When the homology score first reaches 0, indicating no matches within that window, the nucleotide coordinate marking the end of the island was determined by subtracting the window size (5 nucleotides) from the coordinate of the first 0 score.

Results

SEQSIM generated score matrix and cluster map

SEQSIM generated a 57,064 × 57,064 matrix of similarity scores between the promoters of every human gene. Due to computational restraints, we only performed a Gephi ForceAtlas2 cluster analysis of approximately the top 10% most similar promoters of the complete matrix. The Louvain method to determine modularity gave a score of 0.840 discovering 41 total communities or clusters (Fig. 3A–C). The largest cluster (cluster 1) had 402 promoter members and the smallest (cluster 41) just one member. The five largest clusters were significantly larger in size than the remaining cluster population. These five main clusters contained more than 100 promoters (nodes) each, adding up to 67.4% of all clustered promoters (Fig. 3B).

Fig. 3.

Fig. 3

Cluster analysis of the top 10% most similar promoters of the human genome. A Methodology Pipeline for generating the cluster diagrams in Gephi 0.9.2. 10% of our 57,064 × 57,064 large matrix was visualized using Gephi 0.9.2, an open-source data visualization software. Each promoter is depicted in the software as a circle, or a ‘node.’ When there is similarity between sequences, a line, also known as an ‘edge,’ is drawn between each sequence. The more similar the sequence, the shorter the edge thus bringing two nodes closer together. For clarity, the edges were removed from the figure but the relationships between nodes and clusters remain intact. B Depicts the network diagram of the topmost similar promoters mined from the GRCh38. Due to computational restraints, only 10% of the analyzed promoters can be displayed. From the analysis, 5 main clusters were discovered colored green, pink, blue, black and orange. Several other smaller clusters (grey) were also observed. A zoomed view of cluster 3 shows the CABS1 promoter in red. Nodes are sized according to the number of other promoters to which they are similar. The higher the number of hits, the larger the node. CABS1, with its multiple connections to many other promoters, appears to be a large contributor to its cluster. The CABS1 cluster is the 3rd largest cluster in terms of node quantity, however its nodes appear larger visually due to their multiple connections. C Characterization curve of clusters in the human genome. Over 41 clusters were discovered with varying populations of 1 to 402 nodes

The clustering algorithm, ForceAtlas2, also showed relationships among the clusters. The closer the clusters are together (i.e.: the edges are shorter), the more similar their promoters. For example, genes within Cluster 3 and Cluster 4 have more similar promoters to each other than with Cluster 2. We found no correlation between SEQSIM clustering and chromosomal location, i.e., promoters in a single cluster could be from many chromosomes. However, it was interesting that some of the smaller, satellite clusters featured promoters of genes entirely from a single chromosome. Such clusters ranged in size from 1 to 32 promoter nodes. For example, all 32 members of the 12th largest cluster were promoters from chromosome 11, and all 17 members of the 23rd largest cluster were from chromosome 1. Conversely, the main clusters contained promoters from nearly every chromosome. Proportionally, the number of promoters from each chromosome in a cluster varied greatly.

The CABS1 promoter is located on Cluster 3, the third largest cluster comprised of 385 members, with 259 members directly connected to CABS1. The promoter for CABS1 is highly connected to many others within the cluster, which corresponds to its increased node size. These included many promoters for non-coding RNA and pseudo genes. According to our algorithm, the promoter of CABS1 (red node in Fig. 3B) is highly similar to the promoters of the protein-coding genes VWCE, SPOCK1, THSD4, RNF39, and TMX2. Not surprisingly, these highly homologous promoters were validated in Clustal Omega with homology scores of 78.70%, 81.99%, 82.89%, 96.54%, and 77.09% to CABS1 promoter. Interestingly, these five genes were inter-chromosomal, not found on chromosome 4 like CABS1 (see Table 1). Of the 385 promoters in the CABS1 cluster, only 20 were in chromosome 4 with CABS1, and of the 20, 8 were unannotated genes, 9 were pseudogenes, 2 were long non-protein coding RNA, leaving only one as a promoter for a protein coding gene: integrin binding sialoprotein (IBSP).

Table 1.

Summary table of promoters of interest related to CABS1

graphic file with name 12859_2025_6160_Tab1_HTML.jpg

Promoters of known genes in the Inter-Chromosomal and Intra-Chromosomal sections with high similarity to CABS1 promoter. Inter-chromosomal promoters were found to be clustered with CABS1. Also, in the table are SMR3 A and 3B, genes of interest due to evolutionary history and potential functional parallels with SMR1 in rat and CABS1 in human (see Discussion). The promoters of these genes did not cluster with CABS1 and did not show high similarity in promoters

We attempted to characterize each cluster more thoroughly by identifying potential functional similarities through bioinformatics databases such as the Database for Annotation, Visualization and Integrated Discovery (DAVID) [26, 27]. Through DAVID, we searched for biological processes, cellular components, molecular functions, pathways and more. However, when queried, no single, defining similarity was found in functional annotations within clusters. Interestingly, when clustered using DAVID, our SEQSIM clusters often fragmented into smaller sub-families. For example, the genes of 319/385 members within SEQSIM cluster 3 separated into 10 sub-families with enrichment scores as high as 2.33 based on a variety of annotations like molecular function, gene sequence features and biological processes.

Heatmap of adjacent promoters around case study promoter of CABS1 in chromosome 4

The heatmap was generated for 200 promoters for genes flanking the CABS1 gene loci. Promoters are shown in the same order as they appear in chromosome 4 (Fig. 4A). Within Chromosome 4, and within 50 genes on either side of CABS1, the promoters for LINC02562, MUC7 and UGT2B11 closely resemble the promoter of CABS1. Other matches included genes that have not yet been defined (i.e.: LOC genes). Comparative validation using Clustal Omega (Fig. 4B) provided similar results. With Clustal Omega, two additional promoters for genes RNU4 ATAC9P and OPRPN showed a homology score greater than 40% within the 50 closest genes to CABS1. Interestingly, the latter, like CABS1, is associated with male reproductive function among other roles [28], see Table 1.

Fig. 4.

Fig. 4

Analysis of the CABS1 neighborhood in Chromosome 4. A Partial heatmap with section breaks generated from our software (red). For the purposes of this figure, we have selectively omitted portions of the linear genome leaving a buffer of 1 promoter before and after a similar promoter to CABS1. The grey bars depict the omissions in the linear genome. The first break between IFITM3P1 and LOC105377267 is from nucleotide 66094526 to 68300322. The other two breaks are [69216529 to 70334981] and [70532743 to 75037055]. B The same region was validated with Clustal Omega. The omissions for Panel B are as follows: [66094526 to 66003201], [68350090 to 68300322], [70532743 to 72953850], and [72979588 to 75037055] and differ from Panel A due to new promoter similarities discovered during Clustal Omega validation. The CABS1 promoter showed high similarity to a few adjacent promoters for MUC7, RNU4 ATAC9P, and LINC02562, as depicted by the more intense red or blue colour. CABS1 promoter also showed similarity to more distal promoters of undefined genes (LOC genes). New gene promoters with relatively high similarity were discovered upon validation with the less penalizing Clustal Omega algorithm such as OPRPN and RNU4 ATAC9P and conversely, some unique similarities were highlighted using SEQSIM, such as UGT2B11, the names of these are highlighted green

The MSA of the CABS1 promoter to their homologous counterparts (LOC105377261, UGT2B11, MUC7, LINC02562, LOC107986230, RNU4 ATAC9P and OPRPN) revealed large “islands” of 100% similar regions. In a BLAST analysis, these regions were determined to be portions of a long-interspersed nuclear element-1 (LINE-1) with a length of 10,031 nt. To fully investigate how much of the LINE-1 element was present in the promoter of CABS1 and the other associated promoters, we conducted an extended sequence analysis of 10 k nt. Among these 7 promoters, CABS1 retained the largest segment of the LINE-1 element, specifically a segment of 5452 nt spanning open reading frames 1 and 2 of LINE-1 (Fig. 5).

Fig. 5.

Fig. 5

Sequence Alignment of Multiple Genes with LINE-1 element. A scale bar of 12,000 nucleotides (12 k nt) is shown above the aligned sequences. The LINE-1 element is at the top, and its open-reading frame locations are labelled. The grey boxes below the LINE-1 element represent islands of high homology discovered in a multiple sequence alignment between LINE-1 and the promoters of CABS1, LOC107986230, LINC02562, UGT2B11, LOC105377261, MUC7, and RNU4 ATAC9P. For simplification purposes, only islands greater in length than 100 nt are shown. The OPRPN promoter did not reveal any large islands of homology with the same LINE-1 element and thus this promoter was not represented in the figure. Notably, the CABS1 promoter has the largest region of homology, with a 5452 nt region showing 99% homology to LINE-1

Interesting patterns in other chromosomes

While our primary focus was on the case study gene CABS1, we extended the comprehensive analysis to generate similarity heatmaps for every chromosome (beyond the scope of this manuscript). The heatmaps for each chromosome revealed intriguing patterns of similarity between adjacent promoters, which appear to correlate with loops observed in topologically associated domains (TAD).

Figure 6A, shows a small section of the first 100 promoters on chromosome 1, with similarity scores shown as color intensity. Striking patterns of similarity in the first 100 promoters resembled a series of diagonal slashes (arrows in Fig. 6B). Additionally, we observed more intriguing patterns such as the checkerboard-like background (open arrow heads point at both the light and dark segments of the checkerboard) emphasized in Fig. 6B. These patterns, though distinct from the diagonal slashes, may inform encoded genomic interactions. In Fig. 6C (without digital enhancement seen in Fig. 6B), distinct bright red (i.e.: very similar promoters) clusters of similarity among adjacent promoters were also observed. One example is shown with the closed arrowhead in 6B. It’s interesting to note that when highly similar adjacent promoters form clusters, other promoters around the islands are highly dissimilar. This is reminiscent of how TADs form physical loops in chromatin and interact at a higher frequency within the TAD versus adjacent TADs. These results highlight that genes that are looped into close proximities within their TADs also share similar promoters.

Fig. 6.

Fig. 6

Heatmap data visualization for the first 100 gene promoters of Chromosome 1. A Gene view of the first ~ 600 kb of Chromosome 1 and a segment included in our heatmaps. B Overemphasized graphical representation of the patterns that can be extracted from the heatmap depicted in 6 C. The black arrowhead indicates one homology cluster of interest due to the overall pattern created in the map. The white arrowheads indicate a darker and lighter red area observed in the first 100 genes. The black arrows show the diagonal slashes observed in the heatmap. The darker red indicates higher similarity in the promoters while the light red area depicts less similarity to adjacent promoters. White areas indicate little to no similarity. C The actual heatmap without digital enhancement generated using our novel comparison software and algorithm

Discussion

It is widely accepted that promoters are important in conditions like cancer, diabetes and neurodegenerative disorders [2932]. These promoters, when changed, can alter the start of gene transcription and consequently gene expression significantly. Our study provides new insight into the number of promoter homology clusters in the human genome. Importantly, within the first 100 genes of chromosome 1, we discovered a pattern of promoter homology not previously seen, in which genes share promoter homology with a limited number of adjacent genes. This homology results in a checkerboard pattern of similarity observed in heatmaps (see Fig. 6B, C) which appears to repeat itself throughout the genome.

Statistical analysis of modularity revealed 41 clusters of promoters based on sequence similarity, with 67.4% of linked promoters appearing in the main 5 clusters (Fig. 3B). This could indicate five main “classes” of promoters, with several smaller sub-types representing the other 36 clusters (Fig. 3C). Gagniuc et al. previously categorized promoters into ten generic classes based on sequence features and associated regulatory proteins. Our results agree with these observations in that there are indeed different categories or classes of promoters [6]. Our findings also help complete the overall picture by including all human promoters in the analysis and expanding the number of promoter nucleotides in the study.

Interestingly, when using the DAVID database to explore potential functional similarities within clusters, no overarching functional pattern emerged. Instead, our SEQSIM clusters fragmented into smaller sub-families, such as cluster 3, which split into 10 sub-groups with varying degrees of functional enrichment. This fragmentation suggests that promoters grouped by sequence similarity may not always share unified biological roles. Future studies could benefit from further exploration of these sub-families to better understand their regulatory significance. Integrating additional bioinformatics tools and refining clustering methodologies might help identify more subtle relationships and functional nuances within promoter groups.

The examination of promoter sequence homology is essential in elucidating regulatory mechanisms and expression patterns for genes. Rhoads and McIntosh showed that genes like the salicylic acid-inducible alternative oxidase gene aox1 and those involved in disease resistance share similar sequences in their promoters [33]. Thus, these genes might be controlled in similar ways or have evolved together, which is important for determining how genes respond to environmental changes or stress. Likewise, another study focused on a gene in Brassica napus pollen and identified several genes in different species with homologous sequences in their promoter regions that potentially correspond to similar regulatory roles. Motif-discovery methods utilizing expectation maximization and Gibbs sampling strategies are widely regarded as the gold standard for identifying gene regulatory elements within the promoter [34]. These approaches excel at constructing MSAs of short motifs commonly shared among promoters of co-regulated genes and have been foundational in promoter analysis [35].

However, SEQSIM addresses a complementary gap by focusing on large-scale promoter homology that extends beyond the localized identification of similar short motifs. This large scale allows for the detection of broader homology patterns and long-range regulatory elements that might otherwise go unnoticed. While motif-discovery tools remain invaluable for pinpointing conserved, short regulatory sequences, SEQSIM provides a genome-wide perspective, identifying clusters of promoter homology and uncovering potential co-regulatory networks. The integration of both approaches could yield a more comprehensive understanding of transcriptional regulation, offering deeper insights into both localized and large-scale regulatory mechanisms.

Through validation, our study also compares SEQSIM to other sequence similarity software like Clustal Omega and alignment-free methods. While Clustal Omega is traditionally designed for global MSAs, it also provides distance calculation methods through percent identity matrices, aligning with SEQSIM’s focus on evaluating sequence similarity. Although Clustal Omega may not be the most representative choice for pairwise comparisons, it was a useful reference point to assess the robustness of SEQSIM’s similarity scoring. SEQSIM produced similar heatmaps to the percent identity matrix output of Clustal Omega (see Fig. 4) as well as more representative AF methods.

AF methods like Feature Frequency Profile (FFP), Chaos Game Representation (CGR), and Alfree, as detailed by Zielezinski et al. [15], offer alternatives to alignment-based approaches by leveraging k-mer frequencies or sequence embeddings. While these methods are efficient programs, many struggle with computational limitations, highly divergent sequences and require careful parameter tuning. For instance, tools like MatGAT, which can generate pairwise comparison matrices based on scoring matrices like PAM and BLOSUM without the need for pre-alignment, are most similar to SEQSIM but are limited by the maximum quantity of input sequences [36]. Despite these limitations, an analysis within MatGAT’s capabilities of the first 100 promoters in chromosome 1 resulted in a similarity matrix exhibiting similar patterns to that of SEQSIM and Clustal Omega (See Supplementary Fig. 3). Other algorithms, such as Mash, also exhibited a checkerboard pattern like SEQSIM, but lacked the distinct “slash” marks observed in in the other three methods. As the existing tools are unable to conduct full-genome analysis, comparing computational time is difficult. Given these constraints, there is a growing need for scalable methods that can detect homology in full genome sequences. Unlike tools such as BLASTN, which rely on matches to known or indexed sequences, SEQSIM can identify novel sequence similarities in promoter regions, including those not represented in existing reference databases. A comparison of SEQSIM to other methods, AF and otherwise, is summarized in Supplementary Table 1. Given its unique approach and scalability, evaluating SEQSIM’s performance in benchmark studies like those by Zielezinski et al. would be a valuable next step [37].

Beyond comparing SEQSIM against existing methods, we leveraged its capabilities to explore biological questions in hopes to deepen our understanding of specific gene regulation. One such application was investigating the promoter landscape of CABS1, a potential biomarker of stress we have studied over the years. Although we have found CABS1 expression in several tissues [17, 38] its role in biology is not fully understood. We used the SEQSIM dataset to gain further insight into the biology of CABS1. The CABS1 promoter was in the third largest cluster, connected to 259 different promoters. This discovery implies that CABS1 transcriptional initiation is regulated similarly to many other human genes. Interestingly, many of these highly associated promoters belong to non-coding RNA genes, pseudogenes, and protein-coding genes. SPOCK1 transcripts, for example, are also widely expressed in the human body with particularly high expression levels in testis, much like CABS1[39]. The biological functions of the 385 members in cluster 3 catalogued using public databases such as GeneCards, showed their heterogeneity, leading to little concrete insight. For example, VWCE and SPOCK1 are both protein-coding genes whose promoters are clustered with CABS1. However, their functions vary: SPOCK1 primarily regulates extracellular matrix remodeling and cell migration whereas VWCE supports extracellular matrix structural integrity and vascular development [40]. In the future, studies could be conducted to determine the likelihood that genes within the same promoter cluster are co-expressed. Genes that are close and clustered together tend to be co-expressed [41]. However, our findings imply that factors beyond physical proximity, such as sequence similarity and shared control signals that extend across the genome, could influence gene coregulation and co-expression. For example, these long-range interactions could be a complex network of transcription factors that interact with similar sequences in each of the genes mentioned.

We also observed similarities between promoters on the same chromosome as CABS1. This proximity-associated similarity could imply a localized regulatory mechanism or co-evolution of functionally related genes. CABS1 is primarily expressed in testicular tissues and is thought to be involved in the late stages of spermatogenesis, and it also has been found in human saliva [17]. Notably, MUC7 (mucin-7, secreted) is a member of the mucin family of proteins which is specifically known for its role in saliva, contributing to innate defense in the oral cavity. While other neighboring genes lack distinct tissue or fluid expression patterns similar to CABS1, their chromosomal proximity and high promoter similarity to CABS1 could indicate critical regulatory interactions. LINC02562 (long intergenic non-protein coding RNA 2562) is part of a class of RNAs that, while non-coding, can influence gene regulation. Long non-coding RNAs are involved in diverse biological processes, including chromatin remodeling, transcriptional control and post-transcriptional modifications [42, 43]. Similarly, the small nuclear RNA U4atac Pseudogene 9 (RNU4 ATAC9P) may be involved in minor spliceosome function. Traditionally thought to be non-functional, recent studies have suggested that pseudogenes, and types of pseudogenes like retrotransposons, may play a regulatory role in gene expression (43, 44).

CABS1 research is intertwined with studies on the rat gene Vcsa1 (absent in humans) and its protein, submandibular rat 1 (SMR1), which have significant roles in inflammation, shock response, and neuroprotection [16]. Specifically, human CABS1 shares a similar amino acid sequence near the carboxyl terminal (TDIFELL) with rat SMR1 (TDIFEGG), and both sequences have anti-inflammatory activities [16]. On chromosome 4, CABS1 is surrounded by other genes with activities similar to rat Vcsa1, so it was interesting that the promoter of CABS1 did not have high similarity to that of related adjacent genes, SMR3 A or SMR3B that are similarly associated with male reproductive function [4547]. Dissimilar promoters in genes with potentially similar biological activities could be a result of evolutionary divergence and the need for tissue-specific expression. This diversity may help the same or similar functions to be carried out in different tissues, developmental stages, or environmental conditions.

We also discovered in the CABS1 case study that the high scores generated by SEQSIM highlighted truncated LINE-1 [48, 49] fragments in the promoters of associated genes. Transposable elements (TEs) are DNA sequences capable of moving or “jumping” within the genome. In all cases, the LINE-1 element sequence was highly truncated. We were surprised to discover that the surviving LINE-1 regions were of similar size and from the same general fragment from the original sequence. These truncated sequences may play a similar role for several genes. The CABS1 promoter has extensive homology with LINE-1 retaining a large segment of this element, especially within the LINE-1 open reading frame 2 (ORF2). ORF2 typically codes for a protein with endonuclease and reverse transcriptase activity, which is essential for LINE-1’s autonomous retro-transposition. However, RNA data from GenBank revealed no transcripts in these regions. We suspect that transcripts cannot be found in this region due to the extensive redundancy of these ORFs throughout the genome, making it challenging to trace any specific ORF as the transcript source. Although there is no confirmation of transcripts, functional RNA or protein could still originate from the CABS1 promoter region. These retained sequences also seem to be biased towards the gene transcription start site. The presence of such a significant LINE-1 segment within the promoter of CABS1 and its neighboring genes could have multiple implications. LINE-1 elements are known to influence genomic architecture and function, potentially affecting gene expression through regulatory sequences or by altering chromatin states [50, 51]. TEs can lead to the formation of new chromatin loops and domains, that affect the proximity of regulatory elements to target genes, impacting transcription and genomic structure [52]. Distinct TE subfamilies that function as tissue-specific enhancers in colon and liver cancers have been previously identified [53]. These TEs are characterized by genomic features associated with active enhancers, such as epigenetic marks and transcription factor binding, and are associated with differentially expressed genes in these types of cancer. The presence of the large islands of identical sequences in the multiple sequence alignments suggests that these features are evolutionarily conserved in the promoter and may hold functional importance. Variations or mutations in these conserved regions may cause gene dysregulation and abnormal phenotypes. A potential future application using data from SEQSIM could be the comparative analysis of promoters among individuals or populations to identify genetic variants associated with increased susceptibility to dysregulation. It would also be interesting to study whether the TEs responsible for these large islands of homology are conserved across species, and if they retained their historical regulatory functions, or acquired new roles over evolution, such as altering 3D chromatin shape [54, 55].

Our pilot analysis which extended beyond Chromosome 4 revealed intriguing patterns of promoter similarity across other chromosomes. A potential explanation for the similarity patterns is the presence of transposable elements, which tend to propagate within nearby genomic regions, contributing to structural patterns [56]. The diagonal slashes and checkerboard-like patterns observed in Fig. 6 indicate that such repetition may not be coincidental but reflect an underlying genomic structure.

The role of promoters in chromosomal architecture, such as TADs or chromosomal territories that facilitate specific gene interactions, remains poorly understood. The observed clusters of highly similar promoters, interspersed with lower-similarity regions, resemble chromatin interaction domains, such as TADs. While the functional significance of these patterns remains to be elucidated, the similarity patterns could indicate distinct chromosomal neighborhoods where genes share similar chromatin environments, regulatory accessibility or proximity to nuclear compartments influencing gene expression. For example, 3D chromatin architecture in T-cells enables binding of regulatory factors like CCCTC-binding factor (CTCF) and special AT-rich Sequence Binding Protein 1 (SATB1) which influence gene expression (57). The identified promoter patterns could indicate discrete functional regions within a chromosome, where neighboring genes work together to fulfill specific cellular functions, similar to TADs. Co-regulated gene clusters exist in yeast, such as those involved in ribosome biogenesis, suggesting spatial arrangement supports coordinated expression (58, 59). Adjacent genes within the same TAD may share similar chromatin states, including histone modifications, DNA methylation patterns, or nucleosome occupancy [60]. Visualization of promoter relationships complements Hi-C data by highlighting interactions that might be transient or filtered out due to resolution constraints. For example, interleukin genes IL-4, IL-5, and IL-13 are known to reside within the same TAD, where chromatin state changes drive a TH2 immune response [61]. SEQSIM not only captures these known relationships but also reveals additional promoter connections with genes such as RAD50 and TH2-LCR (Supplementary Fig. 2), which were not evident in existing Hi-C data [62].

While SEQSIM effectively identifies sequence homology across promoter regions, it is important to acknowledge a limitation inherent to both SEQSIM and other pairwise comparison methods: the potential for repetitive elements, such as transposable elements, to drive apparent similarities. In our case study of CABS1, we observed that the homology cluster includes a truncated sequence of the LINE-1 element, a widely distributed transposable element. Full-length and truncated copies of the LINE-1 element, primarily the 3’ end, comprise almost 20% of the human genome [6365]. This raises valid concerns about the biological relevance of these repetitive elements in gene regulation, particularly in tissue- and time-specific contexts.

Although some transposable elements have been shown to play regulatory roles, including influencing chromatin architecture and acting as enhancers in specific tissues (66), it is unlikely that all repetitive elements contribute directly to gene-specific regulation. The presence of LINE-1 sequences in promoter regions might reflect evolutionary remnants rather than active regulatory elements in each context.

To address this concern, future iterations of SEQSIM analyses could incorporate sequence masking techniques to filter out common repetitive elements before comparison. This approach would help clarify whether the observed homology is driven by functional regulatory elements or by shared repetitive sequences. Alternatively, highlighting transposable elements specifically to facilitate their detection may also enhance the analysis. Furthermore, integrating expression data or chromatin accessibility profiles could provide additional evidence to support the functional significance of identified promoter similarities. Beyond these approaches, follow-up analyses using functional assays, transcriptomics, or chromatin conformation capture could further assess co-regulation, though this is beyond the scope of the current study.

While SEQSIM offers significant advantages in large-scale sequence comparisons, it also has notable limitations. The algorithm applies a strict scoring function, which prioritizes long, uninterrupted matching sequences, potentially underestimating biologically relevant short motifs with regulatory significance. For example, Clustal Omega scored RNF3 96% in sequence similarity to the CABS1 promoter region, whereas SEQSIM’s score was 4%. This stringent approach may limit its sensitivity in detecting weak yet functionally important sequence similarities, such as conserved transcription factor binding sites that span multiple promoters. Additionally, SEQSIM does not currently account for indels (insertions or deletions) or sequence rearrangements, which could be relevant for detecting structural variations associated with gene regulation. Another constraint is the lack of an integrated method for masking repetitive elements, meaning that shared sequences detected in homology clusters may be driven by highly abundant genomic repeats rather than functional regulatory elements. However, given the emerging evidence that repetitive elements, such as LINE-1, may play roles in gene regulation, chromatin structure, and enhancer activity, it could be valuable to modify SEQSIM to specifically analyze these elements. A targeted approach could help assess their conservation, sequence decay rates, and potential regulatory functions across different genomic contexts. Future iterations of SEQSIM could incorporate adaptive scoring methods, improved filtering for repetitive sequences, and enhanced clustering algorithms to detect meaningful regulatory patterns. Currently, our SEQSIM analysis has only investigated about 10% of the sequence homology data, as we do not have a method to conduct a full cluster analysis on the entire genome. We plan to explore artificial intelligence techniques and other iterative computational approaches to enable more complete processing of our data. Additionally, further research utilizing complementary techniques, such as functional genomics, chromatin conformation capture assays, and gene perturbation studies, can shed light on the functional implications of observed patterns and provide a deeper understanding of the intricate organization and regulation of genes within chromosomes.

Conclusion

Our research demonstrates the utility of SEQSIM for comprehensive analysis and comparison of promoters across the human genome. Our novel approach revealed distinct promoter similarity clusters, indicating potential co-regulation, unique gene sets associated with different biological pathways, and potential presence of evolutionary conserved regions. Notably, identical sequence segments may be critical regulatory elements or TE, implying an intricate interplay of genomic elements in gene regulation.

While computational constraints limited the depth of our analysis, the observed similarity patterns among adjacent gene promoters have sparked new hypotheses about co-regulation, chromosomal organization, and evolutionary significance. Future experimental validation and the use of additional techniques will shed light on these observations, elucidating the complexities of genome organization and regulation thereby enriching our understanding of human health and disease.

Supplementary Information

Additional file 1. (26.9MB, tif)
Additional file 2. (170.4KB, jpg)
Additional file 3. (409.6KB, jpg)
Additional file 4. (259.5KB, pdf)

Acknowledgements

We thank Gilbert Lee for providing technical support for this project.

Author contributions

JRLS: conceptualization, data curation, formal analysis, investigation, methodology, validation, visualization, figure creation, writing—original draft, writing—review and editing. WS: conceptualization, data curation, formal analysis, methodology, software, visualization, writing—review and editing. DB: conceptualization, writing—review & editing. MM: conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, software, supervision, validation, visualization, writing—review and editing.

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada [RGPIN-2020-04553].

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. SEQSIM is freely available, implemented in C + + for Linux and Windows platforms. All files are included in the supplementary documentation or downloadable at this link https://sites.ualberta.ca/~joyramie/SEQSIM.zip. Additional resources, including a user guide and supplementary data, are downloadable with this publication or at https://sites.ualberta.ca/~joyramie/SEQSIM.html.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Joy Ramielle L. Santos and Marcelo Marcet-Palacios have contributed equally to this work.

References

  • 1.Weake VM, Workman JL. Inducible gene expression: diverse regulatory mechanisms. Nat Rev Genet. 2010;11(6):426–37. [DOI] [PubMed] [Google Scholar]
  • 2.Maston GA, Evans SK, Green MR. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006;7:29–59. [DOI] [PubMed] [Google Scholar]
  • 3.Holwerda SJB, de Laat W. CTCF: the protein, the binding partners, the binding sites and their chromatin loops. Philos Trans R Soc B Biol Sci. 2013;368(1620):20120369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.De Jesús TJ, Ramakrishnan P. NF-κB c-Rel dictates the inflammatory threshold by acting as a transcriptional repressor. iScience. 2020;23(3): 100876. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Martone R, Euskirchen G, Bertone P, Hartman S, Royce TE, Luscombe NM, et al. Distribution of NF-κB-binding sites across human chromosome 22. Proc Natl Acad Sci. 2003;100(21):12247–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gagniuc P, Ionescu-Tirgoviste C. Eukaryotic genomes may exhibit up to 10 generic classes of gene promoters. BMC Genomics. 2012;28(13):512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Yamamoto YY, Yoshioka Y, Hyakumachi M, Obokata J. Characteristics of core promoter types with respect to gene structure and expression in arabidopsis thaliana. DNA Res Int J Rapid Publ Rep Genes Genomes. 2011;18(5):333–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dineen DG, Wilm A, Cunningham P, Higgins DG. High DNA melting temperature predicts transcription start site location in human and mouse. Nucleic Acids Res. 2009;37(22):7360–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Yamamoto YY, Ichida H, Abe T, Suzuki Y, Sugano S, Obokata J. Differentiation of core promoter architecture between plants and mammals revealed by by LDSS analysis. Nucleic Acids Res. 2007;35(18):6219–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kanhere A, Bansal M. Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes. Nucleic Acids Res. 2005;33(10):3165–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Florquin K, Saeys Y, Degroeve S, Rouzé P, Van de Peer Y. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Fukue Y, Sumida N, Nishikawa J, Ohyama T. Core promoter elements of eukaryotic genes have a highly distinctive mechanical property. Nucleic Acids Res. 2004;32(19):5834–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ye L, Qian Q, Zhang Y, You Z, Che J, Song J, et al. Analysis of the Sericin1 promoter and assisted detection of exogenous gene expression efficiency in the silkworm Bombyx Mori L. Sci Rep. 2015;6(5):8301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443–53. [DOI] [PubMed] [Google Scholar]
  • 15.Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18(1):186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.St Laurent CD, St Laurent KE, Mathison RD, Befus AD. Calcium-binding protein, spermatid-specific 1 is expressed in human salivary glands and contains an anti-inflammatory motif. Am J Physiol Regul Integr Comp Physiol. 2015;308(7):R569-575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Reyes-Serratos E, Santos JRL, Puttagunta L, Lewis S, Watanabe M, Gonshor A, et al. Identification and characterization of calcium binding protein, spermatid associated 1 (CABS1) in selected human tissues and fluids. bBbioRxiv. 2023. 10.1101/2023.07.21.550040v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zhang X, Zhou W, Zhang P, Gao F, Zhao X, Shum WW, et al. Cabs1 maintains structural integrity of mouse sperm flagella during epididymal transit of sperm. Int J Mol Sci. 2021;22(2):652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Huang YL, Fu Q, Pan H, Chen FM, Zhao XL, Wang HJ, et al. Spermatogenesis-associated proteins at different developmental stages of buffalo testicular seminiferous tubules identified by comparative proteomic analysis. Proteomics. 2016;16(14):2005–18. [DOI] [PubMed] [Google Scholar]
  • 20.Shawki HH, Kigoshi T, Katoh Y, Matsuda M, Ugboma CM, Takahashi S, et al. Identification, localization, and functional analysis of the homologues of mouse Cabs1 protein in porcine testis. Exp Anim. 2016;65(3):253–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kawashima A, Osman BAH, Takashima M, Kikuchi A, Kohchi S, Satoh E, et al. Cabs1 is a novel calcium-binding protein specifically expressed in elongate spermatids of mice. Biol Reprod. 2009;80(6):1293–304. [DOI] [PubMed] [Google Scholar]
  • 22.Marcet-Palacios M, Reyes-Serratos E, Gonshor A, Buck R, Lacy P, Befus AD. Structural and posttranslational analysis of human calcium-binding protein, spermatid-associated 1. J Cell Biochem. 2020;121(12):4945–58. [DOI] [PubMed] [Google Scholar]
  • 23.Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly. National Center for Biotechnology Information; Available from: https://www.ncbi.nlm.nih.gov/nuccore/NC_000001
  • 24.Jacomy M, Venturini T, Heymann S, Bastian M. ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PLoS One. 2014;9(6): e98679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25(9):1189–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44–57. [DOI] [PubMed] [Google Scholar]
  • 27.Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50(W1):W216–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.OPRPN opiorphin prepropeptide [Homo sapiens (human)]. National Center for Biotechnology Information; 58503. Available from: https://www.ncbi.nlm.nih.gov/gene/58503#summary
  • 29.Davidson EH, Levine MS. Properties of developmental gene regulatory networks. Proc Natl Acad Sci. 2008;105(51):20063–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.D hr S. Linking disease-associated genes to regulatory networks via promoter organization. Nucleic Acids Res. 2005;33(3):864–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ionescu-Tîrgovişte C, Gagniuc PA, Guja C. Structural properties of gene promoters highlight more than two phenotypes of diabetes. PLoS ONE. 2015;10(9): e0137950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Rhoads DM, McIntosh L. The salicylic acid-inducible alternative oxidase gene Aox1 and genes encoding pathogenesis-related proteins share regions of sequence similarity in their promoters. Plant Mol Biol. 1993;21(4):615–24. [DOI] [PubMed] [Google Scholar]
  • 34.Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinform. 2007;8(7):S21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Albani D, Altosaar I, Arnison PG, Fabijanski SF. A gene showing sequence similarity to pectin esterase is specifically expressed in developing pollen of brassica napus. Sequences in its 5′ flanking region are conserved in other pollen-specific promoters. Plant Mol Biol. 1991;16(4):501–13. [DOI] [PubMed] [Google Scholar]
  • 36.Campanella JJ, Bitincka L, Smalley J. MatGAT: an application that generates similarity/identity matrices using protein or DNA sequences. BMC Bioinform. 2003;4(1):29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, et al. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 2019;20(1):144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Serratos EAR. The perplexity of calcium-binding protein, spermatid-associated 1 (CABS1): a molecule that despite its name, is present beyond the reproductive tract, with ties to stress, and possessing an anti-inflammatory domain only preserved in simians.
  • 39.SPOCK1 2 sparc (osteonectin), Cwcv and kazal like domains proteoglycan 1 [Internet]. GeneCards - The Human Gene Database; Available from: https://www.genecards.org/cgi-bin/carddisp.pl?gene=SPOCK1
  • 40.Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr Protoc Bioinforma. 2016;54:1.30.1-1.30.33. [DOI] [PubMed] [Google Scholar]
  • 41.Singer GAC, Lloyd AT, Huminiecki LB, Wolfe KH. Clusters of co-expressed genes in mammalian genomes are conserved by natural selection. Mol Biol Evol. 2005;22(3):767–75. [DOI] [PubMed] [Google Scholar]
  • 42.Ismail NH, Mussa A, Al-Khreisat MJ, Mohamed Yusoff S, Husin A, Al-Jamal HAN, et al. Dysregulation of non-coding RNAs: roles of miRNAs and lncRNAs in the pathogenesis of multiple myeloma. Non-Coding RNA. 2023;9(6):68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Zhang HB, Hu Y, Deng JL, Fang GY, Zeng Y. Insights into the involvement of long non-coding RNAs in doxorubicin resistance of cancer. Front Pharmacol. 2023. 10.3389/fphar.2023.1243934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Hall A, Middlehurst B, Cadogan MAM, Reed X, Billingsley KJ, Bubb VJ, et al. A Sine-Vntr-Alu at the Lrig2 locus is associated with proximal and distal gene expression in CRISPR and population models. Sci Rep. 2024;14(1):792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.SMR3A gene - submaxillary gland androgen regulated protein 3a [Internet]. GeneCards - The Human Gene Database; Available from: https://www.genecards.org/cgi-bin/carddisp.pl?gene=SMR3A
  • 46.SMR3B gene - submaxillary gland androgen regulated protein 3B [Internet]. GeneCards - The Human Gene Database; Available from: https://www.genecards.org/cgi-bin/carddisp.pl?gene=SMR3B
  • 47.Mukherjee A, Park A, Wang L, Davies KP. Role of opiorphin genes in prostate cancer growth and progression. Future Oncol. 2021;17(17):2209–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Homo sapiens isolate 150210799 LINE 1, complete sequence - nucleotide - NCBI. Available from: https://www.ncbi.nlm.nih.gov/nucleotide/MZ092701.1?report=genbank&log$=nucltop&blast_rank=55&RID=6E5TG7M7013
  • 49.Gasparotto E, Burattin FV, Di Gioia V, Panepuccia M, Ranzani V, Marasca F, et al. Transposable elements co-option in genome evolution and gene regulation. Int J Mol Sci. 2023;24(3):2610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Grillo G, Keshavarzian T, Linder S, Arlidge C, Mout L, Nand A, et al. Transposable elements are co-opted as oncogenic regulatory elements by lineage-specific transcription factors in prostate cancer. Cancer Discov. 2023. 10.1158/2159-8290.CD-23-0331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Choudhary MNK, Quaid K, Xing X, Schmidt H, Wang T. Widespread contribution of transposable elements to the rewiring of mammalian 3D genomes. Nat Commun. 2023;14(1):634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lawson HA, Liang Y, Wang T. Transposable elements in mammalian chromatin organization. Nat Rev Genet. 2023;24(10):712–23. [DOI] [PubMed] [Google Scholar]
  • 53.Karttunen K, Patel D, Xia J, Fei L, Palin K, Aaltonen L, et al. Transposable elements as tissue-specific enhancers in cancers of endodermal lineage. Bio Rxiv. 2022. 10.1101/2022.12.16.520732v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Chandrashekar DS, Dey P, Acharya KK. GREAM: a web server to short-list potentially important genomic repeat elements based on over-/under-representation in specific chromosomal locations, such as the gene neighborhoods, within or across 17 mammalian species. PLoS One. 2015;10(7): e0133647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Lötscher E, Siwka W, Zimmer FJ, Grummt F, Zachau HG. Ttransposed human immunoglobulin C kappa gene regions carry clusters of conserved sequence elements. Gene. 1988;69(2):225–36. [DOI] [PubMed] [Google Scholar]
  • 56.Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19(1):199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Papadogkonas G, Papamatheakis DA, Spilianakis C. 3D genome organization as an epigenetic determinant of transcription regulation in T cells. Front Immunol. 2022. 10.3389/fimmu.2022.921375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Hardan A, Botero J, Arnone J. Recent developments on the role of spatial positioning in gene expression and disease. SPG BioMed. 2018. 10.32392/biomed.34. [Google Scholar]
  • 59.Arnone JT, McAlear MA. Adjacent gene pairing plays a role in the coordinated expression of ribosome biogenesis genes Mpp10 and Yjr003c in saccharomyces cerevisiae. Eukaryot Cell. 2011;10(1):43–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Zhou N, Friedberg I, Kaiser MS. Hierarchical Markov random field model captures spatial dependency in gene expression, demonstrating regulation via the 3D genome. bioRxiv. 2020. 10.1101/2019.12.16.878371v2.33501432 [Google Scholar]
  • 61.Onrust-van Schoonhoven A, de Bruijn MJW, Stikker B, Brouwer RWW, Braunstahl GJ, van IJcken WFJ, et al. 3D chromatin reprogramming primes human memory TH2 cells for rapid recall and pathogenic dysfunction. Sci Immunol. 2023;8(85): eadg3917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Wang Y, Song F, Zhang B, Zhang L, Xu J, Kuang D, et al. The 3D Genome Browser: a web-based browser for visualizing 3D genome organization and long-range chromatin interactions. Genome Biol. 2018;19(1):151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Rangwala SH, Zhang L, Kazazian HH. Many LINE1 elements contribute to the transcriptome of human somatic cells. Genome Biol. 2009;10(9):R100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Rodić N, Burns KH. Long interspersed element–1 (LINE-1): passenger or driver in human neoplasms? PLoS Genet. 2013;9(3): e1003402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Ardeljan D, Taylor MS, Ting DT, Burns KH. The human LINE-1 retrotransposon: an emerging biomarker of neoplasia. Clin Chem. 2017;63(4):816–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Li X, Bie L, Wang Y, Hong Y, Zhou Z, Fan Y, et al. LINE-1 transcription activates long-range gene expression. Nat Genet. 2024;56(7):1494–502. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1. (26.9MB, tif)
Additional file 2. (170.4KB, jpg)
Additional file 3. (409.6KB, jpg)
Additional file 4. (259.5KB, pdf)

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. SEQSIM is freely available, implemented in C + + for Linux and Windows platforms. All files are included in the supplementary documentation or downloadable at this link https://sites.ualberta.ca/~joyramie/SEQSIM.zip. Additional resources, including a user guide and supplementary data, are downloadable with this publication or at https://sites.ualberta.ca/~joyramie/SEQSIM.html.


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES