Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Apr 19:2024.04.15.589623. [Version 1] doi: 10.1101/2024.04.15.589623

ModDotPlot—Rapid and interactive visualization of complex repeats

Alexander P Sweeten 1,2, Michael C Schatz 1, Adam M Phillippy 2
PMCID: PMC11071298  PMID: 38712106

Abstract

Motivation:

A common method for analyzing genomic repeats is to produce a sequence similarity matrix visualized via a dot plot. Innovative approaches such as StainedGlass have improved upon this classic visualization by rendering dot plots as a heatmap of sequence identity, enabling researchers to better visualize multi-megabase tandem repeat arrays within centromeres and other heterochromatic regions of the genome. However, computing the similarity estimates for heatmaps requires high computational overhead and can suffer from decreasing accuracy.

Results:

In this work we introduce ModDotPlot, an interactive and alignment-free dot plot viewer. By approximating average nucleotide identity via a k-mer-based containment index, ModDotPlot produces accurate plots orders of magnitude faster than StainedGlass. We accomplish this through the use of a hierarchical modimizer scheme that can visualize the full 128 Mbp genome of Arabidopsis thaliana in under 5 minutes on a laptop. ModDotPlot is bundled with a graphical user interface supporting real-time interactive navigation of entire chromosomes.

Availability and Implementation:

ModDotPlot is available at https://github.com/marbl/ModDotPlot.

Introduction

Large tandemly repeating blocks of DNA, such as satellite repeats and their complex higher-order structures, are ubiquitous in many eukaryotic genomes, yet have been notoriously difficult to sequence and assemble. These motifs occur disproportionately in telomeric, centromeric, and heterochromatic regions of the genome (17), and are commonly referred to as genomic “dark matter” due to their prior absence from reference genomes (32). Recent advances in long-read sequencing and assembly tools have enabled genomics researchers to successfully assemble these complex regions, culminating in the first complete human genome (24) as well as important model organisms such as Arabidopsis (23) and non-human primates (20). More broadly, with tools such as Verkko (29) and hifiasm (UL) (6) now able to automatically assemble complete “telomere-to-telomere” chromosomes, developing new methods to analyze these previously dark regions of the genome has taken on new importance.

Traditionally, dot plots have been useful visualizations to characterize the structure of complex repeats (19). To generate such a plot, a sequence S is typically aligned with itself using software such as MUMmer (21), and plotted in a two dimensional space. This approach results in a set of line segments from [x,y] to [x+l-1,y+l-1] for all matches of length l (above some minimum length threshold) beginning at positions x and y in S. This yields a single diagonal line segment, representing the sequence aligned with itself, and all off-diagonal segments representing the location of paralogous repeat copies. If based on a gapped sequence alignment, these segments may also be colored by their average sequence identity, but the internal, fine-grained structure of the repeats cannot be represented by this technique.

To overcome this limitation, recent work by Vollger et al. introduced StainedGlass (33), which relies on a rasterized rather than vectorized approach. In this framework, the aim is to generate a similarity matrix Mw where each cell MwAi,Bj relates two genomic intervals Ai and Bj of length w beginning at positions wi and wj in S. By re-framing the problem in terms of intervals rather than single bases, a percent identity can be computed between all pairs of intervals and the matrix Mw can be rendered as a heatmap where each cell (pixel) represents the percent identity between the two substrings at the corresponding interval positions. This technique has extended the previously binary dot plot into a rich spectrum of information and proven highly effective for visualizing patterns of sequence evolution within tandem repeat arrays of both humans and plants (18, 34).

Although heatmaps produced by StainedGlass have been useful in practice, the workflow used to generate them has inherent limitations. First, StainedGlass uses Minimap2 (16) to determine sequence identity by computing the number of matches, mismatches, insertions, and deletions between pairs of substrings. Minimap2’s alignment heuristic is not well-suited for repetitive sequences (31) and leads to long runtimes, especially for short tandem repeats. For example, a single 3 Mbp human centromere requires over one hour to plot when running on a high performance compute cluster. Furthermore, StainedGlass partitions the input sequence into intervals of a fixed size. Similar substrings that are split across this boundary may fail to align, leading to inaccurate identity estimates.

To improve upon these limitations, we propose a k-mer-based approach that bypasses the computationally expensive requirement of sequence alignment. Estimating sequence identity from sets of k-length substrings (k-mers) has seen increasing use in genomics (25). Such tools typically utilize downsampling methods, such as minhash, to reduce the size of each k-mer set before estimating sequence identity using the Jaccard index or related set similarity measure.

In this work, we introduce ModDotPlot, a novel heatmap visualization tool that rapidly estimates sequence identity using hierarchical modimizers, a form of fractional minhashing (4, 11). Modimizers are defined as hashed k-mer values that have no remainder when divided by some number s, which we refer to as the sparsity. Here we restrict s to powers of two, s=2d, which conveniently results in the set of modimizers being: (1) precisely those hash values with d zeros in their least significant bits, and (2) a strict subset of the modimizers defined by s=2d-1. We use this efficient membership test and hierarchical property to efficiently downsample genomic k-mers at multiple levels of sparsity. We show that the resulting modimizers can be used to accurately estimate the average nucleotide identity (ANI) of two substrings, while being resistant to segmentation artifacts and orders of magnitude faster than StainedGlass. To conclude, we demonstrate ModDotPlot’s ability to elucidate the centromeric satellite structure of both plants and animals.

Materials and Methods

ModDotPlot takes as input a list of sequences in FASTA format and outputs a self-identity heatmap for each sequence, as well as comparative heatmaps for all pairwise combinations of sequences. In describing our methods, we assume the construction of a self-identity heatmap, but the necessary modifications for constructing comparative heatmaps is straightforward. ModDotPlot can be run one of two ways, specified at runtime: Static mode produces a static image file for each plot, while Interactive mode builds a plot hierarchy using multiple modimizer values so that the plot resolution can be adjusted in real time as the user adjusts the zoom level. We outline the workflow of both possible modes of ModDotPlot in Figure 1.

Figure 1.

Figure 1.

Overview of ModDotPlot’s workflow for producing a self-identity plot. Static mode: Hashed k-mers are evenly partitioned into intervals of length w. Modimizers are selected based on an estimated sketch size m within each interval. For each pairwise combination of intervals, identity is computed and stored in a matrix Mw. Finally, a heatmap is created based on the color thresholds provided. Interactive mode: Three distinct modimizer partitions are produced from a minimum interval length of wˆ=1 up to w=4. At launch, a heatmap is rendered for the largest window size (here, w=4). When the field of view is zoomed by half (highlighted region), the dot plot is rendered using a submatrix created from the partition at w=2. This process can extend until a plot produced from the minimum interval length wˆ is reached, with m remaining constant among all layers. While m=1 is used here for demonstration, ModDotPlot adjusts the modimizer sparsity such that m1000 in practice.

ModDotPlot first decomposes each sequence S of length n into a list of its constituent k-mers Sk. Each k-mer and its reverse complement are passed through a hash function h:[0,H] for some HR, with the smaller of the two values added into Sk. Once broken down into k-mers, ModDotPlot partitions Sk into evenly sized and non-overlapping genomic intervals of size w, also referred to as the window size. We define the number of intervals as r=n-k+1w, which we refer to as the resolution. This determines the height and width of the resulting heatmap. To reduce the runtime and space complexity of handling large sequences, ModDotPlot sketches each interval A into sets based on a modulo function, as originally proposed by Broder (4). We formally define our algorithm for sketching Sk in Supplementary Algorithm 1. This generates the following set for each interval:

MODsA=aSkA:a0mods (1)

We refer to any k-mer present in the sketch MODsA as a modimizer. We define sZ+ as the modimizer sparsity and restrict s to powers of 2. Note that the sparsity value is inversely related to the number of modimizers selected (i.e. the density), with s=2 resulting in approximately every second k-mer being selected, s=4 with every fourth k-mer, and so on. Given a set of k-mers sampled from a long random string, the expected number of modimizers per window is:

m=EMODs(A)=ws (2)

We refer to m as the modimizer sketch size, with larger values of m increasing the accuracy of the minhash similarity estimates. Given a desired plot resolution r and target sketch size m, the corresponding window size w=nr and required sparsity s=wm can be automatically derived. Based on prior work (25), we use m=1,000 as a good compromise between accuracy and efficiency.

In practice, if the k-mers in interval A are highly repetitive, then the true size of MODs(A) can be significantly less than m. To avoid selecting too few k-mers in a window, we introduce a threshold set to half the expected number of modimizers. If the size of MODs(A) is less than this threshold, modimizers are iteratively recomputed at half the sparsity until the modimizer count threshold is met or the sparsity hits one (i.e. every k-mer in A is included in the sketch).

Once the input sequence is partitioned and sketched, ModDotPlot produces a similiarity matrix Mw by estimating the identity between each pairwise combination of intervals A and B, which we refer to as a cell in the matrix. We estimate the proportion of k-mers in A that are contained in B, and vice-versa, via the containment index (4):

cˆmod(A,B)=MODs(A)MODs(B)MODs(A) (3)

Hera et al. show that for the FracMinHash scheme, a correction factor is needed for an unbiased estimate of the containment index (28), to account for cases where MODs(A) differs greatly from MODs(B). In practice, this can occur when interval A occurs in a repetitive genomic interval while interval B does not. Since modulo hashing is a variant of fractional minhashing, the same correction applies and we include the expected value in the denominator to achieve an unbiased estimate of the containment index:

cmodA,B=MODsAMODsBMODsA(1-(1-1s)A) (4)

Furthermore, since the containment index drops exponentially with respect to the mutation rate (15), it is useful to represent this as an estimate of percent sequence identity. As implemented in MashScreen (26), we model the probability of mutation at each position in a k-mer with the binomial distribution to estimate the ANI as:

ANIc(A,B)=cmod(A,B)1k (5)

For self-identity plots, ModDotPlot sets Mw(A,B)=maxANIcA,B,ANIc(B,A) to ensure the resulting matrix is symmetric. We note that the containment index is not a distance metric, as it neither satisfies the symmetry property nor the triangle inequality property; however, for two equally sized intervals, we show that ANIc correlates well with an alignment-based ANI. Furthermore, the containment index has the desirable property of not requiring a set operation in its denominator, meaning it is possible to increase the length of interval B without penalizing ANIc. We take advantage of this property to overcome segmentation artifacts, as described later.

Once the matrix of containment indices is computed, ModDotPlot outputs an identity heatmap analogous to a genomic dot plot. The heatmap is assigned a range of color values, ranging from t (a user provided threshold identity threshold) to 100. Any cells in the matrix <t are left uncolored. Use of t<80 is not recommended, as the identity estimate rapidly loses accuracy below this value for typical values of k and m, since the higher divergence may result in very few, or zero, k-mers shared between the two intervals. Given a symmetric self-identity dot plot, the upper diagonal of the dot plot can be used to produce a triangular dot plot in addition to the standard square.

Modimizer Hierarchy.

Modimizers present a quick and efficient sketching approach, as given a sparsity of s=2d, only the first d bits of each k-mer hash need to be checked to verify membership in MODs. In addition, modimizers are context-independent, providing a guarantee that any k-mer selected as a modimizer in one set will also be a modimizer in every other set, regardless of the neighboring context or genomic interval. Given these properties, it is guaranteed that any modimizer in MODs1(A) will also occur in MODs2(A) when s1 is an integer multiple of s2:

MODs1(A)MODs2(A)ifs1s2Z+ (6)

Thus, for a geometric sequence of sparsity values, the smaller modimizer sets will always be subsets of the larger ones. We call this the hierarchical property of modimizers. As we describe below, we leverage this property in order to reduce the memory and runtime overhead when generating dot plots at multiple zoom levels.

A hierarchical modimizer index consists of l modimizer sets with window sizes wˆ,2wˆ,,2(l-1)wˆ and corresponding sparsities sˆ,2sˆ,,2(l-1)sˆ. Given a user-specified modimizer sketch size mˆ and minimum window size wˆ, the initial sparsity is defined as sˆ=2log2wˆmˆ. To construct progressively sparser levels of the hierarchy, let A be an interval of size 2w, and AL and AR be the w-sized left and right halves of A respectively. Due to the hierarchical property, the modimizers for the next sparser level can be sampled from the previous level since MOD2s(A)MODsALMODsAR. Repeating this process, additional levels of the hierarchy are sampled until the window size exceeds |S|r, i.e. the resulting number of intervals would be less than the minimum resolution. For example, a 250 Mbp sequence plotted with a minimum window size of 10 Kbp and a resolution of 1,000 would result in 5 layers, since l=5 is the largest l such that 2(l-1)wˆ|S|r. We formally define our algorithm for producing the modimizer hierarchy in Supplementary Algorithm 2.

The runtime and space complexity for building the initial modimizer layer is 𝒪(n), as this requires linear scan of the sequence of size n. The expected complexity of each successive layer is half the previous due to the sparsity increasing by powers of two, so the overall runtime and space complexity of Supplementary Algorithm 2 remains 𝒪(n). This approach mirrors the “multilevel winnowing” (12) or “SHIMMER” (7) indices, but our use of modimizers rather than minimizers allows for unbiased containment estimates. From this index, similarity matrices can be efficiently computed for any pair of genomic ranges of the input sequence, with the maximum resolution determined by the minimum window size chosen when building the hierarchy.

Offset and Window Expansion.

When partitioning the input sequence into discrete intervals, it’s possible that two highly similar sequences can be partitioned in different ways, resulting in an inaccurate sequence identity estimate between them (Figure 2). This occurs whenever the two similar sequences are “out of register” and have a different offset relative to the start of the full sequence and that difference is not a multiple of the interval length. The result is that the sequences of the two intervals only partially overlap, rather than fully match. This can also occur within tandem repeats when the unit size is larger than the interval length, such as the rDNA arrays of human acrocentric chromosomes.

Figure 2.

Figure 2.

Sample cases for different interval offsets. k-mers shared between intervals A (blue) and B (green) are visualized with a red line. a) In an ideal partition, the shared k-mers are perfectly captured in both intervals. b) In a worse-case partition, only half of the the shared k-mers are captured in the cell, leading to a misleading identity estimate for this region. c) By keeping A fixed, but expanding B to B', ModDotPlot is able to better capture the similarity between two similar sequences with different offsets. The containment index of A in B' is then used to determine the score of the dot plot matrix cell Mw(A,B).

To overcome this offset issue, ModDotPlot extends each interval B by w2 in each direction to form the expanded interval B'. The containment index is then computed as cmod(A,B')1/k, accounting for any sequence similarities that extend beyond the boundaries of B. We show the effect of this approach when computing the containment index in Figure 2, as well as a practical example with human rDNA in Supplementary Figure 1. Since B does not appear in the denominator of Equation 4, expanding the size of B does not penalize or bias the containment index. Doubling the size of B accounts for the worst-case scenario of a match diagonal beginning in the middle of the interval, and so is the default behavior, but this expansion factor can be turned off or adjusted if necessary.

Implementation and User Interface.

ModDotPlot is implemented in the Python programming language (version 3.7 or later). By default, ModDotPlot runs in interactive mode using Plotly (10), which itself uses the Flask web framework (8). Consequently, plots are visualized on a web browser connected to the user’s localhost. Interactive ModDotPlot can also be run remotely, e.g. on a compute cluster, via port forwarding over an ssh tunnel. In static mode, containment indices are saved into a compressed BED file, and dot plots are produced using the Plotnine plotting library (27). In addition to the standard rectangular plots, static mode also supports triangular plot styles.

An important parameter common to all k-mer based methods is the choice of k, as this represents a trade-off between sensitivity and specificity. Smaller k-mers are more sensitive for detecting identity within divergent intervals, but lose specificity due to chance k-mer collisions. ModDotPlot allows for flexibility in setting k, but based on prior work (25), we set a default k=21 to ensure accurate estimates in most cases.

k-mers are hashed using MurmurHash3 (1) and all similarity matrices are stored in the form of NumPy arrays (9). The size of a similarity matrix is proportional to 𝒪r2 rather than the length of the genome sequence. By default, ModDotPlot uses a resolution of r=1,000 for efficient visualizations on most standard displays. To enable a responsive interface in interactive mode, a full similarity matrix is precomputed for each level of the modimizer hierarchy. However, since the number of layers scales logarithmically with the sequence length, only a few layers are needed in practice (e.g. l ≤ 5). When zooming on the plot, the appropriate matrix is chosen such that the number of cells in the matrix is at least the number of pixels in the plot. To prevent redundant computations of similarity matrices for future exploration, NumPy matrices can be saved as binary files and loaded directly as input.

Supplementary Figure 2 shows a screenshot of ModDotPlot’s user interface in interactive mode. Hovering over the plot shows the exact genomic coordinates, along with the corresponding estimated identity of each section. This example shows a plot highlighting the repeat-rich 30 Mbp Y chromosome from a siamang gibbon (Symphalangus syndactylus). Users can select a number of preset color-schemes, including high contrast schemes to aid visually impaired or colorblind users, or specify custom colors, either in hex code or RGB format. ModDotPlot also supports the creation of fullycustomizable static plots as PDF and PNG files.

Results

Plot Accuracy.

To showcase the improvements of ModDotPlot over StainedGlass, Figure 3 shows the plots produced by both tools for the centromeric alpha satellite array of the human HG002 X chromosome. The StainedGlass default window size of 2,000 produces a highly “checkered” plot containing streaks of apparently low identity within the array. However, this is not representative of any sort of centromere biology; rather, it is an artifact of partitioning the genome into windows of a fixed size. The canonical DXZ1 higher-order repeat (HOR) present in this array consists of twelve monomers totaling ~2,050 bp (22), which is slightly longer than the selected window size. Using a window size of 5,000 is sufficient to contain a complete HOR and alleviate this problem, but this comes at the cost of a lower resolution plot and requires advance knowledge of the repeat structure. In contrast, ModDotPlot produces an accurate plot regardless of window length and HOR size.

Figure 3.

Figure 3.

Plots produced by StainedGlass (column 1) and ModDotPlot (column 2), representing the upper diagonal self-identity heatmap of the HG002 DXZ1 satellite array (ChrX:57,680,000–61,000,000). Rows represent a window size of 2,000 (r=1,570) in ModDotPlot) and 5,000 (r=678) respectively. ModDotPlot was run with a default m=1,000. Plotting artifacts in the StainedGlass w=2,000 example are due to interactions between the partitioning window size and tandem repeat periodicity.

Figure 4 shows the strong correlation between ModDotPlot ANIc values and an alignment-based ANIm computed by MUMmer (21), but with the accuracy of ANIc decreasing with increasing sparsity (reduced sketch size), as expected (Supplementary Figure 3). For each pairwise combination of HORs present in chrX:58,000,771–58,200,827, the MUMmer ANIm was taken from the “AvgIdentity” of 1-to-1 alignments computed by the v4.0.1 “dnadiff” program. The vast majority of HORs, representing the canonical 12-mer structure, fall within the consensus range of 97–100% sequence identity (22) with high concordance (r=0.965) between ModDotPlot and MUMmer. Larger differences between the two methods arise from pairs of windows containing structural variation that confound MUMmer’s alignment-based similarity.

Figure 4.

Figure 4.

Scatterplots showing the linear relationship between MUMmer ANIm and ModDotPlot ANIc, using CHM13 chrX:58,000,771–58,200,827. The outlier group labeled in red represents a non-canonical 10-mer HOR (chrX:58,060,405–58,062,120), which is scored differently by the two methods due to the presence of a large deletion when compared to the 12-mer HOR. The dashed line represents ANIm=ANIc.

The containment index used by ModDotPlot does not penalize k-mer copy number differences or large insertions/deletions (indels) in the same way a global alignment would. For example, within the chromosome X centromeric array we observed a small number of windows where the ANIm and ANIc values differed substantially. Closer investigation revealed the presence of a single non-canonical HOR, consisting of a shorter 10 monomer repeat that was scored higher by ANIc when compared to the canonical 12 monomer repeat (Figure 4). The difference between these two repeats is interpreted as a large indel by MUMmer, resulting in a reduced ANIm. However, this difference is not penalized by ANIc, as the 10 monomers present in the shorter HOR are well-contained within the canonical 12 monomer.

Thus, ANIc is more akin to a local alignment similarity, i.e. the average similarity between the sequences that are shared, and reflects the point mutation rate between two sequences rather than the rate of larger structural variants. This is an important distinction, because in this case MUMmer ANIm confounds these two evolutionary processes, while ANIc isolates the point mutation rate of the individual monomers. Such differences between ANIc and ANIm are most pronounced within HOR satellite arrays, which are prone to unequal crossing over leading to frequent expansion and contraction of the arrays (3). For this reason, the UniAligner (5) tool, which is specifically built for aligning long tandem repeats, similarly uses an indel penalty of zero during its k-mer alignment phase.

Modimizer Sparsity.

Compared to other sketching approaches, modimizers lack any sort of “window guarantee,” meaning that no lower bounds exist on the number of k-mers that will be selected for each interval. In addition, the containment index is computed on sets of k-mers, not multisets (i.e. only the presence or absence of a k-mer is considered), so highly repetitive intervals will typically result in smaller k-mer sets, which can lead to reduced accuracy when estimating the containment. Although this is partially taken into account by the error term provided in Equation 4, we demonstrate that by dynamically modifying the sparsity, as done in Supplementary Algorithm 1, the number of modimizers selected per window can be kept above acceptable levels. Figure 5 shows this on a 4 Mbp centromeric region of CHM13 chromosome 1. Regions of alpha satellite repeats show a steep decline in the number of distinct k-mers; however, this can be corrected by adaptively reducing the modimizer sparsity in this region to boost the number of k-mers selected per window to at least m2 and thus improve the containment estimates. Without this correction, we find that real similarities between low-complexity satellite arrays can go entirely undetected.

Figure 5.

Figure 5.

Self-identity plot of the centromere of CHM13 Chromosome 1, overlaid with a smoothed unique k-mer frequency chart. Using a window size w=4,000 and a sparsity s=4, the expected number of modimizers per window is m=1,000. When using an uncorrected sparsity value (red), the number of unique modimizers per window can drop to under 200. By detecting the unexpectedly small set sizes and adjusting the sparsity of these windows, the total number of modimizers in each window can be increased to at least m2 (or, in pathological cases, all k-mers in the window).

Comparative Plots.

In addition to self-identity plots, ModDotPlot is also able to generate comparative plots between two different sequences. As an example, we showcase a pairwise dot plot between the DXZ1 alpha satellite arrays of two different human X chromosome centromeres, one from the HG002 genome and one from the CHM13 genome (Figure 6). These two arrays have been previously assembled and compared (3), but it is difficult to understand their structural differences by comparing only their self-identity plots. By plotting the two arrays against each other, their orthology relationship becomes clear. The comparative dot plot of the HG002 and CHM13 DXZ1 arrays reveals a faint diagonal representing the shared history of the two sequences, punctuated by over 300 large duplications/deletions distributed throughout the array (5). As noted above, centromeric satellite arrays are one of the fastest evolving regions of the human genome and accumulate many such structural variants through various recombinational mechanisms. Because of their unique evolutionary patterns, and propensity for bulk insertions/deletions, they have been one of the most difficult regions of the genome to align using traditional approaches.

Figure 6.

Figure 6.

Comparative dot plot of chromosome X DXZ1 satellite array from the HG002 and CHM13 genomes, overlaid with self-identity plots, using a 99% identity threshold. A faint, high-identity diagonal is visible in the comparative plot, indicating the orthologous sequences between these two highly variable arrays.

Runtime and Memory.

In Table 1, we compare the runtime and memory usage of ModDotPlot to StainedGlass across input sequences of various species and sizes. These include the HG002 X chromosome centromere (same sequence as Figure 3), the gibbon Y chromosome (Supplementary Figure 2), the human Y chromosome (30), and the entire gap-free reference genome of Arabadopsis (23), containing 5 chromosomes. For each input, both a static matrix and interactive matrices containing three layers were produced, based on a window size proportional to the length of the largest chromosome in the input group. Interactive StainedGlass plots were created in a similar way to ModDotPlot (i.e. a bottom-up approach based on a minimum window size), and stored in Cooler format (2).

Table 1.

Analysis of memory and runtime needed to produce the similarity matrix (this does not include plot runtime, as that is the same between StainedGlass and ModDotPlot). ModDotPlot was run with a target sketch size of m = 1,000 for all samples. For the Arabadopsis Col-CEN assembly, the runtime includes the comparative matrix between each pairwise combination of chromosomes, in addition to self-identity comparisons.

Sequence n Plot Type w ModDotPlot StainedGlass
CPU time (s) Memory (GB) CPU time (s) Memory (GB)

Human CHM13 4.0 Static 4000 11.10 0.43 1871.31 12.95
Chr1 Centromere 1000 204.85 1.16 2812.49 13.44

Gibbon mSymSyn1 29.9 Static 32000 51.16 2.05 9857.57 30.13
ChrY 8000 193.22 2.41 11264.01 33.50

Human HG002 62.5 Static 64000 80.47 4.06 11214.19 43.18
ChrY 16000 269.84 5.90 14806.91 48.95

Arabadopsis Col-CEN 128.5 Static 32000 289.12 6.13 16014.17 33.41
Whole Genome 8000 1734.11 9.57 20187.19 35.20

In all cases, ModDotPlot exhibits orders of magnitude lower runtime and memory requirements than StainedGlass. An analysis of the Snakemake report generated by StainedGlass showed that the Minimap2 alignment dominated the runtime and memory usage and was the clear bottleneck of the pipeline. We note that despite both tools requiring the sequence identity computation of r2 cells in each matrix, importantly, ModDotPlot’s runtime is independent of sequence length n. Computing ANIc for each cell requires a set intersection operation on two sets of size m, making Equation 5’s runtime complexity 𝒪(m). This can be observed in Table 1, as in interactive mode with high r, both Y chromosomes and the Human Chr1 centromere took a similar amount of CPU time, despite each sequence being vastly different in size. In contrast, StainedGlass requires each cell to run Minimap2 on an unsketched sequence of length w=nr. The 𝒪(n) runtime for identity estimation hinders the ability of StainedGlass to visualize whole genomes and large sequences.

Discussion

Traditional dot plot methods have struggled with the complexity and abundance of genomic repeats, often leading to oversimplified or inaccurate representations. The use of heatmaps offers a substantial improvement over classic vectorized dotplots as they allow for a more natural and nuanced representation of tandem repeats, thereby capturing subtle variations and patterns that vectorized plots obscure. This is especially true for the typical use case where the genomic sequences are manyfold larger than the resolution of the display so that a single pixel intrinsically represents many kilobases of sequence (e.g. a gigabase genome plotted on a 4K display). ModDotPlot improves upon previous methods in terms of speed and computing requirements by an order of magnitude, enabling visualization of whole genomes on a laptop. At the heart of ModDotPlot’s efficiency is its use of hierarchical modimizers, which enable the interactive visualization of vertebrate-sized genomes on a typical laptop. Additionally, the use of expanded intervals combined with the containment index efficiently corrects for registration artifacts inherent to rasterized similarity heatmaps. This is especially important for centromeric and rDNA repeats that are composed of large subunits that can straddle adjacent windows.

A number of additional features could be added to further extend the utility of ModDotPlot. We note how readily satellite arrays and other repeat classes can be visually identified from the dot plots, e.g. satellite arrays appear as dense blocks of color, segmental duplications as lines, and palindromes as lines that cross the diagonal. This raises the possibility of repeat annotation and classification using automated interpretation of dot plots, possibly through machine learning techniques. Additionally, the integration of arbitrary annotation tracks alongside the dot plots would add the ability to visualize genes and other notable features in the context of structural repeats and variation, as is possible with other visualization tools such as HiGlass (13). Lastly, ModDotPlot currently computes similarity matrices in advance of plotting, but with sufficiently fast set operations it would be possible to compute similarity matrices directly from the hierarchical modimizer index on the fly. This would enable interactive exploration of plots with essentially arbitrary resolution.

ModDotPlot highlights the power of minhashing as a fast yet accurate heuristic for sequence alignment, even within the most complex satellite repeat arrays. Alternative sketching approaches may further the utility of this approach. Minmers, for example, allow for an unbiased and accurate identity estimate, with the added advantage of having a window guarantee (14). While such methods can improve sensitivity for more diverged sequences, this comes at the expense of being slower to compute. However, the results presented here suggest that such methods may be able to guide alignments through highly repetitive and variable satellite arrays, ultimately improving our understanding of the structure, function, and evolution of these previously dark regions of the genome.

Supplementary Material

Supplement 1
media-1.pdf (3.8MB, pdf)

Acknowledgements

We would like to thank Mitchell Vollger, Ian Henderson, Karen Miga, and Nicholas Altemose for helpful discussions, and Richard Durbin for suggesting the term “modimizer” to describe an element of a modulo sketch. We would also like to thank Bryce Kille and Nico Ritschel for their feedback and improvements of this manuscript. This work was supported, in part, by the Intramural Research Program of the National Human Genome Research Institute, US National Institutes of Health (A.P.S. and A.M.P.), NSF awards IOS-2216612 and IOS-1758800 (to M.C.S.), and the Human Frontier Science Program award RGP0025/2021 (to M.C.S.). This work utilized the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).

Bibliography

  • 1.Murmurhash3. URL https://github.com/aappleby/smhasher.
  • 2.Abdennur Nezar and Mirny Leonid A. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics, 36(1):311–316, January 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Altemose Nicolas, Logsdon Glennis A., Bzikadze Andrey V., Sidhwani Pragya, Langley Sasha A., Caldas Gina V., Hoyt Savannah J., Uralsky Lev, Ryabov Fedor D., Shew Colin J., Sauria Michael E. G., Borchers Matthew, Gershman Ariel, Mikheenko Alla, Shepelev Valery A., Dvorkina Tatiana, Kunyavskaya Olga, Vollger Mitchell R., Rhie Arang, McCartney Ann M., Asri Mobin, Lorig-Roach Ryan, Shafin Kishwar, Lucas Julian K., Aganezov Sergey, Olson Daniel, de Lima Leonardo Gomes, Potapova Tamara, Hartley Gabrielle A., Haukness Marina, Kerpedjiev Peter, Gusev Fedor, Tigyi Kristof, Brooks Shelise, Young Alice, Nurk Sergey, Koren Sergey, Salama Sofie R., Paten Benedict, Rogaev Evgeny I., Streets Aaron, Karpen Gary H., Dernburg Abby F., Sullivan Beth A., Straight Aaron F., Wheeler Travis J., Gerton Jennifer L., Eichler Evan E., Phillippy Adam M., Timp Winston, Dennis Megan Y., O’Neill Rachel J., Zook Justin M., Schatz Michael C., Pevzner Pavel A., Diekhans Mark, Langley Charles H., Alexandrov Ivan A., and Miga Karen H.. Complete genomic and epigenetic maps of human centromeres. Science, 376(6588):eabl4178, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Broder A.Z.. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of Sequences 1997, pages 21–29, 1998. [Google Scholar]
  • 5.Bzikadze Andrey V. and Pevzner Pavel A.. Unialigner: a parameter-free framework for fast sequence alignment. Nature Methods, 20(9):1346–1354, 2023. [DOI] [PubMed] [Google Scholar]
  • 6.Cheng Haoyu, Asri Mobin, Lucas Julian, Koren Sergey, and Li Heng. Scalable telomere-totelomere assembly for diploid and polyploid genomes with double graph, 2023. [DOI] [PMC free article] [PubMed]
  • 7.Chin Chen-Shan and Khalak Asif. Human genome assembly in 100 minutes. bioRxiv, 2019. [Google Scholar]
  • 8.Grinberg Miguel. Flask web development: developing web applications with python. “ O’Reilly Media, Inc.”, 2018. [Google Scholar]
  • 9.Harris Charles R., Millman K. Jarrod, van der Walt Stéfan J., Gommers Ralf, Virtanen Pauli, Cournapeau David, Wieser Eric, Taylor Julian, Berg Sebastian, Smith Nathaniel J., Kern Robert, Picus Matti, Hoyer Stephan, van Kerkwijk Marten H., Brett Matthew, Haldane Allan, Fernández del Río Jaime, Wiebe Mark, Peterson Pearu, Gérard-Marchant Pierre, Sheppard Kevin, Reddy Tyler, Weckesser Warren, Abbasi Hameer, Gohlke Christoph, and Oliphant Travis E.. Array programming with numpy. Nature, 585(7825):357–362, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Plotly Technologies Inc. Collaborative data science, 2015. URL https://plot.ly.
  • 11.Irber Luiz, Brooks Phillip T., Reiter Taylor, Pierce-Ward N. Tessa, Hera Mahmudur Rahman, Koslicki David, and Brown C. Titus. Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers. bioRxiv, 2022. [Google Scholar]
  • 12.Jain Chirag, Dilthey Alexander, Koren Sergey, Aluru Srinivas, and Phillippy Adam M. A fast approximate algorithm for mapping long reads to large reference databases. J Comput Biol, 25(7):766–779, April 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kerpedjiev Peter, Abdennur Nezar, Lekschas Fritz, Chuck McCallum Kasper Dinkla, Strobelt Hendrik, Luber Jacob M., Ouellette Scott B., Azhir Alaleh, Kumar Nikhil, Hwang Jeewon, Lee Soohyun, Alver Burak H., Pfister Hanspeter, Mirny Leonid A., Park Peter J., and Gehlenborg Nils. Higlass: web-based visual exploration and analysis of genome interaction maps. Genome Biology, 19(1):125, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kille Bryce, Garrison Erik, Treangen Todd J, and Phillippy Adam M. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. Bioinformatics, 39(9): btad512, August 2023. ISSN 1367–4811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Koslicki David and Zabeti Hooman. Improving minhash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation, 354:206–215, August 2019. [Google Scholar]
  • 16.Li Heng. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, May 2018. ISSN 1367–4803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Logsdon Glennis A. and Eichler Evan E.. The dynamic structure and rapid evolution of human centromeric satellite dna. Genes, 14(1), 2023. ISSN 2073–4425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Logsdon Glennis A., Rozanski Allison N., Ryabov Fedor, Potapova Tamara, Shepelev Valery A., Mao Yafei, Rautiainen Mikko, Koren Sergey, Nurk Sergey, Porubsky David, Lucas Julian K., Hoekzema Kendra, Munson Katherine M., Gerton Jennifer L., Phillippy Adam M., Alexandrov Ivan A., and Eichler Evan E.. The variation and evolution of complete human centromeres. bioRxiv, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Maizel J V and Lenk R P. Enhanced graphic matrix analysis of nucleic acid and protein sequences. Proceedings of the National Academy of Sciences, 78(12):7665–7669, 1981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Makova Kateryna D., Pickett Brandon D., Harris Robert S., Hartley Gabrielle A., Cechova Monika, Pal Karol, Nurk Sergey, Yoo DongAhn, Li Qiuhui, Hebbar Prajna, McGrath Barbara C., Antonacci Francesca, Aubel Margaux, Biddanda Arjun, Borchers Matthew, Erich Bornberg-Bauer Gerard G. Bouffard, Brooks Shelise Y., Carbone Lucia, Carrel Laura, Carroll Andrew, Chang Pi-Chuan, Chin Chen-Shan, Cook Daniel E., Craig Sarah J.C., de Gennaro Luciana, Diekhans Mark, Dutra Amalia, Garcia Gage H., Grady Patrick G.S., Green Richard E., Haddad Diana, Hallast Pille, Harvey William T., Hickey Glenn, Hillis David A., Hoyt Savannah J., Jeong Hyeonsoo, Kamali Kaivan, Kosakovsky Pond Sergei L., LaPolice Troy M., Lee Charles, Lewis Alexandra P., Loh Yong-Hwee E., Masterson Patrick, McGarvey Kelly M., McCoy Rajiv C., Medvedev Paul, Miga Karen H., Munson Katherine M., Pak Evgenia, Paten Benedict, Pinto Brendan J., Potapova Tamara, Rhie Arang, Rocha Joana L., Ryabov Fedor, Ryder Oliver A., Sacco Samuel, Shafin Kishwar, Shepelev Valery A., Slon Viviane, Solar Steven J., Storer Jessica M., Sudmant Peter H., Sweetalana, Sweeten Alex, Tassia Michael G., Thibaud-Nissen Françoise, Ventura Mario, Wilson Melissa A., Young Alice C., Zeng Huiqing, Zhang Xinru, Szpiech Zachary A., Huber Christian D., Gerton Jennifer L., Yi Soojin V., Schatz Michael C., Alexandrov Ivan A., Koren Sergey, O’Neill Rachel J., Eichler Evan E., and Phillippy Adam M.. The complete sequence and comparative analysis of ape sex chromosomes. bioRxiv, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Guillaume Marçais Arthur L. Delcher, Phillippy Adam M., Coston Rachel, Salzberg Steven L., and Zimin Aleksey. Mummer4: A fast and versatile genome alignment system. PLOS Computational Biology, 14(1):1–14, January 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Karen H Miga Yulia Newton, Jain Miten, Altemose Nicolas, Willard Huntington F, and Kent W James. Centromere reference models for human chromosomes X and Y satellite arrays. Genome Res, 24(4):697–707, February 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Naish Matthew, Alonge Michael, Wlodzimierz Piotr, Tock Andrew J., Abramson Bradley W., Anna Schmücker Terezie Mandáková, Jamge Bhagyshree, Lambing Christophe, Kuo Pallas, Yelina Natasha, Hartwick Nolan, Colt Kelly, Smith Lisa M., Ton Jurriaan, Kakutani Tetsuji, Martienssen Robert A., Schneeberger Korbinian, Lysak Martin A., Berger Frédéric, Bousios Alexandros, Michael Todd P., Schatz Michael C., and Henderson Ian R.. The genetic and epigenetic landscape of the arabidopsis centromeres. Science, 374(6569):eabi7489, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Nurk Sergey, Koren Sergey, Rhie Arang, Rautiainen Mikko, Bzikadze Andrey V., Mikheenko Alla, Vollger Mitchell R., Altemose Nicolas, Uralsky Lev, Gershman Ariel, Aganezov Sergey, Hoyt Savannah J., Diekhans Mark, Logsdon Glennis A., Alonge Michael, Antonarakis Stylianos E., Borchers Matthew, Bouffard Gerard G., Brooks Shelise Y., Caldas Gina V., Chen Nae-Chyun, Cheng Haoyu, Chin Chen-Shan, Chow William, de Lima Leonardo G., Dishuck Philip C., Durbin Richard, Dvorkina Tatiana, Fiddes Ian T., Formenti Giulio, Fulton Robert S., Fungtammasan Arkarachai, Garrison Erik, Grady Patrick G. S., Graves-Lindsay Tina A., Hall Ira M., Hansen Nancy F., Hartley Gabrielle A., Haukness Marina, Howe Kerstin, Hunkapiller Michael W., Jain Chirag, Jain Miten, Jarvis Erich D., Kerpedjiev Peter, Kirsche Melanie, Kolmogorov Mikhail, Korlach Jonas, Kremitzki Milinn, Li Heng, Maduro Valerie V., Marschall Tobias, McCartney Ann M., McDaniel Jennifer, Miller Danny E., Mullikin James C., Myers Eugene W., Olson Nathan D., Paten Benedict, Peluso Paul, Pevzner Pavel A., Porubsky David, Potapova Tamara, Rogaev Evgeny I., Rosenfeld Jeffrey A., Salzberg Steven L., Schneider Valerie A., Sedlazeck Fritz J., Shafin Kishwar, Shew Colin J., Shumate Alaina, Sims Ying, Smit Arian F. A., Soto Daniela C., Ivan Sović Jessica M. Storer, Streets Aaron, Sullivan Beth A., Françoise Thibaud-Nissen James Torrance, Wagner Justin, Walenz Brian P., Wenger Aaron, Wood Jonathan M. D., Xiao Chunlin, Yan Stephanie M., Young Alice C., Zarate Samantha, Surti Urvashi, McCoy Rajiv C., Dennis Megan Y., Alexandrov Ivan A., Gerton Jennifer L., O’Neill Rachel J., Timp Winston, Zook Justin M., Schatz Michael C., Eichler Evan E., Miga Karen H., and Phillippy Adam M.. The complete sequence of a human genome. Science, 376(6588):44–53, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ondov Brian D., Treangen Todd J., Melsted Páll, Mallonee Adam B., Bergman Nicholas H., Koren Sergey, and Phillippy Adam M.. Mash: fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1):132, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ondov Brian D., Starrett Gabriel J., Sappington Anna, Kostic Aleksandra, Koren Sergey, Buck Christopher B., and Phillippy Adam M.. Mash screen: high-throughput sequence containment estimation for genome discovery. Genome Biology, 20(1):232, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.The plotnine development team. plotnine: A grammar of graphics for python. URL https://github.com/has2k1/plotnine.
  • 28.Rahman Hera Mahmudur, Pierce-Ward N Tessa, and Koslicki David. Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash. Genome Res, 33(7):1061–1068, June 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Rautiainen Mikko, Nurk Sergey, Walenz Brian P., Logsdon Glennis A., Porubsky David, Rhie Arang, Eichler Evan E., Phillippy Adam M., and Koren Sergey. Telomere-to-telomere assembly of diploid chromosomes with verkko. Nature Biotechnology, 41(10):1474–1482, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Rhie Arang, Nurk Sergey, Cechova Monika, Hoyt Savannah J., Taylor Dylan J., Altemose Nicolas, Hook Paul W., Koren Sergey, Rautiainen Mikko, Alexandrov Ivan A., Allen Jamie, Asri Mobin, Bzikadze Andrey V., Chen Nae-Chyun, Chin Chen-Shan, Diekhans Mark, Flicek Paul, Formenti Giulio, Fungtammasan Arkarachai, Carlos Garcia Giron Erik Garrison, Gershman Ariel, Gerton Jennifer L., Grady Patrick G. S., Guarracino Andrea, Haggerty Leanne, Halabian Reza, Hansen Nancy F., Harris Robert, Hartley Gabrielle A., Harvey William T., Haukness Marina, Heinz Jakob, Hourlier Thibaut, Hubley Robert M., Hunt Sarah E., Hwang Stephen, Jain Miten, Kesharwani Rupesh K., Lewis Alexandra P., Li Heng, Logsdon Glennis A., Lucas Julian K., Makalowski Wojciech, Markovic Christopher, Martin Fergal J., Mc Cartney Ann M., McCoy Rajiv C., McDaniel Jennifer, McNulty Brandy M., Medvedev Paul, Mikheenko Alla, Munson Katherine M., Murphy Terence D., Olsen Hugh E., Olson Nathan D., Paulin Luis F., Porubsky David, Potapova Tamara, Ryabov Fedor, Salzberg Steven L., Sauria Michael E. G., Sedlazeck Fritz J., Shafin Kishwar, Shepelev Valery A., Shumate Alaina, Storer Jessica M., Surapaneni Likhitha, Taravella Oill Angela M., Thibaud-Nissen Françoise, Timp Winston, Tomaszkiewicz Marta, Vollger Mitchell R., Walenz Brian P., Watwood Allison C., Weissensteiner Matthias H., Wenger Aaron M., Wilson Melissa A., Zarate Samantha, Zhu Yiming, Zook Justin M., Eichler Evan E., O’Neill Rachel J., Schatz Michael C., Miga Karen H., Makova Kateryna D., and Phillippy Adam M.. The complete sequence of a human y chromosome. Nature, 621(7978):344–354, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Sahlin Kristoffer, Baudeau Thomas, Cazaux Bastien, and Marchet Camille. A survey of mapping algorithms in the long-reads era. Genome Biology, 24(1):133, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Sedlazeck Fritz J., Lee Hayan, Darby Charlotte A., and Schatz Michael C.. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics, 19(6):329–346, 2018. [DOI] [PubMed] [Google Scholar]
  • 33.Vollger Mitchell R, Kerpedjiev Peter, Phillippy Adam M, and Eichler Evan E. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics, 38(7):2049–2051, January 2022. ISSN 1367–4803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wlodzimierz Piotr, Rabanal Fernando A., Burns Robin, Naish Matthew, Primetis Elias, Scott Alison, Terezie Mandáková Nicola Gorringe, Tock Andrew J., Holland Daniel, Fritschi Katrin, Habring Anette, Lanz Christa, Patel Christie, Schlegel Theresa, Collenberg Maximilian, Mielke Miriam, Nordborg Magnus, Roux Fabrice, Shirsekar Gautam, Carlos AlonsoBlanco Martin A. Lysak, Novikova Polina Y., Bousios Alexandros, Weigel Detlef, and Henderson Ian R.. Cycles of satellite and transposon evolution in arabidopsis centromeres. Nature, 618(7965):557–565, 2023. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.pdf (3.8MB, pdf)

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES