Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Jun 16;39(6):btad388. doi: 10.1093/bioinformatics/btad388

WarpSTR: determining tandem repeat lengths using raw nanopore signals

Jozef Sitarčík 1,2,3,, Tomáš Vinař 4, Broňa Brejová 5, Werner Krampl 6,7,8, Jaroslav Budiš 9,10,11, Ján Radvánszky 12,13,14, Mária Lucká 15,16
Editor: Can Alkan
PMCID: PMC10307940  PMID: 37326967

Abstract

Motivation

Short tandem repeats (STRs) are regions of a genome containing many consecutive copies of the same short motif, possibly with small variations. Analysis of STRs has many clinical uses but is limited by technology mainly due to STRs surpassing the used read length. Nanopore sequencing, as one of long-read sequencing technologies, produces very long reads, thus offering more possibilities to study and analyze STRs. Basecalling of nanopore reads is however particularly unreliable in repeating regions, and therefore direct analysis from raw nanopore data is required.

Results

Here, we present WarpSTR, a novel method for characterizing both simple and complex tandem repeats directly from raw nanopore signals using a finite-state automaton and a search algorithm analogous to dynamic time warping. By applying this approach to determine the lengths of 241 STRs, we demonstrate that our approach decreases the mean absolute error of the STR length estimate compared to basecalling and STRique.

Availability and implementation

WarpSTR is freely available at https://github.com/fmfi-compbio/warpstr

1 Introduction

Short tandem repeats (STRs) are repetitive genomic elements containing many consecutive copies of the same short motif, typically of length 1–6 bp. The number of consecutive repeats often varies among individuals and is one of the largest sources of intraspecies genetic diversity (Gelfand et al. 2007, Gymrek et al. 2016). STRs are involved in the determination of quantitative traits and complex multifactorial diseases, and >50 STR loci in the human genome have been unambiguously associated with severe human monogenic diseases (repeat expansion disorders; REDs). REDs are typically caused by repeat length expansions of certain STRs over a pathogenicity threshold (Gymrek 2017, Depienne and Mandel 2021). Accurate determination of STR lengths is thus critical to differentiate between normal range, premutation range (having intergenerationally unstable lengths but without clinical symptoms), and pathogenic range alleles, but also because in several REDs, the STR length correlates with disease onset and severity (Doyu et al. 1992, Liu et al. 2020).

Up until recently, the STR lengths were estimated mostly by conventional methods of molecular biology (i.e. Southern blotting and PCR-based techniques). In general, their use is complicated by the highly variable range of allele lengths to be detected, by the immense length of the expanded ones, and by the stable secondary structures formed by these repeats. Moreover, the accuracy of these methods decreases significantly with the length and complexity of the STR motif. Therefore, although several improvements have been introduced to overcome these challenges, there is still no single method that would reliably identify and size all possible ranges of normal and expanded alleles, and it is recommended to use them in different parallel combinations. These methods are however limited to characterization of individual loci and are not suitable for genome-scale screening (Radvansky and Kadasi 2010, Bahlo et al. 2018). In contrast, short-read sequencing allows to study STRs in more detail and on the whole-genome scale. However, properly genotyping STRs of lengths surpassing the used read length is difficult and at best limited to the identification of the presence of such alleles, but not their size (Willems et al. 2017, Dashnow et al. 2018, Budiš et al. 2019, Dolzhenko et al. 2019, Halman and Oshlack 2020).

Recently, long-read technologies (such as Pacific Biosciences, and Oxford Nanopore) offered new possibilities for studying even larger expanded STRs (Sedlazeck et al. 2018). However, higher sequencing error rates pose additional challenges. One of the first tools using long reads, PacmonSTR (Ummat and Bashir 2014), modifies a traditional alignment algorithm to take higher error rates of Pacific Biosciences reads into account. The resulting estimate of the repeat number is subsequently corrected by a pair-hidden Markov model. RepeatHMM (Liu et al. 2017) performs error correction of repeat sequences in Oxford Nanopore reads by aligning them with the template of perfect repeats of the specified pattern using asymmetric sequence alignment. Then, RepeatHMM uses a hidden Markov model (HMM) to estimate the number of repeats directly from the corrected read sequences. In both tools, the results from individual reads are pooled together using Gaussian mixture models (GMMs) to predict the final genotype. TandemGenotypes (Mitsuhashi et al. 2019) uses a tailored alignment strategy based on last-split (Frith and Kawaguchi 2015) to obtain more confident alignments between the reference repeat region and reads. The change in the repeat number present in the read compared to the reference is then computed as a sum of contributions from the number of unaligned bases in the reference and the read.

Other tools, such as NanoSatellite (De Roeck et al. 2019) and STRique (Giesselmann et al. 2019), attempt to overcome high error rates in nanopore sequences by using the raw sequencing signal. The raw signal is comprised of measurements of electrical current as DNA passes through a nanopore. The signal is affected by a context of k nucleotides (typically k=6) and each context is read ∼8–9 times on average. The portion of the signal corresponding to the shift by one base is called an event. Typically, the raw signal is first translated into the DNA sequence by a basecalling software using complex machine learning models. The above-mentioned tools use the raw signal to overcome errors introduced through imperfect basecalling.

NanoSatellite (De Roeck et al. 2019) uses an expected signal corresponding to five copies (and later three copies) of the repeating motif to match sections of the real signal using a dynamic time warping (DTW) algorithm (Bellman and Kalaba 1959). The results are clustered into two clusters to obtain the final copy number call. STRique (Giesselmann et al. 2019) uses both flanking sequences and the repeating pattern to build a profile HMM, where match states correspond to k-mers in these sequences; the STRique model does not allow for variation within the repeat. The copy number is then determined by counting match states as the raw signal is aligned to the profile HMM.

We present a novel approach for the length determination of short tandem repeats (STRs) in a genome from the raw signal of nanopore sequencing reads. Our method models an STR locus using a finite-state automaton and uses an adapted version of the DTW algorithm (Bellman and Kalaba 1959) to find an accurate sequence of the read within the STR locus. Our approach improves on the methods introduced by NanoSatellite and STRique in several ways. First, we recognize that basecalling is typically much more accurate in non-repeating regions (such as sequences flanking an STR) than in regions containing short repeats, and thus, it is more efficient (and even more accurate) to simply use basecalled sequences to locate the flanks and subsequently isolate the signal corresponding to the STR locus. Second, instead of using greedy heuristics as in NanoSatellite, we model the whole STR locus by a finite-state automaton and modify the DTW algorithm to align the full length of the signal corresponding to the STR locus to this automaton, determining the corresponding sequence length by the number of states passed by the alignment path. This is similar to the STRique profile HMM; however, there are several major differences. First, in WarpSTR, the finite-state automaton enables much greater customization. This allows us to model more complex STRs that include combinations of heterogeneous motifs or are interrupted, which are common occurrences in medically relevant STRs (Andrew et al. 1994, Musova et al. 2009, Radvansky et al. 2011, Radvanszky et al. 2021). Second, WarpSTR attenuates the signal normalization problem using a novel signal polishing phase. Finally, we use Bayesian GMMs to summarize the information from multiple overlapping reads and to derive the final genotypes. Using nanopore reads from whole-genome sequencing of human genomes and comparing our results to high-confidence variant calls determined by integrative approaches, we determine that our approach is significantly more accurate than STRique.

2 Methods

The speed of DNA passing through a pore varies significantly, which makes signal-based analysis challenging (Midha et al. 2019). To address this problem, DTW technique (Bellman and Kalaba 1959), which has been used e.g. in speech processing, has been adapted to nanopore signal analysis (Loose et al. 2016, Han et al. 2018, 2020). Briefly, DTW uses dynamic programming to find the lowest cost alignment between two sequences of signal values S=s1s2sm and T=t1t2tn. To model variable signal speed, each signal value from S can be aligned to arbitrarily many values from T, and vice versa. The dynamic programming computes a two-dimensional matrix M, where Mi,j represents the cost of the best alignment of s1si to t1tj, using a recurrent formula Mi,j=d(si,tj)+min{Mi1,j1,Mi1,j,Mi,j1}; here, d(s,t) is a distance between signal values s and t. At the end of the computation, the cost of the best alignment is represented by Mm,n, and the best alignment can be recovered using standard back-tracing techniques used in dynamic programming.

To align a known DNA sequence S to a raw signal by DTW, we first estimate an ‘ideal’ signal observed by sequencing S. Namely, we decompose string S to a sequence of overlapping k-mers and for each k-mer use the value from the lookup table of the expected signal values provided by Oxford Nanopore Technologies (https://github.com/nanoporetech/kmer_models/). This signal can be then aligned to the real signal using dynamic programming, minimizing an error function based on the difference between the two compared signal points. In contrast to sequence alignment, alignment of multiple points in one signal to one point in the other signal is not penalized by DTW due to the expected wide variation in signal speeds.

In our work, we adapt DTW for alignment of a repetitive sequence pattern of an undetermined length. Using the alignment of the known sequence pattern to the raw signal helps to inform the decisions in case of ambiguities and ultimately results in a more accurate interpretation of the raw signal. Once the pattern, representing an STR, is aligned to the raw signal, its instantiation is fixed, and the length of the instantiation determines the length of the STR represented by the nanopore read.

2.1 WarpSTR overview

WarpSTR consists of a sequence of steps that include locating reads corresponding to a given STR locus, estimating STR locus length in each of the reads independently (by aligning the raw signal to a finite-state automaton), and finally summarizing the results using a Bayesian GMM. Figure 1 summarizes the whole pipeline; individual steps of the pipeline are described in the following sections. WarpSTR assumes that the input data were preprocessed by basecalling the raw reads and mapped to a reference genome using standard methods. For each read overlapping the target STR locus given by its genomic coordinates, the flanking sequences are then found in the basecalled sequence using local alignment, and the mapping between the called bases and the raw signal, which is an interim output of the Guppy basecaller (Wick et al. 2019), is used to locate the corresponding signal positions. In the experiments, we have used Guppy v.4.0.15 and the flanking sequence length of 110 bases.

Figure 1.

Figure 1.

Overview of the WarpSTR analysis.

2.2 STR representation by a finite-state automaton

An STR locus of an individual haplotype may differ from the reference genome in the number of repeats and in substitutions in individual repeat instances. To represent the variability of each STR, we use a finite-state automaton as shown in Fig. 2, constructed from a simple regular expression. In our software, we currently support IUPAC nucleotides, regions that can be skipped denoted by {}, and repeats occurring at least once denoted by (). For example, the DM2 locus, which consists of a CAGG repeat, potentially interrupted with CAGA or CAGC, can be modeled as (CAGG{CAGM}), where M is the IUPAC code for nucleotides A or C. Such simple regular expressions are sufficient to represent loci that were of interest in our analyses; however, it is straightforward to extend the approach to more complex regular expressions. In the finite-state automaton, we also include flanking sequences from the reference genome. For each STR locus, two automata are built, one for the forward and one for the reverse strand.

Figure 2.

Figure 2.

A finite-state automaton modeling a DM1 STR locus (CAG) and its flanking sequences (shown in gray).

In the next step, we extend the state space of the automaton to the k-mer space to keep track of the context of the last k nucleotides and transform the original automaton to a new finite-state automaton over k-mers (see example in Fig. 3). Briefly, each path of length k in the original automaton will be represented in the new automaton as a state. The two states corresponding to k-mers a and b are connected by a transition if a and b overlap by k1 nucleotides and if the corresponding transition is also present in the original automaton. After this transformation, we can assign to each state the expected value of the nanopore signal aligned to this state.

Figure 3.

Figure 3.

Extended finite-state automaton over the k-mer space.

To align the raw read signal S=s1sm to the finite-state automaton, we use dynamic programming analogous to DTW, where subproblem Mi,j represents the cost of the best alignment of the first i signal points s1si under the condition that we finish in state j. Values Mi,j can be computed using the following recurrence:

Mi,j=mink{j}pred(j){Mi1,k}+|siej|.

Here, pred(j) is the set of all predecessor k-mers of state j in the automaton and ej is the expected signal level for the k-mer represented by state j. A straightforward implementation of this recurrence works in O(mnd) time, where m is the length of the signal, n is the number of states in the automaton, and d is the maximum size of pred(j). By counting the number of state transitions and subtracting the flanking sequences, we obtain the estimated length of the STR (see Fig. 4 for an illustration).

Figure 4.

Figure 4.

A part of the warping path produced by the WarpSTR search algorithm using the state automaton for DM1 locus. States representing k-mers are shown on the left, while the part of the dynamic programming matrix is shown on the right. The warping path, i.e. the path through states with the lowest cost is shown as the sequence of lines with corresponding nucleotides.

When there is no user-defined regular expression modeling the repeat, WarpSTR creates it automatically from the reference genome using an input file listing genomic coordinates of target STRs and their repeat patterns. The software creates the expression using the exact copies of the pattern found in the reference and any incomplete copies or interruptions. For example, if the reference sequence of the STR region was GGG followed by six repeats of the pattern AGAGGG, the automatically generated regular expression was GGG(AGAGGG). These automatically created regular expressions can be adjusted by the user to add flexibility, for example interruptions not present in the reference genome.

2.3 Signal normalization

Nanopore signal is scaled and shifted differently in each sequencing read and it needs to be normalized before analysis so that the resulting values can be compared to the expected signal levels defined in the k-mer tables. The most common approach is based on the assumption that each read represents a sufficiently long random sequence, and thus basic statistics, such as the mean, the median, and the variance should match across different reads. We apply the median normalization strategy outlined in tombo (https://nanoporetech.github.io/tombo/), which for a raw signal sequence R=r1rn computes values shift=1/2(perc46.5(R)+perc53.5(R)) and scale=mediani(|rishift|), where percj(R) is the jth percentile of values r1rn. The normalized signal S=s1sn is obtained from R as si=(rishift)/scale.

Characteristics of a signal produced from a long repetitive region may differ substantially from a random sequence, and consequently, the values obtained by the median normalization may not match those in the expected signal level tables. This is likely one of the reasons why basecalling in these regions often exhibits large errors. To address the problem, we first use the standard normalization and our DTW algorithm to obtain an initial alignment of the raw signal to the finite-state automaton. We then polish the signal by a new spline-based polishing procedure (described below and in Supplementary Section S1), realign the signal to the automaton, and finally determine the final length of the STR.

To polish the signal, we aggregate all signal values aligned to a particular state j and compute their mean mj and standard deviation σj (Supplementary Fig. S1). State j represents a k-mer for which the lookup table of signal values shows an expected value ej. Ideally, values ej and mj will be identical, but in practice they often differ (Supplementary Fig. S3). To address this problem, we estimate a non-linear transformation function f interpolating pairs of values (mj,ej) by a B-spline (Supplementary Fig. S2). Before estimating f, we remove the pairs of (mj,ej) for which mj is too different from ej or where σj is very large, as in such cases, the state j most likely contains misaligned signal values. Finally, function f is used to transform the original signal values s1,s2,,sn to the polished signal values f(s1),f(s2),,f(sn). Figure 5 illustrates that even small changes resulting from polishing can lead to changes in the subsequent second alignment and the resulting repeat count.

Figure 5.

Figure 5.

Signal polishing effect on the alignment. Example alignment of the signal to the expected signal from the state automaton before (top) and after polishing (bottom). Before polishing, the normalized signal values are much higher than the expected signal values, and some of these differences decrease after polishing. More importantly, a spurious repeat in the highlighted windows disappears after polishing.

2.4 Adaptive restriction of event size

In some cases, the expected signal levels in different states are very similar to each other, and the noise present in the raw signal may drive the algorithm to match the same real event to multiple distant states by using several very small events to get to a new state. This would overestimate the STR length. We have solved this problem by requiring a given minimum number s of signal points to be aligned to each state, eliminating short events altogether.

Setting this parameter is important for accuracy. Using a too large fixed value may cause short consecutive events to be skipped altogether. To remediate this problem, we set the parameter adaptively, i.e. we decrease the minimum number of signal points adaptively when short consecutive events occur in the signal.

To find parts of the signal with short consecutive events, we use the result of alignment with s=4. Alignment result is split into non-overlapping windows containing signal points aligned with m events (in experiments, we used m=6). Each such window of w signal values is further looped through in an overlapping sliding window manner with a sliding window length 2p to obtain w2p+1 subwindows. Similarly as in Zhang et al. (2021), for each subwindow S of length 2p signal values, Welch’s unequal variances t-test is calculated as follows: t=(S1¯S2¯)/(yS12+yS22)/p, where S1 and S2 are S[0,p) and S[p,2p), respectively, y represents the standard deviation and S1¯ represents the mean of S1. When an extreme value t is calculated, it denotes that there is a significant context change in that subwindow. By simple local optima picking of t values from all subwindows, we obtain the set of peaks, denoting the expected number of context changes in the whole window. When the window has more context changes than m, WarpSTR presumes that some short events were missed. Thus, in the second signal polishing phase, the values of s are decreased by one for those particular windows.

2.5 Summarizing the results

After running the search algorithm, we obtain a length estimate for each read overlapping a particular STR locus. For diploid organisms, there could be two different alleles with different lengths (one inherited from each parent). Thus, we want to summarize all of the estimates by either one value (if the locus is homozygous) or two values (if the locus is heterozygous).

WarpSTR first filters out apparent outliers by removing estimates that are further than two standard deviations from the mean estimate. To separate reads into two groups, we used Bayesian GMMs with a maximum of two components from the Python scikit-learn library (Pedregosa et al. 2011) (with settings tied covariance type, the weight concentration prior 0.25, 5 initiations, 1000 iterations). WarpSTR declares an STR locus as homozygous if the number of reads in the smaller cluster does not exceed 20% of all unfiltered reads. In this case, the median allele length is reported as the final estimate. Otherwise, the STR locus is declared as heterozygous, and the medians of the two discovered clusters are reported.

3 Experiments

We have used publicly available nanopore data sets for human samples NA12878 (Jain et al. 2018) and NA24385 (https://labs.epi2me.io/gm24385_2020.09/ and https://labs.epi2me.io/gm24385_2020.11/) produced with R9.4.1 nanopore flow cells. We have downloaded the raw fast5 files and BAM files produced using minimap2 and human reference genome version hg38. The NA12878 data set was used for finding the optimal settings of hyperparameters while the NA24385 data set was used for final evaluation only.

To compare to other tools, which cannot handle complex STRs, we have selected STR regions with a single repeating pattern of length between 2 and 6 (i.e. no homopolymeric repeats); the reference sequence of STR region had to be longer than 30 bp and could not contain a homopolymeric sequence longer than 6. To this end, we have used information provided in HipSTR (Willems et al. 2017). To preserve the diversity of repeating patterns and genomic locations, we have grouped the loci based on the repeating pattern length and chromosome. The whole set was then randomly subsampled so that the number of loci from each group was at most 10. The resulting data set has 433 STR regions. The input regular expressions required by WarpSTR were automatically generated using the repeat pattern and the reference genome. In the second experiment, we use complex STRs to further demonstrate WarpSTR capabilities.

High-confidence benchmark VCF files assembled by a complex integrative analysis from multiple sequencing data sets (Zook et al. 2019, Wagner et al. 2022) (v.3.3.2 for NA12878, v.4.2.1 for NA24358) were used as a gold standard. Entries in the VCF files represent predicted variations compared to the reference genome, while also showing types of variations (i.e. whether it is a substitution, insertion, or deletion) and genotype information, i.e. whether the variation is homozygous, or heterozygous, in which case it could contain either one or two alternative genotypes.

As the VCF files stored combined predictions of eight different genotype callers, we first had to calculate the actual allele lengths for each site from the genotype. The allele lengths were computed by adding the length of insertions or subtracting the length of deletions from the reference allele length. For example, in case of reference allele length of 40 with a heterozygous insertion of a single alternative allele of length 4, and a homozygous deletion of 2 nucleotides, the final allele lengths were given as 38 (deletion of 2 nucleotides from the reference length of 40) and 42 (addition of 4 nucleotides and deletion of 2 nucleotides). To discard spurious or low-confidence allele length estimations, caller estimations were combined as follows. If only one caller had allele length estimation, then it was discarded. When two callers had allele length estimations and both of them were equal, then the estimation was taken as the true answer. When k3 callers had estimations and k1 of these callers had the same estimation, then it was taken as the true answer. In other cases or when there were no entries in the VCF file for a locus, the locus was discarded. This resulted in a final set of 241 loci with ground truth information.

3.1 Obtaining predictions for comparison

We compare our results to STRique (Giesselmann et al. 2019) (v0.4.2.) with default parameters. The length of flanks was set to 150 as recommended by STRique authors. We discarded reads where the repeat region was incorrectly extracted by STRique or with the weak signal alignment scores for flanks (<3.8) or where the resulting prediction was 0. While we wanted to include NanoSatellite (De Roeck et al. 2019) in the comparison, the tool is no longer maintained and authors recommend to use STRique as a replacement.

In addition to established tools, we also compare to a straightforward baseline approach, which obtains allele lengths directly from basecalled reads instead of using raw signal analysis. This emulates a straightforward analysis that would be performed by a user unconcerned with the issues introduced in the basecalling process. We first align the left and right flanking sequences of length 110 to each read. The read is used if both flanks map with high enough score and in a correct order. The allele length is then predicted as the distance between the flank matches. These predictions were further processed by filtering and clustering in the same manner as in WarpSTR, to obtain the final result.

3.2 Results

First, we compared STR length estimates for individual reads using the mean absolute error (MAE) and median absolute error (MedAE). For length calls y1,,yn and the gold standard answer (m1,m2), these measures are defined as MAE=avgi{min{|yim1|,|yim2|} and MedAE=mediani{min{|yim1|,|yim2|}. The results are shown in Table 1.

Table 1.

The performance comparison of WarpSTR, STRique, and the baseline on 241 loci.

WarpSTR Baseline STRique
Discarded reads 10.75% 10.75% 26.81%
# loci with the lowest MAE 204 36 0
# loci with the lowest MedAE 237 152 64
Global MAE 3.84 4.89 334.8
Global MedAE 0 1 4
Average D 0.73 0.96 11.74
Median D 0 0 6
# of loci with D = 0 205 182 37
Running time—1 thread 345.43 926.08
Running time—2 threads 196.79 481.83
Running time—4 threads 123.52 274.1
Running time—8 threads 106.47 256.72

WarpSTR produced estimates with the lowest MAE in 204 out of 241 loci. STRique performs poorly under this measure due to outliers occurring frequently even after score filtering recommended by the authors (we assume that this may be due to imprecise signal extraction in STRique). MedAE (median-based measure) is less prone to such outliers, but it created many ties between the tools. However, WarpSTR still performed the best in almost all cases. Supplementary Fig. S4 illustrates the distribution of MedAE across loci. When we combine predictions for reads from all loci to a single global MAE and MedAE score, WarpSTR also performs the best. WarpSTR uses data more efficiently than STRique (only 11% of reads were discarded due to inaccurate alignment of flanking sequences in WarpSTR, while STRique discards more than 26%).

Interestingly, the MAE for individual loci is influenced by the repeated pattern length (see Fig. 6). The patterns of length two are in general the most difficult for both methods. In such case, the expected signal has only two alternating values, a pattern easily confused with noise (see also Supplementary Section S2 for another class of difficult patterns). Even in these cases, WarpSTR outperforms basecalling in approximately half of the loci. For longer patterns, the error is generally lower and WarpSTR very consistently produces the best results.

Figure 6.

Figure 6.

MAE for WarpSTR and basecalling for individual loci colored by repeating pattern length. 25 loci that have very high MAE in both methods are not shown; these are most likely due to large expansions not captured by VCF callers in the gold standard.

So far we have compared predictions for individual reads. These were then combined to an overall genotype for each tool using the filtering and clustering procedure described in Section 2 (the same method was used for all tools). Let y1^,y2^ be the final predicted allele lengths and m1,m2 the true alleles, assuming y1^y2^, m1m2. As the error metric, we use D=|y1^m1|+|y2^m2| if the prediction or the true answer or both are heterozygous, and D=|y1^m1| if both are homozygous. As we can see in Table 1, WarpSTR performs better than other tools, providing perfect results in 85% of the loci (205 out of 241).

Table 1 also shows a comparison of the running time. As the tools have very different preprocessing and postprocessing steps, we only compare the core part of the pipelines (see Supplementary Section S3). WarpSTR is generally faster than STRique; apart from implementation differences, the main reason is that WarpSTR uses only matching states in the model, whereas STRique also uses insertion and deletion states.

One of the advantages of WarpSTR is its ability to analyze STRs with a complex structure, compared to other methods, which are limited by a single repeat pattern. We will illustrate this strength on two clinically relevant complex STR loci: HD (Huntington’s disease) (Andrew et al. 1994) and DM2 (Myotonic dystrophy type 2) (Radvanszky et al. 2021).

The HD locus consists of AGC and CGC repeats separated by a four-codon sequence AACAGCCGCCAC, which is not prone to repeat. Thus, the input sequence for WarpSTR was (AGC)AACAGCCGCCAC(CGC). For the reads included in the NA24385 sample, the WarpSTR predicted alleles of lengths 84 and 105, which agreed with the gold standard answer. Basecalling prediction was 83 and 102 (see Supplementary Section S4). STRique cannot be run on this locus, as it works only for simple repeat patterns.

Thanks to the usage of a state automaton, it is possible to count the number of occurrences of each repeating part of the input sequence, as given by parentheses in the WarpSTR input sequence, and these can be further clustered into genotypes. In the gold standard, HD has 17 and 24 repeats of AGC, and 12 repeats of CGC, and using WarpSTR, we came to the same result (see Supplementary Section S5).

The DM2 locus is particularly complex, consisting of CAGG repeats with CAGA or CAGC interruptions, followed by CAGA repeats and finally CA repeats. The input sequence for WarpSTR was given as ((CAGG)CAGM)(CAGA)(CA), where CAGM denotes both CAGA and CAGC interruptions. Figure 7 shows clustered predictions of DM2 for NA24385 subject split per repeat unit, also listing counts of individual interruptions. The first allele was predicted to contain 16 CAGG repeats, of which 2 are CAGA interruptions and 1 CAGC interruption, followed by 8 CAGA repeats and 21 CA repeats. The second allele was predicted as 18 CAGG repeats, of which 2 are CAGA interruptions and 1 CAGC interruption, followed by 6 CAGA repeats and 25 CA repeats. In the case of DM2 locus, it was not possible to derive the gold standard answer from the benchmark VCF files, since there was no agreement between individual variant callers used in the underlying integrative analysis. The baseline approach also failed, since the apparent errors introduced during basecalling made it impossible to reliably distinguish and count repeat units of this locus. Thus, we have no comparison to the gold standard or other approaches. This further demonstrates the ability of WarpSTR to analyze loci that are difficult to approach by other methods.

Figure 7.

Figure 7.

Clustered predictions of DM2 for NA24385 subject split per repeat unit.

4 Conclusion

In this article, we have investigated the problem of estimating the lengths of STRs from nanopore sequencing reads. Nanopore sequencing is a promising technology, providing very long sequencing reads at a reasonable cost and throughput (a single Promethion flowcell can sequence a human genome at a high depth). Long-read sequences provide a unique opportunity to study the role of STRs in various diseases, since previously employed short-read technology has problems with genotyping larger loci that can be even several hundred bases long.

Since basecalled reads from nanopore sequencing typically exhibit lower accuracy in STR regions, sometimes to the point at which the repeats are unrecognizable, we opted for analyzing the raw sequencing signal instead. We have adapted a commonly used DTW technique to work with representations of the STRs based on a finite-state automaton and determined the length of an STR by alignment of the raw signal to such an automaton. Proper scaling and shifting of the raw signal are also difficult in regions with repetitive sequences. To address this problem, we have developed a new method for polishing the raw signal using splines. Another innovation is an adaptive setting of the minimum event size for parts of the raw signal. The resulting software tool called WarpSTR is able to genotype STR alleles with high accuracy while outperforming baseline approach employing the basecalled sequences and STRique, another tool recently developed for this purpose.

In WarpSTR, basecalled sequences are used only to quickly locate reads relevant to a particular locus, as well as to isolate portions of raw sequencing signal that correspond to the STR locus. To this end, we only use flanking sequences, which are typically non-repetitive and easily mappable. Thus, even though WarpSTR uses the basecalled sequences, it does not suffer from the problems related to read misalignment that typically affects approaches based solely on read mapping.

One obvious extension would be to enrich the underlying finite-state automaton to allow for previously unmapped insertions, deletions, and substitutions. This would likely require employment of a probability-based scoring scheme instead of the simple scoring scheme used in the present work. In fact, defining a probabilistic model for the problem would allow us to use more advanced techniques to deal with high levels of uncertainty in nanopore raw signals, and perhaps allow us to predict a posterior distribution of STR length instead of a single value that can potentially harbor systematic errors.

Finally, at present, we use a simple clustering scheme to summarize the results as a genotyping call, but perhaps techniques modeling typical errors observed in STR analysis would lead to a further improvement.

Supplementary Material

btad388_Supplementary_Data

Contributor Information

Jozef Sitarčík, Comenius University Science Park, Bratislava 841 04, Slovakia; Geneton Ltd, Bratislava 841 04, Slovakia; Slovak Centre of Scientific and Technical Information, Bratislava 811 04, Slovakia.

Tomáš Vinař, Faculty of Mathematics, Physics, and Informatics, Comenius University, Bratislava 842 48, Slovakia.

Broňa Brejová, Faculty of Mathematics, Physics, and Informatics, Comenius University, Bratislava 842 48, Slovakia.

Werner Krampl, Comenius University Science Park, Bratislava 841 04, Slovakia; Geneton Ltd, Bratislava 841 04, Slovakia; Department of Molecular Biology, Faculty of Natural Sciences, Comenius University, Bratislava 841 04, Slovakia.

Jaroslav Budiš, Comenius University Science Park, Bratislava 841 04, Slovakia; Geneton Ltd, Bratislava 841 04, Slovakia; Slovak Centre of Scientific and Technical Information, Bratislava 811 04, Slovakia.

Ján Radvánszky, Comenius University Science Park, Bratislava 841 04, Slovakia; Geneton Ltd, Bratislava 841 04, Slovakia; Institute of Clinical and Translational Research, Biomedical Research Center, Slovak Academy of Sciences, Bratislava 845 05, Slovakia.

Mária Lucká, Slovak Centre of Scientific and Technical Information, Bratislava 811 04, Slovakia; KInIT—Kempelen Institute of Intelligent Technologies, Bottova, 7939/2A, Bratislava 811 09, Slovakia.

Supplementary data

Supplementary data are available at Bioinformatics online.

Funding

This research was supported by grants from the Slovak Research Grant Agency VEGA 1/0538/22 (T.V.) and 1/0463/20 (B.B.), by grants from Slovak Research and Development Agency APVV-18-0319 (J.R.) and APVV-18-0239 (T.V.), by funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 872539 (PANGAIA) and No 956229 (ALPACA), by Operational program Integrated Infrastructure within the project ‘Research in the SANET network and possibilities of its further use and development’, ITMS code 313011W988, co-financed by the European Regional Development Fund, by Life Defender—Protector of Life, ITMS code: 313010ASQ6, co-financed by the European Regional Development Fund (Operational Programme Integrated Infrastructure), and by Ministry of Education, Science, Research and Sport of the Slovak Republic under the Contract No. 0827/2021.

Conflict of interest

None declared.

References

  1. Andrew SE, Goldberg YP, Theilmann J. et al. A CCG repeat polymorphism adjacent to the CAG repeat in the Huntington disease gene: implications for diagnostic accuracy and predictive testing. Hum Mol Genet 1994;3:65–7. [DOI] [PubMed] [Google Scholar]
  2. Bahlo M, Bennett MF, Degorski P. et al. Recent advances in the detection of repeat expansions with short-read next-generation sequencing. F1000Res 2018;7:736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bellman R, Kalaba R.. On adaptive control processes. IRE Trans Automat Contr 1959;4:1–9. [Google Scholar]
  4. Budiš J, Kucharík M, Ďuriš F. et al. Dante: genotyping of known complex and expanded short tandem repeats. Bioinformatics 2019;35:1310–7. [DOI] [PubMed] [Google Scholar]
  5. Dashnow H, Lek M, Phipson B. et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol 2018;19:121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. De Roeck A, De Coster W, Bossaerts L. et al. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION. Genome Biol 2019;20:239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Depienne C, Mandel J-L.. 30 years of repeat expansion disorders: what have we learned and what are the remaining challenges? Am J Hum Genet 2021;108:764–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dolzhenko E, Deshpande V, Schlesinger F. et al. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 2019;35:4754–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Doyu M, Sobue G, Mukai E. et al. Severity of X-linked recessive bulbospinal neuronopathy correlates with size of the tandem CAG repeat in androgen receptor gene. Ann Neurol 1992;32:707–10. [DOI] [PubMed] [Google Scholar]
  10. Frith MC, Kawaguchi R.. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015;16:106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gelfand Y, Rodriguez A, Benson G. et al. TRDB—the tandem repeats database. Nucleic Acids Res 2007;35:D80–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Giesselmann P, Brändl B, Raimondeau E. et al. Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing. Nat Biotechnol 2019;37:1478–81. [DOI] [PubMed] [Google Scholar]
  13. Gymrek M. A genomic view of short tandem repeats. Curr Opin Genet Dev 2017;44:9–16. [DOI] [PubMed] [Google Scholar]
  14. Gymrek M, Willems T, Guilmatre A. et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet 2016;48:22–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Halman A, Oshlack A.. Accuracy of short tandem repeats genotyping tools in whole exome sequencing data. F1000Res 2020;9:200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Han R, Li Y, Gao X. et al. An accurate and rapid continuous wavelet dynamic time warping algorithm for end-to-end mapping in ultra-long nanopore sequencing. Bioinformatics 2018;34:i722–i731. [DOI] [PubMed] [Google Scholar]
  17. Han R, Wang S, Gao X. et al. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing. Bioinformatics 2020;36:1333–43. [DOI] [PubMed] [Google Scholar]
  18. Jain M, Koren S, Miga KH. et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 2018;36:338–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Liu Q, Zhang P, Wang D. et al. Interrogating the “unsequenceable” genomic trinucleotide repeat disorders by long-read sequencing. Genome Med 2017;9:65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Liu Q, Tong Y, Wang K. et al. Genome-wide detection of short tandem repeat expansions by long-read sequencing. BMC Bioinformatics 2020;21:542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Loose M, Malla S, Stout M. et al. Real-time selective sequencing using nanopore technology. Nat Methods 2016;13:751–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Midha MK, Wu M, Chiu K-P. et al. Long-read sequencing in deciphering human genetics to a greater depth. Hum Genet 2019;138:1201–15. [DOI] [PubMed] [Google Scholar]
  23. Mitsuhashi S, Frith MC, Mizuguchi T. et al. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol 2019;20:58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Musova Z, Mazanec R, Krepelova A. et al. Highly unstable sequence interruptions of the CTG repeat in the myotonic dystrophy gene. Am J Med Genet A 2009;149A:1365–74. [DOI] [PubMed] [Google Scholar]
  25. Pedregosa F et al. Scikit-learn: machine learning in Python. J Mach Learning Res 2011;12:2825–30. [Google Scholar]
  26. Radvansky J, Kadasi L.. The expanding world of myotonic dystrophies: how can they be detected? Genet Test Mol Biomarkers 2010;14:733–41. [DOI] [PubMed] [Google Scholar]
  27. Radvansky J, Ficek A, Minarik G. et al. Effect of unexpected sequence interruptions to conventional PCR and repeat primed PCR in myotonic dystrophy type 1 testing. Diagn Mol Pathol 2011;20:48–51. [DOI] [PubMed] [Google Scholar]
  28. Radvanszky J, Hyblova M, Radvanska E. et al. Characterisation of non-pathogenic premutation-range myotonic dystrophy type 2 alleles. JCM 2021;10:3934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Sedlazeck FJ, Lee H, Darby CA. et al. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2018;19:329–46. [DOI] [PubMed] [Google Scholar]
  30. Ummat A, Bashir A.. Resolving complex tandem repeats with long reads. Bioinformatics 2014;30:3491–8. [DOI] [PubMed] [Google Scholar]
  31. Wagner J, Olson ND, Harris L. et al. Benchmarking challenging small variants with linked and long reads. Cell Genomics 2022;2:100128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wick RR, Judd LM, Holt KE. et al. Performance of neural network basecalling tools for oxford nanopore sequencing. Genome Biol 2019;20:129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Willems T, Zielinski D, Yuan J. et al. Genome-wide profiling of heritable and de novo str variations. Nat Methods 2017;14:590–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhang H, Li H, Jain C. et al. Real-time mapping of nanopore raw signals. Bioinformatics 2021;37:i477–i483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zook JM, McDaniel J, Olson ND. et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol 2019;37:561–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad388_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES