Skip to main content
PLOS One logoLink to PLOS One
. 2024 Apr 1;19(4):e0300545. doi: 10.1371/journal.pone.0300545

A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples

John W Oketch 1, Louise V Wain 2,3, Edward J Hollox 1,*
Editor: Paul Aurelian Gagniuc4
PMCID: PMC10984476  PMID: 38558075

Abstract

Short tandem repeat (STR) variation is an often overlooked source of variation between genomes. STRs comprise about 3% of the human genome and are highly polymorphic. Some cause Mendelian disease, and others affect gene expression. Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data will help address this. Here, we compare software that genotypes common STRs and rarer STR expansions genome-wide, with the aim of applying them to population-scale genomes. By using the Genome-In-A-Bottle (GIAB) consortium and 1000 Genomes Project short-read sequencing data, we compare performance in terms of sequence length, depth, computing resources needed, genotyping accuracy and number of STRs genotyped. To ensure broad applicability of our findings, we also measure genotyping performance against a set of genomes from clinical samples with known STR expansions, and a set of STRs commonly used for forensic identification. We find that HipSTR, ExpansionHunter and GangSTR perform well in genotyping common STRs, including the CODIS 13 core STRs used for forensic analysis. GangSTR and ExpansionHunter outperform HipSTR for genotyping call rate and memory usage. ExpansionHunter denovo (EHdn), STRling and GangSTR outperformed STRetch for detecting expanded STRs, and EHdn and STRling used considerably less processor time compared to GangSTR. Analysis on shared genomic sequence data provided by the GIAB consortium allows future performance comparisons of new software approaches on a common set of data, facilitating comparisons and allowing researchers to choose the best software that fulfils their needs.

Introduction

Over the past two decades, most research on the contribution of genetic variation in the human genome to disease has focused on single nucleotide variation. Short tandem repeat (STR) variation has been generally overlooked, as it is not readily assayed by chip-based hybridisation approaches. STRs are short DNA sequence motifs, typically 2-6bp in size, that are repeated multiple times in tandem. They are distributed throughout the human genome, comprising about 3% of the genome sequence, and have a mutation rate generally much higher than single nucleotide variants (around 2x10-3 per locus per generation for STRs compared to 10−8 for single nucleotide variants) [14]. They are frequently polymorphic with multiple alleles in a population, as expected by neutral population genetics theory for loci with high mutation rates.

Particular STRs cause a variety of severe rare monogenic diseases that are inherited in a Mendelian manner [5]. These include triplet-repeat expansion diseases, such as Huntington’s disease and myotonic dystrophy, caused by expansion of 3 bp repeat STRs in the coding region of HTT, and DMPK genes respectively. Some disorders also occur with other repeat motifs in both coding and in non-coding regions [58]. Emerging evidence shows that STRs modulate disease risk in several complex diseases such as autism and neurodegenerative disorders [9, 10].

STRs can affect the function of genes in three main ways, either by directly altering the coding region of genes, by disrupting introns, or by affecting expression levels as part of enhancer elements. For example, in addition to STRs encoding coding CAG-repeats leading to polyglutamine tracts in genes such as HTT, polymorphic STRs with GCA, GCC, GCG or GCT motifs also encode polyalanine tracts of varying length in a variety of genes [11]. Other STRs can disrupt introns, such as the expansion of an intronic ATTCT-repeat in spinocerebellar ataxia type 10 [12]. It is well-established that STR variation is associated with expression levels of genes both at a single gene level and at a genome-wide level [1315]. It is unclear the extent to which STRs are responsible for genomewide association signals detected using SNPs. Linkage disequilibrium between STRs and SNPs varies, but is expected to be lower than between two SNPs at the same recombination distance because the high mutation rate of STRs rapidly breaks down linkage disequilibrium over time. This is shown in attempts to impute STR length genotypes from SNP genotypes, as biallelic STRs can generally be reliably imputed but multi-allelic STRs are much less reliably imputed [16].

Given the potential importance of STRs in human disease, genome-wide studies examining STR variation are clearly warranted. The first generation of STR-calling software, including LobSTR, RepeatSeq and HipSTR [1719], aimed to genotype STRs using sequencing reads that completely span the STR. This led to the limitation that STRs longer than the sequencing read length could not be genotyped. The second generation of STR-calling software uses information from the distance between the paired sequencing reads, in addition to direct sequence information across the repeat, to genotype the STR. This attempts to capture information on STRs that are longer than the sequencing read length and is particularly effective at identifying large expansions at an STR [20].

This study aimed to compare software, suitable for biobank-scale data analysis, that genotypes common STRs and identifies rarer STR expansions genome-wide on short-read genome sequences. This will inform approaches to investigate the role of genomic STR variants in polygenic human diseases. We selected software tools that had at least one of the following characteristics: not limited by read length, able to estimate STR allele length, or had not been compared previously. Three recent pieces of software are focused on profiling large repeat expansions: Expansion Hunter De Novo (EHdn) [21], STRetch [22] and STRling [23]. They all use sequence reads that span STRs and mate pair distance information to identify large expansions, however, unlike STRetch, EHdn and STRling do not require a predefined catalogue of STRs to genotype an STR locus. Therefore EHdn and STRling can also identify novel STR loci not assembled in the reference genome. STRetch, STRling and EHdn can all detect and genotype large repeat expansions in a sample that are outliers from the population distribution of allele lengths. Here we assess the ability of STRetch, STRling and EHdn to identity rarer STR expansions by running them on 116 PCR-free short-read whole genome seqeunces containing clinically-validated trinucleotide repeat expansions, normal repeat sizes and or pre-mutations [24].

We also compare three software tools that focus on genome-wide genotyping of both short and expanded (longer than the sequencing read length) repeat arrays given a reference catalogue of STR genomic coordinates: ExpansionHunter, GangSTR and HipSTR. ExpansionHunter and GangSTR use short-read sequence mate-pair distance information together with STR-spanning sequence reads [24, 25]. Although HipSTR does not use mate-pair distance information and therefore limited by sequence read length, a previous study showed that it outperformed its counterparts [19], and therefore was included in this study. These three tools report diploid allele lengths. In addition, HipSTR genotypes the allele sequence. HipSTR was compared to both GangSTR and ExpansionHunter to genotyping common STR genome wide. Both GangSTR and ExpansionHunter can also identify large STR expansions and were compared against EHdn, STRling and STRetch to assess their ability to call rarer large STR expansions.

These tools have been partially compared in literature by the authors but there is still lack of comprehensive comparison of the STR calling tools. Two additional tools that call STR expansions but are excluded from this study are exSTRa and Tredparse. exSTRa does not estimate the allele length and Tredparse cannot estimate repeat lengths longer than sequencing mate-pair length [20, 25].

The genome data analysed here includes those from the Genome in a Bottle consortium and the 1000 Genomes project, which are collections of publicly-available samples. Our aim in this study is to benchmark different STR-calling software tools primarily using these reference samples. In particular, the Genome in a Bottle consortium provides key reference standard samples analysed by many different methods, so as new technical approaches become available we encourage others to use these same reference samples to facilitate fair comparison of software tools between studies [26].

Methods

Ethics statement

All sequencing data used in this study are from previous studies, from fully consented individuals. Sample collections are from the 1000 Genomes project (https://www.internationalgenome.org/sample_collection_principles/), Personal Genome Project Canada (https://personalgenomes.ca/), or NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research.

Samples and datasets

An overview of the approaches used in this paper is shown in Fig 1, and the different features of the software tools are shown in Table 1.

Fig 1. Flowchart of analyses and samples used in this study.

Fig 1

a) Genome in a bottle (GIAB) samples. b) 1000 Genomes project samples. c) Known clinical samples available from Coriell.

Table 1. Software for genotyping tandem repeats compared.

Software Repeat unit size range (bp) STR catalogue required? Genomewide? Approach Estimates repeat length and/or sequence Reference
HipSTR v.0.6.2 1–6 Yes Yes; only < read length Individual calling Both Willems et al., 2017
GangSTR v.2.5.4 1–20; 20+ Yes Yes Individual calling Only length Mousavi et al., 2019
ExpansionHunter v.5.0.0 1–6; 6+ Yes Yes Individual calling Only length Dolzhenko et., 2017
STRetch 1–6 Yes yes, only expansions Case/control—outlier Only length Dashnow et al., 2018
STRling 1–6 No yes, only expansions Individual calling & Case/control—outlier Both Dashnow et al., 2022
ExpansionHunter Denovo  2–20; 20+ No yes, only expansions Case/control -outlier Both Dolzhenko et al., 2020

The six tools were benchmarked by analysing three different datasets. Firstly, ten gold-standard high coverage short-read human genome sequences from the Genome in a Bottle consortium (https://github.com/genome-in-a-bottle/giab_data_indexes) consisting of two child-parent trios. Secondly, four child-parent trios randomly selected from the 1000 Genomes project (https://www).internationalgenome.org/data)) and thirdly, 116 samples with known trinucleotide repeat mutations obtained from the Coriell Institute (EGA accession number EGAD00001003562). The accession numbers for samples obtained from the Genome in a Bottle (GIAB) and the 1000 Genomes Project are listed in Table 2.

Table 2. Samples used in this study.

Sample ID Population Trio Collection Average fold sequence coverage Paired-end read length (bp) Accession number
HG002 Ashkanazi Son GIAB 30, 100, 300 150, 250 NA24385
HG003 Ashkanazi Father GIAB 30, 100, 300 150, 250 NA24149
HG004 Ashkanazi Mother GIAB 30, 100, 300 150, 250 NA24143
HG005 Chinese Son GIAB 30, 100, 300 250 NA24631
HG006 Chinese Father GIAB 30, 100 150 NA24694
HG007 Chinese Mother GIAB 30, 100 150 NA24691
NA12878 CEU - GIAB 30, 100, 300 150 NA12878
HG00403 CHS Father 1000G 30 150 HG00403
HG00404 CHS Mother 1000G 30 150 HG00404
HG00405 CHS Son 1000G 30 150 HG00405
NA18485 YRI Son 1000G 30 150 NA18485
NA18487 YRI Father 1000G 30 150 NA18487
NA18489 YRI Mother 1000G 30 150 NA18489
HG01500 IBS Father 1000G 30 150 HG01500
HG01501 IBS Mother 1000G 30 150 HG01501
HG01502 IBS Son 1000G 30 150 HG01502
NA06984 CEU Father 1000G 30 150 NA06984
NA06989 CEU Mother 1000G 30 150 NA06989
NA12329 CEU Daughter 1000G 30 150 NA12329

GIAB = Genome in a Bottle 1000G = 1000 Genomes Project

The 116 whole genomes including previously validated disease-causing STR expansions obtained [24] have been validated with one the following repeat expansions: Fragile X Syndrome (FMR1), Huntington disease (HTT), Friedreich’s ataxia (FXN), Myotonic Dystrophy (DM1), Spinocerebellar Ataxia 1/3 (ATXN1/3), Spinal and Bulbar Muscular Atrophy (SBMA). The samples had been sequenced at 2x150 bp reads on Illumina HiSeqX and repeat expansions previously detected using ExpansionHunter and standard PCR techniques [24]. All samples had been Illumina sequenced using PCR-free methods. The sequencing methods for these data sets have been described in [24, 26, 27].

Bam file preparation

The seven GIAB bam files were down-sampled from their original coverage of either 300x or 100x to a final coverage of ~30x. In summary, a total of 5/7 GIAB samples had an initial genome coverage of ~300x and 2/7 had a genome coverage of ~100x. The 300x genomes were down-sampled to 100x and all samples were down-sampled further to ~30x genome coverage using Picard v2.6.0 software [28]. The final 30x coverage was chosen to assess the applicability of STR genotyping tools on large cohorts of genomes sequenced at 30x genome coverage, which is a standard coverage adopted by the 1000 Genomes project, for example. The bam files were refined to remove sequence duplicates and the quality of the bam files assessed using Qualimap v2.2.1 [29]. This data set was collated with twelve 30x genomes from the 1000G Project (1 KGP). Overall, we generated 3 datasets: 300x (n = 5), 100x (n = 7) and 30x (n = 19) genomes, including 6 child-parent trios, with three individuals sequenced using both 150bp and 250bp paired-end reads (Table 2). These sequences had been aligned to the GRCh38 human genome assembly [26]. All the bam files were sorted and indexed using samtools v1.9 [30] before calling for STR genotypes.

Computing memory and run time evaluation

To evaluate compute resource usage, we recorded the time and memory taken to process each bam file from five samples Illumina short-read sequenced at 30x coverage, using a single core of an Intel Xeon Skylake CPU running at 2.6GHz clock speed with 80Gb RAM available. GangSTR, HipSTR, STRetch and ExpansionHunter were run on a custom STR catalogue containing 811899 STRs of 2–6 bp repeat units. This catalogue was built from GangSTR’s catalogue (hg38_v13.bed) available at https://github.com/gymreklab/GangSTR, limited to 2–6 bp repeat units. Because ExpansionHunter is sensitive to the presence of unambiguous ‘NNs’ in a reference genome around the STR, some of the loci were dropped leaving a total of 790661 loci. This custom catalogue used is available at: https://doi.org/10.25392/leicester.data.22041020. GangSTR, HipSTR, and ExpansionHunter were run at default parameters using this custom catalogue. STRetch was run using whole genome sequencing pipeline starting from mapped bam files [22]. STRling and ExpansionHunter Denovo (EHdn) do not require a reference catalogue. STRling was run using single sample pipeline [23]. Both EHdn and STRetch require a control set of samples. To explore STR expansion profiles in each sample, for EHdn, the sample to be examined was treated as a case sample and the rest of the samples as controls and performed outlier analysis. For STRetch, we build a control set from the remaining subset of the genomes analysed in this study using STRetch pipeline [22]

Comparison of software for genotyping STRs

HipSTR, GangSTR and ExpansionHunter performance was compared genome-wide by assessing: (a) the proportion of STRs genotyped (b) accuracy of the calls made by analysing Mendelian inheritance patterns in 6 child-parent trios and sample call-concordance compared across varying sequence depths and sequence read lengths. First, each tool was run using its own STR catalogue published and tested on the respective tool. These catalogues are of different sizes, with GangSTR listing 832,380 STR loci, HipSTR listing 1,638,945 loci and ExpansionHunter listing 174,293 loci. These catalogues are available at: https://github.com/HipSTR-Tool/HipSTR, https://github.com/gymreklab/GangSTR and https://github.com/Illumina/RepeatCatalogs. HipSTR and ExpansionHunter catalogues consist of STRs of 2–6 bp units while GangSTR consist of 2-20bp repeat units. GangSTR was run using default parameters. HipSTR was run using some non-default parameters: —min-reads 10 and—def-stutter-model. The parameters were used to set the minimum total reads required to genotype a locus from 100 and allow a default stutter model, recommended for running few samples [19]. ExpansionHunter was run at default parameters, see codes used in data availability section. For comparisons between the three software tools, they were all run using the custom catalogue of 790661 STRs described above. The raw variant calling files (VCF) were filtered using dumpSTR tool to remove calls with read coverage below 10 and those with abnormally high coverage by setting the max call depth for all tools to 1000 [31]. Additional parameters were used for GangSTR:—filter- spanbound-only and—filter-badCI, to filter out calls where only spanning or bounding reads were found, or calls with the maximum likelihood genotype estimates outside of the 95% bootstrap confidence interval as recommended by the GangSTR authors [25]. The performance of the three methods was further assed by comparing the genotype calls to CE data [32] across 13 forensic STRs, known as core Combined DNA Index System (CODIS). This panel was further extended to include 9 additional forensic STRs and assessed the accuracy of the 3 software by assessing the mendelian inheritance patterns in the 5 parent-child trios.

Comparison of software for detecting and genotyping STR expansions

To assess the sensitivity of GangSTR, EHdn, STRling and STRetch at detecting STR expansions, the four tools were run on samples with known clinical STR expansions that had been analysed by ExpansionHunter in a previous publication [24]. EHdn was run in a case- control mode as opposed to outlier analysis [21]. For each clinical STR expansion locus, the samples with those STR expansions were treated as cases, and samples with different STR expansions or normal allele sizes used as controls. For STRetch, samples were used as a control to each other, and the depth normalised read counts compared to identify significantly expanded loci in each sample relative to the rest. Both STRetch and GangSTR were run using STR catalogues containing disease loci (https://stripy.org/expansionhunter-catalog-creator). GangSTR was run with the same parameters described above. However, for Fragile X Syndrome and Spinal and Bulbar Muscular Atrophy, GangSTR was run by specifying the ploidy of the X chromosome for male and female. STRling was run using joint calling pipeline [23]. Our results were compared to the previously reported validated calls for these samples [24].

Results

Computing resource usage

For large scale studies involving thousands of samples, the relative processing time of different software tools can become an important aspect of software choice. We first compared the software tools that use a STR catalogue (Table 3). GangSTR, HipSTR, ExpansionHunter and STRetch were run on the same STR catalogue containing 790661 loci. GangSTR took an average of 10 hours per diploid genome across five samples, STRetch took about 7 hours and HipSTR took about 3 hours. ExpansionHunter default mode (seeking mode) needed more than 7 days and was not run to completion. For the software tools that do not require a software catalogue, EHdn and STRling took around 30 minutes per diploid genome.

Table 3. Comparison of processing times and memory usage of STR-calling software.

CPU Time (hours:minutes:seconds) RAM in Gb
Sample GangSTR HipSTR STRetch EH (default mode) EH (streaming mode) EHdn STRling GangSTR HipSTR STRetch EH (default mode) EH (streaming mode)  EHdn STRling
NA18485 10:33:01 3:58:53 06:39:38 > 168 h 01:33:22 00:30:50 00:24:33 0.04 0.50 17.63 n/a 69.80 0.57 1.16
NA18487 10:27:03 3:00:32 07:10:47 > 168 h 01:28:28 00:30:09 00:24:03 0.03 0.49 17.60 n/a 72.56 0.48 0.87
NA06984 10:13:42 2:47:50 06:21:20 > 168 h 01:23:40 00:30:33 00:24:19 0.04 0.40 17.42 n/a 70.11 0.58 1.16
NA06989 8:41:14 2:34:05 05:57:05 > 168 h 01:17:33 00:27:24 00:22:11 0.04 0.42 17.51 n/a 69.82 0.57 0.82
NA12329 10:24:04 2:50:49 07:15:59 > 168 h 01:19:26 00:29:49 00:23:42 0.05 0.34 17.40 n/a 72.57 0.51 0.76
Average 10:03:48 03:02:25 06:40:57 n/a 01:24:29 00:29:45 00:23:45 0.04 0.43 17.51 n/a 70.97 0.54 0.96

STRetch and ExpansionHunter are the only software tools tested with a multithreading option. For STRetch increasing the core number from 1 to 28 reduces the time from 7 hours to about 1.5 hours per diploid genome. Using ExpansionHunter’s streaming mode (that is recommended for large genomic catalogs) with at least 16 threads noticeably improved its performance, reducing its run time to an average of 1.5 hours per diploid genome (Table 3). ExpansionHunter was notable for its high memory usage–typically around 70Gb in contrast to the other 5 software tools using either ~17 Gb or less than 1GB (Table 3).

Number of common STR loci genotyped and genotype call concordance of common STRs

We compared GangSTR, ExpansionHunter and HipSTR, which are the three software tools that aim to genotype all STRs across the genome. They all rely on a catalogue of STR loci provided by the user, and the number of the loci in that catalogue provide the upper limit on the number of genotyped STRs. Therefore, our results are presented as a percentage of STRs called using that particular catalogue. We first used the different catalogues provided with each software tool, and found several patterns common to both GangSTR, ExpansionHunter and HipSTR (S1 Table). Firstly, all software tools call a high proportion of the STRs in their respective catalogues across all samples, with GangSTR and ExpansionHunter calling a higher proportion than HipSTR, although the different sizes of their catalogues must be taken into account. Secondly, more calls are made from 100x coverage data than 30x. The increase is small across most samples but is large (over 10 percentage points) for HG005 for HipSTR and GangSTR, where both tools call fewer than 90% of STRs in the 30x coverage data. Taken together, 30x is not only sufficient but appears optimal for STR calling, with little benefit in increasing sequencing depth. For read length, 150bp paired-end is optimal and increased read length seems slightly detrimental, at least for HipSTR, which made fewer calls across the four samples that were sequenced at 2x250bp.

To ensure a fair comparison across tools, we repeated our analysis using the same STR catalogue containing 790661 loci (Table 4). For all software tools and samples, 90% of STRs in this catalogue were called, with most calling >99%. We compared the actual genotype calls made on the STRs called by three tools by plotting each allele call for each on a scatterplot, normalised against the allele in the reference genome (Fig 2). This shows that over 85% of the calls were identical between the three methods, and of those calls, at least 89.0% were identical with the reference genome. There was no strong bias in under- or over-calling repeat number made by HipSTR, ExpansionHunter or GangSTR. There was high concordance of the genotype calls made at 2x150bp compared to 2x250bp for each tool (S1 Table).

Table 4. Percentage of loci called in a shared STR catalogue by GangSTR and HipSTR and ExpansionHunter.

GangSTR HipSTR ExpansionHunter
Sample Read length Called at 30x (%) Called at 100x (%) Calls in both (%) Identical by allele length (%) Identical by allele sequence (%) Called at 30x (%) Called at 100x (%) Calls in both (%) Identical by allele length (%) Identical by allele sequence (%) Called at 30x (%) Called at 100x (%) Calls in both (%) Identical by allele length (%) Identical by allele sequence (%)
HG002 2x150 99.7 99.7 99.6 99.4 NA 96.2 96.0 95.5 99.8 99.9 99.8 99.9 99.8 98.9 NA
HG003 2x150 99.7 99.7 99.6 99.4 NA 96.1 96.0 95.5 99.8 99.9 99.8 99.9 99.8 99.8 NA
HG004 2x150 99.2 99.2 99.1 99.4 NA 95.8 95.5 95.2 99.8 99.9 99.3 99.5 99.3 98.9 NA
HG005 2x250 98.5 99.6 98.4 99.6 NA 93.9 94.3 92.9 99.7 99.1 99.8 99.9 99.8 98.4 NA
HG006 2x150 99.4 99.5 99.3 99.3 NA 96.0 95.9 95.0 99.8 99.9 99.7 99.9 99.7 98.7 NA
HG007 2x150 99.0 99.1 98.9 99.3 NA 95.7 95.4 95.0 99.8 99.8 99.3 99.5 99.3 98.7 NA
NA12878 2x150 99.1 99.1 98.9 99.3 NA 95.7 95.4 95.0 99.8 99.8 99.3 99.5 99.3 98.8 NA
HG002 2x250 96.1 NA NA NA NA 94.1 NA NA NA NA 99.3 NA NA NA NA
HG003 2x250 95.8 NA NA NA NA 93.9 NA NA NA NA 99.8 NA NA NA NA
HG004 2x250 96.8 NA NA NA NA 93.9 NA NA NA NA 99.4 NA NA NA NA

NA: not analysed

Fig 2. Comparison of STR calls from ExpansionHunter, HipSTR and GangSTR.

Fig 2

A). Comparison of the HG002 genotypes of the STRs called by both GangSTR and HipSTR. The x -axis represents the call made by GangSTR in comparison to the reference sequence (0 = same as reference sequence, -100 represents 100 fewer repeat units than the reference sequence. The y-axis represents the call made by HipSTR. B) Comparison of the genotypes of the STRs called by either GangSTR or HipSTR compared to ExpansionHunter (x-axis). The dotted red lines shows correlation between the calls compared.

Accuracy of common STR calls assessed using Mendelian inheritance

Accuracy of the genotypes made by GangSTR, ExpansionHunter and HipSTR can be measured using the expectation of Mendelian inheritance of alleles, whereby we would expect to see one allele of a child’s genotype in their mother and one allele in their father. Any allele in a child not observed in one of the parents is due either to a genotyping error in the child or the parent, or a de novo mutation in the child. Assuming a rate of de novo mutations at STR loci as 10−4 per generation [33], the contribution to observed inheritance errors of de novo mutation is minimal and therefore differences in inconsistencies can be attributed to differences in STR genotyping accuracy.

Analysis of Mendelian inconsistencies using bcftools software across the five mother-father-child trios with 150bp sequence reads shows very little difference in genotyping accuracy across the three methods. GangSTR and HipSTR showed the highest genotyping accuracy (Table 5). Comparison between the HG002,HG003,HG004 trio sequenced at 100x coverage and 30x coverage shows that increased sequencing coverage does not improve genotyping accuracy.

Table 5. Mendelian inheritance of STR alleles.

Parent-offspring Trio Coverage GangSTR Mendelian consistent (%) HipSTR Mendelian consistent (%) ExpansionHunter Mendelian consistent (%)
HG002,HG003,HG004 100x 99.5 99.6 99.2
HG002,HG003,HG004 30x 99.7 99.5 98.7
NA18485,NA18489,NA18487 30x 99.6 99.4 98.5
NA06984,NA06989,NA12329 30x 99.8 99.9 98.9
HG00403,HG00404,HG00405 30x 99.9 99.9 99.0
HG01500,HG01501,HG01502 30x 99.7 99.6 98.7

In order to investigate the possible reasons for erroneous genotype calls, we stratified the genotypes that showed Mendelian inconsistencies by repeat unit size normalised against the repeat-unit counts in the reference list (Fig 3). It is clear that for the three tools; ExpansionHunter, HipSTR and GangSTR, there is overrepresentation of 2 bp repeat unit STRs in the incorrect genotype calls. Therefore, for these methods 2 bp repeat STR genotypes have the highest error rate, over twice the error rate of other STRs (S1 Fig).

Fig 3. Repeat unit size distribution of Mendelian inconsistencies.

Fig 3

a) ExpansionHunter, GangSTR and HipSTR calls that are Mendelian inconsistent as a proportion of motif counts in the catalogue, b) all STRs in the catalogue used by the three software tools.

Genotyping of known forensic STR loci

Thirteen STRs, known as core Combined DNA Index System (CODIS) STRs, are long STRs that show extensive variation between individuals, and are used across many different forensic STR panels for identification of individuals from genomic DNA [34, 35]. Analysis of the performance of GangSTR, ExpansionHunter and HipSTR in calling these STRs in comparison to capillary electrophoresis (CE) data is useful both for practical forensic analysis, but also as a measure of the error rate of the software, if we accept that capillary electrophoresis can be regarded as a gold standard. Matched capillary electrophoresis data for the 13 core STRs, generated using forensic-standard Promega Powerplex Fusion 24 assay, has been published [32], and our calls for NA12878 were compared against these data (Table 6). ExpansionHunter calls matched CE data at 10/13 loci, called one allele incorrectly at the THO1 locus and two alleles at both FGA and D21S11 loci. GangSTR genotypes matched CE data for 11/13 loci, with one locus (D21S11) not called and one (TH01) called incorrectly. HipSTR showed the same results as GangSTR, except that, in addition, it incorrectly called one allele at D13S317.The three tools were also consistent across 19/22 forensic STRs analysed for Mendelian inheritance errors across the six trios.

Table 6. Comparison of STR genotype calls at core forensic loci for NA12878.

Locus Chrom Start End Motif GangSTR 100x GangSTR 30x HipSTR 100x HipSTR 30x ExpansionHunter 100x ExpansionHunter 30x CE
CSF1PO 5 150076324 150076375 ATCT 10,11 10,11 10,11 10,11 10,11 10,11 10,11
D5S818 5 123775556 123775599 ATCT 12,12 12,12 12,12 12,12 12,12 12,12 12,12
D7S820 7 84160226 84160277 TATC 8,10 8,10 8,10 8,10 8,10 8,10 8,10
D13S317 13 82148025 82148068 TATC 11,12 11,12 11.3,12.3 11.3,12.3 11,12 11,12 11,12
D16S539 16 86352702 86352745 GATA 10,11 10,11 10,11 10,11 10,12 10,11 10,11
D21S11 21 19181973 19182099 TCTA - - - - 18,34 34,34 30,30
TH01 11 2171088 2171115 AATG 7,8 7,7 7,9.8 7,9.8 7,10 7,10 7,9.3
TPOX 2 1489653 1489684 AATG 8,8 8,8 8,8 8,8 8,8 8,8 8,8
vWA 12 5983977 5984044 AGAT 15,17 15,15 15,17 15,17 15,17 15,17 15,17
D3S1358 3 45540739 45540802 TCTA 16,17 16,17 16,17 16,17 16,17 16,17 16,17
D8S1179 8 124894865 124894916 TATC 12,12 12,12 12,12 12,12 12,13 12,13 12,12
D18S51 18 63281667 63281738 AGAA 16,17 16,17 16,17 16,17 16,17 16,17 16,17
FGA 4 154587736 154587823 GGAA 22,24 22,24 22,24 22,24 23,25 23,25 22,24

N/B: F = failed QC; CE = capillary electrophoresis genotype

Detection of STR expansions at known clinical loci

To assess the reliability in calling expanded STRs in a clinical situation, we compared STRetch, EHdn and GangSTR performance in calling expanded STRs in samples with known STR expansions (Table 7). All four methods showed high sensitivity, with EHdn (95% sensitivity), GangSTR (89% sensitivity) and STRling (94% sensitivity) outperforming STRetch (68% sensitivity). Although STRetch, STRling and GangSTR were able to flag repeat expansions in most of the genes analysed, the three tools underestimated the repeat lengths in comparison to Southern blot results at DMPK, FMR1 and FXN loci, which may be due to either mosaicism of the STR expansion or to GC-bias of the Illumina sequencing approach [24]. These loci were characterised by longer and or double expansions in both alleles ranging between 50 to 1000 repeat units in the samples that were screened. Both STRetch and STRling performed poorly at the FMR1 locus (Table 7). Of the samples identified as expanded (p< = 0.05; adjusted for multiple testing) at the FMR1 locus by STRetch, all had reported repeat lengths below the expected FMR1 premutation ranges [5]. For STRling, half the samples identified had reported repeat lengths below the expected FMR1 premutation ranges. To assess specificity, 21 samples with known non-pathogenic allele length at FMR1 were analysed, with STRling called two of these as expanded and GangSTR identifying one sample with an expanded allele length. This is a small dataset to robustly test false positive rate, but suggests overall high specificity for the three software tools used.

Table 7. Detection of STR expansions at known clinical loci.

Disease Gene Total analysed Identified using EHdn Identified using STRetch Identified using GangSTR Identified using STRling
Spinal-bulbar muscular atrophy AR 1 0 0 0 0
Myotonic dystrophy type 1 DMPK 16 16 16 16 16
Fragile X FMR1 34 34 19 32 33
Friedreich’s ataxia FXN 25 25 11 19 24
Huntington’s disease HTT 13 13 13 13 12
Dentatorubral-pallidoluysian atrophy ATN1 2 2 2 1 2
Spinocerebellar ataxia type 1 ATXN1 3 0 3 3 1
Spinocerebellar ataxia type 3 ATXN3 1 0 1 1 1
Total 95 90 (95%) 65 (68%) 85 (89%) 89 (94%)
Unaffected 21 0 0 1 2

Discussion

Genomewide analysis of STRs using short read sequencing has lagged behind studies of other variations, including single nucleotide variation and, to a certain extent, structural variation. It is likely that STRs underly some undiscovered genetic associations with complex disease [5]. Several software tools have been developed with the aim of accurately calling STR genotypes genome-wide, but with different ultimate aims. Some have been developed to detect and genotype large repeat expansions that are outliers from the population distribution of allele lengths for that particular STR. Development of these was motivated by well- established repeat expansions causing a variety of Mendelian diseases, and more recent discoveries showing that large repeat expansions can underlie a large amount of complex disease, such as ALS [36, 37]. Other software tools aim to genotype STRs genome-wide irrespective of their alleles, using a catalogue of known STRs.

We selected the most recent software tools most appropriate to our needs in detecting both STR expansions and genome-wide STR genotypes in a population-based sample. A previous study describes detection of six disease-causing STR expansions from whole exome sequences using several STR-calling software tools, including GangSTR and STRetch but not EHdn and STRling [38]. ExpansionHunter [39], STRetch and exSTRa [40] were used to detect clinically-related STR expansions in those 6 genes with high specificity. Other reports include those focused on forensic STR detection [41], or are focused on genotyping a particular, or small number, of STR loci [42].

Our aim was to assess the feasibility of applying these software tools, both in terms of quality of the resulting data and practicality of using the software, for large cohorts of genomes. We assessed this in a variety of ways. Processor and memory usage, in conjunction with processing time, was measured so that we could assess the feasibility of scaling up analysis to thousands of genomes given current computing resources. For the software using a catalogue to genotype known STRs (GangSTR, HipSTR and Expansionhunter), we examined the effect of increasing sequencing depth on the proportion of STRs genotyped, and accuracy. We then examined the concordance between GangSTR, ExpansionHunter and HipSTR using common STR catalogues, and the relative number of Mendelian inconsistencies in trios to assess STR genotyping accuracy. We found sequencing depth to have no effect on the number or quality of STR genotypes called by HipSTR, but higher sequencing depth had a modest positive effect on GangSTR STR genotype calling. Quality of STR calls made by Expansionhunter, GangSTR and HipSTR was very similar but slightly higher between GangSTR and HipSTR when given the same STR catalogue. Both GangSTR and HipSTR used less memory than Expansionhunter, but GangSTR took about 3x more CPU time. Both GangSTR and HipSTR performed very well, but not perfectly, in genotyping STR loci used for forensics.

The software tools STRetch, STRling, ExpansionHunter, EHdn and GangSTR can all call expanded repeats, with only EHdn and STRling not requiring a predefined catalogue of STR loci. Because expanded repeats are expected to be rare, and underlie several clinical conditions, we used a previous dataset to test the ability of STRetch, STRling, EHdn and GangSTR to genotype known STR expansions. EHdn, STRling and GangSTR show higher sensitivity than STRetch, at least under the conditions tested. EHdn and STRling used the least resource, analysing a genome in about half an hour using about half a Gb of memory under our conditions.

For case-control studies where thousands of genomes have been sequenced at ~30x coverage for both large expansions and known STRs genomewide, we have decided to use GangSTR for known STRs and EHdn for large expansions. Although STRling used similar time and memory as EHdn when run in a single sample mode, its resource usage can increase linearly in joint calling mode which is appropriate for case-control outlier analysis [23]. EHdn and STRling are the only STR genotypers that can genotype expansions without a prior defined catalogue, broadening their scope to identify previously unknown expansions. GangSTR and HipSTR are similar, with the increased computing resources needed by GangSTR offset in our view by its ability to detect larger STR expansions longer than the read length, to support and extend EHdn calls. ExpansionHunter default mode was unusable at genotyping a catalog of 790661 STR loci as it needed more than 7 days but efficient in streaming mode albeit higher memory requirements about 70 Gb per genome using 16 cores.

We note that, although the clinical repeat expansions and the forensic STRs have been validated by orthogonal data and methods, this is not the case with the other STR genotypes. For STRs that are associated with disease in subsequent analyses, it will be important to validate those particular STRs on a subset of samples using alternative methods, such as capillary electrophoresis. STR genotype calling from high throughput sequencing remains an area of active development, and we hope that further progress can be made in reliably calling STRs from both short read and long read sequencing data.

Our study had some limitations inherent to different tools. First, we did not analyse the 1bp repeat unit motifs. Both GangSTR and EHdn have been optimised to genotype 2-20bp repeat motifs, of which EHdn only genotypes expanded repeats. 1bp repeat motifs present a challenge to many sequencing technologies due to PCR stutter noise and may introduce artefacts in sequence, however, a detailed assessment of accuracy of 1bp repeat motif STR genotyping in PCR-free sequencing is needed. Secondly, apart from EHdn and STRling, the tools require defined STR coordinates built from reference genomes. Therefore, STRs not assembled in the reference genome, including the highly complex regions of the genome that are often hard to sequence and assemble, cannot be genotyped. However, with improvements in PCR free protocols and long read sequencing, combined with new genome assemblies, the ability to genotype these regions will improve.

Supporting information

S1 Table. Number of loci called as percentage of total in catalogue for GangSTR and HipSTR and ExpansionHunter, and call concordance as function of sequence depth.

(DOCX)

pone.0300545.s001.docx (20.8KB, docx)
S2 Table. Number of loci called as percentage of total in catalogue for GangSTR and HipSTR and ExpansionHunter, and call concordance as function of sequence length.

(DOCX)

pone.0300545.s002.docx (17.6KB, docx)
S1 Fig. ExpansionHunter, GangSTR and HipSTR’s 2 bp repeat unit calls—Mendelian inconsistent sites stratified by sequence motif.

(TIFF)

pone.0300545.s003.tiff (2.1MB, tiff)

Acknowledgments

The views expressed are those of the authors and not necessarily those of the National Health Service (NHS), the NIHR or the Department of Health. This research used the ALICE High Performance Computing Facility at the University of Leicester.

Data Availability

Code for analyses, and the full set of genotype calls at the clinical and forensic loci are available at https://doi.org/10.25392/leicester.data.22041020. Genotype call vcf files for GangSTR and HipSTR and ExpansionHunter are available for the Genome In a bottle samples are at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ and https://doi.org/10.25392/leicester.data.22041020 Genotype call vcf files for GangSTR, ExpansionHunter and HipSTR are available for the 1000 Genomes samples used are at https://doi.org/10.25392/leicester.data.22041020.

Funding Statement

“JWO is funded by a Wellcome Trust PhD studentship as part of the Wellcome Trust Genetic Epidemiology and Public Health Genomics Doctoral Training Programme by grant number 218505/Z/19/Z. LWV holds a GSK/Asthma+Lung UK Chair in Respiratory Research (C17-1). The research was partially supported by the National Institute for Health Research (NIHR) Leicester Biomedical Research Centre. There was no additional external funding received for this study.

References

  • 1.Brinkmann B, Klintschar M, Neuhuber F, Hühne J, Rolf B. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. The American Journal of Human Genetics. 1998. Jun 1;62(6):1408–15. doi: 10.1086/301869 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Burgarella C, Navascués M. Mutation rate estimates for 110 Y-chromosome STRs combining population and father–son pair data. Eur J Hum Genet. 2010;19: 70. doi: 10.1038/ejhg.2010.154 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Willems T, Gymrek M, Highnam G, Mittelman D, Erlich Y. The landscape of human STR variation. Genome Research. 2014;24: 1894–1904. doi: 10.1101/gr.177774.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gymrek M. A genomic view of short tandem repeats. Current Opinion in Genetics and Development. 2017;44: 9–16. doi: 10.1016/j.gde.2017.01.012 [DOI] [PubMed] [Google Scholar]
  • 5.Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nature Reviews Genetics. 2018;19: 286–298. doi: 10.1038/nrg.2017.115 [DOI] [PubMed] [Google Scholar]
  • 6.Paulson H, Henry L, Paulson H. Repeat expansion diseases. Neurogenetics, Part I. 2018: 105. doi: 10.1016/B978-0-444-63233-3.00009-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lieberman AP, Shakkottai VG, Albin RL. Polyglutamine Repeats in Neurodegenerative Diseases. 2018. doi: 10.1146/annurev-pathmechdis-. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Van Kuilenburg ABP, Tarailo-Graovac M, Richmond PA, Drögemöller BI, Pouladi MA, Leen R, et al. Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in GLS. N Engl J Med. 2019;380: 1433. doi: 10.1056/NEJMoa1806627 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hannan AJ. Tandem repeat polymorphisms: modulators of disease susceptibility and candidates for ’missing heritability’. Trends in Genetics. 2010;26: 59–65. doi: 10.1016/j.tig.2009.11.008 [DOI] [PubMed] [Google Scholar]
  • 10.Trost B, Engchuan W, Nguyen CM, Thiruvahindrapuram B, Dolzhenko E, Backstrom I, et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020;586: 80–86. doi: 10.1038/s41586-020-2579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lavoie H, Debeane F, Trinh Q, Turcotte J, Corbeil-Girard L, Dicaire M, et al. Polymorphism, shared functions and convergent evolution of genes with sequences coding for polyalanine domains. 2024;12: 2967. doi: 10.1093/hmg/ddg329 [DOI] [PubMed] [Google Scholar]
  • 12.Matsuura T, Yamagata T, Burgess DL, Rasmussen A, Grewal RP, Watase K, et al. Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia type 10. Nature genetics. 2000. Oct;26(2):191–4. doi: 10.1038/79911 [DOI] [PubMed] [Google Scholar]
  • 13.Dobbelstein M, Contente A, Dittmer A, Koch MC, Roth J. A polymorphic microsatellite that mediates induction of PIG3 by p53. Nature genetics. 2002;30: 315–320. doi: 10.1038/ng836 [DOI] [PubMed] [Google Scholar]
  • 14.Gymrek M, Willems T, Guilmatre A, Zeng H, Markus B, Georgiev S, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nature Genetics. 2015;48: 22–29. doi: 10.1038/ng.3461 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, et al. The impact of short tandem repeat variation on gene expression. Nat Genet. 2019;51: 1652. doi: 10.1038/s41588-019-0521-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Saini S, Mitra I, Mousavi N, Fotsing SF, Gymrek M. A reference haplotype panel for genome-wide imputation of short tandem repeats. Nat Commun. 2018;9. doi: 10.1038/s41467-018-06694-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Research. 2012;22: 1154–1162. doi: 10.1101/gr.135780.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. 2012;41: e32. doi: 10.1093/nar/gks981 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nature Methods. 2017;14: 590–592. doi: 10.1038/nmeth.4267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bahlo M, Bennett MF, Degorski P, Tankard RM, Delatycki MB, Lockhart PJ. Recent advances in the detection of repeat expansions with short-read next-generation sequencing [version 1; referees: 3 approved]. F1000Research. 2018;7: 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Dolzhenko E, Bennett MF, Richmond PA, Trost B, Chen S, Van Vugt JJFA, et al. ExpansionHunter Denovo: A computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biology. 2020;21: 1–14. doi: 10.1186/s13059-020-02017-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, et al. STRetch: Detecting and discovering pathogenic short tandem repeat expansions. Genome Biology. 2018;19: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Dashnow H, Pedersen BS, Hiatt L, Brown J, Beecroft SJ, Ravenscroft G, et al. STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci. Genome Biol. 2022;23. doi: 10.1186/s13059-022-02826-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Dolzhenko E, van Vugt JJFA, Shaw RJ, Bekritsky MA, Van Blitterswijk M, Narzisi G, et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Research. 2017;27: 1895–1903. doi: 10.1101/gr.225672.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mousavi N, Shleizer-Burko S, Yanicky R, Gymrek M. Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Research. 2019;47: 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zook JM, Catoe D, Mcdaniel J, Vang L, Spies N, Sidow A, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3. doi: 10.1038/sdata.2016.25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High coverage whole genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios. bioRxiv. 2021: 2021.02.06.430068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ebbert MTW, Wadsworth ME, Staley LA, Hoyt KL, Pickett B, Miller J, et al. Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics. 2016;17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.García-Alcalde F, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, et al. Qualimap: evaluating next-generation sequencing alignment data. 2012;28: 2678. doi: 10.1093/bioinformatics/bts503 [DOI] [PubMed] [Google Scholar]
  • 30.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. 2024;10. doi: 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mousavi N, Margoliash J, Pusarla N, Saini S, Yanicky R, Gymrek M. TRTools: a toolkit for genome-wide analysis of tandem repeats. 2020;37: 731. doi: 10.1093/bioinformatics/btaa736 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37: 907. doi: 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Willems T, Gymrek M, Poznik GD, Tyler-Smith C, Erlich Y. Population-Scale Sequencing Data Enable Precise Estimates of Y-STR Mutation Rates. The American Journal of Human Genetics. 2016;98: 919. doi: 10.1016/j.ajhg.2016.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Butler JM. Genetics and genomics of core short tandem repeat loci used in human identity testing. Journal of Forensic Sciences. 2006;51: 253–265. doi: 10.1111/j.1556-4029.2006.00046.x [DOI] [PubMed] [Google Scholar]
  • 35.Brearley EJ, Singh P, Bhatti JS, Mastana S. Genetic variation and differentiation among a native British and five migrant South Asian populations of the East Midlands (UK) based on CODIS forensic STR loci. Annals of human biology. 2020;47: 572–583. doi: 10.1080/03014460.2020.1797162 [DOI] [PubMed] [Google Scholar]
  • 36.Dejesus-Hernandez M, Mackenzie IR, Boeve BF, Boxer AL, Baker M, Rutherford NJ, et al. Expanded GGGGCC Hexanucleotide Repeat in Noncoding Region of C9ORF72 Causes Chromosome 9p-Linked FTD and ALS. Neuron. 2011;72: 245. doi: 10.1016/j.neuron.2011.09.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Renton AE, Majounie E, Waite A, Simón-Sánchez J, Rollinson S, Gibbs JR, et al. A Hexanucleotide Repeat Expansion in C9ORF72 Is the Cause of Chromosome 9p21- Linked ALS-FTD. Neuron. 2011;72: 257. doi: 10.1016/j.neuron.2011.09.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Rajan-Babu I, Peng JJ, Chiu R, Birch P, Couse M, Guimond C, et al. Genome-wide sequencing as a first-tier screening test for short tandem repeat expansions. Genome Med. 2021;13. doi: 10.1186/s13073-021-00932-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Dolzhenko E, Deshpande V, Schlesinger F, Krusche P, Petrovski R, Chen S, et al. ExpansionHunter: A sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics. 2019;35: 4754–4756. doi: 10.1093/bioinformatics/btz431 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Tankard RM, Bennett MF, Degorski P, Delatycki MB, Lockhart PJ, Bahlo M. Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data. American Journal of Human Genetics. 2018;103: 858–873. doi: 10.1016/j.ajhg.2018.10.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Valle-Silva G, Frontanilla TS, Ayala J, Donadi EA, Simões AL, Castelli EC, et al. Analysis and comparison of the STR genotypes called with HipSTR, STRait Razor and toaSTR by using next generation sequencing data in a Brazilian population sample. Forensic Science International: Genetics. 2022;58. doi: 10.1016/j.fsigen.2022.102676 [DOI] [PubMed] [Google Scholar]
  • 42.Budiš J, Kucharík M, Ďuriš F, Gazdarica J, Zrubcová M, Ficek A, et al. Dante: genotyping of known complex and expanded short tandem repeats. 2018;35: 1310. doi: 10.1093/bioinformatics/bty791 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Paul Aurelian Gagniuc

15 Feb 2024

PONE-D-24-00164A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samplesPLOS ONE

Dear Dr. Hollox,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 31 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Paul Aurelian Gagniuc, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating in your Funding Statement:

“JWO is funded by a Wellcome Trust PhD studentship as part of the Wellcome Trust Genetic Epidemiology and Public Health Genomics Doctoral Training Programme by grant number 218505/Z/19/Z . LWV holds a GSK/Asthma+Lung UK Chair in Respiratory Research (C17-1). The research was partially supported by the National Institute for Health Research (NIHR) Leicester Biomedical Research Centre

Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now.  Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement.

Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“JWO is funded by a Wellcome Trust PhD studentship as part of the Wellcome Trust Genetic 515 Epidemiology and Public Health Genomics Doctoral Training Programme by grant number 516 218505/Z/19/Z . LWV holds a GSK/Asthma+Lung UK Chair in Respiratory Research (C17-1). The 517 research was partially supported by the National Institute for Health Research (NIHR) Leicester 518 Biomedical Research Centre; the views expressed are those of the author(s) and not necessarily 519 those of the National Health Service (NHS), the NIHR or the Department of Health.”

We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“JWO is funded by a Wellcome Trust PhD studentship as part of the Wellcome Trust Genetic Epidemiology and Public Health Genomics Doctoral Training Programme by grant number 218505/Z/19/Z . LWV holds a GSK/Asthma+Lung UK Chair in Respiratory Research (C17-1). The research was partially supported by the National Institute for Health Research (NIHR) Leicester Biomedical Research Centre”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

5. Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript.

6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

********** 

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This study is important as it identifies an approach to investigate the role of genomic STR variants in polygenic human diseases. It is also a study that genotypes common STRs with the software used and identifies rarer STR expansions genome-wide. Therefore, it can be published in the journal.

Tables must be written in the appropriate format. There is duplicate text content in the article writing, it was uploaded twice. It should be reviewed as a writing language.

Reviewer #2: The authors present a benchmarking of open-source software for the analysis of repeat expansion from short-read genomic data. The comparison is made considering: computational resources, genotyping accuracy (expansion length and variant calling). Four cases of application at genome wide detection, mendelian inheritance (allele differentiation), rare disease (known pathogenic expansions) and in forensic analysis (variant calling) are described. The experiments performed and the well detailed results are very useful for other scientists working with repeat expansion detection and encourage the use of good practices. The research topic fits the current scenario, as genomic sequencing is becoming cheaper and the availability of genomic data for analysis in different research fields (large population studies, clinical diagnostics, etc.) is increasing. Comparative evaluation of tools for detection of repeat expansions has not been widely published in the past. The comparative evaluation of such variants is of great importance to apply correct bioinformatics approaches in routine laboratory work, due to their genomic variability at the sequence level and their importance in diseases.

The abstract and introduction are well presented addressing the current state of the art of the capabilities of detection tools. The experiments used are adequate and cover all the performance-based questions that the authors aim to address. The experimental design is well conceived control samples are grouped according to the questions. The design of the study considers differences between software that could imply restrictions in the comparison of the results. The number of samples used is sufficient to obtain meaningful results and the reference materials chosen (coriell and 1000genomes) are adequate. For the comparison in each category, performance parameters were well chosen. The tables and figures are understandable and help other scientists to consider which software might be best applied to specific scientific questions. The study conforms to ethical standards. The data are freely available in repositories (web links work) and the methods are available for replication of the results by any other user. Data collection and interpretation is well done.

The presentation of the results is well structured and explained. It supports the conclusions and discussion. Supplementary tables provide all the data produced in the study at sample and variant level and provide sufficient evidence to support the benchmarking goals. The discussion is conducted in the context of the results presented and at the current or research level, limitations are discussed. The statistical analysis and parameters chosen for benchmarking conform to current guidelines for quantitative and qualitative variant detection. Figures and tables are well presented and support the results.

The use cases in Mendelian disorders and forensic medicine focus on current issues in STR detection (i.e. allele-specific genotyping and call accuracy, respectively). The clinical samples FMR1, HTT and FXN are good examples, as they have proven to be tricky in accurate detection. The authors point out the limitations of the study, such as the number of samples was not significant enough (e.g. FMR1) and the type of variation they do not include (e.g. 1 bp).

Overall, this study does not present a groundbreaking advance in the field but presents a comparative assessment for a type of genomic variant that is difficult to determine and for many years overlooked in population and clinical studies. Such benchmarking studies are urgently needed to improve best practice in selecting strategies for genomic analysis, especially in clinical settings. Importantly, the authors also addressed computational resources, an issue that is often underestimated and results in unexpectedly high costs.

Comments for the authors:

The authors state in the Abstract: “Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data are beginning to address this.” Referring to repeats. Software help to address questions related to repeat contribution to common disease, but the software itself does not address contribution. The sentence is misleading and should be changed to improve the meaning within the context of the abstract.

Authors should not refer to Tables (Table 1 is mentioned twice) or detailed results in the introduction. Please delete.

Authors should more clearly specify in the introduction that the manuscript focuses exclusively on short read sequencing data (and not long read sequencing).

Figure 3 resolution is pixelized in the review document. Please make that figures have the required resolution for publication.

Authors assume as a gold standard capillary electrophoresis for the genotyping of known forensic STR loci to benchmark error rates. The accuracy of this technique should be stated (sensitivity specificity) and discussed in case this might have an influence in the cases where calls do not match with the software.

Authors show underestimation of repeat length in clinical cases with very large expansions. Please briefly discuss which are the limitations in the software or short read techniques that cause these underestimations.

********** 

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 1;19(4):e0300545. doi: 10.1371/journal.pone.0300545.r002

Author response to Decision Letter 0


26 Feb 2024

Response to reviewers

Reviewer #1: This study is important as it identifies an approach to investigate the role of genomic STR variants in polygenic human diseases. It is also a study that genotypes common STRs with the software used and identifies rarer STR expansions genome-wide. Therefore, it can be published in the journal.

Tables must be written in the appropriate format. There is duplicate text content in the article writing, it was uploaded twice. It should be reviewed as a writing language.

Thanks to the reviewer for their positive comments. Indeed, the entire article was presented twice in the package to reviewers – this was a mistake. We will follow the publisher’s guidance on table formatting.

Reviewer #2: The authors present a benchmarking of open-source software for the analysis of repeat expansion from short-read genomic data. The comparison is made considering: computational resources, genotyping accuracy (expansion length and variant calling). Four cases of application at genome wide detection, mendelian inheritance (allele differentiation), rare disease (known pathogenic expansions) and in forensic analysis (variant calling) are described. The experiments performed and the well detailed results are very useful for other scientists working with repeat expansion detection and encourage the use of good practices. The research topic fits the current scenario, as genomic sequencing is becoming cheaper and the availability of genomic data for analysis in different research fields (large population studies, clinical diagnostics, etc.) is increasing. Comparative evaluation of tools for detection of repeat expansions has not been widely published in the past. The comparative evaluation of such variants is of great importance to apply correct bioinformatics approaches in routine laboratory work, due to their genomic variability at the sequence level and their importance in diseases.

The abstract and introduction are well presented addressing the current state of the art of the capabilities of detection tools. The experiments used are adequate and cover all the performance-based questions that the authors aim to address. The experimental design is well conceived control samples are grouped according to the questions. The design of the study considers differences between software that could imply restrictions in the comparison of the results. The number of samples used is sufficient to obtain meaningful results and the reference materials chosen (coriell and 1000genomes) are adequate. For the comparison in each category, performance parameters were well chosen. The tables and figures are understandable and help other scientists to consider which software might be best applied to specific scientific questions. The study conforms to ethical standards. The data are freely available in repositories (web links work) and the methods are available for replication of the results by any other user. Data collection and interpretation is well done.

The presentation of the results is well structured and explained. It supports the conclusions and discussion. Supplementary tables provide all the data produced in the study at sample and variant level and provide sufficient evidence to support the benchmarking goals. The discussion is conducted in the context of the results presented and at the current or research level, limitations are discussed. The statistical analysis and parameters chosen for benchmarking conform to current guidelines for quantitative and qualitative variant detection. Figures and tables are well presented and support the results.

The use cases in Mendelian disorders and forensic medicine focus on current issues in STR detection (i.e. allele-specific genotyping and call accuracy, respectively). The clinical samples FMR1, HTT and FXN are good examples, as they have proven to be tricky in accurate detection. The authors point out the limitations of the study, such as the number of samples was not significant enough (e.g. FMR1) and the type of variation they do not include (e.g. 1 bp).

Overall, this study does not present a groundbreaking advance in the field but presents a comparative assessment for a type of genomic variant that is difficult to determine and for many years overlooked in population and clinical studies. Such benchmarking studies are urgently needed to improve best practice in selecting strategies for genomic analysis, especially in clinical settings. Importantly, the authors also addressed computational resources, an issue that is often underestimated and results in unexpectedly high costs.

We thank the reviewer for their careful review and positive comments. Changes are highlighted under “track changes” in the manuscript document.

Comments for the authors:

The authors state in the Abstract: “Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data are beginning to address this.” Referring to repeats. Software help to address questions related to repeat contribution to common disease, but the software itself does not address contribution. The sentence is misleading and should be changed to improve the meaning within the context of the abstract.

We have changed “Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data are beginning to address this.” to “Their contribution to common disease is not well-understood, but recent software tools designed to genotype STRs using short read sequencing data will help address this.”

Authors should not refer to Tables (Table 1 is mentioned twice) or detailed results in the introduction. Please delete.

We have deleted references to table 1 in the introduction, and have cut out two sentences that discuss the results in detail (lines 124-128).

Authors should more clearly specify in the introduction that the manuscript focuses exclusively on short read sequencing data (and not long read sequencing).

We agree, and have made two clarifications in the introduction : line 111, line 118, and one in the abstract (line 35), and twice in the methods (lines 165, 205).

Figure 3 resolution is pixelized in the review document. Please make that figures have the required resolution for publication.

Resolution of figures will be publication quality, following the publisher’s requirements.

Authors assume as a gold standard capillary electrophoresis for the genotyping of known forensic STR loci to benchmark error rates. The accuracy of this technique should be stated (sensitivity specificity) and discussed in case this might have an influence in the cases where calls do not match with the software.

STR calling using capillary electrophoresis will, of course, have an error rate. Sensitivity and specificity will be very dependent on sample origin, quality and assay. However, these data had been generated using a forensic genotyping kit (Promega’s Powerplex Fusion) which will be very robust, and optimised for use on poor-quality forensic samples. Therefore we believe that, when used on laboratory quality samples, we are justified in using these genotypes as an error-free gold standard. We have emphasised this in the manuscript by including “forensic-standard” (line 392).

Authors show underestimation of repeat length in clinical cases with very large expansions. Please briefly discuss which are the limitations in the software or short read techniques that cause these underestimations.

We echo Dolzhenko et al 2017, in suggesting that GC-bias of Illumina sequencing (leading to a lower coverage than expected of GC-rich repeats), or somatic mosaicism may lead to this effect. We now have mentioned this (lines 409-410).

Attachment

Submitted filename: Response to reviewers.docx

pone.0300545.s004.docx (16KB, docx)

Decision Letter 1

Paul Aurelian Gagniuc

29 Feb 2024

A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples

PONE-D-24-00164R1

Dear Dr. Hollox,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Paul Aurelian Gagniuc, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Paul Aurelian Gagniuc

20 Mar 2024

PONE-D-24-00164R1

PLOS ONE

Dear Dr. Hollox,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Paul Aurelian Gagniuc

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Number of loci called as percentage of total in catalogue for GangSTR and HipSTR and ExpansionHunter, and call concordance as function of sequence depth.

    (DOCX)

    pone.0300545.s001.docx (20.8KB, docx)
    S2 Table. Number of loci called as percentage of total in catalogue for GangSTR and HipSTR and ExpansionHunter, and call concordance as function of sequence length.

    (DOCX)

    pone.0300545.s002.docx (17.6KB, docx)
    S1 Fig. ExpansionHunter, GangSTR and HipSTR’s 2 bp repeat unit calls—Mendelian inconsistent sites stratified by sequence motif.

    (TIFF)

    pone.0300545.s003.tiff (2.1MB, tiff)
    Attachment

    Submitted filename: Response to reviewers.docx

    pone.0300545.s004.docx (16KB, docx)

    Data Availability Statement

    Code for analyses, and the full set of genotype calls at the clinical and forensic loci are available at https://doi.org/10.25392/leicester.data.22041020. Genotype call vcf files for GangSTR and HipSTR and ExpansionHunter are available for the Genome In a bottle samples are at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/ and https://doi.org/10.25392/leicester.data.22041020 Genotype call vcf files for GangSTR, ExpansionHunter and HipSTR are available for the 1000 Genomes samples used are at https://doi.org/10.25392/leicester.data.22041020.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES