Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2019 Dec 12;14(12):e0225216. doi: 10.1371/journal.pone.0225216

Abundance of ethnically biased microsatellites in human gene regions

Nick Kinney 1,2,¤,*, Lin Kang 1,2, Laurel Eckstrand 3, Arichanah Pulenthiran 1, Peter Samuel 1, Ramu Anandakrishnan 1, Robin T Varghese 1, P Michalak 1,3,4, Harold R Garner 1,2
Editor: Arnar Palsson5
PMCID: PMC6907796  PMID: 31830051

Abstract

Microsatellites–a type of short tandem repeat (STR)–have been used for decades as putatively neutral markers to study the genetic structure of diverse human populations. However, recent studies have demonstrated that some microsatellites contribute to gene expression, cis heritability, and phenotype. As a corollary, some microsatellites may contribute to differential gene expression and RNA/protein structure stability in distinct human populations. To test this hypothesis, we investigate genotype frequencies, functional relevance, and adaptive potential of microsatellites in five super-populations (ethnicities) drawn from the 1000 Genomes Project. We discover 3,984 ethnically-biased microsatellite loci (EBML); for each EBML at least one ethnicity has genotype frequencies statistically different from the remaining four. South Asian, East Asian, European, and American EBML show significant overlap; on the contrary, the set of African EBML is mostly unique. We cross-reference the 3,984 EBML with 2,060 previously identified expression STRs (eSTRs); repeats known to affect gene expression (64 total) are over-represented. The most significant pathway enrichments are those associated with the matrisome: a broad collection of genes encoding the extracellular matrix and its associated proteins. At least 14 of the EBML have established links to human disease. Analysis of the 3,984 EBML with respect to known selective sweep regions in the genome shows that allelic variation in some of them is likely associated with adaptive evolution.

Introduction

Approximately two thirds of the human genome consists of repetitive DNA [1]. These repeats vary in size, complexity, and abundance in the genome: microsatellites are perhaps the simplest. Each microsatellite consist of a short motif (1–6 base pairs) repeated in tandem to form an array [2]; over 600,000 unique microsatellites exist in the human genome [3, 4]. Despite the simplicity of microsatellites, they have been leveraged in forensic and kinship analysis for decades. Essentially, they serve as genetic fingerprints; a consequence of their high mutation rate. In addition, microsatellites have a well-established role in diseases such as fragile X syndrome, spinocerebellar ataxias, myotonic dystrophy, Friedrich ataxia, and Huntington's disease [5, 6].

Recently, microsatellites have garnered interest for their role in complex diseases and subtler effects on gene expression [710]. Variations in the length of repeat arrays influence gene expression by inducing Z-DNA and H-DNA folding [10]; altering nucleosome positioning [10, 11]; and changing the spacing of DNA binding sites [2, 9, 12]. In fact, a recent genome wide survey of short tandem repeats (STR) identified 2,060 that affect nearby gene expression (eSTRs) and estimated that STRs contribute up to 15% of the cis heritability among all types of genetic variants [9]. Strong enrichments were found near transcription start sites and predicted enhancers [9]. Shortly before this work, another study concluded that microsatellites facilitated divergence of gene expression in humans and great apes [13]. Thus, microsatellites have the capacity to affect gene expression and may be leveraged by natural selection for efficient evolution. A review published on the heels of both studies suggested that microsatellites contribute to the missing heritability of polygenic disorders and called for a better understanding of the repeatome at large [8].

Despite our nascent understanding of microsatellite function, they are well-studied in diverse human populations. Landmark studies in the 1990’s showed that microsatellites can be used to infer the demographic history of human populations [1417]. In particular, a 1992 study of four racial/ethnic groups–African American, Mexican American, Asian, and white–concluded that microsatellites can be leveraged for individual identification [18]. This idea was implemented in a 1994 study that succeeded in clustering individuals according to their geographic origin [19]: microsatellite diversity was highest in Africans. Subsequent studies established that only a modest number of microsatellites are required to make reliable inferences and set precedents for a 2003 study of 377 microsatellites in 52 worldwide populations [2022]. A more recent study used microsatellites to characterize genetic variation across 121 African populations, four African American populations, and 60 non-African populations [23]. Genetic diversity was shown to decline with distance from Africa; private alleles were more numerous in Africa than in other regions; and 14 ancestral population clusters were revealed [23].

Microsatellites are now routinely used to study the genetic structure of diverse human populations [2428]. Informative panels range from ten to several thousand loci and often draw from the Marshfield screening sets [29]. One of the largest studies to date merged 8 datasets [30]: the aggregated data included 645 microsatellite loci with genotypes in 5,795 individuals from seven population groups. Africans, East Asians, Oceanians, and Native Americans formed distinct clusters; Europeans and South Asians formed part of a central heterogeneous cluster [30]. This overall pattern was reiterated in a smaller study of 46 ancestry informative markers [31]; principal component analysis (PCA) revealed distinct clusters for Africans and East Asians with overlapping central clusters corresponding to South Asians and Europeans, respectively [31]. Similar results have been obtained with samples drawn from the 1000 Genomes Project Phase 1 demonstrating the utility of microsatellite analysis from Illumina sequencing [32].

We use existing sequencing data from the 1000 Genomes Project Phase 3 to build on what is known about polymorphic germline microsatellites. First, we preform PCA of 316,147 microsatellites: a significant fraction of all microsatellites in the Human Genome. The pattern of variation is consistent with previous studies that focus exclusively on smaller sets of polymorphic microsatellites [3032]. We use Fisher’s exact test to identify 3,984 ethnically-biased microsatellite loci (EBML); for each EBML at least one ethnicity has genotype frequencies statistically different from the remaining four. We find significant overlap between EBML and previously identified eSTRs: a key result of this study. EBML are enriched with core matrisome and matrisome associated genes. Further enrichments are found when we characterize EBML with respect to introns, exons, UTRs, and coding sequences (cds). EBML are over-represented in soft sweep regions of the human genome suggesting that microsatellites contribute to differential gene expression in distinct populations and have adaptive potential.

Results

Abundance of EBML in the human genome

Numerous studies have used microsatellites to reveal patterns of genetic variation within and between diverse human populations [2328]; marker panels typically include ten to several thousand loci. We use next-generation sequencing (Illumina) to investigate 316,147 genome wide microsatellites in 2,529 samples belonging to five super-populations (ethnicities). PCA clusters reiterate the five ethnicities (Fig 1A) confirming the utility of microsatellite analysis from Illumina sequencing [32]. The overall pattern of variation is consistent with previous studies that focus on smaller panels of known polymorphic microsatellites [3032]: AFR, EAS, and EUR in three outside clusters with AMR and SAS in two overlapping central clusters (Fig 1A). We use Fisher’s exact test (see methods) to identify 3,984 ethnically-biased microsatellite loci (EBML); for each EBML at least one ethnicity has genotype frequencies statistically different from the remaining four. With the exception of AFR, EBML show significant overlap in all ethnicity pairs (Fig 2A). On the contrary, microsatellites specific to AFR are under-represented in SAS, EAS, and EUR. PCA of the 3,984 EBML reveals a pattern of variation representative of all 316,147 genome wide microsatellites (Fig 1B). The first two principal components suggest that variation is greatest in AFR and EAS, respectively. Once again, this finding is consistent with previous studies [33]; and, is recapitulated in all subsequent analysis (see results below).

Fig 1. Discovery and analysis of EBML.

Fig 1

a, Principal component analysis of 316,147 genome wide microsatellites reveals a distinct pattern of variation in 5 super-populations (ethnicities). b, Principal component analysis of 3,984 EBML reveals a pattern of variation similar to genome wide microsatellites. c, Number of EBML identified in five ethnicities: UpSet plot inset with 5-way Venn Diagram. Overlap indicates microsatellites specific to two or more ethnicities. Oval regions sum to 3,171 (AFR); 1,494 (EAS); 450 (EUR); 647 (SAS); and 335 (AMR). All regions sum to 3,984: the total number of EBML identified in this study.

Fig 2. Summary of significant enrichments in the set of 3,984 EBML.

Fig 2

a, EBML for each of the 10 super-population pairs were used to construct a 2x2 contingency table followed by χ2 test of independence: matrix entries shown as -log(p-value). Each matrix entry corresponds to a test of the null: that EBML are independent in the pair of super-populations. Over-representation shown in blue; under-representation shown in red. With the exception of AFR, microsatellites specific to two or more super-populations are over-represented. b, Fourfold plots of EBML reveal significant overlap with matrisome genes (left) and eSTRs (right). Area of each quarter circle is proportional to the cell frequency (following marginal standardization). Color of each quarter circle corresponds with its Pearson residual: blue indicates the cell entry exceeds the expected value; red indicates less than then expected value. Confidence rings for the odds ratio allow a visual test of the null (no association); here, 99% confidence rings do not overlap indicating the null is rejected. c, EBML are enriched with 1-mer and 2-mer repeats. Enrichments for each unit length were checked by constructing a 2x2 contingency table followed by χ2 test of independence. Bars show the odds ratio with 95% confidence interval: over-represented motifs shown in blue; under-represented motifs shown in red. Level of significance (p-value) is indicated symbolically: p<0.05 (*); p<0.01 (**); p<0.001 (***). d, The 64 EBML known to affect gene expression (eSTRs) have more alleles on average than the remaining 3,920 EBML; smoothed distribution of allele counts shown in blue and red, respectively. The difference is statistically significant (p = 0.01; two sided Kolmogorov-Smirnov test).

Identifying EBML is an easier task than explaining their origin. EBML could emerge due to high microsatellite mutability, genetic drift, isolation-by-distance, or natural selection. The action of natural selection–investigated in the next section–would suggest adaptive potential. Regardless, the 3,984 EBML add to the known polymorphic microsatellites and could be useful in future studies of population structure; a complete summary is shown in (Fig 1C) (S1S6 Tables; S1 Code; S1 Dataset).

Correspondence between EBML, eSTRs, and selective sweeps suggest adaptive potential

A recent study used RNA sequencing of lymphoblastoid cell lines to investigate links between array length variations in 80,980 short tandem repeats (STRs) and expression of nearby genes [9]. The study identified 2,060 significant associations (among the 80,980) which established the importance of expression STRs (eSTRs). Cross-referencing the 80,980 STRs against our 316,147 microsatellites reveals 13,259 repeats in common. We constructed a 2x2 contingency table based on classifications (eSTR and/or EBML) of the 13,259 shared repeats (Fig 2B; right panel). Remarkably, 64 loci classify as an eSTR and EBML (S7 Table); the overlap is statistically significant (p = 1.53e-8; χ2 test). Interestingly, the overlapping set of 64 EBML/eSTRs average 6.63±3.86 alleles each (424 alleles total) while the remaining 3,920 EBML average 5.68±3.03 alleles each (22,278 alleles total); the difference is statistically significant (p = .011; Kolmogorov-Smirnov test) (Fig 2D). The set of 64 includes all five ethnicities: 53 AFR, 32 EAS, 17 SAS, 16 EUR, and 9 AMR. In addition, five are biased in every ethnicity and five are embedded in coding sequences. Genes harboring the former set of five include SNX2 (3'-UTR), CST3 (intron), C2orf50 (5'-UTR), ACP2 (5'-UTR), and VEGFB (intron); the five coding microsatellites are in NOP9, USP36, PTPN18, AK9, and SNAPC4.

The correspondence between EBML and eSTRs suggest some microsatellites contribute to differential gene expression; however, this does not necessarily imply they have adaptive potential. To infer adaptive potential we identify microsatellites in selective sweep regions of the human genome. Briefly, selective sweep occurs when strong positive selection–due to a novel allele–reduces nearby genetic variation; sweep regions have been established for six populations in the 1000 Genomes Project [34]. We find 434 (out of 3,984) EBML in soft sweep regions; tested microsatellites have 30,850 (out of 316,147). The difference is small but statistically significant (p = 0.018). The 434 EBML include 21 in the coding sequence of 18 genes (S9 Table); one is a previously identified eSTR (CDS of gene USP36).

Overall, these findings suggest a degree of mutual overlap between EBML, eSTRs, and selective sweeps in the human genome. The interpretation of this overlap is built up from the definition of each region. Briefly, each eSTR has the capacity to affect gene expression; and, for each EBML at least one ethnicity has a distribution of genotypes statistically different from the remaining four. Thus, any correspondence between the two implies differential gene expression in one or more human super-populations (ethnicities). These differences likely stem from a complex combination of high mutability, genetic drift, isolation-by-distance, and natural selection. Drift, mutability, and isolation are inevitable; but, the situation is less clear for natural selection. The correspondence between EBML and selective sweeps suggests they have adaptive potential and may be targeted by natural selection [11, 3538]. On the other hand, it is possible that EBML are simply in linkage disequilibrium with targets of selection such as nearby SNPs. Overall it seems likely that some EBML–particularly the ones that overlap eSTRs–have bona fide adaptive potential.

Introns are over-represented among EBML; coding regions are under-represented

We characterize the 3,984 EBML in terms of their overlap with 3,324 introns, 342 exons, 542 UTRs, and 147 coding sequences. Although introns are the majority among tested microsatellites (148,464 out of 316,147), they are over-represented among EBML (Fig 3). On the other hand, coding microsatellites are under-represented (Fig 3). Enrichment analysis leading to these conclusions takes into account the coverage of microsatellites among samples and is robust to sample partitioning (see methods for details). Despite their under-representation, it is remarkable that coding EBML are found in all five ethnicities: 118 AFR, 51 EAS, 26 SAS, 20 EUR, and 14 AMR. Coding microsatellite in four genes (KAT6B, ATN1, HOMEZ, and CNDP1) are biased in all five ethnicities. Overall, we find 727 alleles for coding EBML. Two genes (ATN1 and VEZF1) have 13 alleles each: in both cases, the gene product harbors a glutamine repeat. Highly polymorphic– 10 alleles each–non-glutamine repeats are found in gene products for TRAK1 (glutamic acid), SUPT20HL1 (alanine), KDM6B (proline), and AUTS2 (histidine). On the contrary, TYW3 and MTERF4 only have two alleles; but, each has an allele only found in African samples. In the case of TYW3, 44 out of 654 African samples possess a 4-apartic acid (DDDD) allele. In the case of MTERF4, 39 out of 651 African samples possess a (glutamic acid/aspartic acid) 3-mer (DED).

Fig 3. Enrichment analysis of EBML with respect to gene regions: Introns, exons, coding sequences, and UTRs.

Fig 3

Analysis of microsatellites draws upon whole exome sequenced samples from the 1000 Genomes Project; consequently, enrichment with respect to gene regions could not be checked by directly comparing all 316,147 tested microsatellites to the 3,984 EBML. Instead, two series of enrichment tests are performed on subsets of the tested and EBML. a, In the first series, each iteration removes microsatellites based on the minimum number of available samples. b, In the second series, each iteration considers microsatellites within a range of available samples. Each point represents a χ2 test of the null: no association between EBML and the gene region. Over represented regions are plotted as -log(p) and appear above the x-axis; under represented regions are plotted as log(p) and appear below the x axis. Statistical significance (p = .05) is indicated with a dashed line. The null is rejected for intronic microsatellites (over-represented) and coding microsatellites (under-represented). The same conclusion is reached in both panels suggesting the results are robust to details of the analysis. Note: exonic microsatellites include CDS and UTR microsatellites based on an independent series.

Characterization of EBML by repeat unit

In terms of repeat units, EBML are consistent with human microsatellites at large. Complete summary of EBML based on repeat unit is shown in Fig 4A. Mononucleotide poly-A/poly-T repeats are most common: we find 1,736 out of 3,984. These repeats are predominant in the genomes of human, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, and Saccharomyces cerevisiae [39]. Mononucleotide poly-G/poly-C repeats are far less common in our findings (154 out of 3,984) and in the genomes of the aforementioned organisms; nevertheless, both classes are over-represented among EBML as are mononucleotide repeats overall (p = 2.2e-16) (Fig 2C). Investigation of dinucleotide motifs reveals that EBML AC repeats (735 total) are most common while GC repeats (5 total) are extremely rare; both of these findings reiterate the overall distribution of genomic repeats in humans and D. melanogaster [39]. EBML di-nucleotide repeats are over-represented (p = 2.23e-5) (Fig 2C). EBML tri-nucleotide repeats (356 total) are not over or under represented (p = .5812); however, EBML 4-mer, 5-mer, and 6-mer repeats are all under-represented (p<1e-15). Thus, the composition of EBML shifts from over-representation of mono-nucleotide and di-nucleotide repeats to under-representation of tetra-nucleotide repeats and beyond (Fig 2C).

Fig 4. Characterization of EBML by repeat unit, gene feature, and amino acid.

Fig 4

a, The 16 most common repeat motifs among all 3,984 EBML: colors indicate super-population (ethnicity); for clarity A/T and AC/TG/CA/GT repeats are shown separate. b, The 16 most common amino acid motifs for all 147 coding EBML: colors indicate super-population (ethnicity). Small and hydrophilic amino acids appear most frequently: glutamine (Q), glutamic acid (E), aspartic acid (D), serine (S), and lysine (K). Poly-glutamine repeats are over-represented (p = 9.32e-9) as are glutamic acid repeats (p = .0238) c, Summary of genomic regions harboring the 3,984 EBML in five super-populations (ethnicities). Colors indicate EBML overlap with gene introns, UTRs, and coding sequences. Note: exonic microsatellites (not shown) include CDS and UTR microsatellites.

Poly-glutamine and poly-glutamic acid are common among coding EBML

It comes as no surprise that poly-glutamine repeats are most common among EBML found in coding sequences (we find 34 total); indeed, it is well known that glutamine repeats are abundant in the human proteome. Still, our results suggest that poly-glutamine is over-represented among coding EBML (p = 9.32e-9). Among the other common coding EBML–glutamic acid, aspartic acid, serine, and lysine–only the glutamic acid repeats (30 total) are over-represented (p = .0238). With the exception of serine, all of the aforementioned amino acids are hydrophilic. Overall we find hydrophilic amino acids (95 total) are over-represented (p = 2.38e-6) and hydrophobic amino acids (18 total) are under-represented (p = 1.20e-6) among the EBML. Given these results it is remarkable that we find an EBML embedded in CNDP1 encoding an array of large hydrophobic leucine residues. Indeed it is thought that repeat expansions encoding hydrophobic amino acids are more likely to have deleterious effects on protein function [39]; yet, we find arrays of 4, 5, 6, and 7 residues. The protein coding microsatellite in ATN1 is remarkable for its degree of array length variation. We find 13 variants encoding a 6 to 17 residue glutamine repeat. Although these variants are well known and in the normal range [40, 41], our results show that the distribution of variants differs in the 5 super-populations (ethnicities) under investigation. A complete summary of the EBML protein coding repeats is shown in (Fig 4B).

Pathway analysis for genes harboring EBML

Sets of genes harboring EBML for each ethnicity were checked for enrichment in Reactome and KEGG pathways [42, 43]; we find ten and nine significant pathways, respectively (Fig 5). The most significant Reactome pathways are those associated with the extracellular matrix (ECM): ECM proteoglycans, non-integrin membrane-ECM interactions, ECM organization, and degradation of the ECM. These four pathways–along with three collagen related pathways–were significant in two or more super-populations (Fig 5A). Statistically significant KEGG pathways reiterate these results. ECM receptor interaction and focal adhesions–specialized structures at cell-ECM contact points–are statistically significant in three and two super-populations, respectively (Fig 5B). Thus, the most significant functional enrichments–according to KEGG and Reactome–are those associated with the matrisome: a broad collection of ECM proteins comprising 1%-1.5% of the proteome. Further support for this conclusion is found by cross-referencing genes harboring EBML with 1,062 matrisome genes identified in a previous study [44]. Indeed core matrisome/matrisome associated genes are over-represented (p = .0023) (Fig 2B; left panel). In fact, 46 EBML are in collagen genes alone which in turn have been linked to numerous diseases (S8 Table). Still, the biological significance of these findings remains speculative. Only 64 out of 3,984 EBML have been shown to affect gene expression (see results above). Thus, more work needs to be done to determine the effects (if any) of EBML on ECM pathways and disease.

Fig 5. Reactome and KEGG pathways over-represented by genes harboring EBML for each super-population.

Fig 5

a, The most significant Reactome pathways are those associated with the extra-cellular matrix (ECM) and collagen formation, degradation, and modification. b, The top two KEGG pathways–Focal adhesion and ECM-receptor interaction–are consistent with Reactome.

Despite our nascent understanding of repeat polymorphisms and gene expression, coding microsatellites are the established cause of numerous diseases [8]. Among the 147 coding EBML, we identify 14 with established links to diseases (S10 Table). Questions of prevalence or predisposition for these diseases among the super-populations we investigate could not be addressed. In particular, causative alleles for most repeat expansions diseases extend beyond the 100bp range of Illumina sequencing used for this study; this unfortunate limitation is further discussed below. Our data may shed light on susceptibility to knee and hip osteoarthritis stemming from an aspartic acid (D) repeat in the ASPN1 gene. Previous studies suggest that frequency of the D14 allele increases with disease severity [45, 46]. Our findings show that the D14 allele is most common in Africans which conceivably contributes to their high rates of large joint osteoarthritis [47]. Future work will address this hypothesis.

Discussion

Microsatellites have been used for decades to study the genetic structure of diverse human populations [2328]. Relatively recent studies have shed light on their role in complex trait heritability; epistatic interactions; and superiority as markers of genome integrity [48, 49]. Progress in these areas is important as microsatellites are thought to contribute to disease susceptibility and ‘missing heritability’ [7, 50]. Our work builds on what is known with four main results. First, we identify 3,984 EBML; genetic variation in these regions is representative of the variation in 316,147 genome wide microsatellites. Second, a statistically significant number of EBML coincide with known eSTRs; i.e. they affect gene expression. Third, core matrisome and matrisome associated genes are over represented among genes harboring EBML. Fourth, a significant number of EBML are in putative selective sweep regions. The 3,984 EBML could be useful in future studies of population structure and reiterate the utility of microsatellite analysis from Illumina sequencing [32].

Research leading to the discovery of eSTRs investigated their effects in different human populations; indeed, eSTR association signals were reproducible across population [9]. We build on these results by showing that some microsatellites likely contribute to differential gene expression in different populations. In fact, three of our main results support this overall conclusion: (a) EBML harbor genetic variation particular to one or more ethnicity; (b) EBML overlap eSTRs; and (c) EBML are over-represented in regions of soft selective sweep. On the other hand, we recognize that these results provide necessary but not sufficient indication that microsatellites have adaptive potential and are targets of positive selection. In particular, it is unclear if selection acting directly on microsatellites, with their enhanced mutability and extended sequence motifs, leave classical signatures of a selective sweep. However, we notice that, in a broad sense, a new microsatellite allele is an indel (insertion or deletion of (a) motif(s)), and indels or even transposable element insertions have been known to be targets of selection associated with selective sweeps [51, 52].

It is far too soon to say if functional repeats contribute to racial/ethnic disparities in complex disease rates and prevalence; if so, they are certainly one of many factors. Work on eSTRs demonstrated their enrichment in genes associated with various diseases: Crohn's disease, rheumatoid arthritis, and type 1 diabetes [9]. We find eSTRs over-represented among 3,984 EBML; however, other attempts to understand how microsatellites affect gene expression have led to mixed results. In particular, a study of nearly 5,000 promoter microsatellites revealed 183 significantly associated with nearby gene expression; but, only a small proportion of the microsatellites identified were significant in all populations under investigation [53]. Thus, more work needs to be done to establish links between eSTRs, EBML, and disease.

Some of our results are not surprising; in particular, characterization of the 3,984 EBML reiterates the extensive genetic diversity of African populations. Still, this is an important finding that underscores the lack of diversity in the human reference genome [5456]. The current reference genome–GRCh38 –stems primarily from a single individual; consequently, routine genetic analysis such as read mapping and variant calling may suffer for individuals whose genetic makeup differs from the reference. Alternate loci and databases of known variants partially address this problem; however, many of the alleles we identify are not present in genomic databases such as dbSNP. In fact, the 3,984 EBML we identify have a combined 22,702 alleles rendering any single reference genome insufficient. Thus, our results reiterate the need to better represent diversity in the human genome and genomic databases at large [55]. Possibly, the EBML we identify–as well as undiscovered EBML–can help address this inequality by expanding what is known about human genetic variation.

Key limitations of our study lead us to hypothesize that many more EBML remain undiscovered. In particular, we only investigate microsatellites arrays shorter than 100bp: a number stemming from the short read (Illumina) sequencing used for the 1000 Genomes Project. Microsatellite variants exceeding 100bp are truncated rendering their true array length difficult to infer [57]. Advances in long read sequencing will help address this challenge but introduce others. Sequence alignment algorithms, especially those that use an affine gap model, can introduce errors when challenged with large insertions or deletions [58, 59]. Unfortunately, most clinically relevant microsatellite variants do exceed 100bp. In addition, the 1000 Genomes Project–and by extension our results–may still fall short in capturing genetic diversity in some populations. Principal component analysis has shown that genetic diversity in Southeast Asia is under-represented in the 1000 Genomes Project [60]; and, whole genome sequencing of African Populations has revealed millions of unshared genetic variants [56]. Thus, there are undoubtedly more EBML to be discovered.

Methods

Microsatellite list generation

A preliminary list of 625,178 microsatellites was generated using a two-step process: (a) detection in the reference genome, and (b) reduction of sequence similarity. Super-population specificity was subsequently tested for 316,147 microsatellites (see next sections). Here, we describe each step of preliminary list generation.

(a) Detection of microsatellites in the reference genome. A list of microsatellites in version 38 of the human reference genome was generated with a custom Perl script ‘searchTandemRepeats.pl’ using default parameters. This script has been used in previous microsatellite studies and is freely available online at http://genotan.sourceforge.net/#_Toc324410847 [61]. Briefly, the ‘searchTandemRepeats.pl’ script first searches for pure repetitive stretches: no impurities allowed. Imperfect repeats and compound repeats are handled using a “mergeGap” parameter with a default value of 10 base pairs. Essentially, impurities that interrupt stretches of pure repeat sequence are tolerated unless they exceed 10 base pairs. Likewise, repeats closer than 10 base pairs are considered compound. The initial list generated with this script included 1,671,121 microsatellites.

(b) Reduction of sequence similarity. It is well known that sequencing reads containing microsatellites are prone to mismapping [62]; particularly when different microsatellites possess the same repeat unit between similar 3’ and 5’ flanking sequences. We filtered the initial list of microsatellites (step a) to mitigate these effects. To begin, each microsatellite in the initial list (step a) was assigned a hash key constructed by concatenation of its 3’ flanking sequence, repeat unit, and 5’ flanking sequence. We used 5bp for the flanking regions. Thus, microsatellites ‘GCTGC(A)34CTTAG’ and ‘GCTGC(A)15CTTAG’ received the same hash key: ‘GCTGCACTTAG’. Next, hash keys appearing more than once–and their corresponding microsatellites–were removed from the initial list. The fact that there were many of these potentially ambiguous regions is not surprising since microsatellites are often embedded in larger repetitive motifs such as LINES and SINES [63]. Our filtered preliminary list included 625,178 microsatellites unique in the human genome: available at http://www.cagmdb.org/view_micros.php

Microsatellite genotyping

We used the program RepeatSeq [61] to determine the genotype of microsatellites in next generation sequencing reads. RepeatSeq operates on three input files: a reference genome, a file containing reads aligned to the human reference genome (.bam file), and a list of query microsatellites (see methods above). Like most microsatellite genotyping programs, RepeatSeq excludes reads that do not span the entire repeat. These unusable reads are detected by their lack of one or both flanking regions that anchor the microsatellite to a unique position in the genome; by default, RepeatSeq requires that each read contain at least 3 matching base pairs for both the 3’ and 5’ flanking region. To further increase the accuracy of genotyping calls, we only considered microsatellites with six or more mapped reads: by default genotype calls only require 2 reads. The advantage of RepeatSeq over other microsatellite genotyping programs is that it realigns each read to the reference genome prior to array length detection [57]. This mitigates the main cause of microsatellite genotyping errors, specifically, improper read alignments. Improper alignments often arise when the 3’ and 5’ flanking regions of a microsatellite mimic its repeat unit or from large insertions and deletions, which are common in microsatellites. In these situations, alignment algorithms often incorrectly open and extend gaps; thus, careful realignment is critical to accurate microsatellite array length detection. RepeatSeq has been used in previous studies of microsatellites and is freely available: https://github.com/adaptivegenome/repeatseq.

While other microsatellite genotypers report similar or better accuracy–hipSTR in particular [64]–RepeatSeq was specifically designed and validated using data from the 1000 Genomes Project. In addition, RepeatSeq preforms local realignment and multiple sequence alignment of microsatellite arrays. The lobSTR genotyping program–introduced in 2012 –does include analysis of the 1000 Genomes Project; however, evidence does not suggest that lobSTR assigns genotypes more accurately than RepeatSeq [65]. The exSTRa program is a new option which is more appropriate for detection of repeat expansion disorders [66]. Additional options include STRetch [67], TREDPARSE [68], and Dante [69].

Samples

We used existing data from the 1000 Genomes Project. Specifically, samples were downloaded from phase 3 of the 1000 Genomes Project: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/. All samples listed in the phase 3 index– 20130502.phase3.exome.alignment.index–were included for analysis. Metadata for each sample was retrieved from the sample info file provided by the 1000 Genomes Project: 20130606_sample_info.txt. In total, 2,529 samples were included for analysis: 667 African (AFR), 502 European (EUR), 352 American (AMR), 514 East Asian (EAS), 494 South Asian (SAS).

Discovery and statistical testing of EBML

The 3,984 EBML presented in this work were discovered using a two step-procedure: (a) pairwise screening of super-populations, and (b) convergence of pairwise screens. In step (a) super-populations were screened pairwise: (52)=10 total screens. The average number of tested microsatellites for each pairwise screen was 163,390 (Fig 6A). Step (b) only considered microsatellites passing false discovery: 3,713 on average (Fig 6B). In what follows, we describe steps (a) and (b) in detail.

Fig 6. Overall approach and statistical testing of genome wide microsatellites.

Fig 6

a, Number of microsatellites screened (red) and passing false discovery (blue) in pairwise comparisons of 5 super-populations. Details for one super-population (AFR) shown in bold. b, EBML for each super-population are identified using a two-step approach: step (a) pairwise screening against the remaining 4 super-populations, and step (b) convergence of the pairwise screens. Pairwise screens in step (a) involve 3 sub-steps: (i) construct a 2xN contingency table for each microsatellite; (ii) Fisher’s exact test to assign a p-value; (iii) Benjamini-Hochburg multiple testing correction to mitigate false discovery. Only microsatellites rejecting the null–no difference in genotype distribution–are examined in (b). EBML (yellow) are those that reject the null in all 4 pairwise screens: i.e. the set intersection of microsatellites identified in (b). We identify 3,171 EBML for AFR (shown), 450 for EUR, 1,494 for EAS, 647 for SAS, and 335 for AMR.

(a) Pairwise screening of super-populations. Pairwise comparisons involved three sub-steps: (i) construct a 2xN contingency table for each tested microsatellite; (ii) use Fisher’s exact test to assign a p-value; and (iii) use the Benjamini-Hochburg multiple testing correction to mitigate false discovery (Fig 6B). Each 2xN contingency table (sub-step i) was populated with the tally of samples in 2 super-populations over N different genotypes. At least 1 sample for each super-population was required; the number of genotypes (N) ranged from 1 (no variation) to over 20 (for highly polymorphic microsatellites). For each contingency table we performed a test of the null: no difference in genotype distribution. To assign a p-value (sub-step ii) we use Fisher’s exact test. We considered using a χ2 test or G-test; however, Fisher’s exact test has the advantage that it is equally valid on densely and sparsely populated tables. Various information metrics–which are not suited for testing a null–were not considered in this work. Only microsatellites passing false discovery (sub-step iii) were carried forward to step (b), described below. The numbers of microsatellites in each pairwise screen and the numbers passing false discovery are shown in Fig 6A. Details for step (a) and sub-steps (i-iii) are shown in Fig 6B.

(b) Convergence of pairwise screens. Each super-population received 4 pairwise screens from step (a); i.e. results of comparing microsatellite genotypes with the remaining 4 super-populations. Microsatellites for which the null was rejected in all 4 screens were considered EBML (Fig 6B). All intermediate calculations are provided in S2 Dataset.

Annotation and enrichment analysis

Annotation and enrichment of the 3,984 EBML (see results) was performed with respect to (a) gene features, (b) known eSTRs, (c) matrisome genes, (d) selective sweep regions, (e) repeat lengths, (f) super-population pairs, and (g) amino acids. Analysis with respect to gene features (a) was designed to take into account differences in sample numbers: an intrinsic consequence of the whole exome sequenced samples used for this study. Subsequent analyses (b-g) used 2x2 contingency tables with χ2 test of independence: details are described below.

(a) Annotation and enrichment with respect to gene regions. We used cruzdb–a freely available package for Python–to identify microsatellite overlaps with known introns, exons, coding sequences (cds), 3'-UTRs, and 5'-UTRs. Which (if any) of these features are enriched among the EBML? To check for enrichments we constructed a 2x2 contingency table for each region based on classifications of microsatellites: EBML; not EBML; overlap with gene region; no overlap with gene region. We performed a χ2 test of the null: no association between EBML and the gene region (intron, exon, 3’-UTR, 5’-UTR, or cds). However, a single χ2 test of independence was insufficient since the samples used for microsatellite genotyping (see above) were whole exome sequenced. Consequently, coding and exonic microsatellites tended to be present in more samples than other gene features; on the contrary, intron and UTR microsatellites tended to be present in fewer samples. Thus, an intrinsic difference in the composition of tested microsatellites and EBML was anticipated.

We therefore preformed a series of enrichment tests beginning with the aforementioned χ2 test (this is the first iteration of the series). Successive iterations of the series removed microsatellites from both lists (all tested and EBML) and recomputed enrichment p-values for introns, exons, cds, 3'-UTRs, and 5'-UTRs. Removal of microsatellites in each iteration was determined by sample size: the second iteration removed microsatellites sequenced in 4 or fewer samples; the third iteration removed microsatellites sequenced in 8 or fewer samples. Minimum sample size doubled for 10 iterations.

A second series of tests used a variation of this approach to demonstrate robustness of enrichment analysis to procedure details. Again, each iteration of the series recomputed enrichment p-values for introns, exons, cds, 3'-UTRs, and 5'-UTRs. The first iteration considered microsatellites possessed by 2–3 samples. The second iteration considered microsatellites possessed by 4–7 samples. Window size doubled for 10 iterations. Conclusions drawn from enrichment analysis (see results) were the same regardless of approach. Both approaches suggest that intronic microsatellites are over-represented, and coding microsatellites are under-represented among those that are super-population specific.

(b) Annotation and enrichment with respect to significant expression simple tandem repeats (eSTRs). A previous survey of 80,980 short tandem repeats (STRs) identified 2,060 that contribute to gene expression in human (eSTRs) [9]: 13,259 (of the 80,980) are present in our survey. We constructed a 2x2 contingency table based on classifications of the 13,259 shared repeats: EBML; not EBML; affects gene expression (eSTR); does not affect gene expression (other STR). We performed a χ2 test of the null: no association between EBML and eSTRs. The null was rejected (p = 1.53e-8); see results.

(c) Annotation and enrichment with respect to matrisome genes. A previous study identified 1,062 core matrisome and matrisome associated genes [44]. We constructed a 2x2 contingency table based on classifications of microsatellites: EBML; not EBML; core matrisome/matrisome associated gene; any other gene. We performed a χ2 test of the null: no association between EBML and matrisome genes. The null was rejected (p = .0023); see results.

(d) Annotation and enrichment with respect to selective sweep regions. Putative selective sweep regions were previously identified in 6 populations of the 1000 Genomes Project [34]: Three African populations (YRI, GWD, and LWK); one European population (CEU); one East Asian population (JPT); and one American population (PEL). A 2x2 contingency table was constructed based on classifications of microsatellites: EBML; not EBML; within sweep region; outside of sweep region. We performed a χ2 test of the null: no association between EBML and sweep regions. The null was rejected (p = 0.018); see results.

(e) Enrichment with respect to repeat length. Six 2x2 contingency tables were constructed based on classifications of microsatellites by super-population specificity and unit length. In each case we perform a χ2 test of the null: no association between EBML and unit length (see results).

(f) Enrichment with respect to specificity in two or more super-populations. Ten 2x2 contingency tables were constructed for the 3,984 EBML (one table for each pair of super-populations). Table entries enumerated the number of EBML in both super-populations, one super-population, and neither super-population, respectively. We performed a χ2 test of the null: no association between EBML in the pair of super-populations; see results.

(g) Enrichment with respect to amino acids for coding microsatellites. Contingency tables were constructed for coding microsatellites based on classification by super-population specificity and amino acids. We performed a χ2 test of the null: no association between EBML and amino acids; see results.

Principal component analysis

Smartpca with default parameters from EIGENSOFT [70] was used to perform principal component analysis (PCA) based on a covariance matrix. Details of EIGENSOFT can be found elsewhere. Briefly, analysis began with a matrix (Xmn) populated with microsatellite genotypes (m = 3,984) for each individual (n = 2,529). For PCA analysis of all tested microsatellites we used m = 316,167. Each matrix row (m) received two types of normalization. First, genotype (row) means are set to zero by subtracting the quantity μm = (ΣnXmn)/N from each entry. Second, rows were normalized by dividing each entry by √(pm(1 –pm)): see [70] for details. Once normalized, the covariance was computed for each pair (n·n) of individuals; the result is a covariance matrix (Cn by n). Eigenvectors of the covariance matrix (typically ranked by their eigenvalue) represent the principal components of variation: see [70] for full details.

Pathway analysis

Analysis of metabolic pathways harboring EBML was performed using the Reactome database and the Kyoto Encyclopedia of Genes and Genomes (KEGG). To begin, we used cruzDB to identify genes harboring the 3,984 EBML: 3,171 for AFR, 450 for EUR, 1,494 for EAS, 647 for SAS, and 335 for AMR. CruzDB did not identify a gene for 200 EBML. Next, the gene lists for each super-population were submitted to Reactome and KEGG databases: submission was performed programmatically using the clusterProfiler package in R [71]. The function compareCluster was used for visualization.

Sample availability, code availability, and URLs

Existing data was used for this study. All samples are freely available from the 1000 Genomes Project: http://www.internationalgenome.org/. PCA was performed with EIGENSOFT: https://github.com/DReichLab/EIG. Fourfold plots for matrisome and eSTR enrichments use the visualizing categorical data (vcd) package available in R. UpSet plot and 5-way Venn diagram use the UpSetR and Venn packages available in R, respectively. Kegg and Reactome pathway analysis was performed with the ReactomePA package and visualized with the clusterProfiler package. Additional R packages included corrplot, questionr, and analysis of biological data (abd). Sequencing reads, alignments, and microsatellite genotypes are available online: www.cagmdb.org/.

Supporting information

S1 Table. List of 3,984 EBML identified in this study.

(XLSX)

S2 Table. List of 3,171 microsatellites specific to African populations.

(XLSX)

S3 Table. List of 1,494 microsatellites specific to East Asian populations.

(XLSX)

S4 Table. List of 647 microsatellites specific to South Asian populations.

(XLSX)

S5 Table. List of 450 microsatellites specific to European populations.

(XLSX)

S6 Table. List of 335 microsatellites specific to American populations.

(XLSX)

S7 Table. List of 64 EBML also identified as eSTRs.

(XLSX)

S8 Table. List of 232 EBML in matrisome core/associated genes.

(XLSX)

S9 Table. List of 21 EBML coding repeats in selective sweep regions.

(XLSX)

S10 Table. List of 14 EBML coding repeats previously implicated in disease.

(XLSX)

S1 Dataset. Genotyping data for 3,984 EBML in 1000 Genomes Project samples.

(BZ2)

S2 Dataset. Pairwise comparisons of microsatellites in 5 super-populations.

(7Z)

S1 Code. Functions to access genotype tables for EBML.

(R)

Acknowledgments

NAK, LK, RA, PM, and HRG contributed to the conceptualization of this project, experimental design, and data analysis. NAK, LK, and PM contributed the writing of this manuscript. NAK, LE, PS, LK, and PM were responsible for software writing and data analysis. NAK, AP, LE, LK, RA, PM, HRG, and RTV were responsible for manuscript preparation. All authors read and approved the final manuscript. We thank Liang Shan for assisting with statistical analysis.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This work was funded by a grant from the Bradley Engineering Foundation to the Edward Via College of Osteopathic Medicine and by a grant from Edward Via College of Osteopathic Medicine to NAK. HRG is the founder and co-owner of Orbit Genomics and received support in the form of salary. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.de Koning APJ, Gu WJ, Castoe TA, Batzer MA, Pollock DD. Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. Plos Genet. 2011;7(12). ARTN e1002384 10.1371/journal.pgen.1002384 WOS:000299167900005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ellegren H. Microsatellites: Simple sequences with complex evolution. Nature Reviews Genetics. 2004;5(6):435–45. 10.1038/nrg1348. WOS:000221759700014. 10.1038/nrg1348 [DOI] [PubMed] [Google Scholar]
  • 3.Borstnik B, Pumpernik D. Tandem repeats in protein coding regions of primate genes. Genome Res. 2002;12(6):909–15. 10.1101/gr.138802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li YC, Korol AB, Fahima T, Nevo E. Microsatellites within genes: Structure, function, and evolution. Mol Biol Evol. 2004;21(6):991–1007. 10.1093/molbev/msh073 WOS:000221599300004. [DOI] [PubMed] [Google Scholar]
  • 5.Murmann AE, Yu JD, Opal P, Peter ME. Trinucleotide Repeat Expansion Diseases, RNAi, and Cancer. Trends Cancer. 2018;4(10):684–700. 10.1016/j.trecan.2018.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Everett CM. Trinucleotide Repeat Disorders. Encyclopedia of Movement Disorders, Vol 3: Q-Z. 2010:290–6. WOS:000335076900087. [Google Scholar]
  • 7.Hannan AJ. TANDEM REPEAT POLYMORPHISMS Mediators of Genetic Plasticity, Modulators of Biological Diversity and Dynamic Sources of Disease Susceptibility. Adv Exp Med Biol. 2012;769:1–9. Book_Doi 10.1007/978-1-4614-5434-2. WOS:000333841400002. [PubMed] [Google Scholar]
  • 8.Hannan AJ. Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet. 2018;19(5):286–98. 10.1038/nrg.2017.115 [DOI] [PubMed] [Google Scholar]
  • 9.Gymrek M, Willems T, Guilmatre A, Zeng HY, Markus B, Georgiev S, et al. Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet. 2016;48(1):22–+. WOS:000367255300009. 10.1038/ng.3461 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sawaya SM, Bagshaw AT, Buschiazzo E, Gemmell NJ. Promoter Microsatellites as Modulators of Human Gene Expression In: Hannan AJ, editor. Tandem Repeat Polymorphisms: Genetic Plasticity, Neural Diversity and Disease. New York, NY: Springer New York; 2012. p. 41–54. [DOI] [PubMed] [Google Scholar]
  • 11.Vinces MD, Legendre M, Caldara M, Hagihara M, Verstrepen KJ. Unstable tandem repeats in promoters confer transcriptional evolvability. Science. 2009;324(5931):1213–6. Epub 2009/05/30. 10.1126/science.1170097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bacolla A, Wells RD. Non-B DNA Conformations as Determinants of Mutagenesis and Human Disease. Mol Carcinogen. 2009;48(4):273–85. 10.1002/mc.20507 WOS:000264918500002. [DOI] [PubMed] [Google Scholar]
  • 13.Sonay TB, Carvalho T, Robinson MD, Greminger MP, Krutzen M, Comas D, et al. Tandem repeat variation in human and great ape populations and its impact on gene expression divergence. Genome Res. 2015;25(11):1591–9. WOS:000364355600001. 10.1101/gr.190868.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bruford MW, Wayne RK. Microsatellites and Their Application to Population Genetic-Studies. Curr Opin Genet Dev. 1993;3(6):939–43. 10.1016/0959-437x(93)90017-J WOS:A1993MW46500017. [DOI] [PubMed] [Google Scholar]
  • 15.Brinkmann B, Junge A, Meyer E, Wiegand P. Population genetic diversity in relation to microsatellite heterogeneity. Human Mutation. 1998;11(2):135–44. WOS:000071841800006. [DOI] [PubMed] [Google Scholar]
  • 16.Nei M, Roychoudhury AK. Evolutionary Relationships of Human-Populations on a Global-Scale. Mol Biol Evol. 1993;10(5):927–43. WOS:A1993LX26600001. 10.1093/oxfordjournals.molbev.a040059 [DOI] [PubMed] [Google Scholar]
  • 17.Shriver MD, Smith MW, Jin L, Marcini A, Akey JM, Deka R, et al. Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet. 1997;60(4):957–64. WOS:A1997WT61400026. [PMC free article] [PubMed] [Google Scholar]
  • 18.Edwards A, Hammond HA, Jin L, Caskey CT, Chakraborty R. Genetic variation at five trimeric and tetrameric tandem repeat loci in four human population groups. Genomics. 1992;12(2):241–53. Epub 1992/02/01. . [DOI] [PubMed] [Google Scholar]
  • 19.Bowcock AM, Ruizlinares A, Tomfohrde J, Minch E, Kidd JR, Cavallisforza LL. High-Resolution of Human Evolutionary Trees with Polymorphic Microsatellites. Nature. 1994;368(6470):455–7. WOS:A1994ND12000063. 10.1038/368455a0 [DOI] [PubMed] [Google Scholar]
  • 20.Zhivotovsky LA, Rosenberg NA, Feldman MW. Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. Am J Hum Genet. 2003;72(5):1171–86. Epub 2003/04/12. 10.1086/375120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jorde LB, Rogers AR, Bamshad M, Watkins WS, Krakowiak P, Sung S, et al. Microsatellite diversity and the demographic history of modern humans. P Natl Acad Sci USA. 1997;94(7):3100–3. 10.1073/pnas.94.7.3100 WOS:A1997WR93000064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–59. WOS:000087475100039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Tishkoff SA, Reed FA, Friedlaender FR, Ehret C, Ranciaro A, Froment A, et al. The genetic structure and history of Africans and African Americans. Science. 2009;324(5930):1035–44. Epub 2009/05/02. 10.1126/science.1172257 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Algee-Hewitt BF, Edge MD, Kim J, Li JZ, Rosenberg NA. Individual Identifiability Predicts Population Identifiability in Forensic Microsatellite Markers. Curr Biol. 2016;26(7):935–42. Epub 2016/03/22. 10.1016/j.cub.2016.01.065 . [DOI] [PubMed] [Google Scholar]
  • 25.Creanza N, Ruhlen M, Pemberton TJ, Rosenberg NA, Feldman MW, Ramachandran S. A comparison of worldwide phonemic and genetic variation in human populations. Proc Natl Acad Sci U S A. 2015;112(5):1265–72. Epub 2015/01/22. 10.1073/pnas.1424033112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Santos NPC, Ribeiro-Rodrigues EM, Ribeiro-dos-Santos AKC, Pereira R, Gusmao L, Amorim A, et al. Assessing Individual Interethnic Admixture and Population Substructure Using a 48-Insertion-Deletion (INSEL) Ancestry-Informative Marker (AIM) Panel. Hum Mutat. 2010;31(2):184–90. WOS:000274461800009. 10.1002/humu.21159 [DOI] [PubMed] [Google Scholar]
  • 27.Friedlaender JS, Friedlaender FR, Reed FA, Kidd KK, Kidd JR, Chambers GK, et al. The genetic structure of Pacific Islanders. PLoS Genet. 2008;4(1):e19 Epub 2008/01/23. 10.1371/journal.pgen.0040019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Genome of the Netherlands C. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet. 2014;46(8):818–25. Epub 2014/07/01. 10.1038/ng.3021 . [DOI] [PubMed] [Google Scholar]
  • 29.Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet. 1998;63(3):861–9. Epub 1998/08/27. 10.1086/302011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pemberton TJ, DeGiorgio M, Rosenberg NA. Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation. G3-Genes Genom Genet. 2013;3(5):891–907. 10.1534/g3.113.005728 WOS:000319438700010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Santos C, Phillips C, Oldoni F, Amigo J, Fondevila M, Pereira R, et al. Completion of a worldwide reference panel of samples for an ancestry informative Indel assay. Forensic Sci Int-Gen. 2015;17:75–80. 10.1016/j.fsigen.2015.03.011 WOS:000355918400012. [DOI] [PubMed] [Google Scholar]
  • 32.Willems T, Gymrek M, Highnam G, Genomes Project C, Mittelman D, Erlich Y. The landscape of human STR variation. Genome Res. 2014;24(11):1894–904. Epub 2014/08/20. 10.1101/gr.177774.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–+. WOS:000362095100037. 10.1038/nature15394 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Schrider DR, Kern AD. Soft Sweeps Are the Dominant Mode of Adaptation in the Human Genome. Molecular biology and evolution. 2017;34(8):1863–77. Epub 2017/05/10. 10.1093/molbev/msx154 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Fondon JW 3rd, Garner HR. Molecular origins of rapid and continuous morphological evolution. Proceedings of the National Academy of Sciences of the United States of America. 2004;101(52):18058–63. Epub 2004/12/15. 10.1073/pnas.0408118101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Gemayel R, Vinces MD, Legendre M, Verstrepen KJ. Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annual review of genetics. 2010;44:445–77. Epub 2010/09/03. 10.1146/annurev-genet-072610-155046 . [DOI] [PubMed] [Google Scholar]
  • 37.Kashi Y, King DG. Simple sequence repeats as advantageous mutators in evolution. Trends in genetics: TIG. 2006;22(5):253–9. Epub 2006/03/29. 10.1016/j.tig.2006.03.005 . [DOI] [PubMed] [Google Scholar]
  • 38.Haasl RJ, Payseur BA. Microsatellites as targets of natural selection. Molecular biology and evolution. 2013;30(2):285–98. Epub 2012/10/30. 10.1093/molbev/mss247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Katti MV, Ranjekar PK, Gupta VS. Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol Biol Evol. 2001;18(7):1161–7. WOS:000169846400002. 10.1093/oxfordjournals.molbev.a003903 [DOI] [PubMed] [Google Scholar]
  • 40.Ikeuchi T, Koide R, Tanaka H, Onodera O, Igarashi S, Takahashi H, et al. Dentatorubral-Pallidoluysian Atrophy—Clinical-Features Are Closely-Related to Unstable Expansions of Trinucleotide (Cag) Repeat. Ann Neurol. 1995;37(6):769–75. WOS:A1995RD04000009. 10.1002/ana.410370610 [DOI] [PubMed] [Google Scholar]
  • 41.Komure O, Sano A, Nishino N, Yamauchi N, Ueno S, Kondoh K, et al. DNA Analysis in Hereditary Dentatorubral-Pallidoluysian Atrophy—Correlation between Cag Repeat Length and Phenotypic Variation and the Molecular-Basis of Anticipation. Neurology. 1995;45(1):143–9. WOS:A1995QB88600028. 10.1212/wnl.45.1.143 [DOI] [PubMed] [Google Scholar]
  • 42.Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 2018;46(D1):D649–D55. WOS:000419550700098. 10.1093/nar/gkx1132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Du JL, Yuan ZF, Ma ZW, Song JZ, Xie XL, Chen YL. KEGG-PATH: Kyoto encyclopedia of genes and genomes-based pathway analysis using a path analysis model. Mol Biosyst. 2014;10(9):2441–7. WOS:000340437200017. 10.1039/c4mb00287c [DOI] [PubMed] [Google Scholar]
  • 44.Naba A, Clauser KR, Hoersch S, Liu H, Carr SA, Hynes RO. The Matrisome: In Silico Definition and In Vivo Characterization by Proteomics of Normal and Tumor Extracellular Matrices. Mol Cell Proteomics. 2012;11(4). ARTN M111.014647 10.1074/mcp.M111.014647. WOS:000302786500016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.von Pein F, Valkkila M, Schwarz R, Morcher M, Klima B, Grau A, et al. Analysis of the COL3A1 gene in patients with spontaneous cervical artery dissections. J Neurol. 2002;249(7):862–6. WOS:000177159700012. 10.1007/s00415-002-0745-x [DOI] [PubMed] [Google Scholar]
  • 46.Kizawa H, Kou I, Iida A, Sudo A, Miyamoto Y, Fukuda A, et al. An aspartic acid repeat polymorphism in asporin inhibits chondrogenesis and increases susceptibility to osteoarthritis. Nat Genet. 2005;37(2):138–44. WOS:000226690100019. 10.1038/ng1496 [DOI] [PubMed] [Google Scholar]
  • 47.Liu RX, Yuan XL, Yu J, Quan Q, Meng HY, Wang C, et al. An updated meta-analysis of the asporin gene D-repeat in knee osteoarthritis: effects of gender and ethnicity. J Orthop Surg Res. 2017;12. ARTN 148 10.1186/s13018-017-0647-3. WOS:000412896000001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Queitsch C, Carlson KD, Girirajan S. Lessons from Model Organisms: Phenotypic Robustness and Missing Heritability in Complex Disease. Plos Genet. 2012;8(11). ARTN e1003041 10.1371/journal.pgen.1003041. WOS:000311891600029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Press MO, Carlson KD, Queitsch C. The overdue promise of short tandem repeat variation for heritability. Trends Genet. 2014;30(11):504–12. WOS:000344046400007. 10.1016/j.tig.2014.07.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Hannan AJ. Tandem repeat polymorphisms: modulators of disease susceptibility and candidates for 'missing heritability'. Trends Genet. 2010;26(2):59–65. WOS:000274987400004. 10.1016/j.tig.2009.11.008 [DOI] [PubMed] [Google Scholar]
  • 51.Chen CH, Chuang TJ, Liao BY, Chen FC. Scanning for the signatures of positive selection for human-specific insertions and deletions. Genome Biol Evol. 2009;1:415–9. Epub 2009/01/01. 10.1093/gbe/evp041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Schlenke TA, Begun DJ. Strong selective sweep associated with a transposon insertion in Drosophila simulans. Proc Natl Acad Sci U S A. 2004;101(6):1626–31. Epub 2004/01/28. 10.1073/pnas.0303793101 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Quilez J, Guilmatre A, Garg P, Highnam G, Gymrek M, Erlich Y, et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 2016;44(8):3750–62. WOS:000376389000030. 10.1093/nar/gkw219 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Handsaker RE, Korn JM, Nemesh J, McCarroll SA. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet. 2011;43(3):269–U126. WOS:000287693800020. 10.1038/ng.768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hindorff LA, Bonham VL, Brody LC, Ginoza MEC, Hutter CM, Manolio TA, et al. Prioritizing diversity in human genomics research. Nat Rev Genet. 2018;19(3):175–+. WOS:000425031400008. 10.1038/nrg.2017.89 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Sherman RM, Forman J, Antonescu V, Puiu D, Daya M, Rafaels N, et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent (vol 51, pg 30, 2018). Nat Genet. 2019;51(2):364–. 10.1038/s41588-018-0335-1 WOS:000457314300025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D. Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles. Nucleic Acids Res. 2013;41(1). ARTN e32 10.1093/nar/gks981. WOS:000312889900032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011;21(6):961–73. WOS:000291153400016. 10.1101/gr.112326.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Mose LE, Wilkerson MD, Hayes DN, Perou CM, Parker JS. ABRA: improved coding indel detection via assembly-based realignment. Bioinformatics. 2014;30(19):2813–5. WOS:000343082900018. 10.1093/bioinformatics/btu376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lu D, Xu S. Principal component analysis reveals the 1000 Genomes Project does not sufficiently cover the human genetic diversity in Asia. Front Genet. 2013;4:127 Epub 2013/07/13. 10.3389/fgene.2013.00127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Tae H, Kim DY, McCormick J, Settlage RE, Garner HR. Discretized Gaussian mixture for genotyping of microsatellite loci containing homopolymer runs. Bioinformatics. 2014;30(5):652–9. WOS:000332259300009. 10.1093/bioinformatics/btt595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Tae H, McMahon KW, Settlage RE, Bavarva JH, Garner HR. ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats. Bioinformatics. 2013;29(14):1734–41. WOS:000321747800004. 10.1093/bioinformatics/btt277 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Cordaux R, Batzer MA. The impact of retrotransposons on human genome evolution. Nature reviews Genetics. 2009;10(10):691–703. 10.1038/nrg2640 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Willems T, Zielinski D, Yuan J, Gordon A, Gymrek M, Erlich Y. Genome-wide profiling of heritable and de novo STR variations. Nat Methods. 2017;14(6):590–+. WOS:000402291800021. 10.1038/nmeth.4267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Gymrek M, Golan D, Rosset S, Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Res. 2012;22(6):1154–62. WOS:000304728100017. 10.1101/gr.135780.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Tankard RM, Bennett ME, Degorski P, Delatycki MB, Lockhart PJ, Bahlo M. Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data. Am J Hum Genet. 2018;103(6):858–73. WOS:000452535600003. 10.1016/j.ajhg.2018.10.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, et al. STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol. 2018;19. ARTN 121 10.1186/s13059-018-1505-2. WOS:000442375800001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Tang HB, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, et al. Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am J Hum Genet. 2017;101(5):700–15. WOS:000414251600004. 10.1016/j.ajhg.2017.09.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Budis J, Kucharik M, Duris F, Gazdarica J, Zrubcova M, Ficek A, et al. Dante: genotyping of known complex and expanded short tandem repeats. Bioinformatics. 2018. Epub 2018/09/12. 10.1093/bioinformatics/bty791 . [DOI] [PubMed] [Google Scholar]
  • 70.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics. 2006;38(8):904–9. 10.1038/ng1847 . [DOI] [PubMed] [Google Scholar]
  • 71.Yu GC, Wang LG, Han YY, He QY. clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters. Omics. 2012;16(5):284–7. WOS:000303653300007. 10.1089/omi.2011.0118 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Arnar Palsson

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

19 Sep 2019

PONE-D-19-21466

Abundance of Super-Population Specific Microsatellites in Human Gene Regions

PLOS ONE

Dear Dr. Kinney,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers highlight multiple points that need consideration.

This involves framing of the study (what are the central questions), definitions of terms (super or just specific), methodogical concerns (exome sequencing origin?) and questions of interpretation.

We would appreciate receiving your revised manuscript by Nov 03 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Arnar Palsson, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please amend either the author names on the online submission form (via Edit Submission) or in the manuscript so that they are identical.

Additional Editor Comments (if provided):

The reviewers highlight multiple points that need consideration.

This involves framing of the study (what are the central questions), definitions of terms (super or just specific), methodogical concerns (exome sequencing origin?) and questions of interpretation.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Kinney et al. investigate the variation of more than 300,000 microsatellites across five super-populations of human ethnic groups using genotypes called from 100bp Illumina reads as part of the 1000 Genomes Project. The authors identify 3,984 microsatellites they term super-population specific (SPS). Subsequent analyses of SPS microsatellites reveal: (1) PCA on the original set of 300,000+ microsatellites and the reduced set of SPS microsatellites produce highly similar patterns on PC1-PC2 biplots; (2) a significant fraction of SPS microsatellites overlap with a set of previously identified eSTRs – microsatellites associated with differences in gene expression; (3) genes harboring SPS microsatellites are enriched for extra-cellular matrix pathways, and; (4) a significant number of SPS microsatellites are coincidental with sequences previously implicated in selective sweeps.

Major Concerns

1. The motivation for identifying SPS microsatellites is never explicitly stated. The authors simply tell us they have done so. The closest we get to a motivating statement is in the abstract: “We discover 3,984 super-population specific microsatellites and investigate their potential biological and clinical significance.” But the connection between SPS microsatellites and their functional significance seems hinted at rather than explicitly reasoned through. We’re told, for example, that the SPS set is enriched for eSTRs under a section heading of “Numerous SPS microsatellites affect gene expression.” Yet, we never get explicit statements relating to the importance of this finding. We are told there is a significant overlap, that it may or may not relate to ethnic difference in disease prevalence, and that is it. While I applaud the Introduction’s excellent summary of the state of research into functional microsatellites, the reader is left to connect dots that are too far apart.

The four main findings emphasized by the authors – particularly (2) and (4) above – suggest the main motivation for identifying SPS microsatellites is that differences in genotype frequencies among super populations are potentially explained by the action of natural selection … and are therefore of functional importance. However, the manner in which the paper is written makes it feel like we should just know why SPS microsatellites are of importance and that we should just know why the statistically significant findings of the authors prove the worth of SPS microsatellites.

2. The definition of a super-population specific microsatellite is never given. While the method for finding them is explained well and in detail in the Methods section, a succinct, intuitive definition should be given in the abstract, as well as the Introduction and/or Results. For example, the authors might write, “An SPS microsatellite is a microsatellite in which the genotype frequencies of at least one super-population differs from those of all four of the remaining super-populations.” This is still a little clunky, but something to that effect would have made my reading of the paper much easier from the beginning.

3. Although differences in genotype frequencies may in fact stem from the action of natural selection due to different selective pressures at the continental scale, the authors never mention alternative explanations. An obvious alternative cause of SPS patterns of microsatellite variation is the combination of high mutability, genetic drift, and isolation-by-distance. These population genetic factors could easily lead to substantial differences in populations separated by vast distances. Indeed, the main interest in microsatellites has long been their potential to diagnose population structure. The authors do mention the potential use of microsatellites for population structure analyses near the beginning of the Discussion. However, the Discussion should include an honest appraisal of alternative evolutionary explanations for the development of SPS loci.

4. Another example of hinting at the importance of the findings rather than appraising them in a comprehensive and objective manner is the overlap of SPS microsatellites with genomic regions previously identified as potential regions where natural selection acted. The authors write that a selective sweep is “due to a novel allele [that] reduces nearby genetic variation.” However, natural selection on a microsatellite is a very different animal than selection on a SNP.. Selective sweeps of linked genetic variation result from the emergence of that novel allele on a specific genetic background. If selection actually acts on a microsatellite locus, a novel allele is difficult to define. Due to the high mutability of microsatellite loci, the same favored repeat length can arise many times, linked to many different combinations of linked SNPs. Therefore, it is not clear that selection on a microsatellite locus would lead to the classical patterns associated with selective sweeps. The authors do mention that the overlaps are with soft sweep regions, which themselves leave much messier signatures of natural selection than hard sweeps on a new variant. In my opinion, however, this still overlooks the inherent problem of attempting to link selection on microsatellites with signatures of selection based on theory that assumes a SNP is the target of selection.

Indeed, we might think of this in the reverse. Isn’t it possible that selection on a single nucleotide variant (that may only offer a selective advantage in only one or a subset of super populations) would cause patterns of variation at linked microsatellites to become different from each other in different populations? In other words, the differences in microsatellite variation point to selection, but selection on a SNP. This is exactly how selection on SNPs in cis-regulatory regions that lead to lactase persistence in some African populations was identified.

Minor Concerns

1. p. 2, ln 44: The authors compare array length variation, not “array length mutations”

2. (throughout) Principal component analysis, not Principle component analysis.

3. (throughout) Strange use of colons. For the most part, the authors seem to use them like an em dash would be used.

4. p. 6, ln 116: just curious if higher PCs (e.g., PC3 or PC4) lead to separation of AMR and SAS super populations. More importantly, how much variation do the first two PCs explain?

5. p. 7, lns 143-146. The authors state that eSTRs are commonly found in regions subject to purifying selection, then define a sweep in terms of positive selection.

6. p. 8, ln 157+ Confused by the genomic compartments used. UTRs are technically part of exons, but not the CDS. So, (1) you might make a distinction between UTRs and full+partial exons that make up the CDS … but then why a separate “coding” compartment? … or (2) you might make a distinction between UTRs and CDS … but then why a separate “exon” compartment, since UTRs and CDS comprise exons. Is there overlap between the compartments here? – i.e., are some STRs double-counted?

7. p 11., ln 197 Reference to support the claim that hydrophobic amino acids are more likely to have deleterious effects on protein function.

8. p 14., 1st paragraph Why devote a whole paragraph of a short discussion section to the idea that we have an incomplete picture of human genetic variation. Isn’t the 1000 Genomes Project (with its 26 individual populations) a big step in the right direction? These days, the reference genome is simply a jumping off point for analyses that do make use of data sets that include much more comprehensive coverage of human genetic variation.

Reviewer #2: The manuscript is interesting but several issues need to be addressed before publication in PLOSONE.

1. What were the default parameters of the perl script the authors used for finding the repeats? Did they use a constant length cutoff, or did it change with the repeat unit? Studies have reported way more than 1.6 million STRs in humans, and knowing the exact parameters the authors used will help readers understand why they had such a small set of loci to start with.

2. Line 112 and other relevant places should be modified to clearly convey that they have used existing data and not done the sequencing themselves.

3. Link to eSTRs - Why did the authors choose to use all 80,980 loci rather than the 2060 loci that showed a significant effect on expression as reported by Gymrek et al. ? How many common loci are there between the 2060 eSTRs and 3984 SPS?

4. Authors are advised to elaborate on the p value testing they did to show that 64 common out of 3984 total is significant. In particular, I would like to see if they took into account the bias towards coding regions that is introduced by whole exome sequencing. As can be seen from their study, the loci considered by Gymrek et al were predominantly proximal to genic regions. And as the current study preliminarily uses exome data, the chances of overlap are higher than what is true if whole genome is considered as background. The authors must incorporate this into calculating the significance.

5. Regarding the analysis of over/under representation by repeat unit length, if a larger length cutoff was used for larger repeat unit lengths, that might explain why tetra to hexamers are underrepresented in their analyses.

6. The authors call these microsatellites super-population specific, yet there are a lot of overlaps of these loci indicating they are merely highly polymorphic. Is it really correct to call them “specific”? On a related note, what happens to the analyses if they are done only with the unique loci (~2500 loci added together).

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2019 Dec 12;14(12):e0225216. doi: 10.1371/journal.pone.0225216.r002

Author response to Decision Letter 0


3 Oct 2019

(images included - see the attached document)

Dr. Arnar Palsson, Ph.D.

Academic Editor of Plos One

Dear Dr. Palsson:

We are pleased that our manuscript entitled “Abundance of Ethnically Biased Microsatellites in Human Gene Regions” is accepted for publication in Plos One. In the pages that follow we respond point-by-point to each of the reviewer concerns. The title of the original manuscript was “Abundance of Super-Population Specific Microsatellites in Human Gene Regions”. This change stems from reviewer concerns regarding definition of terms. It is more accurate to call these regions bias rather than specific; and, ethnicity has been used previously as a surrogate for the less familiar term super-population. We have addressed a number of additional weaknesses and look forward to sharing our revised manuscript with the scientific community. There are no related articles published or under consideration. If there are any specific questions that need a conversation, my cell is 301 338-1181.

Sincerely,

Dr. Nicholas Kinney

Primary Care Research Network and the Center for Bioinformatics and Genetics, VCOM

Professor of Biomedicine, VCOM

Reviewer #1:

Kinney et al. investigate the variation of more than 300,000 microsatellites across five super-populations of human ethnic groups using genotypes called from 100bp Illumina reads as part of the 1000 Genomes Project. The authors identify 3,984 microsatellites they term super-population specific (SPS). Subsequent analyses of SPS microsatellites reveal: (1) PCA on the original set of 300,000+ microsatellites and the reduced set of SPS microsatellites produce highly similar patterns on PC1-PC2 biplots; (2) a significant fraction of SPS microsatellites overlap with a set of previously identified eSTRs – microsatellites associated with differences in gene expression; (3) genes harboring SPS microsatellites are enriched for extra-cellular matrix pathways, and; (4) a significant number of SPS microsatellites are coincidental with sequences previously implicated in selective sweeps.

Major Concerns

1a. The motivation for identifying SPS microsatellites is never explicitly stated. The authors simply tell us they have done so. The closest we get to a motivating statement is in the abstract: “We discover 3,984 super-population specific microsatellites and investigate their potential biological and clinical significance.” But the connection between SPS microsatellites and their functional significance seems hinted at rather than explicitly reasoned through.

We agree that the motivation is tacitly assumed when it should be explicitly stated in the abstract; the reviewer’s thorough suggestions have enabled us to address this weakness. The revised manuscript now follows a clear thread from the beginning of the abstract which begins by highlighting the emerging functional relevance of some microsatellites:

"[Microsatellites] have been used for decades as putatively neutral markers to study the genetic structure of diverse human populations. However, recent studies have demonstrated that that some microsatellites contribute to gene expression, cis heritability, and phenotype."

Next, we emphasize the motivation for the current study and approach:

"As a corollary, some microsatellites may contribute to differential gene expression and RNA/protein structure stability in distinct human populations. To test this hypothesis we investigate genotype frequencies, functional relevance, and adaptive potential of microsatellites in five super-populations (ethnicities) drawn from the 1000 Genomes Project."

Although we have addressed the reviewers comment economically, we think the improvement is significant.

1b. We’re told, for example, that the SPS set is enriched for eSTRs under a section heading of “Numerous SPS microsatellites affect gene expression.” Yet, we never get explicit statements relating to the importance of this finding. We are told there is a significant overlap, that it may or may not relate to ethnic difference in disease prevalence, and that is it. While I applaud the Introduction’s excellent summary of the state of research into functional microsatellites, the reader is left to connect dots that are too far apart.

We agree that this is an important section that was given a weak heading and interpretation. The new section heading helps the reader anticipate the importance of the overlap between SPS microsatellites (now referred to as EBML), eSTRs, and selective sweeps:

"Correspondence between EBML, eSTRs, and selective sweeps suggest adaptive potential"

We have added a paragraph at the end of this important section that helps the reader interpret the correspondence between EBML, eSTRs, and selective sweeps. We begin by reiterating the definition of each region and provide a clear statement of significance.

"Overall, these findings suggest a degree of mutual overlap between EBML, eSTRs, and selective sweeps in the human genome. The interpretation of this overlap is built up from the definition of each region. Briefly, each eSTR has the capacity to affect gene expression; and, for each EBML at least one ethnicity has a distribution of genotypes statistically different from the remaining four. Thus, any correspondence between the two implies differential gene expression in one or more human super-populations."

We build upon this interpretation by offering alternative explanations based on high mutability, genetic drift, isolation-by-distance, and natural selection. The finding that EBML overlap with selective sweeps now has a logical motivation; in particular, it is a necessary way to show that microsatellites have adaptive potential:

"These differences likely stem from a complex combination of high mutability, genetic drift, isolation-by-distance, and natural selection. Drift, mutability, and isolation are inevitable; but, the situation is less clear for natural selection. The correspondence between EBML and selective sweeps suggests they have adaptive potential and may be targeted by natural selection"

Alternative explanations are provides and we offer a measured overall conclusion that reiterates the section heading:

"On the other hand, it’s possible that EBML are simply in linkage disequilibrium with targets of selection such as nearby SNPs. Overall it seems likely that some EBML – particularly the ones that overlap eSTRs – have bona fide adaptive potential."

1c. The four main findings emphasized by the authors – particularly (2) and (4) above – suggest the main motivation for identifying SPS microsatellites is that differences in genotype frequencies among super populations are potentially explained by the action of natural selection … and are therefore of functional importance. However, the manner in which the paper is written makes it feel like we should just know why SPS microsatellites are of importance and that we should just know why the statistically significant findings of the authors prove the worth of SPS microsatellites.

We agree that (2) and (4) are highly non-trivial, complementary, and represent the main findings of our work. This was not made clear enough to the reader. We have added passages throughout the manuscript to clarify the argument and recognize alternative explanations. The first section of our results now includes the following interpretation:

"Identifying EBML is an easier task than explaining their origin. EBML could emerge due to high microsatellite mutability, genetic drift, isolation-by-distance, or natural selection. The action of natural selection – investigated in the next section – would suggest adaptive potential."

Have identified SPS microsatellites (now referred to as EBML) in the first section; this should que the reader that the manuscript will be turning to questions of their significance. The second section of the results builds a case for adaptive potential and suggests alternative explanations (see the new interpretation paragraph on page 9 lines 179-190):

"Overall it seems likely that some EBML – particularly the ones that overlap eSTRs – have bona fide adaptive potential."

2. The definition of a super-population specific microsatellite is never given. While the method for finding them is explained well and in detail in the Methods section, a succinct, intuitive definition should be given in the abstract, as well as the Introduction and/or Results. For example, the authors might write, “An SPS microsatellite is a microsatellite in which the genotype frequencies of at least one super-population differs from those of all four of the remaining super-populations.” This is still a little clunky, but something to that effect would have made my reading of the paper much easier from the beginning.

We have formulated a one-sentence minimal definition and now refer to these regions as ethnically biased microsatellite loci (EBML). The abstract now states the definition of these regions:

"We discover 3,984 ethnically biased microsatellite loci (EBML); for each EBML at least one ethnicity has genotype frequencies statistically different from the remaining four."

The same definition is provided in the author summary, introduction, and results. Changing the name of these regions is motivated by a need for clarity. As pointed out by reviewer 2 it is not appropriate to call these regions “specific”. Indeed the underlying genotype distributions may overlap and the same region may be significant in more than one super-population; calling these regions bias is more accurate. Ethnicity has been used previously as a surrogate for the less familiar term super-population.

3. Although differences in genotype frequencies may in fact stem from the action of natural selection due to different selective pressures at the continental scale, the authors never mention alternative explanations. An obvious alternative cause of SPS patterns of microsatellite variation is the combination of high mutability, genetic drift, and isolation-by-distance. These population genetic factors could easily lead to substantial differences in populations separated by vast distances. Indeed, the main interest in microsatellites has long been their potential to diagnose population structure. The authors do mention the potential use of microsatellites for population structure analyses near the beginning of the Discussion. However, the Discussion should include an honest appraisal of alternative evolutionary explanations for the development of SPS loci.

Agreed, we have made a substantial revision to the discussion to address this weakness. Specifically, we have added a new paragraph that addresses the fundamental mechanisms of population differentiation (high microsatellite mutability, genetic drift, isolation-by-distance, and natural selection). We reiterate which of our results suggest some microsatellites have adaptive potential while recognizing alternatives:

"these results provide necessary but not sufficient indication that microsatellites have adaptive potential and are targets of positive selection. In particular, it is unclear if selection acting directly on microsatellites, with their enhanced mutability and extended sequence motifs, leave classical signatures of a selective sweep. However, we notice that, in a broad sense, a new microsatellite allele is an indel (insertion or deletion of (a) motif(s)), and indels or even transposable element insertions have been known to be targets of selection associated with selective sweeps (51,52)"

This paragraph (in addition to the new passages in the results) should clarify the message of the paper and offer further lines of inquiry to researches in the field. In addition, we have removed a discussion paragraph regarding codon and amino acid usage in lower eukaryotes. While interesting, this passage detracts from the focus of the paper which is human genetics.

4. Another example of hinting at the importance of the findings rather than appraising them in a comprehensive and objective manner is the overlap of SPS microsatellites with genomic regions previously identified as potential regions where natural selection acted. The authors write that a selective sweep is “due to a novel allele [that] reduces nearby genetic variation.” However, natural selection on a microsatellite is a very different animal than selection on a SNP.. Selective sweeps of linked genetic variation result from the emergence of that novel allele on a specific genetic background. If selection actually acts on a microsatellite locus, a novel allele is difficult to define. Due to the high mutability of microsatellite loci, the same favored repeat length can arise many times, linked to many different combinations of linked SNPs. Therefore, it is not clear that selection on a microsatellite locus would lead to the classical patterns associated with selective sweeps. The authors do mention that the overlaps are with soft sweep regions, which themselves leave much messier signatures of natural selection than hard sweeps on a new variant. In my opinion, however, this still overlooks the inherent problem of attempting to link selection on microsatellites with signatures of selection based on theory that assumes a SNP is the target of selection.

Indeed, we might think of this in the reverse. Isn’t it possible that selection on a single nucleotide variant (that may only offer a selective advantage in only one or a subset of super populations) would cause patterns of variation at linked microsatellites to become different from each other in different populations? In other words, the differences in microsatellite variation point to selection, but selection on a SNP. This is exactly how selection on SNPs in cis-regulatory regions that lead to lactase persistence in some African populations was identified.

The reviewer is certainly right: the association between microsatellite alleles and selective sweep regions does not necessarily imply that a microsatellite allele drives selective sweep. The alternative scenario, whereby a neighboring SNP drives selective sweep, while the microsatellite hitchhikes to high frequency as a result of its linkage to the SNP, is equally, if not more, likely, and we now clarify it in the manuscript:

"…it’s possible that EBML are simply in linkage disequilibrium with targets of selection such as nearby SNPs. Overall it seems likely that some EBML – particularly the ones that overlap eSTRs – have bona fide adaptive potential."

The reviewer also brings up an important issue whether or not microsatellites are expected to produce classical selective sweep patterns. We believe that as long as a new microsatellite allele increases fitness, positive selection acting on it will leave a signature similar to a SNP-based sweep. In a broader sense, at the sequence polymorphism level, a new microsatellite allele is an indel (insertion of deletion), and indels or even transposable element insertions have been known to be targets of selection associated with selective sweeps (PMID: 20333210, 14745026). Higher mutation rate of microsatellites may actually increase the contribution from hard selective sweeps that typically originate from newly emerging mutations, relative to soft sweeps that leverage the pre-existing standing genetic variation. We clarify this in Discussion as follows:

"On the other hand, we recognize that these results provide necessary but not sufficient indication that microsatellites have adaptive potential and are targets of positive selection. In particular, it is unclear if selection acting directly on microsatellites, with their enhanced mutability and extended sequence motifs, leave classical signatures of a selective sweep. However, we notice that, in a broad sense, a new microsatellite allele is an indel (insertion or deletion of (a) motif(s)), and indels or even transposable element insertions have been known to be targets of selection associated with selective sweeps (51,52)."

The additional results and discussion paragraph should address the reviewer’s concerns and have greatly improved the manuscript.

Minor Concerns

1. p. 2, ln 44: The authors compare array length variation, not “array length mutations”

Fixed.

2. (throughout) Principal component analysis, not Principle component analysis.

Fixed.

3. (throughout) Strange use of colons. For the most part, the authors seem to use them like an em dash would be used.

We use colons to add information to a prior independent clause or to introduce a list after an independent clause. This recommendation comes from a popular writer’s guide by June Casagrande entitled “The best punctuation book, period”. To some extent this is a matter of preference we have elected not to change.

4. p. 6, ln 116: just curious if higher PCs (e.g., PC3 or PC4) lead to separation of AMR and SAS super populations. More importantly, how much variation do the first two PCs explain?

Interesting question; we do see that AMR and SAS populations are not well separated in PC1 and PC2. These two population do separate when we look at PC3, but only for SPS microsatellites (now referred to as EBML). AMR and SAS sill overlap quite a bit when we look at all microsatellites. It’s also surprising that EUR populations are the first to separate from the triplet of populations AMR, SAS, and EUR when we look at genetic variation across all microsatellites. The amount of variation in PCs 1-15 if fairly consistent for all microsatellites compared to EBML. Here we provide the reviewer with a summary image including additional principal component bi-plots and variation for principal components 1-15 (see attached document). We have updated the first figure of the manuscript to show the variation explained by PC1 and PC2.

5. p. 7, lns 143-146. The authors state that eSTRs are commonly found in regions subject to purifying selection, then define a sweep in terms of positive selection.

This passage was confusing. The motivation for looking at soft sweep regions is now clarified:

"The correspondence between EBML and eSTRs suggest some microsatellites contribute to differential gene expression; however, this does not imply they have adaptive potential. To infer adaptive potential we identify microsatellites in selective sweep regions of the human genome. Briefly, selective sweep occurs when strong positive selection – due to a novel allele – reduces nearby genetic variation; sweep regions have been established for six populations in the 1000 Genomes Project"

6. p. 8, ln 157+ Confused by the genomic compartments used. UTRs are technically part of exons, but not the CDS. So, (1) you might make a distinction between UTRs and full+partial exons that make up the CDS … but then why a separate “coding” compartment? … or (2) you might make a distinction between UTRs and CDS … but then why a separate “exon” compartment, since UTRs and CDS comprise exons. Is there overlap between the compartments here? – i.e., are some STRs double-counted?

Indeed exons include the UTR and CDS regions; figures 3 and 4 tabulate both to accommodate the reader’s preference. We now clarify the potential confusion in the figure 3 legend:

"Note: exonic microsatellites include CDS and UTR microsatellites based on an independent series."

The situation in figure 4c is a bit trickier. The reviewers is correct that microsatellites in exons are redundant with those in cds and utr regions. We have regenerated the figure excluding exonic microsatellites so that there is no double counting. The end of the figure legend provides a note to the reader:

"Note: exonic microsatellites (not shown) include CDS and UTR microsatellites."

7. p 11., ln 197 Reference to support the claim that hydrophobic amino acids are more likely to have deleterious effects on protein function.

Added.

8. p 14., 1st paragraph Why devote a whole paragraph of a short discussion section to the idea that we have an incomplete picture of human genetic variation. Isn’t the 1000 Genomes Project (with its 26 individual populations) a big step in the right direction? These days, the reference genome is simply a jumping off point for analyses that do make use of data sets that include much more comprehensive coverage of human genetic variation.

This paragraph doesn’t mention the 1000 Genomes Project which certainly is a big step in the right direction. The intent of the paragraph is to point out potential limitations of a single reference genome even if it is merely used as a jumping off point. Genomes that differ from the reference – particularly African genomes – may contain sequences that cannot be mapped to the reference at all. The paragraph praises the recent version of the reference genome that includes alternative sequences but suggest that this approach is unsustainable; our results alone find 22,702 alleles. Still it’s debatable whether the paragraph fits into the scope of the paper which aims to contribute to what is known about functional microsatellites. We have left it in because it does not detract from the paper and provides a transition to the limitations discussed in the last paragraph.

Reviewer #2:

1. What were the default parameters of the perl script the authors used for finding the repeats? Did they use a constant length cutoff, or did it change with the repeat unit? Studies have reported way more than 1.6 million STRs in humans, and knowing the exact parameters the authors used will help readers understand why they had such a small set of loci to start with.

Good question! The perl script has three key integer valued parameters: (a) merge gap; (b) max motif length; and (c) max target length. Default values for these parameters were 10, 8, and 90, respectively. The script first identifies perfect repeat motifs (up to 8bp). No impurities are allowed. Next repeats in close proximity (up to 10bp) are merged. This step introduces a degree of impurities and allows for compound repeats. We do not search for repeats in excess of 90bp. Subsequent analysis of 100bp Illumina reads requires at least 5bp of 3’ and 5’ flanking sequence; in other words, we do not search for repeats exceeding the length of illumina reads. This is a fairly severe limitation that helps explain why we begin with a small set of loci. We acknowledge this limitation in our discussion:

A key limitation of our study leads us to hypothesize that many more EBML remain undiscovered. In particular, we only investigate microsatellites arrays shorter than 100bp: a number stemming from the short read (illumina) sequencing used for the 1000 Genomes Project. Microsatellite variants exceeding 100bp are truncated rendering their true array length difficult to infer.

We would also emphasize that the script used to generate the initial list of microsatellites is previously published and is freely available: http://genotan.sourceforge.net/#_Toc324410847.

The limitation of our analysis to 100bp and the focus on highly pure repeats explains why we investigate a small set of loci: 1.6 million generate by the initial script. In fact, this list was pared even more to mitigate the effects of improper read mapping; essentially, we make sure repeats have unique flanking regions (see METHODS � microsatellite list generation � reduction of sequence similarity). Finally, we have also made a webserver freely available to browse the preliminary list of repeats: http://www.cagmdb.org/view_micros.php.

2. Line 112 and other relevant places should be modified to clearly convey that they have used existing data and not done the sequencing themselves.

Line 481: All samples are freely available from the 1000 Genomes Project

Now reads: Existing data was used for this study. All samples are freely available…

Line 120: We use sequencing data from the 1000 Genomes Project

Now reads: We use existing sequencing data from the 1000 Genomes Project

Line 370: Specifically, samples were downloaded from phase 3…

Now reads: We used existing data from the 1000 Genomes Project. Specifically, samples…

3. Link to eSTRs - Why did the authors choose to use all 80,980 loci rather than the 2060 loci that showed a significant effect on expression as reported by Gymrek et al.? How many common loci are there between the 2060 eSTRs and 3984 SPS?

This is an important finding so we thank the reviewer for drawing attention to any potential confusion. We have made changes to key sentences in the results:

"The [Gymrek] study identified 2,060 significant associations (among the 80,980) which established the importance of expression STRs (eSTRs). Cross-referencing the 80,980 STRs against our 316,147 microsatellites reveals 13,259 repeats in common. We constructed a 2x2 contingency table based on classifications (eSTR and/or EBML) of the 13,259 shared repeats (Fig. 2b; right panel). Remarkably, 64 loci classify as an eSTR and EBML (Supplementary table 7); the overlap is statistically significant (p=1.53e-8; χ2 test)."

It should now be clear that we do use the 2060 loci reported by Gymrek et al. and that there are 64 common loci between the 2060 eSTRs and 3984 EBML. The revision draws better attention to the contingency table used to test statistical significance and explicitly states the type of test used (χ2 test). It’s also clear where the contingency table can be found (Fig. 2b; right panel).

4. Authors are advised to elaborate on the p value testing they did to show that 64 common out of 3984 total is significant. In particular, I would like to see if they took into account the bias towards coding regions that is introduced by whole exome sequencing. As can be seen from their study, the loci considered by Gymrek et al were predominantly proximal to genic regions. And as the current study preliminarily uses exome data, the chances of overlap are higher than what is true if whole genome is considered as background. The authors must incorporate this into calculating the significance.

This is an important question to have confidence in one of our key findings. The reviewer is suggesting that whole exome sequencing – which is bias towards coding regions – creates the illusion that eSTRs and EBML overlap significantly.

To address this question we reiterate that analysis of the overlap between eSTRs and EBML only considers 13,259 repeats: the intersection of the 80,980 STRs from Gymrek et al and our initial list of 316,147 microsatellites. We do claim that there is statistically significant overlap between these regions (p=1.526e-8). However if we redo the analysis and exclude coding regions we still find significant overlap (p=1.248e-7). If we redo the analysis a third time excluding all exonic regions we still find significant overlap (p=1.092e-6). Contingency tables of the aforementioned results are shown here: (see document)

On one hand it’s possible to argue that the p-value is not as strong when coding sequences and exons are excluded from the analysis. On the other hand, the overlap is still highly significant despite exons being excluded even when the sequencing data is bias towards coding regions. Finally, page 9 of our manuscript concludes that EBML are over-represented in introns and under-represented in exons.

"Although introns are the majority among tested microsatellites (148,464 out of 316,147), they are over-represented among EBML (Fig. 3). On the other hand, coding microsatellites are under-represented (Fig. 3). Enrichment analysis leading to these conclusions takes into account the coverage of microsatellites among samples and is robust to sample partitioning (see methods for details)."

Thus, we emphasize two conclusions: (a) eSTRs and EBML overlap significantly even outside of exons; and (b) EBML are under-represented in coding regions and over-represented in introns. These results are only made stronger by the fact that whole exome sequencing data has a bias towards coding regions.

5. Regarding the analysis of over/under representation by repeat unit length, if a larger length cutoff was used for larger repeat unit lengths, that might explain why tetra to hexamers are underrepresented in their analyses.

Good question! Essentially the reviewer is asking if the length threshold for microsatellites used number of base pairs or number of repeated units. In case of the latter, tetra to hexamers may have been disadvantaged before enrichment analysis was done; however, this is not the case. First we submit to the reviewer a tabulation of the most common microsatellites by length (base pairs) in our initial list:

+------+-------+-------+

| nMer | bases | count |

+------+-------+-------+

| 1 | 10 | 49632 |

| 1 | 11 | 26181 |

| 1 | 12 | 18554 |

| 2 | 12 | 23942 |

| 2 | 13 | 17543 |

| 2 | 14 | 11067 |

| 3 | 15 | 9119 |

| 3 | 17 | 6909 |

| 3 | 16 | 5913 |

+------+-------+-------+

+------+-------+-------+

| nMer | bases | count |

+------+-------+-------+

| 4 | 16 | 14525 |

| 4 | 19 | 11535 |

| 4 | 17 | 10350 |

| 5 | 15 | 24042 |

| 5 | 16 | 13594 |

| 5 | 19 | 10068 |

| 6 | 18 | 7692 |

| 6 | 19 | 4768 |

| 6 | 21 | 3480 |

+------+-------+-------+

The tabulation of microsatellites shows that the most common length in base pairs for various n-mer repeats is fairly constant; thus, we did not use a larger length cutoff for larger repeat units. We also provide the 2x2 tables used for preforming χ2 test of statistical significance: (see attached document)

The messages in these tables is that SPS microsatellites (now referred to as EBML) have a disproportionately high number of 1-mer and 2-mer repeats compared to the compositions of microsatellites we used for analysis. On the contrary, EBML have a disproportionately low number of 4-mer, 5-mer, and 6-mer repeats. With the exception of 2-mer repeats, the strength of these associations leaves little doubt and the overall trend is clear: the composition of EBML shifts from over-representation of mono-nucleotide and di-nucleotide repeats to under-representation of tetra-nucleotide repeats and beyond.

6. The authors call these microsatellites super-population specific, yet there are a lot of overlaps of these loci indicating they are merely highly polymorphic. Is it really correct to call them “specific”? On a related note, what happens to the analyses if they are done only with the unique loci (~2500 loci added together).

We agree that calling these microsatellites super-population specific is potentially confusing. We have addressed this weakness is several ways. First, the abstract, author summary, and introduction include a succinct definition of SPS microsatellites (now called EBML):

"for each EBML at least one ethnicity has genotype frequencies statistically different from the remaining four"

This should clarify that the collections of EBML for each ethnicity are permitted to overlap. We have also changed the name of these regions to ethnically-biased microsatellite loci (EBML). It is more accurate to call these regions bias rather than specific. Ethnicity has been used previously as a surrogate for the less familiar term super-population.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Arnar Palsson

31 Oct 2019

Abundance of Ethnically Biased Microsatellites in Human Gene Regions

PONE-D-19-21466R1

Dear Dr. Kinney,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Arnar Palsson, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for addressing my concerns effectively and presenting the changes you made in such a comprehensive and lucid manner.

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Ryan J. Haasl

Reviewer #2: Yes: Rakesh Mishra

Acceptance letter

Arnar Palsson

27 Nov 2019

PONE-D-19-21466R1

Abundance of Ethnically Biased Microsatellites in Human Gene Regions

Dear Dr. Kinney:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Arnar Palsson

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. List of 3,984 EBML identified in this study.

    (XLSX)

    S2 Table. List of 3,171 microsatellites specific to African populations.

    (XLSX)

    S3 Table. List of 1,494 microsatellites specific to East Asian populations.

    (XLSX)

    S4 Table. List of 647 microsatellites specific to South Asian populations.

    (XLSX)

    S5 Table. List of 450 microsatellites specific to European populations.

    (XLSX)

    S6 Table. List of 335 microsatellites specific to American populations.

    (XLSX)

    S7 Table. List of 64 EBML also identified as eSTRs.

    (XLSX)

    S8 Table. List of 232 EBML in matrisome core/associated genes.

    (XLSX)

    S9 Table. List of 21 EBML coding repeats in selective sweep regions.

    (XLSX)

    S10 Table. List of 14 EBML coding repeats previously implicated in disease.

    (XLSX)

    S1 Dataset. Genotyping data for 3,984 EBML in 1000 Genomes Project samples.

    (BZ2)

    S2 Dataset. Pairwise comparisons of microsatellites in 5 super-populations.

    (7Z)

    S1 Code. Functions to access genotype tables for EBML.

    (R)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES