msRepDB: a comprehensive repetitive sequence database of over 80 000 species

Xingyu Liao; Kang Hu; Adil Salhi; You Zou; Jianxin Wang; Xin Gao

doi:10.1093/nar/gkab1089

. 2021 Dec 1;50(D1):D236–D245. doi: 10.1093/nar/gkab1089

msRepDB: a comprehensive repetitive sequence database of over 80 000 species

Xingyu Liao ^1,², Kang Hu ³, Adil Salhi ⁴, You Zou ⁵, Jianxin Wang ^6,^✉, Xin Gao ^7,^✉

PMCID: PMC8728181 PMID: 34850956

Abstract

Repeats are prevalent in the genomes of all bacteria, plants and animals, and they cover nearly half of the Human genome, which play indispensable roles in the evolution, inheritance, variation and genomic instability, and serve as substrates for chromosomal rearrangements that include disease-causing deletions, inversions, and translocations. Comprehensive identification, classification and annotation of repeats in genomes can provide accurate and targeted solutions towards understanding and diagnosis of complex diseases, optimization of plant properties and development of new drugs. RepBase and Dfam are two most frequently used repeat databases, but they are not sufficiently complete. Due to the lack of a comprehensive repeat database of multiple species, the current research in this field is far from being satisfactory. LongRepMarker is a new framework developed recently by our group for comprehensive identification of genomic repeats. We here propose msRepDB based on LongRepMarker, which is currently the most comprehensive multi-species repeat database, covering >80 000 species. Comprehensive evaluations show that msRepDB contains more species, and more complete repeats and families than RepBase and Dfam databases. (https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html).

INTRODUCTION

Repetitive sequences are abundantly distributed in the genomes of all viruses, bacteria, plants and animals (1). For example, they constitute up to 45% of the genome in Mouse and 50–70% in Human (2). The repetitive sequences of the genome play a central role in the stability of the chromosome, the cell cycle and the regulation of gene expression, and they are also important substrates for genome evolution (3–6). As an example, the number and types of repetitive sequences vary between organisms and may reflect how rapidly an organism evolves to changes in its environment (7,8). Moreover, they are fundamental to the cooperative molecular interactions which form nucleoprotein complexes (9), and have also been implicated in molecular and cellular dysfunction associated with human diseases (10). For instance, the tandem repeat expansion has been associated with >40 monogenic disorders, which has recently been shown to be a major genetic contributor to frontotemporal dementia (FTD), amyotrophic lateral sclerosis (ALS) and autism spectrum disorder (ASD), the middle of which is the most common form of the motor neuron disease (11,12) and the latter of which is a group of neurodevelopmental disorders characterized by atypical social function, communication deficits, restricted interests and repetitive behaviors (13–15). Besides, the expression of retrotransposition-competent transposable elements can lead to more insertions which can disrupt gene function or alter gene expression, contributing to complex diseases such as lung cancer, pancreatic cancer, ovarian cancer, neurological diseases, blood diseases (16–18), etc. Comprehensive identification, classification and annotation of repeats in genomes can provide accurate and targeted solutions towards understanding and diagnosis of complex diseases, optimization of plant properties and development of new drugs.

To achieve these goals, an accurate and complete repeat database is essential. RepBase (19) and Dfam (20) are two most frequently used repeat databases, but they are not sufficiently complete, because most of the repetitive sequences collected in these two libraries are obtained through some existing detection methods (such as RepeatScout (21) and RepeatMasker (22)). Due to the limitations of sequencing data and the defects in design of the detection principle, existing detection methods cannot accurately and comprehensively obtain the repetitive sequences of various species. For instance, in the Glycine max genome, when the combination of RepBase and Dfam is used as the repetitive sequence database, only 28.47% of bases can be annotated as LTR (Long Terminal Repeat) retrotransposons, whereas the expected proportion should be about 42% (23), which means that about 13.52% of LTR retrotransposons cannot be accurately annotated (Figure 1 and Table 4). Due to the lack of a comprehensive repetitive sequence database of multiple species, the current research in this field is far from being satisfactory.

Figure 1. — The classification and proportion statistics of repetitive sequences in Human, Drosophila and Glycine max genomes annotated by the combination of two databases, Dfam and RepBase, and msRepDB. The Y-axis represents the proportion, the X-axis represents the species. ‘Overall’ represents all types of repetitive sequences, ‘Retroelements’ represents the retroposon elements, ‘DNA transposons’ represents the DNA transposon elements, ‘Unclassified’ represents the repetitive elements that cannot be classified based on the unknown information and ‘Total interspersed repeats’ represents the total interspersed repeats.

Table 4.

Partial comparison of the proportion and detailed classification of detected repeats generated based on two databases of the Glycine max genome

Combination of RepBase and Dfam [bases masked: 36.11%]				msRepDB [bases masked: 41.54%]
Repeat types	Number of elements	Length occupied	Percentage of sequences	Number of elements	Length occupied	Percentage of sequences
Retroelements	199 220	289 032 002 bp	29.52%	244 764	328 295 871 bp	33.54%
+SINEs	0	0 bp	0.00%	0	0 bp	0.00%
+Penelope	0	0 bp	0.00%	0	0 bp	0.00%
+LINEs	12 626	10 304 690 bp	1.05%	13 156	10 432 965 bp	1.07%
++CRE/SLACS	0	0 bp	0.00%	0	0 bp	0.00%
+++L2/CR1/Rex	0	0 bp	0.00%	0	0 bp	0.00%
+++R1/LOA/Jockey	0	0 bp	0.00%	0	0 bp	0.00%
+++R2/R4/NeSL	0	0 bp	0.00%	0	0 bp	0.00%
+++RTE/Bov-B	3 790	2 001 199 bp	0.20%	3 945	2 017 968 bp	0.21%
+++L1/CIN4	8 836	8 303 491 bp	0.85%	9 211	8 414 997 bp	0.86%
+LTR elements	186 594	278 727 312 bp	28.47%	231 608	317 862 906 bp	32.47%
++BEL/Pao	0	0 bp	0.00%	0	0 bp	0.00%
++Tyl/Copia	58 199	80 563 666 bp	8.23%	83 194	87 429 549 bp	8.93%
++Gypsy/DTRS1	126 690	195 309 037 bp	19.95%	140 926	225 546 399 bp	23.04%
+++Retroviral	0	0 bp	0.00%	340	206 126 bp	0.02%
DNA transposons	58 468	41 514 301 bp	4.24%	61 139	42 789 484 bp	4.37%
+hobo-Activator	7 612	2 233 822 bp	0.23%	5 901	1 964 869 bp	0.20%
+Tc1-IS630-Pogo	117	56 379 bp	0.01%	321	75 504 bp	0.01%
+En-Spm	0	0 bp	0.00%	0	0 bp	0.00%
+MuDR-IS905	0	0 bp	0.00%	0	0 bp	0.00%
+PiggyBac	0	0 bp	0.00%	0	0 bp	0.00%
+Tourist/Harbinger	923	564 171 bp	0.06%	1 006	582 191 bp	0.06%
+Other	0	0 bp	0.00%	0	0 bp	0.00%
Rolling circles	538	252 405 bp	0.03%	967	740 481 bp	0.08%
Unclassified	0	0 bp	0.00%	46 116	9 214 511 bp	0.94%
Total interspersed repeats		330 546 303 bp	33.77%		378 050 943 bp	38.62%
Small RNA	2 223	902 022 bp	0.09%	2 221	901 834 bp	0.09%
Satellites	19 885	2 175 759 bp	0.22%	9 389	6 367 996 bp	0.65%
Simple repeats	323 670	15 236 633 bp	1.56%	307 769	14 416 738 bp	1.47%
Low complexity	82 139	4 344 053 bp	0.44%	75 689	3 964 123 bp	0.40%

Open in a new tab

The test results were obtained by using RepeatMasker based on the msRepDB database and the combination of Dfam and RepBase respectively under the default parameter settings.

LongRepMarker (24) is a new framework developed recently by our group for comprehensive identification of genomic repetitive sequences. Comprehensive evaluations carried out in the study of LongRepMarker not only show that LongRepMarker can achieve more satisfactory results than the existing detection methods, but can also discover a large number of new repeat sequences and families. We here propose msRepDB (https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html) based on LongRepMarker, which is currently the most comprehensive multi-species repetitive sequence database, covering >80 000 species. msRepDB takes the reference sequence or assembly of species as the input, and generates the masked sequences representing the detected repeats and comprehensive annotation report as the output. When the input data are reference sequences or assemblies, it should be in the FASTA format, and msRepDB matches all subsequences with the database to find out the repeated elements contained in those sequences, as well as their locations and types, and finally masks the repeated elements in the input sequence and generates an annotation report. msRepDB also provides query and download functions. Users can directly retrieve and download the repetitive elements and their classification information from msRepDB according to the taxon name or the family name. In addition, if the user does not have any data, but just a taxon name or a repeat family name, msRepDB will also retrieve all relevant contents from the database and provide download links (Figure 2).

Figure 2. — The function module display of the msRepDB database. The figure mainly shows the four function modules of the msRepDB database, namely ‘Search’, ‘Download’, ‘Online Masking’ and ‘Submit’, and the detailed fields of three detection reports namely ‘Classification Report’, ‘Annotation Report’ and ‘Mapping Report’. For example, the ‘Search’ module provides the following four functions: (1) searching repeats by species name; (2) searching repeats by species name and family name; (3) displaying all annotated repeats of the species; and (4) displaying all annotation repeats of the specific family of the species. As another example, there are six fields in the mapping report: ‘Mapping zero time’, ‘Mapping one time’, ‘Mapping multiple time’, ‘N50’, ‘N75’ and ‘N90’, where ‘Mapping zero time’ indicates the proportion of fragments that cannot be aligned to the reference genome; ‘Mapping one time’ indicates the proportion of fragments that can be aligned to only one location on the reference genome; ‘Mapping multiple time’ indicates the proportion of fragments that can be aligned to many locations on the reference genome; ‘N50’ indicates the length of the longest segment such that all the segments longer than this segment cover at least 50% of the total length of all segments; ‘N75’ indicates the length of the longest segment such that all the segments longer than this segment cover at least 75% of the total length of all segments; and ‘N90’ indicates the length of the longest segment such that all the segments longer than this segment cover at least 90% of the total length of all segments.

MATERIALS AND METHODS

Data collection and identification of repetitive sequences by using LongRepMarker

To obtain a comprehensive repetitive sequence database of multiple species, we must collect the reference genomes or the assemblies of sequencing reads of these species in advance. The NCBI website (https://www.ncbi.nlm.nih.gov/) is an important channel for obtaining these required data. For example, when we enter ‘human’ in the search box on the NCBI homepage and click the search button, the page will turn to the download interface of the Human reference genome ‘GRCH38.P13’. When we continue to follow the prompts to click on the links, we will get a compressed file named ‘ Inline graphic ’. After decompressing this compressed file, we will get a FASTA file named ‘’, which is the required Human reference genome.

As mentioned before, compared with existing detection methods (such as RepeatScout (21), RepeatMasker (22), RepeatModeler2 (25), etc.), LongRepMarker can not only more completely identify repetitive sequences in the genome, but also achieve more prominent performance in discovering new repetitive sequences and families. Therefore, a more comprehensive multi-species repetitive sequence database can be constructed based on the detection results of LongRepMarker (https://github.com/BioinformaticsCSU/LongRepMarker). When the reference genome or the assembly of sequencing reads of species is inputted into the LongRepMarker, it will initiate the following steps to identify and annotate the repetitive sequences contained therein.

Identification of overlap sequences. The repetition relation is a special case of the overlap relation. Thus all possible repetition relationship can be found by searching overlap sequences. Overlap sequences occupy only a small portion of the overall sequences. By finding the overlap sequences between assemblies or chromosomes, the algorithm locates the repetitive sequences faster and more accurately.
Conversion of overlap sequences into unique k-mers . The number and length of sequences will have a great impact on the efficiency of multiple sequence alignment. Generally, the more the number and the longer the length, the greater the computational resource consumption. The unique k-mers (26) (27) are much smaller than overlap sequences both in terms of number and length. Using unique k-mers instead of overlap sequences for mapping can greatly optimize the efficiency of multi-sequence alignment (28).
Generation of the multi-alignment unique k-mers and their coverage regions on overlap sequences. The multi-alignment unique k-mers were first proposed in the paper of LongRepMarker (24), which refers to the unique k-mers that can be aligned to multiple different locations in the overlap sequences. Due to the sequencing bias, the high frequency threshold is often difficult to obtain accurately, which has a great impact on the range of the high frequency k-mers (29–31). However, the multi-alignment unique k-mers are not affected by these factors. By using the multi-alignment unique k-mers to identify repeats in overlap sequences, the algorithm can obtain the repeats in the genomes more comprehensively and stably.
Classification of regions on overlap sequences that can be covered by multi-alignment unique k-mers . Due to the short size of unique k-mers, it is easy to form a coupling alignment (coupling alignment refers to the fusion of unique k-mers that should not be fused together) (32,33). To eliminate the influence of the coupling alignment, the algorithm further classifies the regions on the overlap sequences covered by the multi-alignment unique k-mers into two categories, and filters out the false repetitive sequences, thereby improving the accuracy of the detection results.
Merging fragments with duplication or inclusion. The results of detection methods based on the multiple sequence alignment will inevitably contain redundant elements. In order to make the detection results as pure as possible without any impurities and redundancy, the algorithm merges the detected repetitive fragments with duplication and inclusion relationships (34).
Classification and annotation of the obtained repetitive sequences. When the repetitive sequences are obtained, the algorithm also needs to classify and annotate them, because the repeats without classification and annotation information are meaningless. In this step, a distributed RepeatClassifier (25) developed by our group is used to classify and annotate the obtained repetitive sequences.

Note that LongRepMarker is different from RepeatScout and RepeatModeler2 in detection targets. RepeatScout and RepeatModeler2 both focus on the discovery of repeated families. It is well known that a repeat family is an abstraction of a type of repeat sequence (a one-to-many relationship), and its acquisition must go through the two operations of merging the obtained repeat sequences and taking the consensus sequence. However, the detection goal of LongRepMarker is not to find repeated families, but to comprehensively mine all repeated sequences of the genome (Supplementary Figure S2), and provide a basis for accurately identifying the mutations that exist between different copies. Therefore, in the detection results of LongRepMarker, we merged duplicate copies with high consistency, and saved the duplicate copies with differences as much as possible, and at the same time analyzed the structural variation that occurred in the duplicate copies with differences. Our purpose is to provide a method to study the effect of variations that exist between different duplicate copies on the genetic, evolution and variation of organisms.

Although there will be some redundancy and chimerism in the detection results of LongRepMarker, the repetitive sequences identified by it are still the most complete compared to other existing tools. In order to remove impurities and chimeras in the detection results and output purer repetitive sequences for the database construction, three steps of impurity removal, chimerism removal and consensus sequence construction are carried out after the detection results of LongRepMarker obtained. In this process, the strategies of wicker 80/80/80 rule (as used in RepeatScout), filtering overlaps whose identity is lower than 85% (as used in RepeatScout), and the cutoff score of 225 (as used in RepeatMasker) were used. When the purified repetitive sequences are generated, RepeatClassifier is used to classify and annotate these sequences. After that, the algorithm needs to merge the repeated sequences with its classification and annotation information, and form a file in the FASTA format (35). In this generated file in the FASTA format, the sequence composed of A/T/G/C characters is a repeating sequence, and the sequence starting with an angle bracket above the repeating sequence is the annotation sequence, which contains the corresponding classification and annotation information (36).

Extracting the repetitive sequences and their corresponding families contained in each species from the detection results and storing them in the database

When the purification operation of the previous step is completed, we need to extract the repetitive sequences from the fusion results (files in the FASTA format) and store it in the msRepDB database according to its species name, NCBI accession number, taxid and family information.

DATABASE CONTENT AND USAGE

Home and About

The function of the Home page (Figure 3 A) is to introduce msRepDB, mainly including the application fields of msRepDB, the research and development principles, and the main advantages compared with the existing libraries. The function of the About page is to introduce the main functions and test samples of msRepDB.

Search and Download

The functions of the Search and Download page are as follows: (i) by selecting the species taxonomy name, NCBI accession number, taxonomy id and repeat family name in the Search and Download interface, users can retrieve the complete repetitive sequences with classification information of the special species; (ii) by clicking the ‘Download’ button on the interface, users can also download the comprehensive repetitive sequences with classification information of the specific species that they have retrieved to the local disk (Figure 3 B).

Usage example:

Click the ‘select species’ input box to trigger the list of candidate species names;
Select or write ‘Drosophila files genus’ in the list box of species taxonomy name, and click the ‘Search’ button;

[Server response]: The server will display all the repetitive sequences and classification information in the genome of Drosophila on the bottom of the interface.
Select ‘Drosophila files genus’ in the list box of species taxonomy name, select ‘LTR/Pao’ in the list box of repeat family name, and click the ‘Search’ button;

[Server response]: The server will display all the LTR/Pao-types of repetitive sequences and classification information in the genome of Drosophila on the bottom of the interface.
Click the ‘Total families of Drosophila files genus download’ button on the left of the interface;

[Server response]: The server will provide all the repetitive sequences with classification information (LTR/Pao) of the species selected (Drosophila files genus) by the user in the FASTA format (Figure 3 E), and the user can save the downloaded file to the preferred local directory through the ‘Browse’ option.

Online Masking

The functions of the Online Masking are as follows: (i) by dragging and dropping or pasting the sequence to be masked into the input box on the interface, the users can submit the sequence file in the FASTA format to be masked on the ‘Online Masking’ interface; (ii) when the server completes the masking task, it will feed back the masking results to the interface, and the user can obtain detailed masking results and related reports (the annotation mainly includes the classification of the repetitive sequences and their locations in the genome) by browsing and downloading (Figure 3C).

Usage example:

Select ‘Drosophila files genus’ in the list box of species taxonomy name and configure the related parameters;
Download the demo reference genome of Drosophlia () to the local disk. The complete Reference download address is https://www.ncbi.nlm.nih.gov/assembly/GCF_000001215.4;
Upload the file ‘’ to the server by dragging and dropping from the online masking interface;
Click the ‘Submit Masking Job’ button;

[Server response]: When the server receives the submitted file, it will take several to ten minutes to complete masking and generate annotation reports. When the server completes the whole process of online masking, it will prompt that the task has been completed and provide download links for all generated reports.
Click the download links (such as ‘Mapping Report’, ‘Classification Report’, ‘N50 & Alignment ratios Report’, ‘Masked sequences by msRepDB’ and ‘Extracted Masked Sequences’) in the online masking interface, and save the masked sequence files and several annotation reports (the annotation mainly includes the classification of the repetitive sequences and their locations in the genome) to the preferred local directory through the ‘Browse’ option;

[Server response]: The server will provide download links for the masked sequence files and all generated reports on the interface (Figure 3F–J).

Submit and tools

The submit function is mainly used to update the contents of the msRepDB database (Figure 3D). Update operations can be divided into the following two types: (i) insert new records into the msRepDB database and (ii) update existing records in the msRepDB database. The data submission operation is completed by the system administrator. Before data submission, the administrator needs to evaluate the submitted data to verify its authenticity and reliability. New data can be entered into the database after passing the assessment. The function of the ‘Tools’ page is to introduce the tools related to our research.

Usage example:

Enter the species’ scientific name, taxonomy id, NCBI accession number, and repeat sequence with the family information (Figure 3 D);

[Specific operation]:

Step1: Set the taxonomy id as ‘2517382’;

Step2: Set the species name as ‘Afrixalus weidholzi’;

Step3: Set the repeat sequence with ID and family information as follows

‘>7SLRNA_short_#SINE/Alu

GCCGGGCGCGGTGGCGCGTGCCTGTAGTCCCAGCTA

CTCGGGAGGCTGAGGTGGGAGGATCGCTTGAGTCCA

GGAGTTCTGGGCTGTAGTGCGCTATGCCGATCGGGT

GTCCGCACTAAGTTCGGCATCAATATGGTGACCTCC

CGGGAGCGGGGGACCACCAGGTTGCCTAAGGAGGGG

TGAACCGGCCCAGGTCGGAAACGGAGCAGGTCAAAA

CTCCCGTGCTGATCAGTAGTGGGATCGCGCCTGTGA

ATAGCCACTGCACTCCAGCCTGAGCAACATAGCGAG

ACCCCGTCTCTTAAAAAAAAAAAAAA';
Click the ‘Submit’ button on the interface;

[Server response]: When the server receives the submitted information, it will take several seconds to complete the storage and generate feedback information on the interface.

IMPLEMENTATION

The data processing and analysis functions of msRepDB database were implemented using Python v.3.6.9 (www.python.org/getit/) coupled with the SpringBoot integrated framework (https://spring.io/projects/spring-boot). msRepDB runs on a Linux-based Maven server 3.8.1 (Maven is a build automation tool used primarily for java projects, https://maven.apache.org/download.cgi). The database was developed using MySQL 5.7.31 (https://www.mysql.com/), and the web interface was developed using html5 markup language (https://en.wikipedia.org/wiki/HTML5) combined with Bootstrap v.5.0.2 (https://v3.bootcss.com), layUI v.2.6.8 (https://www.layui.com/) and JQuery v.1.11.1 (http://jquery.com) (Supplementary Figures S3– S8). In the process of online masking, two aligners, bwa (37) and minimap2 (38), were used. In this process, the short sequence fragments were aligned using bwa, and the long sequence fragments were aligned using minimap2.

DISCUSSION

Compared with the existing repeat databases, the major improvements of msRepDB are as follows: (i) msRepDB contains more species than RepBase and Dfam databases (i.e. >84 000 in msRepDB versus about 62 000 in the combination of RepBase and Dfam). The comprehensive experiments carried out in the study of LongRepMarker not only show that LongRepMarker can achieve more satisfactory results than the existing detection methods (Supplementary Tables S4–S6, Supplementary Figures S9–S10), but also can discover a large number of new repeat sequences and families. (ii) For a single species, msRepDB contains more complete repeats and families than the existing repeat databases. We have conducted comprehensive experimental evaluations on the coverage and completeness of the msRepDB database. For example, we used the latest version of RepeatMasker (V.4.1.2) to classify and annotate the repeats of the species Human, Mouse, Rice, Glycine max and Drosophila based on the msRepDB database and the combination of the latest RepBase (V.26.06) and Dfam (V.3.3) libraries, respectively. The frequency and length distribution, the multiple alignment ratio, the proportion of coverage over the reference genome and the duplication ratio of the repetitive sequences contained in msRepDB and the combination of Dfam and RepBase databases are shown in Table 1. We can see that the repetitive sequences collected in the msRepDB database have a higher repetition frequency and larger size as a whole. Furthermore, from the perspective of multiple alignment ratio, coverage of the reference genome, and duplication ratio, the repetitive sequences contained in msRepDB are usually more accurate and less redundant than those contained in the combination of Dfam and RepBase databases. Here, the duplication ratio represents the total number of aligned bases in the repetitive sequences divided by the total number of those in the reference. If there are too many repetitive sequences that cover the same regions, the duplication ratio will be greatly increased. This occurs due to multiple reasons, including overestimating repeat multiplicities and overlaps between repetitive sequences.

Table 1.

Partial comparison of the length distribution, multiple alignment ratio, proportion of covering the reference genome and duplication ratio of elements contained in msRepDB database and the combination of Dfam and RepBase

		Length distribution					Mapping		RepeatMasker	Other
Species	Database	Num	Max	N50	N75	N95	MAR	Non-MAR	Reference	Duplication ratio
			(bp)	(bp)	(bp)	(bp)	(%)	(%)	(%)	(%)
H.sapiens(Human)	msRepDB	1613	20 016	2858	903	496	88.17%	11.82%	47.29%	0.09%
	Dfam+RepBase	1353	9043	2532	786	464	80.93%	19.06%	45.62%	0.15%
Mouse	msRepDB	1779	15 041	3691	1061	505	94.41%	5.58%	43.15%	0.14%
	Dfam+RepBase	1407	8959	2210	791	437	86.28%	13.71%	40.58%	0.21%
Oryza sativa(Rice)	msRepDB	3556	13 922	3584	1668	801	98.94%	1.05%	50.62%	3.90%
	Dfam+RepBase	3049	20 789	3879	1831	892	82.81%	17.18%	50.50%	4.14%
D.melanogaster	msRepDB	477	20 014	4646	2571	1153	99.65%	0.34%	21.86%	2.40%
	Dfam+RepBase	258	15 576	4802	3204	1036	89.77%	10.22%	20.85%	3.36%
Glycine max	msRepDB	1226	10 856	4536	3175	1130	100.00%	0.00%	41.31%	0.44%
	Dfam+RepBase	596	17 080	4688	4180	3207	90.45%	9.54%	36.11%	0.53%

Open in a new tab

‘Num’ represents the number of fragments contained in database. ‘Max(bp)’ represents the length of the longest fragment in database. ‘N50’ represents the length of a fragment, such that all the fragments of at least the same length together cover at least 50% of the total length of all fragments contained in database. ‘N75’ represents the length of a fragment, such that all the fragments of at least the same length together cover at least 75% of the total length of all fragments contained in database. ‘N95’ represents the length of a fragment, such that all the fragments of at least the same length together cover at least 95% of the total length of all fragments contained in database. ‘MAR(%) and Non-MAR(%)’ respectively represent the ratios of multiple alignment and non-multiple alignment. ‘Reference(%)’ represents the proportion of covering the reference genome. ‘Duplication ratio’ represents the total number of aligned bases in the repetitive sequences divided by the total number of those in the reference. If there are too many repetitive sequences that cover the same regions, the duplication ratio will be greatly increased. This occurs due to multiple reasons, including overestimating repeat multiplicities and overlaps between repetitive sequences.

The experimental results in Tables 2, 3 and 4 show that RepeatMasker annotated 3 852 568 Retroelements-type repeats (1 291 793.390 kb in length) on the Human genome based on msRepDB, as compared to 2 800 814 Retroelements-type repeats (1 236 215.277kb in length) for the combination of the state-of-the-art databases (Table 2), annotated 1 828 Statellites-type repeats (1 862.670 kb in length) on the Drosophila genome based on msRepDB, as compared to 1 372 Statellites-type repeats (1 804.199 kb in length) for the combination of the two other databases (Table 3), and annotated 61 139 DNA-transposons-type repeats (42 789.484 kb in length) on the Glycine max genome based on msRepDB, as compared to 58 468 DNA-transposons-type repeats (41 514.301 kb in length) for the combination of the two other databases (Table 4). It can be seen from the experimental results shown in Tables 1– 4, Supplementary Tables S7–S12 and Supplementary Figures S11–S26 that msRepDB is the most complete multi-species repetitive sequence database at present. In order to evaluate the false positive rate of the detection results, we conducted the experiments on the simulated sequencing data for Drosophila, and then we used RepeatMasker to annotate the repeats, and used the annotated set as the ground-truth set to compare with the annotation from RepeatScout and from LongRepMarker (Supplementary Table S13). All the false positives are counted by comparing the ground-truth set of annotations with that of RepeatScout or LongRepMarker.

Table 2.

Partial comparison of the proportion and detailed classification of detected repeats generated based on two databases of the Human genome

Combination of RepBase and Dfam [bases masked: 45.62%]				msRepDB [bases masked: 47.29%]
Repeat types	Number of elements	Length occupied	Percentage of sequences	Number of elements	Length occupied	Percentage of sequences
Retroelements	2 800 814	1 236 215 277 bp	37.78%	3 852 568	1 291 793 390 bp	39.48%
+SINEs	1 453 130	369 205 643 bp	11.28%	1 602 909	329 745 622 bp	10.08%
+Penelope	75	14 277 bp	0.00%	75	14 225 bp	0.00%
+LINEs	807 771	588 058 432 bp	17.97%	1 630 986	696 100 321 bp	21.27%
++CRE/SLACS	0	0 bp	0.00%	0	0 bp	0.00%
+++L2/CR1/Rex	193 908	56 822 264 bp	1.74%	294 645	69 266 031 bp	2.12%
+++R1/LOA/Jockey	0	0 bp	0.00%	0	0 bp	0.00%
+++R2/R4/NeSL	399	95 545 bp	0.00%	400	95 122 bp	0.00%
+++RTE/Bov-B	9 890	2 788 967 bp	0.09%	9 890	2 771 539 bp	0.08%
+++L1/CIN4	603 337	528 287 954 bp	16.15%	1 325 814	623 904 329 bp	19.07%
+LTR elements	539 913	278 951 202 bp	8.53%	618 673	265 947 447 bp	8.13%
++BEL/Pao	0	0 bp	0.00%	0	0 bp	0.00%
++Tyl/Copia	0	0 bp	0.00%	12	3718 bp	0.00%
++Gypsy/DTRS1	14 309	3 767 626 bp	0.12%	15 125	3 750 523 bp	0.11%
+++Retroviral	515 395	272 547 814 bp	8.33%	593 203	259 578 662 bp	7.93%
DNA transposons	425 304	102 360 429 bp	3.13%	424 193	100 612 296 bp	3.07%
+hobo-Activator	280 952	57 692 527 bp	1.76%	280 102	56 974 131 bp	1.74%
+Tc1-IS630-Pogo	128 851	41 753 772 bp	1.28%	128 539	40 749 342 bp	1.25%
+En-Spm	0	0 bp	0.00%	0	0 bp	0.00%
+MuDR-IS905	0	0 bp	0.00%	0	0 bp	0.00%
+PiggyBac	2310	554 582 bp	0.02%	2285	546 552 bp	0.02%
+Tourist/Harbinger	321	59 199 bp	0.00%	320	59 104 bp	0.00%
+Other	0	0 bp	0.00%	0	0 bp	0.00%
Rolling circles	1614	402 976 bp	0.01%	3664	1 046 162 bp	0.03%
Unclassified	122 691	24 233 010 bp	0.74%	225 158	30 427 467 bp	0.93%
Total interspersed repeats		1 362 808 716 bp	41.65%		1 422 833 153 bp	43.48%
Small RNA	12 650	1 358 026 bp	0.04%	10 142	979 175 bp	0.03%
Satellites	15 404	82 714 065 bp	2.53%	12 135	79 167 870 bp	2.42%
Simple repeats	710 220	39 030 544 bp	1.19%	663 652	37 699 053 bp	1.15%
Low complexity	102 465	6 353 924 bp	0.19%	92 549	5 565 612 bp	0.17%

Open in a new tab

The test results were obtained by using RepeatMasker based on the msRepDB database and the combination of Dfam and RepBase respectively under the default parameter settings.

Table 3.

Partial comparison of the proportion and detailed classification of detected repeats generated based on two databases of the Drosophila genome

Combination of RepBase and Dfam [bases masked: 20.85%]				msRepDB [bases masked: 21.86%]
Repeat types	Number of elements	Length occupied	Percentage of sequences	Number of elements	Length occupied	Percentage of sequences
Retroelements	15 330	21 048 835 bp	14.65%	23 186	22 483 014 bp	15.64%
+SINEs	0	0 bp	0.00%	0	0 bp	0.00%
+Penelope	0	0 bp	0.00%	0	0 bp	0.00%
+LINEs	5293	5 447 560 bp	4.49%	6134	6 416 652 bp	4.46%
++CRE/SLACS	0	0 bp	0.00%	0	0 bp	0.00%
+++L2/CR1/Rex	811	844 019 bp	0.59%	870	841 783 bp	0.59%
+++R1/LOA/Jockey	1014	1 562 240 bp	1.09%	1571	1 694 722 bp	1.18%
+++R2/R4/NeSL	38	39 896 bp	0.03%	38	39 900 bp	0.03%
+++RTE/Bov-B	0	0 bp	0.00%	0	0 bp	0.00%
+++L1/CIN4	0	0 bp	0.00%	0	0 bp	0.00%
+LTR elements	10 037	14 601 275 bp	10.16%	16 914	16 066 362 bp	11.18%
++BEL/Pao	2326	3 123 105 bp	2.17%	2937	3 118 973 bp	2.17%
++Tyl/Copia	500	740 782 bp	0.52%	784	733 449 bp	0.51%
++Gypsy/DTRS1	7211	10 737 388 bp	7.47%	13 243	12 190 939 bp	8.48%
+++Retroviral	0	0 bp	0.00%	0	0 bp	0.00%
DNA transposons	4135	1 870 086 bp	1.30%	4494	1 824 527 bp	1.27%
+hobo-Activator	189	75 919 bp	0.05%	168	76 244 bp	0.05%
+Tc1-IS630-Pogo	1112	609 344 bp	0.42%	1108	560 858 bp	0.39%
+En-Spm	0	0 bp	0.00%	0	0 bp	0.00%
+MuDR-IS905	0	0 bp	0.00%	0	0 bp	0.00%
+PiggyBac	23	8619 bp	0.01%	23	8617 bp	0.01%
+Tourist/Harbinger	0	0 bp	0.00%	0	0 bp	0.00%
+Other	2243	913 674 bp	0.64%	2454	894 197 bp	0.62%
Rolling circles	4662	999 082 bp	0.70%	5232	1 028 233 bp	0.72%
Unclassified	495	78 825 bp	0.05%	534	121 856 bp	0.08%
Total interspersed repeats		22 997 746 bp	16.00%		24 429 397 bp	17.00%
Small RNA	306	86 258 bp	0.06%	280	95 863 bp	0.07%
Satellites	1372	1 804 199 bp	1.26%	1828	1 862 670 bp	1.30%
Simple repeats	85 083	3 589 418 bp	2.50%	83 836	3 525 845 bp	2.45%
Low complexity	10 443	488 602 bp	0.34%	10 322	482 327 bp	0.34%

Open in a new tab

The test results were obtained by using RepeatMasker based on the msRepDB database and the combination of Dfam and RepBase respectively under the default parameter settings.

The latest version of the Dfam database (v3.4) only contains the specific data of 552 species (https://dfam.org/home), which can be further subdivided into unique data and the data fused with RepBase. In addition, the data of other species are directly inherited from RepBase (about 61 518 species). Compared with the latest version of the Dfam database, the msRepDB database currently collects the repetitive sequences of 84 601 species which are obtained based on the corresponding detection results of LongRepMarker after the two processes of removing impurities and chimeras, and constructing the consensus sequences (Supplementary Figure S1, Supplementary Tables S1–S3). From the point of view of data integrity, msRepDB completely covers Dfam and RepBase, while providing data on some previously unlisted species.

The continuous update, as well as the long-term operation and maintenance of the database are fundamental for its utility. Since the establishment of our database, we have collected all available genomes on the websites of NCBI-RefSeq (39) (https://www.ncbi.nlm.nih.gov/refseq/), Ensembl (40) (http://asia.ensembl.org/info/data/ftp/index.html), FungiDB (41) (https://fungidb.org/fungidb/app) etc. based on the species name, NCBI accession number and taxid. The specific update measures are as follows. Firstly, we will further expand the coverage of species, and strive to build the most complete and accurate multi-species repetitive sequence database in field of genomic repetitive sequence research. Secondly, we will continue to improve the performance of the algorithm in the subsequent update process to achieve more accurate repeated sequences detection.

From a functional point of view, msRepDB not only provides a more complete multi-species repeat sequence database for users to view and download, but also provides with online masking and annotation functions, which is a major feature of msRepDB. We have implemented many optimizations on the code of the online masking function, so that it can efficiently process large-scale sequences. With online masking and annotation function, users can directly use msRepDB to accurately and quickly annotate genomes or sequences of interest, and obtain detailed annotation reports without the aid of any other third-party tool. For instance, the online masking will be applied in the following scenarios. Numerous cancers, genetic disorders, neurological disorders, and metabolic disorders, have been associated with the Long Interspersed Element-1 (LINE-1 or L1) retrotransposition (42–44). When RepeatMasker uses msRepDB and the combination of Dfam and RepBase as databases to annotate repetitive sequences in the Human genome, the annotation results based on msRepDB contains 1 325 814 L1/CIN4 retrotransposon elements, with annotated base length of 623 904 329 bp. However, the corresponding annotation results based on the combination of Dfam and RepBase are 603 337 and 528 287 954 bp, respectively (Table 2). The same results can also be obtained through the online masking module of the msRepDB database website. Because the proposed database contains more complete repetitive sequences and efficient use interfaces, we believe that it can provide accurate and targeted solutions towards understanding and diagnosis of complex diseases, optimization of plant properties and development of new drugs, and thus greatly benefit the genome research.

DATA AVAILABILITY

The web interface to the database is available at https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html. This website is free, open to all users and no login or password is required.

Supplementary Material

gkab1089_Supplemental_File

Click here for additional data file.^{(4MB, pdf)}

Contributor Information

Xingyu Liao, Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia; Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China.

Kang Hu, Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China.

Adil Salhi, Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.

You Zou, Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China.

Jianxin Wang, Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, P.R. China.

Xin Gao, Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

This work was supported by the National Natural Science Foundation of China [62002388, 61732009, 61772557, U1909208], King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) [FCC/1/1976-18-01, FCC/1/1976-23-01, FCC/1/1976-25-01, FCC/1/1976-26-01, REI/1/0018-01-01, REI/1/4216-01-01, REI/1/4437-01-01, REI/1/4473-01-01, URF/1/4352-01-01, URF/1/4379-01-01, REI/1/4742-01-01, URF/1/4098-01-01], Hunan Provincial Natural Science Foundation of China [2021JJ40787], Hunan Provincial Science and Technology Program [2018wk4001] and 111 Project [B18059]. This work was carried out in part using computing resources at the High Performance Computing Center of Central South University.

Conflict of interest statement. None declared.

REFERENCES

1. Cox R., Mirkin S.M.. Characteristic enrichment of DNA repeats in different genomes. Proc. Natl. Acad. Sci. U.S.A. 1997; 94:5237–5242. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Lu J.Y., Shao W., Chang L., Yin Y., Li T., Zhang H., Hong Y., Percharde M., Guo L., Wu Z.et al.. Genomic repeats categorize genes with distinct functions for orchestrated regulation. Cell Rep. 2020; 30:3296–3311. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Ahmad S.F., Singchat W., Jehangir M., Suntronpong A., Panthum T., Malaivijitnond S., Srikulnath K.. Dark matter of primate genomes: satellite DNA repeats and their evolutionary dynamics. Cells. 2020; 9:2714. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Shapiro J.A., von Sternberg R.. Why repetitive DNA is essential to genome function. Biol. Rev. 2005; 80:227–250. [DOI] [PubMed] [Google Scholar]
5. Kaltenegger E., Leng S., Heyl A.. The effects of repeated whole genome duplication events on the evolution of cytokinin signaling pathway. BMC Evol. Biol. 2018; 18:76–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Lu S., Wang G., Bacolla A., Zhao J., Spitser S., Vasquez K.M.. Short inverted repeats are hotspots for genetic instability: relevance to cancer genomes. Cell Rep. 2015; 10:1674–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. George C.M., Alani E.. Multiple cellular mechanisms prevent chromosomal rearrangements involving repetitive DNA. Crit. Rev. Biochem. Mol. Biol. 2012; 47:297–313. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Hall A.C., Ostrowski L.A., Pietrobon V., Mekhail K.. Repetitive DNA loci and their modulation by the non-canonical nucleic acid structures R-loops and G-quadruplexes. Nucleus. 2017; 8:162–181. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Shweta M., Vinod G.. Repetitive sequences in plant nuclear DNA: Types, Distribution, Evolution and Function. Genomics Proteomics Bioinformatics. 2014; 12:164–171. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Hannan A. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 2018; 19:286–298. [DOI] [PubMed] [Google Scholar]
11. DeJesus-Hernandez M., Mackenzie I.R., Boeve B.F., Boxer A.L., Baker M., Rutherford N.J., Nicholson A.M., Finch N.A., Flynn H., Adamson J.et al.. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-Linked FTD and ALS. Neuron. 2011; 72:245–256. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Alan E.R., Majounie E., Waite A., Simón-Sánchez J., Rollinson S., Gibbs J.R., Schymick J.C., Laaksovirta H., van Swieten J.C., Myllykangas L.et al.. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron. 2011; 72:257–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Trost B., Engchuan W., Nguyen C.M., Thiruvahindrapuram B., Dolzhenko E., Backstrom l., Mirceta M., Mojarad B.A., Yin Y., Dov A.et al.. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020; 586:80–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Mitra I., Huang B., Mousavi N., Ma M., Lamkin M., Yanicky R., Shleizer-Burko S., Lohmueller K.E., Gymrek M.et al.. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 2021; 589:246–250. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Hannan A.J. Repeat DNA expands our understanding of autism spectrum disorder. Nature. 2021; 589:200–202. [DOI] [PubMed] [Google Scholar]
16. Beck C.R., Garcia-Perez J.L., Badge R.M., Moran J.V.. LINE-1 elements in structural variation and disease. Annu. Rev. Gen. Hum. Genet. 2011; 12:187–215. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Chénais B. Transposable elements and human cancer: a causal relationship?. Biochim. Biophys. Acta. 2013; 1835:28–35. [DOI] [PubMed] [Google Scholar]
18. Belancio V.P., Roy-Engel A.M., Deininger P.L.. All y’all need to know ’bout retroelements in cancer. Semin. Cancer Biol. 2010; 20:200–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Bao W., Kojima K.K., Kohany O.. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015; 6:11–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Hubley R., Finn R.D., Clements J., Eddy S.R., Jones T.A., Bao W., Smit A.F., Wheeler T.J.. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016; 44:D81–D89. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Price A.L., Jones N.C., Pevzner P.A.. De novo identification of repeat families in large genomes. Bioinformatics. 2005; 21:i351–i358. [DOI] [PubMed] [Google Scholar]
22. Smit A.F.A., Hubley R., Green P.. RepeatMasker Open-4.0. 2015; 1996–2015. [Google Scholar]
23. Schmutz J., Cannon S., Schlueter J., Ma J., Mitros T., Nelson W., Hyten D.L., Song Q., Thelen J.J., Cheng J.et al.. Genome sequence of the palaeopolyploid soybean. Nature. 2010; 463:178–183. [DOI] [PubMed] [Google Scholar]
24. Liao X., Li M., Hu K., Wu F.-X., Gao X., Wang J.. A sensitive repeat identification framework based on short and long reads. Nucleic Acids Res. 2021; 49:e100. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Jullien M.F., Hubley R., Goubert C., Rosen J., Clark A.G., Feschotte C., Smit A.F.. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. U.S.A. 2020; 117:9451–9457. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Liao X., Li M., Luo J., Zou Y., Wu F.-X., Pan Y., Luo F., Wang J.. Improving de novo assembly based on read classification. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2020; 17:177–188. [DOI] [PubMed] [Google Scholar]
27. Liao X., Li M., Zou Y., Wu F.-X., Pan Y., Wang J.. An efficient trimming algorithm based on multi-feature fusion scoring model for NGS data. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2020; 17:728–738. [DOI] [PubMed] [Google Scholar]
28. Clausen P.T.L.C., Aarestrup F.M., Lund O.. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics. 2018; 19:307. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Koch P., Platzer M., Downie B.R.. RepARK–de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 2014; 42:e80. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Chong C., Nielsen R., Wu Y.. REPdenovo: inferring de novo repeat motifs from short sequence reads. PLoS One. 2016; 11:e0150719. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Liao X., Gao X., Zhang X., Wu F.-X., Wang J.. RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads. BMC Bioinformatics. 2020; 21:463. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Liao X., Li M., Zou Y., Wu F.-X., Pan Y., Wang J.. Current challenges and solutions of de novo assembly. Quant. Biol. 2019; 7:90–109. [Google Scholar]
33. Sohn J.I., Nam J.W.. The present and future of de novo whole-genome assembly. Brief. Bioinformatics. 2018; 19:23–40. [DOI] [PubMed] [Google Scholar]
34. Chen Q., Zobel J., Verspoor K.. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database. 2017; 2017:baw163. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Page A.J., Taylor B., Delaney A.J., Soares J., Seemann T., Keane J.A., Harris S.R.. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb. Genom. 2016; 2:e000056. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Bao Z., Eddy S.R.. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002; 12:1269–1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Li H., Durbin R.. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009; 25:1754-60. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34:3094–3100. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Pruitt K.D., Tatusova T., Maglott D.R.. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007; 35:D61–D65. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Hubbard T., Barker D., Birney E., Cameron G., Chen Y., Clark L., Cox T., Cuff J., Curwen V., Down T.et al.. The Ensembl genome database project. Nucleic Acids Res. 2002; 30:38–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Basenko E.Y., Pulman J.A., Shanmugasundram A., Harb O.S., Crouch K., Starns D., Warrenfeltz S., Aurrecoechea C., Stoeckert C.J. Jr, Kissinger J.C.et al.. FungiDB: an integrated bioinformatic resource for fungi and oomycetes. J. Fungi. 2018; 4:39–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Zhang X., Zhang R., Yu J.. New understanding of the relevant role of LINE-1 retrotransposition in human disease and immune modulation. Front. Cell Dev. Biol. 2020; 8:657. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Solyom S., Ewing A.D., Rahrmann E.P., Doucet T., Nelson H.H., Burns M.B., Harris R.S., Sigmon D.F., Casella A., Erlanger B.et al.. Extensive somatic L1 retrotransposition in colorectal tumors. Genome Res. 2012; 22:2328–2338. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Scott E.C., Gardner E.J., Masood A., Chuang N.T., Vertino P.M., Devine S.E.. A hot L1 retrotransposon evades somatic repression and initiates human colorectal cancer. Genome Res. 2016; 26:745–755. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkab1089_Supplemental_File

Click here for additional data file.^{(4MB, pdf)}

Data Availability Statement

The web interface to the database is available at https://msrepdb.cbrc.kaust.edu.sa/pages/msRepDB/index.html. This website is free, open to all users and no login or password is required.

[B1] 1. Cox R., Mirkin S.M.. Characteristic enrichment of DNA repeats in different genomes. Proc. Natl. Acad. Sci. U.S.A. 1997; 94:5237–5242. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2. Lu J.Y., Shao W., Chang L., Yin Y., Li T., Zhang H., Hong Y., Percharde M., Guo L., Wu Z.et al.. Genomic repeats categorize genes with distinct functions for orchestrated regulation. Cell Rep. 2020; 30:3296–3311. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Ahmad S.F., Singchat W., Jehangir M., Suntronpong A., Panthum T., Malaivijitnond S., Srikulnath K.. Dark matter of primate genomes: satellite DNA repeats and their evolutionary dynamics. Cells. 2020; 9:2714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Shapiro J.A., von Sternberg R.. Why repetitive DNA is essential to genome function. Biol. Rev. 2005; 80:227–250. [DOI] [PubMed] [Google Scholar]

[B5] 5. Kaltenegger E., Leng S., Heyl A.. The effects of repeated whole genome duplication events on the evolution of cytokinin signaling pathway. BMC Evol. Biol. 2018; 18:76–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Lu S., Wang G., Bacolla A., Zhao J., Spitser S., Vasquez K.M.. Short inverted repeats are hotspots for genetic instability: relevance to cancer genomes. Cell Rep. 2015; 10:1674–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. George C.M., Alani E.. Multiple cellular mechanisms prevent chromosomal rearrangements involving repetitive DNA. Crit. Rev. Biochem. Mol. Biol. 2012; 47:297–313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Hall A.C., Ostrowski L.A., Pietrobon V., Mekhail K.. Repetitive DNA loci and their modulation by the non-canonical nucleic acid structures R-loops and G-quadruplexes. Nucleus. 2017; 8:162–181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Shweta M., Vinod G.. Repetitive sequences in plant nuclear DNA: Types, Distribution, Evolution and Function. Genomics Proteomics Bioinformatics. 2014; 12:164–171. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Hannan A. Tandem repeats mediating genetic plasticity in health and disease. Nat. Rev. Genet. 2018; 19:286–298. [DOI] [PubMed] [Google Scholar]

[B11] 11. DeJesus-Hernandez M., Mackenzie I.R., Boeve B.F., Boxer A.L., Baker M., Rutherford N.J., Nicholson A.M., Finch N.A., Flynn H., Adamson J.et al.. Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-Linked FTD and ALS. Neuron. 2011; 72:245–256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Alan E.R., Majounie E., Waite A., Simón-Sánchez J., Rollinson S., Gibbs J.R., Schymick J.C., Laaksovirta H., van Swieten J.C., Myllykangas L.et al.. A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD. Neuron. 2011; 72:257–258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Trost B., Engchuan W., Nguyen C.M., Thiruvahindrapuram B., Dolzhenko E., Backstrom l., Mirceta M., Mojarad B.A., Yin Y., Dov A.et al.. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020; 586:80–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Mitra I., Huang B., Mousavi N., Ma M., Lamkin M., Yanicky R., Shleizer-Burko S., Lohmueller K.E., Gymrek M.et al.. Patterns of de novo tandem repeat mutations and their role in autism. Nature. 2021; 589:246–250. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Hannan A.J. Repeat DNA expands our understanding of autism spectrum disorder. Nature. 2021; 589:200–202. [DOI] [PubMed] [Google Scholar]

[B16] 16. Beck C.R., Garcia-Perez J.L., Badge R.M., Moran J.V.. LINE-1 elements in structural variation and disease. Annu. Rev. Gen. Hum. Genet. 2011; 12:187–215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Chénais B. Transposable elements and human cancer: a causal relationship?. Biochim. Biophys. Acta. 2013; 1835:28–35. [DOI] [PubMed] [Google Scholar]

[B18] 18. Belancio V.P., Roy-Engel A.M., Deininger P.L.. All y’all need to know ’bout retroelements in cancer. Semin. Cancer Biol. 2010; 20:200–210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Bao W., Kojima K.K., Kohany O.. Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 2015; 6:11–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Hubley R., Finn R.D., Clements J., Eddy S.R., Jones T.A., Bao W., Smit A.F., Wheeler T.J.. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016; 44:D81–D89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Price A.L., Jones N.C., Pevzner P.A.. De novo identification of repeat families in large genomes. Bioinformatics. 2005; 21:i351–i358. [DOI] [PubMed] [Google Scholar]

[B22] 22. Smit A.F.A., Hubley R., Green P.. RepeatMasker Open-4.0. 2015; 1996–2015. [Google Scholar]

[B23] 23. Schmutz J., Cannon S., Schlueter J., Ma J., Mitros T., Nelson W., Hyten D.L., Song Q., Thelen J.J., Cheng J.et al.. Genome sequence of the palaeopolyploid soybean. Nature. 2010; 463:178–183. [DOI] [PubMed] [Google Scholar]

[B24] 24. Liao X., Li M., Hu K., Wu F.-X., Gao X., Wang J.. A sensitive repeat identification framework based on short and long reads. Nucleic Acids Res. 2021; 49:e100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Jullien M.F., Hubley R., Goubert C., Rosen J., Clark A.G., Feschotte C., Smit A.F.. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. U.S.A. 2020; 117:9451–9457. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Liao X., Li M., Luo J., Zou Y., Wu F.-X., Pan Y., Luo F., Wang J.. Improving de novo assembly based on read classification. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2020; 17:177–188. [DOI] [PubMed] [Google Scholar]

[B27] 27. Liao X., Li M., Zou Y., Wu F.-X., Pan Y., Wang J.. An efficient trimming algorithm based on multi-feature fusion scoring model for NGS data. IEEE/ACM Trans. Comput. Biol. Bioinformatics. 2020; 17:728–738. [DOI] [PubMed] [Google Scholar]

[B28] 28. Clausen P.T.L.C., Aarestrup F.M., Lund O.. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics. 2018; 19:307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Koch P., Platzer M., Downie B.R.. RepARK–de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Res. 2014; 42:e80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30. Chong C., Nielsen R., Wu Y.. REPdenovo: inferring de novo repeat motifs from short sequence reads. PLoS One. 2016; 11:e0150719. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31. Liao X., Gao X., Zhang X., Wu F.-X., Wang J.. RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads. BMC Bioinformatics. 2020; 21:463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32. Liao X., Li M., Zou Y., Wu F.-X., Pan Y., Wang J.. Current challenges and solutions of de novo assembly. Quant. Biol. 2019; 7:90–109. [Google Scholar]

[B33] 33. Sohn J.I., Nam J.W.. The present and future of de novo whole-genome assembly. Brief. Bioinformatics. 2018; 19:23–40. [DOI] [PubMed] [Google Scholar]

[B34] 34. Chen Q., Zobel J., Verspoor K.. Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. Database. 2017; 2017:baw163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35. Page A.J., Taylor B., Delaney A.J., Soares J., Seemann T., Keane J.A., Harris S.R.. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb. Genom. 2016; 2:e000056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36. Bao Z., Eddy S.R.. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 2002; 12:1269–1276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] 37. Li H., Durbin R.. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009; 25:1754-60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B38] 38. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34:3094–3100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] 39. Pruitt K.D., Tatusova T., Maglott D.R.. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007; 35:D61–D65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] 40. Hubbard T., Barker D., Birney E., Cameron G., Chen Y., Clark L., Cox T., Cuff J., Curwen V., Down T.et al.. The Ensembl genome database project. Nucleic Acids Res. 2002; 30:38–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] 41. Basenko E.Y., Pulman J.A., Shanmugasundram A., Harb O.S., Crouch K., Starns D., Warrenfeltz S., Aurrecoechea C., Stoeckert C.J. Jr, Kissinger J.C.et al.. FungiDB: an integrated bioinformatic resource for fungi and oomycetes. J. Fungi. 2018; 4:39–67. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B42] 42. Zhang X., Zhang R., Yu J.. New understanding of the relevant role of LINE-1 retrotransposition in human disease and immune modulation. Front. Cell Dev. Biol. 2020; 8:657. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B43] 43. Solyom S., Ewing A.D., Rahrmann E.P., Doucet T., Nelson H.H., Burns M.B., Harris R.S., Sigmon D.F., Casella A., Erlanger B.et al.. Extensive somatic L1 retrotransposition in colorectal tumors. Genome Res. 2012; 22:2328–2338. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B44] 44. Scott E.C., Gardner E.J., Masood A., Chuang N.T., Vertino P.M., Devine S.E.. A hot L1 retrotransposon evades somatic repression and initiates human colorectal cancer. Genome Res. 2016; 26:745–755. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

msRepDB: a comprehensive repetitive sequence database of over 80 000 species

Xingyu Liao

Kang Hu

Adil Salhi

You Zou

Jianxin Wang

Xin Gao

Abstract

INTRODUCTION

Figure 1.

Table 4.

Figure 2.

MATERIALS AND METHODS

Data collection and identification of repetitive sequences by using LongRepMarker

Extracting the repetitive sequences and their corresponding families contained in each species from the detection results and storing them in the database

DATABASE CONTENT AND USAGE

Home and About

Figure 3.

Search and Download

Online Masking

Submit and tools

IMPLEMENTATION

DISCUSSION

Table 1.

Table 2.

Table 3.

DATA AVAILABILITY

Supplementary Material

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases