Abstract
Microsatellites are ubiquitous short tandem repeats found in all known genomes and are known to play a very important role in various studies and fields including DNA fingerprinting, paternity studies, evolutionary studies, virulence and adaptation of certain bacteria and viruses etc. Due to the sequencing of several genomes and the availability of enormous amounts of sequence data during the past few years, computational studies of microsatellites are of interest for many researchers. In this context, we developed a software tool called Imperfect Microsatellite Extractor (IMEx), to extract perfect, imperfect and compound microsatellites from genome sequences along with their complete statistics. Recently we developed a user-friendly graphical-interface using JAVA for IMEx to be used as a stand-alone software named G-IMEx. G-IMEx takes a nucleotide sequence as an input and the results are produced in both html and text formats. The Linux version of G-IMEx can be downloaded for free from http://www.cdfd.org.in/imex
Keywords: Bioinformatics Tool, Stand-alone program, Microsatellites, Simple Sequence Repeats, Genomes
Background
Microsatellites, also known as Simple Sequence Repeats (SSRs) or Short Tandem Repeats (STRs), are tandem repetitions of a nucleotide motif of size 1-6 bp. They are distributed in both coding as well as non-coding regions of all known genomes. Because of their polymorphic nature, they are known to play an important role in gene regulation, pathogenesis, bacterial adaptation and in evolution of genomes [1–5]. They are also applied in various fields such as DNA fingerprinting, Paternity studies, Forensics, Evolutionary studies etc. As the sequencing of new genomes is increasing day-by-day, microsatellites of many genomes remain unexplored. Analysis of these microsatellites is important to understand their role in various studies. Computational analysis is a better alternative to the time-consuming and money-intensive traditional wet lab microsatellite studies. A software tool that can extract all types of microsatellites with greater sensitivity and provides flexible options to analyze the repeats detected is the need of the day.
Few tools [6–9] exist in the public domain for extracting microsatellites from genome sequences, but many of them suffer from certain lacunae in- terms of their features and their efficiency. In the course of our studies on evolution of microsatellites in prokaryotic genomes, we developed a novel algorithm [10] to detect imperfect microsatellites from nucleotide sequences. The algorithm has been implemented in the form of a stand- alone software with a user-friendly graphical user interface (GUI) called G-IMEx. The present communication gives the details of this software.
Methodology
The algorithmic details of IMEx have been reported elsewhere [10]. For the sake of continuity we reiterate the method. IMEx scans the input sequence and looks for two consecutive exact repeat units or two alternate exact repeat units and considers them as the ‘ candidate ’ microsatellite repeat tract. The ‘ candidate ’ tract is expanded on both sides by allowing few mismatches in each individual repeat unit ( ‘ k ’ ‐ imperfection limit / repeat unit) such that the percentage of imperfection of the entire tract does not cross the threshold set by the user. The expansion is also terminated if a repeat unit with more than ‘ k ’ mismatches is encountered. The program further collates and clusters equivalent microsatellite repeats into families. It also has an option to identify compound microsatellites, which are regions containing more than one microsatellite tract separated by a certain distance as defined by the user.
Software Requirements
G-IMEx has been developed on the Linux platform and requires preinstalled C and Java (for graphical interface). An ideal environment for running G-IMEx would be a latest Fedora or other Linux distribution with a gcc compiler (version 3.4 or higher), Java version (1.6 or higher) and any browser software.
Input options
G-IMEx offers several options for identification, extraction, collation, clustering and reporting of microsatellites from an input DNA sequence in FASTA format. The software can handle large sequences such as genomes easily and is comparatively faster than many other tools. Users can set thelimits for repeat size, repeat number, repeat type and imperfection level. In addition users can set levels (0 to 4) for clustering of equivalent microsatellites and also to detect compound microsatellites i.e., those microsatellites which are close to each other sequentially. There is also an option to use the core IMEx program in batch mode for scanning multiple sequences.
Output options
G-IMEx creates a folder with the name of the input sequence file and the results are stored in two formats ‐ html and text. The text format of results is optional and separate directories are created for text and html results. The output includes a well-formatted summary table file with information such as the repeating motif, repeat number, imperfection %, tract size, nucleotide composition and protein information (if it falls in coding region) etc. Along with the information about the microsatellite extracted, its corresponding alignment with its perfect repeat counter part is also produced automatically in a separate alignment file which facilitates analysis of mutational events in a microsatellite tract. Figure 1 shows the snapshot of the GUI and the result pages of G-IMEx.
Future Work
The current version of G-IMEx is available only for Linux users. Efforts are underway to develop versions compatible to Windows and Macintosh systems.
Figure 1.
Snapshots of Graphical User-Interface and Results Pages.
Acknowledgments
The authors acknowledge the support from the lab members of computational biology and the core grant of CDFD. SBM would like to thank Mr. Priyatosh Mishra for his valuable help during the development of the graphical interface and the management of AEC for providing necessary facilities.
Footnotes
Citation:Mudunuri etal; Bioinformation 5(5): 221-223 (2010)
References
- 1.Belkum AV, et al. Microbiol MolBiol Rev. 1998;62(2):275. [Google Scholar]
- 2.Martin P, et al. Proc Natl. Acad Sci USA. 2005;102(10):3800. doi: 10.1073/pnas.0406805102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Moxon ER, et al. Curr Biol . 1994;4(1):24. doi: 10.1016/s0960-9822(00)00005-1. [DOI] [PubMed] [Google Scholar]
- 4.Sreenu VB, et al. BMC Genomics . 2006;7:78. doi: 10.1186/1471-2164-7-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kofler R, et al. BMC Genomics . 2008;9:612. doi: 10.1186/1471-2164-9-612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Benson G, et al. Nucleic Acids Res. 1999;27:573. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kofler R, et al. Bioinformatics. 2007;23:1683. doi: 10.1093/bioinformatics/btm157. [DOI] [PubMed] [Google Scholar]
- 8.LaRota M, et al. BMC Genomics. 2005;6:23. [Google Scholar]
- 9. http://www.genomics.ceh.ac.uk/msatfinder.
- 10.Mudunuri SB, Nagarajaram HA. Bioinformatics. 2007;23(10):1181. doi: 10.1093/bioinformatics/btm097. [DOI] [PubMed] [Google Scholar]

