AutoSeqMan: batch assembly of contigs for Sanger sequences

Jie-Qiong Jin; Yan-Bo Sun

doi:10.24272/j.issn.2095-8137.2018.027

. 2018 Mar 7;39(2):123–126. doi: 10.24272/j.issn.2095-8137.2018.027

AutoSeqMan: batch assembly of contigs for Sanger sequences

Jie-Qiong Jin ¹, Yan-Bo Sun ^1,^*

PMCID: PMC5885390 PMID: 29515094

Abstract

With the wide application of DNA sequencing technology, DNA sequences are increasingly generated through the Sanger sequencing platform. SeqMan (in the LaserGene package) is an excellent program with an easy-to-use graphical user interface (GUI) employed to assemble Sanger sequences into contigs. However, with increasing data size, larger sample sets and more sequenced loci make contig assemble complicated due to the considerable number of manual operations required to run SeqMan. Here, we present the ‘autoSeqMan’ software program, which can automatedly assemble contigs using SeqMan scripting language. There are two main modules available, namely, ‘Classification’ and ‘Assembly’. Classification first undertakes preprocessing work, whereas Assembly generates a SeqMan script to consecutively assemble contigs for the classified files. Through comparison with manual operation, we showed that autoSeqMan saved substantial time in the preprocessing and assembly of Sanger sequences. We hope this tool will be useful for those with large sample sets to analyze, but with little programming experience. It is freely available at https://github.com/Sun-Yanbo/autoSeqMan.

Keywords: Batch processing, Sanger sequences, Contig assembly, SeqMan

INTRODUCTION

DNA sequencing technology has experienced a revolutionary shift from automated Sanger sequencing (Sanger et al., 1977) to next-generation sequencing (NGS; reviewed by Shendure & Ji (2008) and Shendure et al. (2004)) and genome assembly. Although NGS has dominated due to its high throughput (Schuster, 2008), it is not suitable for many population studies due to high costs and other limiting factors. For example, errors are always introduced in final assembly and/or annotation results using NGS data (Bickhart et al., 2017), and thus variations detected in high-throughput analyses require validation by Sanger sequencing (Wall et al., 2014). Furthermore, for some present population genomic studies, error rates have been found to increase with increasing depth of coverage for Illumina data, and thus caution is needed when interpreting the results of next-generation sequencing-based association studies (Wall et al., 2014). As such, Sanger sequencing technology is still widely used in many research fields, including in evolutionary taxonomy based on short DNA sequences (Chen et al., 2017), evolutionary history study of wild animals (Yuan et al., 2016), biodiversity estimates and influencing factors (Zhou et al., 2017), and validation of mutations identified from high-throughput analyses (Sun et al., 2013).

Further, with the wide application of DNA sequencing technology, e.g. DNA barcoding, which uses short and standardized DNA sequences for individual identification of organisms (Hajibabaei et al., 2007; Savolainen et al., 2005), Sanger sequencing data are continuing to be accumulated among evolutionary taxonomists and others. Thus, batch manipulation of these Sanger sequences has become an important task before downstream analyses, especially for those who doesn’t have programming or bioinformatics background or experiments. Although several sequence manipulation packages for general purpose issues have been published previously, including MEGA (Kumar et al., 2016), EMBOSS (Rice et al., 2000), and FasParser (Sun, 2017), these packages are all based on assembled contigs (a consensus region of overlapping DNA segments) and no key consideration has been taken on the batch assembly of Sanger sequences.

SeqMan is a popular program in the LaserGene software package (DNAStar, Inc., Madison, WI, USA), which is used for assembling Sanger sequences into contigs and has been widely applied in a great number of studies. It can handle two to thousands of Sanger sequences at one time but requires a considerable number of manual operations (e.g., mouse actions, Figure 1) to run. Hence, it is complicated and time-expensive for those with large sets of samples to assemble. Fortunately, since the release of Version 7, SeqMan now provides a scripting language, including commands for opening, naming, saving, and closing projects, and a single script may be used to execute multiple assemblies consecutively without manual intervention.

Overview of assembling tasks for Sanger sequences

Here, we developed a program called autoSeqMan, which provides a simple way to automatedly classify Sanger sequences and then consecutively assemble them on a personal computer. It is mainly designed for researchers with large sets of samples with one or more loci sequenced.

IMPLEMENTATION AND REQUIREMENTS

autoSeqMan was developed into a standalone Windows desktop application (compiled and tested in Windows 7/10). It involves two modules, ‘Classification’ and ‘Assembly’, corresponding to steps 2 and 3 in Figure 1, respectively. Each module can handle multiple files and needs the user to select the directory either containing the raw Sanger sequence files (*.ab1 files) or containing the classified sub-folders created by ‘Classification’. Theoretically, there is no limit to the number of files that can be analyzed.

This tool requires that the sequence files be named in a specialized format, in which the sample ID should be present at the beginning of the file name. The Classification module will recognize the sample ID by the appropriate delimiter and then create sub-folders (see below). For convenience, autoSeqMan also provides a “Rename” tool to help users rename the ab1 files for the below analyses.

CLASSIFICATION

This function is designed to automatedly create sub-folders according to the sample ID and/or sequenced locus. All downstream analyses are performed in the corresponding sub-folders, where all analyzed results are also saved. According to our laboratory experience, this is an efficient and convenient way to manage and query laboratory samples (Chen et al., 2017; Zhou et al., 2017).

The only input is the directory name, which contains the raw ab1 files. There are several input prerequisites required for Classification performance. First, all files must be stored in a same directory. Second, all files must be named according to a certain pattern, i.e., “sample-locus-others”. For example, the file name “YPX24212_16S-2215_TSS20171122-0871-1171_H02.ab1” denotes that it is a DNA sequence of 16S and the sample number is “YPX24212”. The program will automatedly recognize the filename according to the user-specified delimiter and then create a sub-folder “YPX24212_16S” in the main output folder. The delimiter can be “-“, “_”, or other. After classification, the program will list all sub-folders, and the user can look at the files classified into each sub-folder by simply clicking the folder name (Figure 2).

Overview of ‘Classification’ function in autoSeqMan

ASSEMBLY

This function will automatedly assemble the classified sequence files. It will first read the list of classified sub-folders created by the ‘Classification’ function, and then generate a SeqMan script for consecutively assembling the sequences in each sub-folder. To perform this function, the user must first install the DNASTAR package (version 7 or higher), and then tell autoSeqMan the full path of the SeqMan program (which can be always recognized automatedly by autoSeqMan), after which the program will complete all assembly tasks and save the assembly results automatedly. The default script will generate all SQD, FAS, and SEQ results (Figure 3).

Default SeqMan script for assembling Sanger sequences

PERFORMANCE

The main aim of autoSeqMan is to save manual operation in preparing files and running the SeqMan program. To evaluate its performance, we applied this tool to our laboratory data (Chen et al., 2017; Zhou et al., 2017). In this test, one hundred samples were used, each of which had two ab1 files available. Results showed that the Classification operation created sub-folders (named sample ID as well as locus name if provided) and moved the appropriate files into the sub-folders within 8 s, substantially less than the time used for manual operation (about 1 h, as tested by our colleagues). Performance of the Assembly operation greatly depended on the running efficiency of SeqMan. In this test, the Assembly module required 64 s to consecutively assembly contigs for the classified sequences, also substantially less than the ~2 h required for manual operation, suggesting the significance of autoSeqMan in dealing with large date sets.

LIMITATIONS

It is important to note that autoSeqMan does not undertake any filtration manipulation on the sequence data, even though poor-quality sequence ends are always present. Thus, after running autoSeqMan, users should undertake quality control measures of the final assembly with SeqMan. In addition, the output Fasta files will have very long IDs, which might introduce some errors in subsequent sequence analyses. If necessary, users can use the “Sort & Rename” function of FasParser (Sun, 2017) to shorten these IDs.

ACKNOWLEDGEMENTS

Special thanks to our colleagues, Hong-Man Chen, Yu-Qi He, and Fang Yan, for their helpful suggestions on the improvement of autoSeqMan.

COMPETING INTERESTS

The authors declare that they have no competing interests.

AUTHORS’ CONTRIBUTIONS

Y.B.S. and J.Q.J. designed the study and wrote the manuscript. Y.B.S. wrote the software. J.Q.J. evaluated the performance of autoSeqMan. All authors read and approved the final manuscript.

Funding Statement

The development of this package was supported by the National Natural Science Foundation of China (31671326) and the Youth Innovation Promotion Association, Chinese Academy of Sciences

REFERENCES

Bickhart D.M., Rosen B.D., Koren S., Sayre B.L., Hastie A.R., Chan S., Lee J., Lam E.T., Liachko I., Sullivan S.T., Burton J.N., Huson H.J., Nystrom J.C., Kelley C.M., Hutchison J.L., Zhou Y., Sun J.J., Crisà A., De León F.A.P., Schwartz J.C., Hammond J.A., Waldbieser G.C., Schroeder S.G., Liu G.E., Dunham M.J., Shendure J., Sonstegard T.S., Phillippy A.M., Van Tassell C.P., Smith T.P. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics. 2017;49(4):643–650. doi: 10.1038/ng.3802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J.M., Zhou W.W., Poyarkov N.A., Jr., Stuart B.L., Brown R.M., Lathrop A., Wang Y.Y., Yuan Z.Y., Jiang K., Hou M., Chen H.M., Suwannapoom C., Nguyen S.N., van Duong T., Papenfuss T.J., Murphy R.W., Zhang Y.P., Che J. A novel multilocus phylogenetic estimation reveals unrecognized diversity in Asian horned toads, genus Megophrys sensu lato (Anura: Megophryidae) Molecular Phylogenetics & Evolution. 2017;106:28–43. doi: 10.1016/j.ympev.2016.09.004. [DOI] [PubMed] [Google Scholar]
Hajibabaei M., Singer G.A., Hebert P.D., Hickey D.A. DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. Trends in Genetics. 2007;23(4):167–172. doi: 10.1016/j.tig.2007.02.001. [DOI] [PubMed] [Google Scholar]
Kumar S., Stecher G., Tamura K. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution. 2016;33(7):1870–1874. doi: 10.1093/molbev/msw054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rice P., Longden I., Bleasby A. EMBOSS: the european molecular biology open software suite. Trends in Genetics. 2000;16(6):276–277. doi: 10.1016/S0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
Sanger F., Nicklen S., Coulson A.R. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America. 1977;74(12):5463–5467. doi: 10.1073/pnas.74.12.5463. [DOI] [PMC free article] [PubMed] [Google Scholar]
Savolainen V., Cowan R.S., Vogler A.P., Roderick G.K., Lane R. Towards writing the encyclopedia of life: an introduction to DNA barcoding. Philosophical Transactions of the Royal Society B: Biological Sciences. 2005;360(1462):1805–1811. doi: 10.1098/rstb.2005.1730. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schuster S.C. Next-generation sequencing transforms today's biology. Nature Methods. 2008;5(1):16–18. doi: 10.1038/nmeth1156. [DOI] [PubMed] [Google Scholar]
Shendure J., Ji H. Next-generation DNA sequencing. Nature Biotechnology. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]
Shendure J., Mitra R.D., Varma C., Church G.M. Advanced sequencing technologies: methods and goals. Nature Reviews Genetics. 2004;5(5):335–344. doi: 10.1038/nrg1325. [DOI] [PubMed] [Google Scholar]
Sun Y.B. FasParser: a package for manipulating sequence data. Zoological Research. 2017;38(2):110–112. doi: 10.24272/j.issn.2095-8137.2017.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun Y.B., Zhou W.P., Liu H.Q., Irwin D.M., Shen Y.Y., Zhang Y.P. Genome-wide scans for candidate genes involved in the aquatic adaptation of dolphins. Genome Biology and Evolution. 2013;5(1):130–139. doi: 10.1093/gbe/evs123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wall J.D., Tang L.F., Zerbe B., Kvale M.N., Kwok P.Y., Schaefer C., Risch N. Estimating genotype error rates from high-coverage next-generation sequence data. Genome Research. 2014;24(11):1734–1739. doi: 10.1101/gr.168393.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan Z.Y., Zhou W.W., Chen X., Poyarkov N.A., Jr., Chen H.M., Jang-Liaw N.H., Chou W.H., Matzke N.J., Iizuka K., Min M.S., Kuzmin S.L., Zhang Y.P., Cannatella D.C., Hillis D.M., Che J. Spatiotemporal diversification of the true frogs (Genus Rana): A historical framework for a widely studied group of model organisms. Systematic Biology. 2016;65(5):824–842. doi: 10.1093/sysbio/syw055. [DOI] [PubMed] [Google Scholar]
Zhou W.W., Jin J.Q., Wu J., Chen H.M., Yang J.X., Murphy R.W., Che J. Mountains too high and valleys too deep drive population structuring and demographics in a Qinghai-Tibetan Plateau frog Nanorana pleskei (Dicroglossidae) Ecology and Evolution. 2017;7(1):240–252. doi: 10.1002/ece3.2646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[BickhartDMEtal2017] Bickhart D.M., Rosen B.D., Koren S., Sayre B.L., Hastie A.R., Chan S., Lee J., Lam E.T., Liachko I., Sullivan S.T., Burton J.N., Huson H.J., Nystrom J.C., Kelley C.M., Hutchison J.L., Zhou Y., Sun J.J., Crisà A., De León F.A.P., Schwartz J.C., Hammond J.A., Waldbieser G.C., Schroeder S.G., Liu G.E., Dunham M.J., Shendure J., Sonstegard T.S., Phillippy A.M., Van Tassell C.P., Smith T.P. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nature Genetics. 2017;49(4):643–650. doi: 10.1038/ng.3802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ChenJMEtal2017] Chen J.M., Zhou W.W., Poyarkov N.A., Jr., Stuart B.L., Brown R.M., Lathrop A., Wang Y.Y., Yuan Z.Y., Jiang K., Hou M., Chen H.M., Suwannapoom C., Nguyen S.N., van Duong T., Papenfuss T.J., Murphy R.W., Zhang Y.P., Che J. A novel multilocus phylogenetic estimation reveals unrecognized diversity in Asian horned toads, genus Megophrys sensu lato (Anura: Megophryidae) Molecular Phylogenetics & Evolution. 2017;106:28–43. doi: 10.1016/j.ympev.2016.09.004. [DOI] [PubMed] [Google Scholar]

[HajibabaeiMEtal2007] Hajibabaei M., Singer G.A., Hebert P.D., Hickey D.A. DNA barcoding: how it complements taxonomy, molecular phylogenetics and population genetics. Trends in Genetics. 2007;23(4):167–172. doi: 10.1016/j.tig.2007.02.001. [DOI] [PubMed] [Google Scholar]

[KumarSEtal2016] Kumar S., Stecher G., Tamura K. MEGA7: Molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular Biology and Evolution. 2016;33(7):1870–1874. doi: 10.1093/molbev/msw054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[RicePEtal2000] Rice P., Longden I., Bleasby A. EMBOSS: the european molecular biology open software suite. Trends in Genetics. 2000;16(6):276–277. doi: 10.1016/S0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]

[SangerFEtal1977] Sanger F., Nicklen S., Coulson A.R. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the United States of America. 1977;74(12):5463–5467. doi: 10.1073/pnas.74.12.5463. [DOI] [PMC free article] [PubMed] [Google Scholar]

[SavolainenVEtal2005] Savolainen V., Cowan R.S., Vogler A.P., Roderick G.K., Lane R. Towards writing the encyclopedia of life: an introduction to DNA barcoding. Philosophical Transactions of the Royal Society B: Biological Sciences. 2005;360(1462):1805–1811. doi: 10.1098/rstb.2005.1730. [DOI] [PMC free article] [PubMed] [Google Scholar]

[SchusterSC2008] Schuster S.C. Next-generation sequencing transforms today's biology. Nature Methods. 2008;5(1):16–18. doi: 10.1038/nmeth1156. [DOI] [PubMed] [Google Scholar]

[ShendureJEtal2008] Shendure J., Ji H. Next-generation DNA sequencing. Nature Biotechnology. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]

[ShendureJEtal2004] Shendure J., Mitra R.D., Varma C., Church G.M. Advanced sequencing technologies: methods and goals. Nature Reviews Genetics. 2004;5(5):335–344. doi: 10.1038/nrg1325. [DOI] [PubMed] [Google Scholar]

[SunYB2017] Sun Y.B. FasParser: a package for manipulating sequence data. Zoological Research. 2017;38(2):110–112. doi: 10.24272/j.issn.2095-8137.2017.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[SunYBEtal2013] Sun Y.B., Zhou W.P., Liu H.Q., Irwin D.M., Shen Y.Y., Zhang Y.P. Genome-wide scans for candidate genes involved in the aquatic adaptation of dolphins. Genome Biology and Evolution. 2013;5(1):130–139. doi: 10.1093/gbe/evs123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[WallJDEtal2014] Wall J.D., Tang L.F., Zerbe B., Kvale M.N., Kwok P.Y., Schaefer C., Risch N. Estimating genotype error rates from high-coverage next-generation sequence data. Genome Research. 2014;24(11):1734–1739. doi: 10.1101/gr.168393.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[YuanZYEtal2016] Yuan Z.Y., Zhou W.W., Chen X., Poyarkov N.A., Jr., Chen H.M., Jang-Liaw N.H., Chou W.H., Matzke N.J., Iizuka K., Min M.S., Kuzmin S.L., Zhang Y.P., Cannatella D.C., Hillis D.M., Che J. Spatiotemporal diversification of the true frogs (Genus Rana): A historical framework for a widely studied group of model organisms. Systematic Biology. 2016;65(5):824–842. doi: 10.1093/sysbio/syw055. [DOI] [PubMed] [Google Scholar]

[ZhouWWEtal2017] Zhou W.W., Jin J.Q., Wu J., Chen H.M., Yang J.X., Murphy R.W., Che J. Mountains too high and valleys too deep drive population structuring and demographics in a Qinghai-Tibetan Plateau frog Nanorana pleskei (Dicroglossidae) Ecology and Evolution. 2017;7(1):240–252. doi: 10.1002/ece3.2646. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

AutoSeqMan: batch assembly of contigs for Sanger sequences

Jie-Qiong Jin

Yan-Bo Sun

Abstract

INTRODUCTION

Figure 1.

IMPLEMENTATION AND REQUIREMENTS

CLASSIFICATION

Figure 2.

ASSEMBLY

Figure 3.

PERFORMANCE

LIMITATIONS

ACKNOWLEDGEMENTS

COMPETING INTERESTS

AUTHORS’ CONTRIBUTIONS

Funding Statement

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

AutoSeqMan: batch assembly of contigs for Sanger sequences

Jie-Qiong Jin

Yan-Bo Sun

Abstract

INTRODUCTION

Figure 1.

IMPLEMENTATION AND REQUIREMENTS

CLASSIFICATION

Figure 2.

ASSEMBLY

Figure 3.

PERFORMANCE

LIMITATIONS

ACKNOWLEDGEMENTS

COMPETING INTERESTS

AUTHORS’ CONTRIBUTIONS

Funding Statement

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases