ABSTRACT
Here, we report the genome sequence of Thermobifida halotolerans DSM 44931, a bacterium that was originally isolated from a salt mine in the Yunnan Province of China. This genome was sequenced using Pacific Biosciences sequencing technology and was assembled into 2 contigs in 2 scaffolds. It has a total length of 5,506,851 bp and a GC content of 71.16%. Functional annotation of this genome provides further metabolic insight into this species.
KEYWORDS: bacteria, genome, thermophile, actinomycete, biotechnology
ANNOUNCEMENT
Thermobifida halotolerans is an aerobic, gram-positive, and mesophilic bacterial species that was isolated from a salt mine in the Yunnan Province of China (1). It is a filamentous actinomycete whose genome encodes for enzymes that can degrade synthetic polymers and several types of biomass (2–5). It has previously been limited to a draft genome with 29× coverage, 74 scaffolds, and 371 contigs (GenBank accession number LIZN00000000) (6). This work presents an improved high-quality draft genome of T. halotolerans DSM 44931.
T. halotolerans DSM 44931 was purchased from the Leibniz Institute DSMZ and cultivated in 100 mL of Czapek peptone media for 7 days at 37°C. 20 mL of this culture was centrifuged at 4649 × g at room temperature for 8 minutes. High molecular weight genomic DNA was extracted from the pelleted sample using a modified cetyltrimethylammonium bromide protocol (7).
The quality and quantity of the extracted DNA were evaluated using an Agilent 2200 TapeStation and an Invitrogen Qubit 2.0 Fluorometer, respectively. The draft genome of T. halotolerans DSM 44931 was generated at the Department of Energy (DOE) Joint Genome Institute (JGI) using the Pacific Biosciences (PacBio) sequencing technology (8). Default parameters were used for all software unless otherwise specified. A >10 kb PacBio SMRTbellTM library was constructed and sequenced on the PacBio Sequel platform, which generated 81,163 high-fidelity CCS reads totaling 763,311,996 bp (Fig. 1). BBDuk in BBTools v38.98 was used to filter out low-quality reads (9). The input read coverage was 139.4×. Reads >5 kb were assembled with Flye v2.8.3 (10). The final draft assembly contained 2 contigs in 2 scaffolds, totaling 5,506,851 bp in size with a GC content of 71.16% (Table 1). The contigs were determined to be linear by Circlator v1.5.5 (11). The quality of this assembly was evaluated using tRNAscan-SE v2.0.4 (12) to count tRNAs; Barrnap v0.9-Dev (13) to determine the presence of the 5S, 16S, and 23S rRNA genes; and CheckM2 v1.1.0 (14) to estimate completeness and contamination (Table 1).
Fig 1.
A histogram of the length frequencies of the CCS reads generated by the PacBio Sequel platform.
TABLE 1.
Genome assembly statistics, quality assessment, and annotation statistics for T. halotolerans DSM 44931
| Metric | Assembly statistic | |
|---|---|---|
| Total number of scaffolds | 2 | |
| Total number of contigs | 2 | |
| Total scaffold sequence length (bp) | 5,506,851 | |
| Total contig sequence length (bp) | 5,506,851 | |
| Scaffold N50 (bp) | 5,450,929 | |
| Contig N50 (bp) | 5,450,929 | |
| Pre-filtered CCS read N50 (bp) | 11,849 | |
| Largest contig (bp) | 5,450,929 | |
| Number of scaffolds >50 kb | 2 | |
| Percent of genome in scaffolds >50 kb | 100% | |
| Percent of reads assembled | 100% | |
GC percentage shown as count of G's and C's divided by the total number of bases. The total number of bases is not necessarily synonymous with a total number of G's, C's, A's, and T's.
Regulatory or miscellaneous genes are genes that are not classified as a protein-coding gene, a type of RNA, or a pseudogene, but as 'unknown' or 'other' by the source provider.
The genome draft assembly was annotated using the JGI Integrated Microbial Genomes (15) annotation pipeline v5.1.9 (16). INFERNAL v1.1.2 (17) was used to search against the Rfam 13.0 (18) database (excluding tRNA and CRISPR models) to identify structural RNAs and regulatory motifs; GeneMark.hmm-2 v1.05 (19) and Prodigal v2.6.3 (20) to identify protein-coding genes; tRNAscan-SE v2.0.4 (12) to identify tRNAs; and CRT v1.8.2 (21) to predict CRISPR arrays. Functional annotation was assigned to the genes via lastal 1256 (22) and HMMER v3.1b2 (23) using the following databases: IMG-NR 20211118 (24), SMART 01_06_2016 (25), COG 2003 (26), TIGRFAM v15.0 (27), SuperFamily v1.75 (28), Pfam v34.0 (29), and Cath-Funfam v4.2.0 (30). Topological annotation of protein-coding genes was assigned using SignalP 4.1 (31) and decodeanhmm 1.1g (32). These annotations categorize this genome into protein-coding genes, regulatory and miscellaneous features, and RNA genes; furthermore, they reveal predicted metabolic roles and clustering of the protein-coding genes (Table 1).
Overall, this high-quality draft genome improves our understanding of T. halotolerans biology and our ability to use this species and other actinomycetes for biotechnological applications.
ACKNOWLEDGMENTS
The work (proposal: 10.46936/10.25585/60008441) conducted by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, is supported by the Office of Science of the U.S. Department of Energy, operated under Contract No. DE-AC02-05CH11231. Metadata curation and management were supported using resources from the Genomes OnLine Database (GOLD) (33). A portion of this work was funded by BASF CARA and the Office of Biological and Environmental Research of the U.S. Department of Energy via DE-SC0022142 and also via the Joint BioEnergy Institute (JBEI, https://ror.org/03ww55028) through contract DE-AC02–05CH11231. The United States government retains and the publisher by accepting the article for publication, acknowledges that the United States government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States government purposes. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States government. The authors acknowledge the use of the Biological Nanostructures Laboratory within the California NanoSystems Institute, supported by the University of California, Santa Barbara and the University of California, Office of the President.
Contributor Information
Michelle A. O'Malley, Email: momalley@ucsb.edu.
J. Cameron Thrash, University of Southern California, Los Angeles, California, USA.
DATA AVAILABILITY
The genome sequence was deposited to GenBank under the accession number JBGBYW000000000. The raw reads have been deposited in the NCBI SRA under the accession number SRP583730. Additional data can be explored or downloaded from the JGI Integrated Microbial Genomes with Microbiomes (IMG/M) portal using the NCBI BioProject accession number PRJNA1115251.
REFERENCES
- 1. Yang LL, Tang SK, Zhang YQ, Zhi XY, Wang D, Xu LH, Li WJ. 2008. Thermobifida halotolerans sp. nov., isolated from a salt mine sample, and emended description of the genus Thermobifida. Int J Syst Evol Microbiol 58:1821–1825. doi: 10.1099/ijs.0.65732-0 [DOI] [PubMed] [Google Scholar]
- 2. Ribitsch D, Herrero Acero E, Greimel K, Dellacher A, Zitzenbacher S, Marold A, Rodriguez RD, Steinkellner G, Gruber K, Schwab H, Guebitz GM. 2012. A new esterase from Thermobifida halotolerans hydrolyses polyethylene terephthalate (PET) and polylactic acid (PLA). Polymers (Basel) 4:617–629. doi: 10.3390/polym4010617 [DOI] [Google Scholar]
- 3. Zhang F, Hu SN, Chen JJ, Lin LB, Wei YL, Tang SK, Xu LH, Li WJ. 2012. Purification and partial characterisation of a thermostable xylanase from salt-tolerant Thermobifida halotolerans YIM 90462T. Process Biochem 47:225–228. doi: 10.1016/j.procbio.2011.10.032 [DOI] [Google Scholar]
- 4. Yin YR, Sang P, Xiao M, Xian WD, Dong ZY, Liu L, Yang LQ, Li WJ. 2021. Expression and characterization of a cold-adapted, salt- and glucose-tolerant GH1 β-glucosidase obtained from Thermobifida halotolerans and its use in sugarcane bagasse hydrolysis. Biomass Conv Bioref 11:1245–1253. doi: 10.1007/s13399-019-00556-5 [DOI] [Google Scholar]
- 5. Zhang F, Chen JJ, Ren WZ, Nie GX, Ming H, Tang SK, Li WJ. 2011. Cloning, expression and characterization of an alkaline thermostable GH9 endoglucanase from Thermobifida halotolerans YIM 90462 T. Bioresour Technol 102:10143–10146. doi: 10.1016/j.biortech.2011.08.019 [DOI] [PubMed] [Google Scholar]
- 6. Zhang F, Zhang XM, Yin YR, Li WJ. 2015. Cloning, expression and characterization of a novel GH5 exo/endoglucanase of Thermobifida halotolerans YIM 90462(T) by genome mining. J Biosci Bioeng 120:644–649. doi: 10.1016/j.jbiosc.2015.04.012 [DOI] [PubMed] [Google Scholar]
- 7. Allen GC, Flores-Vergara MA, Krasynanski S, Kumar S, Thompson WF. 2006. A modified protocol for rapid DNA isolation from plant tissues using cetyltrimethylammonium bromide. Nat Protoc 1:2320–2325. doi: 10.1038/nprot.2006.384 [DOI] [PubMed] [Google Scholar]
- 8. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, et al. 2009. Real-time DNA sequencing from single polymerase molecules. Science 323:133–138. doi: 10.1126/science.1162986 [DOI] [PubMed] [Google Scholar]
- 9. BBDuk - Bushnell B. n.d. Sourceforge.net/projects/bbmap/
- 10. Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A, Copeland A, Huddleston J, Eichler EE, Turner SW, Korlach J. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10:563–569. doi: 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]
- 11. Hunt M, Silva ND, Otto TD, Parkhill J, Keane JA, Harris SR. 2015. Circlator: automated circularization of genome assemblies using long sequencing reads. Genome Biol 16:294. doi: 10.1186/s13059-015-0849-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chan PP, Lin BY, Mak AJ, Lowe TM. 2021. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res 49:9077–9096. doi: 10.1093/nar/gkab688 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Seemann T. 2013. Barrnap v.0.9-Dev. Available from: https://github.com/tseemann/barrnap
- 14. Chklovski A, Parks DH, Woodcroft BJ, Tyson GW. 2023. CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nat Methods 20:1203–1212. doi: 10.1038/s41592-023-01940-w [DOI] [PubMed] [Google Scholar]
- 15. Chen IMA, Chu K, Palaniappan K, Ratner A, Huang J, Huntemann M, Hajek P, Ritter SJ, Webb C, Wu D, Varghese NJ, Reddy TBK, Mukherjee S, Ovchinnikova G, Nolan M, Seshadri R, Roux S, Visel A, Woyke T, Eloe-Fadrosh EA, Kyrpides NC, Ivanova NN. 2023. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res 51:D723–D732. doi: 10.1093/nar/gkac976 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Huntemann M, Ivanova NN, Mavromatis K, Tripp HJ, Paez-Espino D, Tennessen K, Palaniappan K, Szeto E, Pillay M, Chen IMA, Pati A, Nielsen T, Markowitz VM, Kyrpides NC. 2016. The standard operating procedure of the DOE-JGI metagenome annotation pipeline (MAP v.4). Stand in Genomic Sci 11. doi: 10.1186/s40793-016-0138-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Nawrocki EP, Eddy SR. 2013. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29:2933–2935. doi: 10.1093/bioinformatics/btt509 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, Bateman A, Finn RD, Petrov AI. 2018. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res 46:D335–D342. doi: 10.1093/nar/gkx1038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Besemer J, Lomsadze A, Borodovsky M. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618. doi: 10.1093/nar/29.12.2607 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119. doi: 10.1186/1471-2105-11-119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Bland C, Ramsey TL, Sabree F, Lowe M, Brown K, Kyrpides NC, Hugenholtz P. 2007. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8:209. doi: 10.1186/1471-2105-8-209 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Kiełbasa SM, Wan R, Sato K, Horton P, Frith MC. 2011. Adaptive seeds tame genomic sequence comparison. Genome Res 21:487–493. doi: 10.1101/gr.113985.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Finn RD, Clements J, Arndt W, Miller BL, Wheeler TJ, Schreiber F, Bateman A, Eddy SR. 2015. HMMER web server: 2015 update. Nucleic Acids Res 43:W30–8. doi: 10.1093/nar/gkv397 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Clum A, Huntemann M, Bushnell B, Foster B, Foster B, Roux S, Hajek PP, Varghese N, Mukherjee S, Reddy TBK, Daum C, Yoshinaga Y, O’Malley R, Seshadri R, Kyrpides NC, Eloe-Fadrosh EA, Chen I-MA, Copeland A, Ivanova NN. 2021. DOE JGI metagenome workflow. mSystems 6. doi: 10.1128/mSystems.00804-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Letunic I, Bork P. 2018. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res 46:D493–D496. doi: 10.1093/nar/gkx922 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41. doi: 10.1186/1471-2105-4-41 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Haft DH, Selengut JD, White O. 2003. The TIGRFAMs database of protein families. Nucleic Acids Res 31:371–373. doi: 10.1093/nar/gkg128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. de Lima Morais DA, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J. 2011. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res 39:D427–34. doi: 10.1093/nar/gkq1130 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto SCE, Paladin L, Raj S, Richardson LJ, Finn RD, Bateman A. 2021. Pfam: the protein families database in 2021. Nucleic Acids Res 49:D412–D419. doi: 10.1093/nar/gkaa913 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P, Orengo CA, Sillitoe I. 2017. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res 45:D289–D295. doi: 10.1093/nar/gkw1098 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Nielsen H. 2017. Predicting secretory proteins with signalP, p 59–73. In Kihara D (ed), Protein function prediction: methods and protocols. Springer, New York, NY. [DOI] [PubMed] [Google Scholar]
- 32. Krogh A, Larsson B, von Heijne G, Sonnhammer EL. 2001. Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J Mol Biol 305:567–580. doi: 10.1006/jmbi.2000.4315 [DOI] [PubMed] [Google Scholar]
- 33. Mukherjee S, Stamatis D, Li CT, Ovchinnikova G, Kandimalla M, Handke V, Reddy A, Ivanova N, Woyke T, Eloe-Fardosh EA, Chen I-MA, Kyrpides NC, Reddy TBK. 2025. Genomes online database (GOLD) v.10: new features and updates. Nucleic Acids Res 53:D989–D997. doi: 10.1093/nar/gkae1000 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The genome sequence was deposited to GenBank under the accession number JBGBYW000000000. The raw reads have been deposited in the NCBI SRA under the accession number SRP583730. Additional data can be explored or downloaded from the JGI Integrated Microbial Genomes with Microbiomes (IMG/M) portal using the NCBI BioProject accession number PRJNA1115251.

