SeqForge: A scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets

Elijah R Bring Horvath; Jaclyn M Winter

doi:10.1101/2025.08.12.669971

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Aug 15:2025.08.12.669971. [Version 1] doi: 10.1101/2025.08.12.669971

SeqForge: A scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets

Elijah R Bring Horvath ^1,^*, Jaclyn M Winter ^1,^*

PMCID: PMC12364017 PMID: 40832300

Abstract

Background:

The rapid increase in publicly available microbial and metagenomic data has created a growing demand for tools that can efficiently perform custom large-scale comparative searches and functional annotation. While BLAST+ remains the standard for sequence similarity searches, population-level studies often require custom scripting and manual curation of results, which can present barriers for many researchers.

Results:

We developed SeqForge, a scalable, modular command-line toolkit that streamlines alignment-based searches and motif mining across large genomic datasets. SeqForge automates BLAST+ database creation and querying, integrates amino acid motif discovery, enables sequence and contig extraction, and curates results into structured, easily parsed formats. The platform supports diverse input formats, parallelized execution for high-performance computing environments, and built-in visualization tools. Benchmarking demonstrates that SeqForge achieves near-linear runtime scaling for computationally intensive modules while maintaining modest memory usage.

Conclusions:

SeqForge lowers the computational barrier for large-scale meta/genomic exploration, enabling researchers to perform population-scale BLAST searches, motif detection, and sequence curation without custom scripting. The toolkit is freely available and platform-independent, making it suitable for both personal workstations and high-performance computing environments.

Keywords: Genomics, bioinformatics, microbes, genome mining, BLAST, metagenomics

Background

The rapid expansion of high-throughput sequencing has ushered in a new era of genomics, with thousands of microbial genomes and metagenomic assemblies deposited into public repositories each year. This expansion enables large-scale comparative analyses, genome mining initiatives, and meta-studies that uncover novel biological insight. Population-scale searches—such as pangenome exploration, biosynthetic gene cluster discovery, and antimicrobial resistance gene surveys—are increasingly essential for linking genomic diversity to ecological or clinical outcomes, as well as to drug discovery and development. However, leveraging this wealth of data requires computational tools that are both powerful and accessible to researchers with diverse technical expertise.

Numerous bioinformatics platforms support specific aspects of genome analysis. Quality assessment tools (e.g., QUAST [1], SeqKit2 [2], CheckM [3]), phylogenomic pipelines (e.g., PhyloPhlAn [4], getphylo [5], autoMLST [6], RAxML [7]), annotation frameworks (e.g., Anvi’o [8], eggNOG-mapper [9], AMRFinderPlus [10], Prokka [11]) and biosynthetic gene cluster discovery platforms (e.g., antiSMASH [12], ARTs [13], PRISM [14], RODEO [15]) allow researchers to assemble, annotate, and interpret genomic datasets. However, none of these platforms directly address a core need in exploratory meta/genomics: scalable, customizable alignment-based searches across large genome collections.

NCBI BLAST+ [16] remains the most widely used tool for sequence alignment and exploratory genome interrogation due to its robustness, speed, and versatility. However, population-scale studies with BLAST+ typically require custom scripting in Bash, Python, or similar languages to automate iterative database creation and query execution. For many researchers, particularly those without computational training, this requirement poses a significant barrier. Additionally, standard BLAST+ workflows generate a separate output file for each query‒database comparison, making downstream organization, parsing, and interpretation cumbersome, even for moderately sized datasets.

To address these challenges, we developed SeqForge, a scalable command-line toolkit that streamlines large-scale genomic exploration. SeqForge automates key components of the BLAST+ workflow, simplifies the management of population-scale datasets, and adds features for amino acid motif discovery, sequence extraction, and organized results curation. By bridging the gap between existing pipelines and user-friendly exploratory search tools, SeqForge lowers the computational barrier for meta/genomic research and accelerates the interrogation of large genomic collections.

Implementation

SeqForge is implemented in Python (>=3.10) and structured as a modular command-line toolkit with a centralized control unit. The package contains two core modules (Genome Search and Sequence Investigation) and a Utilities module (Figure 1). Each functional unit (e.g., makedb, query, extract) is implemented as a subcommand executed via a unified entry point (seqforge <subcommand> <args>), enabling access to all functions through a single interface. To maintain stability, each module is dynamically imported at runtime, so that failure in one component does not affect the operability of others. A built-in --module-health diagnostic command allows users to assess the availability of individual modules prior to execution. Internally, each module is structured around clear separation of responsibilities: input validation, multiprocessing logic, and output formatting. This modular design allows users to integrate individual components into custom workflows or use the complete SeqForge suite for full-scale data mining and exploration.

Figure 1. — Overview of the SeqForge platform. The platform is organized into three modules: Genome Search Module includes database creation (makedb), database query (query), and motif mining; Sequence Investigation module includes extraction of aligned sequences (extract) and extraction of full contigs (extract-contig); Utilities Module includes metadata mining (search), FASTA file splitting (split-fasta), and calculation of assembly metrics (fasta-metrics). All modules are accessible via a unified command-line interface.

A unified file-handling system supports a wide range of FASTA input formats across all modules. This system allows for the submission of 1) individual FASTA files (including gzip-compressed formats), 2) directories containing multiple FASTA files, or 3) compressed archives (.zip, .tar, .tar.gz, .tgz) containing multiple FASTA files. Upon execution, SeqForge recursively enumerates and validates all FASTA files. For compressed archives, contents are extracted to a temporary working directory, which is automatically removed at the end of analysis unless the user specifies --keep-temp-files.

SeqForge was designed to streamline analysis of large meta/genomic datasets. For scalable performance, it uses Python’s concurrent.futures module for parallel execution. All computationally intensive modules (makedb, query, and extract) support multi-core processing via a --threads flag, allowing users to fully leverage available computational resources on personal laptops or high-performance computing (HPC) systems. SeqForge depends on BLAST+ 2.16.0, Biopython 1.85, pandas 2.3.1, matplotlib 3.10.5, seaborn 0.13.2, logomaker 0.8.7 [17], and numpy 2.3.2. All Python dependencies can be installed via pip, while BLAST+ is available through Conda or the NCBI FTP site. SeqForge is platform-independent and has been tested on Linux Ubuntu (v24.04) and macOS (v12.7.6) systems. All modules operate via the command line and are compatible with both personal workstations and HPC environments.

The core SeqForge workflow begins with creating a library of BLAST databases. Input file names are first scanned for special characters (e.g., non-extension periods, spaces, parentheses, etc.) that are incompatible with the BLAST+ architecture. If non-compliant characters are detected, SeqForge prints an error message and exits the program. Users can sanitize file names in two ways (excluding compressed archive submission): use the --sanitize flag within the makedb command string to remove special characters in-place and proceed with database creation, or run the ‘seqforge sanitize’ utility script, specifying either in-place sanitization or copying files to a new directory and renaming them with the --sanitize-outdir <directory> option. Database type (nucl or prot) is determined automatically during database creation. Canonical nucleotide FASTA extensions (.fa, .fas, .ffn, .fna, .fasta) are parsed as nucleotide, while amino acid FASTA (.faa) are parsed as amino acid.

Next, users may proceed to the query module, which allows users to run any number of queries against any number of databases. First, database type is detected by the query module and execution of translated nucleotide (tblastn) or protein (blastp) query is automatically performed (nucleotide database + amino acid query = tblastn, protein database + amino acid query = blastp). Users can run nucleotide-nucleotide (blastn) searches by adding the --nucleotide-query flag. BLAST+ integration is achieved through subprocess calls to blastn, blastp, and tblastn, ensuring full compatibility with NCBI’s established alignment tools. SeqForge does not currently bundle BLAST+ binaries but is compatible with standard BLAST+ installations from Bioconda or from the NCBI FTP site.

By default, the query pipeline uses internally curated thresholds of 90% identity, 75% query coverage, and an E-value of 1 × 10⁻⁵. These values can be modified via command-line arguments. The standard output includes two concatenated results tables: all_results.csv, which contain all BLAST hits, regardless of inclusion criteria, and all_filtered_results.csv, which contain only hits meeting the percent identity, query coverage, and E-value thresholds. These tables include all standard BLAST+ fields, including qseqid, sseqid, pident, evalue, length, mismatch, gapopen, qstart, qend, sstart, send, bitscore, qlen, and sframe, along with calculated query coverage. SeqForge also generates standard BLAST+ alignment files for each query unless the user specifies --no-alignment-files. To return the strongest match per query per genome, the --report-strongest-matches flag can be employed, producing filtered_results.csv. Parallelization is implemented by assigning each query file as a separate task (i.e., one query file = one task). For optimal performance, each query file should harbor only one sequence. If a query file contains multiple sequences, the ‘seqforge split-fasta’ utility can split it into individual files. Multi-FASTA files can be used directly in the query pipeline; however, they are processed as a single task, which increases runtime.

The query module includes a motif mining function that parses blastp results for specific or loosely defined amino acid motifs via the --motif argument. Users may submit a single motif or space-separated list of motifs using the standard single-letter amino acid abbreviation, with ‘X’ serving as a wild card (e.g., WXWXIP). This option executes a modified regex search across all BLAST hits, regardless of inclusion thresholds, enabling detection within heterogenous gene families. Motifs can be linked to specific query files by appending the query base name (file name without extension) in braces within the command string (e.g., for ‘query_file.faa’: --motif <motif>{query_file}), which restricts parsing to results from that file. To facilitate downstream analysis, motif matches can be exported in FASTA format as either the full gene/domain alignment (--motif-fasta-out), or the specific motif string (--motif-fasta-out --motif-only). When > 1 motif is specified, SeqForge generates one FASTA file per motif.

Query results can be visualized by adding the --visualize flag to the query command string, with the standard output being a PNG at 300 dpi (output to PDF via --visualize --pdf). For queries against > 1 database, a heatmap will be generated (Figure 2A), with the color of each cell reflecting individual percent identity values per query per genome. For queries including motif searches, a sequence logo will additionally be generated using Logomaker [17] if any hits were returned (Figure 2B). Similar to the FASTA out option, for motif queries of > 1 unique motif string, a sequence logo will be generated for each search string.

Figure 2. — Example visualization output of the SeqForge query pipeline. A Heatmap generated using the --visualize option with 20 protein queries against 50 *E. coli* nucleotide databases. B Sequence logo generated using the --motif WXWXIP –visualize command. Sequence logos were created with Logomaker [17]. Figures were minimally edited in Adobe Illustrator for clarity.

For downstream sequence analysis and annotation, SeqForge provides two extraction options for extraction of sequences identified via the query module: ‘seqforge extract’ and ‘seqforge extract-contig’. Users may pass a query results file (all_results.csv, all_filtered_results.csv, or filtered_results.csv) to the extract module, as well as the FASTA directory used to generate the BLAST database(s). Using ‘seqforge extract’, aligned sequences will be extracted to FASTA (multi-FASTA for > 1 hit) based on the alignment start and end values. For ‘seqforge extract’, up and/or downstream base pair padding may be specified to extract the aligned sequence plus n neighboring base pairs for more in-depth investigation of genomic context. Furthermore, as metagenome assemblies are often too large to open in their entirety using a standard genome browser, SeqForge additionally offers the extract-contig option, whereby entire contigs may be extracted based on a BLAST hit, facilitating manual sequence investigation using a genome browser.

SeqForge also includes lightweight utility modules for metadata compilation and assembly statistics. The first of these is ‘seqforge search’, which extracts metadata from NCBI-generated GenBank and/or JSON files, including accession number, organism, host, isolation source, and collection region. Users can extract all available metadata (--all) or select specific fields (--fields <fields>). Output formats include CSV, TSV and JSON. Missing metadata are reported as ‘not_specified’. This module supports both single-file or batch submission. Second, SeqForge offers the fasta-metrics tool, which quickly calculates common assembly stats, including genome size, number of contigs, contig length distribution, longest/shortest contig, GC %, N count, N50/N90, L50/L90, and the lengths of each contig as a ‘|’ separated list.

Example Escherichia coli (Table S1), Streptomyces (Table S2), and Penicillium (Table S3) genomes used for benchmarking were downloaded using NCBI Datasets [18]. Only complete genomes were included using the --assembly-level complete Datasets argument. For E. coli, only assemblies released after March 1, 2024, were selected. Query sequences for the E. coli cohorts were generated by predicting coding sequences (CDS) from E. coli strain Ec1119 (Accession: GCA_008364465.2) with prodigal [19], producing a multi-FASTA amino acid file. This file was subsequently split into individual FASTA files using the SeqForge split-FASTA utility function, resulting in ~4,400 predicted CDSs. From this population, cohorts of query sequences were randomly selected using an in-house Python script (20 or 50) to serve as query files. The tryptophan halogenase, RebH (Accession: Q8KHZ8), was downloaded from UniProt and used as the query to generate Figure 2. The Penicillium-trained CDS prediction model was generated using BRAKER2 [20], BUSCO (https://github.come/metashot/busco), RepeatModeler [21], and AUGUSTUS [22] (see Extended Methods for training protocol).

Results

Benchmarking and Performance

SeqForge is designed to expedite genome mining across datasets of any size, supporting batch submissions while maintaining flexibility and ease of interpreting large results datasets. To the best of our knowledge, there are no publicly available tools directly comparable to SeqForge. While some similar utility functions exist in QUAST [23] and SeqKit2 [2], these platforms focus on assembly and read quality analysis and sequence conversion and manipulation, rather than large-scale genome mining and analysis.

Benchmarking was performed for the makedb, query, extract/extract-contig modules to assess runtime scaling, memory requirements, and parallelization efficiency. Tests were performed on two representative datasets: 1) a “laptop-scale” dataset comprising 500 E. coli genomes and 20 gene queries, and 2) a “population-scale” dataset of 2,157 E. coli genomes with 50 gene queries, representing a typical workload for HPC environments. Tests used varying thread counts: 1, 8, and 16 threads for the 500-genome dataset; and 8, 16, 32, and 48 threads for the 2,157-genome dataset. Resource utilization was measured with ‘/usr/bin/time -v’, which reports wall clock runtime, average CPU usage, and maximum resident set size (RSS). Maximum RSS here reflects the peak memory used by any individual worker process, not the sum across all threads.

Across all modules, Seqforge demonstrated efficient scaling with increasing thread counts, with near-linear reductions in wall clock runtime for computationally intensive steps (makedb, query, and extract) (Table 1). For example, query execution on 500 genomes with 20 queries decreased from 32 minutes 32 seconds (1 thread) to 2 minutes using 16 threads. With the 2,157-genome, 50-query dataset, runtime dropped from 57 minutes 57 seconds (8 threads) to 15 minutes 15 seconds using 48 threads. Memory usage was modest for makedb (≤ 82 MB peak RSS) and moderate for query (≤ 901 MB peak RSS). Sequence extraction showed slightly diminishing returns beyond 16 threads, likely due to partial I/O-bound performance, but remained efficient, with extraction of > 100,000 sequences completing just over 3 minutes using 8 threads and just over 1 minute at 48 threads (Table 1).

Table 1.

Multiprocessing performance and resource usage of SeqForge modules. Wall clock runtimes and peak memory usage (maximum resident size, RSS, in megabytes) for makedb, query, extract, and extract-contig modules. These times are reported as minutes:seconds (m:s). RSS reflects the maximum memory used by any single worker process, not the sum across threads.

Module	# Genomes	# Queries	# Sequences	# Threads	Runtime (m:s)	RSS(MB)	Avg. CPU (%)
Makedb	500	N/A	N/A	1	0:47	79	107
				8	0:06	79	808
				16	0:03	80	1524
	2157	N/A	N/A	8	0:23	81	786
				16	0:12	81	1537
				32	0:07	81	2974
				48	0:05	82	4411
Query	500	20	N/A	1	32:33	115	100
				8	3:55	114	790
				16	2:00	113	1558
	2157	50	N/A	8	57:57	901	667
				16	22:07	885	898
				32	16:37	885	1253
				48	15:15	885	1619
Query + Motif	150	1	N/A	1	0:41	1565	110
				8	0:18	1565	315
				16	0:34	1569	261
		2	N/A	1	5:09	1597	101
				8	1:07	1578	543
				16	0:48	1578	793
		4	N/A	1	14:38	1661	100
				8	2:28	1601	698
				16	1:33	1597	1136
Extract	500	20	10,003	1	2:27	1225	104
				8	0:21	1117	815
				16	0:11	1063	1583
	2157	50	100,524	8	3:09	1138	808
				16	1:49	1134	1550
				32	1:15	1129	2674
				48	1:14	1094	2908
Extract-Contig	500	20	500	1	0:17	9947	124
				8	0:18	10156	126
				16	0:18	9769	127
	2157	50	2145	8	0:59	33734	115
				16	1:02	27705	114
				32	0:56	14448	116
				48	0:57	14773	117

Open in a new tab

In contrast, extract-contig showed minimal speedup with more threads, consistent with an I/O-bound workload dominated by sequential file reads and writes. Peak RSS reached up to 33 GB when processing the 2,157-genome dataset, due to loading and writing entire contigs, though runtime remained under 1 minute even at large scale. Because most genomes in the benchmark dataset were complete assemblies with < 8 contigs, many contig extractions involved extracting whole chromosomal sequences. As a result, contig extraction was likely dominated by disk I/O rather than CPU processing and exhibited little benefit from parallelization compared to sequence extraction, which operates on smaller regions. Overall, SeqForge maintains modest per-thread memory usage, scales efficiently for computationally intensive modules, and delivers fast performance even for I/O-heavy workflows.

Motif Mining

To demonstrate Seqforge’s motif mining capabilities, we conducted a targeted search for canonical catalytic and stereochemistry-associated motifs within a well-characterized biosynthetic gene cluster (BGC). Specifically, we mined the erythromycin BGC (MIBiG accession: BGC0000055) for domain-level motifs associated with acyltransferase (AT), ketosynthase (KS), and ketoreductase (KR) activity across the polyketide synthase (PKS) modules. The erythromycin BGC consists of a loading module plus six chain-elongation modules containing seven AT domains (substrate specificity), six KS domains (condensation activity), six KR domains (reduction and stereochemistry determination), one dehydratase (DH) and one enoyl reductase (ER) domain [24, 25]. While the ER domain plays a role in final stereochemistry of the C-2 position—particularly the Y/V52’ residue [25, 26]—this benchmark focused solely on AT, KS, and KR motifs responsible for upstream chain elongation and modification of the C-2 and β-keto positions of the polyketide backbone. Motif search strings were defined using common PKS domain motifs for malonyl and methylmalonyl substrate selectivity, KS catalytic activity, and KR-associated stereochemistry determination [25, 27, 28]. The motif analysis was executed using the following command string: ‘--motif RVXXXQ{AT} GHXXGE{AT} YXXH{AT} HXSH{AT} HXFH{AT} TAXSSX{KS} HXAXXLDDX{KR} SSXXXXXXXXXXXXYXX{KR}’. SeqForge returned results as a structured table for all PKS modules (Table S4) and successfully detected the expected AT, KS, and KR motifs across all relevant modules (Table 2, Figures S1–S3). Importantly, motif parsing occurs within the context of full alignment sequence, enabling researchers to extract matches directly using the --motif-fasta-out argument for downstream analysis. These results align with previously characterized enzymatic roles for each biosynthetic module, validating the utility of motif mining as a complementary tool for BGC functional annotation.

Table 2.

Domain presence, substrate specificity, activity prediction and extracted amino acid motifs from the erythromycin biosynthetic gene cluster (BGC0000055). Domain composition, predicted substrate specificity, and detected motifs for acyltransferase (AT), ketosynthase (KS), and ketoreductase (KR) domains identified using SeqForge’s --motif function with domain-linked search strings.

Module	Domains Present	AT	AT	AT	Substrate Prediction	KS	Activity Prediction	KR	KR	Stereochemistry Prediction	Characterized Stereochemistry
Load	AT	RVEVVQ	GHSIGE	N/A	Propionyl-CoA	N/A	N/A	N/A	N/A	N/A	N/A
1	KS, AT, KR	RVDVVQ	GHSQGE	YASH	Methylmalonyl-CoA	TACSSS	Active	HAAATLDDG	SSFASAFGAPGLGGYAP	B2: 2S, 3R	B2: 2S, 3R
2	KS, AT, KR	RVDVVQ	GHSQGE	YASH	Methylmalonyl-CoA	TACSSS	Active	N/A	SSGAGVWGSARQGAYAA	A1: 2R, 3S	A1: 2R, 3S
3	KS, AT, KR^*	RVDVVQ	GHSQGE	YASH	Methylmalonyl-CoA	TACSSS	Active	N/A	SSVAGIWGGAGMAAYAA	A1: 2R, 3S or C1: 2R or C2: 2S	C2: 2S
4	KS, AT, DH, ER, KR	RVDVLQ	GHSQGE	YASH	Methylmalonyl-CoA	TACSSS	Active	N/A	SSAASVLAGPGQGVYAA	C1: 2R or C2: 2S	C2: 2S
5	KS, AT, KR	RVDVVQ	GHSQGE	YASH	Methylmalonyl-CoA	TACSSG	Active	N/A	SSNAGVWGSPGLASYAA	A1: 2R, 3S or C1: 2R or C2: 2S	A1: 2R, 3S
6	KS, AT, KR	RVDVVQ	GHSQGE	YASH	Methylmalonyl-CoA	TACSSS	Active	N/A	SSGAGVWGSANLGAYSA	A1: 2R, 3S	A1: 2R, 3S

Open in a new tab

KR* indicates an inactive ketoreductase domain [25]. N/A: no motif hits detected. Key residues are in bold.

Leveraging SeqForge to Detect Potential Copper-Dependent Halogenases Across Penicillium

To demonstrate the broader utility of SeqForge, we applied the full pipeline to an uncharacterized fungal population. As a case study, we selected ApnU, a recently characterized copper-dependent halogenase from the Penicillium oxalicum 114–2 atpenin B BGC (MIBiG accession: BGC0002067) (Table S5) [29]. Unlike Fe(II)/α-ketoglutarate-dependent halogenases, ApnU catalyzes mono- or iterative chlorination of unactivated C(sp³)–H bonds via a unique pair of HXXHC motifs responsible for coordination of Cu(II), and it can also incorporate other halides or pseudohalides [29]. Given its mechanistic novelty, ApnU presented a compelling target for mining functionally analogous or homologous BGCs across the Penicillium genus. A dataset of 549 publicly available Penicillium genomes was curated from NCBI Datasets (Table S3) and converted to BLAST databases using ‘seqforge makedb’. A translated nucleotide search was performed with ‘seqforge query’, using the ApnU amino acid sequence as input and a relaxed inclusion threshold of ≥ 80% identity and ≥ 70% query coverage. Fourteen hits met these criteria and were retained for downstream analysis (Table S6).

Genomic context was assessed by extracting each hit along with 50 kb upstream and 30 kb downstream flanking sequences using ‘seqforge extract’. Coding sequences were predicted with AUGUSTUS, trained on the high-quality P. chrysogenum IBT 35668 genome (GCA_028827035.1), and validated against the canonical atpenin B cluster. Protein databases from these predictions were analyzed using SeqForge’s motif mining utility, which identified exactly two HXXHC motifs in each case, consistent with the ApnU copper-binding residues. To confirm the accuracy of motif detection, hits were extracted and aligned with the canonical ApnU sequence using MUSCLE [30]. Both HXXHC motifs were confirmed in all hits (Figure 3A, Figure S4), supporting the conservation of the copper-binding residues.

Figure 3. — Atpenin B-like biosynthetic gene clusters (BGCs) identified in publicly available *Penicillium* genomes using the full SeqForge analytical pipeline. A The first (i) and second (ii) HXXHC copper-binding motifs required for catalytic activity of the ApnU copper-dependent halogenase. Sequence logos represent predicted ApnU homologs. B Synteny map of atpenin B-like BGCs identified in publicly available *Penicillium* genomes. orf1 and orf2 represent genes within the atpenin B BGC that have not yet been formally characterized [32].

To confirm that these genomic regions encoded atpenin B-like clusters, each sequence and its AUGUSTUS-generated GFF3 file were submitted to antiSMASH’s fungal pipeline [12]. All 14 regions showed medium-to-high similarity to the characterized atpenin B gene cluster (Figure 3B). Clinker alignments via CAGECAT [31] revealed strong synteny and high gene identity across all hits. Collectively, this analysis demonstrates SeqForge’s end-to-end capability for population-scale homology screening, flanking region extraction, motif detection, and downstream BGC characterization, enabling rapid discovery of biosynthetic diversity from large genome datasets.

Utilities

Seqforge includes several auxiliary utilities designed to streamline genome mining workflows. The ‘split-fasta’ command fragments large multi-FASTA files into either single-record files or user-defined batches of n sequences. The ‘search’ module extracts genomic metadata (e.g., accession, isolation source, host, and related descriptors) from GenBank and JSON files and was used here to compile metadata for the test populations (Tables S1–S3). For rapid quality checks, the lightweight ‘fasta-metrics’ module reports basic assembly statistics and was benchmarked against QUAST on three E. coli and three Streptomyces genomes (Table S7).

Many annotation and CDS-prediction pipelines assign generic identifiers (e.g., “hypothetical protein”) or placeholder headers (e.g., from AUGUSTUS), which can result in duplicate sequence IDs within or across files. Such duplications can disrupt downstream parsing and interfere with SeqForge’s motif mining workflow. To address this, SeqForge provides a ‘unique-header’ utility that ensures collision-free FASTA identifiers across inputs. This tool preserves the original label while appending the source filename and a short alphanumeric tag. For example: >hypothetical (from genome1a.faa) becomes >hypothetical_genome1a_54uMe. This approach maintains identifiability across large, multi-file datasets and reduces the risk of bookkeeping or programmatic errors during downstream analyses.

Discussion

SeqForge addresses a practical gap between small BLAST+ searches and population-scale mining by providing a modular command-line interface that automates database construction, high-throughput querying, motif discovery, and downstream extraction/visualization. In contrast to pan-genome pipelines, HMM-centric profilers, or general QC/formatting toolkits, SeqForge preserves the interpretability and ubiquity of NCBI BLAST+ while adding scalability, uniform outputs, and utilities that smooth common stress points (e.g., filename sanitization, header de-duplication, and lightweight assembly metrics). This design lowers the barrier for exploratory meta/genomic analyses.

While SeqForge addresses a clear gap in currently available software, it does have limitations. Results remain sensitive to BLAST thresholds and database quality. Permissive settings can inflate false positives for promiscuous families, whereas overly stringent cutoffs may exclude divergent homologs. Dynamic querying, such as using multiple representative sequences or varying inclusion thresholds across runs, can help balance these trade-offs and strengthen result datasets. Similarly, the regex-based motif search, while fast, transparent, and well-suited for highly conserved motifs, may overlook gapped or degenerate variants that could be more effectively captured using profiles or HMMs.

Conclusion

The field of microbial genomics has made great strides in developing computational platforms that enrich microbial discovery, characterization, annotation, and functional investigation. SeqForge builds on this progress by providing an accessible, scalable solution for performing multi-query BLAST searches and motif mining across large genomic datasets. By integrating these capabilities into a streamlined workflow, SeqForge reduces the need for custom scripting, minimizes manual curation, and accelerates the identification of conserved functional motifs. Its modular design and support for multi-core execution make it equally suitable for small-scale analyses on personal computers and high-throughput screening on HPC clusters. SeqForge is freely available and adaptable to a wide range of genomic mining applications, helping accelerate the pace of discovery from the ever-growing wealth of publicly available genome datasets.

Supplementary Material

Supplement 1

media-1.pdf^{(745.1KB, pdf)}

Acknowledgements:

We thank Mathew Stein for valuables discussion on the design of SeqForge and for reviewing the draft manuscript.

Funding:

This work was supported by a 3i graduate research fellowship to ERBH; and the University of Utah Research Foundation, the Ben and Iris Margolis Foundation, and the National Institutes of Health [1R01AI155694] to JMW.

Abbreviations

AT: acyltransferase
BGC: biosynthetic gene cluster
CDS: Coding sequences
DH: dehydratase
ER: enoylreductase
HMM: hidden Markov model
HPC: High performance computing
KR: ketoreductase
KS: ketosynthase
PKS: polyketide synthase
RSS: resident set size

Funding Statement

Footnotes

Availability and Requirements

Project name: SeqForge

Home page: at https://github.com/ERBringHorvath/SeqForge

Operating system: Platform independent

Programming language: Python

License: MIT

Any restrictions to use by non-academics None.

Ethics approval and consent to participate: Not applicable

Consent for publication: Not applicable.

Competing interests: The authors declare no conflicts of interest.

Availability of data and materials:

SeqForge is available through GitHub at https://github.com/ERBringHorvath/SeqForge. Please read the SeqForge documentation for information on installation and usage.

References

1.Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The silva ribosomal rna gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 2013;41(Database issue):D590–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Shen W, Sipos B, Zhao L. Seqkit2: A swiss army knife for sequence and alignment processing. Imeta. 2024;3(3):e191. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. Checkm: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Asnicar F, Thomas AM, Beghini F, Mengoni C, Manara S, Manghi P, et al. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using phylophlan 3.0. Nat Commun. 2020;11(1):2500. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Booth TJ, Shaw S, Cruz-Morales P, Weber T. Getphylo: Rapid and automatic generation of multi-locus phylogenetic trees. BMC Bioinformatics. 2025;26(1):21. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Alanjary M, Steinke K, Ziemert N. Automlst: An automated web server for generating multi-locus species trees highlighting natural product potential. Nucleic Acids Res. 2019;47(W1):W276–W82. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. Raxml-ng: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35(21):4453–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Eren AM, Kiefl E, Shaiber A, Veseli I, Miller SE, Schechter MS, Fink I, Pan JN, Yousef M, Fogarty EC, Trigodet F, Watson AR, Esen ÖC, Moore RM, Clayssen Q, Lee MD, Kivenson V, Graham ED, Merrill BD, Karkman A, Blankenberg D, Eppley JM, Sjödin A, Scott JJ, Vázquez-Campos X, McKay LJ, McDaniel EA, Stevens SLR, Anderson RE, Fuessel J, Fernandez-Guerra A, Maignien L, Delmont TO, Willis AD. Community-led, integrated, reproducible multi-omics with anvi’o. Nature Microbiology. 2020;6:3–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Cantalapiedra CP, Hernandez-Plaza A, Letunic I, Bork P, Huerta-Cepas J. Eggnog-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021;38(12):5825–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, et al. Amrfinderplus and the reference gene catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11(1):12728. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Seemann T. Prokka: Rapid prokaryrotic genome annotation. Bioinformatics. 2014;30(14):2068–9. [DOI] [PubMed] [Google Scholar]
12.Blin K, Shaw S, Vader L, Szenei J, Reitz ZL, Augustijn HE, et al. Antismash 8.0: Extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation. Nucleic Acids Res. 2025;53(W1):W32–W8. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mungan MD, Alanjary M, Blin K, Weber T, Medema MH, Ziemert N. Arts 2.0: Feature updates and expansion of the antibiotic resistant target seeker for comparative genome mining. Nucleic Acids Res. 2020;48(W1):W546–W52. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Skinnider MA JC, Gunabalasingam M. Merwin NJ, Kieliszek AM, MacLellan RJ, Li H, Ranieri MRM, Webster ALH, Cao MPT, Pfeifle A, Spencer N, To QH, Wallace DP, Dejong CA, Magarvey NA. Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat Commun. 2020;11(1):6058. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Tietz JI, Schwalen CJ, Patel PS, Maxson T, Blair PM, Tai HC, et al. A new genome-mining tool redefines the lasso peptide biosynthetic landscape. Nat Chem Biol. 2017;13(5):470–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. Blast+: Architecture and applications. BMC Bioinformatics. 2009;10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Tareen A, Kinney JB. Logomaker: Beautiful sequence logos in python. Bioinformatics. 2020;36(7):2272–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Cox E, Tsuchiya MTN, Ciufo S, Torcivia J, Falk R, Anderson WR, et al. Ncbi taxonomy: Enhanced access via ncbi datasets. Nucleic Acids Res. 2025;53(D1):D1711–D5. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hyatt D, Chen G, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bruna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. Braker2: Automatic eukaryotic genome annotation with genemark-ep+ and augustus supported by a protein database. NAR Genom Bioinform. 2021;3(1):lqaa108. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. Repeatmodeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117(17):9451–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cdna alignments to improve de novo gene finding. Bioinformatics. 2008;24(5):637–44. [DOI] [PubMed] [Google Scholar]
23.Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with quast-lg. Bioinformatics. 2018;34(13):i142–i50. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Oliynyk M, Samborskyy M, Lester JB, Mironenko T, Scott N, Dickens S, et al. Complete genome sequence of the erythromycin-producing bacterium saccharopolyspora erythraea nrrl23338. Nat Biotechnol. 2007;25(4):447–53. [DOI] [PubMed] [Google Scholar]
25.Kwan DH, Schulz F. The stereochemistry of complex polyketide biosynthesis by modular polyketide synthases. Molecules. 2011;16(7):6092–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Kwan DH, Sun Y, Schulz F, Hong H, Popovic B, Sim-Stark JC, et al. Prediction and manipulation of the stereochemistry of enoylreduction in modular polyketide synthases. Chem Biol. 2008;15(11):1231–40. [DOI] [PubMed] [Google Scholar]
27.Reeves CD, Murli S, Ashley GW, Piagentini M, Hutchinson CR, McDaniel R. Alteration of the substrate specificity of a modular polyketide synthase acyltransferase domain through site-specific mutations. Biochemistry. 2001;40(51):15464–670. [DOI] [PubMed] [Google Scholar]
28.Bisang C, Long PF, Cortés J, Westcott J, Crosby J, Matharu A-L, et al. A chain inititation factor common to both modular and aromatic polyketide synthases. Nature. 1999;401(6752):502–5. [DOI] [PubMed] [Google Scholar]
29.Chiang CY, Ohashi M, Le J, Chen PP, Zhou Q, Qu S, et al. Copper-dependent halogenase catalyses unactivated c-h bond functionalization. Nature. 2025;638(8049):126–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Edgar RC. Muscle: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.van den Belt M, Gilchrist C, Booth TJ, Chooi YH, Medema MH, Alanjary M. Cagecat: The comparative gene cluster analysis toolbox for rapid search and visualisation of homologous gene clusters. BMC Bioinformatics. 2023;24(1):181. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Bat-Erdene U, Kanayama D, Tan D, Turner WC, Houk KN, Ohashi M, et al. Iterative catalysis in the biosynthesis of mitochondrial complex ii inhibitors harzianopyridone and atpenin b. J Am Chem Soc. 2020;142(19):8550–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

media-1.pdf^{(745.1KB, pdf)}

Data Availability Statement

SeqForge is available through GitHub at https://github.com/ERBringHorvath/SeqForge. Please read the SeqForge documentation for information on installation and usage.

[R1] 1.Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The silva ribosomal rna gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 2013;41(Database issue):D590–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Shen W, Sipos B, Zhao L. Seqkit2: A swiss army knife for sequence and alignment processing. Imeta. 2024;3(3):e191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. Checkm: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Asnicar F, Thomas AM, Beghini F, Mengoni C, Manara S, Manghi P, et al. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using phylophlan 3.0. Nat Commun. 2020;11(1):2500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Booth TJ, Shaw S, Cruz-Morales P, Weber T. Getphylo: Rapid and automatic generation of multi-locus phylogenetic trees. BMC Bioinformatics. 2025;26(1):21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Alanjary M, Steinke K, Ziemert N. Automlst: An automated web server for generating multi-locus species trees highlighting natural product potential. Nucleic Acids Res. 2019;47(W1):W276–W82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. Raxml-ng: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019;35(21):4453–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Eren AM, Kiefl E, Shaiber A, Veseli I, Miller SE, Schechter MS, Fink I, Pan JN, Yousef M, Fogarty EC, Trigodet F, Watson AR, Esen ÖC, Moore RM, Clayssen Q, Lee MD, Kivenson V, Graham ED, Merrill BD, Karkman A, Blankenberg D, Eppley JM, Sjödin A, Scott JJ, Vázquez-Campos X, McKay LJ, McDaniel EA, Stevens SLR, Anderson RE, Fuessel J, Fernandez-Guerra A, Maignien L, Delmont TO, Willis AD. Community-led, integrated, reproducible multi-omics with anvi’o. Nature Microbiology. 2020;6:3–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Cantalapiedra CP, Hernandez-Plaza A, Letunic I, Bork P, Huerta-Cepas J. Eggnog-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol Biol Evol. 2021;38(12):5825–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, et al. Amrfinderplus and the reference gene catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11(1):12728. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Seemann T. Prokka: Rapid prokaryrotic genome annotation. Bioinformatics. 2014;30(14):2068–9. [DOI] [PubMed] [Google Scholar]

[R12] 12.Blin K, Shaw S, Vader L, Szenei J, Reitz ZL, Augustijn HE, et al. Antismash 8.0: Extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation. Nucleic Acids Res. 2025;53(W1):W32–W8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Mungan MD, Alanjary M, Blin K, Weber T, Medema MH, Ziemert N. Arts 2.0: Feature updates and expansion of the antibiotic resistant target seeker for comparative genome mining. Nucleic Acids Res. 2020;48(W1):W546–W52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Skinnider MA JC, Gunabalasingam M. Merwin NJ, Kieliszek AM, MacLellan RJ, Li H, Ranieri MRM, Webster ALH, Cao MPT, Pfeifle A, Spencer N, To QH, Wallace DP, Dejong CA, Magarvey NA. Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat Commun. 2020;11(1):6058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Tietz JI, Schwalen CJ, Patel PS, Maxson T, Blair PM, Tai HC, et al. A new genome-mining tool redefines the lasso peptide biosynthetic landscape. Nat Chem Biol. 2017;13(5):470–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. Blast+: Architecture and applications. BMC Bioinformatics. 2009;10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Tareen A, Kinney JB. Logomaker: Beautiful sequence logos in python. Bioinformatics. 2020;36(7):2272–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Cox E, Tsuchiya MTN, Ciufo S, Torcivia J, Falk R, Anderson WR, et al. Ncbi taxonomy: Enhanced access via ncbi datasets. Nucleic Acids Res. 2025;53(D1):D1711–D5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Hyatt D, Chen G, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Bruna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. Braker2: Automatic eukaryotic genome annotation with genemark-ep+ and augustus supported by a protein database. NAR Genom Bioinform. 2021;3(1):lqaa108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. Repeatmodeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117(17):9451–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cdna alignments to improve de novo gene finding. Bioinformatics. 2008;24(5):637–44. [DOI] [PubMed] [Google Scholar]

[R23] 23.Mikheenko A, Prjibelski A, Saveliev V, Antipov D, Gurevich A. Versatile genome assembly evaluation with quast-lg. Bioinformatics. 2018;34(13):i142–i50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Oliynyk M, Samborskyy M, Lester JB, Mironenko T, Scott N, Dickens S, et al. Complete genome sequence of the erythromycin-producing bacterium saccharopolyspora erythraea nrrl23338. Nat Biotechnol. 2007;25(4):447–53. [DOI] [PubMed] [Google Scholar]

[R25] 25.Kwan DH, Schulz F. The stereochemistry of complex polyketide biosynthesis by modular polyketide synthases. Molecules. 2011;16(7):6092–115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Kwan DH, Sun Y, Schulz F, Hong H, Popovic B, Sim-Stark JC, et al. Prediction and manipulation of the stereochemistry of enoylreduction in modular polyketide synthases. Chem Biol. 2008;15(11):1231–40. [DOI] [PubMed] [Google Scholar]

[R27] 27.Reeves CD, Murli S, Ashley GW, Piagentini M, Hutchinson CR, McDaniel R. Alteration of the substrate specificity of a modular polyketide synthase acyltransferase domain through site-specific mutations. Biochemistry. 2001;40(51):15464–670. [DOI] [PubMed] [Google Scholar]

[R28] 28.Bisang C, Long PF, Cortés J, Westcott J, Crosby J, Matharu A-L, et al. A chain inititation factor common to both modular and aromatic polyketide synthases. Nature. 1999;401(6752):502–5. [DOI] [PubMed] [Google Scholar]

[R29] 29.Chiang CY, Ohashi M, Le J, Chen PP, Zhou Q, Qu S, et al. Copper-dependent halogenase catalyses unactivated c-h bond functionalization. Nature. 2025;638(8049):126–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Edgar RC. Muscle: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.van den Belt M, Gilchrist C, Booth TJ, Chooi YH, Medema MH, Alanjary M. Cagecat: The comparative gene cluster analysis toolbox for rapid search and visualisation of homologous gene clusters. BMC Bioinformatics. 2023;24(1):181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Bat-Erdene U, Kanayama D, Tan D, Turner WC, Houk KN, Ohashi M, et al. Iterative catalysis in the biosynthesis of mitochondrial complex ii inhibitors harzianopyridone and atpenin b. J Am Chem Soc. 2020;142(19):8550–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

SeqForge: A scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets

Elijah R Bring Horvath

Jaclyn M Winter

Abstract

Background:

Results:

Conclusions:

Background

Implementation

Figure 1.

Figure 2.

Results

Benchmarking and Performance

Table 1.

Motif Mining

Table 2.

Leveraging SeqForge to Detect Potential Copper-Dependent Halogenases Across Penicillium

Figure 3.

Utilities

Discussion

Conclusion

Supplementary Material

Acknowledgements:

Funding:

Abbreviations

Funding Statement

Footnotes

Availability of data and materials:

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

This is a preprint.

SeqForge: A scalable platform for alignment-based searches, motif detection, and sequence curation across meta/genomic datasets

Elijah R Bring Horvath

Jaclyn M Winter

Abstract

Background:

Results:

Conclusions:

Background

Implementation

Figure 1.

Figure 2.

Results

Benchmarking and Performance

Table 1.

Motif Mining

Table 2.

Leveraging SeqForge to Detect Potential Copper-Dependent Halogenases Across Penicillium

Figure 3.

Utilities

Discussion

Conclusion

Supplementary Material

Acknowledgements:

Funding:

Abbreviations

Funding Statement

Footnotes

Availability of data and materials:

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases