ColabCuraTE: an easy-to-use, web-based pipeline for the manual curation of transposable elements

Scott L Travers; Abbas Khansa; Christopher E Ellison

doi:10.1186/s13100-025-00389-2

. 2025 Dec 10;17:2. doi: 10.1186/s13100-025-00389-2

ColabCuraTE: an easy-to-use, web-based pipeline for the manual curation of transposable elements

Scott L Travers ^1,², Abbas Khansa ¹, Christopher E Ellison ^1,^2,^✉

PMCID: PMC12801901 PMID: 41372941

Abstract

Background

Transposable elements (TEs) are widespread mobile DNA sequences that shape genome structure, function, and evolution. Although automated tools exist for the de novo identification and classification of TEs, their output often requires manual refinement to generate accurate consensus sequences for individual TE families. This curation process is essential but remains time-consuming and inaccessible to many researchers, particularly those without bioinformatics expertise or access to sufficient computing resources. To address this gap, we developed ColabCuraTE, a web-based, user-friendly pipeline implemented in Google Colaboratory that enables manual curation of TEs without the need for local software installation or advanced programming skills.

Results

ColabCuraTE includes built-in visualization tools and guides users through a streamlined workflow—from TE copy identification, alignment extension, and refinement, to consensus sequence generation and TE family analysis. We validated the pipeline using both megabase-sized and gigabase-sized genomes and found that it reliably improves the quality and completeness of TE consensus sequences compared to outputs from automated de novo TE annotation tools.

Conclusions

ColabCuraTE enables easier participation in TE curation by removing infrastructure and expertise requirements that typically limit participation in genomic research. It excels at the targeted curation of individual TE families but can also be used for large-scale curation efforts when deployed via a course or workshop. Its accessibility, intuitive interface, and compatibility with existing tools make it a valuable resource for both researchers and educators. ColabCuraTE enables broader participation in TE annotation efforts and supports the integration of undergraduates in genomics research.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13100-025-00389-2.

Keywords: Transposable elements, Genome annotation, Multiple sequence alignment, Consensus sequence, Bioinformatics education

Introduction

Transposable elements (TEs) are DNA sequences that can move or copy themselves to new positions within a host genome. They are ubiquitous and often abundant components of genomes across the tree of life, playing a key role in shaping genome structure, function, and evolution. TEs contribute to genomic diversity by creating mutations, gene duplications, and chromosomal rearrangements, which are frequently deleterious, but can also provide the substrate for natural selection and adaptive evolution [1–4]. They are known to play critical roles in gene regulation, sometimes modulating the expression of nearby genes through cis-regulatory effects [5]. In some cases, TEs have given rise to entirely new genes and regulatory networks, highlighting their importance in genome innovation and evolution. Despite their potential to promote beneficial genomic changes, transposable elements are also well-known contributors to genomic instability and disease [6].

Given their various impacts on host biology, which remain to be fully understood, the accurate identification and annotation of TEs is critical for a comprehensive understanding of genome architecture, function, and evolutionary history. For non-model organisms, whose TEs have not been previously characterized, TE annotation involves two steps: (1) generation and classification of a single representative sequence for each TE family that resides within the genome of interest, and (2) identification of the genomic location(s) of each copy of each TE family. Step 1 above, identification and classification of TE consensi, presents a number of challenges [7–9]. TEs are highly diverse and often evolve rapidly, leading to high sequence divergence between related elements. This makes it difficult for automated tools to recognize distant homologs or classify TEs that may not closely resemble known families. A number of automated bioinformatics tools have been developed to aid in TE identification and annotation from genomic sequences [7, 10–14]. These tools use a combination of sequence similarity searches, de novo identification methods, and existing repeat libraries to classify TEs. After annotation, TE families are typically represented by a single consensus sequence, generated from a seed alignment which consists of a multiple sequence alignment (MSA) of multiple copies of that TE family from the genome. The complete collection of TE consensi generated for a particular species is known as a TE/repeat library. Once a TE library has been generated, it is relatively straightforward to identify the locations of each TE family within the genome of interest using a tool such as RepeatMasker [15].

TE copies are often fragmented and/or nested in other repeats, which makes the automated identification of full-length consensus sequences extremely difficult. Thus, automated approaches often produce fragmented consensus sequences for TE families, whereas full-length consensus sequences are crucial for correctly annotating individual TE copies in the genome and for downstream evolutionary analyses. As a result, manual curation is still required to refine the output of automated tools, correct errors, and ensure the generation of high-quality consensus sequences and seed alignments that can be deposited in curated TE databases [16–19]. This manual curation process is labor-intensive and typically requires bioinformatics expertise, limiting the scalability of TE annotation efforts across the wide range of available genomic data.

Involving undergraduate and high school students in the manual curation of TEs represents a promising solution to the problem of poorly characterized TE families in the genomes of most non-model organisms [20]. Undergraduate students are increasingly participating in Course-based Undergraduate Research Experiences (CUREs) in genomics through initiatives like the Genomics Education Partnership (GEP) [21, 22] and the Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science (SEA-PHAGES) [23], which successfully integrate genome annotation projects into the undergraduate curriculum [20, 24–32]. The availability of pre-designed research projects based on the growing number of sequenced genomes allows undergraduates to contribute meaningfully to scientific research, while also gaining hands-on experience in genomics and bioinformatics. However, despite the abundance of bioinformatic tools and training resources for conducting TE research [33], several barriers still hinder student involvement. Most students, particularly those in high school or at primarily undergraduate institutions or smaller research universities, lack access to the high-performance computing resources typically required for bioinformatics projects. Additionally, manual curation of TEs often necessitates familiarity with computational tools and coding, as well as an understanding of the complex biological concepts underlying TE evolution and classification, making it difficult for students with limited experience to participate effectively.

To address these barriers, we developed ColabCuraTE, an easy-to-use, open-access pipeline that facilitates the manual curation of TEs. ColabCuraTE is implemented as a Google Colaboratory (Colab) notebook (https://colab.research.google.com), which allows users to run bioinformatics tools in the cloud without the need for high-performance computing infrastructure. By leveraging the Google Colab platform, we aim to provide an accessible solution that requires minimal computational expertise, enabling a broader audience—including students and researchers from resource-limited institutions—to participate in TE curation. ColabCuraTE streamlines the curation process, from initial TE family selection to the generation of high-quality consensus sequences, providing a practical tool for researchers and students alike. By enabling easy access to TE curation tools, we hope to accelerate the annotation of TEs across a wide range of genomes, enhancing our understanding of genome evolution and function. ColabCuraTE can also serve as a way for students to increase their understanding of TE evolution and annotation via active learning by integrating the notebook into a CURE or workshop that pairs instruction on these topics with ColabCuraTE TE annotation.

Implementation

ColabCuraTE is an open-access, browser-based tool implemented in Google Colab to facilitate the manual curation of transposable elements. The pipeline assumes the user has already identified a TE family of interest for manual curation, however, we also provide tools for browsing summary information of a repeat library, which can be used to identify promising candidates for manual curation (see below for additional details). The pipeline requires two FASTA-format input files: a genome assembly file and a repeat library file containing one or more TE family consensus sequences (e.g. the XXXX-families.fa file output from RepeatModeler2 [10]). The outputs of the pipeline include an improved, full-length consensus sequence for the TE family of interest and a corresponding seed alignment in Stockholm (.stk) format, suitable for submission to the Dfam repository for curated TE families [34]. The pipeline consists of the following five major steps, which are explained in more detail in the next section (Fig. 1). (1) First, ColabCuraTE identifies all copies of the TE family of interest within the target genome and generates an initial seed alignment from either a subset of copies, or all copies, as specified by the user. (2) Next, it extends the flanking sequences of each TE copy in the seed alignment to improve TE sequence representation. (3) The extended sequences are then realigned, with additional extension performed if necessary. (4) In the refinement step, users can visually inspect the alignment to identify TE boundaries, trim non-TE sequence, and generate an updated consensus sequence and seed alignment. (5) Once a new consensus sequence is generated, the pipeline allows for reclassification of the new sequence by TE family, comparison to the original sequence, and examination of structural and functional features of the consensus sequence, along with summary information about its individual copies in the genome.

Fig. 1 — Schematic overview of the *ColabCuraTE* pipeline for the manual curation of transposable elements

ColabCuraTE is a platform-independent, Python-based program, with dependencies managed through Conda and installed automatically during each virtual runtime in Google Colab, eliminating the need for local installation or configuration. Each runtime operates on a Google-provisioned virtual machine that grants users root privileges and provides complimentary computational resources, which are generally sufficient for most TE curation tasks. To use the pipeline, we recommend users upload input files from Google Drive, as large file uploads from a local directory can be interrupted, leading to potential data loss. Since runtime sessions are temporary—resetting after reaching their time limit or prolonged inactivity—all files are deleted upon session termination. To address this, ColabCuraTE incorporates checkpoints throughout the pipeline, allowing users to back up their progress and resume their workflow at a later time if needed. Although the setup, installation, and file upload steps must be repeated with each new runtime, we have automated the process so that it only involves a few clicks and approximately five minutes of wait time. Throughout the pipeline, the user-friendly interface of the Colab notebook allows users to set input and output parameters through easy-to-use form fields, thus no coding experience is needed to use ColabCuraTE.

Pipeline overview

TE identification

Following setup and installation, the pipeline begins by selecting a consensus sequence for the target TE family from the provided repeat library. All matching copies of this consensus sequence are identified within the user-provided reference genome assembly using RepeatMasker [15]. An initial seed alignment of these TE copies is then generated using RepeatModeler2’s generateSeedAlignments.pl script [10]. The pipeline then provides seed alignment filtering options, including a visualization of TE copy length distribution. This visualization helps users set a length cutoff to retain only full-length TE copies if desired. Additionally, users can define a maximum number of TE copies to retain in the alignment, optimizing computational efficiency and accelerating subsequent alignment steps.

Extension

The heart of TE family annotation involves iterative extension and alignment of TE copies to ensure that the complete TE is included in the consensus sequence [16, 35, 36]. After filtering the seed alignment, the pipeline moves into the extension phase, where flanking genomic sequences are added to both ends of each TE copy. This step ensures that the final consensus sequence accurately represents the full-length TE. Users can adjust the extension length for each end, allowing for a tailored approach to sequence expansion. To optimize efficiency, we recommend an iterative extension strategy to avoid excessive elongation, which can slow down the subsequent alignment step. A practical approach is to start with a small flanking sequence, such as 200 base pairs (bp), proceed to the alignment step, and then assess whether the TE boundaries have been fully captured. If necessary, additional flanking sequence can be iteratively extended until the complete TE is included.

Alignment

Following extension, TE copies are realigned using MAFFT [37]. To assess whether the alignment fully captures the TE, the pipeline generates an alignment conservation plot, which visualizes column-wise conservation across the alignment. The start of the TE is typically marked by a conservation score increase from around 25% to above 50%, often approaching 100% for evolutionarily young, active TEs. High conservation should persist throughout the TE’s length before dropping back to approximately 25% at the endpoint. If the conserved region extends to the boundary of either end of the alignment, additional flanking sequence extension is required. This iterative refinement process continues until the alignment fully encapsulates the complete TE sequence, flanked by low-conservation regions on both ends.

Refinement

Once the alignment represents the full-length TE, users can view it with an interactive MSA visualization tool, react-msaview [38], to pinpoint precise TE edges. Users scroll through the alignment visualizer to identify start and end points of the alignable TE region, noting the columns that span these edges. Most RNA and DNA transposon families generate target site duplications (TSDs) during transposition [39]. TSDs are duplicated segments of the host genome, usually less than ~ 20 bp in length, that directly flank TE termini. TSDs are important features of TEs as their presence can be used to confirm that TE insertion termini have been correctly identified and their length is informative for TE classification. ColabCuraTE includes a step that allows simultaneous visualization of the left and right edges of the seed alignment, to simplify the manual search for TSDs. Users can examine this truncated view to identify a TSD, if present. Once the TE termini have been identified, the alignment edges are trimmed to encompass only the full-length TE, and then trimmed internally using trimAl [40] to remove TE copy-specific insertion/deletion sequences (by default, indels present in fewer than 50% of the TE copies). This trimmed seed alignment is then used to generate the final TE consensus sequence using PILER [41].

TE analysis

Manual curation often results in a longer TE consensus sequence compared to the original pre-curated sequence. ColabCuraTE allows users to visually compare their newly curated consensus sequence to the original sequence, using a sequence homology dot plot generated by EMBOSS [42]. The additional sequence often present in the curated consensus may contain features helpful for classification. ColabCuraTE therefore uses the RepeatClassifier module from RepeatModeler2 to reclassify the curated TE consensus [10]. Finally, using the curated consensus as input, ColabCuraTE runs TE-Aid [19] within the notebook to generate coverage and identity plots between genomic TE copies and the consensus sequence, a self-dot plot of the consensus to reveal structural motifs like long terminal repeats (LTRs) or terminal inverted repeats (TIRs), and annotations of structural and coding features, such as ORFs and TE protein hits, providing a complete view of the TE family’s structure and potential functions. Together, these steps ensure that ColabCuraTE delivers a high-quality, biologically meaningful TE consensus sequence ready for inclusion in curated TE libraries.

Common issues

In an ideal TE curation scenario, the MSA will exhibit high sequence conservation across the internal portion of the element, with a clear drop-off at the termini. These alignment edges will also coincide with known structural features—such as LTRs, TIRs, or poly-A tails—and may be flanked by TSDs, all of which help confirm that the full-length transposable element has been captured. For example, LTR retrotransposons typically show conserved TG and CA dinucleotides at their 5′ and 3′ ends, respectively, along with paired LTRs and short TSDs flanking the insertion [43]. When present and intact, these features allow for confident delimitation of TE boundaries. However, in practice, defining accurate termini may be more complicated due to a range of biological and technical challenges.

For instance, a common difficulty arises from the poorly defined 5′ boundaries of LINEs, which are frequently truncated during insertion [44]. In such cases, curators must rely on the better-preserved 3′ end and then search for a TSD to try to identify the full element [19]. The situation is further complicated when TSDs are absent, eroded, or highly variable in length across TE copies. TE families also often contain subfamilies that differ subtly in sequence and length (see below for more details) [45]. These differences can lead to mixed alignments where some copies extend further than others, blurring the edges of the full-length model.

Additional complexity arises from internal features of TEs themselves. Some TEs contain shorter repetitive elements embedded within them, such as AT-rich microsatellites commonly found near the 3′ end of LINEs, which vary in length across insertions [46]. These regions can create irregular alignment patterns near the termini, making it harder to identify the TE boundaries. Similarly, TEs inserted within segmental duplications may be flanked by highly similar host sequences, which can appear falsely conserved across multiple alignment rows and give the illusion of extended TE boundaries. Finally, the presence of rare insertions, highly divergent copies, copies from different but related families, or alignment artifacts near the edges can introduce noise and further obscure the true extent of the element.

Altogether, these issues mean that while some TE families have clearly defined structural features that make curation straightforward, many require intensive inspection and biological interpretation. Assigning boundaries often involves weighing patterns of conservation, structural expectations, and known TE biology, and in many cases, remains a subjective decision [19].

Testing

To evaluate the performance of ColabCuraTE, we applied the pipeline to five representative transposable element families from three genomes: the well-known copia LTR retrotransposon from the Drosophila melanogaster genome, three randomly selected TE families—a SINE, a gypsy LTR retrotransposon, and a LINE—from the snake genome Bungarus multicinctus, and one randomly selected TE family, a hAT DNA transposon, from the rice genome Oryza sativa (Table 1; Figs. 2, 3 and 4; Supplementary Figs. S1 and S2). These three species were chosen to assess how genome size influences pipeline performance, as the Bungarus genome is approximately an order of magnitude larger than the Drosophila genome (Table 1). De novo repeat libraries for each genome were generated using RepeatModeler2 with default parameters. Each TE family was processed through the full ColabCuraTE workflow, from which we documented runtime durations, differences between pre- and post-curated consensus sequences, and practical considerations encountered during the pipeline.

Table 1.

Genomes and transposable element (TE) families used to test ColabCuraTE, with consensus sequence lengths before and after manual curation

Species	NCBI assembly	Genome size	TE type (total bp in genome)	Consensus length (before)	Consensus length (after)
Drosophila melanogaster	GCA_000001215.4	143.7 Mb	copia LTR (336,689 bp)	1714 bp	5145 bp
Bungarus multicinctus	GCA_023653725.1	1.6 Gb	gypsy LTR (10,655,558 bp)	5652 bp	6090 bp
Bungarus multicinctus	GCA_023653725.1	1.6 Gb	SINE (13,281,989 bp)	225 bp	241 bp
Bungarus multicinctus	GCA_023653725.1	1.6 Gb	LINE (77,147,916 bp)	3536 bp	4755 bp
Oryza sativa	GCF_034140825.1	385.7 Mb	hAT DNA (310,320 bp)	2963 bp	3589 bp

Open in a new tab

Fig. 2 — Visualization outputs from *ColabCuraTE* using the *copia* LTR retrotransposon from *D. melanogaster*. A Histogram showing the length distribution of TE copies in the initial seed alignment. A minimum length threshold of 1500 bp was applied to exclude partial or truncated copies. B Alignment conservation plots illustrating the iterative extension and realignment process: initial alignment requiring additional extension on both ends (left), additional extension needed at the 5′ end (middle), and final alignment with clear conservation drop-off at both edges indicating full-length coverage (right). C MSA viewer displaying the extended alignment near the 3′ end, where users identify the approximate TE boundary positions. D MSA viewer during the TSD detection step. Most of the internal TE is excluded from view (indicated by x’s), allowing users to inspect the flanking regions at the same time, which simplifies TSD detection. In this example, a 5 bp TSD is present and highlighted with dotted boxes

Fig. 3 — Visualization outputs from *ColabCuraTE* for two *B. multicinctus* transposable element families. A *Gypsy* LTR retrotransposon: (top left) histogram of TE copy lengths used to filter the seed alignment. Based on this plot, a minimum length threshold of 5,500 bp was used to retain only full-length elements; (bottom left) alignment conservation plot following iterative extension, showing well-defined boundaries; (right) MSA viewer during the TSD detection step, with a 5 bp TSD highlighted by dotted boxes. B SINE element: (top left) length distribution of TE copies. Based on this plot, a minimum length threshold of 228 bp was set; (bottom left) alignment conservation plot following extension; (right) MSA viewer during TSD detection, where no TSD was identified

Fig. 4 — Comparison of consensus sequences before and after curation using *ColabCuraTE*. A Pairwise dot plots showing sequence similarity between pre- and post-curated consensus sequences for the *D. melanogaster copia* LTR retrotransposon (left), *B. multicinctus gypsy* LTR retrotransposon (middle), and *B. multicinctus* SINE element (right). B Self dot plots of the consensus sequences before (left) and after (right) curation, generated by TE-Aid, for the *copia* LTR (top), *gypsy* LTR (middle), and SINE (bottom). C Structural features and protein hits identified by TE-Aid in the consensus sequences before (left) and after (right) curation of the *copia* LTR (top), *gypsy* LTR (middle), and SINE (bottom). See Supplementary Figs. S1 and S2 for similar plots of the LINE and hAT DNA example TEs

While the major utility of ColabCuraTE is to extend fragmented TE consensi produced by automated pipelines like RepeatModeler2, we also tested its performance when starting with a putatively full-length LTR retrotransposon identified in D. melanogaster by HiTE [13], consisting of both the internal sequence and the two flanking LTRs. We determined that this element corresponded to the 297 LTR retrotransposon. ColabCuraTE confirmed that the consensus reported by HiTE encompassed the full-length 297 element, while also allowing for slight adjustments to the beginning and end of the consensus that are consistent with the 297 family’s known preference for inserting into ATAT target site motifs [47] (see Supplementary Fig. S3 for details).

Supplemental computing resources

While ColabCuraTE is designed to be accessible and free of high-performance computing requirements, it does rely on a pre-existing repeat library as input—typically generated by de novo TE discovery tools such as RepeatModeler2. However, running such tools on a whole-genome assembly can be computationally intensive and cannot be executed within the Google Colab environment due to runtime limits and memory constraints. Additionally, the TE identification steps of ColabCuraTE utilize RepeatMasker to annotate genomic copies of the TE family being curated. This step can be run within the ColabCuraTE notebook, however, for gigabase-sized genomes, it can take several hours to complete (Table 2). To address these limitations, we leveraged the Galaxy web server [48]—a free, user-friendly platform for bioinformatics analyses—to run both RepeatModeler2 and RepeatMasker. We provide a detailed, step-by-step guide as an appendix section within the ColabCuraTE notebook that contains instructions for running both RepeatModeler2 and RepeatMasker on Galaxy. Users whose genome of interest lacks a de novo TE library, such as that produced by RepeatModeler2, will need to run RepeatModeler2 (or a similar automated de novo TE identification tool). Those without computational experience and/or resources can do so using Galaxy. For large-scale curation efforts, we recommend running RepeatMasker on Galaxy using the full repeat library (e.g. all TE consensi output by RepeatModeler2) as described in the notebook appendix, as this will dramatically reduce the curation runtime for each individual TE family.

Table 2.

ColabCuraTE runtimes (h:m:s) for each major pipeline step during testing across different genomes and TE families

Species (TE type)	Installation	TE identification	Extension and alignment	Refinement	TE analysis	Total runtime
D. melanogaster (copia LTR)	0:03:30	0:09:04	0:27:16	0:06:33	0:00:32	0:46:55
B. multicinctus (gypsy LTR)	0:04:00	3:34:25	0:57:26	0:04:50	0:01:17	4:41:58
B. multicinctus (SINE)	0:03:43	3:36:35	0:00:51	0:12:26	0:01:03	3:54:38
B. multicinctus (LINE)	0:04:33	4:01:12	0:32:53	0:10:40	0:01:58	4:51:16
O. sativa (hAT DNA)	0:03:40	0:29:36	0:16:59	0:05:30	0:00:25	0:56:10

Open in a new tab

For users who have not yet selected a TE family for manual curation, the appendix includes tools to explore a genome’s repeat landscape and identify candidate families. Using the RepeatMasker output, users can visualize the repeat landscape for their genome of interest and generate an interactive summary table of TE families, sortable by metrics such as genome abundance or sequence divergence.

Results & discussion

In this study, we developed ColabCuraTE, a new tool for manually curating transposable element families in a genome. We tested ColabCuraTE using five TE families from three genomes of contrasting size: the fruit fly, Drosophila melanogaster, the rice plant, Oryza sativa, and the many-banded krait snake, Bungarus multicinctus. Overall, the pipeline successfully guided the curation process for all TE families, producing a refined consensus sequence and seed alignment files suitable for downstream analysis and TE database submission. Below, we report on the pipeline’s performance during our testing, its current limitations, and other practical considerations for running ColabCuraTE.

Pipeline performance

In each test case, the final curated consensus sequence provided a more complete representation of the full-length TE compared to the uncurated RepeatModeler2 output (Table 1; Figs. 2, 3 and 4; Supplementary Figs. S1 and S2). Below, we provide additional details on the curation results for each of our five test cases.

D. melanogastercopia LTR retrotransposon

The copia final consensus sequence was 3431 bp longer than the pre-curated RepeatModeler2 consensus. The element reported by RepeatModeler2 was not only missing both LTRs, but also a large portion of the internal sequence (2.9 Kb of missing sequence, which is ~ 63% of the complete internal portion of the TE). ColabCuraTE correctly extended the RepeatModeler2 fragment to incorporate the complete internal portion of the element, plus the two LTRs. In the final consensus, we were able to precisely identify the copia TE termini, as indicated by the presence of a 5 bp TSD (Fig. 2). This TE has previously been well-characterized, including the 5 bp TSD [49]. Our curated copia consensus sequence is nearly identical to the curated DFAM consensus, sharing 99.49% nucleotide identity. Both have the same length and termini, differing by only 26 internal nucleotides, which is likely due to differences in the subset of sequences selected during seed alignment filtering.

B. multicinctusgypsy LTR retrotransposon

The gypsy element identified by RepeatModeler2 contained the full LTR at the 5’ end (576 bp) but was missing 439 bp (76%) of the LTR at the 3’ end. ColabCuraTE correctly extended the 3’ end of the RepeatModeler2 fragment in this case as well, resulting in a full-length element with well-defined termini adjacent to a 5 bp TSD (Fig. 3).

B. multicinctus SINE element

With the two LTR elements above, the high sequence identity among the full-length TE copies and the presence of an easily-identifiable TSD made the curation process relatively straightforward. However, curating the Bungarus SINE family was less clearcut, as full length copies were more divergent from one another, notably at the edges, making it difficult to precisely identify the TE boundaries (Fig. 3). Furthermore, although SINE transposition is known to generate a TSD, none were identifiable in this example (Fig. 3). Additional inspection using the TSD search tool in SINEBase [50] also failed to identify a clear TSD. This SINE may lack a TSD [51], or may belong to an older, inactive TE family whose sequence has degraded over time, rendering the TSD undetectable. This highlights a common challenge in TE curation stemming from the dynamic and often degenerative nature of TE evolution. Nevertheless, our final consensus was still a better representation of the full-length SINE compared to the pre-curated consensus (Table 1). Although its classification remained unchanged (SINE/tRNA-Deu) based on RepeatClassifier, SINEBase (UCON3: Euteleostomi SINE with a tRNA head), and CENSOR [52] (SINE2/tRNA: SINE2-6_TilSci, blue-tongued skink genome), our ~ 10 bp extension of the 5’ end captured a more complete tRNA head, including the Pol III promoter that SINEs utilize for transcription.

B. multicinctus LINE element

As mentioned above, LINE elements can be challenging to curate because their 5′ ends often show varying degrees of truncation, their 3′ ends may contain a variable-length AT-rich microsatellite, and their TSDs can vary in nucleotide length. Consequently, LINE seed alignments may require manual trimming of truncated LINE copies. These fine-scale adjustments are not feasible within ColabCuraTE alone, thus we recommend following the LINE curation guidelines of Goubert et al. [19], which use an external alignment visualization and editing tool such as AliView [53] to manually trim truncated LINE copies and/or refine the AT-rich satellite portion of the alignment prior to consensus generation. For the Bungarus LINE element we tested, the varying degrees of 5′ truncation among copies in the filtered seed alignment are evident in the alignment conservation plot, which shows a gradual decline in sequence conservation toward the 5′ end (Supplementary Fig. S1b). For our curation procedure, we began with the better defined 3′ edge and moved toward the 5′ edge attempting to capture as much alignable sequence as possible until several copies terminated near a common point. We then trimmed the alignment edges at these positions and searched for TSDs in the ColabCuraTE notebook. Although no unambiguous TSD was detected, we identified several 2–4 bp direct repeats flanking the putative start and end coordinates of the longest copies we could identify (Supplementary Fig. S1c). This range is consistent with the short, variable TSDs reported for the L2 LINE family [54]. Accordingly, we trimmed the alignment at these points to represent the full-length element. The trimmed alignment was then exported to AliView for manual refinement, where we removed highly truncated copies, retained only near-complete ones, and adjusted terminal positions to ensure both edges contained only the alignable TE sequence (Supplementary Fig. S1c). Minor refinements were also made to the alignment of the 3′ AT-rich microsatellite. The finalized alignment was re-imported into ColabCuraTE for internal trimming and consensus generation. The resulting consensus sequence was 1,219 bp longer than the pre-curated RepeatModeler2 consensus and contained two ORFs (ORF1: 243 amino acids, ORF2: 1,241 amino acids) consistent with most LINE TEs (Supplementary Fig. S1).

O. sativa hAT DNA transposon

The hAT final consensus sequence was 626 bp longer than the pre-curated RepeatModeler2 consensus. ColabCuraTE correctly extended the RepeatModeler2 fragment several hundred basepairs on either end to incorporate the complete internal portion of the element, plus the two 19 bp terminal inverted repeats (TIRs) (Supplementary Fig. S2). In the final consensus, we were able to precisely identify the hAT TE termini, as indicated by the presence of an 8 bp TSD (Supplementary Fig. S2).

Runtimes

Total runtimes varied substantially between the three genomes, ranging from ~ 47 min for the copia element in Drosophila to ~ 4–5 h for each TE in the Bungarus genome (Table 2). This variation was primarily driven by a single step—the RepeatMasker runtime during seed alignment generation—which took just ~ 9 min for the smaller Drosophila genome but over 3 h for the much larger Bungarus genome (Table 2). However, we note that the ColabCuraTE appendix includes instructions for reducing this runtime by running RepeatMasker on Galaxy (see below). Excluding this step, the remainder of the pipeline completed in under an hour across all test cases. The next most time-consuming step was the MAFFT alignment, which is sensitive to both alignment length and the amount of flanking (non-TE) sequence introduced during extension (Table 2). To minimize alignment runtimes, we recommend an iterative extension strategy to avoid overextension and including too much non-homologous sequence.

If a Colab runtime remains idle for too long, it will disconnect and delete all session files. This poses a particular challenge for the RepeatMasker step during the initial seed alignment, which can take several hours for large genomes. To address this, we implemented a checkpoint after this step that automatically saves the alignment output to the user’s Google Drive. If a session times out after this step, users can reconnect to a new runtime, complete the setup steps, and resume their workflow without loss of progress. Alternatively, running the RepeatMasker step on the Galaxy web server (see ColabCuraTE appendix) offers a reliable workaround when genome size prevents successful completion of this step within the Colab notebook.

To significantly reduce ColabCuraTE runtimes for large-scale curation projects, such as curating an entire TE library, we recommend running RepeatMasker on Galaxy using the full set of TE consensi output by RepeatModeler2 (or a similar automated de novo TE identification tool). Users can then generate seed alignments for individual TE families by following the steps outlined in the ColabCuraTE appendix. While this approach requires a longer initial RepeatMasker runtime, it eliminates the need to rerun the process for each TE family within the ColabCuraTE notebook. For example, with the B. multicinctus genome, running RepeatMasker once on Galaxy with the full repeat library took approximately 14 h, but it reduced the time to generate each seed alignment in the ColabCuraTE notebook from over 3 h to around 25 min. With this strategy, total ColabCuraTE runtimes for individual TE families are typically closer to 1 h.

Advantages and limitations of ColabCuraTE

Several tools and protocols have been developed to support the manual curation of transposable elements. Notably, Goubert et al. [19] and Storer et al. [18] each provide detailed guides, along with accompanying scripts and software, for navigating the TE curation process. We highly recommend these resources for researchers new to TE curation, as they offer foundational insights and best practices. More recently, tools such as Earl Grey [7], MCHelper [12], HiTE [13], and TEtrimmer [14] have been developed to automate portions of the curation workflow, significantly improving the scalability of generating curated TE libraries in a time-efficient manner. However, these tools typically require computational proficiency and access to computing infrastructure. ColabCuraTE offers a distinct advantage in its accessibility and ease of use. It is the first, to our knowledge, TE curation pipeline specifically designed for users with limited computational experience or resources. Because it runs entirely within the Google Colab environment, it eliminates the need for local installation or high-performance computing. Its intuitive, form-based interface guides users step-by-step through the curation process, making it particularly well-suited for students, educators, and researchers at primarily undergraduate institutions or other resource-constrained settings.

While ColabCuraTE is not a practical solution for a single researcher or laboratory to generate a fully curated TE library for a specific organism, this pipeline can be scaled up to curate hundreds of TE families by employing it as part of a CURE. Based on the runtimes in Table 2, a class of ~ 30 students could easily curate several hundred TE families in ~ 10 h of class time when either working on a megabase-sized genome or running RepeatMasker via Galaxy if using a gigabase-sized genome. Indeed, a previous course-based TE curation effort, using a more computationally advanced approach, curated over 400 tardigrade TE families through the combined effort of 37 students across two working days [20]. These examples suggest CUREs and crowdsourcing, or “course-sourcing” [20] are viable approaches for large-scale TE curation efforts.

Our pipeline is not without its limitations, for example, ColabCuraTE does not include functionality for analyzing or resolving TE subfamily structure. Over evolutionary time, TE families can undergo multiple discrete bursts of high transposition activity [55–57]. TE copies arising from different activity bursts often share unique sequence variants [58–61]. A variety of analysis approaches have been developed to group copies of a single TE family into subgroups associated with distinct bursts of activity [62–66]. Such methods fall under the category of TE subfamily analysis and annotation. However, recent work suggests that TE subfamily annotation can be unreliable [45, 67]. For this reason, we did not incorporate subfamily annotation into the ColabCuraTE pipeline. For users interested in these approaches, we note that they are covered in depth in the recent curation guidelines developed by Storer et al. [18]. Another limitation is the pipeline’s dependence on the Google Colab runtime. While Colab provides free and user-friendly access to cloud computing, it imposes memory and session time constraints that may affect performance when working with large genomes. To address this limitation, we incorporated workarounds using the Galaxy web server. While this approach maintains accessibility, it introduces additional steps that add some complexity to an otherwise streamlined workflow. We also acknowledge that, while the notebook is designed to mount the Google Drive of the user, some users may not want to give Colab read/write access to their Google Drive account for privacy or security reasons. For these users, we recommend two possible workarounds: (1) the user can create a new Google account that is used only for TE annotation, such that Colab will not have access to any of the user’s other files that remain in their primary Google account or (2) we provide alternative instructions in the notebook for directly uploading files to the notebook, obviating the need for mounting Google Drive, though we note that the automated saving of intermediate files (e.g. the RepeatMasker output) is not possible with this approach.

Despite these limitations, ColabCuraTE fills an important gap in TE research by enabling participation from undergraduates and researchers with limited computational experience or access to high-performance computing. We see this tool as an ideal entry point for involving students in genomics research, particularly through CUREs and related programs, where students can take ownership of curating TE families from genomes of interest. We piloted ColabCuraTE during a semester-long independent study course at Rutgers University, primarily to test usability and troubleshoot the pipeline. Each student successfully curated a TE family from the Bungarus genome featured in this study, highlighting the tool’s accessibility and educational value. These results underscore ColabCuraTE’s dual utility as both a research and teaching platform. Its integration into CURE-style annotation projects will offer a scalable way to train students in bioinformatics while also contributing valuable data to the broader TE research community.

Conclusions

Manual curation of transposable elements remains a critical step in refining automated annotations of TE libraries into accurate, biologically meaningful representations of the TE families in a genome—an essential foundation for downstream analyses of TE diversity and evolution. However, tools that make this process accessible to users without high-performance computing resources or bioinformatics expertise are limited. ColabCuraTE addresses this gap by providing a free, intuitive, and cloud-based platform designed specifically for users in resource-limited settings or with minimal computational training. While not a replacement for automated pipelines designed for large-scale TE library curation, ColabCuraTE excels at the precise, case-by-case refinement of individual TE families. Its step-by-step interface and lightweight infrastructure enable easier participation for students, educators, and researchers alike. Importantly, it also creates new opportunities for integrating TE research into classroom settings, where participants can meaningfully contribute to the annotation of transposable elements from underexplored genomes—advancing both genomics research and education.

Supplementary Information

Supplementary Material 1^{(2.1MB, pdf)}

Acknowledgements

We thank members of the Ellison Lab for their feedback during the development of the pipeline. We especially thank Tim Stanek, Eros Barahona, Graham Ort, Jana Elsawwah, Julia Arscott, Lamia Khondaker, Michaella Schultz, and Soham Banerjee for testing and feedback on early versions of the ColabCuraTE notebook.

Other requirements

ColabCuraTE utilizes the following packages, which are managed through Conda and installed automatically during each virtual runtime in Google Colab: trimAl v1.5.0 [40], PILER v1.0 [40], RepeatModeler2 v2.0.5 [10], RepeatMasker v4.1.9 [15], Biopython v1.85 [68], MAFFT v7.526 [37], BEDTools v2.31.1 [69], EMBOSS v6.6.0 [42], BLAST v2.16.0 [70], HMMER v3.4 [71], Biostrings v2.74.0 [72], react-msaview v2.1.5 [38], and TE-Aid v1.0.0 [19].

License

MIT License

Any restrictions to use by non-academics

None

Abbreviations

TE: Transposable element
MSA: Multiple sequence alignment
CURE: Course-based undergraduate research experience
GEP: Genomics education partnership
SEA-PHAGES: Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science
COLAB: Google Colaboratory
TSD: Target site duplication
LTR: Long terminal repeat
TIR: Terminal inverted repeat
BP: Base pair

Authors’ contributions

S.L.T. and C.E.E. contributed to the conception and design of this work. S.L.T., A.K., and C.E.E. contributed to the code development. S.L.T. and C.E.E. contributed to the acquisition, analysis, and interpretation of data. S.L.T. and C.E.E. contributed to manuscript preparation. All authors reviewed and approved the manuscript.

Funding

This work was supported by the National Institute of General Medical Sciences grants R01GM130698 and R35GM152168 to CEE.

Data availability

The testing datasets analyzed during the current study are available from the GitHub repository, https://github.com/Ellison-Lab/ColabCuraTE.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

C.E.E. is a member of the Mobile DNA editorial board.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19:199. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Pasyukova EG, Nuzhdin SV, Morozova TV, Mackay TFC. Accumulation of transposable elements in the genome of Drosophila melanogaster is associated with a decrease in fitness. J Hered. 2004;95:284–90. [DOI] [PubMed] [Google Scholar]
3.Montgomery EA, Huang SM, Langley CH, Judd BH. Chromosome rearrangement by ectopic recombination in Drosophila melanogaster: genome structure and evolution. Genetics. 1991;129:1085–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cosby RL, Chang N-C, Feschotte C. Host–transposon interactions: conflict, cooperation, and cooption. Genes Dev. 2019;33:1098–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chuong EB, Elde NC, Feschotte C. Regulatory activities of transposable elements: from conflicts to benefits. Nat Rev Genet. 2017;18:71–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Burns KH. Our conflict with transposable elements and its implications for human disease. Annu Rev Pathol. 2020;15:51–70. [DOI] [PubMed] [Google Scholar]
7.Baril T, Galbraith J, Hayward A. Earl grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Mol Biol Evol. 2024;41:msae068. [DOI] [PMC free article] [PubMed]
8.Loreto ELS, Melo ESd, Wallau GL, Gomes TMFF. The good, the bad and the ugly of transposable elements annotation tools. Genet Mol Biol. 2023;46. [DOI] [PMC free article] [PubMed]
9.Rodriguez M, Makałowski W. Software evaluation for de novo detection of transposons. Mob DNA. 2022;13:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA. 2020;117:9451–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 2019;20:275. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Orozco-Arias S, Sierra P, Durbin R, González J. Mchelper automatically curates transposable element libraries across eukaryotic species. Genome Res. 2024;34:2256–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hu K, Ni P, Xu M, Zou Y, Chang J, Gao X, et al. Hite: a fast and accurate dynamic boundary adjustment approach for full-length transposable element detection and annotation. Nat Commun. 2024;15:5573. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Qian J, Xue H, Ou S, Mann L, Storer J, Fürtauer L, et al. Tetrimmer: a tool to automate the manual curation of transposable elements. Nat Commun. 2025;16:8429. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Smit AF, Hubley R, Green P. RepeatMasker. Institute for Systems Biology. Available: https://www.repeatmasker.org/
16.Platt RN, II, Blanco-Berdugo L, Ray DA. Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome Biol Evol. 2016;8:403–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Gozashti L, Hoekstra HE. Accounting for diverse transposable element landscapes is key to developing and evaluating accurate de novo annotation strategies. Genome Biol. 2024;25:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Storer JM, Hubley R, Rosen J, Smit AFA. Curation guidelines for de novo generated transposable element families. Curr Protoc. 2021;1:e154. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Goubert C, Craig RJ, Bilat AF, Peona V, Vogan AA, Protasio AV. A beginner’s guide to manual curation of transposable elements. Mob DNA. 2022;13:7. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Peona V, Martelossi J, Almojil D, Bocharkina J, Brännström I, Brown M, et al. Teaching transposon classification as a means to crowd source the curation of repeat annotation – a tardigrade perspective. Mob DNA. 2024;15:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Elgin SCR, Hauser C, Holzen TM, Jones C, Kleinschmit A, Leatherman J. The GEP: crowd-sourcing big data analysis with undergraduates. Trends Genet. 2017;33:81–5. [DOI] [PubMed] [Google Scholar]
22.Shaffer CD, Alvarez C, Bailey C, Barnard D, Bhalla S, Chandrasekaran C, et al. The genomics education partnership: successful integration of research into laboratory classes at a diverse group of undergraduate institutions. CBE-Life Sci Educ. 2010;9:55–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Jordan TC, Burnett SH, Carson S, Caruso SM, Clase K, DeJong RJ, et al. A broadly implementable research course in phage discovery and genomics for first-year undergraduate students. mBio. 2014;5:e01051–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Leung W, Shaffer CD, Chen EJ, Quisenberry TJ, Ko K, Braverman JM et al. Retrotransposons are the major contributors to the expansion of the Drosophila ananassae Muller F element. G3 Genes|Genomes|Genetics. 2017;7:2439–60. [DOI] [PMC free article] [PubMed]
25.Moya ND, Stevens L, Miller IR, Sokol CE, Galindo JL, Bardas AD, et al. Novel and improved Caenorhabditis briggsae gene models generated by community curation. BMC Genomics. 2023;24:486. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Chang WH, Mashouri P, Lozano AX, Johnstone B, Husić M, Olry A, et al. Phenotate: crowdsourcing phenotype annotations as exercises in undergraduate classes. Genet Med. 2020;22:1391–400. [DOI] [PubMed] [Google Scholar]
27.Zhou N, Siegel ZD, Zarecor S, Lee N, Campbell DA, Andorf CM, et al. Crowdsourcing image analysis for plant phenomics to generate ground truth data for machine learning. PLoS Comput Biol. 2018;14:e1006337. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Singh M, Bhartiya D, Maini J, Sharma M, Singh AR, Kadarkaraisamy S et al. The zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation. Database. 2014;2014. [DOI] [PMC free article] [PubMed]
29.Prost S, Petersen M, Grethlein M, Hahn SJ, Kuschik-Maczollek N, Olesiuk ME, et al. Improving the chromosome-level genome assembly of the Siamese fighting fish (Betta splendens) in a university master’s course. G3 Genes|Genomes|Genetics. 2020;10:2179–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Prost S, Winter S, De Raad J, Coimbra RTF, Wolf M, Nilsson MA, et al. Education in the genomics era: generating high-quality genome assemblies in university courses. Gigascience. 2020;9:giaa058. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Tello-Ruiz MK, Marco CF, Hsu F-M, Khangura RS, Qiao P, Sapkota S, et al. Double triage to identify poorly annotated genes in maize: the missing link in community curation. PLoS One. 2019;14:e0224086. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Rödelsperger C, Athanasouli M, Lenuzzi M, Theska T, Sun S, Dardiry M, et al. Crowdsourcing and the feasibility of manual gene annotation: a pilot study in the nematode Pristionchus Pacificus. Sci Rep. 2019;9:18789. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Elliott TA, Heitkam T, Hubley R, Quesneville H, Suh A, Wheeler TJ. TE hub: a community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mobile DNA. 2021;12:16. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Storer J, Hubley R, Rosen J, Wheeler J, Smit AF. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA. 2021;12:2. [DOI] [PMC free article] [PubMed]
35.Suh A, Smeds L, Ellegren H. Abundant recent activity of retrovirus-like retrotransposons within and among flycatcher species implies a rich source of structural variation in songbird genomes. Mol Ecol. 2018;27:99–111. [DOI] [PubMed] [Google Scholar]
36.Galbraith JD, Ludington AJ, Suh A, Sanders KL, Adelson DL. New environment, new invaders—repeated horizontal transfer of LINEs to sea snakes. Genome Biol Evol. 2020;12:2370–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Diesh C, Stevens G, Buels R, Cain S. react-msaview. Available: https://github.com/GMOD/react-msaview
39.Wessler SR. In: Caporale LH, editor. Eukaryotic transposable elements: teaching old genomes new tricks. The implicit genome: Oxford University Press; 2006. pp. 138–65. [Google Scholar]
40.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Edgar RC, Myers EW. Piler: identification and classification of genomic repeats. Bioinformatics. 2005;21:i152-8. [DOI] [PubMed] [Google Scholar]
42.Rice P, Longden I, Bleasby A. Emboss: the European molecular biology open software suite. Trends Genet. 2000;16:276–7. [DOI] [PubMed] [Google Scholar]
43.Ellinghaus D, Kurtz S, Willhoeft U. Ltrharvest, an efficient and flexible software for de Novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Szak ST, Pickeral OK, Makalowski W, Boguski MS, Landsman D, Boeke JD. Molecular archeology of L1 insertions in the human genome. Genome Biol. 2002;3:research00521. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Carey KM, Patterson G, Wheeler TJ. Transposable element subfamily annotation has a reproducibility problem. Mob DNA. 2021;12:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Grandi FC, An W. Non-LTR retrotransposons and microsatellites. Mob Genet Elem. 2013;3:e25674. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Linheiro RS, Bergman CM. Whole genome resequencing reveals natural target site preferences of transposable elements in Drosophila melanogaster. PLoS One. 2012;7:e30008. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Community TG. The galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res. 2024;52:W83–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Dunsmuir P, Brorein WJ Jr, Simon MA, Rubin GM. Insertion of the Drosophila transposable element Copia generates a 5 base pair duplication. Cell. 1980;21:575–9. [DOI] [PubMed] [Google Scholar]
50.Vassetzky NS, Kramerov DA. SINEBase: a database and tool for SINE analysis. Nucleic Acids Res. 2012;41:D83–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Kapitonov VV, Jurka J. A novel class of SINE elements derived from 5S rRNA. Mol Biol Evol. 2003;20:694–702. [DOI] [PubMed] [Google Scholar]
52.Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in repbase: Repbasesubmitter and censor. BMC Bioinformatics. 2006;7:474. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Larsson A. Aliview: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics. 2014;30:3276–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ichiyanagi K, Okada N. Mobility pathways for vertebrate L1, L2, CR1, and RTE clade retrotransposons. Mol Biol Evol. 2008;25:1148–57. [DOI] [PubMed] [Google Scholar]
55.Le Rouzic A, Capy P. The first steps of transposable elements invasion: parasitic strategy vs. genetic drift. Genetics. 2005;169:1033–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Feschotte C, Jiang N, Wessler SR. Plant transposable elements: where genetics meets genomics. Nat Rev Genet. 2002;3:329–41. [DOI] [PubMed] [Google Scholar]
57.Belyayev A. Bursts of transposable elements as an evolutionary driving force. J Evol Biol. 2014;27:2573–84. [DOI] [PubMed] [Google Scholar]
58.Slagel V, Flemington E, Traina-Dorge V, Bradshaw H, Deininger P. Clustering and subfamily relationships of the Alu family in the human genome. Mol Biol Evol. 1987;4:19–29. [DOI] [PubMed] [Google Scholar]
59.Jurka J, Smith T. A fundamental division in the Alu family of repeated sequences. Proc Natl Acad Sci USA. 1988;85:4775–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Willard C, Nguyen HT, Schmid CW. Existence of at least three distinct Alu subfamilies. J Mol Evol. 1987;26:180–6. [DOI] [PubMed] [Google Scholar]
61.Krane DE, Clark AG, Cheng JF, Hardison RC. Subfamily relationships and clustering of rabbit C repeats. Mol Biol Evol. 1991;8:1–30. [DOI] [PubMed] [Google Scholar]
62.Storer JM, Walker JA, Baker JN, Hossain S, Roos C, Wheeler TJ, et al. Framework of the Alu subfamily evolution in the platyrrhine three-family clade of Cebidae, Callithrichidae, and Aotidae. Genes. 2023;14:249. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Britten RJ, Baron WF, Stout DB, Davidson EH. Sources and evolution of human Alu repeated sequences. Proc Natl Acad Sci USA. 1988;85:4770–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Jurka J, Milosavljevic A. Reconstruction and analysis of human Alu genes. J Mol Evol. 1991;32:105–21. [DOI] [PubMed] [Google Scholar]
65.Hubley R, Siegel A, Smit AF. COSEG. Institute for Systems Biology. Available: https://www.repeatmasker.org/COSEGDownload.html
66.Price AL, Eskin E, Pevzner PA. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 2004;14:2245–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Wacholder AC, Cox C, Meyer TJ, Ruggiero RP, Vemulapalli V, Damert A, et al. Inference of transposable element ancestry. PLoS Genet. 2014;10:e1004482. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. [DOI] [PubMed] [Google Scholar]
71.Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Pagès H, Aboyoun P, Gentleman R, DebRoy S, Biostrings. Efficient manipulation of biological strings. Available: https://bioconductor.org/packages/Biostrings

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(2.1MB, pdf)}

Data Availability Statement

The testing datasets analyzed during the current study are available from the GitHub repository, https://github.com/Ellison-Lab/ColabCuraTE.

[CR1] 1.Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19:199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Pasyukova EG, Nuzhdin SV, Morozova TV, Mackay TFC. Accumulation of transposable elements in the genome of Drosophila melanogaster is associated with a decrease in fitness. J Hered. 2004;95:284–90. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Montgomery EA, Huang SM, Langley CH, Judd BH. Chromosome rearrangement by ectopic recombination in Drosophila melanogaster: genome structure and evolution. Genetics. 1991;129:1085–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Cosby RL, Chang N-C, Feschotte C. Host–transposon interactions: conflict, cooperation, and cooption. Genes Dev. 2019;33:1098–116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Chuong EB, Elde NC, Feschotte C. Regulatory activities of transposable elements: from conflicts to benefits. Nat Rev Genet. 2017;18:71–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Burns KH. Our conflict with transposable elements and its implications for human disease. Annu Rev Pathol. 2020;15:51–70. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Baril T, Galbraith J, Hayward A. Earl grey: a fully automated user-friendly transposable element annotation and analysis pipeline. Mol Biol Evol. 2024;41:msae068. [DOI] [PMC free article] [PubMed]

[CR8] 8.Loreto ELS, Melo ESd, Wallau GL, Gomes TMFF. The good, the bad and the ugly of transposable elements annotation tools. Genet Mol Biol. 2023;46. [DOI] [PMC free article] [PubMed]

[CR9] 9.Rodriguez M, Makałowski W. Software evaluation for de novo detection of transposons. Mob DNA. 2022;13:14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA. 2020;117:9451–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Ou S, Su W, Liao Y, Chougule K, Agda JRA, Hellinga AJ, et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 2019;20:275. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Orozco-Arias S, Sierra P, Durbin R, González J. Mchelper automatically curates transposable element libraries across eukaryotic species. Genome Res. 2024;34:2256–68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Hu K, Ni P, Xu M, Zou Y, Chang J, Gao X, et al. Hite: a fast and accurate dynamic boundary adjustment approach for full-length transposable element detection and annotation. Nat Commun. 2024;15:5573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Qian J, Xue H, Ou S, Mann L, Storer J, Fürtauer L, et al. Tetrimmer: a tool to automate the manual curation of transposable elements. Nat Commun. 2025;16:8429. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Smit AF, Hubley R, Green P. RepeatMasker. Institute for Systems Biology. Available: https://www.repeatmasker.org/

[CR16] 16.Platt RN, II, Blanco-Berdugo L, Ray DA. Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome Biol Evol. 2016;8:403–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Gozashti L, Hoekstra HE. Accounting for diverse transposable element landscapes is key to developing and evaluating accurate de novo annotation strategies. Genome Biol. 2024;25:4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Storer JM, Hubley R, Rosen J, Smit AFA. Curation guidelines for de novo generated transposable element families. Curr Protoc. 2021;1:e154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Goubert C, Craig RJ, Bilat AF, Peona V, Vogan AA, Protasio AV. A beginner’s guide to manual curation of transposable elements. Mob DNA. 2022;13:7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Peona V, Martelossi J, Almojil D, Bocharkina J, Brännström I, Brown M, et al. Teaching transposon classification as a means to crowd source the curation of repeat annotation – a tardigrade perspective. Mob DNA. 2024;15:10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Elgin SCR, Hauser C, Holzen TM, Jones C, Kleinschmit A, Leatherman J. The GEP: crowd-sourcing big data analysis with undergraduates. Trends Genet. 2017;33:81–5. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Shaffer CD, Alvarez C, Bailey C, Barnard D, Bhalla S, Chandrasekaran C, et al. The genomics education partnership: successful integration of research into laboratory classes at a diverse group of undergraduate institutions. CBE-Life Sci Educ. 2010;9:55–69. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Jordan TC, Burnett SH, Carson S, Caruso SM, Clase K, DeJong RJ, et al. A broadly implementable research course in phage discovery and genomics for first-year undergraduate students. mBio. 2014;5:e01051–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Leung W, Shaffer CD, Chen EJ, Quisenberry TJ, Ko K, Braverman JM et al. Retrotransposons are the major contributors to the expansion of the Drosophila ananassae Muller F element. G3 Genes|Genomes|Genetics. 2017;7:2439–60. [DOI] [PMC free article] [PubMed]

[CR25] 25.Moya ND, Stevens L, Miller IR, Sokol CE, Galindo JL, Bardas AD, et al. Novel and improved Caenorhabditis briggsae gene models generated by community curation. BMC Genomics. 2023;24:486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Chang WH, Mashouri P, Lozano AX, Johnstone B, Husić M, Olry A, et al. Phenotate: crowdsourcing phenotype annotations as exercises in undergraduate classes. Genet Med. 2020;22:1391–400. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Zhou N, Siegel ZD, Zarecor S, Lee N, Campbell DA, Andorf CM, et al. Crowdsourcing image analysis for plant phenomics to generate ground truth data for machine learning. PLoS Comput Biol. 2018;14:e1006337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Singh M, Bhartiya D, Maini J, Sharma M, Singh AR, Kadarkaraisamy S et al. The zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation. Database. 2014;2014. [DOI] [PMC free article] [PubMed]

[CR29] 29.Prost S, Petersen M, Grethlein M, Hahn SJ, Kuschik-Maczollek N, Olesiuk ME, et al. Improving the chromosome-level genome assembly of the Siamese fighting fish (Betta splendens) in a university master’s course. G3 Genes|Genomes|Genetics. 2020;10:2179–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Prost S, Winter S, De Raad J, Coimbra RTF, Wolf M, Nilsson MA, et al. Education in the genomics era: generating high-quality genome assemblies in university courses. Gigascience. 2020;9:giaa058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Tello-Ruiz MK, Marco CF, Hsu F-M, Khangura RS, Qiao P, Sapkota S, et al. Double triage to identify poorly annotated genes in maize: the missing link in community curation. PLoS One. 2019;14:e0224086. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Rödelsperger C, Athanasouli M, Lenuzzi M, Theska T, Sun S, Dardiry M, et al. Crowdsourcing and the feasibility of manual gene annotation: a pilot study in the nematode Pristionchus Pacificus. Sci Rep. 2019;9:18789. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Elliott TA, Heitkam T, Hubley R, Quesneville H, Suh A, Wheeler TJ. TE hub: a community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mobile DNA. 2021;12:16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Storer J, Hubley R, Rosen J, Wheeler J, Smit AF. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA. 2021;12:2. [DOI] [PMC free article] [PubMed]

[CR35] 35.Suh A, Smeds L, Ellegren H. Abundant recent activity of retrovirus-like retrotransposons within and among flycatcher species implies a rich source of structural variation in songbird genomes. Mol Ecol. 2018;27:99–111. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Galbraith JD, Ludington AJ, Suh A, Sanders KL, Adelson DL. New environment, new invaders—repeated horizontal transfer of LINEs to sea snakes. Genome Biol Evol. 2020;12:2370–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Diesh C, Stevens G, Buels R, Cain S. react-msaview. Available: https://github.com/GMOD/react-msaview

[CR39] 39.Wessler SR. In: Caporale LH, editor. Eukaryotic transposable elements: teaching old genomes new tricks. The implicit genome: Oxford University Press; 2006. pp. 138–65. [Google Scholar]

[CR40] 40.Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. Trimal: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Edgar RC, Myers EW. Piler: identification and classification of genomic repeats. Bioinformatics. 2005;21:i152-8. [DOI] [PubMed] [Google Scholar]

[CR42] 42.Rice P, Longden I, Bleasby A. Emboss: the European molecular biology open software suite. Trends Genet. 2000;16:276–7. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Ellinghaus D, Kurtz S, Willhoeft U. Ltrharvest, an efficient and flexible software for de Novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9:18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Szak ST, Pickeral OK, Makalowski W, Boguski MS, Landsman D, Boeke JD. Molecular archeology of L1 insertions in the human genome. Genome Biol. 2002;3:research00521. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Carey KM, Patterson G, Wheeler TJ. Transposable element subfamily annotation has a reproducibility problem. Mob DNA. 2021;12:4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Grandi FC, An W. Non-LTR retrotransposons and microsatellites. Mob Genet Elem. 2013;3:e25674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Linheiro RS, Bergman CM. Whole genome resequencing reveals natural target site preferences of transposable elements in Drosophila melanogaster. PLoS One. 2012;7:e30008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Community TG. The galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update. Nucleic Acids Res. 2024;52:W83–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Dunsmuir P, Brorein WJ Jr, Simon MA, Rubin GM. Insertion of the Drosophila transposable element Copia generates a 5 base pair duplication. Cell. 1980;21:575–9. [DOI] [PubMed] [Google Scholar]

[CR50] 50.Vassetzky NS, Kramerov DA. SINEBase: a database and tool for SINE analysis. Nucleic Acids Res. 2012;41:D83–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Kapitonov VV, Jurka J. A novel class of SINE elements derived from 5S rRNA. Mol Biol Evol. 2003;20:694–702. [DOI] [PubMed] [Google Scholar]

[CR52] 52.Kohany O, Gentles AJ, Hankus L, Jurka J. Annotation, submission and screening of repetitive elements in repbase: Repbasesubmitter and censor. BMC Bioinformatics. 2006;7:474. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Larsson A. Aliview: a fast and lightweight alignment viewer and editor for large datasets. Bioinformatics. 2014;30:3276–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Ichiyanagi K, Okada N. Mobility pathways for vertebrate L1, L2, CR1, and RTE clade retrotransposons. Mol Biol Evol. 2008;25:1148–57. [DOI] [PubMed] [Google Scholar]

[CR55] 55.Le Rouzic A, Capy P. The first steps of transposable elements invasion: parasitic strategy vs. genetic drift. Genetics. 2005;169:1033–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Feschotte C, Jiang N, Wessler SR. Plant transposable elements: where genetics meets genomics. Nat Rev Genet. 2002;3:329–41. [DOI] [PubMed] [Google Scholar]

[CR57] 57.Belyayev A. Bursts of transposable elements as an evolutionary driving force. J Evol Biol. 2014;27:2573–84. [DOI] [PubMed] [Google Scholar]

[CR58] 58.Slagel V, Flemington E, Traina-Dorge V, Bradshaw H, Deininger P. Clustering and subfamily relationships of the Alu family in the human genome. Mol Biol Evol. 1987;4:19–29. [DOI] [PubMed] [Google Scholar]

[CR59] 59.Jurka J, Smith T. A fundamental division in the Alu family of repeated sequences. Proc Natl Acad Sci USA. 1988;85:4775–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR60] 60.Willard C, Nguyen HT, Schmid CW. Existence of at least three distinct Alu subfamilies. J Mol Evol. 1987;26:180–6. [DOI] [PubMed] [Google Scholar]

[CR61] 61.Krane DE, Clark AG, Cheng JF, Hardison RC. Subfamily relationships and clustering of rabbit C repeats. Mol Biol Evol. 1991;8:1–30. [DOI] [PubMed] [Google Scholar]

[CR62] 62.Storer JM, Walker JA, Baker JN, Hossain S, Roos C, Wheeler TJ, et al. Framework of the Alu subfamily evolution in the platyrrhine three-family clade of Cebidae, Callithrichidae, and Aotidae. Genes. 2023;14:249. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Britten RJ, Baron WF, Stout DB, Davidson EH. Sources and evolution of human Alu repeated sequences. Proc Natl Acad Sci USA. 1988;85:4770–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Jurka J, Milosavljevic A. Reconstruction and analysis of human Alu genes. J Mol Evol. 1991;32:105–21. [DOI] [PubMed] [Google Scholar]

[CR65] 65.Hubley R, Siegel A, Smit AF. COSEG. Institute for Systems Biology. Available: https://www.repeatmasker.org/COSEGDownload.html

[CR66] 66.Price AL, Eskin E, Pevzner PA. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 2004;14:2245–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR67] 67.Wacholder AC, Cox C, Meyer TJ, Ruggiero RP, Vemulapalli V, Damert A, et al. Inference of transposable element ancestry. PLoS Genet. 2014;10:e1004482. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR68] 68.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR69] 69.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR70] 70.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. [DOI] [PubMed] [Google Scholar]

[CR71] 71.Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011;7:e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR72] 72.Pagès H, Aboyoun P, Gentleman R, DebRoy S, Biostrings. Efficient manipulation of biological strings. Available: https://bioconductor.org/packages/Biostrings

PERMALINK

ColabCuraTE: an easy-to-use, web-based pipeline for the manual curation of transposable elements

Scott L Travers

Abbas Khansa

Christopher E Ellison

Abstract

Background

Results

Conclusions

Supplementary Information

Introduction

Implementation

Fig. 1.

Pipeline overview

TE identification

Extension

Alignment

Refinement

TE analysis

Common issues

Testing

Table 1.

Fig. 2.

Fig. 3.

Fig. 4.

Supplemental computing resources

Table 2.

Results & discussion

Pipeline performance

D. melanogastercopia LTR retrotransposon

B. multicinctusgypsy LTR retrotransposon

B. multicinctus SINE element

B. multicinctus LINE element

O. sativa hAT DNA transposon

Runtimes

Advantages and limitations of ColabCuraTE

Conclusions

Supplementary Information

Acknowledgements

Other requirements

License

Any restrictions to use by non-academics

Abbreviations

Authors’ contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases