Skip to main content
NAR Genomics and Bioinformatics logoLink to NAR Genomics and Bioinformatics
. 2025 Jul 24;7(3):lqaf105. doi: 10.1093/nargab/lqaf105

Targeted decontamination of sequencing data with CLEAN

Marie Lataretu 1,2,, Sebastian Krautwurst 3, Matthew R Huska 4, Mike Marquet 5, Adrian Viehweger 6, Sascha D Braun 7, Christian Brandt 8, Martin Hölzer 9
PMCID: PMC12288876  PMID: 40708849

Abstract

Many biological and medical questions are answered based on the analysis of sequence data. However, we can find contamination, artificial spike-ins, and overrepresented rRNA (ribosomal RNA) sequences in various read collections and assemblies. In particular, spike-ins used as controls, as those known from Illumina or Nanopore data, are often not considered as contaminants and also not appropriately removed during analyses. Additionally, removing human host DNA may be necessary for data protection and ethical considerations to ensure that individuals cannot be identified. We developed CLEAN, a pipeline to remove unwanted sequences from both long- and short-read sequencing techniques. While focusing on Illumina and Nanopore data with their technology-specific control sequences, the pipeline can also be used for host decontamination of metagenomic reads and assemblies, or the removal of rRNA from RNA-Seq data. The results are the purified sequences and sequences identified as contaminated with statistics summarized in a report. The output can be used directly in subsequent analyses, resulting in faster computations and improved results. Although decontamination seems mundane, many contaminants are routinely overlooked, cleaned by steps that are not fully reproducible or difficult to trace. CLEAN facilitates reproducible, platform-independent data analysis in genomics and transcriptomics and is freely available at https://github.com/rki-mf1/clean under a BSD3 license.

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Next-generation sequencing and third-generation sequencing, commonly referred to as sequencing technologies, require quality control of the raw data. While most bioinformatics tools trim low-quality bases and remove adapter sequences, a critical step often overlooked is identifying DNA/RNA contamination from multiple sources [1], which can occur naturally or due to various factors. Contamination can arise during sample collection, shipping, or the sequencing library preparation process [2, 3]. Additionally, control sequences are often introduced to calibrate basecalling and monitor run quality. However, these controls have been found to contaminate microbial isolate genomes in public databases: Illumina’s spike-in sequence, PhiX, were shown to be large-scale contaminants in microbial isolate genomes because the reads were not removed before assembly [4]. Also, we found that the positive control in Oxford Nanopore Technologies (ONT) DNA sequencing, a 3.6-kb standard amplicon known as DCS mapping the 3′ end of the Lambda phage genome, is mislabeled as Escherichia coli or Klebsiella quasipneumoniae subsp. similipneumoniae plasmid in the NCBI GenBank (CP077071.1, CP092122.1; see Supplementary Figs S1S3). For ONT native RNA sequencing, a yeast ENO2 Enolase II transcript of strain S288C, YHR174W, functions as a positive control. Spike-in steps are usually optional; however, the information about whether a spike-in was used or not often does not reach the user working with raw reads.

Aside from addressing intentionally introduced control sequences and known contaminants, there are cases where specific biological sequences must be removed. For instance, in Illumina RNA-Seq samples, it is often essential to remove ribosomal or mitochondrial RNA before read-count normalization and differential gene expression estimation [5]. This is particularly crucial for non-model species without optimized rRNA (ribosomal RNA) depletion kits [6]. Another example is eliminating host sequences, e.g. human sequences in human gut microbiome sequencing data [7].

Numerous tools have been developed for sequence data classification and decontamination, including Kraken 2 [8], Clark [9], Kaiju [10], HoCoRT [11], and Decontam [12], each with its own focus. Other tools specifically target ONT DNA spike-ins, such as nanolyse [13]. Nevertheless, despite the potential benefits in runtime and accuracy, many studies neglect proper read decontamination. As a direct result, we can find contamination omnipresent in genomic resources [14]. One reason might be that the output files of many pipelines cannot be directly used for downstream steps such as assembly or annotation and additional formatting of the files and extraction of the results are needed. We need decontamination tools that can be easily integrated into modern bioinformatics workflows.

To address these challenges, we introduce CLEAN (https://github.com/rki-mf1/clean), an all-in-one decontamination pipeline for short reads, long reads, and any FASTA-formatted sequences. Initially designed for removing Illumina and Nanopore spike-ins and host sequences in metagenomics, CLEAN’s functionality has been extended to also remove or keep user-provided reference sequences. It also simplifies rRNA removal from Illumina RNA-Seq data and offers a streamlined QC report. CLEAN produces intermediate mapping files for further analysis and can be executed easily on various platforms. It uses common output formats, enabling direct integration into downstream analyses, enhancing decontamination in molecular biology research and genomic resources.

Materials and methods

Implementation

We use the workflow manager Nextflow v21.04.0 or higher [15] to manage our workflow, encapsulating each step in Docker [16], Singularity containers, or Conda environments [17] to ensure reproducibility of the results. The modular structure makes it easier for us to update the containers and environments used by CLEAN periodically. CLEAN can be easily installed with a single command—the only prerequisites are Nextflow and one of Docker, Singularity, or Conda. We offer configurations for local execution, LSF and SLURM workload managers, and a simple cloud execution.

Workflow

CLEAN’s input can be single- and paired-end Illumina FASTQ files, ONT or PacBio FASTQ read files, or FASTA files (Fig. 1). The only required parameter is the input file, and users have the option to include a custom contamination reference FASTA file. We provide different external resources for common use cases, e.g. common host genomes, rRNA contamination reference, and spike-in sequences (see the Supplementary data). CLEAN combines all specified contaminants, allowing users to clean both host and spike-in reads in a single step. By default, each input file (FASTQ and/or FASTA) is mapped against the reference with minimap2 v2.26 [18] and specific options for short- or long-read data. For Illumina, we also offer a k-mer-based filtering option with bbduk (sourceforge.net/projects/bbmap) that directly results in clean and contaminated FASTQ files. Alternatively, the user can switch to BWA MEM [19] as short-read mapper. After mapping, we separate mapped from unmapped reads/contigs by the primary alignment with SAMtools [20]. CLEAN generates quality reports for input, clean and contamination files using FastQC (www.bioinformatics.babraham.ac.uk/projects/fastqc/) for Illumina, NanoPlot [21] for long reads, or QUAST [22] for FASTA files. MultiQC [23] summarizes all quality reports and mapping statistics in an HTML report. If minimap2 or BWA MEM were used, CLEAN additionally produces indexed mapping files (BAM) and an indexed contamination reference. If necessary, users can further analyze the results in a genome browser.

Figure 1.

Figure 1.

Schematic overview of the CLEAN workflow. Gray/blurred elements are optional and depend on the user input. The pipeline can search multiple FASTA or FASTQ inputs against a user-defined set of reference sequences (potential contamination). CLEAN automatically combines different user-defined FASTA reference sequences, built-in spike-in controls, and downloadable host species into one mapping index for decontamination. The user can also specify FASTA files comprising sequences that should explicitly not be counted as contamination. The output is finally filtered to provide well-formatted FASTA or FASTQ files for direct downstream analyses. The icons and diagram components that make up the schematic view were originally designed by James A. Fellow Yates and nf-core under a CCO license (public domain).

We want to highlight three parameter options: First, for ONT data and DSC control, Inline graphicdcs_strict, which exclusively considers reads that align to the DCS and cover at least one of its artificial ends as contamination. This prevents inadvertent removal of similar phage DNA that might actually belong to a metagenomics sample. Second, with Inline graphicmin_clip, mapped reads are filtered by the total length (sum of both ends) of the soft-clipped positions. If Inline graphicmin_clip ≥ 1, the total number is considered, else the fraction of soft-clipped positions to the read length. Third, the user can specify FASTA files with Inline graphickeep. Input reads are then separately mapped to this reference. If a read maps to the “keep” reference but was classified as contamination before, CLEAN moves the read to the set of clean reads. This feature helps mitigate false contaminant, particularly when dealing with closely related species or metagenomic samples.

Test datasets and computations

We applied CLEAN to five distinct case studies to evaluate the pipeline’s ability to decontaminate sequencing data. Detailed descriptions of the underlying public sequencing datasets, novel sequencing data generated for this study, and all computational steps can be found in the Supplementary data.

Case study I: removal of cell cultivation contamination

Nanopore and Illumina data from two previously published Chlamydiifraterisolates [24] and two novel isolates were decontaminated using a combined reference comprising the genome of Chlorocebus sabaeus (green monkey) and the mitochondrial DNA genome of Chlorocebus pygerythrus to remove host-derived contamination resulting from cultivation. Subsequently, we used the cleaned data to reconstruct hybrid de novo assemblies.

Case study II: decontamination in Nanopore native RNA sequencing

Direct RNA (dRNA) sequencing data from HCoV-229E-infected Huh7 cells [25] were processed using CLEAN to distinguish viral, yeast, and human reads, with results compared to manual assignment.

Case study III: rRNA removal from Illumina RNA-Seq data

CLEAN’s performance for rRNA removal was assessed against SortMeRNA [26] using seven simulated Illumina datasets and one real RNA-Seq sample from a bat transcriptome study [6], with runtime and accuracy comparisons.

Case study IV: human DNA spike-in removal from bacterial isolates after Nanopore adaptive sequencing

Mixed DNA samples with varying human-to-bacteria ratios, spanning five bacterial species, were sequenced on the Nanopore platform using adaptive sequencing to deplete human sequences in real time. Then, we used CLEAN to remove remaining human contamination after adaptive sequencing, facilitating accurate taxonomic classification and comparability between samples.

Case study V: large-scale SARS-CoV-2 data decontamination

We used CLEAN to process 3866 SARS-CoV-2 Nanopore amplicon datasets to ensure the removal of human contamination (including for personal data protection) while retaining viral reads using the “keep” parameter for subsequent upload to the European Nucleotide Archive. For scalability, the pipeline was executed on a high-performance computing cluster.

Results

Case study I: removal of cell cultivation contamination from Nanopore- and Illumina-sequenced Chlamydiaceae

The polished assemblies based on cleaned reads reveal 1.19-Mb circular genomes and 6-kb plasmids for each of the four Chlamydiifrater isolates. Without prior decontamination of the cell line DNA, contigs belonging to Chlorocebus species can be found in the final assemblies. Using an older version of Unicycler, running the assemblies without a CLEAN step of the raw read data also yields more fragmented final assembly results, likely due to the inflated complexity of the initial short-read graph. However, this issue was resolved by using a newer version of Unicycler, but still contigs belonging to the host cell line could be found. Thus, decontamination of DNA belonging to a host cell line can improve the general assembly process and results in a much cleaner assembly.

Case study II: yeast enolase is a highly abundant spike-in control in Nanopore native RNA-Seq data

Nanopore sequencing is currently the only technology that allows the sequencing of native RNA strands without requiring a complementary DNA intermediate [27]. This “direct RNA” protocol includes the addition of a calibration strand (amplified RNA sequences of the Saccharomyces cerevisiae Enolase 2 mRNA, GenBank, NP_012044.1) as a spike-in positive control. Depending on the concentration of sample input RNA, this spike-in can represent a substantial fraction of the sequenced reads. In our study of direct RNA sequencing of human coronavirus genomes [25], these sequences made up 15.8% and 10.2% of the two samples, respectively. Due to algorithmic advances, re-basecalling the raw data with version 4.0.11 of the Guppy basecaller (RNA models are unchanged since then) yields more reads and a higher fraction of spike-in reads (31.4% and 31.0%, see Fig. 2). Guppy does not filter these with default parameters but has an optional parameter (Inline graphiccalib_detect) to enable detection and filtering calibration strand reads. However, we found that this functionality does not adequately detect spike-in reads: 35.4% and 19.8% of spike-in reads were still present when using this parameter. Applying CLEAN to this dataset removes all calibration strand reads, while preserving human and human coronavirus HCoV-229E reads (see Fig. 2).

Figure 2.

Figure 2.

Number of reads mapping to the human genome, HCoV-229E, or S. cerevisiae Enolase 2 (from bottom to top) for two HCoV-229E samples WT (left) and SL2 (right) after Guppy (default parameters), Guppy with Inline graphiccalib_detect, or after CLEAN usage. Only CLEAN is able to remove all reads originating from the dRNA control sequence (ENO2), while the percentage of reads mapped to the human genome and HCoV-229E has not changed as expected. WT, wild type sample; SL2, sample with different RNA secondary structure; ENO2, S. cerevisiae Enolase 2 (dRNA control sequence).

Generally, if a positive control is not needed for the experiment, we suggest skipping the addition of this spike-in. This can increase the yield of desired RNA reads by freeing up throughput capacity. For all direct RNA read data with added spike-in, we propose using CLEAN to remove these sequences reliably and quickly before downstream analyses are performed.

Case study III: speeding up an everyday task in transcriptomics—removal of rRNA from Illumina RNA-Seq data

CLEAN performs equally well compared to SortMeRNA for the non-rRNA dataset: CLEAN’s recall is slightly better (<0.001) than SortMeRNA’s. CLEAN’s recall for the six rRNA datasets is, on average, 0.03 lower (minimum 0.01, maximum 0.06) than SortMeRNA (see Supplementary Table S1). On the real data sample, CLEAN runs ∼1.7-fold faster than SortMeRNA (see Supplementary Fig. S4). Results vary slightly with <0.014% divergence.

Case study IV: decontamination of human spike-in DNA from bacteria isolates

Nanopore sequencing allows for the selective depletion or enrichment of target sequences in real time during the sequencing run. Here, we created an ONT dataset that includes five different bacterial species, to which four different concentrations of human DNA (commercially bought, accession GCA_011064465.2) were added and sequenced with real-time depletion enabled. However, selective sequencing does not remove 100% of the target (human) DNA and, in particular, results in shorter reads of the targeted sequence (as sequencing stops as soon as a classification is possible). Therefore, we used CLEAN to remove the remaining human DNA contamination from the bacterial samples.

Overall, we observed a 99.07% reduction in human reads after applying CLEAN to the sequencing results. Reads before and after CLEAN were first adapter-trimmed with porechop (v0.3.2pre, https://github.com/artic-network/Porechop), and then classified with Kraken 2 (a k-mer-based method), to identify and quantify remaining human reads. Supplementary Table S2 lists the total number of reads per sample, those that could be taxonomically classified as human using Kraken 2, and the percentage of reads remaining after applying CLEAN. To further validate the remaining ∼1% human reads, classified by Kraken 2 but not removed by CLEAN (Supplementary Fig. S5), we mapped them back to the matching human reference genome of the commercially bought DNA (accession GCA_011064465.2) using minimap2 (v2.24). Remarkably, only 1/10th could be successfully aligned to the genome (see Supplementary Table S3), explaining why CLEAN’s mapping-based approach did not identify them as human. Therefore, we assume that the discrepancy might be related to the k-mer-based classification of Kraken 2 on these few remaining sequencing reads. Figure 3 illustrates the amount of human DNA spike-in remaining in the Acinetobacter pittii (Ap) sample that could not be completely removed by selective sequencing and the level of decontamination achieved by CLEAN (see Supplementary Fig. S5 for all samples). Our results show that CLEAN can be used to effectively remove human contamination from bacterial datasets. However, a small number of reads can still be annotated as human, which could be due to algorithmic and/or reference biases.

Figure 3.

Figure 3.

Abundances of reads per sequencing run for Apittii (Ap) before and after CLEAN, after removal of potentially remaining adapters and barcodes with porechop and subsequent Kraken 2 and Bracken classification. H, percent of human DNA spike-in; B, percent of bacteria DNA. Supplementary Fig. S5 shows the results for all investigated bacteria.

Case study V: removal of human contamination from a large SARS-CoV-2 genomic surveillance dataset

CLEAN processed all 3866 SARS-CoV-2 amplicon FASTQ files on a high-performance cluster in 2 h 14 min, having used 1547 CPU hours for computation. Across all samples, the mean number of reads that were removed was 274, and median number of reads removed was 61. Only 15 samples had >5% of their reads removed.

Looking at reads that mapped to both the human genome and the SARS-CoV-2 genome, in all cases the reads mapped better to the SARS-CoV-2 genome: SARS-CoV-2 alignments were typically the full length of the read and with high similarity, while alignments to the human genome were only covering a small fraction of the reads. Because of this, we did not consider those reads to be contamination and kept them using CLEAN’s Inline graphickeep functionality.

Discussion

We developed CLEAN to easily screen any nucleotide sequences against reference sequences to identify and remove potential contamination. This includes common tasks such as the removal of positive controls added during library preparation, host contamination, or rRNAs. Decontamination with CLEAN can be easily used as a preprocessing step before a main analysis since the output needs no further processing or reformatting. By default, the pipeline uses alignment-based approaches for short and long reads that subsequently also allow for the inspection of the reads aligned to a potential contamination reference in more detail. Furthermore, CLEAN provides quality control reports for more insights. CLEAN is freely available at https://github.com/rki-mf1/clean and can be easily installed and executed using Nextflow.

A recently published benchmark study introduced nf-core/detaxizer, a tool for decontaminating human sequences, and compared its performance with two other tools, including CLEAN [28]. It was found that the choice of tool and database in particular can result in many times more human data not being removed, underscoring the importance of careful selection of tools and reference genomes.

In this sense, CLEAN enables the easy use and combination of different reference genomes to allow tailored decontamination. For example, instead of the GRCh38 human reference genome, the recently published and more complete T2T-CHM13 genome [29] can be used and automatically downloaded by CLEAN. This is of great importance, as otherwise insufficient filtering of the host material using earlier references of the human genome can lead to incorrect gender-specific biases, for example, and unintentionally allow the flow of host-specific DNA during bioinformatics analyses, which could be exploited to identify individuals [30].

Limitations

CLEAN cannot be used to remove unexpected contaminants. For such a task, DecontaMiner, a tool to remove contaminating sequences of unmapped reads [31], or QC-Blind, a tool for quality control and contamination screening without a reference genome [32], can be used. Other tools try to find unexpected compositions in metagenomics samples to identify contaminants [12]. With CLEAN, we did also not focus on the detection of cross-contamination. For this task, other tools such as ART-DeCo [33] can be used. We have also shown that although CLEAN can be used to remove rRNAs, we recommend using customized tools such as SortMeRNA when runtime is not a limitation. The same applies to tools such as Kraken 2 for taxonomic read classification.

Supplementary Material

lqaf105_Supplemental_File

Acknowledgements

We thank Fabien Vorimore from ANSES, France for sequencing of the two Chlamydiifrater strains and providing the raw data for our benchmark. We thank Stephan Fuchs from RKI, Germany, for fruitful discussions.

Author contributions: Conceptualization: M.H., C.B., A.V.; Software: M.L., M.H., M.R.H.; Formal analysis: M.H. (case study I), S.K. (case study II), M.L. (case study III), M.M. (case study IV), C.B. (case study IV), M.R.H. (case study V); Investigation: S.D.B. (bacteria samples with human spike-in DNA and nanopore sequencing); Writing—original draft: M.H., M.L.; Writing—review & editing: M.L., S.K., M.R.H., M.M., A.V., S.D.B., C.B., M.H.; Visualization: M.L., S.K., M.M., C.B.; Supervision: M.H.

Contributor Information

Marie Lataretu, Genome Competence Center, Robert Koch Institute, 13353 Berlin, Germany; RNA Bioinformatics and High-Throughput Analysis, University of Jena, 07743 Jena, Germany.

Sebastian Krautwurst, RNA Bioinformatics and High-Throughput Analysis, University of Jena, 07743 Jena, Germany.

Matthew R Huska, Genome Competence Center, Robert Koch Institute, 13353 Berlin, Germany.

Mike Marquet, Institute for Infectious Diseases and Infection Control, Jena University Hospital, 07747 Jena, Germany.

Adrian Viehweger, Institute of Medical Microbiology and Virology, University Hospital Leipzig, 04103 Leipzig, Germany.

Sascha D Braun, Leibniz-Institute of Photonic Technology (Leibniz-IPHT), 07745 Jena, Germany.

Christian Brandt, Institute for Infectious Diseases and Infection Control, Jena University Hospital, 07747 Jena, Germany.

Martin Hölzer, Genome Competence Center, Robert Koch Institute, 13353 Berlin, Germany.

Supplementary data

Supplementary data is available at NAR Genomics & Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the European Centre for Disease Prevention and Control (grant number 2021/008 ECD.12222 to M.L.) and by the Federal Ministry of Education and Research (BMBF) in the context of the AVATAR project (grant number 16KISA012). The computational experiments were also tested on resources of the Friedrich Schiller University Jena supported in part by DFG grants INST 275/334-1 FUGG and INST 275/363-1 FUGG.

Data availability

CLEAN, including the user manual, is available on GitHub (https://github.com/rki-mf1/clean, DOI: 10.5281/zenodo.14803046) under the open source BSD3 license. All supporting analysis scripts are available in OSF (doi.org/10.17605/OSF.IO/CUXEM). Data used in this work are available in public databases:

Case study I: SRA BioSample IDs: SAMEA6565319 (strain 15-2067_O50), SAMEA6565320 (strain 15-2067_O99), and ENA study accession ID: PRJEB59173 (strains 15-2067_O09 and 15-2067_O77; see also [34]).

Case study II: sequencing data and scripts: doi.org/10.17605/OSF.IO/UP7B4.

Case study III: SRA BioSample ID: SAMN10246232.

Case study IV: ENA study accession ID: PRJNA1199779.

Case study V: ENA study accession ID: PRJEB76939.

References

  • 1. Nieuwenhuis  TO, Yang  SY, Verma  RX  et al.  Consistent RNA sequencing contamination in GTEx and other data sets. Nat Commun. 2020; 11:1933. 10.1038/s41467-020-15821-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Chrisman  B, He  C, Jung  J-Y  et al.  The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families. Sci Rep. 2022; 12:9863. 10.1038/s41598-022-13269-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Porter  AF, Cobbin  J, Li  C-X  et al.  Metagenomic identification of viral sequences in laboratory reagents. Viruses. 2021; 13:2122. 10.3390/v13112122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Mukherjee  S, Huntemann  M, Ivanova  N  et al.  Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci. 2015; 10:18. 10.1186/1944-3277-10-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Zhao  S, Zhang  Y, Gamini  R  et al.  Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus rRNA depletion. Sci Rep. 2018; 8:4781. 10.1038/s41598-018-23226-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hölzer  M, Schoen  A, Wulle  J  et al.  Virus- and interferon alpha-induced transcriptomes of cells from the microbat Myotis daubentonii. iScience. 2019; 19:647–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Almeida  A, Mitchell  AL, Boland  M  et al.  A new genomic blueprint of the human gut microbiota. Nature. 2019; 568:499–504. 10.1038/s41586-019-0965-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Lu  J, Rincon  N, Wood  DE  et al.  Metagenome analysis using the Kraken software suite. Nat Protoc. 2022; 17:2815–39. 10.1038/s41596-022-00738-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Ounit  R, Wanamaker  S, Close  TJ  et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015; 16:236. 10.1186/s12864-015-1419-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Menzel  P, Ng  KL, Krogh  A  Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016; 7:11257. 10.1038/ncomms11257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rumbavicius  I, Rounge  TB, Rognes  T  HoCoRT: host contamination removal tool. BMC Bioinformatics. 2023; 24:371. 10.1186/s12859-023-05492-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Davis  NM, Proctor  DM, Holmes  SP  et al.  Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome. 2018; 6:226. 10.1186/s40168-018-0605-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. De  Coster W, D’Hert  S, Schultz  DT  et al.  NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018; 34:2666–9. 10.1093/bioinformatics/bty149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Steinegger  M, Salzberg  SL  Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol. 2020; 21:115. 10.1186/s13059-020-02023-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Di  Tommaso P, Chatzou  M, Floden  EW  et al.  Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017; 35:316–9. 10.1038/nbt.3820. [DOI] [PubMed] [Google Scholar]
  • 16. Boettiger  C  An introduction to Docker for reproducible research, with examples from the R environment. arXiv2 October 2014, preprint: not peer reviewedhttps://arxiv.org/abs/1410.0846.
  • 17. Grüning  B, Dale  R, Sjödin  A  et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018; 15:475–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Li  H  Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34:3094–100. 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Li  H  Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv26 May 2013, preprint: not peer reviewedhttps://arxiv.org/abs/1303.3997.
  • 20. Danecek  P, Bonfield  JK, Bonfield  JK  et al.  Twelve years of SAMtools and BCFtools. GigaScience. 2021; 10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. De  Coster W, Rademakers  R  NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics. 2023; 39:btad311. 10.1093/bioinformatics/btad311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Mikheenko  A, Prjibelski  A, Saveliev  V  et al.  Versatile genome assembly evaluation with QUAST-LG. Bioinformatics. 2018; 34:i142–50. 10.1093/bioinformatics/bty266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Ewels  P, Magnusson  M, Lundin  S  et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016; 32:3047–8. 10.1093/bioinformatics/btw354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Vorimore  F, Hölzer  M, Liebler-Tenorio  EM  et al.  Evidence for the existence of a new genus Chlamydiifrater gen. nov. inside the family Chlamydiaceae with two new species isolated from flamingo (Phoenicopterus roseus): Chlamydiifraterphoenicopteri sp. nov. and Chlamydiifrater volucris sp. nov. Syst Appl Microbiol. 2021; 44:126200. 10.1016/j.syapm.2021.126200. [DOI] [PubMed] [Google Scholar]
  • 25. Viehweger  A, Krautwurst  S, Lamkiewicz  K  et al.  Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis. Genome Res. 2019; 29:1545–54. 10.1101/gr.247064.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Kopylova  E, Noé  L, Touzet  H  SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics. 2012; 28:3211–7. 10.1093/bioinformatics/bts611. [DOI] [PubMed] [Google Scholar]
  • 27. Ergin  S, Kherad  N, Alagoz  M  RNA sequencing and its applications in cancer and rare diseases. Mol Biol Rep. 2022; 49:2325–33. 10.1007/s11033-021-06963-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Seidel  J, Kaipf  C, Straub  D  et al.  nf-core/detaxizer: a benchmarking study for decontamination from human sequences. bioRxiv30 March 2025, preprint: not peer reviewed 10.1101/2025.03.27.645632. [DOI]
  • 29. Nurk  S, Koren  S, Rhie  A  et al.  The complete sequence of a human genome. Science. 2022; 376:44–53. 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Guccione  C, Patel  L, Tomofuji  Y  et al.  Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data. Nat Commun. 2025; 16:825. 10.1038/s41467-025-56077-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Sangiovanni  M, Granata  I, Thind  AS  et al.  From trash to treasure: detecting unexpected contamination in unmapped NGS data. BMC Bioinformatics. 2019; 20:168. 10.1186/s12859-019-2684-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Xi  W, Gao  Y, Cheng  Z  et al.  Using QC-Blind for quality control and contamination screening of bacteria DNA sequencing data without reference genome. Front Microbiol. 2019; 10:1560. 10.3389/fmicb.2019.01560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Fiévet  A, Bernard  V, Tenreiro  H  et al.  ART-DeCo: easy tool for detection and characterization of cross-contamination of DNA samples in diagnostic next-generation sequencing analysis. Eur J Hum Genet. 2019; 27:792–800. 10.1038/s41431-018-0317-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Hölzer  M, Reuschel  C, Vorimore  F  et al.  Exploring the genomic landscape of Chlamydiifrater species: novel features include multiple truncated major outer membrane proteins, unique genes and chlamydial plasticity zone orthologs. Access Microbiol. 2025; 7: 10.1099/acmi.0.000936.v3. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqaf105_Supplemental_File

Data Availability Statement

CLEAN, including the user manual, is available on GitHub (https://github.com/rki-mf1/clean, DOI: 10.5281/zenodo.14803046) under the open source BSD3 license. All supporting analysis scripts are available in OSF (doi.org/10.17605/OSF.IO/CUXEM). Data used in this work are available in public databases:

Case study I: SRA BioSample IDs: SAMEA6565319 (strain 15-2067_O50), SAMEA6565320 (strain 15-2067_O99), and ENA study accession ID: PRJEB59173 (strains 15-2067_O09 and 15-2067_O77; see also [34]).

Case study II: sequencing data and scripts: doi.org/10.17605/OSF.IO/UP7B4.

Case study III: SRA BioSample ID: SAMN10246232.

Case study IV: ENA study accession ID: PRJNA1199779.

Case study V: ENA study accession ID: PRJEB76939.


Articles from NAR Genomics and Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES