Skip to main content
NAR Genomics and Bioinformatics logoLink to NAR Genomics and Bioinformatics
. 2024 May 25;6(2):lqae056. doi: 10.1093/nargab/lqae056

ViralFlow v1.0—a computational workflow for streamlining viral genomic surveillance

Alexandre Freitas da Silva 1,2,c, Antonio Marinho da Silva Neto 3,c, Cleber Furtado Aksenen 4,c, Pedro Miguel Carneiro Jeronimo 5,c, Filipe Zimmer Dezordi 6,7,c, Suzana Porto Almeida 8, Hudson Marques Paula Costa 9, Richard Steiner Salvato 10, Tulio de Lima Campos 11,d,, Gabriel da Luz Wallau 12,13,14,d,, on behalf of the Fiocruz Genomic Network
PMCID: PMC11127631  PMID: 38800829

Abstract

ViralFlow v1.0 is a computational workflow developed for viral genomic surveillance. Several key changes turned ViralFlow into a general-purpose reference-based genome assembler for all viruses with an available reference genome. New virus-agnostic modules were implemented to further study nucleotide and amino acid mutations. ViralFlow v1.0 runs on a broad range of computational infrastructures, from laptop computers to high-performance computing (HPC) environments, and generates standard and well-formatted outputs suited for both public health reporting and scientific problem-solving. ViralFlow v1.0 is available at: https://viralflow.github.io/index-en.html.

Introduction

Detecting and monitoring the spread of viral pathogens in the human population plays a crucial role for informing public health policies (1–3). Pathogen molecular surveillance is key to rapid detection and evaluation of emerging and reemerging lineages with altered phenotype. In the last years pathogen genome-wide surveillance has emerged as a powerful tool allowing higher resolution characterization of complete viral genomes and hence a more complete understanding of viral evolution and the emergence of new viral strains (4–6). The most recent example of the unprecedented amount of viral genomic data and their usefulness for public health comes from the COVID-19 pandemic where severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been monitored in near real-time across the globe (5,7). Genomic data were used to understand SARS-CoV-2 transmission dynamics and provide key information for the development of effective diagnostics and vaccines (8,9).

High-throughput sequencing (HTS) technologies have enabled dozens of pathogen genomes to be sequenced at once. This has moved the genomic surveillance bottleneck away from sequencing hundreds of genomes to analyzing ever-growing sequence datasets. Genomic surveillance of viral pathogens, like any surveillance system, benefits from rapid and consistent data sharing. However, managing, analyzing and summarizing the results of hundreds of genomes in a timely manner (daily/weekly basis) is a daunting task using separate software solutions. Bioinformatic challenges encompass several computational steps including multiple quality checks and intermediate steps that generate a substantial number of files requiring local processing and storage. To facilitate and accelerate the obtention of the genomic surveillance results, a number of pipelines/workflows have been proposed for processing large-scale SARS-CoV-2 genomic data such as HAVoC (10), ASPICov (11) and Viralrecon (https://doi.org/10.5281/zenodo.3901628). Additionally, there are a diverse range of tools available for general viral analysis such as V-pipe (12), ViReflow (13), Lazypipe (14), TRACESPipe (15), QVG (16), Haploflow (17), ASPIRE (18) and others. Some of these tools have been developed to handle viral metagenomic data or developed only for specific viruses, employing different assembly algorithms, each with its own advantages and limitations. ViralFlow v.0.0.6 is one of those that encapsulates several tools specifically tailored for genomic surveillance of SARS-CoV-2 (19). ViralFlow has been extensively used to assemble and analyze SARS-CoV-2 short-read genomic data and is the most used workflow in Brazil reported on GISAID, surpassing many commercial options (29.64% of genomes deposited with assembly methodology described on October 31, 2023). However, despite being widely used, some limitations still exist for broader genomic surveillance applicability including other viruses.

In this application note, we describe ViralFlow v1.0 and compare it with its predecessor. This new version was refactored and incorporated into a workflow language (NextFlow) that provides several advantages such as better management, an efficient parallelism and continuous checkpoint of processes. These features improve reproducibility and allow a rapid and easy implementation of new features to the workflow. Furthermore, this new version includes additional new agnostic software for mutation analysis and visualization that allows to more in depth characterize the effects of mutations on viral genomes. ViralFlow 1.0 also generates specific files containing viral-only sequenced reads, ensuring viral-only data sharing. In summary, this new version of ViralFlow is a general reference-based viral genome assembler allowing easy customization to any virus with a reference genome available.

Materials and methods

ViralFlow v0.0.6 code was refactored within a NextFlow workflow system (20) focusing on four important pillars: (i) modularity; (ii) simplified installation steps and documentation; (iii) reproducibility; and (iv) code development and usability transparency. To achieve these aims, we used the ‘module’ functionality of NextFlow to separate the main tasks of ViralFlow v1.0 using Singularity containers (21). This modular aspect allows facilitated integration of new modules, and/or module replacements. Reduced numbers of command line steps are required to install ViralFlow v1.0 in both Ubuntu and MacOs systems (https://viralflow.github.io/index-en.html). The current version of ViralFlow v1.0 offers user flexibility to run analyses in two modes: ‘sars-cov-2′ and ‘custom’. The ‘sars-cov-2′ mode is pre-configured to execute all analyses without the need for configuring a reference genome. In this mode, ViralFlow v1.0 relies on the SARS-CoV-2 reference genome (NC_045512.2), which is automatically set up upon selection. The ‘custom’ mode allows for more flexibility, enabling the workflow to analyze any virus with an available genome. In this mode, users can specify the reference genome by either providing the NCBI accession number (–refGenomeCode parameter) for automatic download and configuration or providing a fasta file containing the reference genome (–referenceGenome, parameter), accompanied by an additional GFF annotation file (–referenceGFF parameter) which bears gene annotation needed for further analysis. Users can configure additional parameters according to their needs, e.g. activating (true) or deactivating (false) functions such as the prediction and annotation of mutations using SnpEff (–runSnpEff parameter) and outputting the mapped reads (–writeMappedReads parameter). Performance parameters can also be adjusted, including ‘–nextflowSimCalls’ to set the number of simultaneous calls that NextFlow will handle, as well as ‘–fastp_threads’, ‘–fastp_threads’ and ‘–mafft_threads’ parameters, which are set to use a single thread by default but can be adjusted for improved performance, if necessary. As ViralFlow v1.0 is based on NextFlow, users can invoke it directly using the main workflow module located at ‘∼/ViralFlow/vfnext/main.nf’ or utilize the ViralFlow wrapper configured within the ViralFlow environment along with the adjusted parameters file. Comprehensive documentation and running mode examples are available on the ViralFlow website.

The workflow processes raw reads based on a minimum phred score of 20 for quality control and employs a minimum length size (–minLen parameter) of 75 bases for reads trimming. This parameter may be adjusted by the user. The sequencing adapters are automatically detected and removed using fastp v0.23.4 (22) that allows deduplication of reads (–dedup parameter) to be performed, if desired. ViralFlow v1.0 removes primer regions more efficiently by employing the samtools ampliconclip from samtools v.1.11 (23) using a BED file containing primer positions provided by the user. Alternatively, a used refined number of bases can be removed from all reads using a ‘–trimLen’ parameter. Reads passing filters are mapped to a reference genome using BWA v.0.7.17 (24) in a default mode. Quality metrics for the alignment (BAM) file are calculated using Picard v.2.27.2 utilizing a default mapping and base quality of 30 that may be adjusted by the user (–base_quality and –mapping_quality parameters). Coverage plots are generated using the BAMdash tool (https://github.com/jonas-fuchs/BAMdash). The consensus genomes (fasta) are built using the most frequent allele per position, with samtools v.1.11 and iVar v1.4.2 (25), implementing a minimum depth to call consensus of 25 that may be adjusted by the user through the ‘–depth’ parameter. Bam-readcount v1.0.1 (26) and MAFFT v7.505 (27) are used to identify intrahost (minor allele) variants and build an alternative consensus (bearing the minor allele) using an in-house Python script. Single nucleotide polymorphism visualization plots are also generated using the snipit tool (https://github.com/aineniamh/snipit).

We implemented a new module containing freebayes v0.9.21 (28) and snpEff v.5.0 (29) to annotate and predict the impact of mutations detected. ViralFlow reports a plethora of files for each sample which includes annotated variants, mutation files, consensus genomes with major and minor alleles as well as with ambiguous characters, an alignment of the consensus and the reference, intrahost single nucleotide variants, mapped reads, genome assembly statistics, quality control reports, alignment files (BAM), plots showing mutations and genome coverage. Finally ViralFlow generates compiled output reports that include summarized information such as lineage, coverage breadth and depth that are generated in text/tabular format to allow further data wrangling.

To benchmark and evaluate the new ViralFlow v1.0 features, we compared its performance with that of ViralFlow v0.0.6 using four benchmark datasets with variable numbers of samples. Furthermore, we tested ViralFlow v1.0 performance using real datasets for SARS-CoV-2, monkeypox virus (MPXV), Dengue virus serotypes 1 and 2 (DENV-1 and DENV-2) and Zika virus (ZIKV).

Illumina simulated reads

ART (30) was employed to generate simulated datasets using various high-quality SARS-CoV-2 genomes (horizontal coverage > 99%) from different lineages encompassing different sets of mutations (substitutions and indels) utilizing the ART-MountRainier-2016–06-05 software (https://github.com/scchess/Art/tree/master). We generated different numbers of paired FASTQ files: 1 FASTQ, 8 FASTQs, 16 FASTQs and 32 FASTQs. The artificial reads samples were created using the command art_illumina -ss HS25 -sam -i ’$file‘ -p -l 150 -f 500 -m 200 -s 0 -o ’$output". The -ss HS25 parameter stands for simulated HiSeq 2500 system, while the -p parameter stands for paired reads. Read length was set to 150 bp (-l 150 parameter) and -f 500 for 500 bp fragment size, with an -m 200 setting the standard deviation of the fragment size to 200 bp. The list of the genomes used for generating these artificial reads is accessible in Supplementary Table S1.

Real-world benchmarking datasets

Benchmark datasets were obtained from https://github.com/CDCgov/datasets-sars-cov-2 and used to assess the performance of the workflows. These datasets consist of two real-life sets of reads and one negative control set, namely Bench 1 (BM data1), Bench 5 (BM data5) and Bench 6 (BM data6, negative control set), offering diverse perspectives on SARS-CoV-2 genomic data. BM data1, or the ‘Boston Outbreak’, encompasses 63 samples using Illumina metagenomic sequencing to understand real outbreak transmission. In BM data5, ‘Non-VOI/VOC Lineages’, 39 samples are used to benchmark non-specific lineage-calling workflows on Illumina, employing various primer sets, including Arctic v1, Arctic v3 and a random primer from NexteraXT. Lastly, BM data6, the ‘Failed QC’, features 24 samples serving as controls for bioinformatics quality control testing on Illumina, utilizing primer sets Arctic v3 and CDC in-house multiplex polymerase chain reaction (PCR) primers. The analyses of all datasets were run in triplicate, and the mean and standard deviation of the computational resources, maximum memory peak and runtime used were calculated.

Computer setup

Two different computational platforms, namely AWS and IAM Carlos Chagas Cluster, were employed to simulate diverse execution setups of the workflow. The configuration of the Carlos Chagas Cluster node used comprised a Ubuntu Server 20.04.6 LTS containing 191 Giga bytes (GB) of RAM memory, dual Intel(R) Xeon(R) Gold 5220R CPUs (2.20 GHz), resulting in a total of 96 threads. To emulate a personal computer environment, an AWS instance was utilized with the following specifications: Ubuntu Server 22.04 LTS as the operating system, paired with a virtual machine instance of m5.large (8 GB RAM + 2 vCPUs).

Workflow setup

Both versions of the workflow were executed with carefully matched parameters to ensure the fairest comparison possible considering the different software implemented. The specific arguments employed for each run can be found in Supplementary Table S2. While the two versions may differ in some steps, we extract a number of metrics such as: the maximum resident set size (i.e. the highest amount of physical memory the process utilized at any given moment) and wall-clock time.

Additionally, to validate the genome assembly of ViralFlow v1.0, we incorporated essential genomic statistics such as coverage breadth, coverage depth and the level of agreement between the assemblies (by comparing the lineages generated by each version of ViralFlow) and the corresponding consensus of the original samples.

Diverse viral datasets

We performed both SARS-CoV-2 and custom mode analysis using real data (available on ENA project:PRJEB71472) using only ViralFlow v1.0. Five datasets were employed comprising 1 277 SARS-CoV-2, 56 MPXV, 37 DENV-1, 29 DENV-2 and 271 ZIKV samples. The performance metrics were collected from NextFlow using the parameter -with-trace.

Results

New features

This novel version of ViralFlow v1.0 provides new implementations and improvements compared with its previous version, v0.0.6. The main difference is the refactoring to the NextFlow workflow language, which has enabled enhancements in various aspects such as efficient parallelism and scalability of the workflow. Furthermore, with this new version one can analyze any virus with an available genome. Improvements have been made to handle amplicon sequencing data effectively by implementing primer removal based on a BED file, and deduplication reads have also been implemented using fastp. The validation tests conducted with and without deduplication demonstrate consistent results across versions and modes for coverage breadth and depth (Supplementary Table S3). Minor differences were observed in total reads and coverage depth when employing deduplication of reads, as expected (Supplementary Table S3). New analyses modules have been implemented for the annotation and prediction of mutation effects, as well as for their visualization and the generation of genomic maps with coverage breadth and depth. We introduced a package manager based on Micromamba for speed and robustness, along with standardizing containers to favor fast configuration and ensure the reproducibility of any analysis performed on ViralFlow v1.0.

Viral genomic statistics

The breadth and depth coverage were consistent across all datasets comparisons (Figure 1A), and the negative control set (Bench 6) exhibited reduced coverage breadth and depth, as expected (Figure 1A).

Figure 1.

Figure 1.

ViralFlow 0.0.6 and 1.0 performance evaluation. (A) Coverage depth and breadth of artificial and benchmark datasets. Statistic metrics recovered from viral assembly performed on the Carlos Chagas cluster. (B) CPU usage of ViralFlow v1.0 using real data for different viruses. (C) Physical memory usage of ViralFlow v1.0 using real data for different viruses.

Time and memory consumption of both ViralFlow versions were similar in the AWS and Carlos Chagas Cluster environment, with ViralFlow v1.0 using slightly more memory, probably due to the new modules implemented (Supplementary Figure S1A, B).

Regarding the lineage concordance analysis, using the simulated reads (ART dataset), 100% concordance was observed for both versions of ViralFlow. In contrast, for the Bench datasets, version 1.0 consistently demonstrated higher concordance regarding the lineages identified previously in the Bench5 dataset. For the Bench 1 dataset, the same level of concordance was reached for both 0.0.6 and 1.0 versions (Supplementary Figure S2).

Performance of ViralFlow v1.0 using diverse viral datasets running on high-performance computing (HPC)

In order to validate ViralFlow v1.0 with a wide range of viruses, we performed a complete run with Illumina sequencing data from SARS-CoV-2, MPXV, DENV-1, DENV-2 and ZIKV on the Carlos Chagas cluster.

Single CPU core usage was always between 200% and 500%, which is equivalent to 2–5 threads being used simultaneously. Three virus-agnostic processes (fixWGS, snpPlot, for all modes and runPangolin, for SARS-CoV-2 mode) requested more CPU usage (Figure 1B). Less than 1 GB was requested for most processes, while few tasks required more memory (up to 5.2 GB), such as alignConsensus2Ref and runPicard (Figure 1C). All tasks did not exceed more than 1 min to be completed separately (Supplementary Figure S3A; Supplementary Table S4). The processes requiring more time were runPicard, runIntrahostScript and runPangolin, for the SARS-CoV-2 mode. The majority of tasks executed by ViralFlow took no more than 2.9 min on average. However, tasks required 18 min to be completed (runPicard) for larger viral genomes such as MPXV (Supplementary Figure S3B; Supplementary Table S4). The analyses of 56 MPXV samples required a total of 54 min and 27 s to be completed (Supplementary File S1), while the analyses of 1 277 SARS-CoV-2 samples required a total of 8 h, 31 min and 22 s (Supplementary File S2).

Furthermore, we performed three additional analyses in the custom mode using three arbovirus samples: 37 DENV-1, 29 DENV-2 and 271 ZIKV samples. ViralFlow v1.0 took 24 min (Supplementary File S3), 19 min and 28 s (Supplementary File S4) and 2 h 5 min and 8 s to run for each dataset, respectively.

Discussion

Viral genomic surveillance has been incorporated into routine monitoring programs around the globe. However, the ease of generating raw viral sequences has not been readily accompanied by bioinformatic tools tailored for timely output analyses of hundreds to thousands of samples. Here we describe ViralFlow v1.0, a workflow that requires low memory and CPU consumption, and hence can be implemented in low resource settings. This workflow can also be scalable in HPC environments. Moreover, ViralFlow v1.0 has improved modularity, transparency and usability, as well as improved best practices for advanced users and developers. These improvements also allow increased reproducibility and performance. Despite the availability of several computational pipelines for viral genomic data analysis in the literature, few of them are implemented into workflow managers languages, and those that do offer such features are limited to specific viruses. ViralFlow, while sharing similarities with some pipelines, combines both workflow management and flexibility for any virus. We provided a feature comparison table, including ViralFlow and several other pipelines developed with similar goals in mind, that will be employed in future benchmark comparison using different case studies, genome assembly metrics and performance (Supplementary Table S5). Planned future developments will focus on incorporating heuristics for automatically defining best reference genome selection, which is particularly useful for non-segmented genomes such as influenza virus, and analysis of long reads.

Supplementary Material

lqae056_Supplemental_Files

Contributor Information

Alexandre Freitas da Silva, Departamento de Entomologia, Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil; Núcleo de Bioinformática (NBI), Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil.

Antonio Marinho da Silva Neto, Data Analysis and Engineering, Genomic Surveillance Unit, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

Cleber Furtado Aksenen, Fundação Oswaldo Cruz (Fiocruz) - Fiocruz-CE, Eusebio, Ceará 61760-000, Brazil.

Pedro Miguel Carneiro Jeronimo, Fundação Oswaldo Cruz (Fiocruz) - Fiocruz-CE, Eusebio, Ceará 61760-000, Brazil.

Filipe Zimmer Dezordi, Departamento de Entomologia, Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil; Núcleo de Bioinformática (NBI), Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil.

Suzana Porto Almeida, Fundação Oswaldo Cruz (Fiocruz) - Fiocruz-CE, Eusebio, Ceará 61760-000, Brazil.

Hudson Marques Paula Costa, Núcleo de Bioinformática (NBI), Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil.

Richard Steiner Salvato, Secretaria Estadual da Saúde do Rio Grande do Sul, Centro Estadual de Vigilância em Saúde, Laboratório Central de Saúde Pública, Porto Alegre, Rio Grande do Sul 90450-190, Brazil.

Tulio de Lima Campos, Núcleo de Bioinformática (NBI), Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil.

Gabriel da Luz Wallau, Departamento de Entomologia, Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil; Núcleo de Bioinformática (NBI), Instituto Aggeu Magalhães (IAM)-Fundação Oswaldo Cruz-FIOCRUZ, Recife, Pernambuco 50670-420, Brazil; Department of Arbovirology, Bernhard Nocht Institute for Tropical Medicine, WHO Collaborating Center for Arbovirus and Hemorrhagic Fever Reference and Research, National Reference Center for Tropical Infectious Diseases, Bernhard-Nocht-Strasse 74, D-20359 Hamburg, Germany.

Data availability

The code is publicly available and maintained within a GitHub repository (https://github.com/WallauBioinfo/ViralFlow, also on Figshare, https://doi.org/10.6084/m9.figshare.24902754.v4), allowing version control, scrutiny and free use/reuse (MIT license). Data used in this manuscript were simulated or generated from original samples. All the data are available at ENA project: PRJEB71472.

Supplementary data

Supplementary Data are available at NARGAB Online.

Funding

The Coordenação de Aperfeiçoamento de Pessoal de Nível Superior; Centers for Disease Control and Prevention [DC; 002174]; Departamento de Ciência e Tecnologia (DECIT) of the Brazilian Ministry of Health and Vice Presidency of Research and Biological Collections; and Fiocruz Technological Platforms [specifically for scholarships to support the authors of this manuscript]. G.L.W. is supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) through their productivity research fellowships (307209/2023-7).

Conflict of interest statement. None declared.

References

  • 1. Tang  J.-L., Li  L.-M.  Importance of public health tools in emerging infectious diseases. BMJ. 2021; 375:n2374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Haldane  V., Jung  A.-S., De Foo  C., Bonk  M., Jamieson  M., Wu  S., Verma  M., Abdalla  S.M., Singh  S., Nordström  A.  et al.  Strengthening the basics: public health responses to prevent the next pandemic. BMJ. 2021; 375:e067510. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ling-Hu  T., Rios-Guzman  E., Lorenzo-Redondo  R., Ozer  E.A., Hultquist  J.F.  Challenges and opportunities for global genomic surveillance strategies in the COVID-19 era. Viruses. 2022; 14:2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Gire  S.K., Goba  A., Andersen  K.G., Sealfon  R.S.G., Park  D.J., Kanneh  L., Jalloh  S., Momoh  M., Fullah  M., Dudas  G.  et al.  Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science. 2014; 345:1369–1372. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Oude Munnink  B.B., Worp  N., Nieuwenhuijse  D.F., Sikkema  R.S., Haagmans  B., Fouchier  R.A.M., Koopmans  M.  The next phase of SARS-CoV-2 surveillance: real-time molecular epidemiology. Nat. Med.  2021; 27:1518–1524. [DOI] [PubMed] [Google Scholar]
  • 6. Li  J., Lai  S., Gao  G.F., Shi  W.  The emergence, genomic diversity and global spread of SARS-CoV-2. Nature. 2021; 600:408–418. [DOI] [PubMed] [Google Scholar]
  • 7. Tosta  S., Moreno  K., Schuab  G., Fonseca  V., Segovia  F.M.C., Kashima  S., Elias  M.C., Sampaio  S.C., Ciccozzi  M., Alcantara  L.C.J.  et al.  Global SARS-CoV-2 genomic surveillance: what we have learned (so far). Infect. Genet. Evol.  2023; 108:105405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Grad  Y.H., Lipsitch  M.  Epidemiologic data and pathogen genome sequences: a powerful synergy for public health. Genome Biol.  2014; 15:538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Kao  R.R., Haydon  D.T., Lycett  S.J., Murcia  P.R.  Supersize me: how whole-genome sequencing and big data are transforming epidemiology. Trends Microbiol.  2014; 22:282–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Truong Nguyen  P.T., Plyusnin  I., Sironen  T., Vapalahti  O., Kant  R., Smura  T.  HAVoC, a bioinformatic pipeline for reference-based consensus assembly and lineage assignment for SARS-CoV-2 sequences. BMC Bioinformatics. 2021; 22:373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Tilloy  V., Cuzin  P., Leroi  L., Guérin  E., Durand  P., Alain  S.  ASPICov: an automated pipeline for identification of SARS-Cov2 nucleotidic variants. PLoS One. 2022; 17:e0262953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Posada-Céspedes  S., Seifert  D., Topolsky  I., Jablonski  K.P., Metzner  K.J., Beerenwinkel  N.  V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data. Bioinformatics. 2021; 37:1673–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Moshiri  N., Fisch  K.M., Birmingham  A., DeHoff  P., Yeo  G.W., Jepsen  K., Laurent  L.C., Knight  R.  The ViReflow pipeline enables user friendly large scale viral consensus genome reconstruction. Sci. Rep.  2022; 12:5077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Plyusnin  I., Vapalahti  O., Sironen  T., Kant  R., Smura  T.  Enhanced viral metagenomics with Lazypipe 2. Viruses. 2023; 15:431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Pratas  D., Toppinen  M., Pyöriä  L., Hedman  K., Sajantila  A., Perdomo  M.F.  A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level. GigaScience. 2020; 9:giaa086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Váradi  A., Kaszab  E., Kardos  G., Prépost  E., Szarka  K., Laczkó  L.  Rapid genotyping of targeted viral samples using Illumina short-read sequencing data. PLoS One. 2022; 17:e0274414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Fritz  A., Bremges  A., Deng  Z.-L., Lesker  T.R., Götting  J., Ganzenmueller  T., Sczyrba  A., Dilthey  A., Klawonn  F., McHardy  A.C.  Haploflow: strain-resolved de novo assembly of viral genomes. Genome Biol.  2021; 22:212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lee  S.-D., Wu  M., Lo  K.-W., Yip  K.Y.  Accurate reconstruction of viral genomes in human cells from short reads using iterative refinement. BMC Genomics. 2022; 23:422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Dezordi  F.Z., Neto  A.M., da  S., Campos  T., de  L., Jeronimo  P.M.C., Aksenen  C.F., Almeida  S.P., Wallau  G.L.  and on behalf of the Fiocruz COVID-19 Genomic Surveillance Network  ViralFlow: a versatile automated workflow for SARS-CoV-2 genome assembly, lineage assignment, mutations and intrahost variant detection. Viruses. 2022; 14:217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Di Tommaso  P., Chatzou  M., Floden  E.W., Barja  P.P., Palumbo  E., Notredame  C.  Nextflow enables reproducible computational workflows. Nat. Biotechnol.  2017; 35:316–319. [DOI] [PubMed] [Google Scholar]
  • 21. Kurtzer  G.M., Sochat  V., Bauer  M.W.  Singularity: scientific containers for mobility of compute. PLoS One. 2017; 12:e0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Chen  S., Zhou  Y., Chen  Y., Gu  J.  Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018; 34:i884–i890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Danecek  P., Bonfield  J.K., Liddle  J., Marshall  J., Ohan  V., Pollard  M.O., Whitwham  A., Keane  T., McCarthy  S.A., Davies  R.M.  et al.  Twelve years of SAMtools and BCFtools. GigaScience. 2021; 10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Li  H., Durbin  R.  Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Grubaugh  N.D., Gangavarapu  K., Quick  J., Matteson  N.L., De Jesus  J.G., Main  B.J., Tan  A.L., Paul  L.M., Brackney  D.E., Grewal  S.  et al.  An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol.  2019; 20:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Khanna  A., Larson  D.E., Srivatsan  S.N., Mosior  M., Abbott  T.E., Kiwala  S., Ley  T.J., Duncavage  E.J., Walter  M.J., Walker  J.R.  et al.  Bam-readcount—rapid generation of basepair-resolution sequence metrics. J. Open Source Softw.  2022; 7:3722. [Google Scholar]
  • 27. Katoh  K., Standley  D.M.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol.  2013; 30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Garrison  E., Marth  G.  Haplotype-based variant detection from short-read sequencing. 2012; arXiv doi:20 July 2012, preprint: not peer reviewed https://arxiv.org/abs/1207.3907.
  • 29. Cingolani  P., Platts  A., Wang  L.L., Coon  M., Nguyen  T., Wang  L., Land  S.J., Lu  X., Ruden  D.M.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly (Austin). 2012; 6:80–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Huang  W., Li  L., Myers  J.R., Marth  G.T.  ART: a next-generation sequencing read simulator. Bioinformatics. 2012; 28:593–594. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqae056_Supplemental_Files

Data Availability Statement

The code is publicly available and maintained within a GitHub repository (https://github.com/WallauBioinfo/ViralFlow, also on Figshare, https://doi.org/10.6084/m9.figshare.24902754.v4), allowing version control, scrutiny and free use/reuse (MIT license). Data used in this manuscript were simulated or generated from original samples. All the data are available at ENA project: PRJEB71472.


Articles from NAR Genomics and Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES