Abstract
Background
Colorectal cancer (CRC) patients exhibit distinct gut microbiota disruption, known as dysbiosis, which is believed to play a causative role in CRC. One of the key bacterial species implicated in CRC dysbiosis is Bacteroides fragilis, which presents a paradox as it is also present in most healthy individuals. This discrepancy underscores the need for analysis beyond species-level associations and to investigate intraspecies variation within B. fragilis.
Methods
From a highly specific collection of B. fragilis isolates from CRC patients and controls, a pangenome-wide association study was conducted, identifying intraspecies genetic variations associated with CRC. The CRC association of these genetic variations were then validated in a metagenome sequencing cohort of faecal samples from 877 individuals, with and without CRC. To test group differences a mixed effects logistic regression with cohort as a random effect was performed for each genetic variation.
Results
Here we show that CRC-associated B. fragilis isolates are infected with specific Caudoviricetes prophages, significantly more often than negative controls. The initial discovery was made in our highly specific isolate collection and then validated in an independent metagenome sequencing cohort, finding that CRC patients were twice as likely to have detectable levels of these phages (OR = 2.05, p = 2.522E-7, SE = 0.139).
Conclusions
To our knowledge, these findings mark the first link between one of the most implicated driver bacteria and phages in CRC and suggest a more complex role of phages in CRC dysbiosis than current models suggest and highlights the potential of phages as CRC biomarkers.
Subject terms: Predictive markers, Colorectal cancer, Dysbiosis, Clinical microbiology, Metagenomics
Plain language summary
Colorectal cancer is linked to changes in human gut microbes, but how these changes are involved remain unclear. The common gut bacterium Bacteroides fragilis is often tied to colorectal cancer but is also present in healthy people. We investigated if there was something more to these bacteria that could explain how they were tied to colorectal cancer. Using genetic tests, we found that B. fragilis from colorectal cancer patients are infected with specific, and previously unknown, viruses. These findings suggest a partnership between bacteria and their viruses that may shape disease. If confirmed, these findings may support earlier detection of colorectal cancer and guide new ways to treat and prevent this disease.
Damgaard et al. analyze Bacteroides fragilis genomes isolated from people with and without colorectal cancer, identifying prophages. Screening 877 faecal metagenomes, they find that these specific prophages are significantly enriched in colorectal cancer patients.
Introduction
Colorectal cancer (CRC) is the third most common cancer worldwide and is responsible for the second highest number of cancer-related deaths1–3. The risk of developing CRC is multifactorial but as much as 80% of this risk has been assigned to environmental factors, with one of the greatest of these being the gut microbiota3–5 While the precise link between the microbiota and CRC remains elusive, it is well established that CRC is accompanied by dysbiosis of the gut microbiota, with evidence supporting a causative role6,7.
Due to the sheer size and complexity of the microbiota, it has been very difficult to pinpoint exact differences between CRC patients and controls that are consistent across cohorts8. Additionally, most studies of dysbiosis investigate the phylogenetic composition of the microbiota thus overlooking the natural diversity between strains of the same species8. In particular, many of these composition-focused studies implicate Bacteroides fragilis as an important dysbiosis contributor6. However, this represents a paradox, as B. fragilis is a commensal bacterium ubiquitous in healthy individuals without CRC9. One possible explanation could lie in the genetic heterogenicity of B. fragilis, namely that CRC-associated isolates may possess unique genetic traits absent in isolates from healthy individuals. This is further supported by the fact that B. fragilis toxin (bft), the only known virulence factor of B. fragilis often linked to CRC, is encoded in only a small fraction of B. fragilis isolates and is spread by horizontal gene transfer6,10, Therefore, a deeper understanding of B. fragilis, particularly the differences between individual strains and their relationship to CRC, is needed.
One of our recent studies supported this link between B. fragilis and an increased risk of CRC11. Based on the hypothesis that CRC-meditated breaches in the intestinal epithelium enable bloodstream access for bacteria at the CRC site, we analysed cases of bacteraemia where the causative agents were specific gut bacteria12. Patients experiencing bacteraemia with B. fragilis indeed had an increased risk of being diagnosed with CRC, an association supported by several other studies as well13,14.
This collection of isolates from our bacteraemia study were available to us and provided a unique opportunity to investigate the genomic composition of CRC-associated B. fragilis isolates in unprecedented detail. We hypothesised that these B. fragilis isolates possess genetic features that differentiate them from isolates of patients who did not develop CRC. Accordingly, the purpose of this study was to compare genomes of CRC-associated B. fragilis isolates with negative controls from the same cohort to identify potential biomarkers for early CRC diagnosis15.
Here, we present a pangenome-wide association analysis of a highly specific collection of CRC-associated B. fragilis isolates, collected over a decade from a cohort of 2 million people. In total, 583 patients developed B. fragilis bacteraemia and 11 of these patients were diagnosed with CRC within weeks. From eight of these patients, isolates were available for further studies11. The aim was to identify unique genetic traits in B. fragilis isolates linked to CRC, distinguishing them from isolates found in controls. These findings were then independently validated using whole-genome metagenomic sequencing data of faecal samples from 877 patients with and without CRC.
From the isolate collection we found that the CRC-associated B. fragilis genomes contained specific prophages, in a CRC-specific manner. These phage sequences were then extracted and screened for on the metagenomic sequencing cohorts, finding that CRC patients were twice as likely to have detectable levels of these phages (OR = 2.05, p = 2.522E-7, SE = 0.139).
Methods
Isolates and patients
From the previously described cohort study, the 11 patients with B. fragilis bacteraemia and CRC from the period 2007 to 2016, eight isolates were available and sequenced. Acting as negative controls, we randomly selected 24 bacteraemia isolates from the same cohort of patients who were not diagnosed with CRC (n = 572)11. Individual age and sex data was not available for the cohort but the median age for Bacteroides spp. bacteraemia CRC cases was 76.8 years (IQR 66.8–82.6), and control cases 73.3 years (IQR 63.5–81.5), and the sex distribution was 44% male in CRC and 53.7% male in controls. These 24 negative control patients did not have CRC before bacteraemia or in a follow-up observation period of at least five years from the date of bacteraemia. The isolates and patients were from the same period of time and with the same age distribution. Thus, the only distinct difference between these groups was the diagnosis of CRC. We also included 16 randomly selected faecal B. fragilis isolates from healthy Danish children. The isolates were collected from 2009 to 2012 (n = 96) and included in a study assessing antimicrobial resistance in the B. fragilis group in children (0–6 years of age)16.
All the B. fragilis isolates were stored in glycerol broth at −80 °C. Isolates were cultured on supplemented chocolate agar in an anaerobic atmosphere for at least 24 h before DNA extraction.
Illumina whole genome sequencing and assembly
The Illumina MiSeq platform was utilized for conducting whole genome sequencing on all isolates. Genomic DNA was purified using the MasterPure DNA Purification kit (Epicentre Biotechnologies, Madison, WI, USA) and libraries were prepared using the Nextera XT kit (Illumina, Essex, United Kingdom), in accordance with the manufacturers’ standard protocol.
Quality control of the sequencing data was performed using FastQC (v0.11.9), and the read coverage was calculated using the Lander/Waterman equation17. De novo assembly was performed using the SPAdes-based pipeline Shovill (v1.1.0), with default settings, excluding any contigs >200 bp in length18. Open reading frame prediction and annotation was performed with Prokka (v1.14.6), using the standard parameters of the program19. Genomes with >150X coverage were deduplicated and down sampled using Seqkit (v2.4.0)20. All genome assemblies are available through GenBank and their BioSample accession numbers are given in Supplementary Data.
Pangenome-wide association study
Pangenomes were constructed using Roary (v3.13.0) with the following settings: -e (multiFASTA alignment of core genes), -s (group paralogs), -ap (allow paralogs in core), -cd 99 (core genes must be in 99% of genomes), -i 90 (genes must have 90% identity using blastp)21. Pangenome-wide association studies (panGWAS) were performed using Scoary (v1.6.16) at the standard parameters of the software, with the “gene_presence_absence_matrix.csv” file from Roary as an input22. The panGWAS analysis was performed with patient CRC status as the analysed trait. The analysis was performed with all CTR samples pooled and then repeated once for each CTR group independently. All hits were also subject to Bonferroni correction for multiple testing.
Phylogenetic analysis
The sequence type of each sequenced genome was determined using the B. fragilis MLST scheme, through the pubMLST database23. A full core genome alignment on all isolates was performed using Roary and subsequently used as an input for FastTree (v2.1.11) to create a phylogenetic tree based on SNPs in the core genome24.
Each genome was also screened for the presence of the B. fragilis toxin genes (bft-1, bft-2, bft-3) and the cfiA gene, which defines which B. fragilis division the isolate belongs to. This was achieved by performing a TBLASTN with the translated sequence of each gene against all genomes with an e-value cut-off 1E-5, using BLAST+ (v2.11.0)25.
Nanopore whole genome sequencing and assembly
To identify the genomic context and full extend of each prophage identified in the pangenome-wide association studies, nanopore sequencing was performed on the genomes containing the CRC associated prophages.
DNA was purified using the MasterPure DNA Purification kit (Epicentre Biotechnologies, Madison, WI, USA), in accordance with the manufacturers’ standard protocol. Long read sequencing was performed on the MinION platform from Oxford Nanopore Technology (ONT), using the R9.4.1 flowcell (FLO-MIN106). DNA eluates (200–400 ng) were multiplexed and prepared for sequencing using the Rapid Barcoding Kit 96 (SQK-RBK110.96) following the manufacturers protocol (version RBK_9126_V110_REVO_24Mar2021). The library was sequenced for 24 h.
Reads were initially filtered using Filtlong (v0.2.1), keeping 95% of reads and excluding any of >1500 bp of length. Quality control of the sequencing data was performed using FastQC and subsequently assembled using Flye (v. 2.9.1-b1780) at the standard parameters of the software26. The assembled genomes were annotated with Prokka using the standard parameters.
Prophage analysis
The gene product of the CRC-associated genes identified in the pangenome-wide association study was assessed through blasting the nucleotide sequences against both the UniProt and Genbank database and analysed using ProteInfer27.
As the genes predominantly shared sequence homology with phage genes, a comprehensive screening of phage genes was performed on the genomes. All genes from each genome were screened against the NCBI Viral RefSeq database, The Gut Virome Database (GVD), and the Danish Enteric Virome Catalogue (DEVoC)28,29. This was achieved using BLAST + locally25.
Prophage cladogram
Prophage boundaries were determined by identifying flanking bacterial core genes. Flanking core gene pairs were identified using Corekaburra (v0.0.5)30. The validity of these boundaries were manually assessed by confirming that the genes encoded within the predicted prophage region matched against the phage databases and that the flanking core regions were joined in the prophage negative genomes. Lastly, the attL and attR sequences were determined by sequentially aligning the suspected flanking regions with MAFFT (v7.520) until overlapping direct repeat sequences were identified31. The attB site was then located in control isolates by performing a BLASTN with the attR and attL sequences as query25.
The prophage sequences were then extracted from the whole genome sequence data and a multiple sequence alignment was performed using MAFFT31. The prophage genomes were then sorted in accordance with the multiple sequence alignment and a cladogram was created using Easyfig (v2.2.5)32.
Phage classification
The genome sequence of each prophage was extracted from their respective bacterial host genomes and searched against the Genbank and RefSeq to identify any known phages with similar sequences. Taxonomy against known dsDNA phages was inferred using ViPTREE v.4.0, where the smallest possible phylogenetic cluster, encompassing all the phages identified from this study, was identified33. All phage sequences within this cluster along with phages identified in this study were retrieved and a multiple sequence alignment was performed along with a phylogenetic re-analysis through ViPTREE33. To further identify taxonomic links to known phages, all genomes in fasta format was submitted to the PhaBOX server34.
All phages were named following the guidelines of the Bacterial and Archaeal Viruses Subcommittee (BAVS) of the International Committee on the Taxonomy of Viruses (ICTV). The names were checked against the NCBI Nucleotide database with the term “vhost bacteria[filter] AND ddbj_embl_genbank[filter]”, which contains all the viral sequences of the INSDC databases including the phages classified by ICTV, to ensure they were not already in use35,36.
Sequence identity of the prophages was calculated using VIRIDIC at the standard parameters of the software, with a species threshold of 95% sequence identity across the entire genome and a genus threshold of 70%37.
Metagenome public data selection and acquisition
Nine publicly available whole-genome metagenome sequencing cohorts were identified by searching NCBI for “colorectal cancer” and “gut metagenomic sequencing”. We were unable to retrieve case control metadata for the IT (SRP136711) cohort. Raw sequencing reads were downloaded for the remaining cohorts from the European Nucleotide Archive, using the following ENA identifiers: Accession for the Austrian cohort (AUS) was PRJEB7774, CRC, n = 46; and Control, n = 6338. Accession for the German cohort (GER) was PRJEB6070, CRC, n = 70; and Control, n = 6039. Accession for the French cohort (FRA) was PRJEB6070, CRC, n = 93; and Control, n = 6639. Accession for the American cohort (US) was PRJEB12449, CRC, n = 52; and Control, n = 5240. Accession for the Chinese cohort 1 (CHN1) was PRJNA763023, CRC, n = 50; and Control, n = 5041. Accession for the Japanese cohort (JPN) was PRJDB4176, CRC, n = 40; and Control, n = 4042. Accession for the Chinese cohort 2 (CHN2) was PRJNA731589, CRC, n = 53; and Control, n = 6843. Accession for the Chinese cohort 3 (CHN3) was PRJNA514108, CRC, n = 32; and Control, n = 4444. Adenoma and youth groups were not included in this study.
Metagenome assembly
All the available cohorts were originally sequenced with Illumina technology. Paired-end raw metagenome sequencing reads were retrieved from the NCBI SRA run selector tool using the fasterq-dump (v.2.11.0) function of sra-tools45. The raw reads were trimmed using Trimmomatic (v.0.39) with a sliding window of 4:20 and minimum length at 35 bp46. Assemblies were constructed from the filtered and trimmed reads with SPAdes (v.3.15.5) using metaSPAdes mode, with the standard parameters of the software47. Of the 1371 samples, 18 could not be filtered with metaspades due to an unresolved software issue and assembly-only was used for these samples instead (Supplementary Data). Contig length, coverage, and quality was assessed using QUAST (v.5.0.2)48.
Phage screening of metagenomic data
To filter the dataset for DNA sequences possibly belonging to the specific phages, MASH (v.2.3) was utilised to do a rough pass-through to identify candidate contigs. Sketches were drawn for both the phages and metagenomic assemblies using MASH with the following settings: -k 18 (kmer-length), -s 10000 (sketch size) and -I (sketches contigs individually). MASH dist was then used to identify phage and metagenome sketches with a similarity score of 0.3 or lower. These contigs were then extracted for further analysis using R with the seqinR package and blasted against the phage genomes using BLAST+25. To identify samples that contained multiple phage fragments spanning several genes, samples with two or more BLASTN hits of >2000bp alignment length each was identified. This length was based on the average length of two bacterial genes and the N50 values of the metagenomic contigs. A mixed effects logistic regression with cohort as a random effect was performed on distribution of all Bacteroides phages across the two patient groups (CRC and CTR) using the lme4 (version 1.1-36) package in R.
All BLASTN matches from the screening were also mapped along the genome length of each Bacteroides phage using the IRanges (2.38.1) and Biostrings (2.72.1) packages in R (4.4.2)49,50.
A 50 bp rolling window was applied to all phage genome sequences, calculating mean coverage of CRC and CTR cases for each window using purrr (1.0.2), zoo (1.8.14), and IRanges (2.38.1) packages in R (4.4.2)49–52. To avoid redundancy, windows were sorted in order of discriminatory power keeping only those that were > 1000 bp apart. Starting with the highest scoring window, a greedy algorithm iteratively chose additional windows, up until a total of six, to maximise the Youden’s J of the joined samples. This selection process was performed with an 80/20 split, with 80% of samples chosen for training at random within the CRC and CTR groups, based on BioSample identifiers. The specificity, sensitivity and Youden´s J was calculated for the six fragments based on the remaining test samples.
Bacteroides phage insertion site screening
All metagenomic contigs were species annotated using Kraken2 (v.2.1.3)53. Using the annotations, all contigs identified as belonging to B. fragilis were extracted, and the flanking core genes of the prophage insertion sites were identifies using a TBLASTN with the translated sequence of each gene against all genome with an e-value cut-off 1E-5, using BLAST+25. Contigs containing both insertion site genes were then isolated and their distance calculated, any insertion site where the flanking genes were separated by any number of additional bases were identified and the inserted sequence extracted. The extracted sequences were then aligned against each other and the known Bacteroides phages using VIRIDIC37.
Ethical statement
The study, from which the bacterial isolates were identified, was approved by the Danish Data Protection Agency and the Danish Patient Safety Authority. Ethics committee approval is not required for this type of study in Denmark.
The publicly available metagenome sequencing data used in this study were obtained from previously published studies, all of which received approval from their local ethics committees and collected the samples with written informed consent from all participants.
Results
We set out to identify possible genetic differences that would distinguish B. fragilis isolates from patients with CRC compared to control subjects without CRC. To achieve this whole genome sequencing was performed on 48 B. fragilis isolates (Supplementary Data).
CRC-associated B. fragilis are not phylogenetically distinct
Firstly, it was examined whether the CRC-associated B. fragilis isolates formed a sub-species or distinct population of the genomes collected. An initial multi-locus sequence typing (MLST) analysis, based on the B. fragilis MLST-typing scheme, revealed that the CRC isolates did not belong to any specific sequence types or clonal complexes (Supplementary Data)23.
To perform a more comprehensive phylogenetic analysis all isolates were used to create a pangenome. The pangenome consisted of 16,134 unique genes, with a core genome of 2,232 genes shared across all isolates. A phylogenetic tree was then created based on single-nucleotide polymorphisms (SNPs) in the core genes of the isolates. The analysis revealed that neither the CRC-associated isolates nor controls formed any distinct phylogenetic clades (Fig. 1).
Fig. 1. Unrooted phylogenetic tree, based on single nucleotide polymorphisms in the core genes, of the Bacteroides fragilis isolates used in this study.
The abbreviations used in the isolate names are BF (B. fragilis), BC (Blood culture), FA (Faeces) and CRC (Colorectal cancer). Genomes are shaded in colour according to source. The presence of the different B. fragilis toxin genes bft 1, bft 2, bft 3 along with the sequence type (ST) of each isolate is also specified.
CRC-associated B. fragilis isolates share common accessory genes
As the CRC-associated isolates were not phylogenetically distinct based on the analysis of the core genes, attention was shifted to the non-core/accessory genes. On average, the genome of each isolate consisted of 52.5 ± 2.2% core genes, indicating a high degree of genetic variability within B. fragilis. Examining core genes in each division separately increased the number of core genes in each isolate to 68.9 ± 3.1% (Division I, cfiA negative) and 69.5 ± 2.0% (Division II, cfiA positive).
To identify if any accessory genes or genetic variations were associated with patient CRC status, a pangenome-wide association study (GWAS) was performed. Through the GWAS a population-agnostic Fisher’s exact test is employed for each gene in the pangenome in relation to the patient’s CRC status. In total 68 unidentified accessory genes were significantly enriched within the CRC-associated isolates, however the hits did not survive Bonferroni correction (Supplementary Data). The top two genes had sensitivity and specificity scores of 75% and 90%, with the remaining 66 genes showing scores of 37.5% and 100% respectively (Supplementary Data). All the identified CRC-associated accessory genes were of unknown function in B. fragilis.
The control isolates (CTR) consisted of matched control cases with CRC negative bacteraemia isolates from the same demographic group as the CRC-associated isolates (n = 24) as well as naive faecal isolates from children (N = 16). The pangenome analyses was repeated for each CTR group separately and showed similar absence of the identified accessory genes enriched in CRC isolates. No other accessory genes were associated with patient CRC status in any of these analyses.
The bft genes were not among the predicted CRC-associated genes. To identify their distribution, a BLAST analysis of the three bft gene variants were conducted against the isolates. This revealed that only 6/48 isolates encoded a bft gene (CRC = 2, CTR = 4). Both the bft-1 and bft-2 genes were present, but bft-3 was absent (Fig. 1).
CRC-associated B. fragilis isolates are infected with prophages
The nucleotide sequences of the CRC-associated genes identified using GWAS were extracted and, in an effort, to elucidate their function, a homology search against genes of known function was performed. Interestingly, the majority of the genes with predicted protein products exhibited homology to previously identified phage genes and all of the genes were located in close proximity to each other within the individual B. fragilis genomes (Supplementary Data).
A closer inspection of the genomic context of these genes and their flanking regions, revealed that all genes were part of two distinct prophages, inserted within the genome of B. fragilis (Fig. 2A, B). The structure of the prophages varied slightly across the isolates due to the natural genomic mosaicism displayed by phages. This meant that the 68 CRC-associated genes identified through GWAS were conserved across all instances of the prophages, whereas the genes that had variations were left undetectable by GWAS. Following this revelation, the prophages in their entirety were investigated.
Fig. 2. Overview of the two CRC-associated prophage gene clusters and their genomic location.
A, B Prophage clusters are ordered by nucleotide sequence homology. The parent bacterial strain is designated to the right, along with the name of the prophage itself. A Synteny plot of all instances of the FU phage. B Synteny plot of all instances of the ODE phage. C Insertion sites of the Bacteroides phage ODE and FU prophages in the B. fragilis reference genome NCTC_9343.
The prophages are part of two novel families of Caudoviricetes
In total 13 prophages were identified across the isolates, and the complete sequence of each was extracted, characterized, and named according to the guidelines from the International Committee on Taxonomy of Viruses (ICTV)35. Taxonomy was inferred against known phages and all genes were annotated by sequence homology (Supplementary Data).
Through sequence identity, it was determined that the prophages belonged to two distinct and unknown families of the class Caudoviricetes. The closest known families to both were Autographiviridae and Herelleviridae (Fig. 3). However, the genome length did not match the range of the Herelleviridae family and the characteristic RNAP gene of Autographiviridae was also absent.
Fig. 3. Prophage phylogeny.
A Intergenomic sequence identity matrix for the prophages identified in the B. fragilis isolate collection. The sequence identity is stated specifically for each alignment in the matrix and shaded in blue according to the degree of similarity. The genome length ratio of the aligned sequences is shown in greyscale and aligned genome fraction that participated in the alignment is shown in yellow. B Unrooted phylogenetic proteomic tree of all phage genomes from VIPTree. The position of Bacteroides phage FU and ODE are indicated by the red star along with their nearest neighbours.
The first prophage group consisted of eight unique species of the same genus. These are referred to as Bacteroides phage FU01-08 (Fig. 2A, B). The Bacteroides phage FU group occupied the same insertion site, and share the left and right attachment sites (attR and attL), a pair 16 bp direct repeat sequences (5´-TCCCTTCGGGAGTACA-3´), flanking the prophage sequences. The corresponding attB site was identified in control isolates, as situated in the BF9343_RS17395 tRNA (locus tags from NTCT_9343) that is flanked by the hxpA and asnA genes (Fig. 2C). Both the tRNA and flanking genes are part of the B. fragilis core genome. Only Bacteroides phage FU08 was situated outside the insertion site. However, it lacked the attR and attL sequences indicating that it might be degraded, having undergone loss of genes.
The second prophage group consisted of three members of a single species, (sequence identity > 95%) which we refer to as Bacteroides phage ODE. All three members of the Bacteroides phage ODE group occupied the same genomic insertion site of B. fragilis, and shared a pair of 19 bp direct repeat, attR and attL sequences (5´-TTAGATAGGGGTTCGATTC-3´). The attB site was located the BF9343_RS20625 tRNA flanked by the porA and BF9343_4128 core genes (Fig. 2C). All isolates with Bacteroides phage ODE, were also infected with Bacteroides phage FU.
Metagenomic sequencing data of the faecal CRC microbiome
To further investigate the link between the Bacteroides phage FU and ODE to CRC, we sought to test our hypothesised association in other and larger independent dataset. We selected a published shotgun metagenomic dataset from faecal samples of 877 patients (CRC = 434, CTR = 443) spanning multiple nationalities (Table 1, Supplementary Data). The cohorts were referenced to, based on the predominant nationality of their main study population: Austria (AUS), Germany (GER), France (FRA), United States (US), CHN (China), and Japan (JPN).
Table 1.
Metagenomic faecal studies of CRC microbiota included in this study divided by each cohort and patient status
| Cohort code | Status | Cases | Age | Sex (M/F) | Bases per sample | N50 | ODE | FU | Both | Reference |
|---|---|---|---|---|---|---|---|---|---|---|
| FRA | CRC | 91 | 64.66 ± 12.23 | 54/37 | 5.01 Gbp | 1751 | 27 | 27 | 39 |
Zeller et al. 2014 PRJEB6070 |
| CTR | 66 | 58.77 ± 12.77 | 33/33 | 5.23 Gbp | 1627 | 15 | 13 | 24 | ||
| AUS | CRC | 46 | 67.07 ± 10.91 | 28/18 | 5.12 Gbp | 3940 | 15 | 7 | 19 |
Feng et al. 2015 PRJEB7774 |
| CTR | 63 | 67.06 ± 6.73 | 37/26 | 4.66 Gbp | 5227 | 15 | 6 | 16 | ||
| GER | CRC | 70 | 68.31 ± 10.66 | 44/26 | 1.99 Gbp | 1655 | 26 | 15 | 35 |
Wirbel et al. 2019 PRJEB6070 |
| CTR | 60 | 57.57 ± 11.08 | 32/28 | 4.07 Gbp | 1733 | 16 | 9 | 20 | ||
| US | CRC | 52 | 61.8 ± 13.6 | 37/15 | 5.18 Gbp | 1951 | 25 | 17 | 26 |
Vogtmann et al. 2016 PRJEB12449 |
| CTR | 52 | 62.2 ± 11.0 | 37/15 | 5.32 Gbp | 2386 | 15 | 11 | 15 | ||
| CHN1 | CRC | 50 | 63.58 ± 8.29 | 25/25 | 10.98 Gbp | 2358 | 16 | 27 | 32 |
Yang et al. 2021 PRJNA763023 |
| CTR | 50 | 63.36 ± 9.67 | 37/13 | 10.15 Gbp | 1488 | 9 | 13 | 16 | ||
| JPN | CRC | 40 | 59.05 ± 12.83 | 21/19 | 6.11 Gbp | 2703 | 11 | 10 | 16 |
Yachida et al. 2019 PRJDB4176 |
| CTR | 40 | 63.62 ± 12.36 | 23/17 | 6.98 Gbp | 3743 | 15 | 9 | 19 | ||
| CHN2 | CRC | 53 | 64.8 ± 5.57 | 34/19 | 11.51 Gbp | 4967 | 17 | 26 | 33 |
Liu et al. 2022 PRJNA731589 |
| CTR | 68 | 62 ± 5.09 | 44/24 | 10.72 Gbp | 2634 | 8 | 17 | 21 | ||
| CHN3 | CRC | 32 | - | - | 11.31 Gbp | 4930 | 6 | 18 | 21 |
Gao et al. 2022 PRJNA514108 |
| CTR | 44 | - | - | 12.67 Gbp | 6298 | 6 | 16 | 18 |
The patient statistics are given as mean age with standard deviation (SD) in years, biological sex and disease status CRC (colorectal cancer) and CTR (controls). Sequencing of each group is given by the average number of raw bases per sample before assembling and the average N50 for the assembled contigs. The number of samples in which Bacteroides phage FU and Bacteroides phage ODE is also stated. Literature references and NCBI BioProject identifiers used to retrieve the data are given lastly.
Gender distribution was skewed with 486 male and 315 female subjects across all samples (64.8% male). However, this bias was almost consistent across the CRC (65.8% male) and CTR (64.2% male) groups. The mean age of patients ranged from 57-68 years and stayed relatively consistent across CRC and CTR subjects within each cohort, except for the CTR groups being slightly younger on average in the FRA (CRC = 64.66 ± 12.23, CTR = 58.77 ± 12.77) and GER (CRC = 68.31 ± 10.66, CTR = 57.57 ± 11.08) cohorts (Table 1, Supplementary Data). Age and gender data was not available for the individual patients in CHN3.
Sequencing depth across cohorts was in the range of 5Gbp ± 1Gbp for most samples, with the least data per sample being GER (CRC = 1.99Gbp, CTR = 4.07Gbp) and the most data per sample being CHN3 (CRC = 11.31Gbp, CTR = 12.67Gbp). All bioinformatic analysis and processing of raw reads was conducted identically across samples to reduce technical bias. The sample N50 values following assembly ranged from 268 bp to 34,416 bp, with a mean and median lengths across all cohorts being 2,468 bp and 1,674 bp (Table 1, Fig. 4A).
Fig. 4. Metagenome metrics according to patient disease state with colorectal cancer (CRC) positive (red) and CRC negative (blue).
A Contig length distribution for each cohort, grouped by patient CRC status. The contigs are sorted according to the number of contigs of equal or greater lengths than that given in the x-axis. B, C Bacteroides phage FU and ODE distribution across metagenomic sequencing data from faecal samples (n = 877). A sample was considered phage positive when multiple fragments, each spanning several genes were identified. Samples are distributed according to whether the patient had CRC (red) or were CRC negative (blue). Statistical significance is based on a two-sided mixed effects logistic regression. Country codes are based on the nationality of the majority in each patient population (AUS Austria, CHN China, Ger Germany, FRA France, JPN Japan, and US United States).
Bacteroides phage FU and ODE show cross-cohort enrichment in CRC patients, based on faecal metagenomic sequencing data
The assembled metagenomic contigs were screened for the presence of Bacteroides phage FU and ODE groups. As the N50 value for each sample placed most contigs far below the length of the phage genomes, we sought to identify sequence matches of >2,000 bp in length. This length match was chosen based on the mean and median N50 value across the cohorts (2,468 bp and 1,674 bp) and had the added benefit of mitigating the penalty of the genomic mosaicism of the phage genomes (Fig. 2A, B, Fig. 4A). To increase specificity only samples containing multiple of these matches across the Bacteroides phage FU and ODE genomes were considered positive.
This screening detected DNA from the B. fragilis phages more often in the metagenome data of CRC patients than healthy controls (CRC = 52.00%, CTR = 34.38%), revealing a significant association between these specific B. fragilis phages and CRC (Fig. 4B, C, Table 1). This association was true across all cohorts except for the Bacteroides phage ODE in the JPN cohort (Fig. 4B). Bacteroides phage FU displayed the most significant CRC enrichment (CRC = 35.38%, CTR = 21.88%) and Bacteroides phage ODE was detected in more samples overall (CRC = 32.38%, CTR = 23.13%).
A mixed effects logistic regression with the cohorts as a random effect, found that DNA from both Bacteroides phages were significantly enriched in patients with CRC compared to controls. Patients with CRC were more than twice as likely to have detectable levels of Bacteroides phage FU (OR = 2.017, p = 1.023E-5, SE = 0.159) or any of the two phages (OR = 2.05, p = 2.522E-7, SE = 0.139) and ~1.7 times as likely with Bacteroides phage ODE alone (OR = 1.69, p = 6.086E-4, SE = 0.154).
All the matching sequence fragments from the previous analysis were also mapped along the length of the Bacteroides phage genomes. This revealed that the Bacteroides phage FU and ODE genomes were enriched in CRC patients along the entirety of their genomes (Fig. 5 and Figures S3–S5).
Fig. 5. Coverage of the Bacteroides phage FU02 and ODE02 genomes from BF_BC_ODE_DK_2016_CRC across the metagenomic cohort screening on patients with colorectal cancer (CRC, red, N = 434) and controls (CTR, blue, N = 443).
Assembled metagenomic contigs were screened with MASH and BLASTN to identify matching sequence fragments as described in methods. The raw genome coverage plots contain all the unfiltered matches counted for each patient once and the Genome coverage plot contains only quality hits (bitscore = >50). The genome weighed coverage contains all reads and individual sequence coverage accumulated.
As a control for CRC-associated phage enrichment, the above analysis was performed on the Escherichia phage P2. It was found not to be associated with patient CRC status (OR = 1.24, p = 0.278, SE = 0.199, Figure S6A). The genome coverage of the P2 phage, was likewise not enriched among CRC patients (Figure S6B, C).
The insertion sites of each phage in B. fragilis were analysed across all samples to identify any prophages inserted within these sites sequenced in their entirety. The Bacteroides phage FU could be recovered in full from eight metagenomic samples, six of which were from CRC patients (Figure S2, Supplementary Data). No other phage species were found to occupy this specific insertion site. The Bacteroides phage ODE was found on prophage form in a single control subject from the US cohort (Supplementary Data). The additional phages identified had an intergenomic sequence identity of >70% to the existing Bacteroides phages, placing them within the same families (Figure S2).
Finally, to evaluate the Bacteroides phage FU and ODE sequences as CRC biomarkers, metagenomic assemblies were randomly partitioned into training (80%) and test (20%) sets, balanced across CRC and control groups. The phage genomes were divided into 50 bp windows, and a greedy algorithm was used to then identify the best combination of six windows by maximizing Youden’s J. The resulting six-window panel detected 40.6% of CRC cases at 83.3% specificity in the test dataset.
Discussion
Our unique collection of CRC-associated isolates allowed us to investigate the genomic composition of B. fragilis in unprecedented detail. We found that B. fragilis genomes are highly variable, with the average genome consisting of approximately 50% accessory genes. Of these accessory genes, two distinct prophages (Bacteroides phage FU and ODE) could differentiate CRC-associated isolates from those in patients who were not diagnosed with CRC. These findings were independently validated using whole genome metagenomic sequencing data from 680 patients, confirming their association with CRC.
To our knowledge, this study is the first to identify a CRC-specific bacterium and prophage interaction. Both B. fragilis and Caudoviricetes phages are separately implicated amongst the most overrepresented in CRC dysbiosis and development separately. Recent research reveals that the CRC-specific dysbiosis of the gut microbiome extends beyond bacteria to include viruses. In particular, phages of the class Caudoviricetes are found to be significantly more abundant among CRC patients54,55 While these phages are known to infect Gram-negative bacterial hosts, such as B. fragilis, their role in CRC dysbiosis remains unknown55.
There are several limitations to this study due to differences between the sampling of the metagenomic samples and our bacteraemia isolates. Firstly, the metagenomic data were obtained from faecal samples, whereas our CRC-associated isolates were from cases of bacteraemia. However, finding the same prophage infections enriched within CRC patients in both sample groups could strengthen the hypothesis that CRC-meditated breaches enable bloodstream access for the bacteria at the CRC site12–14 Secondly, our CRC-negative B. fragilis bacteraemia control subjects were not diagnosed with CRC in the following five years after bacteraemia. In contrast, the control subjects of the metagenomic cohorts were not always subject to a colonoscopy and their disease status following sampling was unavailable15. These limitations increase the likelihood of false negative cases of CRC in the control groups from the metagenomics cohorts.
Another limitation of this study was the sometimes fragmented and superficial metagenomic sequencing of individual contigs from the faecal samples. The assembled contigs were on average very short, with a median N50 of 1,674 bp (Table 1, Fig. 4A). In our initial screening we decided to identify matches to our phage genomes within a minimum of this size range, this could also reduce the sensitivity of the screening to the high levels of genome mosaicism. However, as phages may share structural genes or other genetic elements, it was decided that two or more unique matching sequences would be required for a possible match. This screening proved that DNA fragments from the Bacteroides FU and ODE phages were detectable and indeed significantly associated with patient CRC status, with CRC dysbiosis metagenomes more often containing these phages than controls. (Fig. 4C). This was also true across all eight cohorts for both phages, except for the JPN cohort, where Bacteroides phage ODE was found in more control subjects (Fig. 4B). Mapping all the sequence matches along the genome sequence of each Bacteroides FU and ODE phage revealed that the CRC association was true across phage sequences and thus not a product of an isolated sequence motif (Fig. 5 and Figure S3-S5).
Surprisingly, intact versions of the Bacteroides phages were also retrieved from the metagenomic contigs, despite contigs of at least 50 kb in length made up less than 0.01% of contigs (Fig. 4A, Supplementary Data). The Bacteroides phage FU was found in eight unique samples from the US, JPN, CHN2, and AUS cohorts, with six of these being from CRC patients (Fig. 4B, Supplementary Data).
One possible explanation for the higher prevalence of these B. fragilis-infecting phages among CRC patients is the Kill-the-Winner effect. In this scenario, the increased abundance of B. fragilis in CRC patients makes it more likely for these bacteria to encounter phage particles, sparking an infection cascade56,57, Consequently, the presence of Bacteroides phages FU and ODE could be attributed to the increased amount of B. fragilis documented in the microbiota of CRC-patients. Thus, the phage infections are symptomatic of an increased amount of B. fragilis and not necessarily a direct causative agent in CRC but leaving them as potential useful biomarkers.
To investigate whether the CRC association of the Bacteroides phages was a product of an overall phage enrichment in CRC patients, we screened for the Escherichia phage P2, another known example of a Caudiovirecetes phage mediating lysogenic infections of a CRC-implicated driver bacterium. This revealed no significant enrichment amongst CRC patients.
Both increasing age and CRC have been shown to increase phage diversity and the levels of specific phages28,55, These phages may influence the composition and behaviour of the microbiome, ultimately leading to CRC dysbiosis, both through lysis of non-CRC-associated bacteria or lysogenic conversion. In lysogenic conversion the prophage alters the phenotype of the infected host by providing genes for the bacterium to express. In the case of CRC, these phenotypical changes might alter the metabolism or pathogenic potential of the lysogen58,59.
The bacterial genes in the vicinity of the prophage insertion sites may also be disrupted by the prophage insertion creating knock-out mutants, or overexpression mutants by transcriptional readthrough from the prophage60. It is difficult to predict the effect of such alterations introduced by Bacteroides phage FU and ODE as the flanking genes of the attB sites for both are to our knowledge relatively unknown. However, despite them being core genes, they have been found to be non-essential in B. fragilis for growth in both complex and basic medium61.
As B. fragilis is considered a driver species in CRC development, lysogenic conversion mediated by prophages could partially explain the paradox of B. fragilis being ubiquitous in the healthy microbiome, while still acting as a driver bacterium in CRC development6,62–64, Notably, a case of lysogenic conversion in a Bacteroides species by a Caudoviricetes phage has been reported. In this case, lysogeny altered the expression of 115 genes in Bacteroides vulgatus (recently renamed Phocaeicola vulgatus), repressing bile acid metabolism58.
A previous cohort study investigating the gut microbiome and its relation to cancer, found some similar phages to Bacteroides phage ODE7. Interestingly, among their shared genes, the paper reported homology to the anthrax toxin genes found in Bacillus anthracis. This toxin acts through the ANTXR1 (TEM8) receptor in humans, which is known to play a pivotal role in tumorigenesis and is upregulated in human endothelial cells lining the tumour vasculature of CRC (Figure S1)7,65.
Current CRC microbiome studies often investigate bacteria and viruses separately. Our findings suggest integrating these studies and accounting for prophages. However, the genomic mosaicism of phages poses a challenge in larger datasets. Our highly specific dataset based on B. fragilis isolates made it possible to identify these prophage infections before moving on to analyse the metagenomic data. Previous studies have likely overlooked these phages due to the high genomic mosaicism of the phages. Future studies could address this by identifying common prophage insertion sites or conserved sequence motifs.
These phages also present a possible option for early detection and intervention in CRC, as they were discovered in patients who were not yet diagnosed with CRC. Future studies should focus on the diagnostic potential of these phages in individuals with or at risk of developing CRC. The most common non-invasive CRC screening method is the faecal immunochemical test (FIT), where occult blood is identified66. A PCR test for the presence of Bacteroides phage FU or ODE could be performed on these samples in addition to the screening for occult blood. This PCR could either target unique sequence motifs of these phages or test the insertion sites in B. fragilis for prophage insertions. Some countries have already implemented similar PCR panels targeting tumour mutations in the patient DNA66. As a preliminary test for such a panel, six sequence fragments from the FU and ODE phage genomes were identified using a greedy algorithm optimising for the Youden’s J, this panel detected 40.6% of CRC cases with 83.3% specificity. However, the actual PCR performance on faeces is dependent on many factors, and this panel should only be viewed as a shortlist for future studies. It should also be noted that detecting the phage sequences directly would encompass both virion and prophage but without distinguishing integration as opposed to targeting the insertion site. Thus, choosing the correct target depends on whether the lysogenic or lytic stages of infection are relevant in a CRC setting.
In conclusion, we believe our study provides compelling evidence for CRC-associated prophage infections of B. fragilis, a major contributor to CRC dysbiosis. These newly discovered Bacteroides phage FU and ODE genome sequence are significantly associated with CRC across multiple patient cohorts and nationalities. Despite the limitations of the sequencing material, this consistent presence highlights their potential value in diagnostics.
Supplementary information
Description of Additional Supplementary files
Acknowledgements
This study was supported by unrestricted grants from the Region of Southern Denmark, the Harboe Foundation, and the Novo Nordisk Foundation.
Author contributions
U.S.J. conceived to the project and acquired funding. U.S.J., J.K.M., J.E.C., R.B.D., and T.V.S. provided bacterial isolates. F.D.N. performed the main bioinformatic analyses for the paper. U.S.J., M.G.J., T.V.S., M.L.S., and J.M.J. provided supervision during the project. U.S.J. and F.D.N. wrote the initial draft of the manuscript. All authors read, reviewed, and worked on the manuscript.
Peer review
Peer review information
Communications Medicine thanks Yasser Morsy and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Data availability
The genomes sequenced for this study have been deposited at GenBank under the BioProject accession numbers: PRJNA1036030 and PRJNA910333. Raw data used for creating Fig. 4 is available in the Supplementary Data.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s43856-026-01403-1.
References
- 1.Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. Ca. Cancer J. Clin.74, 229–263 (2024). [DOI] [PubMed] [Google Scholar]
- 2.Quinn, M. J. et al. Cancer mortality trends in the EU and acceding countries up to 2015. Ann. Oncol.14, 1148–1152 (2003). [DOI] [PubMed] [Google Scholar]
- 3.Haggar, F. & Boushey, R. Colorectal cancer epidemiology: Incidence, mortality, survival, and risk factors. Clin. Colon Rectal Surg.22, 191–197 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wong, S. H. & Yu, J. Gut microbiota in colorectal cancer: mechanisms of action and clinical applications. Nat. Rev. Gastroenterol. Hepatol.16, 690–704 (2019). [DOI] [PubMed] [Google Scholar]
- 5.Aran, V., Victorino, A. P., Thuler, L. C. & Ferreira, C. G. Colorectal Cancer: Epidemiology, Disease Mechanisms and Interventions to Reduce Onset and Mortality. Clin. Colorectal Cancer15, 195–203 (2016). [DOI] [PubMed] [Google Scholar]
- 6.Tjalsma, H., Boleij, A., Marchesi, J. R. & Dutilh, B. E. A bacterial driver-passenger model for colorectal cancer: Beyond the usual suspects. Nat. Rev. Microbiol.10, 575–582 (2012). [DOI] [PubMed] [Google Scholar]
- 7.Nishijima, S. et al. Extensive gut virome variation and its associations with host and environmental factors in a population-level cohort. Nat. Commun.13, 5252 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Baas, F. S., Brusselaers, N., Nagtegaal, I. D., Engstrand, L. & Boleij, A. Navigating beyond associations: Opportunities to establish causal relationships between the gut microbiome and colorectal carcinogenesis. Cell Host Microbe32, 1235–1247 (2024). Aug. [DOI] [PubMed] [Google Scholar]
- 9.Piquer-Esteban, S., Ruiz-Ruiz, S., Arnau, V., Diaz, W. & Moya, A. Exploring the universal healthy human gut microbiota around the World. Comput. Struct. Biotechnol. J.20, 421–433 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wexler, H. M. Bacteroides: the good, the bad, and the nitty-gritty. Clin. Microbiol. Rev.20, 593–621 (2007). Oct. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Justesen, U. S. et al. Bacteremia with Anaerobic Bacteria and Association with Colorectal Cancer: A Population-based Cohort Study. Clin. Infect. Dis.75, 1747–1753 (2022). [DOI] [PubMed] [Google Scholar]
- 12.Kouzu, K., Tsujimoto, H., Kishi, Y., Ueno, H. & Shinomiya, N. Bacterial Translocation in Gastrointestinal Cancers and Cancer Treatment. Biomedicines10, 380 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kwong, T. N. Y. et al. Association Between Bacteremia From Specific Microbes and Subsequent Diagnosis of Colorectal Cancer. Gastroenterology155, 383–390.e8 (2018). [DOI] [PubMed] [Google Scholar]
- 14.Laupland, K. B., Edwards, F., Furuya-Kanamori, L., Paterson, D. L. & Harris, P. N. A. Bloodstream Infection and Colorectal Cancer Risk in Queensland Australia, 2000-2019. Am. J. Med.136, 896–901 (2023). [DOI] [PubMed] [Google Scholar]
- 15.Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med.25, 679–689 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sydenham, T. V., Jensen, B. H., Petersen, A. M., Krogfelt, K. A. & Justesen, U. S. Antimicrobial resistance in the Bacteroides fragilis group in faecal microbiota from healthy Danish children. Int. J. Antimicrob. Agents49, 573–578 (2017). [DOI] [PubMed] [Google Scholar]
- 17.Simon Andrews, Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. Available from: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2010).
- 18.Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A. & Korobeynikov, A. Using SPAdes De Novo Assembler. Curr. Protoc. Bioinforma. 70, 10.1002/cpbi.102 (2020). [DOI] [PubMed]
- 19.Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics30, 2068–2069 (2014). [DOI] [PubMed] [Google Scholar]
- 20.Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One11, e0163962 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics31, 3691–3693 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Brynildsrud, O., Bohlin, J., Scheffer, L. & Eldholm, V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol.17, 238 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nielsen, F.D., Skov, M. N., Sydenham, T. V. & Justesen, U. S. Development and Clinical Application of a Multilocus Sequence Typing Scheme for Bacteroides fragilis Based on Whole-Genome Sequencing Data. Microbiol. Spectr.11, 10.1128/spectrum.05111-22 (2023). [DOI] [PMC free article] [PubMed]
- 24.Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS One5, e9490 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinforma.10, 421 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol.37, 540–546 (2019). [DOI] [PubMed] [Google Scholar]
- 27.Sanderson, T., Bileschi, M. L., Belanger, D. & Colwell, L. J. ProteInfer, deep neural networks for protein functional inference. Elife. 10.7554/eLife.80942 (2023). [DOI] [PMC free article] [PubMed]
- 28.Van Espen, L. et al. A Previously Undescribed Highly Prevalent Phage Identified in a Danish Enteric Virome Catalog. mSystems6, 10.1128/msystems.00382-21 (2021). [DOI] [PMC free article] [PubMed]
- 29.Gregory, A. C. et al. The Gut Virome Database Reveals Age-Dependent Patterns of Virome Diversity in the Human Gut. Cell Host Microbe28, 724–740.e8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Jespersen, M. G., Hayes, A. & Davies, M. R. Corekaburra: pan-genome post-processing using core gene synteny. J. Open Source Softw.7, 4910 (2022). [Google Scholar]
- 31.Katoh, K., Rozewicki, J. & Yamada, K. D. MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform. 10.1093/bib/bbx108 (2018). [DOI] [PMC free article] [PubMed]
- 32.Sullivan, M. J., Petty, N. K. & Beatson, S. A. Easyfig: A genome comparison visualizer. Bioinformatics27, 1009–1010 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Nishimura, Y. et al. ViPTree: The viral proteomic tree server. Bioinformatics33, 2379–2380 (2017). [DOI] [PubMed] [Google Scholar]
- 34.Shang, J., Peng, C., Liao, H., Tang, X. & Sun Y. PhaBOX: A web server for identifying and characterizing phage contigs in metagenomic data. Bioinforma. Adv. 3, 10.1093/bioadv/vbad101 (2023). [DOI] [PMC free article] [PubMed]
- 35.Adriaenssens, E. & Brister, J. R. How to name and classify your phage: An informal guide. Viruses9, 70 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral Genomes resource. Nucleic Acids Res43, D571–D577 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Moraru, C., Varsani, A. & Kropinski, A. M. VIRIDIC — A Novel Tool to Calculate the Intergenomic Similarities of. Viruses12, 1268 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Feng, Q. et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun.6, 6528 (2015). [DOI] [PubMed] [Google Scholar]
- 39.Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 10.15252/MSB.20145645 (2014). [DOI] [PMC free article] [PubMed]
- 40.Vogtmann, E. et al. Colorectal Cancer and the Human Gut Microbiome: Reproducibility with Whole-Genome Shotgun Sequencing. PLoS One11, e0155362 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yang, Y. et al. Dysbiosis of human gut microbiome in young-onset colorectal cancer. Nat. Commun. 12, 1–13 (2021). [DOI] [PMC free article] [PubMed]
- 42.Yachida, S. et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat. Med.25, 968–976,(2019). [DOI] [PubMed]
- 43.Liu, N. N. et al. Multi-kingdom microbiota analyses identify bacterial–fungal interactions and biomarkers of colorectal cancer across cohorts. Nat. Microbiol.7, 238–250 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Gao, R. et al. “Integrated Analysis of Colorectal Cancer Reveals Cross-Cohort Gut Microbial Signatures and Associated Serum Metabolites. Gastroenterology163, 1024–1037.e9 (2022). [DOI] [PubMed] [Google Scholar]
- 45.Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res50, D20–D26 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics30, 2114–2120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. MetaSPAdes: A new versatile metagenomic assembler. Genome Res27, 824–834 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics29, 1072–1075 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Lawrence, M. et al. Software for Computing and Annotating Genomic Ranges. {PLoS} Comput. Biol.9, e1003118 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.R. Core Team, R: A Language and Environment for Statistical Computing. Vienna, Austria, (2024).
- 51.Wickham, H. & Henry, L. purrr: Functional Programming Tools. 2023. [Online]. Available: https://cran.r-project.org/package=purrr.
- 52.Zeileis, A. & Grothendieck, G. zoo: S3 Infrastructure for Regular and Irregular Time Series. J. Stat. Softw.14, 1–27 (2005). [Google Scholar]
- 53.Wood, D. E., Lu, J. & Langmead, B. “Improved metagenomic analysis with Kraken 2. Genome Biol.20, 1–13 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Hannigan, G. D., Duhaime, M. B., Ruffin, M. T., Koumpouras, C. C. & Schloss, P. D. Diagnostic potential and interactive dynamics of the colorectal cancer virome. MBio9, 1–13 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Nakatsu, G. et al. Alterations in Enteric Virome Are Associated With Colorectal Cancer and Survival Outcomes. Gastroenterology155, 529–541.e5 (2018). [DOI] [PubMed] [Google Scholar]
- 56.Maslov, S. & Sneppen, K. Population cycles and species diversity in dynamic Kill-the-Winner model of microbial ecosystems. Sci. Rep. 7, 1–8 (2017). [DOI] [PMC free article] [PubMed]
- 57.Thingstad, T. F. Elements of a theory for the mechanisms controlling abundance, diversity, and biogeochemical role of lytic bacterial viruses in aquatic systems. Limnol. Oceanogr.45, 1320–1328 (2000). [Google Scholar]
- 58.Campbell, D. E. et al. Infection with Bacteroides Phage BV01 Alters the Host Transcriptome and Bile Acid Metabolism in a Common Human Gut Microbe. Cell Rep.32, 108142 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Dragoš, A. et al. “Phages carry interbacterial weapons encoded by biosynthetic gene clusters. Curr. Biol.31, 3479–3489.e5 (2021). [DOI] [PubMed] [Google Scholar]
- 60.Casjens, S. R. & Hendrix, R. W. Bacteriophage lambda: early pioneer and still relevant. Virology479-480, 310–330 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Veeranagouda, Y., Husain, F., Tenorio, E. L. & Wexler, H. M. Identification of genes required for the survival of B. fragilis using massive parallel sequencing of a saturated transposon mutant library. BMC Genomics15, 429 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Drewes, J. L. et al. High-resolution bacterial 16S rRNA gene profile meta-analysis and biofilm status reveal common colorectal cancer consortia. npj Biofilms Microbiomes3, 34 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Boleij, A. et al. The Bacteroides fragilis toxin gene is prevalent in the colon mucosa of colorectal cancer patients. Clin. Infect. Dis.60, 208–215 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Wu, S. et al. A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat. Med.15, 1016–1022 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.St Croix, B. et al. Genes expressed in human tumor endothelium. Science289, 1197–1202 (2000). [DOI] [PubMed] [Google Scholar]
- 66.Shaukat, A. & Levin, T. R. Current and future colorectal cancer screening strategies. Nat. Rev. Gastroenterol. Hepatol. 19, 521–531 (2022). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Description of Additional Supplementary files
Data Availability Statement
The genomes sequenced for this study have been deposited at GenBank under the BioProject accession numbers: PRJNA1036030 and PRJNA910333. Raw data used for creating Fig. 4 is available in the Supplementary Data.





