Abstract
We have developed a virus detection and discovery computational pipeline, Pickaxe, and applied it to NGS databases provided by The Cancer Genome Atlas (TCGA). We analyzed a collection of whole genome (WGS), exome (WXS), and RNA (RNA-Seq) sequencing libraries from 3,052 participants across 22 different cancers. NGS data from nearly all tumor and normal tissues examined contained contaminating viral sequences. Intensive computational and manual efforts are required to remove these artifacts. We found that several different types of cancers harbored Herpesviruses including EBV, CMV, HHV1, HHV2, HHV6 and HHV7. In addition to the reported associations of Hepatitis B and C virus (HBV & HCV) with liver cancer, and Human papillomaviruses (HPV) with cervical cancer and a subset of head and neck cancers, we found additional cases of HPV integrated in a small number of bladder cancers. Gene expression and mutational profiles suggest that HPV drives tumorigenesis in these cases.
Keywords: TCGA, metagenomics, cancer, papillomavirus, herpesvirus, virome
Introduction
Like all organisms, humans are constantly bombarded with microorganisms including viruses. Many diseases are the consequence of acute infection with viruses and in these cases the pathogen may be present for a limited time and be localized to specific tissues. Some viruses establish subclinical lifelong persistent or latent infections in their host thereby becoming part of the normal microbiome. Bacteriophages also form a major component of the human microbiome, their presence being indicative of their bacterial hosts. All species, including humans, must constantly respond to the myriad of endogenous viruses they harbor as well as to the transient presence of pathogenic viruses. Yet human viral ecology is poorly understood.
The Cancer Genome Atlas (TCGA) is a large database of deep sequencing of thousands of human tumors. This database has enabled the survey of viruses found in the tissue of cancer patients (1, 3-5, 10, 11, 14, 15, 17-19). Collectively these studies detected Human papillomavirus (HPV) sequences in nearly all cervical carcinomas as well as in a subset of squamous cell carcinomas of the head and neck; Hepatitis B and Hepatitis C viral sequences associated with a subset of liver cancers; and EBV gene expression in a subset of stomach cancers. Furthermore, these analyses detected viral associations with cancer that were previously unrecognized. For example, HPV was detected in a small number of bladder cancers and members of the Herpesvirus family were detected in some tumor and normal tissues. These studies provide an overview of the types of viruses present in human cancer and demonstrate the ability to identify molecular hallmarks associated with viral presence.
Oncogenic viruses contribute to tumorigenesis by expressing transforming proteins or ncRNAs that act on key cellular targets to alter cellular biology. In many cases the action of viral oncogenes results in the activation and repression of signaling pathways that are reflected in changes in cellular gene expression. In addition, integration of viral DNA is a hallmark of tumorigenesis for some viruses. Thus, specific changes in cellular gene expression patterns and/or viral integration events can be indicative of viral action driving tumorigenesis. In this manuscript, we report a survey of viral sequences present in TCGA data representing 22 distinct types of human cancers. This is the first study to combine DNA (WGS and WXS) and RNA sequencing data sets to search for viral sequences present in human cancers and to deduce their effects of cellular gene expression.
Results
Virus detection pipeline and removal of artifacts
To identify known viruses present in tumor or normal tissue from cancer patients we compared sequences in TCGA databases to the reference genomes for all known viral species in NCBI (Viral RefSeq). Unmapped reads from whole genome sequencing (WGS), whole exon sequencing (WXS), and RNA-seq libraries were obtained from TCGA BAM files. High quality reads were selected and aligned with Bowtie 2 (12) to hg19 or additional human databases to further remove human sequence. The remaining reads were aligned to Viral RefSeq or to a set of 135 human papillomavirus genomes. A total of 4,268 WXS, 3,727 RNA-Seq, and 269 WGS libraries from 3,052 participants representing 4,562 unique samples across 22 different cancers were analyzed (Table 1; Figure 1A). Our analysis found that practically every library examined, tumor or normal, contained viral sequences. These reads aligned to 2,406 of the 3,102 viral species in Viral RefSeq covering 81 of 84 viral families. In most cases only a few sequence reads aligned to a given virus and genome coverage was poor making taxonomic classification unreliable. Some viruses were represented by millions of reads with complete genome coverage. The vast majority of alignments were to bacteriophage sequences (Figure 1B). In addition to bacteriophage, we detected sequences of a number of plant, insect and fungal viruses. The majority of the sequences scored as viral proved to be contaminants or computational artifacts. A complete list of all viral sequences detected can be found in the Online Table (worksheet “Raw virus alignments”; http://pipaslab.webfactional.com/tmp/OnlineTable.xlsx). A description of how we detected and eliminated artifacts can be found in the Supplemental Results.
Table 1. Number of TCGA patients, samples and libraries processed.
| Cancer Abbr | Cancer | Patients | Samples | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| virus positive | Total Patients | RNA-Seq | DNA-Seq | Total Samples | Tumors | Normals | RNA-Seq* | WXS | WGS | Both RNA&DNA** | Total Libraries*** | ||
| BLCA | Bladder Urothelial Carcinoma | 20 (7.5%) | 268 | 267 | 253 | 541 | 271 | 270 | 289 | 526 | 145 | 274 | 960 |
| BRCA | Breast invasive carcinoma | 0 (0.0%) | 49 | 49 | 48 | 100 | 55 | 45 | 100 | 91 | 4 | 91 | 195 |
| CESC | Cervical squamous cell carcinoma | 252 (98.8%) | 255 | 252 | 208 | 467 | 256 | 211 | 256 | 420 | 209 | 676 | |
| COAD | Colon adenocarcinoma | 74 (18.2%) | 407 | 407 | 398 | 429 | 408 | 21 | 429 (431) | 402 | 69 | 417 | 902 |
| GBM | Glioblastoma multiforme | 2 (1.2%) | 162 | 162 | 162 | 169 | 169 | 169 (339) | 167 | 4 | 167 | 510 | |
| HNSC | Head and Neck squamous cell carcinoma | 125 (24.2%) | 517 | 500 | 516 | 1096 | 518 | 578 | 542 (544) | 1078 | 15 | 524 | 1637 |
| KICH | Kidney Chromophobe | 2 (3.0%) | 66 | 66 | 66 | 91 | 66 | 25 | 91 | 91 | 91 | 182 | |
| KIRC | Kidney renal clear cell carcinoma | 0 (0.0%) | 50 | 50 | 50 | 100 | 50 | 50 | 100 | 97 | 97 | 197 | |
| KIRP | Kidney renal papillary cell carcinoma | 0 (0.0%) | 78 | 78 | 77 | 101 | 76 | 25 | 101 (107) | 100 | 100 | 207 | |
| LAML | Acute Myeloid Leukemia | 0 (0.0%) | 173 | 173 | 71 | 173 | 173 | 173 | 71 | 71 | 244 | ||
| LGG | Brain Lower Grade Glioma | 1 (1.0%) | 100 | 100 | 94 | 100 | 100 | 100 | 94 | 94 | 194 | ||
| LIHC | Liver hepatocellular carcinoma | 25 (35.2%) | 71 | 71 | 67 | 104 | 68 | 36 | 104 | 96 | 96 | 200 | |
| LUAD | Lung adenocarcinoma | 0 (0.0%) | 57 | 57 | 57 | 114 | 59 | 55 | 114 (115) | 108 | 108 | 223 | |
| LUSC | Lung squamous cell carcinoma | 0 (0.0%) | 56 | 56 | 56 | 99 | 56 | 43 | 99 (100) | 93 | 93 | 193 | |
| OV | Ovarian serous cystadenocarcinoma | 0 (0.0%) | 93 | 93 | 86 | 100 | 100 | 100 | 86 | 86 | 186 | ||
| PAAD | Pancreatic adenocarcinoma | 2 (5.0%) | 40 | 40 | 40 | 41 | 40 | 1 | 41 | 41 | 41 | 82 | |
| PRAD | Prostate adenocarcinoma | 0 (0.0%) | 56 | 56 | 56 | 100 | 56 | 44 | 100 | 93 | 93 | 193 | |
| READ | Rectum adenocarcinoma | 32 (20.5%) | 156 | 156 | 154 | 162 | 157 | 5 | 162 | 159 | 32 | 160 | 353 |
| SKCM | Skin Cutaneous Melanoma | 2 (2.0%) | 98 | 98 | 98 | 100 | 100 | 100 | 100 | 100 | 200 | ||
| STAD | Stomach adenocarcinoma | 36 (25.9%) | 139 | 139 | 137 | 143 | 127 | 16 | 143 | 141 | 141 | 284 | |
| THCA | Thyroid carcinoma | 0 (0.0%) | 66 | 66 | 62 | 132 | 74 | 58 | 132 | 116 | 116 | 248 | |
| UCEC | Uterine Corpus Endometrial Carcinoma | 0 (0.0%) | 95 | 95 | 93 | 100 | 95 | 5 | 100 | 98 | 98 | 198 | |
|
| |||||||||||||
| TOTALS | 3052 | 4562 | 3074 | 1488 | 3545 (3727) | 4268 | 269 | 3267 | 8264 | ||||
Sample number equals number of analysis IDs processed (BAM files) except where number of analysis IDs is given in parenthesis
Number of samples that have both RNA and DNA data
Number of analysis IDs where each corresponds to a unique BAM file
Figure 1. Virus detection in human cancer databases.
A) High-quality non-human reads were extracted from selected TCGA BAM files. Non-human reads were obtained after subtraction to hg19 or additional human databases. Remaining reads (non-human, high-quality) were mapped to Viral RefSeq or a set of 135 HPV genomes. Several rules were applied to remove virus families and artifacts which resulted in a list of viruses present in the TCGA cancer databases. B) Number of total alignments to phiX174, other bacteriophage (non-phiX174), eukaryotic viruses (virus) and the set of virus alignments that were removed. The number of alignments for each bar is shown (M = millions). For A and B, see text for details and Online Table 1 (worksheet “filtering rules”) for a description of how raw alignments were handled. C) Number of samples that contained at least one virus in the five virus families detected.
Five viral families are prevalent in human cancer
After artifact removal, we concluded that 34 viruses representing five virus families are present in TCGA cancer samples (Figure 1C; Figure 2; and Online Table worksheet “Viruses in cancer table”). These include the Papillomaviridae, Polyomoviridae, Hepadnaviridae, Flaviviridae, and Herpesviridae. Viruses were detected in 7.5 - 98.8% of patients of seven cancers (BLCA, CESC, COAD, HNSC, LIHC, READ, and STAD) (Table 1). In addition, viral sequences were present in a single or a small number of normal or tumor samples from several cancers (Online Table worksheet “Viruses in cancer table”).
Figure 2. The human virome.

Pie charts are shown for cancers that have >5% of patients with a virus. Pie charts show the frequency of viral presence in each cancer.
Human Papillomavirus (HPV) sequences in cervical cancers
An analysis of TCGA RNA-seq databases reported that HPV was present in the vast majority of cervical cancers (19). In agreement with this study our analysis indicated that HPV is present in 252/256 (98.4%) tumors. In addition, we found HPV sequences in 125/211 (59.2%) of adjacent normal tissue samples (Figure 3). As expected, HPV16 and HPV18 accounted for most of the papillomaviruses detected in cervical cancer patients (89.4%). All of the HPV species detected were high-risk alpha-papillomaviruses except for a few samples containing HPV30, HPV73, or HPV111 (Figure 3, Supp Figure 3A). Most samples harbored a single HPV species but some had multiple species up to a maximum of six (Supp Figure 3B). The number of sequence reads aligning to HPV varied widely among samples and was higher in DNA-seq (maximum greater than 9.5 million) compared to RNA-seq datasets (maximum of 141,000).
Figure 3. Detection of HPV in CESC and HNSC.

Number of HPV reads in CESC or HNSC by RNA-Seq or WXS. Patients are sorted left to right by descending number of HPV16 reads.
HPV in head and neck cancers
HPV has also been associated with head and neck cancers (11, 14). Consistent with this, we detected HPV in 20.1% of HNSC patients (104/517). Of the 518 tumors and 578 normal samples analyzed from these HNSC patients, 103 (20.0%) and 2 (0.3%) contained HPV. HPV16 is the predominant HPV found in HNSC. We detected HPV16 expression in 17.2% (86/500) of HNSC patients (17.2% tumors [86/499]; 2.3% normals [1/43]) as well as other high-risk alphapapillomaviruses, HPV31, 33, 35 and 56 (Figure 3).
Hepatitis B and Hepatitis C viruses associated with liver cancers
We found Hepatitis B virus (HBV) and Hepatitis C virus (HCV) sequences in normal and tumor samples in patients with liver cancer as previously reported (10, 19) (Figure 2; and Figure 4). HBV and HCV virus sequences were present in 31.0% (22/71) and 5.6% (4/71) of liver cancer patients, respectively. HBV expression was detected by RNA-Seq in 27.9% of tumors (19/68) and in 22.2% of normal tissue (8/36) (Online Table worksheet “Viruses in cancer table”) and confirmed by WXS in 20 out of these 27 samples. Five of the RNA-Seq analyses (4 tumors and 1 normal) had more than 100,000 alignments (genome coverage > 82%) to HBV. We found evidence of HBV integration in 13/22 (59.1%) HBV-positive patients. HCV gene expression was detected in 5.6% (4/71) of liver cancer patients (2.9% tumors [2/68] and 8.3% normal [3/36]) with a maximum of 198 reads covering 54% of the genome. As expected, HCV was not detected by WXS nor was integration of HCV detected in these samples. The same HBV sequence was present in normal and tumor tissue from the same patient as evidenced by detection of the same polymorphisms compared to the HBV reference strain (Figure 4C). HBV was also detected in one sample of THCA and one sample of CESC. However, these HBV sequences were due to sample cross contamination during sequencing (Supp Figure 4).
Figure 4. Detection of HBV and HCV in liver cancer.

Number of reads aligning to Hepatitis B (HBV) and Hepatitis C virus (HCV) by A) RNA-Seq and B) WXS. Samples are sorted left to right by descending number of HBV reads. C) Normal and tumor tissue from the same patient harbor the same HBV sequence. For two patients (top, TCGA-DD-A1EL, and bottom, TCGA-DD-A119), a coverage graph is plotted across the entire genome of HBV (NC_003977) (gene annotations are shown at the bottom). Vertical colored bars represent polymorphisms compared to the reference genome. Notice the same polymorphisms are present within the normal and tissue samples of the same patient but are different across both patients.
Gene expression and mutational profiling suggest that HPV and BKV drive a small subset of bladder cancers
Previously it was reported that viral transcripts or viral DNA from HPV and BKV were present in a small subset of bladder cancer tumors (5). We have analyzed each of these tumors and confirmed the presence of these same viruses. Next, we extended this analysis to include a total of 268 BLCA patients and have identified additional samples harboring HPV (Table 2). In total, HPV was found in 19/268 tumor samples and BKV is present in a single tumor. Neither BKV nor HPV was present in normal tissue.
Table 2. HPVs in BLCA tumors.
| pid | barcode | TCGA | TCGA-virus | Pickaxe | |||
|---|---|---|---|---|---|---|---|
|
| |||||||
| organism | RNA-Seq1 | WGS | wxs | ||||
| 24f21425-b001-4986-aedf-5b4dd851c6ad | TCGA-BT-A20V-01A-11R-A14Y-07 | yes | HPV45 | HPV45 | 3855 | ||
| 25ac9acd-328b-4fae-a448-4f3957blb212 | TCGA-CF-A7I0-01A-22R-A352-07 | no | HPV16 | 15 | |||
| 39fdc742-8bdb-4e77-b20b-0f967a75e659 | TCGA-LC-A66R-01A-41R-A30C-07 | no | HPV51 | 16 | |||
| 3d644eaf-82fc-4bc7-93b7-3e600e8c6511 | TCGA-GV-A3JW-10A-01D-A20D-08 | yes | none | HPV16 | 11 | ||
| 74cbb078-9aae-422a-a7af-dd29ebca949e | TCGA-GC-A3I6-10A-01D-A20D-08 | yes | HPV16 | HPV16 | 27162 | 18097 | 874 |
| 854e5236-f2d6-4681-be62-0bead955c926 | TCGA-E7-A8O8-01A-11R-A36F-07 | no | HPV18 | 24 | |||
| 899f8d52-4c3a-4371-90eb-lfl06001eddc | TCGA-FD-A3B4-01A-12D-A202-08 | yes | HPV56 | HPV53 | 30 | ||
| HPV56 | 4591 | ||||||
| 8cldd7f7-b74a-4fa2-b6a7-86f0348d2567 | TCGA-FD-A3N6-01A-11D-A21C-26 | yes | HPV6b | HPV11 | 221 | ||
| HPV6b | 35852 | 2440 | 17 | ||||
| 8da76432-3004-47e5-a474-bd94bd4c0b33 | TCGA-FD-A5BY-01A-31R-A28M-07 | no | HPV52 | 14771 | 32 | ||
| 90bf49b7-6710-4665-82df-30ecc22c59a3 | TCGA-XF-A8HB-01A-11R-A36F-07 | no | HPV16 | 16888 | |||
| c8adce3d-4914-4851-9952-846b0c9c32de | TCGA-E7-A85H-01A-11R-A352-07 | no | HPV16 | 21 | |||
| eeca58fe-b5cb-4dl3-a93f-a5a2ca00b6e2 | TCGA-K4-A5RH-10A-01D-A30H-08 | no | HPV51 | 35 | |||
Numbers of alignments detected in RNA-Seq, WGS and WXS libraries
The integration of viral sequences in tumor DNA is one of the hallmarks of virus-driven cancers. Previous studies identified three bladder tumors containing integrated HPV sequences. Our analysis confirmed these integrations and identified an additional patient (participant 90bf49b7) harboring integrated HPV within the SLC2A1-AS1 gene on chromosome 1. The presence of human polyomavirus BKV in a single patient has been previously reported. We found 98% genome coverage by DNA-seq in this sample. Nearly 80% of the genome was covered by RNA-Seq reads due to expression of the viral early region (Large T and small t antigen), agnoprotein and VP1 (Figure 5). Consistent with previous reports we detected two BKV chimeras from the RNA-seq sample. Both appear to be DNA junctions and appear on chromosomes 2 and 12. The chr2 chimera occurs approximately 0.5Mb downstream of GRB14. The chr12 chimera occurred within the gene HSP90B1. We assembled the BKV reads into several contigs and performed a BLASTN search of the two longest contigs against Genbank. The top hit for both contigs (99% identity) was to BK polyomavirus isolate SJH-LG-253 (JN192434.1). These observations convinced us that BKV was present in this sample and that the BKV was not the laboratory strain.
Figure 5. BKV expression in a bladder tumor.

Coverage graph of BKV (NC_001538) in patient tumor TCGA-DK-A3IT-01A. The whole genome of BKV virus is shown. The top track shows the number of reads aligning to each base in the genome (y-axis scale max = 5488). Read alignments are shown below the graph (red, forward; blue, reverse). The virus gene annotations are shown in the bottom track.
We found evidence for expression of the viral oncogenes, E7 and E6 HPV-positive tumors and large T antigen (LT) for the BKV-positive tumor. HPV and BKV are known to block the activity of the Rb, p16 and p53. Thus, tumors expressing active HPV or BKV oncoproteins are not under selective pressure to mutate these tumor suppressors. Therefore, we examined the status of the Rb, CDKN2A and p53 genes in virus-associated BLCA tumors. The mutation rate of Rb, CDKN2A and p53 in non-viral tumors was 16%, 7% and 49%, respectively. However, none of the BLCA viral tumors had a mutation in Rb, CDKN2A nor p53 (Table 3). This suggests that HPV and BKV functionally impinge upon the Rb and p53 pathways in these tumors.
Table 3. Mutation contingency table for HPV and BKV presence.
| Gene | Virus A | No B | Yes B | % Mutated |
|---|---|---|---|---|
| RB1 | - | 164 | 32 | 16% |
| + | 6 | 0 | 0% | |
| CDKN2A | - | 183 | 13 | 7% |
| + | 6 | 0 | 0% | |
| TP53 | - | 100 | 96 | 49% |
| + | 6 | 0 | 0% |
HPV or BKV considered present if number of alignment to virus is >= 1000.
Gene was considered mutated if there was one or more non-silent mutations found in that gene.
As a consequence of blocking the function of Rb, the E2F transcription factors are free to transactivate a set of downstream responsive genes termed E2F regulated genes (ERGs). We sought to determine if HPV and BKV tumors form a distinct sub cluster of bladder cancers by analyzing the expression of a set of 325 ERGs (16). Using TCGA RNA-seq data, we calculated whether a gene was up or downregulated relative to a set of normal bladder samples. Genes were categorized with a label of 1 for up regulated, 2 for not changed and 3 for down regulated (Johnson et al., submitted). We selected bladder tumors that had 0 alignments to HPV or BKV (n=211) and those tumors that have at least 1000 alignments to HPV or BKV (n=7). The categorized expression data was clustered using the R pheatmap package (Figure 6). Two clusters of tumors are apparent (separated by a vertical black line): group A (n=37) and B (n=181). All HPV and BKV associated tumors are found in Group B tumors. There are two types of ERGs: group 1 and group 2 (Supp Table 1). Group 1 genes are generally not changed or are downregulated in tumors. Group 1 is enriched for genes associated with cell death such as FAS and BCL2 and with the MAPK pathway such as several MAPKs and MYC. Most group 2 genes are upregulated in tumors from group B but show a range of expression in group A tumors. Group 2 genes are enriched for DNA replication and cell cycle functions. We used consensus clustering (13) to determine if these tumor clusters were robust. After testing with a range of k values from 2 to 10, and based on the CDF plot, a ‘k’ size of 2 resulted in clusters of maximal stability (Supp Figure 5). The larger cluster (n=185) contained all the viral samples and overall 95% of the tumors remained grouped in the same cluster.
Figure 6. Clustering of bladder tumors using E2F regulated genes.

Bladder tumors that had 0 alignments to HPV or BKV, or had >1000 alignments to an HPV or BKV were selected. Using the categorized expression data for the ERGs (n=325), these tumors were clustered using the ‘pheatmap’ package in R. Tumors and ERGs separated into two clusters (marked by black lines). Virus associated tumors (colored by virus) are found in Group B tumors and the BKV sample is indicated with an arrow. In the heatmap, ‘red’ indicates upregulation in tumor relative to normal tissue, ‘yellow’ indicates no change in expression and ‘blue’ indicates downregulation in tumor relative to normal tissue (see Materials and Methods).
Next we hypothesized that non-viral tumors in group B would have a similar impact on the mutation status of Rb, CDKN2A and p53. We found that all tumors with mutated Rb were found in the non-viral tumors of group B and was not mutated in Group A tumors (Table 4). p53 was mutated in a higher percentage of non-viral group B tumors (54%, 88/162) than in group A tumors (24%, 8/34). However, CDKN2A mutation was similar between non-viral group B and group A tumors at 7% and 6%, respectively.
Table 4. Mutation contingency table for HPV and BKV presence by ERG tumor cluster.
| Gene | Cluster A | No B | Yes B | % Mutated |
|---|---|---|---|---|
| RB1 | A non-viral | 34 | 0 | 0% |
| B non-viral | 130 | 32 | 20% | |
| B viral | 6 | 0 | 0% | |
| CDKN2A | A non-viral | 32 | 2 | 6% |
| B non-viral | 151 | 11 | 7% | |
| B viral | 6 | 0 | 0% | |
| TP53 | A non-viral | 26 | 8 | 24% |
| B non-viral | 74 | 88 | 54% | |
| B viral | 6 | 0 | 0% |
Tumors are defined by the clustering in Figure 6.
Gene was considered mutated if there was one or more non-silent mutations found in that gene.
Human herpesviruses are present in many different cancers
We detected one or more members of the Herpesvirus family in at least some samples of 20 out of the 22 cancers examined (Table 5). Among these, EBV, CMV, and HHV6 were found in at least one sample of 17/22 cancers. Consistent with previous reports we found EBV associated with 23% of stomach cancers. The number of EBV sequence reads varied dramatically among the STAD tumor samples (Supp Table 2 “Herpesviruses/EBV presence”). We deduced viral gene expression profiles for 11 samples with the highest number of RNA-Seq reads (>= 5000) (Supp Figure 6A). The pattern of expression is similar to that reported by Strong et al (18) with most of the expression coming from the RPMS1 and A73 genes in the BamHI A region (Supp Figure 6B).
Table 5. Number of patients with herpesvirus.
| HSV1 | HSV2 | EBV | CMV | HHV6 | HHV7 | KSHV | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cancer | Patients | Herpes+ | % | Tumor | Normal | RNA-seq | DNA-Seq | Total | T1 | N | Total | T | N | Total | T | N | Total | T | N | Total | T | N | Total | T | N | Total | T | N |
| BLCA | 268 | 16 | 6% | 268 | 253 | 267 | 253 | 8 | 6 | 2 | 5 | 5 | 3 | 3 | 1 | 1 | ||||||||||||
| BRCA | 49 | 1 | 2% | 49 | 45 | 49 | 48 | 1 | 1 | |||||||||||||||||||
| CESC | 255 | 15 | 6% | 255 | 208 | 252 | 208 | 1 | 1 | 9 | 4 | 6 | 3 | 3 | 2 | 2 | 1 | 1 | 1 | |||||||||
| COAD | 407 | 74 | 18% | 407 | 21 | 407 | 398 | 1 | 1 | 19 | 19 | 39 | 39 | 19 | 19 | 7 | 4 | 3 | ||||||||||
| GBM | 162 | 5 | 3% | 162 | 162 | 162 | 3 | 3 | 2 | 2 | ||||||||||||||||||
| HNSC | 517 | 59 | 11% | 517 | 515 | 500 | 516 | 4 | 3 | 1 | 24 | 17 | 7 | 20 | 20 | 11 | 11 | 6 | 3 | 2 | 1 | 2 | 1 | 1 | ||||
| KICH | 66 | 0% | 66 | 25 | 66 | 66 | ||||||||||||||||||||||
| KIRC | 50 | 1 | 2% | 50 | 50 | 50 | 50 | 1 | 1 | |||||||||||||||||||
| KIRP | 78 | 3 | 4% | 76 | 25 | 78 | 77 | 3 | 3 | |||||||||||||||||||
| LAML | 173 | 4 | 2% | 173 | 173 | 71 | 3 | 3 | 1 | 1 | ||||||||||||||||||
| LGG | 100 | 2 | 2% | 100 | 100 | 94 | 1 | 1 | 1 | 1 | ||||||||||||||||||
| LIHC | 71 | 7 | 10% | 68 | 36 | 71 | 67 | 1 | 1 | 6 | 6 | |||||||||||||||||
| LUAD | 57 | 6 | 11% | 57 | 55 | 57 | 57 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | 2 | 2 | ||||||||||||
| LUSC | 56 | 8 | 14% | 56 | 43 | 56 | 56 | 4 | 3 | 1 | 3 | 3 | 1 | 1 | ||||||||||||||
| OV | 93 | 7 | 8% | 93 | 93 | 86 | 7 | 7 | ||||||||||||||||||||
| PAAD | 40 | 5 | 13% | 40 | 1 | 40 | 40 | 1 | 1 | 2 | 2 | 2 | 2 | |||||||||||||||
| PRAD | 56 | 2 | 4% | 56 | 44 | 56 | 56 | 2 | 1 | 1 | ||||||||||||||||||
| READ | 156 | 32 | 21% | 156 | 5 | 156 | 154 | 9 | 9 | 18 | 18 | 7 | 6 | 1 | ||||||||||||||
| SKCM | 98 | 5 | 5% | 98 | 98 | 98 | 1 | 1 | 2 | 2 | 2 | 2 | ||||||||||||||||
| STAD | 139 | 48 | 35% | 127 | 16 | 139 | 137 | 32 | 31 | 1 | 19 | 19 | 5 | 5 | 1 | 1 | ||||||||||||
| THCA | 66 | 0% | 66 | 58 | 66 | 62 | ||||||||||||||||||||||
| UCEC | 95 | 2 | 2% | 95 | 5 | 95 | 93 | 1 | 1 | 1 | 1 | |||||||||||||||||
T, tumor; N, normal tissue
We also detected CMV (14%) and HHV6 (4%) in STAD tumors. In fact, 21% of rectal cancers and 18% of colon cancers contained EBV, CMV, or HHV6. In many cases, these viruses were detected by both DNA-seq and RNA-seq experiments indicating they are not contaminants. In most cancers, the number of reads aligning to EBV, CMV, and HHV6 were relatively low suggesting that only a subset of cells in the sample harbored the virus. In the case of CMV, examination of the viral gene expression revealed transcription across most of the viral genome with a majority of expression coming from two non-coding genes (Supp Figure 7A, B). This expression pattern is similar to the reported transcriptome of a productive CMV infection in cell culture (8).
Other herpesviruses including HSV1, HSV2, HHV7 and KSHV were rarely detected (Supplemental Results). However, their presence in both DNA-seq and RNA-seq datasets suggest they are not contaminants. A search for chimeric reads found no evidence of integration of herpesviruses suggesting that the viral DNA is episomal.
Discussion
In this manuscript, we surveyed a variety of cancers for the presence of known viruses. Databases such as those provided by the TCGA give scientists an unprecedented opportunity to examine viral infections in humans, identify new associations of known viruses with disease, and search human tissue for novel viruses. We have developed a computational pipeline to facilitate known virus identification and discovery from complex environmental samples. Here we have applied this pipeline to DNA sequence and RNA-seq datasets collected by the TCGA. Our goal was to assess the frequency with which viruses occurred in tumor and normal cells from different human tissues. The use of independent libraries for the same samples is important for sequence confirmation (10). In this study we explored viral diversity by a combined analysis of RNA-seq, WXS and WGS databases.
Artifacts and contamination obscure the search for viruses associated with cancer
We found viral sequences in most of the TCGA BAM files that we analyzed. There are two possible explanations for the presence of viral sequences in human tumor databases. First, a given virus may be present in human tissue because it infects humans, perhaps even contributing to tumorigenesis. Second, the viral detection may be due to an artifact either by physical contamination from samples, reagents, and sequencing machines or by a computational artifact from database or pipeline artifacts such as cross mapping of reads to a closely related virus. Concerned about recent literature highlighting contamination in NGS datasets, we questioned if each virus was truly present (2, 7, 10). A comprehensive list of computational artifacts and physical contaminants detected in TCGA datasets can be found in the Supplemental Results. An effort, beyond the scope of this manuscript, is underway to develop rules for identifying and computationally removing artifacts with the goal to reduce the burden of manual inspection.
Viruses are frequently found in tumors from a subset of cancers
In agreement with previous viral surveys, we found members of five virus families in TCGA databases. HPV was detected in 98.8% of cervical cancers. HPV was also present in a subset of head and neck cancers, as well as in a small number of bladder cancers. HBV or HCV were associated with some liver cancers. The association of EBV with stomach cancer has been under investigation for some time (18) so it was not surprising to find EBV in about 20% of stomach cancers. In addition, members of the polyomavirus and herpesvirus families were detected in some samples.
HPV and BKV are drivers of a small subset of bladder cancers
HPV was detected in 19 of 268 bladder cancers and BKV was detected in a single bladder tumor. Several factors convinced us that HPV and BKV presence in a subset of bladder carcinomas was not the consequence of contamination. First, in three patients, HPV16, HPV6b and HPV52 sequences were detected in both DNA and RNA sequencing experiments. Second, we detected chimeric junctions indicating viral integration of HPV16, HPV45, HPV56 in some patients. Finally, the viral transforming genes, E6 and E7, are transcribed in all the HPV-associated tumors (data not shown).
One common feature of tumorigenesis mediated by HPV and BKV is that both antagonize the functions of the Rb and p53 tumor suppressors. An analysis of mutational profiles revealed that nonviral bladder tumors harbored mutations in pRb (16%), CDKN2A (7%), and p53 (49%) indicating that the Rb and p53 pathways are compromised. In contrast, none of the HPV-associated tumors, nor the BKV-associated tumor, carried mutations in these genes. We propose that the Rb and p53 pathways are inactivated in a subclass of bladder cancers. This inactivation can occur either by mutation of key genes in these pathways (Rb, CDKN2A, p53) or by the expression of the HPV E7 and E6 proteins or the BKV LT. Consistent with this hypothesis, we found that E2F-dependent cellular genes (ERGs) that are normally repressed by Rb, are over-expressed in virus-positive tumors and in the subset of virus-negative tumors that carry Rb or CDKN2A mutations. Taken together, these observations suggest that HPV and BKV contribute to a small number of bladder carcinomas.
Herpesviruses are frequently detected in cancers of the stomach, colon, and rectum
Members of the Herpesvirus family were detected in a small number of normal tissues from cancer patients. In many cases, the gene expression patterns suggest that a small number of cells in the sample are productively infected with the virus. Strikingly, three herpesviruses, EBV, CMV, and HHV6 are found in a significant number of cancers of the gastrointestinal track, specifically cancers of the stomach, colon, and rectum. The role of EBV in stomach cancer is well known and has been the subject of intensive study. However, the detection of CMV and HHV6 in stomach cancers, and of all three herpesviruses in cancers of the colon and rectum, but not in paired normal tissue, is provocative. One possibility is that these viruses contribute to a subset of gastrointestinal cancers. Alternatively, these tumors may present an environment that favors viral infection or the virus may reside in infiltrating immune cells in the tumor. In either case, herpesvirus presence suggests that these tumors are somehow different. The basis for this difference warrants further study to determine if a causal role exists for herpesviruses in these tumors.
Surveys of viruses in TCGA give an incomplete picture of the viruses that are actually present
The human body is host to a myriad of microorganisms any of which might induce illness under certain conditions such as immunosuppression, poor diet, or stress. Tissue procurement and nucleic acid isolation methods need to be developed that minimize contamination and maximize the isolation of nucleic acids from viruses, bacteria, fungi and parasites. The presence of contaminating viral sequences in these data sets usually does not compromise the analysis of host gene expression profiling data because these sequences are computationally removed. However, their presence does sound a warning for virus discovery searches. The presence of known viruses in these data sets means that there will also be unknown viruses present. It will be critical to develop methods that distinguish these contaminating sequences from authentic novel infectious agents. Finally, sequencing depth and the methods used for isolating tumor cells and extracting nucleic acids often do not favor the detection of relatively small viral genomes present in a subset of cells. Thus, the viruses detected in studies like this one are likely to represent a small subset of viruses actually present.
Materials and Methods
Cancer databases
The results published here are in whole based upon data generated by The Cancer Genome Atlas (TCGA) Research Network (http://cancergenome.nih.gov). All human data were handled in accordance with a Data Access Request between the University of Pittsburgh and the NIH for dbGaP study accession phs000178. BAM files were downloaded with GeneTorrent (http://cghub.ucsc.edu) and handled in accordance to the TCGA Data Use Certification Agreement (version 9/12/2013). However, CGHub is no longer active. BAM files can be downloaded from GDC (https://portal.gdc.cancer.gov/legacy-archive/search/f). Clinical and mutation data (broad.mit.edu_BLCA.IlluminaGA_DNASeq_automated.Level_2.1.5.0/PR_TCGA_BLCA_PAIR_Capture_All_Pairs_QCPASS_v6.aggregated.capture.tcga.uuid.automated.somatic.maf) was obtained from the TCGA Data Portal (https://tcga-data.nci.nih.gov). However, this website is no longer accessible. We are maintaining a local copy of these data.
Computational pipeline for virus detection
Non-human reads from TCGA BAM files were extracted and processed as previously described (6). A selected set of BAM files from BLCA, CESC and HNSC were processed with an additional subtraction step to the human Ensembl cDNA database, Homo_sapiens.GRCh37.74.cdna.all.fa (http://ftp.ensembl.org/pub/release-74/fasta/homo_sapiens/cdna/). To obtain alignments to a set of 135 manually curated human papillomavirus genomes, RNA-Seq and WXS libraries of BLCA, CESC and HNSC were first subtracted against several human databases - Ensembl ncRNA (http://ftp.ensembl.org/pub/release-74/fasta/homo_sapiens/ncrna/) and cDNA, NCBI RefSeq (downloaded Aug 2014; ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/), UCSC Genes (h38 assembly; http://genome.ucsc.edu/cgi-bin/hgTables), the human genome references hg19 (ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/hg19.zip) and hg38 (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.16_GRCh38.p1), and NCBI human_genomic (ftp://ftp.ncbi.nlm.nih.gov/blast/db/) - to obtain a set of non-human reads and then the non-human reads were aligned against the 135 HPV genomes. A virus is considered detected in a TCGA BAM file if the number of alignments to the virus is >= 10 and the average MAPQ of the alignments is >= 5. While these filters may seem lenient, they were chosen to identify all viral positive samples. Then, we applied several filters to further narrow the list of detected viruses. We removed all bacteriophage, plant viruses, insect viruses and artifacts to obtain a list of viruses that are present in in human cancer databases.
Viral integration analysis
Virus/human chimeric reads were detected with SummonChimera (9). High-quality reads from TCGA BAM files were searched against the selected viral genome and the human genome (hg19) with BLASTN and Bowtie 2. SummonChimera was run using the BLASTN output and Bowtie2 virus alignment files from the virus detection pipeline to generate a list of chimeric junctions and integrations in the human genome.
Other analyses
Integrated Genome Viewer (http://www.broadinstitute.org/igv/; version 2.1.24) was used to visualize alignments to virus reference sequences. R (www.r-project.org; version 3) was used for visualization, correlation analysis (fisher.test) and clustering. The R pheatmap package was used with default parameters. The R ConsensusClusterPlus package was used with default parameters except that Minkowski distance was used.
Supplementary Material
Acknowledgments
This work was supported by NIH grant CA170248 to J.M.P. We thank the University of Pittsburgh's Center for Simulation and Modeling (SAM) that provided a state-of-the-art high performance computing (HPC) cluster for the computation used in these experiments.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- 1.Amirian ES, Bondy ML, Mo Q, Bainbridge MN, Scheurer ME. Presence of viral DNA in whole-genome sequencing of brain tumor tissues from the cancer genome atlas. Journal of virology. 2014;88:774. doi: 10.1128/JVI.02725-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Borozan I, Wilson S, Blanchette P, Laflamme P, Watt SN, Krzyzanowski PM, Sircoulomb F, Rottapel R, Branton PE, Ferretti V. CaPSID: a bioinformatics platform for computational pathogen sequence identification in human genomes and transcriptomes. BMC Bioinformatics. 2012;13:206. doi: 10.1186/1471-2105-13-206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cancer-Genome-Atlas-Research-Network. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015;517:576–582. doi: 10.1038/nature14129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cancer-Genome-Atlas-Research-Network. Comprehensive molecular characterization of gastric adenocarcinoma. Nature. 2014;513:202–209. doi: 10.1038/nature13480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Cancer-Genome-Atlas-Research-Network. Comprehensive molecular characterization of urothelial bladder carcinoma. Nature. 2014;507:315–322. doi: 10.1038/nature12965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cantalupo PG, Katz JP, Pipas JM. HeLa nucleic acid contamination in The Cancer Genome Atlas leads to the misidentification of HPV18. Journal of virology. 2015 doi: 10.1128/JVI.03365-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Cantalupo PG, Katz JP, Pipas JM. HeLa Nucleic Acid Contamination in The Cancer Genome Atlas Leads to the Misidentification of Human Papillomavirus 18. Journal of virology. 2015;89:4051–4057. doi: 10.1128/JVI.03365-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gatherer D, Seirafian S, Cunningham C, Holton M, Dargan DJ, Baluchova K, Hector RD, Galbraith J, Herzyk P, Wilkinson GW, Davison AJ. High-resolution human cytomegalovirus transcriptome. Proceedings of the National Academy of Sciences of the United States of America. 2011;108:19755–19760. doi: 10.1073/pnas.1115861108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Katz JP, Pipas JM. SummonChimera infers integrated viral genomes with nucleotide precision from NGS data. BMC Bioinformatics. 2014;15:348. doi: 10.1186/s12859-014-0348-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kazemian M, Ren M, Lin JX, Liao W, Spolski R, Leonard WJ. Possible HPV38 contamination of endometrial cancer RNA-Seq samples in The Cancer Genome Atlas database. Journal of virology. 2015 doi: 10.1128/JVI.00822-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Khoury JD, Tannir NM, Williams MD, Chen Y, Yao H, Zhang J, Thompson EJ, Meric-Bernstam F, Medeiros LJ, Weinstein JN, Su X. Landscape of DNA virus associations across human malignant cancers: analysis of 3,775 cases using RNA-Seq. Journal of virology. 2013;87:8916–8926. doi: 10.1128/JVI.00340-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn. 2003;52:91–118. [Google Scholar]
- 14.Parfenov M, Pedamallu CS, Gehlenborg N, Freeman SS, Danilova L, Bristow CA, Lee S, Hadjipanayis AG, Ivanova EV, Wilkerson MD, Protopopov A, Yang L, Seth S, Song X, Tang J, Ren X, Zhang J, Pantazi A, Santoso N, Xu AW, Mahadeshwar H, Wheeler DA, Haddad RI, Jung J, Ojesina AI, Issaeva N, Yarbrough WG, Hayes DN, Grandis JR, El-Naggar AK, Meyerson M, Park PJ, Chin L, Seidman JG, Hammerman PS, Kucherlapati R. Characterization of HPV and host genome interactions in primary head and neck cancers. Proceedings of the National Academy of Sciences of the United States of America. 2014 doi: 10.1073/pnas.1416074111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Salyakina D, Tsinoremas NF. Viral expression associated with gastrointestinal adenocarcinomas in TCGA high-throughput sequencing data. Hum Genomics. 2013;7:23. doi: 10.1186/1479-7364-7-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Shackney SE, Chowdhury SA, Schwartz R. A Novel Subset of Human Tumors That Simultaneously Overexpress Multiple E2F-responsive Genes Found in Breast, Ovarian, and Prostate Cancers. Cancer Inform. 2014;13:89–100. doi: 10.4137/CIN.S14062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Strong MJ, O'Grady T, Lin Z, Xu G, Baddoo M, Parsons C, Zhang K, Taylor CM, Flemington EK. Epstein-Barr virus and human herpesvirus 6 detection in a non-Hodgkin's diffuse large B-cell lymphoma cohort by using RNA sequencing. Journal of virology. 2013;87:13059–13062. doi: 10.1128/JVI.02380-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Strong MJ, Xu G, Coco J, Baribault C, Vinay DS, Lacey MR, Strong AL, Lehman TA, Seddon MB, Lin Z, Concha M, Baddoo M, Ferris M, Swan KF, Sullivan DE, Burow ME, Taylor CM, Flemington EK. Differences in gastric carcinoma microenvironment stratify according to EBV infection intensity: implications for possible immune adjuvant therapy. PLoS pathogens. 2013;9:e1003341. doi: 10.1371/journal.ppat.1003341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tang KW, Alaei-Mahabadi B, Samuelsson T, Lindh M, Larsson E. The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat Commun. 2013;4:2513. doi: 10.1038/ncomms3513. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

