Abstract
Numerous viral sequences have been reported in the whole-genome sequencing (WGS) data of human blood. However, it is not clear to what degree the virus-mappable reads represent true viral sequences rather than random-mapping or noise originating from sample preparation, sequencing processes, or other sources. Identification of patterns of virus-mappable reads may generate novel indicators for evaluating the origins of these viral sequences. We characterized paired-end unmapped reads and reads aligned to viral references in human WGS datasets, then compared patterns of the virus-mappable reads among DNA sources and sequencing facilities which produced these datasets. We then examined potential origins of the source- and facility-associated viral reads. The proportions of clean unmapped reads among the seven sequencing facilities were significantly different (P < 2×10−16). We identified 260,339 reads that were mappable to a total of 99 viral references in 2,535 samples. The majority (86.7%) of these virus-mappable reads (corresponding to 47 viral references), which can be classified into four groups based on their distinct patterns, were strongly associated with sequencing facility or DNA source (adjusted P value < 0.01). Possible origins of these reads include artificial sequences in library preparation, recombinant vectors in cell culture, and phages co-contaminated with their host bacteria. The sequencing facility-associated virus-mappable reads and patterns were repeatedly observed in other datasets produced in the same facilities. We have constructed an analytic framework and profiled the unmapped reads mappable to viral references. The results provide a new understanding of sequencing facility- and DNA source-associated batch effects in deep sequencing data and may facilitate improved bioinformatics filtering of reads.
Keywords: High-throughput sequencing, Unmapped reads, Batch effect, Human virome
1. Introduction
Unmapped high-throughput sequencing reads
High-throughput sequencing (HTS) has been routinely used to study the genome, transcriptome, and epigenome of human samples. The data quality of HTS is critical for all genomic analyses. Many HTS reads remain unmapped against the human reference genome; for whole-genome sequencing (WGS) data, approximately 1–10% reads were not mappable to the reference genome of the sequenced species [1]. These unmappable reads, which have been historically considered junk sequences and discarded in primary HTS data analyses, are attributed to a wide range of factors: for instance, repetitive genomic sequences (e.g., transposable elements), non-reference large insertions, limitations of short read alignment algorithms, quality of the reference genomes, and sequencing errors. While these unmapped reads may suggest presence of microbial agents [2, 3] in sequenced human specimens, they can also indicate the existence of sequencing artifacts, including cross-species contamination during data production. Therefore, the proportion of unmapped reads and that of reads mapping to non-human sources, such as the human virome, are potential indicators of the quality of human WGS data.
Human virome analyses using HTS reads mappable to viral references
With the recent bioinformatics advances in WGS and RNA-Sequencing (RNA-Seq) data analyses, mapping the unmapped reads to viral references has been used to identify a broad range of infectious agents from the human virome [4, 5, 6, 7, 8, 9, 10, 11, 12, 13], particularly those with low abundance. For example, the mouse kidney parvovirus was found in multiple laboratory mouse colonies that resulted in renal failure [13]; the human viromes of gut microbiota [14], skin [15], and blood [12] were recently characterized using HTS; and a novel pathogen was found in cord colitis syndrome by screening WGS data after the removal of known microbial sequences [16].
Other origins of reads mappable to viral references in human HTS data
Although bioinformatics tools have been developed to detect viral or bacterial pathogens [17, 18, 19, 20, 21, 22], it is still challenging to distinguish true viral pathogens, particularly new viral species [8], from microbial contaminants, especially in low-coverage HTS data. These microbial contaminants impose additional challenges on all analyses related to viral sequences, such as discovering pathogens and viral integration events or elucidating composition of the human virome [3, 12, 23, 24, 25, 26, 27, 28, 29, 30, 31]. HTS reads may be contaminated by sources such as vectors, adapters, and other artificial sequences [23, 32, 33], which are frequently detected in virome-wide studies [12]. Artificial sequences may be distinguished by the fact that reads originating from them should not cover an entire viral reference genome, e.g., all adeno-associated viral reads originating from vector sequences map to a small region of the viral genome with extremely high coverage [34]. However, microbial contaminants can also be introduced from other sources, including cell culturing, ultrapure water [32], DNA and RNA extraction kits, PCR amplification reagents, and sequencing library preparation kits [23, 32], which are very challenging to differentiate.
The current study
Numerous viral sequences have been reported in WGS data from human blood DNA [12]. However, it is not clear to what degree these virus-mappable reads identified in human blood WGS data represent true viral sequences from the virome rather than random-mapping or noise originating from sample preparation, sequencing processes, or other batch-specific sources. Analyses searching for patterns of virus-mappable reads may generate novel indicators for evaluating the origins of these virus-mappable sequences. In this study, we examined the human-unmappable reads from WGS data which were mappable to viral genome references. We identified unique patterns of virus-mappable reads from viral contaminants or sequence artifacts that were strongly associated with sequencing facility (center) and DNA source (even in one of the best controlled WGS datasets), supporting facility and source related batch effects potentially common in HTS data.
2. Methods
2.1. Samples and Data
We analyzed the WGS data of 2,535 subjects from the 1000 Genomes Project (phase 3). Whole-genome aligned BAM files (hg19 genome build) were downloaded from the International Genome Sample Resource database. This dataset represents 26 different populations that can be categorized into five super-populations, including African, Ad Mixed American, European, East Asian, and South Asian. The WGS data was produced by seven primary sequencing facilities worldwide, including Baylor College of Medicine (BCM), Beijing Genomics Institute (BGI), The Broad Institute (BI), Illumina, The Max Planck Institute of Molecular Genetics (MPIMG), The Wellcome Trust Sanger Institute (SC), and Washington University Genome Science Center (WUGSC) (Supplementary Table 1). The DNA sources for these subjects included lymphoblastoid cell lines (LCLs) and blood. All samples analyzed in this study were deidentified and our institutional review board approval for use of the existing data was obtained from the committee on human research at the University of Vermont.
2.2. Detection of clean paired-end unmapped reads
Paired-end unmapped reads were obtained in BAM format using SAMtools with the parameter “−f 12” [35]. After being sorted with SAMtools, the BAM files containing paired-end unmapped reads were converted to FASTQ files using HYDRA’s bamToFastq tool [36]. To eliminate low quality sequence data, NGSQCToolkit [37] was used to trim nucleotides at the 3’ end of each read having a quality score of less than 20 (Q20). We then used NGSQCToolkit to remove either short reads (20 base pairs (bp) or shorter) or reads consisting of less than 80% high quality nucleotides (Q20). We then used FastUniq for the removal of duplicate reads [38]. One-way analysis of variance (ANOVA) was performed to measure the significance of differences of mean values in the percentages of unmapped reads among the populations, sequencing facilities, and DNA sources by using the “aov” function in R (v3.2.4).
2.3. Removal of repeats, known vectors, and human sequences
To remove repeat sequences, the paired-end unmapped reads, in FASTQ format, were first converted to FASTA files then screened for repeats using the tools RepeatMasker, DUST, and TRF [39, 40, 41]. We additionally removed reads containing fewer than 50 bp of non-repeat sequence. To further remove human reads, we performed reciprocal alignment with the paired-end unmapped reads against the human reference genome (hg19) using BWA-MEM [42] with customized parameters ‘−k 19 −c 100000 −m −T 20 −h 10000 −Y −M’ [28]. Reads containing more than 50 bp of mappable sequence were considered human reads and removed from further analysis. To remove known vectors and other artificial sequences, we aligned the reads against a vector database using BWA-MEM with the same customized parameters described above. The vector database was curated by merging the UniVec (consisting of 5,456 sequences) and EMVEC (consisting of 4,189 sequences) databases, which were originally assembled by the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), respectively. Reads with more than 50 bp mappable to either vector database were removed from further analysis.
2.4. Screening of viral databases for virus-mappable reads
The NCBI Viral Genomes database, containing 6,009 complete viral genomes (obtained from https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/ on March 31, 2016), was used as the primary reference for the detection of virus-mappable reads. To explore additional viral sources, we also screened our larger in-house viral sequence database, which contains a total of approximately 921,000 unique viral reference sequences. Our viral sequence database was collected and filtered by curating 15 different sources of viral sequence databases. We kept only the unique viral references by removing redundant sequences, including references with the same GenBank accession numbers and references with different identification numbers but identical sequences. The names of these viral references were obtained from the NCBI Nucleotide database. More details of our database were described in our previous study [28]. All viral reference sequences shorter than 500 bp were removed from the analyses.
To detect candidate viral reads, the cleaned paired-end unmapped reads were separately aligned to the NCBI Viral Genomes database and our in-house viral database. BWA-MEM [42] was used for efficient read alignment with the same customized parameters described above. To identify confident viral reads, only those reads with more than 50 bp of non-repeat sequence which uniquely mapped to references with the same viral names in each database were used. To identify the patterns of virus-mappable reads, we only kept reads which mapped to a viral reference having either a minimum of 50 virus-mappable reads cumulatively across all subjects or at least 20 virus-mappable reads from a single subject.
2.5. Identification of virus-mappable reads associated with sequencing facility or DNA source
We first searched for viral references with mappable reads that differed in abundance among the seven sequencing facilities or between the two DNA sources. Here, viral abundance was estimated as the number of virus-mappable reads divided by the sequencing coverage (i.e., number of virus-mappable reads per sequencing fold). This abundance was used to measure the enrichment and preference of viral reads. When comparing between LCLs and blood, we only considered subjects whose DNA source was originally reported.
We conducted a Mann-Whitney test of the viral abundance between each pair of sequencing facilities and between each pair of DNA sources (viral references with reads detected in fewer than five subjects were excluded). Then, the Mann-Whitney test of viral abundance between each pair of sequencing facilities was repeated among the subjects having the same DNA source. We then used the Holm’s correction method to adjust the P value by using the p.adjust function in R. To retrieve all potential and high-quality facility- or source-associated virus-mappable reads, the viral references showing significant differences (adjusted P < 0.01) in viral abundance were included.
2.6. Cluster analysis of facility and source-associated reads
To further investigate whether the detected virus-mappable reads differed in their presence among sequencing facilities, super-populations, or DNA sources, we clustered the detected viral references based on the similarity of the presence of viral reads across all analyzed subjects. To visualize the detected viral references, we applied an approach similar to that described in a previous study [12], i.e., a sample with the presence of a viral reference having virus-mappable reads was scored as 1 (this approach was applied only for heatmap generation, but not for viral abundance calculation). The heatmap.2 function in R with the default ‘hclust’ parameter was used to perform the hierarchical clustering analysis of the subjects for all detected viral references.
2.7. Independent replications
To verify whether the sequencing facility and DNA source-associated patterns observed in the 1000 Genomes Project data also existed in other independent datasets produced by the same sequencing facilities, we analyzed the WGS data (Build 37) of 158 subjects obtained from the Alzheimer’s Disease Sequencing Project. The sequence data was produced by three sequencing facilities: WUGSC, BI, and BCM (all samples were from blood). The same analyses described above were performed, and the detected patterns were compared with those from the 1000 Genomes Project samples.
3. Results
3.1. Comparison of unmapped reads
The average depth of the WGS data of the 2,535 subjects from the 1000 Genomes Project was 7.6-fold. There were significant differences (ANOVA P = 2×10−16) regarding the numbers of sequencing reads between the 26 different populations (Supplementary Figure 1A). When we compared the mean values of sequencing reads generated by each of the seven sequencing facilities, we found that the subjects sequenced by BI had the most (30.8 billion reads per subject) (Supplementary Figure 1B). We also compared the percentage of paired-end unmapped reads per subject among 26 different populations, seven sequencing facilities, and two DNA sources. We found a high degree of variability among populations (Figure 1C, Supplementary Figure 2, and Supplementary Table 2; ANOVA P < 2×10−16), among sequencing facilities (Figure 1A and Supplementary Figure 3; P < 2×10−16), but not among DNA sources (Figure 1B; P = 0.22). For example, 40% of the subjects from the PUR population (Puerto Rican in Puerto Rico) had more than 10% paired-end unmapped reads. We further found a high degree of variability between sequencing facilities in each of the 26 populations (adjusted P < 0.01), while Illumina (adjusted P = 0.018), MPIMG (adjusted P = 0.014), and BCM (adjusted P = 0.013) showed no major variations across populations (Supplementary Table 3). These results indicate that sequencing facility likely accounted for the majority of the differences observed between populations (Figure 1C). For example, the BI center had the highest percentage of unmapped reads, while Illumina had the lowest. The PUR population samples were mostly (44 out of 105) sequenced by the BI center (Figure 1C). In all, the percentage of unmapped reads differed significantly between facilities.
Figure 1.

Distribution of percentages of unmapped reads across A) sequencing facilities, B) DNA source, and C) populations. Red bars indicate the mean percentages of paired-end unmapped reads.
3.2. Detection of virus-mappable reads associated with DNA source or sequencing facility
By aligning the clean paired-end unmapped reads separately to the NCBI Viral Genomes and our in-house viral databases, we obtained a total of 99 virus references that contained either more than 50 virus-mappable reads in total or more than 20 reads in a single subject (Supplementary Table 4). A total of 260,339 reads mappable to these 99 viral references were detected. Sixty-one of the 99 viral references were consistently found in both databases, while 21 and 17 were found only in the NCBI or in-house viral databases, respectively. Among the 99 viral references, 47 showed significantly differential viral abundance among subjects under the adjusted P value threshold < 0.01 when stratified by DNA source or sequencing facility (Supplementary Table 5), containing a total of 225,646 virus-mappable reads (86.7%) across all subjects. We observed significantly more (t-test, P = 6.6×10−5) subjects containing reads mapped to each of the 47 DNA source or sequencing facility associated viral references compared to the 52 non-associated references (Supplementary Figure 4) when we compared the number of subjects having each virus. By comparison, it was insignificant (t-test, P = 0.028) between the two groups of viral references when we compared the number of reads per subject. All reads mapped to these 47 viral references were kept for further analyses.
We divided these viral references into four groups based on the patterns of abundance of reads mapping to them, including, 1) phage references consistently observed in subjects from both blood and LCLs at specific facilities, 2) facility-specific adenovirus C references, 3) Enterobacterial phage phiX174 and other phage references, and 4) non-phage non-adenovirus C viral references (Figure 2A). When we visualized the reads mapped to each viral reference across all subjects, the subjects were observed to cluster by sequencing facility, but not by population (Figure 2B). The abundance of these virus-mappable reads varied significantly among the viral references (Figure 2A). Of all the phage references, Bacillus phage SPbeta was most abundant. Of the non-phage viral references, human adenovirus 2 had the highest abundance.
Figure 2.

Abundance and patterns of all detected virus-mappable reads. A) We identified a total of 47 DNA source or sequencing facility associated viral references, which we separated into four groups based on the abundance patterns of viral reads. The percentage of subjects containing reads mapped to each viral reference are shown for each sequencing facility separately. Subjects sourced from LCLs or blood are shown separately. B) Heatmap of viral reads mapped to the 47 viral references detected in the 1000 Genomes Project WGS data of all subjects. Subjects were clustered at specific facilities, while no clusters were observed between different populations. The green boxes highlight the subjects containing viral reads which appear to be correlated with each other within the cluster.
3.3. Patterns of reads mappable to group 1: phages
The reads mappable to group 1 phages were highly prevalent in data sequenced at WUGSC (e.g., Enterobacteria phage T4), and at BI (e.g., Ralstonia phages, Burkholderia phages, Rhodococcus phages, and Xylella phage Paz) (Figure 2A). All subjects containing reads mapped to those phage references were clustered by sequencing facility, which was consistent for subjects sourced from both blood and LCLs. Figure 3A shows that Enterobacteria phage T4 mappable reads appeared in almost 100% and 75% of subjects sourced from LCLs and blood, respectively, which were sequenced by WUGSC. The percentages were significantly higher (> 9.7 times) than those from the other six facilities (Figure 3B). The total abundance of virus-mappable reads in the subjects sequenced at WUGSC was also significantly higher (t-test P = 2×10−16) than those sequenced at the other facilities, and this was true for subjects sourced from both LCLs and blood (Figure 3C). To examine the alignment of virus-mappable reads, we inspected all reads mapped to the Enterobacteria phage T4 reference genome (NC_000866.4) across all subjects. We observed that nearly all reads were mapped to a 4.3-kb genomic region, consisting of two viral genes: 43 and pseT (Supplementary Table 5). We then aligned each of the two assembled contigs to the NCBI Nucleotide database using BLASTn [43]. Both contigs were aligned to the Enterobacterial phage T4 genome at the two gene loci, which was the top hit for both. We further searched the sequences of the two genes in the vector database using the VecScreen tool, and no matches were found. Thus, although it is challenging to determine their origins, we suspect that the reads mapped to the Enterobacteria phage T4 may originate from artificial sequences introduced as contamination during library preparation at the sequencing facility.
Figure 3.

Abundance and patterns of reads mapped to the Enterobacterial phage T4 reference in subjects sequenced at WUGSC, and Ralstonia phages, Burkholderia phages, Xylella phage Paz, and Rhodococcus phage REQ1 references sequenced at BI. A) Heatmap of all subjects containing group 1 phage mappable reads. B) Percentage of subjects containing phage T4 mappable reads for each sequencing facility. C) Abundance of phage T4 mappable reads in subjects sequenced at each facility. D) Percentage of subjects containing Ralstonia phage RSB1 mappable reads for each facility. E) Abundance of Ralstonia phage RSB1 mappable reads in subjects sequenced at each facility. The phage T4 mappable reads were abundant in subjects sequenced at WUGSC; and other group 1 phage mappable reads were abundant in subjects sequenced at BI.
We found that reads mapped to the Burkholderia phages, Ralstonia phages, Xylella phage Paz, and Rhodococcus phage REQ1 references exhibited greater prevalence (> 9.5 times) in subjects sequenced at BI than other facilities (Figure 2A). These phage references were closely correlated and concurrently observed in a small group of subjects (Figure 3A). Among them, the Ralstonia phage RSB1 reference had the highest abundance (Figure 3E, i.e., the mean viral abundance value was 0.58 across all subjects with Ralstonia phage RSB1 mappable reads), appearing in around 25% of subjects sourced from both blood and LCLs and sequenced at BI (Figure 3D). The abundance of the Ralstonia phage RSB1 was similar in both subjects sourced from blood and those sourced from LCLs (Figure 3E). Due to the low-depth WGS data or low-abundance of phage contaminants, the virus-mappable reads often covered a small percentage of the phage genomes. However, some phages did have reads evenly distributed across the entire phage genome, such as Ralstonia phage RSB1; in contrast, other phages had viral reads mapped only to certain genomic regions (Supplementary Table 5). Almost all of the associated host bacteria (such as Ralstonia, Burkholderia, Rhodococcus, and Xylella [33]) were discovered as contaminations in DNA extraction kits or other laboratory reagents, indicating that reads mapping to the phages may have originated as co-contaminations with their host bacteria during library preparation at the sequencing facility.
3.4. Patterns of reads mappable to group 2: adenovirus C
Almost all reads mapped to adenovirus C were found in subjects sourced from LCLs. The abundance of these viral reads also differed in subjects sequenced at different facilities. Approximately 30% of the samples sequenced at SC and 23% sequenced at MPIMG had human adenovirus 2 mappable reads (also known as human mastadenovirus C), one of the adenovirus species C (Figure 4A and Figure 4B). The highest abundance of adenovirus C reads was observed in samples sequenced at SC (with a mean viral abundance value of 85.5 across all subjects with human adenovirus 2 mappable reads, Figure 4C). By further inspecting the read alignment of each adenovirus C reference, we found that almost all reads were mapped, with high depth, to a 5,035 bp genomic region which consisted of multiple coding regions from adenovirus C: 52K, pIIIa, III, pX, and pVI. The adenovirus C-mappable reads were found predominantly in subjects sourced from LCLs (almost no reads were found in subjects from blood) and their abundance was different among sequencing facilities (the mean viral abundance values varied from 0.09 for BI to 85.5 for SC across the subjects with adenovirus C-mappable reads). Adenovirus C, including the human adenoviruses 1, 2, 5, and 6, are known to be highly infectious, and they appear in subjects from all populations [44]. However, recombinant adenoviruses are often used as vectors in gene expression studies. Thus, adenovirus C mappable reads can be introduced through use of customized adenoviral vectors [34]. Based on our analysis, we suspect that the adenovirus C mappable reads observed in our analyses are most likely derived from recombinant adenovirus vectors used during cell culture, rather than infectious pathogens.
Figure 4.

Adenovirus C mappable reads abundant in subjects mainly sourced from LCLs at SC and MPIMG. A) Heatmap of all subjects having adenovirus C mappable reads. B) Percentage of subjects having human adenovirus 2 mappable reads for each facility. C) Abundance of human adenovirus 2 mappable reads at each facility. The adenovirus C mappable reads were only abundant in subjects sourced from LCLs at all other facilities, except for Illumina, where they were more abundant in subjects sourced from blood compared to LCLs.
3.5. Patterns of reads mappable to group 3: Enterobacteria phage phiX174 and other phages
Our analyses showed that almost all reads mapped to non-phiX174 Enterobacteria phages coincided with those mapped to Enterobacteria phage phiX174, whilst the reads mapped to most non-Enterobacteria phages were independent. Besides the reads mapped to the phage references described in group 1 above, all other phage references containing reads appearing in the 1000 Genomes Project samples were categorized as group 3 here. Phage phiX174 mappable reads were highly abundant in the data from BCM and WUGSC (Figure 2A). We analyzed the relationship between the phage phiX174 and other group 3 phage references. All subjects with detected group 3 phage references were divided into two subgroups based on the existence of phage phiX174 mappable reads and were clustered separately (Figure 5A).
Figure 5.

Abundance and patterns of reads mapped to the Enterobacterial phage phiX174 and other group 3 phages. A) Heatmaps of subjects having any group 3 phage mappable reads. The subjects with and without phage phiX174 mappable reads were clustered separately. B) Percentage of subjects having viral reads mapped to each phage reference, with the presence or absence of phage phiX174 mappable reads, separately. The reads mapped to some phages exhibit different patterns regardless of the presence of phage phiX174 mappable reads.
For most of the 175 subjects with phage phiX174 mappable reads, non-phiX174 Enterobacteria phages appeared to co-exist with phage phiX174, such as the reads mapped to Escherichia phage TL-2011b, which could originate from use of commercial Enterobacteria phage phiX174 reagents (Figure 5B). We then analyzed all viral reads cumulatively across all subjects for each of these phage genomes (Supplementary Table 5). We found that the viral reads were relatively evenly distributed across some phage references, such as Stx2-converting phage 1717 and Enterobacteria phage mEp460. Other phage references showed reads that were more enriched in certain genomic regions, which may be due to their high genome sequence homology.
We observed distinct patterns of viral read abundance for the 449 subjects that had reads mapped to group 3 phage references other than phiX174. The reads mapped to these phage references, including uncultured phage, Lactobacillus phage Lc-Nu, Bacillus phages, Pseudomonas phage phi297, and Erwinia phage phiEaH2, appeared in more subjects sourced from blood (> 6.3 times on average) compared to LCLs, implying potential blood sample origins (Figure 5B). For example, the Pseudomonas phage phi297 mappable reads were clustered in subjects sequenced by BGI compared to other facilities (Figure 5A). By analyzing the virus-mappable reads across all subjects for each phage genome (Supplementary Table 5), some phages revealed more evenly distributed reads, such as Bacillus phage phlS3501 and Pseudomonas phage phi297, while others, such as Lactobacillus phage Lc-Nu and Erwinia phage phiEaH2, had reads only mapped to specific genomic regions. Thus, in the absence of phage phiX174 mappable reads, we still observed viral reads mapped to other phage references, including Bacillus phages, Pseudomonas phage phi297, and Erwinia phage phiEaH2. Taken together, the phages, including Stx2-converting phage 1717, Enterobacteria phage mEp460, Bacillus phage phlS3501 and Pseudomonas phage phi297 and others with evenly distributed viral reads appeared to coincide with spike-in phage phiX174 or to be found independently, while the reads mapped only to specific genomic regions in other phages from this group might originate from homologous sequences to the phages having evenly distributed reads.
3.6. Reads mappable to group 4: non-phage, non-adenovirus C viral references
Besides phages and adenovirus C, we also observed reads that mapped to many other non-phage viral references (Figure 2A). It is known that EBV sequences exist in WGS data derived from LCLs. We tried to remove EBV sequences when generating the paired-end unmapped reads from the input BAM files by using an EBV reference genome as a supplement to the human reference genome. However, we still identified some EBV reads in subjects sourced from LCLs, regardless of sequencing facilities (Figure 2A). This was likely because we adopted less stringent parameters for read alignment, which allowed some EBV strains containing genomic variations relative to the ERV reference genome to escape filtering. Reads mapped to other herpesviruses, such as human herpesvirus 7 and macacine herpesvirus 4, were also observed in subjects sourced from blood at some facilities. Torque teno virus, which has been found in blood samples [45], appeared in the data produced by multiple facilities. However, due to the low number of viral reads, it is difficult to distinguish whether they are derived from contaminants or infectious viruses. The reads mapped to non-phage viral references could originate during cell culturing or could be due to sequence homology. They could possibly be real viral infections if they mainly appear in subjects sourced from blood, such as human herpesvirus 7 and torque teno virus. Thus, further validation is needed to verify the origins of the reads mapped to non-phage, non-adenovirus C viral references.
3.7. Independent replication of distinct viral sequence patterns
To verify whether our observed viral sequence patterns were replicable, we analyzed an independent WGS dataset (around 37-fold sequencing depth on average) generated by three facilities, including BCM, BI, and WUGSC. Our analyses showed that consistently, a greater abundance of reads mapped to phage phiX174 and its related Enterobacterial phages were observed in subjects sequenced at BCM and WUGSC, compared to BI (Supplementary Figure 5). This agreed with the abundance patterns we observed in the 1000 Genomes Project samples. We also observed consistency in the abundance of reads mapped to Enterobacterial phage T4 in data generated at WUGSC, and Enterobacteria phage P1 and Pseudomonas phage phi297 in data generated at BI (Supplementary Table 6). These results indicate that reads mapped to these nine viral references were replicable in two independent WGS datasets produced by the same sequencing facilities (Supplementary Figure 5). Additionally, we observed a high abundance of EBV reads in six of the analyzed subjects; however, the DNA sources for the six subjects were documented as blood from the Alzheimer’s Disease Sequencing Project. We contacted the authors of this Project and the sources of these six samples were eventually verified to be LCLs, as predicted by our analysis (Supplementary Table 6).
4. Discussion
Summary of results
We systematically dissected the composition of virus-mappable sequences in human-unmappable reads. The identified reads mapped to 47 viral references and showed significant differential abundance in subjects sourced from blood and LCLs, or in subjects sequenced at different sequencing facilities. The sequencing facility- and DNA source-associated patterns of virus-mappable reads support batch effects commonly existing in HTS data. To verify our findings, we used an independent WGS dataset that was produced by the same sequencing facilities (three facilities included). Reads mapped to Enterobacteria phage T4 consistently appeared in subjects sequenced at WUGSC; and reads mapped to phage phiX174 and other phages consistently appeared at BCM and WUGSC. Our analysis also unexpectedly corrected DNA source information for samples from the Alzheimer’s Disease Sequencing Project.
Possible origins of sequencing facility and DNA source-associated reads
Contaminated viral reads have previously been reported in studies of the composition of unmapped reads and pathogen discovery [1, 12, 46, 47]. Contaminated viral reads can be introduced from various sources, such as water, tissue culture plates, reagents, sequencing library preparation, use of recombinant vectors, and in some cases, their bacterial hosts [23]. Any of these sources might eventually lead to different abundances of viral reads among sequencing facilities. In this study, we hypothesize that vector contamination during cell culture might be a potential origin for the highly abundant reads mapped to adenovirus C references. As adenovirus C was not found in the two vector databases, our result implied a potential new source of vector contamination. The reads mapped to Enterobacterial and Shigella phages mainly originated in subjects sequenced at BCM and WUGSC (Shigella, which is in the order Enterobacterales and family Enterobacteriaceae, has over 99% similarity with enterobacteria Escherichia [48]). The reads mappable to these phages were also observed coinciding with the presence of phage phiX174, indicating they might be introduced through use of commercial phage phiX174 reagents. Phages can also originate from co-contaminated host bacteria or contaminate independently through certain batch effects, such as sharing of the same sequencing flow cell. In this study, a greater abundance of reads mappable to Ralstonia phages, Burkholderia phages, and other phages were observed in the subjects sequenced at BI. Indeed, the associated host bacteria, such as Ralstonia and Burkholderia, have been reported as common bacterial contaminants from ultrapure water systems [49], DNA extraction kits, or other laboratory regents [33]. The evidence cumulatively indicate that some phages detected in the 1000 Genomes Project subjects are likely co-contaminated with their host bacteria and originated from ultrapure water or other kit/lab reagents. Additionally, we also observed reads mapped to a few rare viral references appearing in a small number of subjects. For example, reads mapped to the Staphylococcus phage references with high viral abundance only appeared in two subjects, both from LCLs and sequenced at separate facilities. The squirrel monkey retrovirus, which were also previously reported in cultured cell lines [50], appeared with high viral abundance (e.g., corresponding to up to 486 reads) in eight subjects sourced from LCLs and drawn from European populations.
Comparisons with prior research
Previous studies have found viral contaminants in WGS datasets from healthy individuals [1, 12, 32] although few of them were focused on analyzing sequencing facility or DNA source patterns. For example, Laurence et al. [32] found less than 10 viral reads per dataset by analyzing the WGS data from the 1000 Genomes Project. As that study only analyzed a small subset of the data, it did not allow for comparisons between sequencing facilities. In another study, Tae et al. [1] found that adenovirus C might arise from vector DNA. In a more recent study, Moustafa et al. used BLASTn to detect viral reads after removing vector, plasmid, and other non-viral sequences, and examined the human virome in 8,000 blood samples [12]. In our study, we used BWA-MEM to identify uniquely mapped reads and detect batch-associated patterns, after applying similar quality controls and filtering, and analyzed 2,535 samples with a focus on identifying sequencing facility- and DNA source-associated patterns of virus-mappable reads. We further evaluated whether the viral reads mapped to the 47 batch-associated viral references commonly existed in other publicly available datasets that were previously used for profiling the human virome. Specifically, we compared our results with those from two studies [12, 27] (no sequencing facility or tissue specific patterns were analyzed in either study). Most of these 47 viral references were also identified in the datasets from the two studies. For example, Moustafa et al. [12] also observed viral reads mappable to phiX174 and other phages (Supplementary Table 7). This suggests that viral reads mappable to these batch-associated viral references exist in other human WGS datasets as well. In our study, we further investigated sequencing facility-specific and tissue-specific enrichment patterns, e.g., those associated with Ralstonia phages, which were significantly enriched in the data from BI but not others. In addition, adenovirus was not detected as contamination in the previous study [12]. In our study, we newly found that the reads mapped to adenovirus appeared in subjects derived only from LCL, except for a small percentage of subjects sequenced at Illumina. We further found that the viral reads mapped to the adenovirus C references appeared in more than 20% subjects sourced from LCLs at MPIMG. These results suggest that adenovirus may represent either contamination or an actual infectious pathogen, which still needs further investigation to determine, e.g., based on the patterns of mappable viral reads.
Further considerations and future research
All virus-mappable reads were identified from human-unmappable reads, which may also contain sequencing errors and repeat sequences. To address these issues in this study, we first removed low-quality nucleotides and reads as well as repetitive sequences. We then removed known vectors, plasmids, and other artificial sequences. Although we observed virus-mappable reads broadly distributed along the viral genome references with high abundance, we also observed many more scenarios where the reads were sporadically distributed with low abundance. This may have numerous causes: for example, for reads mapped to phages, if their host bacteria had low abundance, the viral abundance would also be low. Sequence homology among different viral species is another possibility. We performed reciprocal alignment and removed all possible reads mapped to the human genome, which might have led to removal of additional viral reads. However, even with stringent criteria and low-depth data, our results have demonstrated strong batch effects. Similar analyses could be applied to a broad range of other datasets. More sequencing facility- or DNA source-associated reads might be identified when high-depth data is used [51]. It is also necessary to perform similar analyses with RNA-Seq and whole-exome sequencing data, in which a greater abundance of certain viral reads might be observed, benefitting from the generally higher sequencing depths. Different batch effects or new viral read patterns might be found with these technologies as different procedures are required for the preparation of sequencing libraries with each sequencing type [51, 52, 53]. Bacterial contaminants have been recently reported in WGS and RNA-Seq data [1, 23, 32, 51, 54, 55]. It might also be interesting to screen for bacterial reads to study the co-occurrence of host bacteria and their phages. In future studies, it will be necessary to verify identified “viral sequences” and subtract all verified batch-associated reads, particularly in studies characterizing the human blood virome or identifying pathogens. To facilitate virus-related HTS sequencing data analyses, in this study we provided the in-house viral database that we have curated on our webpage. We suggest users to align their detected viral reads against the sequencing facility- and DNA source-associated viral references identified in this study. In general, 1) if reads are mapped to the exact genomic regions as reported in this study, e.g. the same regions of adenovirus C, the reads should be excluded from the detection and discovery research of pathogens and infectious agents; 2) if reads are mapped to other genomic regions, the viral references should be more carefully inspected for whether they represent real infectious agents before the removal of the viral candidates; and 3) additionally, a blank control may also be sequenced when designing the experiments, and the identified viral sequences of interest could be further subjected to targeted sequencing for verification.
Supplementary Material
Supplementary Figure 1 Summary of the 1000 Genomes Project WGS data. A) The total number of reads from each population. B) The total number of reads generated by the seven primary sequencing facilities. The full names of populations and super-populations, and sequencing facilities are listed in Supplementary Table 1. Red bars indicate the mean values of sequencing reads.
Supplementary Figure 2 Proportions of samples with various ranges of percentages of unmapped reads across A) Sequencing facilities, B) DNA sources, and C) populations.
Supplementary Figure 3 Distribution of percentages of unmapped reads among the seven sequencing facilities.
Supplementary Figure 4 Comparative analyses of A) the number of reads per subject and B) the number of subjects per each of the 47 sequencing facility- or DNA source-associated viral references and the 52 non associated viral references.
Supplementary Figure 5 Heatmap of the viral references having abundant viral reads validated using an independent WGS dataset. The abundance patterns of the reads mapped to those viral references observed in this dataset were consistent with those observed in the 1000 Genomes Project data.
Supplementary Table 1 The 1000 Genomes Project (phase 3) subjects included in this study (Excel Table).
Supplementary Table 2 Distribution of the percentages of unmapped reads in each population
Supplementary Table 3 ANOVA by populations and sequencing facilities
Supplementary Table 4 Viral reference sequences detected by screening the NCBI Viral genome and in-house viral databases. The number of reads mapped to each viral reference is shown (Excel Table).
Supplementary Table 5 Viral reads associated with sequencing facility or DNA source. Multiple statistical tests were used to compare between each pair of sequencing facilities or DNA sources. Only uniquely mapped reads were used to calculate the length of viral references covered by viral reads (Excel Table).
Supplementary Table 6 Independent datasets included to replicate the sequencing facility-associated patterns of reads mapped to identified viral references (Excel Table).
Supplementary Table 7 Comparative analysis of identified sequencing facility- or DNA source-associated viral reads using datasets from other studies (Excel Table).
Highlights.
Virus-mappable reads are identified in whole-genome sequencing datasets derived from human specimens.
Patterns of reads associated with technical variables (sequencing facility and DNA source) are identified and characterized.
Filtering of technical variable-associated reads may improve data quality of genomic and metagenomic analyses.
Acknowledgment
The data used in this study included those from the 1000 Genomes Project (Phase 3), and the dbGaP through accession number phs000572 (Alzheimer’s Disease Sequencing Project; the Build 37 data was used in this study. The Build 38 data can be downloaded from the NIAGADS database). The authors acknowledge the Vermont Advanced Computing Core and the Massachusetts Green High-Performance Computer C3DDB for computing resources. The authors thank John Baronas, Jason Kost, Arvis Sulovari, Guangchen Liu, and Michael Mariani for their discussions pertaining to the analysis. We thank William Langdon, Ph.D., Alan Walker, Ph.D., and other colleagues for their careful reviews of the manuscript. We also thank the anonymous reviewers for their careful reviews and constructive comments and suggestions.
Funding
This work was supported by the Start-up Fund of The University of Vermont, and partially by research grants from the National Institutes of Health National Institute of Allergy and Infectious Diseases (AI147084), the Department of Defense Lung Cancer Research Program (LC190467), and the Solve ME/CFS Initiative Ramsay Award Program.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Electronic Database Information
Accession numbers and URLs for data presented herein are as follows: UCSC Genome Browser database: http://hgdownload.soe.ucsc.edu/downloads.html#human;
International Genome Sample Resource database: http://www.internationalgenome.org/home;
VecScreen tool: https://www.ncbi.nlm.nih.gov/tools/vecscreen/;
The database of Genotypes and Phenotypes (dbGaP): https://www.ncbi.nlm.nih.gov/gap;
The UniVec database: ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/;
The EMVEC database: https://www.ebi.ac.uk/Tools/sss/ncbiblast/vectors.html.
Conflict of Interest
The authors declare no potential conflict of interest.
References
- [1].Tae H, Karunasena E, Bavarva JH, McIver LJ, Garner HR, Large scale comparison of non-human sequences in human sequencing data, Genomics, 104 (2014) 453–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Handley SA, The virome: a missing component of biological interaction networks in health and disease, Genome Med, 8 (2016) 32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Chiu CY, Viral pathogen discovery, Curr Opin Microbiol, 16 (2013) 468–478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Cantalupo PG, Katz JP, Pipas JM, Viral sequences in human cancer, Virology, 513 (2018) 208–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Strong MJ, O’Grady T, Lin Z, Xu G, Baddoo M, Parsons C, Zhang K, Taylor CM, Flemington EK, Epstein-Barr virus and human herpesvirus 6 detection in a non-Hodgkin’s diffuse large B-cell lymphoma cohort by using RNA sequencing, J Virol, 87 (2013) 13059–13062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Cao S, Wendl MC, Wyczalkowski MA, Wylie K, Ye K, Jayasinghe R, Xie M, Wu S, Niu B, Grubb R 3rd, Johnson KJ, Gay H, Chen K, Rader JS, Dipersio JF, Chen F, Ding L, Divergent viral presentation among human tumors and adjacent normal tissues, Sci Rep, 6 (2016) 28294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Cancer N Genome Atlas Research, Comprehensive molecular characterization of urothelial bladder carcinoma, Nature, 507 (2014) 315–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].The N Cancer Genome Atlas Research, Integrated genomic and molecular characterization of cervical cancer, Nature, 543 (2017) 378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].w.b.e. Cancer Genome Atlas Research Network. Electronic address, N. Cancer Genome Atlas Research, Comprehensive and Integrative Genomic Characterization of Hepatocellular Carcinoma, Cell, 169 (2017) 1327–1341 e1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Salyakina D, Tsinoremas NF, Viral expression associated with gastrointestinal adenocarcinomas in TCGA high-throughput sequencing data, Hum Genomics, 7 (2013) 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Tang KW, Alaei-Mahabadi B, Samuelsson T, Lindh M, Larsson E, The landscape of viral expression and host gene fusion and adaptation in human cancer, Nat Commun, 4 (2013) 2513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Moustafa A, Xie C, Kirkness E, Biggs W, Wong E, Turpaz Y, Bloom K, Delwart E, Nelson KE, Venter JC, Telenti A, The blood DNA virome in 8,000 humans, PLoS Pathog, 13 (2017) e1006292. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Roediger B, Lee Q, Tikoo S, Cobbin JCA, Henderson JM, Jormakka M, O’Rourke MB, Padula MP, Pinello N, Henry M, Wynne M, Santagostino SF, Brayton CF, Rasmussen L, Lisowski L, Tay SS, Harris DC, Bertram JF, Dowling JP, Bertolino P, Lai JH, Wu W, Bachovchin WW, Wong JJL, Gorrell MD, Shaban B, Holmes EC, Jolly CJ, Monette S, Weninger W, An Atypical Parvovirus Drives Chronic Tubulointerstitial Nephropathy and Kidney Fibrosis, Cell, 175 (2018) 530–543.e524. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Minot S, Bryson A, Chehoud C, Wu GD, Lewis JD, Bushman FD, Rapid evolution of the human gut virome, Proceedings of the National Academy of Sciences of the United States of America, 110 (2013) 12450–12455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Byrd AL, Belkaid Y, Segre JA, The human skin microbiome, Nat Rev Microbiol, 16 (2018) 143–155. [DOI] [PubMed] [Google Scholar]
- [16].Bhatt AS, Freeman SS, Herrera AF, Pedamallu CS, Gevers D, Duke F, Jung J, Michaud M, Walker BJ, Young S, Earl AM, Kostic AD, Ojesina AI, Hasserjian R, Ballen KK, Chen YB, Hobbs G, Antin JH, Soiffer RJ, Baden LR, Garrett WS, Hornick JL, Marty FM, Meyerson M, Sequence-based discovery of Bradyrhizobium enterica in cord colitis syndrome, The New England journal of medicine, 369 (2013) 517–528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RG, Getz G, Meyerson M, PathSeq: software to identify or discover microbes by deep sequencing of human tissue, Nature biotechnology, 29 (2011) 393–396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Bhaduri A, Qu K, Lee CS, Ungewickell A, Khavari PA, Rapid identification of non-human sequences in high-throughput sequencing datasets, Bioinformatics, 28 (2012) 1174–1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E, Bouquet J, Greninger AL, Luk KC, Enge B, Wadford DA, Messenger SL, Genrich GL, Pellegrino K, Grard G, Leroy E, Schneider BS, Fair JN, Martinez MA, Isa P, Crump JA, DeRisi JL, Sittler T, Hackett J Jr., Miller S, Chiu CY, A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples, Genome Res, 24 (2014) 1180–1192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Andrusch A, Dabrowski PW, Klenner J, Tausch SH, Kohl C, Osman AA, Renard BY, Nitsche A, PAIPline: pathogen identification in metagenomic and clinical next generation sequencing samples, Bioinformatics, 34 (2018) i715–i721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Tampuu A, Bzhalava Z, Dillner J, Vicente R, ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples, PLoS One, 14 (2019) e0222271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Ren J, Song K, Chao Deng C, Ahlgren NA, Fuhrman JA, Li Y, Xie X, Poplin R, Sun F, Identifying viruses from metagenomic data using deep learning, Quantitative Biology, 8 (2020) 64–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Strong MJ, Xu G, Morici L, Splinter Bon-Durant S, Baddoo M, Lin Z, Fewell C, Taylor CM, Flemington EK, Microbial contamination in next generation sequencing: implications for sequence-based analysis of clinical samples, PLoS Pathog, 10 (2014) e1004437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Pennisi E, Microbiology. Contamination plagues some microbiome studies, Science, 346 (2014) 801. [DOI] [PubMed] [Google Scholar]
- [25].Xu B, Zhi N, Hu G, Wan Z, Zheng X, Liu X, Wong S, Kajigaya S, Zhao K, Mao Q, Young NS, Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing, Proceedings of the National Academy of Sciences of the United States of America, 110 (2013) 10264–10269. [DOI] [PMC free article] [PubMed] [Google Scholar] [Research Misconduct Found]
- [26].Poore GD, Kopylova E, Zhu Q, Carpenter C, Fraraccio S, Wandro S, Kosciolek T, Janssen S, Metcalf J, Song SJ, Kanbar J, Miller-Montgomery S, Heaton R, McKay R, Patel SP, Swafford AD, Knight R, Microbiome analyses of blood and tissues suggest cancer diagnostic approach, Nature, 579 (2020) 567–574. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- [27].Asplund M, Kjartansdottir KR, Mollerup S, Vinner L, Fridholm H, Herrera JAR, Friis-Nielsen J, Hansen TA, Jensen RH, Nielsen IB, Richter SR, Rey-Iglesia A, Matey-Hernandez ML, Alquezar-Planas DE, Olsen PVS, Sicheritz-Ponten T, Willerslev E, Lund O, Brunak S, Mourier T, Nielsen LP, Izarzugaza JMG, Hansen AJ, Contaminating viral sequences in high-throughput sequencing viromics: a linkage study of 700 sequencing libraries, Clin Microbiol Infect, 25 (2019) 1277–1285. [DOI] [PubMed] [Google Scholar]
- [28].Chen X, Kost J, Sulovari A, Wong N, Liang WS, Cao J, Li D, A virome-wide clonal integration analysis platform for discovering cancer viral etiology, Genome Res, 29 (2019) 819–830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Chen X, Kost J, Li D, Comprehensive comparative analysis of methods and software for identifying viral integrations, Brief Bioinform, 20 (2019) 2088–2097. [DOI] [PubMed] [Google Scholar]
- [30].Cao J, Li D, Searching for human oncoviruses: Histories, challenges, and opportunities, J Cell Biochem, 119 (2018) 4897–4906. [DOI] [PubMed] [Google Scholar]
- [31].Sulovari A, Li D, VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing, Genomics, 112 (2020) 207–211. [DOI] [PubMed] [Google Scholar]
- [32].Laurence M, Hatzis C, Brash DE, Common contaminants in next-generation sequencing that hinder discovery of low-abundance microbes, PLoS One, 9 (2014) e97876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, Turner P, Parkhill J, Loman NJ, Walker AW, Reagent and laboratory contamination can critically impact sequence-based microbiome analyses, BMC Biol, 12 (2014) 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Luo J, Deng ZL, Luo X, Tang N, Song WX, Chen J, Sharff KA, Luu HH, Haydon RC, Kinzler KW, Vogelstein B, He TC, A protocol for rapid generation of recombinant adenoviruses using the AdEasy system, Nature protocols, 2 (2007) 1236–1247. [DOI] [PubMed] [Google Scholar]
- [35].Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome S Project Data Processing, The Sequence Alignment/Map format and SAMtools, Bioinformatics, 25 (2009) 2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM, Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome, Genome Res, 20 (2010) 623–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Patel RK, Jain M, NGS QC Toolkit: a toolkit for quality control of next generation sequencing data, PLoS One, 7 (2012) e30619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Xu H, Luo X, Qian J, Pang X, Song J, Qian G, Chen J, Chen S, FastUniq: a fast de novo duplicates removal tool for paired short reads, PLoS One, 7 (2012) e52249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Smit A, Hubley R, Green P, RepeatMasker Open-4.0. 2013–2015, Institute for Systems Biology; http://repeatmasker.org, (2015). [Google Scholar]
- [40].Morgulis A, Gertz EM, Schaffer AA, Agarwala R, A fast and symmetric DUST implementation to mask low-complexity DNA sequences, J Comput Biol, 13 (2006) 1028–1040. [DOI] [PubMed] [Google Scholar]
- [41].Benson G, Tandem repeats finder: a program to analyze DNA sequences, Nucleic Acids Res, 27 (1999) 573–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Li H, Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv preprint arXiv:1303.3997, (2013). [Google Scholar]
- [43].Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL, NCBI BLAST: a better web interface, Nucleic Acids Res, 36 (2008) W5–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Garnett CT, Erdman D, Xu W, Gooding LR, Prevalence and quantitation of species C adenovirus DNA in human mucosal lymphocytes, J Virol, 76 (2002) 10608–10616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Focosi D, Antonelli G, Pistello M, Maggi F, Torquetenovirus: the human virome from bench to bedside, Clinical Microbiology and Infection, 22 (2016) 589–593. [DOI] [PubMed] [Google Scholar]
- [46].Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM, Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes, Genome Res, 27 (2017) 300–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Kazemian M, Ren M, Lin JX, Liao W, Spolski R, Leonard WJ, Possible Human Papillomavirus 38 Contamination of Endometrial Cancer RNA Sequencing Samples in The Cancer Genome Atlas Database, J Virol, 89 (2015) 8967–8973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [48].Devanga Ragupathi NK, Muthuirulandi Sethuvel DP, Inbanathan FY, Veeraraghavan B, Accurate differentiation of Escherichia coli and Shigella serogroups: challenges and strategies, New Microbes New Infect, 21 (2018) 58–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Kulakov LA, McAlister MB, Ogden KL, Larkin MJ, O’Hanlon JF, Analysis of bacteria contaminating ultrapure water in industrial systems, Appl Environ Microbiol, 68 (2002) 1548–1555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Uphoff CC, Denkmann SA, Steube KG, Drexler HG, Detection of EBV, HBV, HCV, HIV-1, HTLV-I and -II, and SMRV in human and other primate cell lines, J Biomed Biotechnol, 2010 (2010) 904767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Lusk RW, Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data, PLoS One, 9 (2014) e110808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Li G, Yang Y, Van Buren E, Li Y, Dropout imputation and batch effect correction for single-cell RNA sequencing data, Journal of Bio-X Research, 2 (2019) 169–177. [Google Scholar]
- [53].Zhang H, Cui N, Cai Y, Lei F, Weitz DA, Single-cell sequencing leads a new era of profiling transcriptomic landscape, Journal of Bio-X Research, 1 (2018) 2–6. [Google Scholar]
- [54].Merchant S, Wood DE, Salzberg SL, Unexpected cross-species contamination in genome sequencing projects, PeerJ, 2 (2014) e675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Langdon WB, Mycoplasma contamination in the 1000 Genomes Project, BioData Min, 7 (2014) 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Figure 1 Summary of the 1000 Genomes Project WGS data. A) The total number of reads from each population. B) The total number of reads generated by the seven primary sequencing facilities. The full names of populations and super-populations, and sequencing facilities are listed in Supplementary Table 1. Red bars indicate the mean values of sequencing reads.
Supplementary Figure 2 Proportions of samples with various ranges of percentages of unmapped reads across A) Sequencing facilities, B) DNA sources, and C) populations.
Supplementary Figure 3 Distribution of percentages of unmapped reads among the seven sequencing facilities.
Supplementary Figure 4 Comparative analyses of A) the number of reads per subject and B) the number of subjects per each of the 47 sequencing facility- or DNA source-associated viral references and the 52 non associated viral references.
Supplementary Figure 5 Heatmap of the viral references having abundant viral reads validated using an independent WGS dataset. The abundance patterns of the reads mapped to those viral references observed in this dataset were consistent with those observed in the 1000 Genomes Project data.
Supplementary Table 1 The 1000 Genomes Project (phase 3) subjects included in this study (Excel Table).
Supplementary Table 2 Distribution of the percentages of unmapped reads in each population
Supplementary Table 3 ANOVA by populations and sequencing facilities
Supplementary Table 4 Viral reference sequences detected by screening the NCBI Viral genome and in-house viral databases. The number of reads mapped to each viral reference is shown (Excel Table).
Supplementary Table 5 Viral reads associated with sequencing facility or DNA source. Multiple statistical tests were used to compare between each pair of sequencing facilities or DNA sources. Only uniquely mapped reads were used to calculate the length of viral references covered by viral reads (Excel Table).
Supplementary Table 6 Independent datasets included to replicate the sequencing facility-associated patterns of reads mapped to identified viral references (Excel Table).
Supplementary Table 7 Comparative analysis of identified sequencing facility- or DNA source-associated viral reads using datasets from other studies (Excel Table).
