Abstract
Although the recent Zika virus (ZIKV) epidemic in the Americas and its link to birth defects have attracted a great deal of attention1,2, much remains unknown about ZIKV disease epidemiology and ZIKV evolution, in part owing to a lack of genomic data. Here we address this gap in knowledge by using multiple sequencing approaches to generate 110 ZIKV genomes from clinical and mosquito samples from 10 countries and territories, greatly expanding the observed viral genetic diversity from this outbreak. We analysed the timing and patterns of introductions into distinct geographic regions; our phylogenetic evidence suggests rapid expansion of the outbreak in Brazil and multiple introductions of outbreak strains into Puerto Rico, Honduras, Colombia, other Caribbean islands, and the continental United States. We find that ZIKV circulated undetected in multiple regions for many months before the first locally transmitted cases were confirmed, highlighting the importance of surveillance of viral infections. We identify mutations with possible functional implications for ZIKV biology and pathogenesis, as well as those that might be relevant to the effectiveness of diagnostic tests.
Since its introduction into the Americas, mosquito-borne ZIKV (family: Flaviviridae) has spread rapidly, causing hundreds of thousands of cases of ZIKV disease, as well as ZIKV congenital syndrome and probably other neurological complications1–3. Phylogenetic analysis of ZIKV can reveal the trajectory of the outbreak and detect mutations that may be associated with new disease phenotypes or affect molecular diagnostics. Despite the 70 years since its discovery and the scale of the recent outbreak, however, fewer than 100 ZIKV genomes have been sequenced directly from clinical samples. This is due in part to technical challenges posed by low viral loads (for example, these are often orders of magnitude lower than in Ebola virus or dengue virus infection4–6), and by loss of RNA integrity in samples collected and stored without sequencing in mind. Culturing the virus increases the material available for sequencing but can result in genetic variation that is not representative of the original clinical sample.
We sought to gain a deeper understanding of the viral populations underpinning the ZIKV epidemic by extensive genome sequencing of the virus directly from samples collected as part of ongoing surveillance. We initially pursued unbiased metagenomic sequencing to capture both ZIKV and other viruses known to be co-circulating with ZIKV5. In most of the 38 samples examined by this approach there proved to be insufficient ZIKV RNA for genome assembly, but it still proved valuable to verify results from other methods. Metagenomic data also revealed sequences from other viruses, including 41 likely novel viral sequence fragments in mosquito pools (Extended Data Table 1). In one patient we detected no ZIKV sequence but did assemble a complete genome from dengue virus (type 1), one of the viruses that co-circulates with and presents similarly to ZIKV7.
To capture sufficient ZIKV content for genome assembly, we turned to two targeted approaches for enrichment before sequencing: multiplex PCR amplification8 and hybrid capture9. We sequenced and assembled complete or partial genomes from 110 samples from across the epidemic, out of 229 attempted (221 clinical samples from confirmed and possible ZIKV disease cases and eight mosquito pools; Table 1, Supplementary Table 1). This dataset, which we used for further analysis, includes 110 genomes produced using multiplex PCR amplification (amplicon sequencing) and a subset of 37 genomes produced using hybrid capture (out of 66 attempted). Because these approaches amplify any contaminant ZIKV content, we relied heavily on negative controls to detect artefactual sequence, and we established stringent, method-specific thresholds on coverage and completeness for calling high-confidence ZIKV assemblies (Fig. 1a). Completeness and coverage for these genomes are shown in Fig. 1b, c; the median fraction of the genome with unambiguous base calls was 93%. Per-base discordance between genomes produced by the two methods was 0.017% across the genome, 0.15% at polymorphic positions, and 2.2% for minor allele base calls. Concordance of within-sample variants is shown in more detail in Fig. 1d–f. Patient sample type (urine, serum, or plasma) made no significant difference to sequencing success in our study (Extended Data Fig. 1).
Table 1.
Country or territory |
Samples | Samples with metagenomic data |
Amplicon sequencing genomes |
Hybrid capture genomes |
Total genomes |
---|---|---|---|---|---|
Brazil | 53 | 12 | 27 | 7 | 27 |
Colombia | 20 | 0 | 4 | 2 | 4 |
Dominican Republic | 45 | 7 | 30 | 9 | 30 |
Guatemala/El Salvador | 3 | 0 | 1 | 0 | 1 |
Haiti | 4 | 0 | 1 | 0 | 1 |
Honduras | 20 | 6 | 18 | 8 | 18 |
Jamaica | 20 | 0 | 5 | 0 | 5 |
Martinique | 3 | 0 | 1 | 0 | 1 |
Puerto Rico | 15 | 0 | 3 | 1 | 3 |
Continental US | 36 | 12 | 20 | 10 | 20 |
Other | 10 | 1 | 0 | 0 | 0 |
| |||||
Total | 229 | 38 | 110 | 37 | 110 |
Sample source information and sequencing results for 229 clinical and mosquito pool samples.
Continental United States includes eight mosquito pool samples; all others are clinical samples from the Americas. In the final column, genomes generated by both methods are counted only once. ‘Other’ includes regions without a ZIKV genome included in downstream analysis.
To investigate the spread of ZIKV in the Americas we performed a phylogenetic analysis of the 110 genomes from our dataset, together with 64 published genomes available on NCBI GenBank and in refs 10 and 11 (Fig. 2a). Our reconstructed phylogeny (Fig. 2b), which is based on a molecular clock (Extended Data Fig. 2), is consistent with the outbreak having originated in Brazil12: Brazil ZIKV genomes appear on all deep branches of the tree, and their most recent common ancestor is the root of the entire tree. We estimate the date of that common ancestor to have been in early 2014 (95% credible interval (CI) August 2013 to July 2014). The shape of the tree near the root remains uncertain (that is, the nodes have low posterior probabilities) because there are too few mutations to clearly distinguish the branches. This pattern suggests rapid early spread of the outbreak, consistent with the introduction of a new virus to an immunologically naive population. ZIKV genomes from Colombia (n = 10), Honduras (n = 18), and Puerto Rico (n = 3) cluster within distinct, well-supported clades. We also observed a clade consisting entirely of genomes from patients who contracted ZIKV in one of three Caribbean countries (the Dominican Republic, Jamaica, and Haiti) or the continental United States, containing 30 of 32 genomes from the Dominican Republic and 19 of 20 from the continental United States. We estimated the within-outbreak substitution rate to be 1.15 × 10−3 substitutions per site per year (95% CI (9.78 × 10−4, 1.33 × 10−3)), similar to prior estimates for this outbreak12. This is 1.3–5 times higher than reported rates for other flaviviruses13, but is measured over a short sampling period, and therefore may include a higher proportion of mildly deleterious mutations that have not yet been removed through purifying selection.
Determining when ZIKV arrived in specific regions helps to elucidate the spread of the outbreak and track rising incidence of possible complications of ZIKV infection. The majority of the ZIKV genomes from our study fall into four major clades from different geographic regions, for which we estimated a likely date for ZIKV arrival. In each case, the date was months earlier than the first confirmed, locally transmitted case, indicating ongoing local circulation of ZIKV before its detection. In Puerto Rico, the estimated date was 4.5 months earlier than the first confirmed local case14; it was 8 months earlier in Honduras15, 5.5 months earlier in Colombia16, and 9 months earlier for the Caribbean–continental US clade17. In each case, the arrival date represents the estimated time to the most recent common ancestor (tMRCA) for the corresponding clade in our phylogeny (Fig. 2c; see Extended Data Fig. 3 and Extended Data Table 2 for details). Similar temporal gaps between the tMRCA of local transmission chains and the earliest detected cases were seen when chikungunya virus emerged in the Americas18. We also observed evidence for several introductions of ZIKV into the continental United States, and found that sequences from mosquito and human samples collected in Florida cluster together, consistent with the finding of local ZIKV transmission in Florida in ref. 11.
Principal component analysis (PCA) is consistent with the phylogenetic observations (Fig. 2d). It shows tight clustering among ZIKV genomes from the continental United States, the Dominican Republic, and Jamaica. ZIKV genomes from Brazil and Colombia are similar and distinct from genomes sampled in other countries. ZIKV genomes from Honduras form a third cluster that also contains genomes from Guatemala or El Salvador. The PCA results show no clear stratification of ZIKV within Brazil.
Genetic variation can provide important insights into ZIKV biology and pathogenesis and can reveal potentially functional changes in the virus. We observed 1,030 mutations in the complete dataset, and they were well distributed across the genome (Fig. 3a). Any effect of these mutations cannot be determined from these data; however, the most likely candidates for functional mutations would be among the 202 nonsynonymous mutations (Supplementary Table 2) and the 32 mutations in the 5′ and 3′ untranslated regions (UTRs). Adaptive mutations are more likely to be found at high frequency or to be seen multiple times, although both effects can also occur by chance. We observed five positions with nonsynonymous mutations at more than 5% minor allele frequency that occurred on two or more branches of the tree (Fig. 3b); two of these (at positions 4,287 and 8,991) occurred together and might represent incorrect placement of a Brazil branch in the tree. The remaining three are more likely to represent multiple nonsynonymous mutations; one (at 9,240) appears to involve nonsynonymous mutations to two different alleles.
To assess the possible biological significance of these mutations, we looked for evidence of selection in the ZIKV genome. Viral surface glycoproteins are known targets of positive selection, and mutations in these proteins can confer adaptation to new vectors19 or aid immune escape20,21. We therefore searched for an excess of nonsynonymous mutations in the ZIKV envelope glycoprotein (E). However, the nonsynonymous substitution rate in E proved to be similar to that in the rest of the coding region (Fig. 3c, left); moreover, amino acid changes were significantly more conservative in that region than elsewhere (Fig. 3c, middle and right). Any diversifying selection occurring in the surface protein thus appears to be operating under selective constraint. We also found evidence for purifying selection in the ZIKV 3′ UTR (Fig. 3d, Supplementary Table 3), which is important for viral replication22.
While the transition-to-transversion ratio (6.98) was within the range seen in other viruses23, we observed a considerably higher frequency of C-to-T and T-to-C substitutions than other transitions (Fig. 3d, Extended Data Fig. 4, Supplementary Table 3). This enrichment was apparent both in the genome as a whole and at fourfold degenerate sites, where selection pressure is minimal. Many processes could contribute to this conspicuous mutation pattern, including mutational bias of the ZIKV RNA-dependent RNA polymerase, host RNA editing enzymes (for example, APOBECs, ADARs) acting upon viral RNA, and chemical deamination, but further investigation is required to determine the cause of this phenomenon.
Mismatches between PCR assays and viral sequence are a potential source of poor diagnostic performance in this outbreak24. To assess the potential influence of ongoing viral evolution on diagnostic function, we compared eight published qRT–PCR-based primer/probe sets to our data. We found numerous sites at which the probe or primer did not match an allele found among the 174 ZIKV genomes from the current dataset (Fig. 3e). In most cases, the discordant allele was shared by all outbreak samples, presumably because it was present in the Asian lineage that entered the Americas. These mismatches could affect all uses of the diagnostic assay in the outbreak. We also found mismatches from new mutations that occurred after ZIKV entry into the Americas. Most of these were present in less than 10% of samples, although one was seen in 29%. These observations suggest that genome evolution has not caused widespread degradation of diagnostic performance during the course of the outbreak, but that mutations continue to accumulate and ongoing monitoring is needed.
Analysis of within-host viral genetic diversity can reveal important information for understanding virus–host interactions and viral transmission. However, accurately identifying these variants in low-titre clinical samples is challenging, and further complicated by potential artefacts associated with enrichment before sequencing. To investigate whether we could reliably detect within-host ZIKV variants in our data, we identified within-host variants in a cultured ZIKV isolate used as a positive control throughout our study, and found that both amplicon sequencing and hybrid capture data produced concordant and replicable variant calls (Fig. 1d). In clinical and mosquito samples, hybrid capture within- host variants were noisier but contained a reliable subset: although most variants were not validated by the other sequencing method or by a technical replicate, those at high frequency were always replicable, as were those that passed a previously described filter25 (Fig. 1e, f, Extended Data Table 3). Within this high confidence set we looked for variants that were shared between samples as a clue to transmission patterns, but there were too few variants to draw any meaningful conclusions. By contrast, within-host variants identified in amplicon sequencing data were unreliable at all frequencies (Fig. 1f, Extended Data Table 3), suggesting that further technical development is needed before amplicon sequencing can be used to study within-host variation in ZIKV and other clinical samples with low viral titres.
Sequencing low-titre viruses such as ZIKV directly from clinical samples presents several challenges that are likely to have contributed to the paucity of genomes available from the current outbreak. While the development of technical and analytical methods will surely continue, we note that factors upstream in the process, including collection site and cohort, were strong predictors of sequencing success in our study (Extended Data Fig. 1). This finding highlights the importance of continuing development and implementation of best practices for sample handling, without disrupting standard clinical workflows, for wider adoption of genome surveillance during outbreaks. Additional sequencing, however challenging, remains critical to ongoing investigation of ZIKV biology and pathogenesis. Together with refs 10 and 11, this study advances both technological and collaborative strategies for genome surveillance in the face of unexpected outbreak challenges.
Methods
No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.
Ethics statement
The clinical studies from which samples were obtained were evaluated and approved by the relevant Institutional Review Boards/Ethics Review Committees at Hospital General de la Plaza de la Salud (Santo Domingo, Dominican Republic), University of the West Indies (Kingston, Jamaica), Universidad Nacional Autónoma de Honduras (Tegucigalpa, Honduras), Oswaldo Cruz Foundation (Rio de Janeiro, Brazil), Centro de Investigaciones Epidemiologicas—Universidad Industrial de Santander (Bucaramanga, Colombia), Massachusetts Department of Public Health (Jamaica Plain, Massachusetts), and Florida Department of Health (Tallahassee, Florida). Informed consent was obtained from all participants enrolled in studies at Hospital General de la Plaza de la Salud, Universidad Nacional Autónoma de Honduras, Oswaldo Cruz Foundation, and Universidad Industrial de Santander. IRBs at the University of West Indies, Massachusetts Department of Public Health, and Florida Department of Health granted waivers of consent given this research with leftover clinical diagnostic samples involved no more than minimal risk. Harvard University and Massachusetts Institute of Technology (MIT) Institutional Review Boards/Ethics Review Committees provided approval for sequencing and secondary analysis of samples collected by the aforementioned institutions.
Sample collections and study subjects
Patients with suspected ZIKV infection (including high-risk travellers) were enrolled through study protocols at multiple aforementioned collection sites. Clinical samples (including blood, urine, cerebrospinal fluid, and saliva) were obtained from suspected or confirmed ZIKV cases and from high-risk travellers. De-identified information about study participants and other sample metadata are reported in Supplementary Table 1.
Viral RNA isolation
RNA was isolated following the manufacturer’s standard operating protocol for 0.14–1-ml samples32 using the QIAamp Viral RNA Minikit (Qiagen), except that in some cases 0.1 M final concentration of β -mercaptoethanol (as a reducing agent) or 40 µ g/ml final concentration of linear acrylamide (Ambion) (as a carrier) were added to AVL buffer before inactivation. Extracted RNA was resuspended in AVE buffer or nuclease-free water. In some cases, viral samples were concentrated using Vivaspin-500 centrifugal concentrators (Sigma- Aldrich) before inactivation and extraction. In these cases, 0.84 ml of sample was concentrated to 0.14 ml by passing through a 30-kDa filter and discarding the flow-through.
Carrier RNA and host rRNA depletion
In a subset of human samples, carrier poly(rA) RNA and host rRNA were depleted from RNA samples using RNase H selective depletion9,33. In brief, oligo d(T) (40 nt long) and/or DNA probes complementary to human rRNA were hybridized to the sample RNA. The sample was then treated with 15 units Hybridase (Epicentre) for 30 min at 45 °C. The complementary DNA probes were removed by treating each reaction with an RNase-free DNase (Qiagen) according to the manufacturer’s protocol. Following depletion, samples were purified using 1.8× volume AMPure RNAclean beads (Beckman Coulter Genomics) and eluted into 10 µ l water for cDNA synthesis.
Illumina library construction and sequencing
cDNA synthesis was performed as described in previously published RNA-seq methods9. To track potential crosscontamination, 50 fg synthetic RNA (gift from M. Salit, NIST) was spiked into samples using unique RNA for each individual ZIKV sample. ZIKV negative control cDNA libraries were prepared from water, human K-562 total RNA (Ambion), or EBOV (KY425633.1) seed stock; ZIKV positive controls were prepared from ZIKV Senegal (isolate HD78788) or ZIKV Pernambuco (isolate PE243; KX197192.1) seed stock. The dual index Accel-NGS 2S Plus DNA Library Kit (Swift Biosciences) was used for library preparation. Approximately half of the cDNA product was used for library construction, and indexed libraries were generated using 18 cycles of PCR. Each individual sample was indexed with a unique barcode. Libraries were pooled at equal molarity and sequenced on the Illumina HiSeq 2500 or MiSeq (paired-end reads) platforms.
Amplicon-based cDNA synthesis and library construction
ZIKV amplicons were prepared as described8,11, similarly to ‘RNA jackhammering’ for preparing low-input viral samples for sequencing34, with slight modifications. After PCR amplification, each amplicon pool was quantified on a 2200 Tapestation (Agilent Technologies) using High Sensitivity D1000 ScreenTape (Agilent Technologies). Two microlitres of a 1:10 dilution of the amplicon cDNA was loaded and the concentration of the 350–550-bp fragments was calculated. The cDNA concentration, as reported by the Tapestation, was highly predictive of sequencing outcome (that is, whether a sample passed genome assembly thresholds) (Extended Data Fig. 5). cDNA from each of the two amplicon pools was mixed equally (10–25 ng each) and libraries were prepared using the dual index Accel-NGS 2S Plus DNA Library Kit (Swift Biosciences) according to the manufacturer’s protocol. Libraries were indexed with a unique barcode using seven cycles of PCR, pooled equally and sequenced on the Illumina MiSeq (250-bp paired-end reads) platform. Primer sequences were removed by hard trimming the first 30 bases for each insert read before analysis.
Zika virus hybrid capture
Virus hybrid capture was performed as previously described9. Probes were created to target ZIKV and chikungunya virus (CHIKV). Candidate probes were created by tiling across publicly available sequences for ZIKV and CHIKV on NCBI GenBank35. Probes were selected from among these candidate probes to minimize the number used while maintaining coverage of the observed diversity of the viruses. Alternating universal adapters were added to allow two separate PCR amplifications, each consisting of non-overlapping probes. (To download probe sequences, see Supplementary Information.)
The probes were synthesized on a 12k array (CustomArray). The synthesized oligos were amplified by two separate emulsion PCR reactions with primers containing T7 RNA polymerase promoter. Biotinylated baits were in vitro transcribed (MEGAshortscript, Ambion) and added to prepared ZIKV libraries. The baits and libraries were hybridized overnight (~16 h), captured on streptavidin beads, washed, and re-amplified by PCR using the Illumina adaptor sequences. Capture libraries were then pooled and sequenced. In some cases, a second round of hybrid capture was performed on PCR-amplified capture libraries to further enrich the ZIKV content of sequencing libraries (Extended Data Fig. 6). In the main text, ‘hybrid capture’ refers to a combination of hybrid capture sequencing data and data from the same libraries without capture (unbiased), unless explicitly distinguished.
Genome assembly
We assembled reads from all sequencing methods into genomes using viral-ngs v1.13.3 (refs 36, 37). We taxonomically filtered reads from amplicon sequencing against a ZIKV reference, KU321639.1. We filtered reads from other approaches against the list of accessions provided in the Supplementary Information. To compute results on individual replicates, we de novo assembled these and scaffolded against KU321639.1. To obtain final genomes for analysis, we pooled data from multiple replicates of a sample, de novo assembled, and scaffolded against KX197192.1. For all assemblies, we set the viral-ngs ‘assembly_min_length_fraction_of_reference’ and ‘assembly_min_unambig’ parameters to 0.01. For amplicon sequencing data, unambiguous base calls required at least 90% of reads to agree in order to call that allele (‘major_cutoff ’ = 0.9); for hybrid capture data, we used the default threshold of 50%. We modified viral-ngs so that calls to GATK’s UnifiedGenotyper set ‘min_indel_count_for_genotyping’ to 2.
At three sites with insertions or deletions (indels) in the consensus genome CDS, we corrected the genome using Sanger sequencing of the RT–PCR product (namely, at 3,447 in the genome for sample DOM_2016_BB-0085-SER; at 5,469 in BRA_2016_FC-DQ12D1-PLA; and at 6,516–6,564 in BRA_2016_FC-DQ107D1-URI, coordinates as in KX197192.1). At other indels in the consensus genome CDS, we replaced the indel with ambiguity.
Depth-of-coverage values from amplicon sequencing include read duplicates. In all other cases, we removed duplicates with viral-ngs.
Identification of non-ZIKV viruses in samples by unbiased sequencing
Using Kraken v0.10.638 in viral-ngs, we built a database that included its default ‘full’ database (which incorporates all bacterial and viral whole genomes from RefSeq39 as of October 2015). Additionally, we included the whole human genome (hg38), genomes from PlasmoDB40, sequences covering mosquito genomes (Aedes aegypti, Aedes albopictus, Anopheles albimanus, Anopheles quadrimaculatus, Culex quinquefasciatus, and the outgroup Drosophila melanogaster) from GenBank35, protozoa and fungi whole genomes from RefSeq, SILVA LTP 16 S rRNA sequences41, and all sequences from NCBI’s viral accession list42 (as of October 2015) for viral taxa that have human as a host. (To download the database, see Supplementary Information.)
For each sample, we ran Kraken on data from unbiased sequencing replicates (not including hybrid capture data) and searched its output reports for viral taxa with more than 100 reported reads. We manually filtered the results, removing ZIKV, bacteriophages, and known laboratory contaminants. For each sample and its associated taxa, we assembled genomes using viral-ngs as described above; the results are in Extended Data Table 1a. We used the following genomes for taxonomically filtering reads and as the reference for assembly: KJ741267.1 (cell fusing agent virus), AY292384.1 (deformed wing virus), NC_001477.1 (dengue virus type 1) and LC164349.1 (JC polyomavirus). When reporting sequence identity of an assembly to its taxon, we used BLASTN43 to determine the identity between the sequence and the reference used for its assembly.
To focus on metagenomics of mosquito pools (Extended Data Table 1b), we considered unbiased sequencing data from eight mosquito pools (not including hybrid capture data). We first ran the depletion pipeline of viral-ngs on raw data and then ran the viral-ngs Trinity44 assembly pipeline on the depleted reads to assemble them into contigs. We pooled contigs from all mosquito pool samples and identified all duplicate contigs with sequence identity >95% using CD-HIT45. Additionally, we used predicted coding sequences from Prodigal 2.6.3 (ref. 46) to identify duplicate protein sequences at >95% identity. We classified contigs using BLASTN43 against nt and BLASTX43 against nr (as of February 2017) and discarded all contigs with an E value greater than 1 × 10−4. We define viral contigs as contigs that hit a viral sequence, and we manually removed all reverse-transcriptase-like contigs owing to their similarity to retrotransposon elements within the Aedes aegypti genome. We categorized viral contigs with less than 80% amino acid identity to their best hit as likely novel viral contigs. Supplementary Table 4 lists the unique viral contigs we found, their best hit, and information scoring the hit.
Relationship between metadata and sequencing outcome
To determine whether available sample metadata are predictive of sequencing outcome, we tested the following variables: sample collection site, patient gender, patient age, sample type, and the number of days between symptom onset and sample collection (collection interval). To describe sequencing outcome of a sample S, we used the following response variable YS:
mean({ I(R) * (number of unambiguous bases in R) for all amplicon sequencing replicates R of S}), where I(R) = 1 if median depth of coverage of R ≥275 and I(R) = 0 otherwise.
This value is listed in Supplementary Table 1 under ‘Dependent variable used in regression on metadata’. We excluded the saliva, cerebrospinal fluid, and whole blood sample types owing to sample number (n = 1), and also excluded mosquito pool samples and rows with missing values. We excluded samples from one collection site (prefix JAM_2016_WI-) because most had missing values. We treated samples with type ‘Plasma EDTA’ as having type ‘Plasma’. We treated the collection interval variable as categorical (0–1, 2–3, 4–6, and 7+ days).
With a single model we underfit the zero counts, possibly because many zeros (samples without a replicate that passed ZIKV assembly) are truly ZIKV-negative. We thus view the data as coming from two processes: one determining whether a sample is ZIKV-positive or ZIKV-negative, and another that determines, among the observed passing samples, how much of a ZIKV genome we are able to sequence. We modelled the first process, predicting whether a sample is passing, with logistic regression (in R using GLM47 with binomial family and logit link); here, the observed passing samples are the samples S for which YS ≥2,500. For the second, we performed a beta regression, using only the observed passing samples, of YS divided by ZIKV genome length on the predictor variables. We implemented this in R using the betareg package48 and transformed fractions from the closed unit interval to the open unit interval as the authors suggest.
To test the significance of predictor variables, we used a likelihood ratio test. For variable Xi we compared a full model (with all predictors) against a model that used all predictors except Xi. The results of these tests are shown in Extended Data Fig. 1a, d. We explored the effects of sample type and collection interval on obtaining a passing assembly in Extended Data Fig. 1b, c, respectively. Error bars are 95% confidence intervals derived from binomial distributions. We explored the effects of these same two variables on YS (in passing samples only) in Extended Data Fig. 1e, f.
Criteria for pooling across replicates
We attempted to sequence one or more replicates of each sample and attempted to assemble a genome from each replicate. We discarded data from any replicates whose assembly showed high sequence similarity, in any part of the genome, to our assembly of the genome in a sample consisting of an African (Senegal) lineage (strain HD78788) of ZIKV. We used this sample as a positive control throughout this study, and considered its presence in the assembly of a clinical or mosquito pool sample to be evidence of contamination. Similarly, we discarded data from four replicates belonging to samples from the Dominican Republic because they yielded assemblies that were unexpectedly identical or highly similar to our assembly of the ZIKV isolate PE243 genome, another positive control used in this study. We also discarded data from replicates that showed evidence of contamination, at the RNA stage, by the baits used in hybrid capture; we detected these by looking for adapters that were added to these probes for amplification.
For amplicon sequencing, we considered an assembly of a replicate to be ‘passing’ if it contained at least 2,500 unambiguous base calls and had a median depth of coverage of at least 275× over its unambiguous bases (depth includes duplicate reads). For the unbiased and hybrid capture approaches, we considered an assembly of a replicate ‘passing’ if it contained at least 4,000 unambiguous base calls. For each approach, the unambiguous base threshold was based on an observed density of negative controls below the threshold (Fig. 1a). For amplicon sequencing assemblies, we added a coverage depth threshold because coverage depth was roughly binary across replicates, with negative controls falling in the lower class. On the basis of these thresholds, 0 of 99 negative controls used throughout our sequencing runs yielded passing assemblies and 32 of 32 positive controls yielded passing assemblies.
We considered a sample to have a passing assembly if any of its replicates, by either method, yielded an assembly that passed the above thresholds. For each sample with at least one passing assembly, we pooled read data across replicates for each sample, including replicates with assemblies that did not pass the assembly thresholds. When data were available from both amplicon sequencing and unbiased/hybrid capture approaches, we pooled amplicon sequencing data separately from data produced by the unbiased and hybrid capture approaches, the latter two of which were pooled together (henceforth, the ‘hybrid capture’ pool). We then assembled a genome from each set of pooled data. When assemblies on pooled data were available from both approaches, we selected for downstream analysis the assembly from the hybrid capture approach if it had at least 10,267 unambiguous base calls (95% of the reference genome used, GenBank accession KX197192.1); when this condition was not met, we selected the one that had more unambiguous base calls.
The number of ZIKV genomes publicly available before this study was the result of an NCBI GenBank35 search for ZIKV in February 2017. We filtered any sequences with length <4,000 nt, excluded sequences that are being published as part of this study or in refs 10, 11, excluded sequences from non-human hosts, and excluded sequences labelled as having been passaged. We counted fewer than 100 sequences, the precise number depending on details of the count.
Visualization of coverage depth across genomes
For amplicon sequencing data, we plotted coverage across the 110 samples that yielded a passing assembly by amplicon sequencing (Fig. 1b). With viral-ngs, we aligned depleted reads to the reference sequence KX197192.1 using the novoalign aligner with options ‘-r Random -l 40 -g 40 -x 20 -t 100 -k’. Because of the nature of amplicon sequencing, duplicates were not identified or removed. We binarized depth at each nucleotide position, showing red if depth of coverage was at least 100×. Rows (samples) are hierarchically clustered to ease visualization.
For hybrid capture sequencing data, we plotted depth of coverage across the 37 samples that yielded a passing assembly (Fig. 1c). We aligned reads as described above for amplicon sequencing data, except we removed duplicates. For each sample, we calculated the depth of coverage at each nucleotide position. We then scaled the values for each sample so that each would have a mean depth of 1.0. At each nucleotide position, we calculated the median depth across the samples, as well as the 20th and 80th percentiles. We plotted the mean of each of these metrics within a 200-nt sliding window.
Multiple sequence alignments
We aligned ZIKV consensus genomes using MAFFT v7.221 (ref. 49) with the following parameters: ‘--maxiterate 1000 --ep 0.123 --localpair’.
In Supplementary Data, we provide sequences and alignments used in analyses.
Analysis of within- and between-sample variants
To measure overall per-base discordance between consensus genomes produced by amplicon sequencing and hybrid capture, we considered all sites at which base calls were made in both the amplicon sequencing and hybrid capture consensus genomes of a sample, and we calculated the fraction in which the bases were not in agreement. To measure discordance at polymorphic sites, we searched for positions with a polymorphism in all genomes generated in this study that we selected for downstream analysis (see ‘Criteria for pooling across replicates’ for choosing among the amplicon sequencing and hybrid capture genome when both are available). We then looked at these positions in genomes that were available from both methods, and we calculated the fraction in which the alleles were not in agreement.
To measure discordance at minor alleles, we searched for minor alleles in all genomes generated in this study that we selected for downstream analysis. We then looked at all sites at which there was a minor allele and for which genomes from both methods were available, and we calculated the fraction in which the alleles were not in agreement. For these calculations, we tolerated partial ambiguity (for example, ‘Y’ is concordant with ‘T’). If one genome had full ambiguity (‘N’) at a position and the other genome had an indel, we counted the site as discordant; otherwise, if one genome had full ambiguity, we did not count the site.
After assembling genomes, we identified within-sample variants by running V-Phaser 2.0 via viral-ngs37 on all pooled reads mapping to each sample assembly. When determining per-library allele counts at each variant position, we modified viral-ngs to require a minimum base (Phred) quality score of 30 for all bases, discard anomalous read pairs, and use per-base alignment quality (BAQ) in its calls to SAMtools50 mpileup. This is particularly helpful for filtering spurious amplicon sequencing variants because all generated reads start and end at a limited number of positions (owing to the pre-determined tiling of amplicons across the genome). Because amplicon sequencing libraries were sequenced using 250-bp paired-end reads, bases near the middle of the ~450-nt amplicons fall at the end of both paired reads, where quality scores drop and incorrect base calls are more likely. To determine the overall frequency of each variant in a sample, we summed allele counts (calculated using SAMtools50 mpileup via viral-ngs) across libraries.
When comparing variant frequencies between amplicon sequencing (seven technical replicates) and hybrid capture (seven technical replicates) replicates of the PE243 positive control (Fig. 1d), we included only positions at which the mean (pooled) frequency across replicates within at least one method was ≥1%. When comparing allele frequencies between replicate libraries, we restricted the sample set to only samples with a passing assembly in both methods, and included only samples with two or more replicates. By contrast, when comparing alleles across methods, we included samples that have a passing assembly by either method, with any number of replicates. For these comparisons, we included only positions with a minor variant; that is, positions for which both libraries/methods had an allele at 100% were removed, even if the single allele differed between the two libraries/methods. Additionally, we considered any allele with frequency <1% as not found (0%).
When comparing allele frequencies across methods: let fa and fhc be frequencies in amplicon sequencing and hybrid capture, respectively. If both are non-zero, we included an allele only if the read depth at its position was ≥1/min(fa, fhc) in both methods, and if depth at the position was at least 100× for hybrid capture and 275× for amplicon sequencing. If fa = 0, we required a read depth of max(1/fhc, 275) at the position in the amplicon sequencing method; similarly, if fhc = 0 we required a read depth of max(1/fa, 100) at the position in the hybrid capture method. This was to eliminate lack of coverage as a reason for discrepancy between two methods. When comparing allele frequencies across sequencing replicates within a method, we imposed only a minimum read depth (275× for amplicon sequencing and 100× for hybrid capture), but required this depth in both libraries. In samples with more than two replicates, we considered only the two replicates with the highest depth at each variant position.
We considered allele frequencies from hybrid capture sequencing ‘verified’ if they passed the strand bias and frequency filters described in ref. 25, with the exception that we imposed a minimum allele frequency of 1% and allowed a variant identified in only one library if its frequency was ≥5%. In Fig. 1f and Extended Data Table 3, we considered variants ‘validated’ if they were present at ≥1% frequency in both libraries or methods. When comparing two libraries for a given method M (amplicon sequencing or hybrid capture): the proportion unvalidated is the fraction, among all variants in M at ≥1% frequency in at least one library, of the variants that are at ≥1% frequency in exactly one of the two libraries. Similarly, when comparing methods: the proportion unvalidated for a method M is the fraction, among all variants at ≥1% frequency in M, of the variants that are at ≥1% frequency in M and <1% frequency in the other method.
We called SNPs on the aligned genomes using Geneious version 9.1.7 (ref. 51). We converted all fully or partially ambiguous calls, which are treated by Geneious as variants, into missing data. We then removed all sites that were no longer polymorphic from the SNP set and re-calculated allele frequencies. A nonsynonymous mutation is shown on the tree (Fig. 3b) if it includes an allele that is nonsynonymous relative to the ancestral state (see ‘Molecular clock phylogenetics and ancestral state reconstruction’ section below) and has a minor allele frequency of >5%; all occurrences of nonsynonymous alleles are shown. (Two mutations, at positions 2,853 and 7,229, had nominal derived allele frequencies over 95%; in both cases, the ‘ancestral’ allele was seen only in a small clade within the tree, suggesting that the ancestral allele was incorrectly assigned. These are not shown.) We placed mutations at a node such that the node leads only to samples with the mutation or with no call at that site. Uncertainty in placement occurs when a sample lacks a base call for the corresponding mutation; in this case, we placed the mutation on the most recent branch for which we have available data. We also used this ancestral ZIKV state to count the frequency of each type of substitution over various regions of the ZIKV genome, per number of available bases in each region (Fig. 3d and Supplementary Table 3).
We quantified the effect of nonsynonymous mutations using the original BLOSUM62 scoring matrix for amino acids52, in which positive scores indicate conservative amino acid changes and negative scores unlikely or extreme substitutions. We assessed statistical significance for equality of proportions by χ2 test (Fig. 3c, middle), and for difference of means by two-sample t-test with Welch–Satterthwaite approximation of d.f. (Fig. 3c, right). Error bars are 95% confidence intervals derived from binomial distributions (Fig. 3c, left and middle; Fig. 3d) or Student’s t distributions (Fig. 3c, right).
Maximum likelihood estimation and root-to-tip regression
We generated a maximum likelihood tree using a multiple sequence alignment that included genomes generated in this study, as well as a selection of other available sequences from the Americas, Southeast Asia, and the Pacific. The sequences are listed in Supplementary Information. We ran PhyML53 with the GTR substitution model and 4 gamma substitution rate categories; for the tree search operation, we used ‘BEST’ (best of NNI and SPR). In FigTree v1.4.2 (ref. 54), we rooted the tree on the oldest sequence used as input (GenBank accession EU545988.1).
We used TempEst v1.5 (ref. 55), which selects the best-fitting root with a residual mean squared function, to estimate root-to-tip distances. We performed regression in R with the lm function47 of distances on dates. The relationship between root-to-tip divergence and sample dates (Extended Data Fig. 2) supports the use of a molecular clock analysis in this study.
In Supplementary Data, we provide the output of PhyML, as well as the dates and distances used for root-to-tip regression.
Molecular clock phylogenetics and ancestral state reconstruction
For molecular clock phylogenetics, we made a multiple sequence alignment from the genomes generated in this study combined with a selection of other available sequences from the Americas. We did not use sequences from outside the outbreak in the Americas. Among ZIKV genomes published and publicly available on NCBI GenBank35, we selected 32 from the Americas that had at least 7,000 unambiguous bases, were not labelled as having been passaged more than once, and had location metadata. We also used 32 genomes from Brazil published in ref. 10 that met the same criteria. The sequences are listed in Supplementary Information.
We used BEAST v1.8.4 to perform molecular clock analyses56. We used sampled tip dates to handle inexact dates57. Because of sparse data in non-coding regions, we used only the CDS as input. We used the SRD06 substitution model on the CDS, which uses HKY with gamma site heterogeneity and partitions codons into two partitions (positions (1+ 2) and 3)58. To perform model selection, we tested three coalescent tree priors: a constant-size population, an exponential growth population, and a Bayesian Skyline tree prior (ten groups, piecewise-constant model)59. For each tree prior, we tested two clock models: a strict clock and an uncorrelated relaxed clock with log-normal distribution (UCLN)60. In each case, we set the molecular clock rate to use a continuous time Markov chain rate reference prior61. For all six combinations of models, we performed path-sampling (PS) and stepping- stone sampling (SS) to estimate marginal likelihood62,63. We sampled for 100 path steps with a chain length of 1 million, with power posteriors determined from evenly spaced quantiles of a Beta(alpha = 0.3; 1.0) distribution. The Skyline tree prior provided a better fit than the two other (baseline) tree priors (Extended Data Table 2), so we used this tree prior for all further analyses. Using a constant or exponential tree prior, a relaxed clock provides a better model fit, as shown by the log Bayes factor when comparing the two clock models. Using a Skyline tree prior, the log Bayes factor comparing a strict and relaxed clock is smaller than it is using the other tree priors, and it is similar to the variability between estimated log marginal likelihood from PS and SS methods. We chose to use a relaxed clock for further analyses, but we also report key findings using a strict clock.
For the tree and tMRCA estimates in Fig. 2, as well as the clock rate reported in main text, we ran BEAST with 400 million MCMC steps using the SRD06 substitution model, Skyline tree prior, and relaxed clock model. We extracted clock rate and tMRCA estimates, and their distributions, with Tracer v1.6.0 and identified the maximum clade credibility (MCC) tree using TreeAnnotator v1.8.4. We visualised the tree in FigTree v1.4.2 (ref. 54). The reported credible intervals around estimates are 95% highest posterior density (HPD) intervals. When reporting substitution rate from a relaxed clock model, we give the mean rate (mean of the rates of each branch weighted by the time length of the branch). Additionally, for the tMRCA estimates in Fig. 2c with a strict clock, we ran BEAST with the same specifications (also with 400M steps) except using a strict clock model. The resulting data are also used in the more comprehensive comparison shown in Extended Data Fig. 3.
For the data with an outgroup in Extended Data Fig. 3, we ran BEAST as specified above (with strict and relaxed clock models), except with 100 million steps and with outgroup sequences in the input alignment. The outgroup sequences were the same as those used to make the maximum likelihood tree (see Supplementary Information). For the data excluding sample DOM_2016_MA-WGS16-020-SER in Extended Data Fig. 3, we ran BEAST as specified above (with strict and relaxed clocks), except we removed the sequence of this sample from the input and ran 100 million steps.
We used BEAST v1.8.4 to estimate transition and transversion rates within the CDS and non-coding regions. The model was the same as above except that we used the Yang96 substitution model on the CDS, which uses GTR with gamma site heterogeneity and partitions codons into three partitions64; for the non-coding regions, we used a GTR substitution model with gamma site heterogeneity and no codon partitioning. There were four partitions in total: one for each codon position and another for the non-coding region (5′ and 3′ UTRs combined). We ran this for 200 million steps. At each sampled step of the MCMC, we calculated substitution rates for each partition using the overall substitution rate, the relative substitution rate of the partition, the relative rates of substitutions in the partition, and base frequencies. In Extended Data Fig. 4, we plot the means of these rates over the steps; the error bars shown are 95% HPD intervals of the rates over the steps.
We used BEAST v1.8.4 to reconstruct ancestral state at the root of the tree using CDS and non-coding regions. The model was the same as above except that, on the CDS, we used the HKY substitution model with gamma site heterogeneity and codons partitioned into three partitions (one per codon position). On the non-coding regions we used the same substitution model without codon partitioning. We ran this for 50 million steps and used TreeAnnotator v1.8.4 to find the state with the MCC tree. We selected the ancestral state corresponding to this state.
In all BEAST runs, we discarded the first 10% of states from each run as burn-in.
In Supplementary Data, we provide BEAST input (XML) and output files. We also provide the sequence of the reconstructed ancestral state.
Principal component analysis
We carried out principal component analysis using the R package FactoMineR65. We imputed missing data with the package missMDA66 and we show the results in Fig. 2d.
Diagnostic assay assessment
We extracted primer and probe sequences from eight published RT–qPCR assays26–31 and aligned them to our ZIKV genomes using Geneious version 9.1.7 (ref. 51). We then tabulated matches and mismatches to the diagnostic sequence for all outbreak genomes, allowing multiple bases to match where the diagnostic primer and/or probe sequence contained nucleotide ambiguity codes (Fig. 3e).
Extended Data
Extended Data Table 1.
a | |||
---|---|---|---|
| |||
Species | Sample | # reads from species (% of total) |
% genome unambiguous |
Cell fusing agent virus | USA_2016_FL-01-MOS | 5662 (0.02%) | 99.1% |
Cell fusing agent virus | USA_2016_FL-04-MOS | 1588 (0.003%) | 91.1% |
Cell fusing agent virus | USA_2016_FL-05-MOS | 9614 (0.02%) | 99.9% |
Cell fusing agent virus | USA_2016_FL-06-MOS | 2646 (0.007%) | 82.2% |
Cell fusing agent virus | USA_2016_FL-08-MOS | 13608 (0.008%) | 99.4% |
Deformed wing virus-like | USA_2016_FL-06-MOS | 6580 (0.02%) | 8.34% |
Dengue virus type 1 | BLM_2016JVIA-WGS16-006-SER | 2355926 (2.6%) | 99.8% |
JC polyomavirus | BRA_2016_FC-DQ75D1-URI | 8050 (0.20%) | 99.2% |
JC polyomavirus-like | USA_2016_FL-032-URI | 316 (0.001%) | 7.71% |
b | ||||
---|---|---|---|---|
| ||||
Sample | Total contigs | Classified contigs (all) |
Classified contigs (viral) |
Likely novel viral contigs |
USA_2016_FL-01-MOS | 496 | 431 | 45 | 25 |
USA_2016_FL-02-MOS | 563 | 463 | 17 | 14 |
USA_2016_FL-03-MOS | 164 | 133 | 29 | 22 |
USA_2016_FL-04-MOS | 679 | 492 | 25 | 19 |
USA_2016_FL-05-MOS | 355 | 313 | 25 | 8 |
USA_2016_FL-06-MOS | 726 | 635 | 26 | 14 |
USA_2016_FL-07-MOS | 5967 | 5650 | 5 | 2 |
USA_2016_FL-08-MOS | 1679 | 1528 | 39 | 27 |
All pools: unique | 9013 | 8426 | 84 | 41 |
, Viral species other than Zika were found by unbiased sequencing of 38 samples. Column 3, number of reads in a sample belonging to a species as a raw count and a per cent of total reads. Column 4, per cent genome assembled based on the number of unambiguous bases called. We identified cell fusing agent virus (a flavivirus) and deformed wing virus-like genomes in mosquito pools, and dengue virus type 1, JC polyomavirus, and JC polyomavirus-like genomes in clinical samples. All assemblies had ≥ 95% sequence identity to a reference sequence for the listed species, except cell fusing agent virus in USA_2016_FL-06-MOS (91%) and dengue virus type 1 in BLM_2016_MA-WGS16-006-SER (92%). The dengue virus type 1 genome showed ≥ 95% sequence identity to other available isolates of the virus.
, Contigs assembled from unbiased sequencing data of eight mosquito pools. Column 2, number of contigs assembled. Column 3, number of contigs classified by BLASTN/BLASTX43. Column 4, number of contigs hitting a viral species. Column 5, number of contigs hitting a viral species with < 80% amino acid identity to the best hit. Each column is a subset of the previous column. Contigs in column 5 are considered to be likely to be novel. Last row lists counts, after removing duplicate contigs, for all mosquito pools combined. Supplementary Table 4 lists the unique viral contigs and their best hits.
Extended Data Table 2.
a | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Skyline Relaxed |
Skyline Strict |
Exponential Relaxed |
Exponential Strict |
Constant Relaxed |
Constant Strict |
||
PS | log(marginal likelihood) | −24952 | −24950 | −24974 | −24989 | −25007 | −25026 |
log(Bayes factor) | 74 | 76 | 53 | 38 | 20 | — | |
SS | log(marginal likelihood) | −24957 | −24954 | −24976 | −24990 | −25010 | −25030 |
log(Bayes factor) | 73 | 77 | 54 | 40 | 20 | — |
b | ||||||
---|---|---|---|---|---|---|
| ||||||
Skyline Relaxed |
Skyline Strict |
Exponential Relaxed |
Exponential Strict |
Constant Relaxed |
Constant Strict |
|
Clock rate | 1.15E-03 [9.78E-04, 1.33E-03] |
1.09E-03 [9.32E-04, 1.25E-03] |
1.06E-03 [8.38E-04, 1.29E-03] |
9.42E-04 [7.42E-04, 1.14E-03] |
1.41E-03 [1.15E-03, 1.69E-03] |
1.18E-03 [9.97E-04, 1 36E-03] |
tMRCA: all | 2014.129 [2013.621,2014.552] |
2013.981 [2013.531, 2014.417] |
2013.498 [2012.772, 2014.175] |
2013.401 [2012.724, 2014.028] |
2013.752 [2012.897, 2014.405] |
2013.806 [2013.349, 2014.241] |
tMRCA: Puerto Rico | 2015.632 [2015.376, 2015.849] |
2015.600 [2015.369, 2015.816] |
2015.599 [2015.314, 2015.900] |
2015.530 [2015.231, 2015.832] |
2015.796 [2015.533, 2016.039] |
2015.714 [2015.491, 2015.951] |
tMRCA: Honduras | 2015.300 [2014.928, 2015.594] |
2015.241 [2014.888, 2015.512] |
2015.197 [2014.850, 2015.524] |
2015.066 [2014.684, 2015.392] |
2015.527 [2015.206, 2015.834] |
2015.334 [2015.049, 2015.599] |
tMRCA: Colombia | 2015.333 [2015.088, 2015.567] |
2015.283 [2015.060, 2015.496] |
2015.246 [2014.989, 2015.472] |
2015.153 [2014.873, 2015.398] |
2015.411 [2015.201, 2015.636] |
2015.306 [2015.096, 2015.503] |
tMRCA: Caribbean | 2015.289 [2014.933, 2015.628] |
2015.242 [2014.876, 2015.578] |
2015.140 [2014.798, 2015.465] |
2015.007 [2014.623, 2015.373] |
2015.412 [2015.073, 2015.754] |
2015.278 [2014.952, 2015.605] |
, Marginal likelihoods calculated with path-sampling (PS) and stepping-stone sampling (SS) for combinations of three coalescent tree priors (constant size population, exponential growth population, and Skyline) and two clock models (strict clock and uncorrelated relaxed clock with log-normal distribution). The Bayes factor is calculated against the baseline model, a constant size tree prior and strict clock.
, Mean estimates and 95% credible intervals across evaluated models for the clock rate, date of tree root, and tMRCAs of the four regions shown in Fig. 2c. Under a Skyline tree prior, the use of strict and relaxed clock models yields similar estimates.
Extended Data Table 3.
a | |
---|---|
| |
Method | % unvalidated by other method |
Amplicon sequencing | 87.3% n = 126 |
Hybrid capture | 85.8% n = 113 |
Hybrid capture, verified | 25.0% n = 20 |
b | ||
---|---|---|
| ||
Method | % unvalidated in replicate | |
all variants |
variants passing strand bias filter |
|
Amplicon sequencing | 92.7% n = 304 | 66.7% n = 3 |
Hybrid capture | 74.5% n = 98 | 0.00% n = 8 |
, For each method (amplicon sequencing or hybrid capture), fraction of identified variants (≥ 1%) not identified at ≥ 1% by the other method (that is, unvalidated). ‘Verified’ hybrid capture variants are those passing strand bias and frequency filters, as described in Methods.
, For each method, the fraction of identified variants unvalidated in a second library. To pass the strand bias filter, a variant must meet filter criteria in both replicates.
Supplementary Material
Acknowledgments
We thank M. and L. Benioff for their vision and support; L. Brown, E. Lee, M. Giovanni, J. Levin-Allerhand and E. S. Lander for support and guidance; M. Schleicher, E. Lipscomb, A. Felix, A. Saltzman, and S. Donnelly for assistance with IRB and ethics processes; E. Mair, L. Nogelo and E. Carmean for legal counsel; T. Mason and the Broad Institute Genomics Platform for sequencing support; A. Matthews, S. Chapman, D. Neafsey, and B. Birren for management and guidance; O. Pybus and ZiBRA Project colleagues for sharing data before publication; D. Olson, E. Asturias, M. Salit, and E. Simon-Loriere for sharing samples and reagents; and E. Holmes, G. Bello, R. Tewhey, A. Piantadosi, C. Edwards and the Sabeti Laboratory for discussions and reading of the manuscript. We are indebted to Zika patients and clinical teams for making this work possible. Funding was provided by: Marc and Lynne Benioff (P.C.S.); NIH NIAID U19AI110818 (Broad Institute); Howard Hughes Medical Institute (P.C.S.); Harvard University Burke Global Health Fellowship (P.C.S.); Broad Institute BroadNext10 program (A.G. and P.C.S.); AWS Cloud Credits for Research (P.C.S.); Conselho Nacional de Desenvolvimento Científico e Tecnológico (440909/2016-3) and Fundação de Amparo a Pesquisa do Estado do Rio de Janeiro (E-26/201.320/2016, E-26/201.332/2016, E-26/010.000194/2015) (P.T.B. and F.A.B.); NIH NIAID 1R01AI099210 (S.I. and S.F.M.); MIDAS-National Institute of General Medical Sciences U54GM111274 (M.E.H. and D.P.R.); NIH NIAID AI100190 (I.B. and L.G.); AEDES Network (I.B.) and Colombian Science, Technology and Innovation Fund of Sistema General de Regalías-BPIN 2013000100011 (L.V., R.M.G.R., M.C.M.M., and I.B.); ASTMH Shope Fellowship (K.G.B.); NSF DGE 1144152 (A.E.L.); PNPD/CAPES Postdoctoral Fellowship (E.D.); Fulbright-Colciencias Doctoral Scholarship (D.P.R.); NIH training grant 5T32AI007244-33 (N.D.G.); EU under grant agreements 278433-PREDEMICS and 643476-COMPARE (A.R.); and NIH NCATS CTSA UL1TR001114, NIH NIAID contract HHSN272201400048C, The Ray Thomas Foundation, and Pew Biomedical Scholarship (K.G.A.).
Footnotes
Online Content Methods, along with any additional Extended Data display items and Source Data, are available in the online version of the paper; references unique to these sections appear only in the online paper.
Supplementary Information is available in the online version of the paper.
Author Contributions C.B.M., S.W., C.A.F., S.M.W., K.W., J.Q., M.L.B., A.G.-Y., C.Y.L., R.R.S., G.B.-L., Y.R.V., L.M.P., A.L.T., C.M.Ba., M.C.P., C.Vas., A.C.C., M.R.C., K.N.H., E.W.K., J.J.A., K.F.G., L.A.P., R.M.G.R., M.C.M.M., C.M.Br., S.H., B.S., S.Sc., K.G., G.O., R.R.-S., and I.B. performed laboratory experiments and prepared samples for sequencing. H.C.M., C.B.M., C.A.F., S.M.W., K.W., J.Q., M.L.B., C.Y.L., A.G.-Y., N.D.G, A.G., and K.G.A. developed methods for ZIKV detection, targeted enrichment, and/or sequencing library preparation. H.C.M., C.B.M., S.W., S.F.S., M.L.B., A.E.L., C.H.T.-T., S.H.Y., D.J.P., E.D., A.R., T.M.L.S., I.B., and B.L.M. performed sequence assembly, curation, and/or data analyses. S.Sm., L.V., S.M., I.L., S.I., S.F.M., and F.A.B. led clinical studies and/or study sites. K.G.B., B.C., D.P.R., N.D.G., L.G., M.E.H., A.R., A.G., J.C.-N., C.Val., W.D., P.T.B., A.G., K.G.A., S.I., S.F.M., F.A.B., T.M.L.S., and I.B. provided critical insights and guidance. H.C.M., C.B.M., T.M.L.S., N.L.Y., B.L.M., and P.C.S. oversaw study design and management. H.C.M., C.B.M., S.W., S.F.S., A.E.L., N.L.Y., B.L.M. and P.C.S. drafted the manuscript. All authors reviewed the manuscript.
The authors declare no competing financial interests.
Data availability. Sequence data that support findings of this study have been deposited in NCBI GenBank35 under BioProject accession PRJNA344504. Zika virus genomes have accession numbers KY014295–KY014327 and KY785409–KY785485. The dengue virus type 1 genome sequenced in this study has accession number KY829115. See Supplementary Table 1 for a mapping of sample names to accession numbers.
References
- 1.World Health Organization. Zika situation report: Zika virus, Microcephaly and Guillain–Barré syndrome. 2017 http://who.int/emergencies/zika-virus/situation-report/2-february-2017/en/
- 2.Reynolds MR, et al. Vital signs: update on Zika virus-associated birth defects and evaluation of all U.S. infants with congenital Zika virus exposure—U.S. Zika pregnancy registry, 2016. MMWR Morb. Mortal. Wkly. Rep. 2017;66:366–373. doi: 10.15585/mmwr.mm6613e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.de Vigilância em Saúde S. Protocolo de Vigilância e Resposta à Ocorrência de Microcefalia. Ministério da Saúde Brasília; 2016. [Google Scholar]
- 4.Schieffelin JS, et al. Clinical illness and outcomes in patients with Ebola in Sierra Leone. N. Engl. J. Med. 2014;371:2092–2100. doi: 10.1056/NEJMoa1411680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sardi SI, et al. Coinfections of Zika and chikungunya viruses in Bahia, Brazil, identified by metagenomic next-generation sequencing. J. Clin. Microbiol. 2016;54:2348–2353. doi: 10.1128/JCM.00877-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Martina BEE, Koraka P, Osterhaus ADME. Dengue virus pathogenesis: an integrated view. Clin. Microbiol. Rev. 2009;22:564–581. doi: 10.1128/CMR.00035-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fauci AS, Morens DM. Zika virus in the Americas—yet another Arbovirus threat. N. Engl. J. Med. 2016;374:601–604. doi: 10.1056/NEJMp1600297. [DOI] [PubMed] [Google Scholar]
- 8.Quick J, et al. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protocols. 2017 doi: 10.1038/nprot.2017.066. http://dx.doi.org/10.1038/nprot.2017.066. [DOI] [PMC free article] [PubMed]
- 9.Matranga CB, et al. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples. Genome Biol. 2014;15:519. doi: 10.1186/s13059-014-0519-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Faria NR, et al. Establishment and cryptic transmission of Zika virus in Brazil and the Americas. Nature. 2017 doi: 10.1038/nature22401. http://dx.doi.org/10.1038/nature22401. [DOI] [PMC free article] [PubMed]
- 11.Grubaugh ND, et al. Genomic epidemiology reveals multiple introductions of Zika virus into the United States. Nature. 2017 doi: 10.1038/nature22400. http://dx.doi.org/10.1038/nature22400. [DOI] [PMC free article] [PubMed]
- 12.Faria NR, et al. Zika virus in the Americas: early epidemiological and genetic findings. Science. 2016;352:345–349. doi: 10.1126/science.aaf5036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sall AA, et al. Yellow fever virus exhibits slower evolutionary dynamics than dengue virus. J. Virol. 2010;84:765–772. doi: 10.1128/JVI.01738-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Centers for Disease Control and Prevention. First case of Zika virus reported in Puerto Rico. 2015 https://www.cdc.gov/media/releases/2015/s1231-zika.html.
- 15.Pan American Health Organization. Zika: Epidemiological Report Honduras. 2017 http://www2.paho.org/hq/index.php?option=com_docman&task=doc_view&gid=35137&Itemid=270.
- 16.Pan American Health Organization. Epidemiological Update: Zika Virus Infection. 2015 http://www2.paho.org/hq/index.php?option=com_docman&task=doc_view&gid=32021&Itemid=270.
- 17.Pan American Health Organization. Zika: Epidemiological Report Dominican Republic. 2017 http://www2.paho.org/hq/index.php?option=com_docman&task=doc_view&gid=35103&Itemid=270.
- 18.Nunes MRT, et al. Emergence and potential for spread of chikungunya virus in Brazil. BMC Med. 2015;13:102. doi: 10.1186/s12916-015-0348-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Tsetsarkin KA, Vanlandingham DL, McGee CE, Higgs S. A single mutation in chikungunya virus affects vector specificity and epidemic potential. PLoS Pathog. 2007;3:e201. doi: 10.1371/journal.ppat.0030201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Piantadosi A, et al. HIV-1 evolution in gag and env is highly correlated but exhibits different relationships with viral load and the immune response. AIDS. 2009;23:579–587. doi: 10.1097/QAD.0b013e328328f76e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Villabona-Arenas CJ, et al. Dengue virus type 3 adaptive changes during epidemics in São Jose de Rio Preto, Brazil, 2006–2007. PLoS One. 2013;8:e63496. doi: 10.1371/journal.pone.0063496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Brinton MA, Basu M. Functions of the 3′ and 5′ genome RNA regions of members of the genus Flavivirus. Virus Res. 2015;206:108–119. doi: 10.1016/j.virusres.2015.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Duchêne S, Ho SYW, Holmes EC. Declining transition/transversion ratios through time reveal limitations to the accuracy of nucleotide substitution models. BMC Evol. Biol. 2015;15:36. doi: 10.1186/s12862-015-0312-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Corman VM, et al. Clinical comparison, standardization and optimization of Zika virus molecular detection. Bull. World Health Organ. 2016 doi: 10.2471/BLT.16.175950. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Gire SK, et al. Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak. Science. 2014;345:1369–1372. doi: 10.1126/science.1259657. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Pyke AT, et al. Imported zika virus infection from the Cook islands into Australia, 2014. PLoS Curr. 2014 doi: 10.1371/currents.outbreaks.4635a54dbffba2156fb2fd76dc49f65e. http://dx.doi.org/10.1371/currents.outbreaks.4635a54dbffba2156fb2fd76dc49f65e. [DOI] [PMC free article] [PubMed]
- 27.Lanciotti RS, et al. Genetic and serologic properties of Zika virus associated with an epidemic, Yap State, Micronesia, 2007. Emerg. Infect. Dis. 2008;14:1232–1239. doi: 10.3201/eid1408.080287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Faye O, et al. One-step RT–PCR for detection of Zika virus. J. Clin. Virol. 2008;43:96–101. doi: 10.1016/j.jcv.2008.05.005. [DOI] [PubMed] [Google Scholar]
- 29.Faye O, et al. Quantitative real-time PCR detection of Zika virus and evaluation with field-caught mosquitoes. Virol. J. 2013;10:311. doi: 10.1186/1743-422X-10-311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Balm MND, et al. A diagnostic polymerase chain reaction assay for Zika virus. J. Med. Virol. 2012;84:1501–1505. doi: 10.1002/jmv.23241. [DOI] [PubMed] [Google Scholar]
- 31.Tappe D, et al. First case of laboratory-confirmed Zika virus infection imported into Europe, November 2013. Euro Surveill. 2014;19:20685. doi: 10.2807/1560-7917.es2014.19.4.20685. [DOI] [PubMed] [Google Scholar]
- 32.U.S. Food and Drug Administration. Zika virus response updates from FDA. 2017 https://www.fda.gov/EmergencyPreparedness/Counterterrorism/MedicalCountermeasures/MCMIssues/ucm485199.htm#eua.
- 33.Morlan JD, Qu K, Sinicropi DV. Selective depletion of rRNA enables whole transcriptome profiling of archival fixed tissue. PLoS One. 2012;7:e42882. doi: 10.1371/journal.pone.0042882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Worobey M, et al. 1970s and ‘Patient 0’ HIV-1 genomes illuminate early HIV/AIDS history in North America. Nature. 2016;539:98–101. doi: 10.1038/nature19827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res. 2016;44:D67–D72. doi: 10.1093/nar/gkv1276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Park DJ, et al. Ebola virus epidemiology, transmission, and evolution during seven months in Sierra Leone. Cell. 2015;161:1516–1526. doi: 10.1016/j.cell.2015.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tomkins-Tinch C, et al. Broad Institute viral-ngs: v1.13.3. 2016 https://github.com/broadinstitute/viral-ngs/releases/tag/v1.13.3.
- 38.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46. doi: 10.1186/gb-2014-15-3-r46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.O’Leary NA, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733–D745. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Aurrecoechea C, et al. PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 2009;37:D539–D543. doi: 10.1093/nar/gkn814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Yarza P, et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 2008;31:241–250. doi: 10.1016/j.syapm.2008.07.001. [DOI] [PubMed] [Google Scholar]
- 42.Brister JR, Ako-Adjei D, Bao Y, Blinkova O. NCBI viral genomes resource. Nucleic Acids Res. 2015;43:D571–D577. doi: 10.1093/nar/gku1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.NCBI Resource Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2016;44:D7–D19. doi: 10.1093/nar/gkv1290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Grabherr MG, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2012;28:2223–2230. doi: 10.1093/bioinformatics/bts429. [DOI] [PubMed] [Google Scholar]
- 47.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2016. [Google Scholar]
- 48.Cribari-Neto F, Zeileis A. Beta regression in R. J. Stat. Softw. 2010;34:1–24. [Google Scholar]
- 49.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kearse M, et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28:1647–1649. doi: 10.1093/bioinformatics/bts199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA. 1992;89:10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Guindon S, et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 2010;59:307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
- 54.Rambaut A. FigTree. Version 1.4.2. Inst. Evol. Biol.; Univ. Edinburgh: 2014. [Google Scholar]
- 55.Rambaut A, Lam TT, Max Carvalho L, Pybus OG. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) Virus Evol. 2016;2:vew007. doi: 10.1093/ve/vew007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol. 2012;29:1969–1973. doi: 10.1093/molbev/mss075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Shapiro B, et al. A Bayesian phylogenetic method to estimate unknown sequence ages. Mol. Biol. Evol. 2011;28:879–887. doi: 10.1093/molbev/msq262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Shapiro B, Rambaut A, Drummond AJ. Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. Mol. Biol. Evol. 2006;23:7–9. doi: 10.1093/molbev/msj021. [DOI] [PubMed] [Google Scholar]
- 59.Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol. Biol. Evol. 2005;22:1185–1192. doi: 10.1093/molbev/msi103. [DOI] [PubMed] [Google Scholar]
- 60.Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol. 2006;4:e88. doi: 10.1371/journal.pbio.0040088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Ferreira MAR, Suchard MA. Bayesian analysis of elapsed times in continuous-time Markov chains. Can. J. Stat. 2008;36:355–368. [Google Scholar]
- 62.Baele G, et al. Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty. Mol. Biol. Evol. 2012;29:2157–2167. doi: 10.1093/molbev/mss084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Baele G, Li WLS, Drummond AJ, Suchard MA, Lemey P. Accurate model selection of relaxed molecular clocks in bayesian phylogenetics. Mol. Biol. Evol. 2013;30:239–243. doi: 10.1093/molbev/mss243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Yang Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 1996;42:587–596. doi: 10.1007/BF02352289. [DOI] [PubMed] [Google Scholar]
- 65.Lê S, Josse J, Husson F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw. 2008;25:1–18. [Google Scholar]
- 66.Josse J, Husson F. missMDA: a package for handling missing values in multivariate data analysis. J. Stat. Softw. 2016;70:1–31. [Google Scholar]
- 67.Gourinat A-C, O’Connor O, Calvez E, Goarant C, Dupont-Rouzeyrol M. Detection of Zika virus in urine. Emerg. Infect. Dis. J. 2015;21:84. doi: 10.3201/eid2101.140894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Paz-Bailey G, et al. Persistence of Zika virus in body fluids—preliminary report. N. Engl. J. Med. 2017 doi: 10.1056/NEJMc1814416. https://doi.org/10.1056/NEJMoa1613108. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.