Skip to main content
eLife logoLink to eLife
. 2021 Jan 19;10:e64509. doi: 10.7554/eLife.64509

Evidence for adaptive evolution in the receptor-binding domain of seasonal coronaviruses OC43 and 229e

Kathryn E Kistler 1,2,, Trevor Bedford 1,2
Editors: Daniel B Weissman3, Patricia J Wittkopp4
PMCID: PMC7861616  PMID: 33463525

Abstract

Seasonal coronaviruses (OC43, 229E, NL63, and HKU1) are endemic to the human population, regularly infecting and reinfecting humans while typically causing asymptomatic to mild respiratory infections. It is not known to what extent reinfection by these viruses is due to waning immune memory or antigenic drift of the viruses. Here we address the influence of antigenic drift on immune evasion of seasonal coronaviruses. We provide evidence that at least two of these viruses, OC43 and 229E, are undergoing adaptive evolution in regions of the viral spike protein that are exposed to human humoral immunity. This suggests that reinfection may be due, in part, to positively selected genetic changes in these viruses that enable them to escape recognition by the immune system. It is possible that, as with seasonal influenza, these adaptive changes in antigenic regions of the virus would necessitate continual reformulation of a vaccine made against them.

Research organism: Virus

Introduction

Coronaviruses were first identified in the 1960s and, in the decades that followed, human coronaviruses (HCoVs) received a considerable amount of attention in the field of infectious disease research. At this time, two species of HCoV, OC43 and 229E, were identified as the causative agents of roughly 15% of common colds (McIntosh, 1974; Heikkinen and Järvinen, 2003). Infections with these viruses were shown to exhibit seasonal patterns, peaking in January to March in the Northern Hemisphere, as well as yearly variation, with the greatest incidence occurring every 2–4 years (Monto and Lim, 1974; Hamre and Beem, 1972). Subsequently, two additional seasonal HCoVs, HKU1 and NL63, have entered the human population. These four HCoVs endemic to the human population usually cause mild respiratory infections, but occasionally result in more severe disease in immunocompromised patients or the elderly (Liu et al., 2020b). In the past 20 years, three additional HCoVs (SARS-CoV-1, MERS-CoV, and SARS-CoV-2) have emerged, which cause more severe respiratory illness. At the writing of this paper, amidst the SARS-CoV-2 pandemic, no vaccine for any HCoV is currently available, though many candidate SARS-CoV-2 vaccines are in production and clinical trials (Krammer, 2020).

Coronaviruses are named for the ray-like projections of spike protein that decorate their surface. Inside these virions is a positive-sense RNA genome of roughly 30 kB (Li, 2016). This large genome size can accommodate more genetic variation than a smaller genome (Woo et al., 2009). Genome flexibility, coupled with a RNA virus error-prone polymerase (Drake, 1993) and a high rate of homologous recombination (Pasternak et al., 2006), creates genetic diversity that is acted upon by evolutionary pressures that select for viral replication. This spawns much of the diversity within and between coronaviruses species (Woo et al., 2009; Hon et al., 2008) and can contribute to the virus’ ability to jump species-barriers, allowing a previously zoonotic CoV to infect and replicate in humans.

The battle between virus and host results in selective pressure for mutations that alter viral antigens in a way that evades immune recognition. Antigenic evolution, or antigenic drift, leaves a characteristic mark of positively selected epitopes within the viral proteins most exposed to the host immune system (Smith et al., 2004). For CoVs, this is the spike protein, exposed on the surface of the virion to human humoral immunity. Some human respiratory illnesses caused by RNA viruses, like seasonal influenza (Smith et al., 2004), evolve antigenically, while others, like measles, do not (Fulton et al., 2015). Because of this, seasonal influenza vaccines must be reformulated on a nearly annual basis, while measles vaccines typically provide lifelong protection. Whether HCoVs undergo antigenic drift is relevant not only to understanding HCoV evolution and natural immunity against HCoVs, but also to predicting the duration of a vaccine’s effectiveness.

Early evidence that closely related HCoVs are antigenically diverse comes from a 1980s human challenge study in which subjects were infected and then reinfected with a variety of 229E-related strains (Reed, 1984). All subjects developed symptoms and shed virus upon initial virus inoculation. After about a year, subjects who were re-inoculated with the same strain did not show symptoms or shed virus. However, the majority of subjects who were re-inoculated with a heterologous strain developed symptoms and shed virus. This suggests that immunity mounted against 229E viruses provides protection against some, but not all, other 229E strains. This is a result that would be expected of an antigenically evolving virus.

More recent studies have identified eight OC43 genotypes and, in East Asian populations, certain genotypes were shown to temporally replace other genotypes (Lau et al., 2011; Zhang et al., 2015; Zhu et al., 2018). Whether certain genotypes predominate due to antigenic differences that confer a fitness advantage is not known. However, evidence for selection in the spike protein of one of these dominant OC43 genotypes has been provided by dN/dS, a standard computational method for detecting positive selection (Ren et al., 2015). This method has also been used to suggest positive selection in the spike protein of 229E (Chibo and Birch, 2006). Additionally, two genetically distinct groupings (each of which include multiple of the aforementioned eight genotypes) of OC43 viruses have been shown to alternate in prevalence within a Japanese community, meaning that the majority of OC43 infections are caused by one group for about 2–4 years at which point the other group begins to account for the bulk of infections. It has been suggested that antigenic differences between these groups contribute to this epidemic switching (Komabayashi et al., 2020).

However, a similar surveillance of the NL63 genotypes circulating in Kilifi, Kenya found that NL63 genotypes persist for relatively long periods of time, that people become reinfected by the same genotype, and that reinfections are often enhanced by prior infection (Kiyuka et al., 2018). These findings are inconsistent with antigenic evolution in NL63.

Here we use a variety of computational approaches to detect adaptive evolution in spike and comparator proteins in HCoVs. These methods were designed as improvements to dN/dS with the intention of identifying adaptive substitutions within a serially sampled RNA virus population. We focus on the seasonal HCoVs that have been continually circulating in humans: OC43, 229E, HKU1, and NL63. Our analyses of nonsynonymous divergence, rate of adaptive substitutions, and Time to Most Recent Common Ancestor (TMRCA) provide evidence that the spike protein of OC43 and 229E is under positive selection. Though we conduct these analyses on HKU1 and NL63, we do not observe evidence for adaptive evolution in the spike protein of these viruses. For HKU1, there is not enough longitudinal sequencing data available for us to confidently make conclusions as to whether or not this lack of evidence reflects an actual lack of adaptive evolution.

Results

Phylogenetic consideration of viral diversity and recombination in OC43 and 229e

We constructed time-resolved phylogenies of the OC43 and 229E using publicly accessible sequenced isolates. A cursory look at these trees confirms previous reports that substantial diversity exists within each viral species (Zhang et al., 2015; Komabayashi et al., 2020; Lau et al., 2011). Additionally, the trees form ladder-like topologies with isolate tips arranged into temporal clusters rather than geographic clusters, indicating a single global population rather than geographically isolated populations of virus. The phylogeny of OC43 bifurcates immediately from the root (Figure 1), indicating that OC43 consists of multiple, co-evolving lineages. Because of the distinct evolutionary histories, it is appropriate to conduct phylogenetic analyses separately for each lineage. We have arbitrarily labeled these lineages ‘A’ and ‘B’ (Figure 1).

Figure 1. Phylogenetic trees for spike gene of seasonal human coronaviruses (HCoVs) OC43 and 229E.

Phylogenies built from (A) OC43 spike sequences from 389 isolates over 53 years, and (B) 229E spike sequences from 54 isolates over 31 years. OC43 bifurcates immediately after the root and is split into two lineages: lineage A (dark teal) and lineage B (light teal). 229E contains just one lineage (dark blue). For the analyses in this paper, the evolution of each gene (or genomic region) is considered separately, so phylogenies are built for each viral gene, and those phylogenies are used to split isolates into lineages for each gene. These are temporally resolved phylogenies with year shown on the x-axis. The clock rate estimate is 5 × 10–4 substitutions per nucleotide site per year for OC43 and 6 × 10–4 for 229E.

Figure 1.

Figure 1—figure supplement 1. Recombination occurs between human coronavirus (HCoV) isolates.

Figure 1—figure supplement 1.

A tanglegram draws lines between an isolate’s position on two phylogenies built on different genes (or genomic regions). Dramatic differences in an isolate’s position on one tree versus another are indicative of recombination. (A) Phylogenetic relationships between OC43 isolates based on spike sequences (left) versus relationships based on RdRp sequences (right). Light teal lines that connect isolates classified as lineage A based on their RdRp sequence to isolates classified as lineage B based on their spike sequence suggest that recombination occurred in these isolates or their ancestors. (B) Phylogenetic reconstruction of OC43 isolates based on S1 sequences (left) versus S2 sequences (right). Year is shown on the x-axis.
Figure 1—figure supplement 2. Phylogenetic trees for seasonal human coronaviruses (HCoVs) NL63 and HKU1.

Figure 1—figure supplement 2.

Phylogenies built from (A) NL63 spike sequences from 159 isolates over 37 years, and (B) HKU1 spike sequences from 41 isolates over 13 years. HKU1 bifurcates immediately after the root and is split into lineage A (darker blue) and lineage B (lighter blue). NL63 contains just one lineage (green). Both HCoVs are rooted on an outgroup sequence. For the analyses in this paper, the evolution of each gene (or genomic region) is considered separately, so phylogenies are built for each viral gene and those phylogenies are used to split isolates into lineages for each gene. These are temporally resolved phylogenies with year shown on the x-axis. The clock rate of each HCoV is listed in the section ‘Phylogenetic inference’.

Because recombination is common among coronaviruses (Pasternak et al., 2006; Hon et al., 2008; Lau et al., 2011), we built separate phylogenies for each viral gene. In the absence of recombination, each tree should show the same evolutionary relationships between viral isolates. A dramatic difference in a given isolate’s position on one tree versus another is strongly indicative of recombination (Kosakovsky Pond et al., 2006). Comparing the RNA-dependent RNA polymerase (RdRp) and spike trees reveals this pattern of recombination in some isolates (Figure 1—figure supplement 1A). A comparison of the trees of the S1 and S2 sub-domains of spike shows more limited evidence for intragenic recombination (Figure 1—figure supplement 1B), which is consistent with the fact that the distance between two genetic loci is inversely related to the chance that these loci remain linked during a recombination event. Though intragenic recombination likely does occur occasionally, analyzing genes, rather than isolates, greatly reduces the contribution of recombination to genetic variation in our analyses.

Thus, in all of our analyses, we use alignments and phylogenies of sequences of single genes (or genomic regions) rather than whole genome sequences of isolates. We designate the lineage of those genes (or genomic regions) based on the gene’s phylogeny. Though most isolates contain all genes from the same lineage, some isolates have, say, a lineage A spike gene and a lineage B RdRp gene. This strategy allows us to consider the evolution of each gene separately and interrogate the selective pressures acting on them.

It is worth noting that the analyses we use here to detect adaptive evolution canonically presume that selective pressures are acting on single nucleotide polymorphisms (SNPs). However, it is possible that recombination also contributes to the genetic variation that is acted on by immune selection. This would be most likely to occur if two closely related genomes recombine, resulting in the introduction of a small amount of genetic diversity without disrupting crucial functions. Our analyses do not aim to determine the source of genetic variation (i.e. SNPs or recombination), but rather focus on identifying if and how selection acts on this variation.

Because of its essential role in viral replication and lack of antibody exposure, we expect RdRp to be under purifying selection to maintain its structure and function. If HCoVs evolve antigenically, we expect to see adaptive evolution in spike, and particularly in the S1 domain of spike (Hofmann et al., 2006; Hulswit et al., 2019), due to its exposed location at the virion’s surface and interaction with the host receptor. Mutations that escape from population immunity are beneficial to the virus and so are driven to fixation by positive selection. This results in adaptive evolution of the virus population.

Phylogenetic inference of substitution prevalence within spike

Using phylogenies constructed from the spike gene, we tallied the number of independent amino acid substitutions at each position within spike. The average number of substitutions per site is higher in S1 than S2 for HCoV lineages in OC43 and 229E (Figure 2A). We focus on S1 rather than the receptor-binding domain (RBD) within S1 in our analyses, because it is known that neutralizing antibodies bind to epitopes within the N-terminal domain (NTD) as well as the RBD of S1 (Liu et al., 2020a; Zhang et al., 2018; Zhou et al., 2019). A greater occurrence of repeated substitutions is expected if some mutations within S1 confer immune avoidance. Alternatively, these repeated substitutions could be a result of high mutation rate and random genetic drift as has been shown at particular types of sites in SARS-CoV-2 (van Dorp et al., 2020). However, this latter hypothesis should affect all regions of the genome equally and should not result in a greater number of repeated substitutions in S1 than S2.

Figure 2. More sites mutate repeatedly within spike S1 versus S2.

(A) Number of substitutions observed at each amino acid position in the spike gene throughout the phylogeny. S1 (gray) and S2 (white) are indicated by shading and the number of substitutions per site is indicated by a dot and color-coded by human coronavirus (HCoV) lineage. The putative receptor-binding domains for 229E (Li et al., 2019) and the putative domain for OC43 (Lau et al., 2011) are indicated with light yellow bars. Asterisks indicate two example positions (192 and 262), which mutate repeatedly throughout the OC43 lineage A phylogeny. The OC43 phylogeny built from spike sequences and color-coded by genotype at positions 192 and 262 is shown in (B) and (C), respectively.

Figure 2.

Figure 2—figure supplement 1. Mutations at each position within Spike for NL63 and HKU1.

Figure 2—figure supplement 1.

Number of substitutions observed at each amino acid position in the spike gene throughout the phylogeny. S1 (gray) and S2 (white) are indicated by shading and the number of substitutions per site is indicated by a dot and color-coded by human coronavirus (HCoV) lineage: (A) NL63, (B) HKU1 (assuming all HKU1 isolates are a single lineage), (C) HKU1 lineage A (D) HKU1 lineage B (assuming there are two co-circulating HKU1 lineages).

If the repeated mutations are a product of immune selection, not only should S1 contain more repeated mutations, but we would also expect these mutations to spread widely after they occur due to their selective advantage. Additionally, we expect sites within S1 to experience diversifying selection due to the ongoing arms race between virus and host immune system. This is visible in the distribution of genotypes at the most repeatedly mutated sites in OC43 lineage A (Figure 2B and C).

Nonsynonymous and synonymous divergence in RdRp and subdomains of spike

An adaptively evolving gene, or region of the genome, should exhibit a high rate of nonsynonymous substitutions. For each seasonal HCoV lineage, we calculated nonsynonymous and synonymous divergence as the average Hamming distance from that lineage’s most recent common ancestor (Zanini et al., 2015). The rate of nonsynonymous divergence is markedly higher within spike versus RdRp of 229E and OC43 lineage A (Figure 3A). While nonsynonymous divergence increases steadily over time in spike, it remains roughly constant at 0.0 in RdRp. These results suggest that there is predominantly positive selection on OC43 and 229E spike, but predominantly purifying selection on RdRp. Separating spike into the S1 (receptor-binding) and S2 (membrane-fusion) domains reveals that the majority of nonsynonymous divergence in spike occurs within S1 (Figure 3B). In fact, the rates of nonsynonymous divergence in S2 are similar to those seen in RdRp, suggesting S2 evolves under purifying selection while S1 evolves adaptively.

Figure 3. Nonsynonymous divergence is higher in OC43 and 229E Spike S1 versus S2 or RdRp.

(A) Nonsynonymous (dashed lines) and synonymous divergence (solid lines) of the spike (dark orange) and RdRp (dark gray) genes of all 229E and OC43 lineages over time. Divergence is the average Hamming distance from the ancestral sequence, computed in sliding 3-year windows that contain at least two sequenced isolates. Shaded region shows 95% confidence intervals. Note that the absence of a line means that there are fewer than two sequences available at this time point and that, therefore, the divergence is not calculated. (B) Nonsynonymous and synonymous divergence within the S1 (light orange) and S2 (light gray) domains of spike. Year is shown on the x-axis and is shared between plots.

Figure 3.

Figure 3—figure supplement 1. Nonsynonymous divergence in NL63 and HKU1.

Figure 3—figure supplement 1.

Nonsynonymous (dashed lines) and synonymous divergence (solid lines) within the spike (dark orange) and RdRp (dark gray) genes and within S1 (light orange) and S2 (light gray) over time. Divergence is the average Hamming distance from the ancestral sequence, computed in sliding 3-year windows that contain at least two sequenced isolates. Shaded region shows 95% confidence intervals. (A) NL63, (B) HKU1 (assuming all HKU1 isolates belong to a single lineage), and (C) HKU1 (divided into two co-circulating lineages). Year is shown on the x-axis. Note that x- and y-axis scales are shared between the subplots but are different from Figure 3.
Figure 3—figure supplement 2. Ratio of divergence between genomic regions.

Figure 3—figure supplement 2.

(A) The ratio of nonsynonymous divergence in spike to nonsynonymous divergence in RdRp (dashed lines) and the equivalent ratio of synonymous divergence (solid lines) is shown for 229E (dark blue), OC43 lineage A (dark teal), and OC43 lineage B (light teal). (B) The same ratios of divergence as in panel A, except comparing S1 and S2. Year is on the x-axis.

Though we would expect synonymous divergence to be equivalent in all areas of the genome, this is not born out in our results. It is unclear whether the difference in synonymous divergence between genes reflects an actual biological difference. However, the ratio of nonsynonymous divergence in spike to nonsynonymous divergence in RdRp is consistently higher than the equivalent ratio of synonymous divergence (Figure 3—figure supplement 2). Thus, despite differences in synonymous divergence, spike is accumulating relatively more nonsynonymous divergence than RdRp.

We compared our analysis of divergence to the results of a more standard approach for detecting positive selection on certain branches of a phylogeny. This approach, called mixed effects model of evolution (MEME), is maximum-likelihood method that gives a single dN/dS value for each gene (Murrell et al., 2012; Weaver et al., 2018). In agreement with measures of nonsynonymous divergence over time, dN/dS estimates are higher in Spike than RdRp and higher in S1 than S2 (Table 1). Our estimate of dN/dS in OC43 spike is similar to the previously reported estimate of roughly 0.3 (Ren et al., 2015). However, we believe that the standard dN/dS approach is not the ideal tool for detecting adaptive evolution in HCoVs because it is a phylogenetic approach, which may be biased by recombination, and also because some assumptions of the model hold true for mammalian genomes, but not necessarily for RNA viruses.

Table 1. dN/dS is lower in spike than RdRp.

A single dN/dS value was computed for gene (or spike domain) and each human coronavirus (HCoV) using mixed effects model of evolution (MEME).

RdRp Spike S1 S2
229E 0.143 0.441 0.662 0.166
OC43 lineage A 0.080 0.435 0.466 0.301
OC43 lineage B 0.061 0.317 0.418 0.234
NL63 0.068 0.139 0.121 0.038

Rate of adaptation in RdRp and subdomains of spike

Therefore, as a complement to the divergence analysis, we implemented an alternative to the dN/dS method that was specifically designed to detect positive selection within RNA virus populations (Bhatt et al., 2011). Compared with traditional dN/dS methods, the Bhatt method has the advantages of: (1) measuring the strength of positive selection within a population given sequences collected over time, (2) higher sensitivity to identifying mutations that occur only once and sweep through the population, and (3) correcting for deleterious mutations (Bhatt et al., 2010; Bhatt et al., 2011). Briefly, this method defines a class of neutrally evolving nucleotide sites as those with synonymous mutations or where nonsynonymous polymorphisms occur at medium frequency. Then, the number of fixed and high-frequency nonsynonymous sites that exceed the neutral expectation are calculated. This method compares nucleotide sequences at each time point (the ingroup) to the consensus nucleotide sequence at the first time point (the outgroup) and yields an estimate of the number of adaptive substitutions within a given genomic region at each of these time points.

We adapted this method to detect adaptive substitutions in seasonal HCoVs. As shown in Figure 4, OC43 lineage A has continuously amassed adaptive substitutions in spike over the past >30 years while RdRp has accrued few, if any, adaptive substitutions. These adaptive substitutions are located within the S1, and not the S2, domain of spike (Figure 4). We observe a largely linear accumulation of adaptive substitutions in spike and S1 through time, although the method does not dictate a linear increase. This observation suggests that spike (and S1 in particular) is evolving in response to a continuous selective pressure. This is exactly what would be expected if these adaptive substitutions are evidence of antigenic evolution resulting from an evolutionary arms race between spike and the host immune system.

Figure 4. Adaptive substitutions accumulate over time in OC43 lineage A spike S1.

Figure 4.

Adaptive substitutions per codon within OC43 lineage A spike, S1, S2 and RdRp as calculated by our implementation of the Bhatt method. Adaptive substitutions are computed in sliding 3-year windows, and only for time points that contain three or more sequenced isolates. Green dots display estimated values calculated from the empirical data and green lines show linear regression fit to these points. Gray lines show the distribution of regressions fit to the computed number of adaptive substitutions from 100 bootstrapped data sets. Year is shown on the x-axis.

We estimate that OC43 lineage A accumulates roughly 1.8 × 10–3 adaptive substitutions per codon per year (or 1.4 adaptive amino acid substitutions in S1 each year) in the S1 domain of spike, while the rate of adaptation in OC43 lineage B is slightly higher and is estimated to result in an average 1.7 adaptive substitutions in S1 per year (Figure 5). The S1 domain of 229E is estimated to accrue 0.76 adaptive substitutions per year (a rate of 1.4 × 10–3 adaptive substitutions per codon per year).

Figure 5. The rate of adaptive substitution is highest in spike S1.

Adaptive substitutions per codon per year as calculated by our implementation of the Bhatt method. Rates are calculated within Spike, S1, S2 and RdRp for 229E and OC43 lineages. Error bars show 95% bootstrap percentiles from 100 bootstrapped data sets.

Figure 5.

Figure 5—figure supplement 1. NL63 and HKU1 have low rates of adaptation in spike.

Figure 5—figure supplement 1.

As in Figure 4, adaptive substitutions per codon per year are calculated by our implementation of the Bhatt method. (A) NL63 (green) and HKU1 (blue) are both considered to consist of a single lineage. (B) HKU1 is divided into two co-circulating lineages (blue and light blue). The calculated rates of adaptive substitution within spike, S1, S2 and RdRp are plotted alongside 229E and OC43 for comparison. Error bars show 95% bootstrap percentiles from 100 bootstrapped data sets.

A benefit of the Bhatt method is the ability to calculate the strength of selection, which allows us to compare these seasonal HCoVs to other viruses. We used our implementation of the Bhatt method to calculate the rate of adaptation for influenza A/H3N2, which is known to undergo rapid antigenic evolution (Rambaut et al., 2008; Yang, 2000), measles, which does not (Fulton et al., 2015), and influenza B strains Vic and Yam, which evolve antigenically at a slower rate than A/H3N2 (Bedford et al., 2014). We estimate that the receptor-binding domain of influenza A/H3N2 accumulates adaptive substitutions between two and three times faster than the HCoVs OC43 and 229E (Figure 6). The rates of adaptive substitution in influenza B/Yam and B/Vic are on par with the seasonal HCoVs. We detect no adaptive substitutions in the measles receptor-binding protein. These results put the evolution of the S1 domain of OC43 and 229E in context, indicating that the S1 domain is under positive selection, and that this positive selection generates new variants in the putative antigenic regions of these HCoVs at about the same rate as influenza B strains and about half the rate of the canonical example of antigenic evolution, the HA1 domain of influenza A/H3N2.

Figure 6. OC43 and 229E spike S1 accumulates adaptive substitutions faster than measles but slower than influenza A/H3N2.

Figure 6.

Comparison of adaptive substitutions per codon per year between measles (yellow), four influenza strains (A/H3N2, A/H1N1pdm, B/Vic, and B/Yam- shown in shades of red), OC43 lineage A (dark teal), OC43 lineage B (light teal), and 229E (dark blue). The polymerase, receptor-binding domain, and membrane fusion domain for influenza strains are PB1, HA1, and HA2. For both human coronaviruses (HCoVs), they are RdRp, S1, and S2, respectively. For measles, the polymerase is the L gene, the receptor-binding protein is the H gene, and the fusion protein is the F gene. Error bars show 95% bootstrap percentiles from 100 bootstrapped data sets.

Validation that rate of adaptation is not biased by recombination

Because coronaviruses are known to recombine, and recombination has the potential to impact evolutionary analyses of selection, we sought to verify that our results are not swayed by the presence of recombination. To do this, we simulated the evolution of OC43 lineage A spike and RdRp genes under varying levels of recombination and positive selection (representative phylogenies of simulated spike evolution can be seen in Figure 7—figure supplement 2) and used our implementation of the Bhatt method to identify adaptive evolution. As the strength of positive selection increases, we detect a higher rate of adaptive evolution, regardless of the level of recombination (Figure 7). This demonstrates that our estimates of adaptive evolution are not biased by recombination events.

Figure 7. Detection of positive selection is not biased by recombination.

OC43 lineage A sequences were simulated with varying levels of recombination and positive selection. The Bhatt method was used to calculate the rate of adaptive substitutions per codon per year for S1 (light orange), S2 (light gray), and RdRp (dark gray). The mean and 95% confidence interval of 10 independent simulations are plotted.

Figure 7.

Figure 7—figure supplement 1. Fewer years of longitudinally sampled isolates reduces ability to detect rate of adaptation.

Figure 7—figure supplement 1.

OC43 lineage A S1 sequences were simulated under conditions of no, moderate, and high rates of recombination in combination with no, moderate, or high strength of positive selection. The Bhatt method was used to calculate the ‘true’ rate of adaptive evolution under each of these scenarios using all available simulated sequence data (30 years), or the estimated rate if only the most recent 24, 14, 10, or 7 years of simulated sequences were used. The mean and 95% confidence intervals of 10 independent simulations are plotted.
Figure 7—figure supplement 2. Representative phylogenies of simulated spike data.

Figure 7—figure supplement 2.

OC43 lineage A spike sequence evolution was simulated under conditions of no, moderate, and high rates of recombination in combination with no, moderate, or high strength of positive selection. This figure shows time-resolved phylogenies built from 1 of the 10 independent simulations under each recombination/selection regime.

TMRCA of RdRp and subdomains of spike

Finally, we know that strong directional selection skews the shape of phylogenies (Volz et al., 2013). In influenza H3N2, immune selection causes the genealogy to adopt a ladder-like shape where the rungs are formed by viral diversification and each step is created by the appearance of new, antigenically superior variants that replace previous variants. This ladder-like shape can also be seen in the phylogenies of the OC43 and 229E (Figure 1). In this case, selection can be quantified by the timescale of population turnover as measured by the TMRCA, with the expectation that stronger selection will result in more frequent steps and therefore a smaller TMRCA measure (Bedford et al., 2011). We computed average TMRCA values from phylogenies built on Spike, S1, S2 or RdRp sequences of OC43 lineage A and 229E (Table 2). We did not compute TMRCA for OC43 lineage B because the limited number of available RdRp sequences for this lineage mean that TMRCA can only be calculated for about 4 years, which could artificially skew the TMRCA estimates. Our estimates of HCoV spike TMRCA are roughly 2–2.5 longer the estimated TMRCA for influenza A/H3N2 hemagglutinin (Bedford et al., 2011).

Table 2. Mean Time to Most Recent Common Ancestor (TMRCA) is lower in S1 than RdRp or S2.

Average TMRCA values (in years) for OC43 lineage A and 229E. The 95% confidence intervals are indicated in parentheses below mean TMRCA values.

Spike S1 S2 RdRp
OC43 lineage A 4.67
(4.04, 5.28)
3.45
(2.86, 4.05)
13.05
(11.24, 14.97)
17.39
(15.63, 19.15)
229E 4.19
(3.13, 5.25)
2.23
(1.76, 2.69)
5.08
(3.93, 6.23)
4.86
(4.04, 5.69)

We observe that for both OC43 lineage A and 229E, the average TMRCA is lower in spike than RdRp and lower in S1 versus S2. These results suggest strong directional selection in S1, likely driven by pressures to evade the humoral immune system. The difference in TMRCA between S1 and S2 is indicative not only of differing selective pressures acting on these two spike domains, but also of intra-spike recombination. This is because the immune selection imposed on S1 should also propagate neutral hitch-hiker mutations in closely linked regions such as S2. The difference in TMRCA suggests that recombination may uncouple these regions. Recombination can also push TMRCA to higher values, though this should not have a larger impact on RdRp than S1. The contributions of the forces of directional selection and recombination are difficult to parse from the TMRCA results. This emphasizes the importance of using methods, such as the Bhatt method, that are robust to recombination to detect adaptive evolution.

Application of methods for identifying adaptive evolution to HKU1 and NL63

Because HKU1 was identified in the early 2000s, there are fewer longitudinally sequenced isolates available for this HCoV compared to 229E and OC43 (Figure 1—figure supplement 2). Consequently, the phylogenetic reconstructions and divergence analysis of HKU1 have a higher level of uncertainty. To begin with, it is less clear from the phylogenies whether HKU1 represents a single HCoV lineage like 229E or, instead, should be split into multiple lineages like OC43 (Figure 1). Because of this, we completed all antigenic analyses for HKU1 twice: once considering all isolates to be members of a single lineage, and again after splitting isolates into two separate lineages. These lineages are arbitrarily labeled ‘A’ and ‘B’ as was done for OC43. When HKU1 is considered to consist of just one lineage, there is no signal of antigenic evolution by divergence analysis (Figure 3—figure supplement 1B) or by the Bhatt method of estimating adaptive evolution (Figure 5—figure supplement 1A). However, when HKU1 is assumed to consist of two co-circulating lineages, HKU1 lineage A has a markedly higher rate of adaptive substitutions in S1 than in S2 or RdRp (Figure 5—figure supplement 1B).

To demonstrate the importance of having a well-sampled longitudinal series of sequenced isolates for our antigenic analyses, we returned to our simulated OC43 S1 data sets. We mimicked shorter longitudinal series by truncating the data set to only 24, 14, 10, or 7 years of samples and ran the Bhatt analysis on these sequentially shorter time series (Figure 7—figure supplement 1). The rates of adaptation estimated from the truncated data sets can be compared to the ‘true’ rate of adaptation calculated from all simulated data. This simulated data reveals a general trend that less longitudinal data reduces the ability to detect adaptive evolution by skewing the estimated rate away from the ‘truth’ and increasing the uncertainty of the analysis. Given the dearth of longitudinal data for HKU1, we do not feel that it is appropriate to make strong conclusions about adaptive evolution, or lack thereof, in this HCoV.

Despite being identified at roughly the same time as HKU1, substantially more NL63 isolates have been sequenced (Figure 1—figure supplement 2) making the phylogenetic reconstruction and evolutionary analyses of this virus correspondingly more reliable. We do not observe evidence for adaptive evolution in NL63 (Figure 3—figure supplement 1A and Figure 5—figure supplement 1A) and this lack of support for adaptive evolution in the NL63 spike gene is more likely to reflect an actual lack of adaptive evolution in this virus.

Discussion

Using several corroborating methods, we provide evidence that the seasonal HCoVs OC43 and 229E undergo adaptive evolution in S1, the region of the spike protein exposed to human humoral immunity (Figures 3, 4, and 5). We additionally confirm that RdRp and S2 do not show signals of adaptive evolution. We observe that S1 accumulates between 0.8 (229E) and 1.4 (OC43) adaptive substitutions per year. We infer that these viruses accumulate adaptive substitutions at roughly half the rate of influenza A/H3N2 and at a similar rate to influenza B viruses (Figure 6). The most parsimonious explanation for the observation of substantial adaptive evolution in S1 is that antigenic drift is occurring in which mutations that escape from human population immunity are selectively favored in the viral population leading to repeated adaptive changes. However, it is formally possible that the adaptive evolution we detect is a result of selective pressures other than evasion of the adaptive immune system. Showing that this is truly antigenic evolution could involve a serological comparison of isolates that differ at S1 residues under positive selection.

In seasonal influenza and measles, the rates of adaptive evolution we estimate correlate well with relative rates of antigenic drift reported by other groups (Fulton et al., 2015; Bedford et al., 2014). The relative rates of adaptation we calculate also match the relative frequency of vaccine strain updates, as would be expected since vaccines must be updated to match antigenically evolving viruses. Since 2006, the A/H3N2 component of the seasonal influenza vaccine has been updated 10 times (11 different A/H3N2 strains), four different B/Vic strains and four different B/Yam strains have been included in the vaccine, and the measles vaccine strain has not changed (Global Influenza Surveillance and Response System [GISRS], https://www.who.int/influenza/vaccines/virus/en/). Using these numbers as guidance, our results suggest that a vaccine against OC43 or 229E might need to be updated as frequently as the B/Vic and B/Yam components of the influenza vaccine are.

We do not observe evidence of antigenic evolution in NL63 (Figure 3—figure supplement 1 and Figure 5—figure supplement 1). This likely represents a lack of marked adaptive evolution in S1. Our finding fits with a study of NL63 in Kenya, which identified multiple genotypes of NL63 and show that people regularly become reinfected with the same genotype of NL63 (Kiyuka et al., 2018). Additionally, Kiyuka et al. found that these genotypes circulate locally for a long period of time, suggesting a decent amount of viral diversity and a potential lack of evolution due to immune selection. Though our results cannot explain why OC43 and 229E likely evolve antigenically while NL63 does not, Kiyuka et al. observe that NL63 reinfections are sometimes enhanced by a previous infection and hypothesize that NL63 is actually under purifying selection at epitope sites (Kiyuka et al., 2018).

Though analysis of all HCoVs would benefit from more sequenced isolates, there is substantially less longitudinal sequencing data available for HKU1. Thus, despite finding no evidence of antigenic evolution in HKU1 (Figure 3—figure supplement 1 and Figure 5—figure supplement 1), it is possible that a more completely sampled time series of HKU1 genome sequences could alter the result for this virus (Figure 7—figure supplement 1).

Our conclusions of adaptive evolution in S1, arrived at through computational analyses of sequencing data, agree with studies that observe reinfection of subjects by heterologous isolates of 229E (Reed, 1984), sequential dominance of specific genotypes of OC43 (Lau et al., 2011; Zhang et al., 2015), and common reinfection by seasonal HCoVs from longitudinal serological data (Edridge et al., 2020). In this latter study, HCoV infections were identified from longitudinal serum samples by assaying for increases in antibodies against the nucleocapsid (N) protein of representative OC43, 229E, HKU1, and NL63 viruses. This study concluded that the average time between infections was 1.5–2.5 years, depending on the HCoV (Edridge et al., 2020). In comparison, influenza H3N2 reinfects people roughly every 5 years (Kucharski et al., 2015). Thus, frequent reinfection by seasonal HCoVs is likely due to a combination of factors and suggests waning immune memory, and/or incomplete immunity against reinfection, in addition to antigenic drift.

HCoVs are a diverse grouping split, phylogenetically, into two genera: NL63 and 229E are alphacoronaviruses, while OC43, HKU1, MERS, SARS, and SARS-CoV-2 are betacoronaviruses. The method of cell entry does not seem to correlate with genus. Coronaviruses bind to a remarkable range of host-cell receptors including peptidases, cell adhesion molecules, and sugars. Among the seasonal HCoVs, OC43 and HKU1 both bind 9-O-acetylsialic acid (Hulswit et al., 2019), while 229E binds human aminopeptidase N (hAPN) and NL63 binds angiotensin-converting enzyme 2 (ACE2) (Liu et al., 2020b). Despite a relatively large phylogenetic distance and divergent S1 structures, NL63 and SARS-CoV-1 and SARS-CoV-2 bind to the same host receptor using the same virus-binding motifs (VBMs) (Li, 2016). This VBM is located in the C-terminal domain of S1 (S1-CTD), which fits within the trend of S1-CTD receptor-binding in CoVs that bind protein receptors (Hofmann et al., 2006; Li, 2016). This is opposed to the trend among CoVs that bind sugar receptors, where receptor binding is located within the S1-NTD (Li, 2016). This localization roughly aligns with our observations that the majority of the repeatedly mutated sites occur toward the C-terminal end of 229E S1 and the N-terminal end of OC43 S1 (Figure 2).

Here we have provided support that at least two of the four seasonal HCoVs evolve adaptively in the region of spike that is known to interact with the humoral immune system. These two viruses span both genera of HCoVs, though due to the complexity of HCoV receptor binding and pathology mentioned above, it is not clear whether or not this suggests that other HCoVs, such as SARS-CoV-2, will also evolve adaptively in S1. This is important because, at the time of writing of this manuscript, many SARS-CoV-2 vaccines are in production and most of these exclusively include spike (Krammer, 2020). If SARS-CoV-2 evolves adaptively in S1 as the closely related HCoV OC43 does, it is possible that the SARS-CoV-2 vaccine would need to be frequently reformulated to match the circulating strains, as is done for seasonal influenza vaccines.

Materials and methods

All data, source code, and analyses can be found at https://github.com/blab/seasonal-cov-adaptive-evolution (Kistler, 2021; copy archived at swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1). All phylogenetic trees constructed and analyzed in this manuscript can be viewed interactively at https://nextstrain.org/community/blab/seasonal-cov-adaptive-evolution; Mattenberger, 2021. All analysis code is written in Python 3 (Python Programming Language, SCR_008394) in Jupyter notebooks (Jupyter-console, RRID:SRC_018414).

Sequence data

All viral sequences are publicly accessible and were downloaded from ViPR (http://www.viprbrc.org) under the ‘Coronavirdiae’ with host ‘human’ (Pickett et al., 2012). Sequences labeled as ‘OC43’, ‘229E’, ‘HKU1’, and ‘NL63’ were pulled out of the downloaded FASTA file into four separate data files. Additionally, a phylogeny of all downloaded HCoVs was made and unlabeled isolates that clustered within clades formed by labeled OC43, 229E, HKU1, or NL63 isolates were marked as belonging to that HCoV type and added to our data files. Code for these data-parsing steps is located in data-wrangling/postdownload_formatting_for_rerun.ipynb.

Phylogenetic inference

For each of the four HCoV data sets, full-length sequences were aligned to a reference genome using the augur align command (Hadfield et al., 2018) and MAFFT (Katoh et al., 2002). Individual gene sequences were then extracted from these alignments if sequencing covered 50% or more of the gene using the code in data-wrangling/postdownload_formatting_for_rerun.ipynb. Sequence files for each gene are located in the data/ directory within each HCoV parent directory (ex: oc43/data/oc43_spike.fasta). A Snakemake file (Köster and Rahmann, 2012) within each HCoV directory follows the general outline of a Nextstrain build (Nextstrain, RRID:SCR_018223) and was used to align each gene to a reference strain and build a time-resolved phylogeny with IQ-Tree v1 (Nguyen et al., 2015) and TimeTree (Sagulenko et al., 2018). Phylogenies were viewed to identify the distribution of genotypes throughout the tree, different lineages, and signals of recombination using the nextstrain view command (Hadfield et al., 2018). The clock rate of the phylogeny based on spike sequences for each isolate (as shown in Figure 1 and Figure 1—figure supplement 2) was 0.0005 substitutions per nucleotide site per year for OC43, 0.0006 for 229E, 0.0007 for NL63, and 0.0062 for HKU1. All NL63 and HKU1 trees were rooted on an outgroup sequence. For NL63, the outgroup was 229e/AF304460/229e_ref/Germany/2000 and for HKU1 the outgroup was mhv/NC_048217_1/mhv/2006. Clock rates for the phylogenies built on each individual gene can be found within the results/ directory within each HCoV parent directory (ex: oc43/results/branch_lengths_oc43_spike.json).

Mutation counting

Amino acid substitutions at each position in spike were tallied from the phylogeny. In other words, the phylogenetic reconstruction of spike sequences returns nucleotides changes to the ancestral sequence along each branch. The number of times this changed amino acid identity at each position was tallied. This analysis was conducted using code in antigenic_evolution/site_mutation_rank.ipynb.

Divergence analysis

For each HCoV lineage and each gene, synonymous and nonsynonymous divergence was calculated at all time points as the average Hamming distance between each sequenced isolate and the consensus sequence at the first time point (founder sequence). The total number of observed differences between the isolate and founder nucleotide sequences that result in nonsynonymous (or synonymous) substitutions is divided by the number of possible nucleotide mutations that result in nonsynonymous (or synonymous) substitutions, weighted by kappa, to yield an estimate of divergence. Kappa is the ratio of rates of transitions:transversions and was calculated by averaging values from spike and RdRp trees built by BEAST 2.6.3 (Bouckaert et al., 2019) using the HKY+gamma4 model with two partitions and ‘coalescent constant population’. All BEAST results are found in. log files in gene- and HCoV-specific subdirectories within beast/. Divergence is calculated from nucleotide alignments. Sliding 3-year windows were used and only time points that contained at least two sequences were considered. The concept for this analysis is from Zanini et al., 2015 and code for our adaptation is in antigenic_evolution/divergence_weighted.ipynb. The ratios of divergence shown in Figure 3—figure supplement 2 are also calculated in this notebook.

Calculation of dN/dS

A dN/dS value was calculated for RdRp, spike, S1 and S2 of each HCoV using the Datamonkey (Weaver et al., 2018) implementation of MEME (Murrell et al., 2012). Aligned FASTA files (ex: oc43/results/aligned_oc43_rdrp.fasta) were uploaded to Datamonkey (http://datamonkey.org/meme) and dN/dS value was recorded as the calculated Global MG94xREV model non-synonymous/synonymous rate ratio.

Implementation of the Bhatt method

The rate of adaptive evolution was computed using an adaptation of the Bhatt method (Bhatt et al., 2011; Bhatt et al., 2010). For each lineage and each genomic region, we partitioned all available sequences into sliding 3-year windows and only used time points that contained at least three sequences in the analysis. We compared nucleotide sequences at each time point (the ingroup) to the consensus nucleotide sequence at the first time point (the outgroup). Eight estimators (silent fixed, replacement fixed, silent high frequency, replacement high frequency, silent mid-frequency, replacement mid-frequency, silent low frequency, and replacement low-frequency) are calculated by the site-counting method (Bhatt et al., 2010). In the site-counting method, each estimator is the product of the fixation or polymorphism score times the silent or replacement score, summed for each site in that frequency class. Fixation and polymorphism scores depend on the number of different nucleotides observed at the site and whether the outgroup base is present in the ingroup. Selectively neutral sites are assumed to contain the classes of silent polymorphisms and replacement polymorphisms occurring at a frequency between 0.15 and 0.75. A class of nonneutral, adaptive sites is then identified as having an excess of replacement fixations or polymorphisms (Bhatt et al., 2011). For each lineage and gene, 100 bootstrap alignments and ancestral sequences were generated and run through the Bhatt method to assess the statistical uncertainty of our estimates of rates of adaptation (Bhatt et al., 2011). The rate of adaptation (per codon per year) shown in Figure 5 is calculated by linear regression of the time series values of adaptive substitutions per codon (Figure 4). Our code for implementing the Bhatt method is at antigenic_evolution/bhatt_bootstrapping.ipynb.

Estimation of rates of adaptation of measles and influenza viruses

Influenza and measles alignments were generated by running Nextstrain the respective Nextstrain builds from https://github.com/nextstrain/seasonal-flu and https://github.com/nextstrain/measles (Hadfield et al., 2018; Nextstrain, RRID:SCR_018223). The seasonal influenza build was run with 20-year resolution for H3N2, H1N1pdm, Vic, and Yam. The rates of adaptation of different genes was calculated using our implementation of the Bhatt method described above. The receptor-binding domain used for influenza was HA1, for measles was the H protein, and for the HCoVs was S1. The membrane fusion protein used for influenza was HA2, for measles was the F protein, and for the HCoVs was S2. The polymerase for influenza was PB1, for measles was the L protein, and for the HCoVs was RbRd (nsp12). Our code for this analysis is at antigenic_evolution/bhatt_nextstrain.ipynb.

Simulation of evolving OC43 sequences

The evolution of OC43 lineage A Spike and RdRp genes was simulated using SANTA-SIM (Jariani et al., 2019). The OC43 lineage A root sequence was used as a starting point and the simulation was run for 500 generations and 10 simulated sequences were sampled every 50 generations. The spike and RdRp genes were simulated separately. Purifying selection was simulated across both genes. Evolution was simulated in the absence of recombination and with moderate and high levels of recombination during replication. Under each of these recombination paradigms, we simulated evolution in the absence of positive selection within spike and with moderate and high levels of positive selection. Positive selection was simulated through exposure-dependent selection at a subset of spike S1 sites proportional to the number of epitope sites in H3N2 HA (Luksza and Lässig, 2014). The simulated selection allows mutations in these ‘epitope’ sites to rise in frequency while also encouraging ‘epitopes’ to change over time (to mimic antigenic novelty). All simulations were run with a nucleotide mutation rate of 1 × 10−4 (Vijgen et al., 2005). Config files, results, and source code for these simulations can be at santa-sim_oc43a/ and the Bhatt method is implemented on the simulated data in antigenic_evolution/bhatt_simulated_oc43_data.ipynb.

Estimation of TMRCA

Mean TMRCA values were estimated for each gene and each HCoV using PACT (Bedford et al., 2011). Briefly, PACT computes TMRCA values by creating a series of subtrees that include only tips positioned within a temporal slice of the full tree and finding the common ancestor of these tips. The overall mean and 95% confidence intervals were calculated from the list of TMRCA values in these time slices. The PACT config files and results for each run are in the directory antigenic_evolution/pact/. The TMRCA estimations and subsequent analyses are executed by code in antigenic_evolution/tmrca_pact.ipynb.

Acknowledgements

We thank Jesse Bloom and members of the Bedford lab for useful feedback. KEK was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1762114. TB is a Pew Biomedical Scholar and is supported by NIH R35 GM119774-01.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Kathryn E Kistler, Email: kistlerk@uw.edu.

Daniel B Weissman, Emory University, United States.

Patricia J Wittkopp, University of Michigan, United States.

Funding Information

This paper was supported by the following grants:

  • National Science Foundation Graduation Research Fellowship Program to Kathryn E Kistler.

  • Pew Charitable Trusts Pew Biomedical Scholar to Trevor Bedford.

  • National Science Foundation DGE-1762114 to Kathryn E Kistler.

  • Pew Charitable Trusts NIH R35 GM119774-01 to Trevor Bedford.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Data curation, Software, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Supervision, Funding acquisition, Validation, Methodology, Writing - review and editing.

Additional files

Transparent reporting form

Data availability

All data used in this study can be found at https://www.viprbrc.org/ and is archived as .fasta files in the following subdirectories of the Github repository for this project: https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/229e/data, https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/oc43/data, https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/nl63/data, https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/hku1/data (copy archived at https://archive.softwareheritage.org/swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1).

References

  1. Bedford T, Cobey S, Pascual M. Strength and tempo of selection revealed in viral gene genealogies. BMC Evolutionary Biology. 2011;11:220. doi: 10.1186/1471-2148-11-220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bedford T, Suchard MA, Lemey P, Dudas G, Gregory V, Hay AJ, McCauley JW, Russell CA, Smith DJ, Rambaut A. Integrating influenza antigenic dynamics with molecular evolution. eLife. 2014;3:e01914. doi: 10.7554/eLife.01914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bhatt S, Katzourakis A, Pybus OG. Detecting natural selection in RNA virus populations using sequence summary statistics. Infection, Genetics and Evolution. 2010;10:421–430. doi: 10.1016/j.meegid.2009.06.001. [DOI] [PubMed] [Google Scholar]
  4. Bhatt S, Holmes EC, Pybus OG. The genomic rate of molecular adaptation of the human influenza A virus. Molecular Biology and Evolution. 2011;28:2443–2451. doi: 10.1093/molbev/msr044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled J, Jones G, Kühnert D, De Maio N, Matschiner M, Mendes FK, Müller NF, Ogilvie HA, du Plessis L, Popinga A, Rambaut A, Rasmussen D, Siveroni I, Suchard MA, Wu CH, Xie D, Zhang C, Stadler T, Drummond AJ. BEAST 2.5: an advanced software platform for Bayesian evolutionary analysis. PLOS Computational Biology. 2019;15:e1006650. doi: 10.1371/journal.pcbi.1006650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chibo D, Birch C. Analysis of human coronavirus 229E spike and nucleoprotein genes demonstrates genetic drift between chronologically distinct strains. Journal of General Virology. 2006;87:1203–1208. doi: 10.1099/vir.0.81662-0. [DOI] [PubMed] [Google Scholar]
  7. Drake JW. Rates of spontaneous mutation among RNA viruses. PNAS. 1993;90:4171–4175. doi: 10.1073/pnas.90.9.4171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Edridge AWD, Kaczorowska J, Hoste ACR, Bakker M, Klein M, Loens K, Jebbink MF, Matser A, Kinsella CM, Rueda P, Ieven M, Goossens H, Prins M, Sastre P, Deijs M, van der Hoek L. Seasonal coronavirus protective immunity is short-lasting. Nature Medicine. 2020;26:1691–1693. doi: 10.1038/s41591-020-1083-1. [DOI] [PubMed] [Google Scholar]
  9. Fulton BO, Sachs D, Beaty SM, Won ST, Lee B, Palese P, Heaton NS. Mutational analysis of measles virus suggests constraints on antigenic variation of the glycoproteins. Cell Reports. 2015;11:1331–1338. doi: 10.1016/j.celrep.2015.04.054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018;34:4121–4123. doi: 10.1093/bioinformatics/bty407. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hamre D, Beem M. Virologic studies of acute respiratory disease in young adults. V. coronavirus 229E infections during six years of surveillance. American Journal of Epidemiology. 1972;96:94–106. doi: 10.1093/oxfordjournals.aje.a121445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Heikkinen T, Järvinen A. The common cold. The Lancet. 2003;361:51–59. doi: 10.1016/S0140-6736(03)12162-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hofmann H, Simmons G, Rennekamp AJ, Chaipan C, Gramberg T, Heck E, Geier M, Wegele A, Marzi A, Bates P, Pöhlmann S. Highly conserved regions within the spike proteins of human coronaviruses 229E and NL63 determine recognition of their respective cellular receptors. Journal of Virology. 2006;80:8639–8652. doi: 10.1128/JVI.00560-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hon CC, Lam TY, Shi ZL, Drummond AJ, Yip CW, Zeng F, Lam PY, Leung FC. Evidence of the recombinant origin of a bat severe acute respiratory syndrome (SARS)-like coronavirus and its implications on the direct ancestor of SARS coronavirus. Journal of Virology. 2008;82:1819–1826. doi: 10.1128/JVI.01926-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hulswit RJG, Lang Y, Bakkers MJG, Li W, Li Z, Schouten A, Ophorst B, van Kuppeveld FJM, Boons GJ, Bosch BJ, Huizinga EG, de Groot RJ. Human coronaviruses OC43 and HKU1 bind to 9-O-acetylated sialic acids via a conserved receptor-binding site in spike protein domain A. PNAS. 2019;116:2681–2690. doi: 10.1073/pnas.1809667116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Jariani A, Warth C, Deforche K, Libin P, Drummond AJ, Rambaut A, Matsen Iv FA, Theys K. SANTA-SIM: simulating viral sequence evolution dynamics under selection and recombination. Virus Evolution. 2019;5:vez003. doi: 10.1093/ve/vez003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research. 2002;30:3059–3066. doi: 10.1093/nar/gkf436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kistler K. seasonal-cov-adaptive-evolution. swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1Software Heritage. 2021 https://archive.softwareheritage.org/swh:1:dir:feb61594144d5787d91049172267d925b738079b;origin=https://github.com/blab/seasonal-cov-adaptive-evolution;visit=swh:1:snp:db7c97d981938da69e10585d540a30e869537903;anchor=swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1/
  19. Kiyuka PK, Agoti CN, Munywoki PK, Njeru R, Bett A, Otieno JR, Otieno GP, Kamau E, Clark TG, van der Hoek L, Kellam P, Nokes DJ, Cotten M. Human coronavirus NL63 molecular epidemiology and evolutionary patterns in rural coastal Kenya. The Journal of Infectious Diseases. 2018;217:1728–1739. doi: 10.1093/infdis/jiy098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Komabayashi K, Matoba Y, Tanaka S, Seto J, Aoki Y, Ikeda T, Shimotai Y, Matsuzaki Y, Itagaki T, Mizuta K. Longitudinal epidemiology of human coronavirus OC43 in Yamagata, Japan, 2010–2017: Two Groups Based on Spike Gene Appear One after Another. Journal of Medical Virology. 2020;7:825. doi: 10.1002/jmv.26361. [DOI] [PubMed] [Google Scholar]
  21. Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SD. Automated phylogenetic detection of recombination using a genetic algorithm. Molecular Biology and Evolution. 2006;23:1891–1901. doi: 10.1093/molbev/msl051. [DOI] [PubMed] [Google Scholar]
  22. Köster J, Rahmann S. Snakemake--a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2522. doi: 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]
  23. Krammer F. SARS-CoV-2 vaccines in development. Nature. 2020;586:516–527. doi: 10.1038/s41586-020-2798-3. [DOI] [PubMed] [Google Scholar]
  24. Kucharski AJ, Lessler J, Read JM, Zhu H, Jiang CQ, Guan Y, Cummings DA, Riley S. Estimating the life course of influenza A(H3N2) antibody responses from cross-sectional data. PLOS Biology. 2015;13:e1002082. doi: 10.1371/journal.pbio.1002082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lau SK, Lee P, Tsang AK, Yip CC, Tse H, Lee RA, So LY, Lau YL, Chan KH, Woo PC, Yuen KY. Molecular epidemiology of human coronavirus OC43 reveals evolution of different genotypes over time and recent emergence of a novel genotype due to natural recombination. Journal of Virology. 2011;85:11325–11337. doi: 10.1128/JVI.05512-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li F. Structure, function, and evolution of coronavirus spike proteins. Annual Review of Virology. 2016;3:237–261. doi: 10.1146/annurev-virology-110615-042301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Li Z, Tomlinson AC, Wong AH, Zhou D, Desforges M, Talbot PJ, Benlekbir S, Rubinstein JL, Rini JM. The human coronavirus HCoV-229E S-protein structure and receptor binding. eLife. 2019;8:e51230. doi: 10.7554/eLife.51230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Liu L, Wang P, Nair MS, Yu J, Rapp M, Wang Q, Luo Y, Chan JF, Sahi V, Figueroa A, Guo XV, Cerutti G, Bimela J, Gorman J, Zhou T, Chen Z, Yuen KY, Kwong PD, Sodroski JG, Yin MT, Sheng Z, Huang Y, Shapiro L, Ho DD. Potent neutralizing antibodies against multiple epitopes on SARS-CoV-2 spike. Nature. 2020a;584:450–456. doi: 10.1038/s41586-020-2571-7. [DOI] [PubMed] [Google Scholar]
  29. Liu DX, Liang JQ, Fung TS. Human Coronavirus-229E, -OC43, -NL63, and -HKU1. Reference Module in Life Sciences. 2020b;342:339–346. doi: 10.1016/B978-0-12-809633-8.21501-X. [DOI] [Google Scholar]
  30. Luksza M, Lässig M. A predictive fitness model for influenza. Nature. 2014;507:57–61. doi: 10.1038/nature13087. [DOI] [PubMed] [Google Scholar]
  31. Mattenberger F. seasonal-cov-adaptive-evolution. swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1Software Heritage. 2021 https://archive.softwareheritage.org/swh:1:dir:feb61594144d5787d91049172267d925b738079b;origin=https://github.com/blab/seasonal-cov-adaptive-evolution;visit=swh:1:snp:db7c97d981938da69e10585d540a30e869537903;anchor=swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1/
  32. McIntosh K. In: Current Topics in Microbiology and Immunology / Ergebnisse Der Mikrobiologie Und Immunitätsforschung. Koprowski H, Arber W, editors. Berlin: Springer; 1974. Coronaviruses: A Comparative Review; pp. 85–129. [DOI] [Google Scholar]
  33. Monto AS, Lim SK. The tecumseh study of respiratory illness. VI. Frequency of and relationship between outbreaks of coronavims infection. Journal of Infectious Diseases. 1974;129:271–276. doi: 10.1093/infdis/129.3.271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Kosakovsky Pond SL. Detecting individual sites subject to episodic diversifying selection. PLOS Genetics. 2012;8:e1002764. doi: 10.1371/journal.pgen.1002764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Pasternak AO, Spaan WJM, Snijder EJ. Nidovirus transcription: how to make sense…? Journal of General Virology. 2006;87:1403–1421. doi: 10.1099/vir.0.81611-0. [DOI] [PubMed] [Google Scholar]
  37. Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z, Zhou L, Larson CN, Dietrich J, Klem EB, Scheuermann RH. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Research. 2012;40:D593–D598. doi: 10.1093/nar/gkr859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Rambaut A, Pybus OG, Nelson MI, Viboud C, Taubenberger JK, Holmes EC. The genomic and epidemiological dynamics of human influenza A virus. Nature. 2008;453:615–619. doi: 10.1038/nature06945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Reed SE. The behaviour of recent isolates of human respiratory coronavirus in vitro and in volunteers: Evidence of heterogeneity among 229E-related strains. Journal of Medical Virology. 1984;13:179–192. doi: 10.1002/jmv.1890130208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Ren L, Zhang Y, Li J, Xiao Y, Zhang J, Wang Y, Chen L, Paranhos-Baccalà G, Wang J. Genetic drift of human coronavirus OC43 spike gene during adaptive evolution. Scientific Reports. 2015;5:11451. doi: 10.1038/srep11451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sagulenko P, Puller V, Neher RA. TreeTime: maximum-likelihood phylodynamic analysis. Virus Evolution. 2018;4:vex042. doi: 10.1093/ve/vex042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Smith DJ, Lapedes AS, de Jong JC, Bestebroer TM, Rimmelzwaan GF, Osterhaus AD, Fouchier RA. Mapping the antigenic and genetic evolution of influenza virus. Science. 2004;305:371–376. doi: 10.1126/science.1097211. [DOI] [PubMed] [Google Scholar]
  43. van Dorp L, Richard D, Tan CCS, Shaw LP, Acman M, Balloux F. No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2. Nature Communications. 2020;11:5986. doi: 10.1038/s41467-020-19818-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Vijgen L, Lemey P, Keyaerts E, Van Ranst M. Genetic variability of human respiratory coronavirus OC43. Journal of Virology. 2005;79:3223–3225. doi: 10.1128/JVI.79.5.3223-3225.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Volz EM, Koelle K, Bedford T. Viral phylodynamics. PLOS Computational Biology. 2013;9:e1002947. doi: 10.1371/journal.pcbi.1002947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Weaver S, Shank SD, Spielman SJ, Li M, Muse SV, Kosakovsky Pond SL. Datamonkey 2.0: a modern web application for characterizing selective and other evolutionary processes. Molecular Biology and Evolution. 2018;35:773–777. doi: 10.1093/molbev/msx335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Woo PCY, Lau SKP, Huang Y, Yuen K-Y. Coronavirus diversity, phylogeny and interspecies jumping. Experimental Biology and Medicine. 2009;234:1117–1127. doi: 10.3181/0903-MR-94. [DOI] [PubMed] [Google Scholar]
  48. Yang Z. Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. Journal of Molecular Evolution. 2000;51:423–432. doi: 10.1007/s002390010105. [DOI] [PubMed] [Google Scholar]
  49. Zanini F, Brodin J, Thebo L, Lanz C, Bratt G, Albert J, Neher RA. Population genomics of intrapatient HIV-1 evolution. eLife. 2015;4:e11282. doi: 10.7554/eLife.11282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Zhang Y, Li J, Xiao Y, Zhang J, Wang Y, Chen L, Paranhos-Baccalà G, Ren L, Wang J. Genotype shift in human coronavirus OC43 and emergence of a novel genotype by natural recombination. Journal of Infection. 2015;70:641–650. doi: 10.1016/j.jinf.2014.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Zhang S, Zhou P, Wang P, Li Y, Jiang L, Jia W, Wang H, Fan A, Wang D, Shi X, Fang X, Hammel M, Wang S, Wang X, Zhang L. Structural definition of a unique neutralization epitope on the Receptor-Binding domain of MERS-CoV spike glycoprotein. Cell Reports. 2018;24:441–452. doi: 10.1016/j.celrep.2018.06.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Zhou H, Chen Y, Zhang S, Niu P, Qin K, Jia W, Huang B, Zhang S, Lan J, Zhang L, Tan W, Wang X. Structural definition of a neutralization epitope on the N-terminal domain of MERS-CoV spike glycoprotein. Nature Communications. 2019;10:3068. doi: 10.1038/s41467-019-10897-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Zhu Y, Li C, Chen L, Xu B, Zhou Y, Cao L, Shang Y, Fu Z, Chen A, Deng L, Bao Y, Sun Y, Ning L, Liu C, Yin J, Xie Z, Shen K. A novel human coronavirus OC43 genotype detected in mainland China. Emerging Microbes & Infections. 2018;7:1–4. doi: 10.1038/s41426-018-0171-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Daniel B Weissman1
Reviewed by: Daniel B Weissman2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Coronavirus adaptation to immune responses is a very timely and important topic. This paper demonstrates that some of the seasonal coronavirus strains responsible for common colds have undergone substantial recent adaptation in a protein that is likely targeted by the immune system. However, they have not been adapting as quickly as similar proteins in the influenza virus, and at least one seasonal coronavirus strain has undergone very little adaptation in this protein.

Decision letter after peer review:

Congratulations, we are pleased to inform you that your article, "Evidence for adaptive evolution in the receptor-binding domain of seasonal coronaviruses", has been accepted for publication in eLife.

Please take note of the points below and we hope you will continue to support eLife. The main point that all the reviewers agreed on is that it would be helpful to compare your results to those that would be obtained with a standard dN/dS approach.

Reviewer #1:

The authors use dN/dS-based statistics to show that the S1 domain has been adapting in coronaviruses OC43 and 229E, but not in NL63. For at least OC43 and 229E, this agrees qualitatively with earlier studies. The authors should make a detailed, quantitative comparison with these earlier results.

1) I had a hard time understanding exactly what was going on in the simulations, and whether, e.g., they produced phylogenies that looked like the real ones. But perhaps the simulations should be cut or de-emphasized. From my understanding of the Bhatt et al. papers and the authors' code, their selection-detection method is entirely site-based and doesn't use the tree at all, and so should be insensitive to recombination a priori. If this is correct, the simulations are largely unnecessary (but see the next point).

2) It looks like there is substantial geographic population structure, as well as time-varying geographic biases in sampling. This seems like it could generate false signals of substitutions, where standing spatial variation appears to be temporal variation. Can the authors bound how large this contribution could be? If not, perhaps they could still say that they've found signatures of either global positive selection or local adaptation, both of which are interesting.

3) Why is synonymous divergence also much higher in OC43 S1 than it is in RDRP or S2?

4) As I mention in the general assessment above, the authors should compare their results to previous studies. This includes both overall substitution rates as well as specific sites found to have repeated substitutions. For NL63, how does the null result compare to what's known? I believe Kiyuka et al., 2018, found widespread reinfection by similar NL63 genotypes, which may be relevant.

Reviewer #2:

The paper by Kistler and Bedford explores whether adaptive evolution has led to diversification of coronaviruses responsible for the common cold in the human population. The paper is very well written and presents evidence for adaptive evolution in some strains and lack of evidence in others. I enjoyed reading the paper.

Reviewer #3:

This paper analyzing potential adaptive evolution in seasonal coronaviruses ("common colds") is highly relevant with well-supported conclusions. Its results will be widely applicable to myriad fields, including evolutionary biology, epidemiology and public health, vaccinology, immunology, and virology.

Kistler and Bedford present a timely and highly relevant analysis of adaptive evolution in seasonal "common cold" coronaviruses. Overall, I find the research compelling, well-performed, and mostly well-presented. The research contains sufficient statistical rigor with commendable computational reproducibility to be considered highly reliable. The authors conclude that at least two of the four known common cold coronaviruses have been undergoing adaptive evolution in their human hosts. These results may shed light on the emerging long-term evolutionary dynamics of SARS-CoV-2, the causative agent of COVID-19, as it continues circulating in humans, as well as informing ongoing vaccine design.

Kistler and Bedford present a timely and highly relevant analysis of adaptive evolution in seasonal "common cold" coronaviruses. Overall, I find the research compelling, well-performed, and mostly well-presented (though some organizational changes are needed). As always, the commitment to open code and data from the Bedford lab is admirable and successfully performed/communicated. In my comments below, I offer advice to improve the clarity and presentation of the paper, with a few small requested analyses or need for further explanations.

1) "We have arbitrarily labeled these lineages 'A' and 'B' (Figure 1)." The figure then shows A/B panels for different hCoV strains, which is a little confusing at first until you orient to the figure presentation. I recommend labeling the phylogeny in panel A with "A" and "B" rather than just using colors, and including that lineage information in the legend.

2) The legend for Figure 1 needs units for the clock rate. Presumably this is the codon sub rate per year, which is used elsewhere? Or is this nucleotide? Similarly, units are also needed:…

– Subsection “Rate of Adaptation in RdRp and subdomains of spike”, for the parenthetical "(or 0.45 adaptive substitutions each year)". Are the authors converting to nucleotide?

– Subsection “Phylogenetic Inference”.

3) In general, I find the references to supplementary figures in the text confusing. For example, I first read the phrase "Figure 1 Supplement 1A" to mean both Figure 1 and Supplement 1A. Writing this as, "Figure 1—figure supplement 1A" will make it more clear that Figure 1 of the main text is not being referenced.

4) Results, second paragraph – sentences are not well ordered. Should be in order: 1) Though…, 2) Because…, 3) This….

5) The last two sentences in the second paragraph of the Results seem tacked on and not immediately relevant to recombination. Please move these sentences or include a paragraph break.

6) Related to the previous comment, the Results section as a whole will benefit from improved organization, specifically by creating subsections. I highly recommend adding these to improve readability.

7) Comments for Figure 2:

– Unless it becomes too busy, small indicators (or at least in the legend to avoid figure noise) might be added to emphasize the RBD domain within S1 specifically.

– I recommend changing the color of the asterisks in panel A to match the lineage A (red) color.

8) The authors write, "from that lineage's common ancestor." It would be more precise to say the "from that lineage's most recent common ancestor."

9) Figure 3, specifically in panel A for RdRp, leaves some ambiguous interpretations: Is the line missing for OC43 lineage B because there is no RdRp data after the early 1990s (seems unlikely?), or because there were no adaptive substitutions, in which case the orange lines should remain steady at 0? This aspect of the figure should be clarified or fixed.

There is a similar situation for panel C, HKU1 lineage B, in Figure 2—figure supplement 1. In addition, the "C" for that panel is cut off at the bottom, so this figure needs to be slightly reformatted.

10) In the fifth paragraph of the Results, the authors introduce the H3N2 analysis, but the actual analysis is not really presented or described for another two pages. The H3N2 comparison is definitely not part of Figure 4, which this paragraph is introducing. I think this sentence is likely misplaced? Again, this comment shows that adding subsections to the Results section will be helpful for overall organization.

11) Results, I would like to see more details about what constitutes an “adaptive substitution” in the Bhatt method, which is not as widely used as dN/dS, within the main text itself. A couple additional sentences briefly and "birds eye view" explaining what constitutes adaptive will help orient readers. The easiest way to this – just move "Briefly,…each of these timepoints." to the Results section.

12) Jumping off of the last comment – why wasn't dN/dS done? Given that dN/dS is more commonly applied, I think a comparison of these results to standard dN/dS analysis is merited. In fact, including a dN/dS analysis may bolster the authors' overall conclusions and/or contribute to justifying using the Bhatt method, especially if dN/dS is not sufficient sensitive for this data.

13) In general, the authors should clarify their use of the terms "positive selection" and "adaptive substitutions." The former is traditionally associated with interpreting dN/dS, which isn't calculated in the manuscript, and the latter term is more mechanistically-oriented regarding effects of mutations. Therefore, what is meant by "positive selection" in the simulations, and how does this definition/implementation compare to the authors' measurements of "adaptive substitutions"?

14) Figure 7 and its associated analysis raised some concerns for me. It seems like only simulation 5 replicates were performed for each condition. Is there a reason so few simulations were performed (e.g. too computationally expensive?). Further, mean and CI bars for only 5 replicates in Figure 7 gives the impression that there are more than 5 replicates. A strip plot would be more forthcoming about the analyses conducted here, and some additional explanation about why only 5 replicates were performed per condition would help.

15) Table 1 and its associated analysis:

– CI's or some measure of statistical bounds should be included in Table 1.

– Where is OC43B in the table? Was the analysis not performed on this lineage, and if so why not?

– The authors motivate this analysis by explaining how TMRCA is meaningful for H3N2. Can the authors perform this analysis for H3N2 proteins as well to provide further context for the HCoV values, just as they did for these analyses associated with Figure 5?

16) Results, tenth paragraph. To motivate this analysis, the authors may also wish to cite this paper, co-authored by Trevor, that uses a downsampling strategy from empirical to study 2009 H1N1 dynamics, and shows time dependency in evolutionary metrics. https://bedford.io/papers/meyer-time-dependence/

– In addition, why did the authors use simulated data here? If we have HCoV sequence data since at least the 1990s, it seems possible to have used real data here. Further explanation/justification is therefore needed.

– All that said, looking at the CI's (assuming these are CI's – the legend needs to add this info) in Figure 7—figure supplement 1, the bounds across time points are often overlapping. One might expect that CIs would be wider as sample size decreases, which is not always the case. To my eye, the only panel in this figure that truly shows the authors' conclusion is the "no recombination/high positive selection" panel.

17) Please add a reference in the third paragraph of the Discussion about transmissibility/pathology correlates.

18) Subsection “Phylogenetic Inference”: IG-Tree typo should be IQ-TREE. In addition, The authors may also wish to confirm on their own end which IQ-TREE version was used – a major version 2 was released in 2020 and has a different citation. Either way, please indicate the IQ-TREE version used and make sure the citation is right for whichever was used.

19) Figure 2—figure supplement 1C – please choose different colors. Since some transparency is used for points, it's very hard to distinguish precisely light from dark purple.

20) Grammar and spelling:

– There should be a comma after 229E (as in, "…two species of HCoV, OC43 and 229E, were identified…")

– "Some human respiratory illness…" is a runon sentence. Please add a comma before ",while others,".

– Legend of Figure 1: Please add a comma at "…for each viral gene, and those…"

– Typo, "spikenand" → "spike and"

– Please add a comma before ", while the rate of adaptation…"

– Discussion, third paragraph, again a comma is needed before "while."

eLife. 2021 Jan 19;10:e64509. doi: 10.7554/eLife.64509.sa2

Author response


Reviewer #1:

The authors use dN/dS-based statistics to show that the S1 domain has been adapting in coronaviruses OC43 and 229E, but not in NL63. For at least OC43 and 229E, this agrees qualitatively with earlier studies. The authors should make a detailed, quantitative comparison with these earlier results.

1) I had a hard time understanding exactly what was going on in the simulations, and whether, e.g., they produced phylogenies that looked like the real ones. But perhaps the simulations should be cut or de-emphasized. From my understanding of the Bhatt et al. papers and the authors' code, their selection-detection method is entirely site-based and doesn't use the tree at all, and so should be insensitive to recombination a priori. If this is correct, the simulations are largely unnecessary (but see the next point).

We agree that Bhatt method should be robust to recombination since it is a site-based method. However, we think the simulated data in Figure 7 provides a useful confirmation of this in addition to providing a framework to test how limited longitudinal sequencing data affects the estimated rate of adaptation (Figure 7—figure supplement 1). In order to provide clarity, we have added Figure 7—figure supplement 2, which shows representative phylogenies for each of the recombination/selection. When presenting early iterations of this work to colleagues, we received repeated questions about potential bias from recombination and so although the Bhatt method is indeed robust to recombination (in theory and in practice), we feel it’s still useful to show this robustness explicitly.

2) It looks like there is substantial geographic population structure, as well as time-varying geographic biases in sampling. This seems like it could generate false signals of substitutions, where standing spatial variation appears to be temporal variation. Can the authors bound how large this contribution could be? If not, perhaps they could still say that they've found signatures of either global positive selection or local adaptation, both of which are interesting.

Examining the phylogenies with isolates colored by country

(https://nextstrain.org/community/blab/seasonal-cov-adaptive-evolution/oc43/spike?c=country)​ , we can see that the tree forms a ladder-like topology where the tips are organized by date, rather than by country. This is especially visible when examining sequences from the 2010’s, where isolates from several countries were sequenced. In these years, isolates frequently appear interspersed within the same temporal cluster rather than forming separate geographic clusters. This suggests that HCoVs are rapidly transmitted globally and this geographic mixing justifies analyzing all isolates together as one population, rather conducting separate analyses on geographically-partitioned data. We have added an explanation of this to the Results section, which reads as follows:

“Additionally, the trees form ladder-like topologies with isolate tips arranged into temporal clusters rather than geographic clusters, indicating a single global population rather than geographically-isolated populations of virus”.

3) Why is synonymous divergence also much higher in OC43 S1 than it is in RDRP or S2?

We are not sure why synonymous divergence differs between genes. To address this concern, we have calculated the ratio of nonsynonymous divergence in spike to nonsynonymous divergence in RdRp and compare this to the equivalent ratio of synonymous divergence. We also compute the same ratios for S1 versus S2. We find that the nonsynonymous divergence ratio is higher than the synonymous divergence ratio. This indicates that, while spike synonymous divergence is often higher than RdRp synonymous divergence, this difference is much smaller than the difference in nonsynonymous divergences. We take this to be a convincing result showing that nonsynonymous divergence in spike is indeed higher than nonsynonymous divergence in RdRp (and the same for the comparison of S1 to S2). We have added Figure 3—figure supplement 2 showing these results, and have inserted the following text in the Results section to acknowledge this observation:

“Though we would expect synonymous divergence to be equivalent in all areas of the genome, this is not born out in our results. […] Thus, despite differences in synonymous divergence, spike is accumulating more relatively more nonsynonymous divergence than RdRp.”

4) As I mention in the general assessment above, the authors should compare their results to previous studies. This includes both overall substitution rates as well as specific sites found to have repeated substitutions. For NL63, how does the null result compare to what's known? I believe Kiyuka et al., 2018, found widespread reinfection by similar NL63 genotypes, which may be relevant.

Thanks for pointing us towards the Kiyuka et al. paper – we have included a brief discussion of this paper in the Introduction and Discussion sections. Kiyuka et al. designate multiple NL63 genotypes and show that each genotype circulates within Kilifi Kenya for a relatively long period of time. Additionally, people become reinfected by the same genotype and Kiyuka et al. observe that NL63 reinfections are often enhanced by a previous infection. Kiyuka et al. hypothesize that enhanced reinfection signals that NL63 is exploiting an immune response and that epitopes would actually be under purifying selection because of this. Whether or not this hypothesis is correct, their observations of persistent circulation of NL63 genotypes and reinfection by the same genotype fit well with our results showing a lack of adaptive evolution in NL63 S1.

“Our finding fits with a study of NL63 in Kenya, which identified multiple genotypes of NL63 and show that people regularly become reinfected with the same genotype of NL63 (Kiyuka et al., 2018). […] Though our results cannot explain why OC43 and 229E likely evolve antigenically while NL63 does not, Kiyuka et al. observe that NL63 reinfections are sometimes enhanced by a previous infection and hypothesize that NL63 is actually under purifying selection at epitope sites (Kiyuka et al., 2018).”

We have also added a sentence in the Results comparing our calculation of dN/dS in OC43 spike to the estimates reported in Ren et al., 2015. The Chibo and Birch, 2006 paper states that “a probability of 1.0 was obtained when testing for positive selection”, but does not elaborate on specific dN/dS values.

“Our estimate of dN/dS in OC43 spike is similar to the previously reported estimate of roughly 0.3 (Ren et al., 2015).”

Reviewer #3:

[…] Kistler and Bedford present a timely and highly relevant analysis of adaptive evolution in seasonal "common cold" coronaviruses. Overall, I find the research compelling, well-performed, and mostly well-presented (though some organizational changes are needed). As always, the commitment to open code and data from the Bedford lab is admirable and successfully performed/communicated. In my comments below, I offer advice to improve the clarity and presentation of the paper, with a few small requested analyses or need for further explanations.

1) "We have arbitrarily labeled these lineages 'A' and 'B' (Figure 1)." The figure then shows A/B panels for different hCoV strains, which is a little confusing at first until you orient to the figure presentation. I recommend labeling the phylogeny in panel A with "A" and "B" rather than just using colors, and including that lineage information in the legend.

A legend showing the color for each lineage has been added to Figure 1 and the associated figure supplements.

2) The legend for Figure 1 needs units for the clock rate. Presumably this is the codon sub rate per year, which is used elsewhere? Or is this nucleotide? Similarly, units are also needed:

– Subsection “Rate of Adaptation in RdRp and subdomains of spike”, for the parenthetical "(or 0.45 adaptive substitutions each year)". Are the authors converting to nucleotide?

– Subsection “Phylogenetic Inference”.

The clock rate is given in substitutions per nucleotide site per year. These units have been added to the text. The parenthetical was altered for clarification and now reads: “(or 0.45 adaptive amino acid substitutions in S1 each year)”.

3) In general, I find the references to supplementary figures in the text confusing. For example, I first read the phrase "Figure 1 Supplement 1A" to mean both Figure 1 and Supplement 1A. Writing this as, "Figure 1—figure supplement 1A" will make it more clear that Figure 1 of the main text is not being referenced.

References to figures and supplements have been changed according to eLife convention, so that figures are referred to by (Figure 1) in the text and supplements are referred to by (Figure 1—figure supplement 1).

4) Results, second paragraph – sentences are not well ordered. Should be in order: 1) Though…, 2) Because…, 3) This….

We have restructured this paragraph to, hopefully, make the explanation of how we assigned lineages and completed gene-specific analyses more clear.

5) The last two sentences in the second paragraph of the Results seem tacked on and not immediately relevant to recombination. Please move these sentences or include a paragraph break.

We have introduced a paragraph break as well as changing the text in previous sentences to, hopefully, make this section more clear to the reader.

6) Related to the previous comment, the Results section as a whole will benefit from improved organization, specifically by creating subsections. I highly recommend adding these to improve readability.

We have added subsections to the Results section.

7) Comments for Figure 2:

– Unless it becomes too busy, small indicators (or at least in the legend to avoid figure noise) might be added to emphasize the RBD domain within S1 specifically.

We have added the 229E RBD and the putative OC43 RBD to Figure 2. We have also added a sentence to the Results that explains that we conduct analyses on all of S1 rather than just the RBD within it because neutralizing antibodies have been identified which bind to the N-Terminal Domain (NTD) as well as the RBD.

– I recommend changing the color of the asterisks in panel A to match the lineage A (red) color.

We have changed the layout of this figure to put OC43 lineage A and lineage B in separate panels, which should reduce the confusion due to over-layed data. We have decided to keep the asterisks black so that is clear that they are asterisks rather than data points and hope that separating the 2 lineages into different panels will obviate the need for coloring the asterisks.

8) The authors write, "from that lineage's common ancestor." It would be more precise to say the "from that lineage's most recent common ancestor."

We agree and have changed this sentence to read “most recent common ancestor”.

9) Figure 3, specifically in panel A for RdRp, leaves some ambiguous interpretations: Is the line missing for OC43 lineage B because there is no RdRp data after the early 1990s (seems unlikely?), or because there were no adaptive substitutions, in which case the orange lines should remain steady at 0? This aspect of the figure should be clarified or fixed.

We have added a sentence in the Figure 3 legend to clarify this.

There is a similar situation for panel C, HKU1 lineage B, in Figure 2—figure supplement 1. In addition, the "C" for that panel is cut off at the bottom, so this figure needs to be slightly reformatted.

We have reformatted the supplement to avoid the panels being cut off.

10) In the fifth paragraph of the Results, the authors introduce the H3N2 analysis, but the actual analysis is not really presented or described for another two pages. The H3N2 comparison is definitely not part of Figure 4, which this paragraph is introducing. I think this sentence is likely misplaced? Again, this comment shows that adding subsections to the Results section will be helpful for overall organization.

We have removed the mention of H3N2 in the Introduction of the Bhatt analysis. We now mention calculating rates of adaptation in H3N2, measles and flu B several paragraphs later, when we actually compare these rates to seasonal HCoVs.

11) Results, I would like to see more details about what constitutes an “adaptive substitution” in the Bhatt method, which is not as widely used as dN/dS, within the main text itself. A couple additional sentences briefly and "birds eye view" explaining what constitutes adaptive will help orient readers. The easiest way to this – just move "Briefly,…each of these timepoints." to the Results section.

We agree that it would be useful to orient the reader to the Bhatt method by providing a brief overview of the method when it is introduced in the Results section. We have included this explanation in the Results section as suggested and rearranged the Methods description of the Bhatt method accordingly.

“Briefly, this method defines a class of neutrally-evolving nucleotide sites as those with synonymous mutations or where nonsynonymous polymorphisms occur at medium frequency. […] This method compares nucleotide sequences at each timepoint (the ingroup) to the consensus nucleotide sequence at the first time point (the outgroup) and yields an estimate of the number of adaptive substitutions within a given genomic region at each of these timepoints.”

12) Jumping off of the last comment – why wasn't dN/dS done? Given that dN/dS is more commonly applied, I think a comparison of these results to standard dN/dS analysis is merited. In fact, including a dN/dS analysis may bolster the authors' overall conclusions and/or contribute to justifying using the Bhatt method, especially if dN/dS is not sufficient sensitive for this data.

We originally only included an analysis of the data using the Bhatt method because this method (and implementations of the McDonald-Krietman method in general) is better at distinguishing relaxed selection from positive selection, whereas dN/dS can get confused in these situations. However, while we think dN/dS is less appropriate for detecting selection in RNA virus populations, we agree that including this standard method for comparison is useful. We have included dN/dS estimates for each HCoV and gene in Table 1. As expected, here we see higher dN/dS values for spike than RdRp and higher dN/dS values for S1 than S2.

13) In general, the authors should clarify their use of the terms "positive selection" and "adaptive substitutions." The former is traditionally associated with interpreting dN/dS, which isn't calculated in the manuscript, and the latter term is more mechanistically-oriented regarding effects of mutations. Therefore, what is meant by "positive selection" in the simulations, and how does this definition/implementation compare to the authors' measurements of "adaptive substitutions"?

While the terms “positive selection” and “adaptive evolution” are not synonymous, they are related in that adaptive evolution is a result of positive selection acting on genetic variation. We have added a sentence to the Results section to explain this and aim to use these terms in throughout our paper as they are used in the literature (i.e. “positive selection” to refer to dN/dS results and “adaptive substitutions” to refer to the Bhatt method).

“Mutations that escape from population immunity are beneficial to the virus and so are driven to fixation by positive selection. This results in adaptive evolution of the virus population.”

For the simulations, we specify a fitness function that imitates positive selection at epitope sites, then we use the Bhatt method to detect adaptive evolution, the result of this positive selection. We have added more detail to the Materials and methods section describing how these simulations are done.

“Positive selection was simulated through exposure-dependent selection at a subset of spike S1 sites proportional to the number of epitope sites in H3N2 HA (Luksza and Lässig, 2014). The simulated selection allows mutations in these “epitope” sites to rise in frequency while also encouraging “epitopes” to change over time (to mimic antigenic novelty).”

14) Figure 7 and its associated analysis raised some concerns for me. It seems like only simulation 5 replicates were performed for each condition. Is there a reason so few simulations were performed (e.g. too computationally expensive?). Further, mean and CI bars for only 5 replicates in Figure 7 gives the impression that there are more than 5 replicates. A strip plot would be more forthcoming about the analyses conducted here, and some additional explanation about why only 5 replicates were performed per condition would help.

We have run additional simulations so Figure 7 now displays mean and 95% CI for 10 replicates instead of 5.

15) Table 1 and its associated analysis:

– CI's or some measure of statistical bounds should be included in Table 1.

We have added 95% confidence intervals to the table of TMRCA values (now Table 2).

– Where is OC43B in the table? Was the analysis not performed on this lineage, and if so why not?

We did not perform this analysis on OC43 lineage B because the limited number of RdRp sequences available for this lineage could artificially skew the TMRCA estimates. We have added a sentence to the Results section stating this.

We did not perform this analysis on OC43 lineage B because the limited number of RdRp sequences available for this lineage could artificially skew the TMRCA estimates. We have added a sentence to the Results section stating this.

– The authors motivate this analysis by explaining how TMRCA is meaningful for H3N2. Can the authors perform this analysis for H3N2 proteins as well to provide further context for the HCoV values, just as they did for these analyses associated with Figure 5?

Based on Bedford et al., 2011, our estimates of HCoV TMRCA are roughly 2-2.5 times longer than TMRCA of H3N2 hemagglutinin. We have included this information in the Results section.

16) Results, tenth paragraph. To motivate this analysis, the authors may also wish to cite this paper, co-authored by Trevor, that uses a downsampling strategy from empirical to study 2009 H1N1 dynamics, and shows time dependency in evolutionary metrics. https://bedford.io/papers/meyer-time-dependence/

– In addition, why did the authors use simulated data here? If we have HCoV sequence data since at least the 1990s, it seems possible to have used real data here. Further explanation/justification is therefore needed.

While we could use down-sampling strategy of the empirical data to show a decrease in power with a decrease in data, we think that using simulated data to show this relationship is better because it gives a “truth” to compare to. In other words, in the simulated data we know whether or not there is positive selection on residues within S1 and, therefore, whether or not adaptive substitutions should be detected. Using all​ ​ of the simulated data gives us a “true” rate of adaptation that we can compare other rates of adaptation to. Empirical HCoV data is already somewhat sparse and, therefore, we do not think the rates of adaptation we calculate can be regarded as a “truth”, but rather are estimates.

– All that said, looking at the CI's (assuming these are CI's – the legend needs to add this info) in Figure 7—figure supplement 1, the bounds across time points are often overlapping. One might expect that CIs would be wider as sample size decreases, which is not always the case. To my eye, the only panel in this figure that truly shows the authors' conclusion is the "no recombination/high positive selection" panel.

We have added clarification to the Figure 7—figure supplement 1 to state that the mean and 95% CI are plotted for each point. Additionally, we ran more simulations (for a total of 10). Our reasoning in including this figure supplement is to show that when the 30-year point is taken as the “true” rate of adaptation, it is evident that a reduced amount of temporal data alters the estimated rate of adaptation. Though the general trend is that less longitudinal sequence data results in a lower estimated rate of adaptation and higher uncertainty, this is not the case at every time point. However, whether a truncated time-series results in a lower or higher rate of adaptation, the estimated rate is still skewed from the “true” rate. We have added a sentence to the Results section to try to clarify this interpretation of Figure 7—figure supplement 1.

17) Please add a reference in the third paragraph of the Discussion about transmissibility/pathology correlates.

We have rewritten this sentence to remove this statement.

18) Subsection “Phylogenetic Inference”: IG-Tree typo should be IQ-TREE. In addition, The authors may also wish to confirm on their own end which IQ-TREE version was used – a major version 2 was released in 2020 and has a different citation. Either way, please indicate the IQ-TREE version used and make sure the citation is right for whichever was used.

The typo in the Materials and methods section has been fixed, the version number was added, and the citation was verified.

19) Figure 2—figure supplement 1C – please choose different colors. Since some transparency is used for points, it's very hard to distinguish precisely light from dark purple.

The typo in the Materials and methods section has been fixed, the version number was added, and the citation was verified.

20) Grammar and spelling:

– There should be a comma after 229E (as in, "…two species of HCoV, OC43 and 229E, were identified…")

Done.

– "Some human respiratory illness…" is a runon sentence. Please add a comma before ",while others,".

Done.

– Legend of Figure 1: Please add a comma at "…for each viral gene, and those…"

Done.

– Typo, "spikenand" → "spike and"

Addressed.

– Please add a comma before ", while the rate of adaptation…"

Done.

– Discussion, third paragraph, again a comma is needed before "while."

Fixed.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Transparent reporting form

    Data Availability Statement

    All data used in this study can be found at https://www.viprbrc.org/ and is archived as .fasta files in the following subdirectories of the Github repository for this project: https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/229e/data, https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/oc43/data, https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/nl63/data, https://github.com/blab/seasonal-cov-adaptive-evolution/tree/master/hku1/data (copy archived at https://archive.softwareheritage.org/swh:1:rev:83721cd000f2848d4f77d2a6da8c2d0df8a555a1).


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES