Coordinated evolution at amino acid sites of SARS-CoV-2 spike

Alexey Dmitrievich Neverov; Gennady Fedonin; Anfisa Popova; Daria Bykova; Georgii Bazykin

doi:10.7554/eLife.82516

. 2023 Feb 8;12:e82516. doi: 10.7554/eLife.82516

Coordinated evolution at amino acid sites of SARS-CoV-2 spike

Alexey Dmitrievich Neverov ^1,^2,^✉, Gennady Fedonin ^2,^3,⁴, Anfisa Popova ², Daria Bykova ^2,⁵, Georgii Bazykin ^4,⁶

Editors: Richard A Neher⁷, Neil M Ferguson⁸

PMCID: PMC9908078 PMID: 36752391

Abstract

SARS-CoV-2 has adapted in a stepwise manner, with multiple beneficial mutations accumulating in a rapid succession at origins of VOCs, and the reasons for this are unclear. Here, we searched for coordinated evolution of amino acid sites in the spike protein of SARS-CoV-2. Specifically, we searched for concordantly evolving site pairs (CSPs) for which changes at one site were rapidly followed by changes at the other site in the same lineage. We detected 46 sites which formed 45 CSP. Sites in CSP were closer to each other in the protein structure than random pairs, indicating that concordant evolution has a functional basis. Notably, site pairs carrying lineage defining mutations of the four VOCs that circulated before May 2021 are enriched in CSPs. For the Alpha VOC, the enrichment is detected even if Alpha sequences are removed from analysis, indicating that VOC origin could have been facilitated by positive epistasis. Additionally, we detected nine discordantly evolving pairs of sites where mutations at one site unexpectedly rarely occurred on the background of a specific allele at another site, for example on the background of wild-type D at site 614 (four pairs) or derived Y at site 501 (three pairs). Our findings hint that positive epistasis between accumulating mutations could have delayed the assembly of advantageous combinations of mutations comprising at least some of the VOCs.

Research organism: Viruses

Introduction

Evolution of SARS-CoV-2 in human hosts before November 2020 was largely neutral, with little evidence for emergence of novel adaptation with the exception of fixation of D614G in the Spike protein (Dearlove et al., 2020; MacLean et al., 2021). However, since the end of 2020, evidence for adaptive viral evolution has started to accumulate, suggesting a change in the mode of evolution (Martin et al., 2021; Rochman et al., 2021b). The subsequent pandemic was characterized by emergence of multiple concurrently circulating lineages with increased fitness compared to the ancestral variant, including Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), and Delta (B.1.617.2). These lineages are typically characterized by high divergence, compared to cocirculating strains; divergence is often particularly pronounced at nonsynonymous sites, suggesting positive selection at origin of these variants. Some of these sites are evident of a change in selection regime at the origin of VOCs. For example, out of the 34 lineage-defining amino acid changes in the S-protein at the origin of the Omicron BA.1 sublineage (Hodcroft, 2021), eleven were characterized by strong purifying selection against changes of ancestral amino acids (Martin et al., 2022), suggesting that Omicron lineage-defining mutations at these sites were previously individually deleterious. In turn, the obvious high fitness of Omicron suggested that the origin of this variant has been characterized by a change in the selection regime at least at these sites. The reasons for this change are unclear. Several non-exclusive explanations were proposed, including a distinct mode of evolution at variant origin, for example, in an immunosuppressed individual (Corey et al., 2021; Kupferschmidt, 2021) or a different host species (Wei et al., 2021) and/or cascades of substitutions at positively epistatically interacting sites (Moulana et al., 2022). Here, we focus on the latter possibility.

Several previous studies have attempted to infer possible epistatic interactions between sites of SARS-CoV-2 genome from sequence data or experimentally. In an early study, Zeng et al. used direct coupling analysis (DCA) to search for epistasis in SARS-CoV-2 genome and reported several pairs of putatively interacting sites (Zeng et al., 2020). No pairwise interactions between sites of the S-protein were identified. For DCA to accurately detect interacting sites, the analyzed sequences need to be highly divergent (Bisardi et al., 2022). SARS-CoV-2 has a recent common ancestor, and divergence of its lineages is relatively low, limiting the applicability of DCA for this virus. Rodriguez-Rivas et al. applied DCA to homologous protein sequences from genomes of other coronaviruses and successfully predicted variability of SARS-CoV-2 protein sites, thus showing that knowledge of covariation between sites in related viruses is relevant for predicting evolution of new pathogens (Rodriguez-Rivas et al., 2022). In the study of Rochman et al., a method based on counting of mutations on the phylogeny was used to look for strongly associated mutation pairs (Rochman et al., 2020). They found intra- and intergenic epistasis between positively selected mutations in the nuclear localization signal (NLS) of the N-protein and RBD in the S-protein. In RBD, many of the detected epistatically interacting mutations were among the lineage signature mutations. Another proposed approach to study epistatic interaction was to estimate the fitness effects of mutations that arose on different backgrounds relative to their effects on the wild-type background (Rochman et al., 2021a). Using molecular dynamics, Rochman et. al. estimated effects of all individual non-synonymous mutations in the S-protein RBM on binding with host ACE2 receptor and on binding with neutralizing Ab (NAb). Effects of each mutation were estimated for the Wuhan ancestral background, Delta (452 R, 478 K), Gamma variants (417T, 484 K, 501Y), and Omicron (339D, 371 L, 373 P, 375 F, 417 N, 440 K, 446 S, 477 N, 478 K, 484 A, 493 R, 496 S, 498 R, 501Y, 505 H). On average, the epistatic effects of mutations weakly stabilized NAb binding for Delta and destabilized it for Gamma and Omicron variants relative to the ancestral background. The authors concluded that the Gamma and Omicron variants had a higher potential for emergence of immune escape mutations than Delta or Wuhan variants.

For some site pairs, epistasis had been demonstrated experimentally. For example, the Q498R mutation alone affected the affinity of Spike to ACE2 only slightly (Zahradník et al., 2021), but on the background of N501Y, the affinity of binding increased by a factor of 4–25 (Starr et al., 2022a; Zahradník et al., 2021), with both mutations together increasing the affinity by up to 387-fold compared to the wild type (Starr et al., 2022a). The very strong binding provided by the double mutant allows accumulation, at Omicron origin, of multiple immune escape mutations at other sites which by themselves destabilize ACE2 binding (Moulana et al., 2022; Starr et al., 2022a).

Here, we study the mutual distribution of spike mutations in SARS-CoV-2 phylogeny to infer the pairs of sites with evidence for concordant and discordant evolution, as manifested by the propensity of substitutions at these sites to occur rapidly one after the other (for concordant evolution), or to avoid each other (for discordant evolution). We detect 46 concordantly evolved sites combined into 13 coevolving clusters, and 12 discordantly evolved sites. Many of the concordantly evolved sites carry the characteristic mutations of VOC lineages, strongly arguing for the role of positive epistasis in VOC origin.

Results

Detecting interdependently evolving pairs of sites

To find coevolving site pairs, we modified our previously developed phylogenetic approach (Kryazhimskiy et al., 2011; Neverov et al., 2021; Neverov et al., 2015) to improve the accuracy of detecting concordantly evolving site pairs (see Materials and methods). Similarly to our previous work, as a measure of concordance of evolution at two sites, we used the epistatic statistics calculated as the weighted sum of consecutive pairs of mutations at these two sites on the phylogeny, where each mutation pair was taken with exponential penalty for the waiting time for the later mutation (Kryazhimskiy et al., 2011).

We need to introduce some definitions for further explanation. Hereafter, unless specified otherwise, we use the term ‘mutation’ for defining a triple of a site, ancestral and derived amino acids identifiers. Using ancestral state reconstruction, we are able to infer the order in which two specific mutations occurred in an evolving lineage. For a pair of consecutive mutations at two sites, we call the mutation that occurs first a leading mutation, and the mutation that follows it, a trailing mutation. For an ordered pair of sites, we call the first site in a pair the background site, and the second site in the pair, the foreground site. The epistatic statistic for an ordered pair of sites summarizes the weights of consecutive pairs of mutations at these sites, such that mutations at the background site are leading and mutations at the foreground site are trailing. The epistatic statistics for an unordered pair of sites is a sum of statistics of the two corresponding ordered pairs (Neverov et al., 2021).

We introduced two significant changes to the original method (Kryazhimskiy et al., 2011; Neverov et al., 2021) which improved the power to infer epistasis (‘revised method’, see next section). First, as in Neverov et al., 2015, we modified the null model used to calculate the significance of the epistatic statistics in permutations. While previously (Kryazhimskiy et al., 2011; Neverov et al., 2021) we permuted the positions of mutations on the tree branches at each site independently of other sites, here, we fixed the positions of mutations for the background site and permuted just the positions of mutations at the foreground site. This change allowed us to account for the possible effects of leading mutations on the topology of the phylogenetic tree; for example, an advantageous leading mutation could give rise to a prolific clade (Neher, 2013) which in turn would carry a large number of trailing mutations, artefactually inflating the epistatic statistic for this pair of sites even in the absence of epistatic interactions.

Second, while our previous work (Kryazhimskiy et al., 2011; Neverov et al., 2021) treated all substitutions at a site equally, we now distinguished between substitutions into different amino acids. Therefore, the revised epistatic statistic accounts for the preference of a specific mutation at the foreground site to follow a specific mutation at the background site. For this, in calculation of the epistatic statistic, we now additionally scored each pair of consecutive mutations by the fraction of times that the specific type of mutation at the foreground site followed the specific type of mutations at the background site, among all occurrences of this type of mutations at the foreground site. Therefore, extra weight was given to those mutation types that became more frequent on a specific background.

Estimating the power of the method to detect epistasis

To demonstrate that our revised method improves inference of positive and negative epistasis, we used MimicrEE2 (Vlachos and Kofler, 2018) to simulate clonal evolution of linked sites. We simulated two modes of evolution: (i) under positive and negative selection without epistatic interactions (‘multiplicative mode’), and (ii) under epistatic selection (‘epistatic mode’).

Specifically, we simulated independent forward-time evolution of a population of 50,000 genotypes consisting of 100 biallelic sites. For the multiplicative mode, at the start of the simulations, 20 sites of the gene carried the disfavored allele, and were therefore under positive selection; and 20 sites carried the favored allele, and were therefore under negative selection. For the epistatic mode, 20 sites constituted 10 site pairs such that the sites within each pair evolved under positive epistasis; and another 20 sites constituted 10 site pairs such that the sites within each pair evolved under negative epistasis. Under each mode, the remaining 60 sites evolved neutrally (see Materials and methods).

We used simulations to estimate how the changes to the method for inference of epistasis introduced in this work impacted method accuracy. For this, using simulated datasets, we compared the power of the four variants of our method, corresponding to the presence or absence of the two modifications introduced in the previous section (accounting for amino acid identities and unlinking the distributions of mutations on the tree branches for background and foreground for the null model).

To compare the specificity of the four variants of the method, we used the multiplicative mode of simulation (i.e. the absence of the epistasis), and asked how frequently concordant or discordant pairs were inferred under each model. Since there was no epistasis in the simulation, each such pair was spurious, and the best method would be the one with fewest such pairs. For each method, we counted the number of spuriously inferred concordant and discordant pairs at the lowest p-value threshold in our simulation trials (10^-4). There were 24 concordant pairs and 16 discordant pairs in the method of Neverov et al., 2021, but just 2 concordant pairs and 2 discordant pairs in the revised method, indicating that the modifications introduced here helped improve the specificity of our approach. We used the 10% FDR level for this analysis and for all its variants (see below). For the FDR 10%, no concordant or discordant pairs were inferred in this dataset by the revised method ((Appendix 1—tables 1 and 2, Appendix 1—figure 1).

To study the accuracy of the four variants of the method, we used simulations with epistasis. The revised method detected all 10 positively epistatic site pairs as concordantly evolving; additionally, it spuriously detected five other site pairs as concordantly evolving (Appendix 1—tables 3 and 4). The revised method also detected 7 out of 10 negatively epistatic site pairs as discordantly evolving, and spuriously detected four other site pairs as discordantly evolving (Appendix 1—tables 5 and 6). The three other detection models were less accurate: the number of false predictions was greater than the number of true predictions for positive epistatic pairs for all other detection methods, and for negative epistatic pairs, for two out of three methods (Appendix 1—tables 3 and 5). Therefore, the revised method was the method of choice for subsequent analyses.

Phylogenetic analysis of SARS-CoV-2 spike

To obtain a phylogeny representative of SARS-CoV-2 diversity, we downloaded 3,299,439 complete genome sequences of SARS-CoV-2 aligned to the WIV04 reference genome from the GISAID EpiCov database on 07.09.2021. We ignored insertions and deletions relative to the reference sequence and removed sequences with inframe stop codons in the spike protein. We then clustered the remaining sequences by pairwise distances between S-protein subsequences, allowing up to three mutations in the S-protein within a cluster, which resulted in 7,348 clusters. For each cluster, we selected one representative sequence of the best quality with the earliest date of sampling. The median date of representative sequences was February 10, 2021. Therefore, the dataset covered approximately equally both characteristic periods of SARS-CoV-2 evolution: the neutral period between Jan and Nov 2020, and the period of antigenic drift between Dec 2020 and May 2021 (MacLean et al., 2021; Martin et al., 2021). We classified representative sequences according to pangolin lineages. Most sequences (5,721) were of the B.1.* sublineages. The representative sequences included some from the variants of concern (VOCs) Alpha (B.1.1.7+Q.*, 951 sequences), Beta (B.1.351.*, 192 sequences), Delta (B.1.617.2+AY.*, 24 sequences) and Gamma (P.1.*, 100 sequences). The phylogeny of representative sequences was reconstructed using IQ-TREE (Minh et al., 2020). The tree was rooted by the outgroup USA-WA1/2020 (EPI_ISL_404895) that matched the sequence of the putative SARS-CoV-2 progenitor (Bloom, 2021; Kumar et al., 2021). The ancestral sequences at internal tree nodes were reconstructed by TreeTime (Sagulenko et al., 2018). We extracted the part of the alignment that corresponded to the S gene, and collapsed the internal tree branches without mutations in the S gene. The final tree had 1,783 internal branches. For each internal branch, we listed the amino acid mutations that occurred at this branch.

Concordantly evolving site pairs

To study the concordant and discordant evolution of pairs of sites in SARS-CoV-2 spike, we applied our approach to the distribution of mutations in the S gene on the reconstructed SARS-CoV-2 phylogeny. 185 of the sites carried two or more mutations on internal tree branches. We considered all 17,020 unordered pairs of these sites.

We detected 45 concordantly evolving site pairs which comprised 46 sites (Figure 1A, Appendix 1—table 7, Appendix 1—figure 2A). Our phylogenetic approach for detecting concordantly and discordantly evolved site pairs relied on the assumption that the tree provided for analysis is correct. To check the robustness of our results to uncertainty of phylogenetic reconstruction, we repeated the analysis on the tree reconstructed for the same set of sequences by the UShER (Turakhia et al., 2021) method utilizing maximum parsimony approach (see Materials and methods). Among the 45 site pairs inferred to be concordantly evolving (Appendix 1—table 7), 33 were also concordantly evolving on the UShER tree at 50% FDR, including 28 at 10% FDR (Table 1, Figure 1A, Appendix 1—table 8). Thus, we conclude that for 73% (33/45) of the detected concordantly evolved site pairs, the statistical signal was strong enough to be insensitive to phylogenetic uncertainty. In what follows, we focus on the IQ-TREE results.

Table 1. Concordantly evolving sites of the SARS-CoV-2 S-protein with FDR less than 10% for both reconstructed phylogenies (see Text).

The following characteristics are shown: coordinates on the S-protein sequence, nominal p-values, the value of the epistatic statistics, the total number of consecutive mutation pairs for the two corresponding ordered site pairs, numbers of mutations in consecutive pairs at sites 1 and 2, total numbers of mutations at sites 1 and 2, and the distance in the protein structure (PDB ID: 7JJJ). Pairs of sites where non-consecutive mutations are further from each other than expected (suggesting both epistatic and episodic selection; p-value <0.05 after adjustment) are indicated in bold; pairs of sites where they are closer to each other than expected (suggesting episodic rather than epistatic selection) are indicated in italic (see Appendix 1—tables 10 and 12). Physical distance could not be calculated for site pairs (13,152) and (681,716) because sites 13 and 681 were absent in 7JJJ.

site 1	site 2	cluster	p-value	epistatic statistics	#consec.pairs of mutations	#mutations in consec. pairs at site 1	#mutations in consec. pairs at site 2	#mutations at site 1	#mutations at site2	physical distance, Å
13	152		<2e-5	2.864	5	5	4	5	16	-
20	417	4	2.2e-4	1.348	4	3	2	8	7	47.83
26	190	4	1.8e-4	1.029	4	4	3	12	3	18.4
63	213	3	4e-5	0.681	1	1	1	2	3	13.51
69	70	3	<2e-5	1.125	2	2	2	5	4	1.3
70	144	3	<2e-5	1.301	3	3	3	4	5	14.03
76	490		2.6e-4	0.22	1	1	1	3	3	45.18
189	360	2	1.6e-4	0.544	2	1	2	3	3	29.89
190	417	4	0.0001	1.272	3	2	2	3	7	39.93
259	261	3	<2e-5	1.35	1.5	2	1	4	2	3.81
356	360	2	<2e-5	1.283	2.5	2	3	2	3	10.12
359	360	2	<2e-5	0.976	1.5	1	2	2	3	1.34
439	441	1	<2e-5	2.097	4	3	4	9	8	3.5
440	441	1	1.4e-4	1.505	3	3	3	12	8	1.33
440	442	1	<2e-5	2.476	3	2	3	12	6	3.92
440	444	1	<2e-5	2.515	3.5	2	5	12	7	5.6
441	442	1	<2e-5	2.244	4	4	4	8	6	1.33
441	443	1	2e-5	1.281	5	4	2	8	2	3.23
441	444	1	<2e-5	3.492	5.5	4	4	8	7	2.6
442	443	1	6e-5	1.301	4	4	2	6	2	1.32
442	444	1	<2e-5	3.013	6	4	4	6	7	4.05
443	444	1	8e-5	1.043	2	2	2	2	7	1.32
501	1118	5	1.4e-4	2.737	16	13	12	40	22	131.71
681	716	5	<2e-5	3.905	16.5	12	16	59	21	-
716	982	5	2e-5	2.001	15	9	11	21	15	81.66
716	1118	5	4e-5	2.382	13	8	12	21	22	22.29
859	950		<2e-5	2.219	5.5	4	5	11	9	15.17
982	1118	5	1.4e-4	1.637	15	11	8	15	22	94.63

site 1	site 2	p-value	epistat.	#consec. pairs of mutations	#mut. in consec. pairs in site1	#mut. in consec. pairs in site2	#mut. in site1	#mut. in site2	FDRUShER tree	physical distance,Å
69	614	3e-3	0.07	5	1	5	5	14	0.231	46.75
222	501	3.1e-3	0.01	1	1	1	11	40	0.058	55.44
440	681	3.1e-3	0.01	1	1	1	12	59	0.055	-
501	675	1.2e-3	0.02	1	1	1	40	24	0.06	84.98
501	677	4.8e-4	0.05	3	2	3	40	39	0	88.20
570	614	2.4e-3	0.16	7	1	7	16	14	0.895	19.30
614	653	4e-5	0.03	4	1	4	14	5	0.025	15.47
614	982	3.2e-4	0.12	7	1	7	14	15	0.891	27.87
681	1176	3.9e-3	0.01	1	1	1	59	11	0.385	-

Model		#FP	min p-value	Maximal FDR threshold for#FP = 0
consider alleles	shuffle mutations in fgr. only	2	<1e-4	0,25
consider alleles	shuffle mutations both in bgr. and fgr.	3	<1e-4	0,15
ignore alleles	shuffle mutations in fgr. only	4	<1e-4	0,11
ignore alleles	shuffle mutations both in bgr. and fgr.	24	<1e-4	0,02

site1	site2	p-value	epistat	#consec. pairs of mut.	#mut. in consec. pairs in site1	#mut. in consec. pairs in site2	#mut. in site1	#mut. in site2	exp. epistat	SE
1	2	<1e-4	12,36	66,5	52	65	2	96	3,97	0,87
3	4	<1e-4	17,79	94	61	104	4	118	6,55	1,23
5	6	<1e-4	13,87	54	48	54	6	105	3,27	0,83
5	43	<1e-4	21,43	108,5	82	100	43	105	12,6	1,89
7	8	<1e-4	11,36	50	42	50	8	109	3,22	0,78
9	10	<1e-4	17,09	90	64	88	10	117	6,62	1,19
11	12	<1e-4	13,46	53,5	48	51	12	108	2,54	0,69
13	14	<1e-4	11,63	52,5	43	56	14	106	2,85	0,72
15	16	<1e-4	11,84	69	47	70	16	116	5,88	1,08
19	20	<1e-4	9,06	47	39	47	18	109	4,29	0,88
48	68	<1e-4	29,11	171,5	110	169	20	95	2,6	0,69
17	18	2e-4	8,23	66	38	65	31	95	6,61	1,23
19	31	2e-4	11,58	78,5	56	78	67	176	20,72	2,09
50	94	2e-4	24,34	162,5	100	151	68	183	17,55	1,9
78	94	2e-4	28,50	172	110	169	94	178	20,5	2,07

site1	site2	p-value	epistat	#consec. pairs of mut.	#mut. in consec. pairs in site1	#mut. in consec. pairs in site2	#mut. in site1	#mut. in site2	exp. epistat	SE
16	50	<1e-4	5,72	85,5	47	83	108	178	10,51	1,48
17	96	1e-4	7,44	93,5	54	90	109	192	12,92	1,70
23	24	<1e-4	8,27	70,5	51	68	137	129	13,70	1,65
25	26	<1e-4	6,13	73,5	37	74	139	121	10,84	1,46
26	95	1e-4	6,94	91,5	52	94	121	179	12,18	1,60
27	28	<1e-4	5,90	76	42	73	168	149	13,90	1,82
33	34	1e-4	4,70	48,5	37	45	117	136	8,55	1,30
35	36	<1e-4	5,47	59,5	39	55	150	127	10,81	1,51
37	38	<1e-4	4,10	44,5	32	41	113	146	8,93	1,39
39	40	<1e-4	4,97	61,5	37	59	115	142	11,11	1,47
39	47	<1e-4	7,27	86,5	54	84	115	166	13,09	1,66

site1	site2	p-value	epistat	#consec. pairs of mut.	#mut. in consec. pairs in site1	#mut. in consec. pairs in site2	#mut. in site1	#mut. in site2	FDRfor the USHER tree	pdb distance
13	152	<2e-5	2,864	5	5	4	5	16	0,03
18	20	0,00008	1,661	3,5	3	3	28	8	1,19	3,62
20	26	<2e-5	1,985	4	3	3	8	12	0,36	15,19
20	417	0,00022	1,348	4	3	2	8	7	0,05	47,83
26	190	0,00018	1,029	4	4	3	12	3	0,06	18,40
63	64	0,00002	0,726	1	1	1	2	4	1,00	1,31
63	67	0,00008	0,726	1	1	1	2	5	1,00	9,38
63	69	0,00012	0,726	1	1	1	2	5	1,00	11,07
63	213	0,00004	0,681	1	1	1	2	3	0,01	13,51
64	67	0,00012	0,726	1	1	1	4	5	1,00	5,51
64	69	0,00014	0,726	1	1	1	4	5	1,00	7,05
67	69	<2e-5	1,101	2	2	2	5	5	0,55	3,59
69	70	<2e-5	1,125	2	2	2	5	4	0,01	1,30
70	144	<2e-5	1,301	3	3	3	4	5	0,01	14,03
76	490	0,00026	0,220	1	1	1	3	3	0,05	45,18
154	1071	<2e-5	1,767	4	2	3	5	5	0,25	95,13
155	157	<2e-5	1,113	2	2	2	3	11	0,53	3,74
189	356	0,00006	0,544	2	1	2	3	2	0,32	31,63
189	360	0,00016	0,544	2	1	2	3	3	0,07	29,89
190	417	0,0001	1,272	3	2	2	3	7	0,01	39,93
213	261	0,00008	0,562	1	1	1	3	2	0,37	11,10
259	261	<2e-5	1,350	1,5	2	1	4	2	0,05	3,81
262	272	<2e-5	1,183	4	4	1	9	2	0,94	31,42
356	357	0,00006	0,448	1	1	1	2	2	0,69	1,33
356	360	<2e-5	1,283	2,5	2	3	2	3	0,05	10,12
359	360	<2e-5	0,976	1,5	1	2	2	3	0,01	1,34
439	441	<2e-5	2,097	4	3	4	9	8	0,01	3,50
440	441	0,00014	1,505	3	3	3	12	8	0,01	1,33
440	442	<2e-5	2,476	3	2	3	12	6	0,01	3,92
440	443	0,00002	1,284	4	3	2	12	2	0,25	3,86
440	444	<2e-5	2,515	3,5	2	5	12	7	0,01	5,60
441	442	<2e-5	2,244	4	4	4	8	6	0,01	1,33
441	443	0,00002	1,281	5	4	2	8	2	0,02	3,23
441	444	<2e-5	3,492	5,5	4	4	8	7	0,01	2,60
442	443	0,00006	1,301	4	4	2	6	2	0,05	1,32
442	444	<2e-5	3,013	6	4	4	6	7	0,01	4,05
443	444	0,00008	1,043	2	2	2	2	7	0,03	1,32
484	655	0,00004	1,716	8	6	5	34	12	0,89	73,58
501	1118	0,00014	2,737	16	13	12	40	22	0,09	131,71
681	716	<2e-5	3,905	16,5	12	16	59	21	0,02
716	982	0,00002	2,001	15	9	11	21	15	0,01	81,66
716	1118	0,00004	2,382	13	8	12	21	22	0,01	22,29
859	950	<2e-5	2,219	5,5	4	5	11	9	0,01	15,17
982	1118	0,00014	1,637	15	11	8	15	22	0,01	94,63
1258	1259	0,00004	0,775	3	1	3	5	3	NA

site1	site2	p-value	epistat	#consec. pairs of mut.	#mut. in consec. pairs in site1	#mut. in consec. pairs in site2	#mut. in site1	#mut. in site2	IQ-TREE	pdb distance
12	346	0,00006	0,934	2	2	1	9	4	0	-
12	899	0,00004	0,724	2	2	1	9	3	0	-
13	152	0,00004	1,407	5	4	3	4	13	1	-
20	190	0,00022	0,885	3	2	3	8	4	0	22,59
20	417	0,00018	1,107	2	2	1	8	8	1	47,83
26	190	0,00026	0,913	4	3	4	12	4	1	18,40
26	655	0,00042	0,968	6,5	3	6	12	14	0	37,53
54	690	0,00044	0,714	1	1	1	11	3	0	33,65
62	251	<2e-5	0,748	1	1	1	2	3	0	32,01
67	96	<2e-5	0,599	1	1	1	3	7	0	7,12
69	70	<2e-5	1,496	3	3	3	5	5	1	1,30
69	144	0,00002	0,984	2	2	2	5	5	0	12,57
70	144	<2e-5	1,088	3	3	3	5	5	1	14,03
76	490	0,0001	0,275	1	1	1	2	3	1	45,18
80	215	<2e-5	1,687	6,5	3	6	21	14	0	12,51
80	950	0,00014	1,091	3	2	2	21	7	0	56,59
152	252	0,00016	0,623	2	2	1	13	4	0	15,26
189	360	0,0003	0,534	1	1	1	2	2	1	29,89
189	772	0,00024	0,440	1	1	1	2	2	0	44,17
190	417	<2e-5	1,134	2,5	2	2	4	8	1	39,93
215	1167	0,00042	0,853	2	2	2	14	4	0
255	256	0,00022	0,758	2	2	2	8	7	0	1,32
255	260	0,00036	0,811	3	3	2	8	3	0	9,61
256	258	0,00012	0,660	1	1	1	7	2	0	2,81
256	260	0,00014	1,056	2	2	2	7	3	0	8,18
259	260	<2e-5	1,536	2	2	2	3	3	0	1,31
259	261	0,00014	0,909	1	1	1	3	2	1	3,81
357	360	<2e-5	1,025	1,5	1	2	2	2	0	7,59
359	360	<2e-5	1,268	2	1	2	2	2	1	1,34
360	772	0,00018	0,534	1	1	1	2	2	0	58,61
439	440	<2e-5	2,038	3,5	2	4	10	12	0	1,31
439	441	<2e-5	2,795	5	3	4	10	5	1	3,50
439	444	0,00006	1,606	4,5	3	4	10	7	0	6,21
440	441	<2e-5	2,822	4,5	4	2	12	5	1	1,33
440	442	<2e-5	2,092	3	3	2	12	3	1	3,92
440	444	<2e-5	2,947	4,5	4	3	12	7	1	5,60
441	442	<2e-5	2,292	3	2	2	5	3	1	1,33
441	443	0,00002	1,383	2	2	2	5	2	1	3,23
441	444	<2e-5	3,829	6,5	4	4	5	7	1	2,60
441	445	0,00018	1,086	2	2	1	5	6	0	8,51
442	443	0,0001	1,132	2	2	1	3	2	1	1,32
442	444	0	2,387	4	3	3	3	7	1	4,05
443	444	0,00006	1,031	4	1	4	2	7	1	1,32
444	445	0,0002	1,270	4	5	3	7	6	0	1,31
501	570	0	2,734	16	13	10	46	19	0	53,69
501	1118	0,00048	2,028	18	14	13	46	19	1	131,71
570	681	0,00018	2,538	17,5	15	14	19	62	0
570	1118	0,00002	1,881	13,5	9	10	19	19	0	79,04
572	1181	0	0,533	1	1	1	3	2	0
583	1237	0,00014	0,242	1	1	1	3	3	1
681	716	0,00002	3,017	20	20	16	62	21	1
681	1118	0,00032	2,334	18	14	12	62	19	0
716	982	0	4,623	19,5	13	16	21	18	1	81,66
716	1118	0	2,872	17,5	11	13	21	19	1	22,29
859	950	0	1,596	4	3	4	9	7	1	15,17
982	1118	0	1,831	17,5	10	10	18	19	1	94,63
1027	1176	0,00016	1,206	5,5	5	4	12	9	0

site1	site2	p-value	epistat	#consec. pairs of mut.	#mut. in consec. pairs in site1	#mut. in consec. pairs in site2	#mut. in site1	#mut. in site2	IQ-TREE	pdb distance
18	681	0,00344	0,102	5,5	5	4	22	62	0
222	501	0,00314	0,019	1	1	1	12	46	1	55,44
439	501	0,00286	0	0	0	0	10	46	0	5,16
440	501	0,0019	0	0	0	0	12	46	0	9,47
440	681	0,00274	0,008	1	1	1	12	62	1
484	982	0,0021	0,020	3	3	3	43	18	0	35,42
501	675	0,00136	0,025	1	1	1	46	22	1	84,98
501	677	0,0001	0,023	2	2	2	46	33	1	88,2
570	677	0,00226	0,011	1	1	1	19	33	0	44,77
614	653	0,00022	0,037	4	1	4	10	5	1	15,47
675	716	0,00568	0	0	0	0	22	21	0	38,15
675	1118	0,00524	0	0	0	0	22	19	0	58,84
677	681	0,00264	0,098	3	3	3	33	62	0
677	716	0,00102	0,011	1	1	1	33	21	0	42,93
677	982	0,00152	0,004	1	1	1	33	18	0	54,37
677	1118	0,00066	0,002	1	1	1	33	19	0	64,36

site 1	site 2	#nonseq. pairs	zscore	lower pvalue	upper pvalue	lower pvalue adj.	upper pvalue adj.
13	152	2569	–0,724	0,7675	0,2325	1	0,5813
18	20	8610,5	1,148	0,1225	0,8775	0,3675	1
20	26	5719	2,444	0,0175	0,9825	0,1125	1
20	417	2533	1,738	0,04	0,96	0,2	1
26	190	3682	2,258	0,015	0,985	0,1125	1
63	64	124	1,736	0,06	0,94	0,2455	1
63	67	264	1,68	0,075	0,925	0,2596	1
63	69	239	2,224	0,025	0,975	0,1406	1
63	213	94	1,407	0,0825	0,9175	0,2652	1
64	67	1324	2,353	0,0125	0,9875	0,1125	1
64	69	1199	2,766	0,0025	0,9975	0,0281	1
67	69	2542	4,104	0	1	0,0281	1
69	70	2014	4,041	0	1	0,0281	1
70	144	627	2,952	0,0025	0,9975	0,0281	1
76	490	1748	0,632	0,25	0,75	0,5357	1
154	1071	502	–1,357	0,925	0,075	1	0,225
155	157	482	0,671	0,2375	0,7625	0,5344	1
189	356	108	–0,187	0,555	0,445	0,999	0,9536
189	360	58	0,927	0,17	0,83	0,4765	1
190	417	1631	0,099	0,445	0,555	0,8344	1
213	261	778	–1,251	0,905	0,095	1	0,2672
259	261	736,5	–2,384	0,9975	0,0025	1	0,0094
262	272	744	0,085	0,44	0,56	0,8344	1
356	357	65	–0,797	0,7525	0,2475	1	0,5862
356	360	63,5	–2,704	1	0,0025	1	0,0094
359	360	64,5	–0,39	0,6275	0,3725	1	0,8381
439	441	1196	–3,989	1	0,0025	1	0,0094
440	441	1647	–3,55	1	0,0025	1	0,0094
440	442	822	–3,23	1	0,0025	1	0,0094
440	443	931	–2,124	0,995	0,005	1	0,0173
440	444	1591,5	–3,511	1	0,0025	1	0,0094
441	442	446	–5,063	1	0,0025	1	0,0094
441	443	505	–4,26	1	0,0025	1	0,0094
441	444	864,5	–5,337	1	0,0025	1	0,0094
442	443	251	–3,988	1	0,0025	1	0,0094
442	444	429	–4,727	1	0,0025	1	0,0094
443	444	491	–3,972	1	0,0025	1	0,0094
484	655	18792	1,038	0,18	0,82	0,4765	1
501	1118	16121	1,443	0,07	0,93	0,2596	1
681	716	19375,5	–0,962	0,8325	0,1675	1	0,4434
716	982	8267	0,58	0,3075	0,6925	0,629	1
716	1118	9986	1,65	0,05	0,95	0,225	1
859	950	3684,5	0,747	0,2075	0,7925	0,5188	1
982	1118	8103	0,757	0,2325	0,7675	0,5344	1
1258	1259	1053	–1,435	0,9375	0,0625	1	0,2009

site 1	site 2	#nonseq. pairs	zscore	lower pvalue	upper pvalue	lower pvalue adj.	upper pvalue adj.
69	614	2875	2,27	0,015	0,985	0,06	1
222	501	17114	–0,861	0,795	0,205	0,995	0,369
440	681	10559	–2,229	0,995	0,005	0,995	0,045
501	675	18907	–2,39	0,9875	0,0125	0,995	0,0563
501	677	30478	–1,463	0,925	0,075	0,995	0,1856
570	614	5813	1,738	0,0425	0,9575	0,0956	1
614	653	1856	2,944	0,0025	1	0,0225	1
614	982	4913	2,226	0,02	0,98	0,06	1
681	1176	9599	–1,326	0,9175	0,0825	0,995	0,1856

site 1	site 2	#nonseq. pairs	zscore	lower pvalue	upper pvalue	lower pvalue adj.	upper pvalue adj.
12	346	1296	0,461	0,29	0,71	1	0,9263
12	899	883	0,705	0,245	0,755	1	0,9551
13	152	3955	–0,731	0,75	0,25	1	0,4597
20	190	2357	–0,372	0,6525	0,3475	1	0,5826
20	417	3184	0,692	0,225	0,775	1	0,9551
26	190	3876	0,993	0,1425	0,8575	0,8835	0,9975
26	655	10663,5	2,486	0,01	0,99	0,1425	1
54	690	923	–1,288	0,9175	0,0825	1	0,1959
62	251	95	0,51	0,2875	0,7125	1	0,9263
67	96	2027	0,462	0,285	0,715	1	0,9263
69	70	2241	2,982	0	1	0,0713	1
69	144	967	2,582	0,01	0,99	0,1425	1
70	144	833	2,303	0,02	0,98	0,1832	1
76	490	1801	0,766	0,2125	0,7875	1	0,9551
80	215	13261,5	–2,369	0,995	0,005	1	0,0204
80	950	6321	–1,417	0,9325	0,0675	1	0,1749
152	252	1142	0,042	0,4875	0,5125	1	0,8063
189	360	43	0,188	0,395	0,605	1	0,8621
189	772	109	–0,85	0,785	0,215	1	0,4226
190	417	2157,5	–0,741	0,7725	0,2275	1	0,4323
215	1167	2031	–1,602	0,9525	0,0475	1	0,1425
255	256	2126	–1,048	0,845	0,155	1	0,3398
255	260	1285	–2,55	1	0,0025	1	0,0119
256	258	645	–1,441	0,9375	0,0625	1	0,1696
256	260	872	–1,368	0,9275	0,0725	1	0,1797
259	260	435	–2,326	0,9925	0,0075	1	0,0267
259	261	778	–1,861	0,99	0,01	1	0,0335
357	360	30,5	–1,517	0,95	0,05	1	0,1425
359	360	38	–0,549	0,6725	0,3275	1	0,5657
360	772	39	–0,674	0,7225	0,2775	1	0,4943
439	440	2456,5	–2,577	0,9925	0,0075	1	0,0267
439	441	1061	–2,935	1	0,0025	1	0,0119
439	444	1143,5	–3,434	1	0,0025	1	0,0119
440	441	1555,5	–2,671	0,995	0,005	1	0,0204
440	442	897	–2,686	1	0,0025	1	0,0119
440	444	1675,5	–2,976	1	0,0025	1	0,0119
441	442	387	–4,065	1	0,0025	1	0,0119
441	443	440	–3,437	1	0,0025	1	0,0119
441	444	721,5	–4,488	1	0,0025	1	0,0119
441	445	596	–2,144	0,9825	0,0175	1	0,0554
442	443	253	–3,58	1	0,0025	1	0,0119
442	444	416	–4,346	1	0,0025	1	0,0119
443	444	472	–4,301	1	0,0025	1	0,0119
444	445	640	–2,502	0,9975	0,0025	1	0,0119
501	570	21489	3,943	0	1	0,0713	1
501	1118	21300	1,398	0,08	0,92	0,57	1
570	681	26547,5	2,15	0,015	0,985	0,171	1
570	1118	13096,5	1,926	0,0225	0,9775	0,1832	1
572	1181	511	–0,81	0,785	0,215	1	0,4226
583	1237	3077	–0,998	0,8325	0,1675	1	0,3536
681	716	27238	–1,029	0,8575	0,1425	1	0,3249
681	1118	26316	0,042	0,475	0,525	1	0,8063
716	982	10836,5	0,057	0,4625	0,5375	1	0,8063
716	1118	13434,5	–0,024	0,5175	0,4825	1	0,7858
859	950	4892	0,429	0,3125	0,6875	1	0,9263
982	1118	10470,5	0,258	0,3975	0,6025	1	0,8621
1027	1176	4023,5	0,98	0,155	0,845	0,8835	0,9975

	mean ranks for subset I(15 site pairs)	mean ranks for subset II(17005 site pairs)	delta
data	234,7	8517,8	–8283,1
simulations	2182,4	8516,1	–6333,7

	mean ranks for subset I(15 site pairs)	mean ranks for subset II(17005 site pairs)	delta
data	530,5	8517,5	–7987,1
simulations	3021,4	8515,3	–5494,0

	mean ranks for subset I(15 site pairs)	mean ranks for subset II(17005 site pairs)	delta
data	3137,2	8515,2	–5378,0
simulations	5545,9	8513,1	–2967,2

	mean ranks for subset I(55 site pairs)	mean ranks for subset II(16965 site pairs)	delta
data	543,3	8536,3	–7993,0
simulations	3532,3	8526,6	–4994,3

site 1	site 2	#nonseq. pairs	zscore	lower pvalue	upper pvalue	lower pvalue adj.	upper pvalue adj.
18	681	34644,5	–1,979	0,9725	0,0275	0,98	0,2
222	501	20756	0,222	0,3825	0,6175	0,68	1
439	501	7667	–1,207	0,8875	0,1125	0,98	0,45
440	501	11220	0,422	0,325	0,675	0,665	1
440	681	13859	–0,599	0,7375	0,2625	0,98	0,6629
484	982	21249	4,304	0,0025	1	0,0133	1
501	675	22065	–1,868	0,9625	0,0375	0,98	0,2
501	677	36276	–0,526	0,71	0,29	0,98	0,6629
570	677	22309	4,782	0,0025	1	0,0133	1
614	653	2273	1,739	0,05	0,95	0,16	1
675	716	13924	–0,731	0,7725	0,2275	0,98	0,6629
675	1118	13452	0,417	0,3325	0,6675	0,665	1
677	681	44811	–2,043	0,98	0,02	0,98	0,2
677	716	22891	0,604	0,265	0,735	0,665	1
677	982	17847	3,543	0,0025	1	0,0133	1
677	1118	22115	1,869	0,0425	0,9575	0,16	1

site types	#FP	#total possible FP	expected FP frequency	observed FP frequency
(pos,pos)	7	180	0,0364	0,0843
(pos,neg)	7	400	0,0810	0,0843
(neg,neg)	0	190	0,0385	0
(pos,neu)	33	1200	0,2429	0,3976
(neg,neu)	6	1200	0,2429	0,0723
(neu,neu)	30	1770	0,3583	0,3614

PERMALINK

Coordinated evolution at amino acid sites of SARS-CoV-2 spike

Alexey Dmitrievich Neverov

Gennady Fedonin

Anfisa Popova

Daria Bykova

Georgii Bazykin

Roles

Abstract

Introduction

Results

Detecting interdependently evolving pairs of sites

Estimating the power of the method to detect epistasis

Phylogenetic analysis of SARS-CoV-2 spike

Concordantly evolving site pairs

Figure 1. Concordantly evolving sites in SARS-CoV-2 Spike protein.

Table 1. Concordantly evolving sites of the SARS-CoV-2 S-protein with FDR less than 10% for both reconstructed phylogenies (see Text).

Figure 2. Clusters of coevolving sites on the protein structure.

Figure 3. Concordantly evolving pair of sites 501 and 1118.

Figure 3—figure supplement 1. Concordantly evolving pair of sites 501 and 1118.

Discordantly evolving site pairs

Table 2. Discordantly evolving sites of the SARS-CoV-2 S-protein.

Figure 4. The Q677H is depleted on the background of N501Y.

Distinguishing between epistasis and non-epistatic episodic selection

Long-term coordinated evolution of Spike

Coordinated evolution of sites carrying VOC mutations

Figure 5. Coevolution of S-protein sites 484 and 655.

Figure 5—figure supplement 1. Concordantly evolving pair of sites 484 and 655.

Discussion

Materials and methods

Constructing the set of sequences

Phylogenetic analysis and inference of ancestral sequences

Concordantly and discordantly evolving pairs of sites

Simulation of independent evolution of sites

Simulation of positively and negatively epistatically evolved site pairs

Comparing the signal strength of coordinated evolution across site subsets

Comparing the sets of concordantly evolving pairs and DCA high scoring pairs

Acknowledgements

Appendix 1

Appendix 1—table 1. Levels of spurious signal of concordant evolution, inferred in the simulated dataset with no epistasis by four variants of the method.

Appendix 1—table 2. Levels of spurious signal of discordant evolution, inferred in the simulated dataset with no epistasis by four variants of the method.

Appendix 1—table 3. Numbers of truly and falsely predicted concordantly evolving pairs of sites for the simulated data with positively and negatively epistatically interacting sites.

Appendix 1—table 4. Predicted concordantly evolving pairs of sites for the simulated data with positively and negatively epistatically interacting sites.

Appendix 1—table 5. Numbers of truly and falsely predicted discordantly evolving pairs of sites for the simulated data with positively and negatively epistatically interacting sites.

Appendix 1—table 6. Predicted discordantly evolving pairs for the simulated data with positively and negatively epistatically interacting sites.

Appendix 1—table 7. Predicted concordantly evolving pairs for the ML phylogeny of S-gene reconstructed by IQ-TREE.

Appendix 1—table 8. Predicted concordantly evolving pairs for the MP phylogeny of S-gene reconstructed by USHER.

Appendix 1—table 9. Predicted discordantly evolving pairs for the MP phylogeny of S-gene reconstructed by USHER.

Appendix 1—table 10. Coordinated episodic selection in concordantly evolving pairs predicted for the phylogeny reconstructed by IQ-TREE.

Appendix 1—table 11. Coordinated episodic selection in discordantly evolving pairs predicted for the phylogeny reconstructed by IQ-TREE.

Appendix 1—table 12. Coordinated episodic selection in concordantly evolving pairs predicted for the phylogeny reconstructed by USHER.

Appendix 1—table 13. Coordinated episodic selection in discordantly evolving pairs predicted for the phylogeny reconstructed by USHER.

Appendix 1—table 14. Pairs of lineage-defining sites of Alpha VOC (subset I) tend to have stronger signal of concordant evolution than the complementary subset of site pairs (subset II).

Appendix 1—table 15. Pairs of lineage-defining sites of Beta VOC tend to have stronger signal of concordant evolution than the complementary subset of site pairs.

Appendix 1—table 16. Pairs of lineage-defining sites of Delta VOC tend to have stronger signal of concordant evolution than the complementary subset of site pairs.

Appendix 1—table 17. Pairs of lineage-defining sites of Gamma VOC tend to have stronger signal of concordant evolution than the complementary subset of site pairs.

Appendix 1—table 18. Pairs of lineage-defining sites of Omicron VOC.

Appendix 1—table 19. Pairs of lineage-defining sites of Alpha VOC tend to have stronger signal of concordant evolution than the complementary subset of site pairs even if all Alpha and related lineages are excluded from the analysis.

Appendix 1—table 20. No difference in the strength of concordant evolution is detected for pairs of lineage-defining sites of Beta VOC and the complementary subset of site pairs if all Beta and related lineages are excluded from the analysis.

Appendix 1—table 21. No difference in the strength of concordant evolution is detected for pairs of lineage-defining sites of Delta VOC and the complementary subset of site pairs if all Delta and related lineages are excluded from the analysis.

Appendix 1—table 22. Pairs of lineage defining sites of Gamma tend to have weaker signal of concordant evolution than the complementary subset of site pairs if all Gamma and related lineages are excluded from the analysis.

Appendix 1—figure 1. Numbers of predicted concordantly (A) and discordantly (B) evolving pairs for different nominal p-values in the simulated data in non-epistatic mode of evolution, compared to the null distribution for the ML phylogeny reconstructed by IQ-TREE.

Appendix 1—figure 2. Numbers of predicted concordantly (A) and discordantly (B) evolving pairs for different nominal p-values in the S-gene of SARS-Cov-2, compared to the null distribution for the ML phylogeny reconstructed by IQ-TREE.

Funding Statement

Contributor Information

Funding Information

Additional information

Competing interests

Author contributions

Additional files

Data availability

References

Editor's evaluation

Richard A Neher

Roles

Decision letter

Roles

Author response

Author response image 1. Site pairs detected as concordantly (I) or discordantly (II) evolving for FDR 30%, blue – site from a true concordantly evolving pair, red – site from a true discordantly evolving site pair, yellow – neutral; thick lines indicate true epistatic pairs.

Author response table 1. Observed frequencies of spurious detection of concordant evolution for different types of site pairs, compared to the expected (multinomial test p-value = 0).