Testing for genetic mutation of seasonal influenza virus

Vera Liu; Stephen Walker

doi:10.1080/02664763.2021.1978955

. 2021 Sep 29;50(1):1–18. doi: 10.1080/02664763.2021.1978955

Testing for genetic mutation of seasonal influenza virus

Vera Liu ^a,^CONTACT, Stephen Walker ^b

PMCID: PMC9754041 PMID: 36530774

Abstract

Influenza virus strains undergo genetic mutations every year and these changes in genetic makeup pose difficulties for effective vaccine selection. To better understand the problem it is important to statistically quantify the amount of genetic change between circulating strains from different years. In this paper, we propose the nonparametric crossmatch test applied to phylogenetic trees to assess the level of discrepancy between circulating flu virus strains between two years; the viruses being represented by a phylogenetic tree. The crossmatch test has advantages compared to parametric tests in that it preserves more information in the data. The outcome of the test would indicate whether the circulating influenza virus has mutated sufficiently in the past year to be considered as a new population of virus, suggesting the need to consider a new vaccine. We validate the test on simulated phylogenetic tree samples with varying branch lengths, as well as with publicly available virus sequence data from the ‘Global Initiative on Sharing All Influenza Data’ (GISAID: https://www.gisaid.org/)

Keywords: Phylogenetic tree, mutation, non-parametric test, seasonal influenza, crossmatch test

1. Introduction

Two types of human influenza viruses,¹ types A and B, cause seasonal epidemics of disease almost every year (see ‘Center for Disease Control and Prevention’²), with type A divided into subtypes, such as H1N1 and H3N2, which are based on surface glycoproteins hemmagglutinin (HA) and neuraminidase (NA); see, for example, [9].

HA mediates the entry of a virus into the cell and the corresponding antibody protects the body from illness; while NA facilitates the spread of a virus among the cells and its antibody generally reduces the severity of illness; [25]. Accordingly, flu vaccines are developed each year to protect the general public against infections .³ A flu vaccine is made up from a collection of strains of virus observed in previous years that are predicted to circulate in the upcoming flu season, and would therefore provide an effective protection if the vaccine shot is administrated ahead of time. For example, it has been suggested that the 2019/20 flu vaccine contains an A/Brisbane/02/2018 (H1N1)pdm09 like virus, an A/Kansas/14/2017 (H3N2) like virus, a B/Colorado/06/2017 like virus, and a B/Phuket/3073/2013 like virus, according to the World Health Organization. To what extent the vaccine is effective depends on whether there is a good match between the strains selected, as parts of a vaccine, and the current circulating strain.

When analyzing type A flu strains, of particular interest is the protein HA, which binds to the target cell and becomes the principal target of human immune responses, see [9]. A large number of studies have been conducted on the properties of HA and its evolution and drift overtime; see [11,23], for example. Virus strains are typically characterized both genetically ⁴ and antigenically⁵. Antigenic characterizations rely on the hemagglutination inhibition (HI) assay, which quantifies the ability of antisera raised in one virus strain to block the agglutination to red blood cells by the other strain, and this in turn measures the antigenic similarity between two strains of virus; see [15,21]. The resulting HI titer are then compared and summarized. For example, in the work of Barr et al.[1], the HI titer of strain A/Brisbane/10/2007 to itself is 5120, meaning 5120 is the highest dilution of serum that can effectively inhibit hemagglutination; see [15].

Based on the HI between each pair of antigen and antiserum, some summary statistics are used to characterize the difference between two antigens (virus strains). An example of this work is in [5], where the summary statistics, such as A-distance (mean HI value between strain of interests against all antisera), M-distance (mean HI value between strain of interests against all antisera over a constrained time period), L-distance (maximum HI value between strain of interests against all antisera), are compared.

Genetically, flu strain sequences are represented as binary phylogenetic trees and the similarities between strains are described by clades, see [1]. Genetic characterization serves as an important step to understand the evolution path and is usually done before further analysis (including antigenic and/or three-dimension structural analysis) is conducted. Some recent work includes [24,26]. Our proposed method assesses the level of genetic discrepancy between circulating virus strains of two different years based on a collection of phylogenetic trees. A specific example of how our method aids current genetic analysis will be detailed in Section 5. In essence, the core problem dealt with is to accurately (and with corresponding confidence intervals) assess the degree of overlap between two populations – each containing circulating strains from one particular year. If little overlap is found, the claim is that the flu strains have mutated sufficiently enough to be considered as a new population that is distinct from previous years. Among other implications, this could indicate that the yearly vaccine should be updated. The rest of the paper focuses on a statistical testing method that quantifies the degree of overlap.

To describe the layout of the paper; Section 2 introduces the mathematical background for the test, followed in Section 3 by some background on phylogenetic trees. Section 4 describes some simulation results which provide validation for the method. In Section 5, the test is implemented on publicly available influenza sequence data from ‘Global Initiative on Sharing All Influenza Data’(GISAID). We conclude in Section 6 with a brief discussion and some pointers to future research.

2. Statistical foundation

In the following we describe the background to the test introduced in [20] which tests for the equality of random mechanisms generating data from two groups. It is universal and only requires a metric or distance to be defined on the relevant space where the data are observed. In our particular interest, the space is that of phylogenetic trees and the metric is described in [3].

Consider two potentially different sets of numbers on the real line, not necessarily equal in size. A matrix can be constructed with the pairwise distances between all points. In this respect, the symmetric matrix can be partitioned into four sectors, containing the pairwise distances between points in one group, the pairwise distances between points in the other group, in the off-diagonal part of the matrix, and the pairwise distances between points from different groups. To be more concrete, if $(X_{i})_{i = 1}^{n}$ are the points from one group, and $(Y_{j})_{j = 1}^{m}$ from the other, then

G = (\begin{matrix} d (X_{i}, X_{i^{'}})_{i, i^{'} = 1 : n} & d (X_{i}, Y_{j})_{i = 1 : n, j = 1 : m} \\ d (Y_{j}, X_{i})_{j = 1 : m, i = 1 : n} & d (Y_{j}, Y_{j^{'}})_{j = 1 : m, j^{'} = 1 : m} \end{matrix}) .

The crossmatch test statistic is a pairing of elements from the $(n + m) \times (n + m)$ matrix which minimizes the total distance of the paired elements. To expand on this, let $Z_{1 : n} = X_{1 : n}$ and $Z_{n + 1 : n + m} = Y_{1 : m}$ . The crossmatch algorithm finds the idempotent permutation $\hat{σ}$ on ${1, \dots, n + m}$ minimizing

D (σ) = \sum_{i = 1}^{n + m} d (Z_{i}, Z_{σ (i)})

subject to $σ (i) \neq i$ and $σ (σ (i)) = i$ . The crossmatch test statistic is then given by

T = \sum_{i = 1}^{n} 1 (\hat{σ} (i) > n);

i.e. it sums the number of times one point is paired with a point from a different group. The null hypothesis is rejected if T is too small; for if, e.g. T is 0, then no X is matched with any Y and hence the groups are non-overlapping and hence different.

On the other hand, if the random mechanism for generating the points is the same for both the groups, then as Rosenbaum [20] shows, it is possible to find the distribution of T under the null assumption. For example, under the null assumption, the probability that $X_{1}$ is paired with a point from the other group is $m / (N - 1)$ , where N = n + m. This follows since there are m points from the other group and a possible N−1 points to be paired with. Hence, $1 (\hat{σ} (1) > n)$ is a Bernoulli random variable with probability of being 1 equal to $m / (N - 1)$ . Hence, as T is the sum of n such variables, $E (T) = n m / (N - 1) .$ The $1 (\hat{σ} (i) > n)$ , for $i = 1, \dots, n$ are not independent, however, it remains straightforward to compute the variance of T, which is given by

Var (T) = \frac{2 n (n - 1) m (m - 1)}{(N - 3) (N - 1)^{2}} .

From this at least one can use a normal approximation to T and hence calculate a critical set for each desired type I error.

With X and Y in a Euclidean space, e.g. $R^{m}$ , and the distance is Euclidean; i.e.

d (X, Y) = \sqrt{\sum_{j = 1}^{m} (X_{j} - Y_{j})^{2}},

then

T / n \to π (1 - π) \int \frac{p (x) q (x)}{π p (x) + (1 - π) q (x)} d x,

where the $(X_{i})_{i = 1 : n}$ are i.i.d. from density p, the $(Y_{i})_{i = 1 : m}$ are i.i.d. from density q, and $n / (n + m) \to π$ as $n + m \to \infty$ . So, if p = q, then $T / n \to (1 - π)$ . For more on the asymptotics of the test under the alternative, see [6].

Here we present a simple illustration of the test statistic. Suppose n = m = 2 and $x_{1} = 1, x_{2} = 3$ while $y_{1} = 2, y_{2} = 4$ . So as can be seen, there is overlap of the two groups and it is therefore believable the groups have the same random mechanism. The distance matrix is given by

M = (\begin{matrix} 0 & 2 & \underline{1} & 3 \\ 2 & 0 & 1 & \underline{1} \\ \underline{1} & 1 & 0 & 2 \\ 3 & \underline{1} & 2 & 0 \end{matrix}) .

The idempotent permutation sought can be represented by finding the unique element for each row/column which minimizes the sum of those elements. In the matrix M this is clearly given by using the 1s, which are underlined. Hence $\hat{σ} (1) = 3$ , $\hat{σ} (2) = 4$ , $\hat{σ} (3) = 1$ and $\hat{σ} (4) = 2$ . Hence, T = 2, with the obvious interpretation, given that T is the maximum value, it is not possible here to reject the null hypothesis.

The strength of the test is that the distribution of T under the null is the same regardless as to where the data come from. Hence the X and Y can come from any ( non-Euclidean) space and the distribution of T remains the same under the null hypothesis, which is that the random mechanism generating the data is the same for both groups.

The R package implementing the crossmatch test is available online at https://cran.r-project.org/web/packages/crossmatch/crossmatch.pdf.

3. Phylogenetic trees

Phylogeny is concerned with the evolution of groups and specifically about the lines of descent and relationships among groups. It is one of the best tools for understanding the evolution of pathogens. A phylogenetic tree is a diagram depicting a phylogeny through lines of evolutionary descent from a common ancestor. Throughout this article, the phrase ‘phylogenetic tree’ and ‘evolutionary tree’ are used interchangeably. Influenza viruses are permanently changing, undergoing genetic changes over time and monitoring these changes in the genome is fundamental to the production of vaccines on a seasonal basis. The RNA genes of influenza are made up of nucleotides. It is the composition of these nucleotides and the differences which account for the different viruses. The differences and ancestry of viruses are demonstrated through the use of a phylogenetic tree. The tree shows how different viruses are related to each other and are grouped together based on how close their corresponding nucleotides are. Specifically the phylogenetic trees of influenza viruses will usually display how similar the viruses hemagglutinin (HA) or neuraminidase (NA) genes are to one another. The tree consists of branches and branch lengths. Groups on the same branch share the same nucleotides. At a split of a branch, the length of the branch indicates how different (i.e. the number of nucleotide differences) from each other the groups are.

3.1. Constructing phylogenetic trees

Constructing a phylogenetic tree from RNA sequences is not so straightforward. Each analysis requires an optimality criterion to assess model fitness. Even once if a criterion has been adopted, implementing an algorithm to find an optimal tree is also far from easy. Criteria for tree construction fall into three categories [13]:

Parsimony The idea with this approach is based on Occam's razor. The evolutionary tree with the simplest explanation is selected. The determining factor for the simplest explanation is the number of evolutionary changes characterized by the tree.
Maximum likelihood This approach provides a statistical framework for phylogenetic reconstruction. A probability model is formulated which provides the likelihood of observing the data given a particular tree. The tree selected is the one which maximizes the likelihood.
Minimum evolution This criterion is based on the distance from one taxon to another along the tree. A distance matrix between sequences leads to the construction of the tree whereby branch lengths represent evolutionary distance between nodes. The optimal tree is selected which minimizes the least square distance, which is the sum of squares between distances within the matrix and the corresponding distance within the tree. Its consistency and mathematical basis are demonstrated by Rzhetsky and Nei [18].

Algorithms for finding the optimal tree in each case are available. Further information can be obtained from [2]. The following sections of this paper utilizes an improved (computationally more efficient) method based on the minimum evolution criterion, described by Desper and Gascuel [8], to find the optimal phylogenetic trees.

3.2. Distance between phylogenetic trees

Describing a geometry for spaces of phylogenetic trees has not historically been a simple objective. Recently, in [3], one has been presented for binary trees with a fixed number of leaves. These are trees which descend from a single node, bifurcate at lower nodes, and end at terminal nodes, also known as leaves. These leaves determine an RNA sequence or more generally an operational taxonomic unit (OTU).

The geometry provides a natural metric between trees which allows for averaging procedures over trees and for representing uncertainty within a set of trees for representing a specific data set. Indeed, all statistical analyses of any type needs to work with the notion of distances between objects being estimated, usually labeled as parameters. As we are working with a statistical analysis of phylogenetic trees, a metric is essential.

As a simple illustration, the space of phylogenetic trees with 3 leaves has 3 members. The leaves in Figure 1 are labeled left to right as $(1, 2, 3)$ . The other two members have leaves reading $(2, 1, 3)$ and $(3, 1, 2)$ . Note that $(1, 3, 2)$ is regarded as the same tree as $(1, 2, 3)$ .

The long outside edge indicates the evolutionary distance between leaves 1 and 2; and also 1 and 3. The shorter inner edge length indicates the distance between leaves 2 and 3.

To illustrate how the distance between two trees can be computed, we consider two trees each with 3 leaves, both with one long edge and three short edges. For simplicity, assume that the length of the longer edge is 2 and the rest are 1, as shown by Figure 2.

Figure 2. — Two different phylogenetic tree with 3 leaves: (a) Phylogenetic tree 1 and (b) Phylogenetic tree 2.

Each leaf has a binary representation with a 1 indicating the leaf label, hence each leaf comprises a single 1 and the rest 0. As we move up the tree, and pass a node, the binary sequences combine to yield a new binary sequence describing the relevant edge. The distance is made up of the differences between length of edges with the same binary sequences, and the sum of edge lengths for different binary sequences.

To calculate the distance between two trees, we consider edges with the same binary sequences and edges with different binary sequences, respectively. In this example, e denote edges, and are double indexed (for example, $e_{2, 1}$ denotes the first edge in tree 2). This second index is arbitrary and for notation purposes. In every tree, each edge is uniquely described by its binary sequence. For example, in tree 2, edge $e_{2, 1}$ has the binary sequence $(0, 1, 1)$ , meaning both leaf 2 and leaf 3 are its offsprings, but not leaf 1.

Edge $e_{1, 2}$ and edge $e_{2, 4}$ have the same binary sequence. Both of them only have leaf 1 as offspring. In this case, we simply compute their difference in length (here 2 minus 1). The same logic applies to edge $e_{1, 3}$ and edge $e_{2, 3}$ , edge $e_{1, 4}$ and edge $e_{2, 2}$ . For the pair of edge that has a different binary sequence, here edge $e_{1, 1}$ and edge $e_{2, 1}$ (meaning that they have different offspring combinations), we take the sum of the edge length, which is 2 in this case.

Putting this all together, the distance D is given by

\begin{aligned} D_{tree 1, tree 2} & = \sqrt{{| (e_{1, 2} - e_{2, 4}) |}^{2} + {| (e_{1, 3} - e_{2, 3}) |}^{2} + {| (e_{1, 4} - e_{2, 2}) |}^{2} + {| (e_{1, 1} - e_{2, 1}) |}^{2}} \\ = \sqrt{1^{2} + 0^{2} + 1^{2} + 2^{2}} = \sqrt{6} . \end{aligned}

Further details and more efficient algorithms for trees with more leaves are provided in [7].

3.3. Interpreting phylogenetic trees

As mentioned in Section 1, researchers are interested in both genetic and antigenic characterization of virus strains. Phylogenetic trees focus on genetic characterization (analysis based on gene sequence) and provide us with information such as: ⁶

Analyzing how closely different virus strains are related to one another, in terms of their gene sequences.
Monitoring the change of the virus sequence over time and geography.
Assessing to what extent the influenza vaccine strains demonstrate genetic similarity to the circulating strains.

The examples in this paper are based on the minimum evolution trees. The branch lengths on a minimum evolution tree contain information on the distance between each taxon to another [13] and help us to answer crucial the question: are the distances between a flu strain and the vaccine statistically larger than the distances between two flu strains? If the answer is ‘yes’, we may conclude that the vaccine is genetically different from flu strains overall and thus will offer questionable protection. We will show how crossmatch test allows us to quantify the similarity (or difference) between virus sequence and flu vaccine.

4. Simulation study

In this section we consider two simulation studies. In the first, we generate phylogenetic trees all with varying degrees of branch length and varying numbers of leaves. In the second, we construct the trees by taking DNA sequences and performing random mutations on them. The overall goal is to give an idea of the power and property of the test under different scenarios.

4.1. Trees with varying branch lengths and number of branches

In the first case, we are concerned with the test achieving pre-defined type one error values under the null assumption. We do this by generating trees with the same random mechanism and splitting into two groups, and we consider four different tree generating mechanisms, as follows:

Each branch length randomly drawn from uniform distribution on $(0, 1)$ .
Each branch length randomly drawn from $beta (2, 2)$ (symmetric, and has the same center as population 1).
Each branch length randomly drawn from $beta (2, 5)$ (asymmetric, and has different center as population 1).
Each branch length set at 1 (fixed branch length).

Specifically, for each of the above cases, we simulate N = 200 binary phylogenetic trees with four tips (and thus 6 branches) with corresponding branch lengths. These phylogenetic trees are then randomly split into two groups, each of size n = m = 100 observations. Thus, we have two samples of phylogenetic trees that are constructed under the same mechanism. We repeated the process and record the number of crossmatches. We then compare the number of crossmatches with the pre-defined type one error to see that they match up.

Figure 3 gives the distribution of crossmatches under the null hypothesis where both groups of trees are generated with branch length $Unif (0, 1)$

Under the null, Rosebaum [20] demonstrated that the corresponding critical number of crossmatches could be solved analytically. In particular, assuming both m and n are even, the number of crossmatches T has probability

P (T = t) = \frac{2^{t} \times (N / 2)!}{(\binom{N}{n}) t! [(m - t) / 2]! [(n - t) / 2]!} .

For example, the probability that there are 38 crossmatch pairs is calculated by $2^{38} \times 100 / (\binom{200}{100}) 38! 31! 31! = 0.008$ . This would be used to set up a critical value with corresponding type I error. For example, we have $P (T \leq 38 | null) = 1.2 %$ and $P (T \leq 42 | null) = 7.3 %$ , which are the theoretical type one error rates with corresponding critical values. We then simulate under various null scenarios to check that the actual type one error rate matches up. The results are demonstrated in Table 1. We do not test against traditional type one error rates such as 5% because under n = m = 100 no critical value exactly matches the 5% type one error.

Table 1.

Type one error rate under multiple null scenarios closely matches the pre-defined type one error rate; based on K = 10, 000 repetitions each.

Theoretical rate	Unif (0,1)	Beta (2,2)	Beta (5,2)	Fixed at 1
Type one error = 1.2%	1.15%	1.24%	1.38%	1.13%
Type one error = 7.3%	6.9 %	7.25%	6.96%	7.34%

Open in a new tab

The distribution of the statistic under the null assumption can be approximated by a normal distribution with expectation and variance given in Section 2. Here we look at the discrepancy between a theoretical and observed type I error resulting from estimating a discrete distribution by a normal distribution. To this end, we now consider the sample size to be $n = m = 1000$ and take the type I errors set to the traditional 1%, 2.5% and 5 % values. As a one tailed test (clearly we are interested in the case where the crossmatch is below a certain threshold), the critical z-scores are $- 2.33$ , $- 1.96$ and $- 1.64$ , respectively. This leads to the rejection of the null if the number of crossmatches are below or equal to 463, 469, and 474 pairs (out of a 1000 possible pairs), respectively. Each case is repeated K = 10, 000 times and the number of occasions the null is wrongly reject is recorded. Table 2 shows that in each case, the actual type I error rate closely matches the pre-defined type I errors under the normal approximation .

Table 2.

Type one error rate under multiple null scenarios closely matches normal approximation type one error rate with K = 10, 000 iterations each.

Theoratical rate	$Unif (0, 1)$	$Beta (2, 2)$	$Beta (5, 2)$	Fixed at 1
1%	0.90%	0.83%	0.80%	0.95%
2.5%	2.43%	2.22%	2.28%	2.60%
5%	5.26%	5.40%	5.34%	5.43%

Open in a new tab

Figure 4 provides some illustrations of phylogenetic trees from populations 1 and 4.

Under the alternative, we conduct separate crossmatch tests on population 2–4, against population 1 and assess how well the test picks up the structural difference between populations 1 and 2, 1 and 3, and 1 and 4. We repeat each of the above simulations K = 100 times and approximate the distribution of the number of crossmatches as shown in Figure 5.

Figure 5. — Crossmatch distribution under three different alternative scenarios. (a) Alternative: branch length are fixed at 1. (b) Alternative: branch length $\sim beta (5, 2)$ . (c) Alternative: branch length $\sim beta (2, 2)$ .

Under each of the alternative cases, we see far less crossmatches out of 100 pairs compared to the null. The expected number of crossmatches under the assumption the groups are identical is $n m / (N - 1) = 100^{2} / 199 = 50.25$ , which matches Figure 3, but clearly does not match with Figure 5. Noticeably, under each of the alternative scenarios, we see different distributions. Unsurprisingly, between trees with branch length $\sim Unif (0, 1)$ and branch length $\sim Beta (2, 2)$ produce the most cross-matched pairs, as the generating mechanism is similar between these two groups, though still different enough that the expected value 50.25 does not appear on the histogram.

Then we focus on the population where all branch length $\sim Unif (0, 1)$ and the population where all branch length $\sim Beta (2, 2)$ and investigate if varying the number of leaves on each tree will yield different crossmatch distributions. We are only interested in relatively small number of leaves because in the eventual application, we construct many small trees with available flu virus sequences. Figure 6 is to be compared with the panel (c) of Figure 5, where the alternative branch generating mechanism is the same and only the number of leaves differs. Table 3 shows the summary statistics under multiple alternatives.

Figure 6. — Crossmatch distribution under varying number of leaves. (a) Branch length from Beta(2,2) and Unif(0,1). (b) Branch length from Beta(2,2) and Unif(0,1).

Table 3.

Summary statistics of the crossmatch test with varying number of leaves.

	4 leaves	5 leaves	6 leaves
$E (T)$	17.78	33.32	41.52
$Va r (T)$	47.43	59.00	51.24

Open in a new tab

If we set the type one error to be at p = 0.01 for a one-sided test, then with the $E (T) = 50.25$ and $Var (T) = 25.12$ under the null distribution, we will reject the null if the number of crossmatch is less than 38.60 (z–score = $- 2.33$ ). With the above simulations under 4, 5, and 6 leaves under this particular alternative, we notice that only with 4 leaves do we get a strong power.

From this section, we conclude that the tree generating mechanism and number of leaves generated both have a strong influence on the power of the test.

4.2. Trees constructed from simulated mutations

We further validate the method on phylogenetic trees constructed from simulated mutations. Based on known mitochondria DNA sequences of Homosapien and Pan, shown in Table 4, from [12], we simulate some mutations at random sites in each. Then we construct two types of phylogenetic tree. One tree population contains phylogenetic trees with five tips all from variations of homosapien, and the other one contains phylogenetic trees with five tips from a mixture of variations of homosapien and pan; see Figures 7– Figure 9. Here, the two distinct populations of interest are homosapien and pan and we show that the two types of trees constructed from the above mentioned ways are noticeably different.

Table 4.

Segment of known mitochondria DNA sequences for two species. There are in total 80 site differences out of 898 bases ([12]).

H	AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCA…GCTT
P	AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCA…GCTT

Open in a new tab

Figure 7. — The construction of two types of phylogenetic trees.

Figure 9. — Examples of two types of trees. (a) Trees from $M_{1}$ . (b) Trees from $M_{2}$ .

By construction, trees from $M_{2}$ differs from trees from $M_{1}$ by the presence of a noticeably longer single branch.

A closer look at the pairwise distance distribution would provide more intuition. Figure 10 shows the distribution of $100^{2}$ pairwise distances across two groups, and of $2 \times (\binom{100}{2}) = 9900$ pairwise distances within each group. The much larger overall distance between trees from different population explains why the crossmatch test would find zero matches across groups.

5. H1N1 hemagglutinin (HA) sequences

In this section our goal is to quantify the genetic discrepancy of circulating influenza viruses across different years, in particular the HA sequences in H1N1 viruses. In most studies on influenza viruses, researchers analyze the genetic similarity (and hence signs of mutation and drift) between different virus strains with respect to varying time periods or geographical locations. Some recent examples include [16,19,22,24,26]. Virus sequences are usually put into a large phylogenetic tree where their evolutionary relationships are studied. On the other hand, our method quantifies genetic variation using several virus strains to construct multiple smaller phylogenetic trees. The degree of genetic difference (implied by the number of nucleotide differences) between each strain is represented by the length of the branches⁷.

The crossmatch test is conducted between group 1, where all the strains are from the same year, and group 2, where one strain is substituted by a sequence from another year. Some tree examples are shown on Figure 11. If enough genetic variation is present between these two time periods, a noticeably longer branch would be present in trees from the second group. It would result in structural differences between two groups of trees, and would be detected by the crossmatch test. This allows us to say whether there is a significant difference in genetic makeup across time.

Figure 11. — By substituting one tip of a 2013 virus strain by a strain from 2016, a significantly longer branch is observed, signifying more genetic change across years. The crossmatch test detects the difference in tree structures and hence detects significant genetic variation. (a) Phylogenetic trees where all tips are H1N1 HA sequence from 2013. (b) Phylogenetic trees where tips are a mixture of H1N1 HA sequence from 2013 and 2016.

Virus strains are collected from GISAID. Information included in each strain is the year and location of occurrence, type and lineage (for example, A/Bosnia and Herzegovina/7356/2019, of type H1N1 and lineage pdm09), together with the corresponding HA sequence, typically of length between 1650 and 1750 bases.

Most existing inference methods based on phylogentic trees construct a single tree from hundreds of available strains and describe genetic similarity of two strains based on whether they are grouped into the same cluster within the tree. See [17], for example. In contrast to this, we construct multiple smaller phylogenetic trees, each only using several strains, because our interest lies in quantifying the overall difference between different years, instead of a simple comparison between any two specific strains. This construction also provides the possibility for a statistical test with a well-defined critical region and significance level.

In our case, two groups of phylogenetic trees, each containing 100 individual trees, are constructed using the existing R function ‘fastme.ols’, which performs the minimum evolution algorithm; [8], within the R package ‘ape’. Group I consists of trees with all tips coming from randomly selected circulating flu strains in a particular year, while Group II consists of trees with one tip being selected from a different year, and all remaining tips coming from the same year as in Group I. We illustrate the difference between the two groups with some examples.

We demonstrate how the test can be used in a meaningful way in the following scenario: From From studies such as the one outlined by Jones et al. [14], shown in Figure 12, where H1N1 HA sequences collected globally across time are put into a phylogenetic tree, we notice that the genetic makeup in general drifts as time progresses. We also observe that there are some overlaps existing across years (for example, part of the gray branches are mingled together with the brown ones, which signifies some similarity between viruses strain in 2013 and 2014; part of the brown branches are mingled together with the purple ones, which signifies some similarity between viruses strain in 2013 and 2012). It is reasonable to quantify the degree of genetic variation across different years.

Figure 12. — Phylogenetic analysis of H1N1 HA sequence from [14]. In this large phylogenetic tree, we observe a general drifting pattern across years, while virus strains from some years overlaps with the previous years.

The crossmatch test shows that there is indeed a general drift trend of the H1N1 HA genetic makeup across years, with statistical significance, as shown in Table 5. The non-significant p-value is expected since we are testing the virus strains from the same year. In this way, we are able to quantify the different amounts of drift that occur across any two selected years. This type of analysis provides an additional tool where genetic variation is statistically quantified and helps to lead to a more informed decision about vaccine selection.

Table 5.

Summary of crossmatch test across multiple years. Displayed statistics is the number of crossmatches among 100 possible pairs, with corresponding p-value.

Year	2013	2014
2013	52 (0.709)	–
2014	38 (0.012)	54 (0.829)
2015	18 (0)	30 (0)
2016	2 (0)	22 (0)

Open in a new tab

6. Discussion and future work

This paper has demonstrated the use of the crossmatch test for comparing viruses as represented by phylogenetic trees. The test provides a well-defined critical region for a chosen significance level along with a p -value. The test statistic acts as a distance between the two groups, see [6]. This aspect presents motivation for future work. Indeed, the dynamics of the changing virus from season to season can be explored and future predictions made based on a suitable time series analysis. Incorporating dynamical analysis of the mutation of virus strains has been of interest in improving the efficiency of influenza vaccination, see [10,27]. Monitoring genetic distance of strains over time could be extremely useful, as strong correlation between temporal distance and genetic distance indicates the possibility of strains in a particular season heralding the early strains of the next season; see [4]. As a future step, we would like to apply the test to virus strains over consecutive years and quantify the amount of genetic mutation both within a season and over seasons. This is a realistic agenda in light of the fact that the part of the tree which corresponds to similarities between the viruses are a component of the test statistic and hence are quantifiable.

Acknowledgments

The authors are grateful for the comments and suggestions of two reviewers.

Notes

Types of Influenza Viruses. https://www.cdc.gov/flu/about/viruses/types.htm.

Selecting Viruses for the Seasonal Influenza Vaccine, Center for Disease Control and Prevention. https://www.cdc.gov/flu/prevent/vaccine-selection.htm.

Vaccine Effectiveness: How Well Do the Flu Vaccines Work? https://www.cdc.gov/flu/vaccines-work/vaccineeffect.htm.

⁴

Genetic Characterization. https://www.cdc.gov/flu/about/professionals/genetic-characterization.htm.

⁵

Antigenic Characterization. https://www.cdc.gov/flu/about/professionals/antigenic.htm.

⁶

https://www.cdc.gov/flu/about/professionals/genetic-characterization.htm.

⁷

European Bioinformatics Institute https://www.ebi.ac.uk/.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Barr I.G., McCauley J., Cox N., Daniels R., Engelhardt O.G., Fukuda K., Grohmann G., Hay A., Kelso A., Klimov A., Odagiri T., Smith D., Russell C., Tashiro M., Webby R., Wood J., Ye Z. and Zhang W., Epidemiological, antigenic and genetic characteristics of seasonal influenza A(H1N1), A(H3N2) and B influenza viruses: basis for the WHO recommendation on the composition of influenza vaccines for use in the 2009–2010 northern hemisphere season, Vaccine 28 (2010), pp. 1156–1167. [DOI] [PubMed] [Google Scholar]
2.Barton N.H., Briggs D.E.G., Eisen J.A., Goldstein D.B. and Patel N.H., Phylogenetic reconstruction, In Evolution, N. Barton, eds., Chap. 27, Cold Spring Harbor Laboratory Press, New York, 2007.
3.Billera L.J., Holmes S.P. and Vogtmann K., Geometry of the space of phylogenetic trees, Adv. Appl. Math. 27 (2001), pp. 733–767. [Google Scholar]
4.Boni M.F., Vaccination and antigenic drift in influenza, Vaccine 26 (2008), pp. C8–C14. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Cai Z., Zhang T. and Wan X.F., Antigenic distance measurements for seasonal influenza vaccine selection, Vaccine 30 (2012), pp. 448-–453. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Castro E. and Pelletier B., On the consistency of the crossmatch test, J. Stat. Plan. Infer. 171 (2016), pp. 184–190. [Google Scholar]
7.Chakerian J. and Holmes S., Computational tools for evaluating phlogenetic and hierarchical clustering trees, J. Comput. Graph. Stat. 21 (2012), pp. 581–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Desper R. and Gascuel O., Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle, J. Comput. Biol. 9 (2002), pp. 687–705. [DOI] [PubMed] [Google Scholar]
9.Earn D.J.D., Dushoff J. and Levin S.A., Ecology and evolution of the flu, Trends. Ecol. Evol. (Amst.) 17 (2002), pp. 334–340. [Google Scholar]
10.Gupta V., Earl D.J. and Deem M.W., Quantifying influenza vaccine efficacy and antigenic distance, Vaccine 24 (2006), pp. 3881–3888. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hampson A.W., Influenza virus antigens and ‘antigenic drift’, Perspectives in Medical Virology. 7 (2002), 49–85. [Google Scholar]
12.Hayasaka K., Gojobori T. and Horai S., Molecular phylogeny and evolution of primate mitochondrial DNA, Mol. Biol. Evol. 5 (1988), pp. 626–644. [DOI] [PubMed] [Google Scholar]
13.Hillis D.M., Phylogenetic analysis, Curr. Biol. 7 (1997), pp. R129–R131. [DOI] [PubMed] [Google Scholar]
14.Jones S., Nelson-Sathi S., Wang Y., Prasad R., Rayen S., Nandel V., Hu Y., Zhang W., Nair R., Dharmaseelan S., Chirundodh D. V., Kumar R. and Pillai R. M., Evolutionary, genetic, structural characterization and its functional implications for the influenza A (H1N1) infection outbreak in India from 2009 to 2017, Sci. Rep. 9 (2019), p. 14690. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Landolt G.A., Townsend H.G.G. and Lunn D.P, Equine influenza infection, in Equine Infectious Diseases, D. C. Sellon, M. T. Long, eds., 2nd ed., Chap. 13, W.B. Saunders, St. Louis, Missouri, 2014.
16.Opanda S., Bulimo W., Gachara G., Ekuttan C., Amukoye E. and Dijkman R., Assessing antigenic drift and phylogeny of influenza A (H1N1) pdm09 virus in Kenya using HA1 sub-unit of the hemagglutinin gene, PLoS One. 15 (2020), p. e0228029. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Paessler S. and Veljkovic V., Using electronic biology based platform to predict flu vaccine efficacy for 2018/2019, [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 7 (2018), p. 298. [DOI] [PMC free article] [PubMed]
18.Rzhetsky A. and Nei M., Theoretical foundation of the minimum-evolution method of phylogenetic inference, Mol. Biol. Evol. 10 (1993), pp. 1073–1095. [DOI] [PubMed] [Google Scholar]
19.Rajao D.S., Anderson T.K., Kitikoon P., Stratton J., Lewis N.S. and Vincent A.L., Antigenic and genetic evolution of contemporary swine H1 influenza viruses in the United States, Virology 518 (2018), pp. 45–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Rosenbaum P. R., An exact distribution free test comparing two multivariate distributions based on adjacency, J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 67 (2005), pp. 515–530. [Google Scholar]
21.Russell C. A., Jones T. C., Barr I. G., Cox N. J., Garten R. J., Gregory V., Gust I. D., Hampson A. W., Hay A. J., Hurt A. C., de Jong J. C., Kelso A., Klimov A. I., Kageyama T., Komadina N., Lapedes A. S., Lin Y. P., Mosterin A., Obuchi M., Odagiri T., Osterhaus A. D.M.E., Rimmelzwaan G. F., Shaw M. W., Skepner E., Stohr K., Tashiro M., Fouchier R. A.M. and Smith D. J., Influenza vaccine strain selection and recent studies on the global migration of seasonal influenza viruses, Vaccine 26 (2008), pp. D31–D34. [DOI] [PubMed] [Google Scholar]
22.Shao T.J., Li J., Yu X.F., Kou Y., Zhou Y.Y. and Qian X., Progressive antigenic drift and phylogeny of human influenza A(H3N2) virus over five consecutive seasons (2009-2013) in Hangzhou, China, Int. J. Infect Dis. 29 (2014), pp. 190–193. [DOI] [PubMed] [Google Scholar]
23.Smith D.J., Lapedes A.S., de Jong J.C., Bestebroer T.M., Rimmelzwaan G.F., Osterhaus A.D.M.E. and Fouchier R.A., Mapping the antigenic and genetic evolution of influenza virus, Science 305 (2004), pp. 371–376. [DOI] [PubMed] [Google Scholar]
24.Tapia R., Torremorell M., Culhane M., Medina R. A. and Neira V., Antigenic characterization of novel H1 influenza A viruses in swine, Sci. Rep. 10 (2020), p. e1003657. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Treanor J., Influenza vaccine outmaneuvering antigenic shift and drift, N. Engl. J. Med. 350 (2004), pp. 218–220. [DOI] [PubMed] [Google Scholar]
26.Virk R.K., Jayakumar J., Mendenhall I.H., Moorthy M., Lam P., Linster M., Lim J., Lin C., Oon L.L.E., Lee H.K., Koay E.S.C., Vijaykrishna D., Smith G.J.D. and Su Y.C.F., Divergent evolutionary trajectories of influenza B viruses underlie their contemporaneous epidemic activity, Proc. Natl. Acad. Sci. 117 (2020), pp. 619–628. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Wu J.T., Wein L.M. and Perelson A.S., Optimization of influenza vaccine selection, Oper. Res. 53 (2005), pp. 456–476. [Google Scholar]

[CIT0001] 1.Barr I.G., McCauley J., Cox N., Daniels R., Engelhardt O.G., Fukuda K., Grohmann G., Hay A., Kelso A., Klimov A., Odagiri T., Smith D., Russell C., Tashiro M., Webby R., Wood J., Ye Z. and Zhang W., Epidemiological, antigenic and genetic characteristics of seasonal influenza A(H1N1), A(H3N2) and B influenza viruses: basis for the WHO recommendation on the composition of influenza vaccines for use in the 2009–2010 northern hemisphere season, Vaccine 28 (2010), pp. 1156–1167. [DOI] [PubMed] [Google Scholar]

[CIT0002] 2.Barton N.H., Briggs D.E.G., Eisen J.A., Goldstein D.B. and Patel N.H., Phylogenetic reconstruction, In Evolution, N. Barton, eds., Chap. 27, Cold Spring Harbor Laboratory Press, New York, 2007.

[CIT0003] 3.Billera L.J., Holmes S.P. and Vogtmann K., Geometry of the space of phylogenetic trees, Adv. Appl. Math. 27 (2001), pp. 733–767. [Google Scholar]

[CIT0004] 4.Boni M.F., Vaccination and antigenic drift in influenza, Vaccine 26 (2008), pp. C8–C14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0005] 5.Cai Z., Zhang T. and Wan X.F., Antigenic distance measurements for seasonal influenza vaccine selection, Vaccine 30 (2012), pp. 448-–453. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0006] 6.Castro E. and Pelletier B., On the consistency of the crossmatch test, J. Stat. Plan. Infer. 171 (2016), pp. 184–190. [Google Scholar]

[CIT0007] 7.Chakerian J. and Holmes S., Computational tools for evaluating phlogenetic and hierarchical clustering trees, J. Comput. Graph. Stat. 21 (2012), pp. 581–599. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0008] 8.Desper R. and Gascuel O., Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle, J. Comput. Biol. 9 (2002), pp. 687–705. [DOI] [PubMed] [Google Scholar]

[CIT0009] 9.Earn D.J.D., Dushoff J. and Levin S.A., Ecology and evolution of the flu, Trends. Ecol. Evol. (Amst.) 17 (2002), pp. 334–340. [Google Scholar]

[CIT0010] 10.Gupta V., Earl D.J. and Deem M.W., Quantifying influenza vaccine efficacy and antigenic distance, Vaccine 24 (2006), pp. 3881–3888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0011] 11.Hampson A.W., Influenza virus antigens and ‘antigenic drift’, Perspectives in Medical Virology. 7 (2002), 49–85. [Google Scholar]

[CIT0012] 12.Hayasaka K., Gojobori T. and Horai S., Molecular phylogeny and evolution of primate mitochondrial DNA, Mol. Biol. Evol. 5 (1988), pp. 626–644. [DOI] [PubMed] [Google Scholar]

[CIT0013] 13.Hillis D.M., Phylogenetic analysis, Curr. Biol. 7 (1997), pp. R129–R131. [DOI] [PubMed] [Google Scholar]

[CIT0014] 14.Jones S., Nelson-Sathi S., Wang Y., Prasad R., Rayen S., Nandel V., Hu Y., Zhang W., Nair R., Dharmaseelan S., Chirundodh D. V., Kumar R. and Pillai R. M., Evolutionary, genetic, structural characterization and its functional implications for the influenza A (H1N1) infection outbreak in India from 2009 to 2017, Sci. Rep. 9 (2019), p. 14690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0015] 15.Landolt G.A., Townsend H.G.G. and Lunn D.P, Equine influenza infection, in Equine Infectious Diseases, D. C. Sellon, M. T. Long, eds., 2nd ed., Chap. 13, W.B. Saunders, St. Louis, Missouri, 2014.

[CIT0016] 16.Opanda S., Bulimo W., Gachara G., Ekuttan C., Amukoye E. and Dijkman R., Assessing antigenic drift and phylogeny of influenza A (H1N1) pdm09 virus in Kenya using HA1 sub-unit of the hemagglutinin gene, PLoS One. 15 (2020), p. e0228029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0017] 17.Paessler S. and Veljkovic V., Using electronic biology based platform to predict flu vaccine efficacy for 2018/2019, [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 7 (2018), p. 298. [DOI] [PMC free article] [PubMed]

[CIT0018] 18.Rzhetsky A. and Nei M., Theoretical foundation of the minimum-evolution method of phylogenetic inference, Mol. Biol. Evol. 10 (1993), pp. 1073–1095. [DOI] [PubMed] [Google Scholar]

[CIT0019] 19.Rajao D.S., Anderson T.K., Kitikoon P., Stratton J., Lewis N.S. and Vincent A.L., Antigenic and genetic evolution of contemporary swine H1 influenza viruses in the United States, Virology 518 (2018), pp. 45–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0020] 20.Rosenbaum P. R., An exact distribution free test comparing two multivariate distributions based on adjacency, J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 67 (2005), pp. 515–530. [Google Scholar]

[CIT0021] 21.Russell C. A., Jones T. C., Barr I. G., Cox N. J., Garten R. J., Gregory V., Gust I. D., Hampson A. W., Hay A. J., Hurt A. C., de Jong J. C., Kelso A., Klimov A. I., Kageyama T., Komadina N., Lapedes A. S., Lin Y. P., Mosterin A., Obuchi M., Odagiri T., Osterhaus A. D.M.E., Rimmelzwaan G. F., Shaw M. W., Skepner E., Stohr K., Tashiro M., Fouchier R. A.M. and Smith D. J., Influenza vaccine strain selection and recent studies on the global migration of seasonal influenza viruses, Vaccine 26 (2008), pp. D31–D34. [DOI] [PubMed] [Google Scholar]

[CIT0022] 22.Shao T.J., Li J., Yu X.F., Kou Y., Zhou Y.Y. and Qian X., Progressive antigenic drift and phylogeny of human influenza A(H3N2) virus over five consecutive seasons (2009-2013) in Hangzhou, China, Int. J. Infect Dis. 29 (2014), pp. 190–193. [DOI] [PubMed] [Google Scholar]

[CIT0023] 23.Smith D.J., Lapedes A.S., de Jong J.C., Bestebroer T.M., Rimmelzwaan G.F., Osterhaus A.D.M.E. and Fouchier R.A., Mapping the antigenic and genetic evolution of influenza virus, Science 305 (2004), pp. 371–376. [DOI] [PubMed] [Google Scholar]

[CIT0024] 24.Tapia R., Torremorell M., Culhane M., Medina R. A. and Neira V., Antigenic characterization of novel H1 influenza A viruses in swine, Sci. Rep. 10 (2020), p. e1003657. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0025] 25.Treanor J., Influenza vaccine outmaneuvering antigenic shift and drift, N. Engl. J. Med. 350 (2004), pp. 218–220. [DOI] [PubMed] [Google Scholar]

[CIT0026] 26.Virk R.K., Jayakumar J., Mendenhall I.H., Moorthy M., Lam P., Linster M., Lim J., Lin C., Oon L.L.E., Lee H.K., Koay E.S.C., Vijaykrishna D., Smith G.J.D. and Su Y.C.F., Divergent evolutionary trajectories of influenza B viruses underlie their contemporaneous epidemic activity, Proc. Natl. Acad. Sci. 117 (2020), pp. 619–628. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0027] 27.Wu J.T., Wein L.M. and Perelson A.S., Optimization of influenza vaccine selection, Oper. Res. 53 (2005), pp. 456–476. [Google Scholar]

PERMALINK

Testing for genetic mutation of seasonal influenza virus

Vera Liu

Stephen Walker

Abstract

1. Introduction

2. Statistical foundation

3. Phylogenetic trees

3.1. Constructing phylogenetic trees

3.2. Distance between phylogenetic trees

Figure 1.

Figure 2.

3.3. Interpreting phylogenetic trees

4. Simulation study

4.1. Trees with varying branch lengths and number of branches

Figure 3.

Table 1.

Table 2.

Figure 4.

Figure 5.

Figure 6.

Table 3.

4.2. Trees constructed from simulated mutations

Figure 8.

Table 4.

Figure 7.

Figure 9.

Figure 10.

5. H1N1 hemagglutinin (HA) sequences

Figure 11.

Figure 12.

Table 5.

6. Discussion and future work

Acknowledgments

Notes

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases