Genome-wide epistasis and co-selection study using mutual information

Johan Pensar; Santeri Puranen; Brian Arnold; Neil MacAlasdair; Juri Kuronen; Gerry Tonkin-Hill; Maiju Pesonen; Yingying Xu; Aleksi Sipola; Leonor Sánchez-Busó; John A Lees; Claire Chewapreecha; Stephen D Bentley; Simon R Harris; Julian Parkhill; Nicholas J Croucher; Jukka Corander

doi:10.1093/nar/gkz656

. 2019 Jul 30;47(18):e112. doi: 10.1093/nar/gkz656

Genome-wide epistasis and co-selection study using mutual information

Johan Pensar ^1,^✉, Santeri Puranen ^1,², Brian Arnold ³, Neil MacAlasdair ⁴, Juri Kuronen ⁵, Gerry Tonkin-Hill ⁴, Maiju Pesonen ^1,², Yingying Xu ^1,², Aleksi Sipola ¹, Leonor Sánchez-Busó ⁴, John A Lees ⁶, Claire Chewapreecha ^7,⁸, Stephen D Bentley ⁴, Simon R Harris ⁴, Julian Parkhill ⁹, Nicholas J Croucher ¹⁰, Jukka Corander ^1,^4,^5,^✉

PMCID: PMC6765119 PMID: 31361894

Abstract

Covariance-based discovery of polymorphisms under co-selective pressure or epistasis has received considerable recent attention in population genomics. Both statistical modeling of the population level covariation of alleles across the chromosome and model-free testing of dependencies between pairs of polymorphisms have been shown to successfully uncover patterns of selection in bacterial populations. Here we introduce a model-free method, SpydrPick, whose computational efficiency enables analysis at the scale of pan-genomes of many bacteria. SpydrPick incorporates an efficient correction for population structure, which adjusts for the phylogenetic signal in the data without requiring an explicit phylogenetic tree. We also introduce a new type of visualization of the results similar to the Manhattan plots used in genome-wide association studies, which enables rapid exploration of the identified signals of co-evolution. Simulations demonstrate the usefulness of our method and give some insight to when this type of analysis is most likely to be successful. Application of the method to large population genomic datasets of two major human pathogens, Streptococcus pneumoniae and Neisseria meningitidis, revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.

INTRODUCTION

Comparative methods for detecting co-evolutionary signals from population sequence data have received a lot of attention over the last few decades. As one of the more striking examples, statistical analysis of covariation between non-adjacent sites in large protein alignments has proven effective for predicting contacts between sites in the three-dimensional protein structure (1–8). Since sites in contact in the protein structure co-evolve under a common structural constraint, they give rise to a detectable trace of correlation in the protein alignment. Similarly, sites co-evolving under a shared selective pressure may give rise to a co-selection pattern that can be detected from sequence alignments, even in the absence of appropriate phenotypic data. As a result, attention has recently been directed toward exploratory co-variation analysis of genome-wide nucleotide alignments for bacterial populations, where the aim is to reveal putative sites co-evolving under a shared selective pressure and possibly, but not necessarily, being involved in epistatic interactions (9–11).

Genome-scale analysis of co-variation at single-nucleotide resolution, here termed as genome-wide epistasis and co-selection study (GWES), has already shown great potential; however, it poses considerable statistical and computational challenges as the number of pairs to be considered increases quadratically with the number of sites. Previous GWES approaches have been based on either straightforward pairwise tests (9), which do not distinguish between indirect and direct interactions, or a more elaborate model-based technique known as direct coupling analysis (DCA) (10,11), which is computationally demanding. The main motivation behind pairwise structure learning methods has typically been scalability; however, a recent simulation study with synthetic network models showed that pairwise methods based on mutual information (MI) can be as accurate as and even outperform model-based methods in the high-dimensional regime (12), which is the typical setting in GWES. While MI has been successful in detecting co-evolution from protein and RNA data (1,2,13,14), it has not yet been systematically applied to bacterial population genomics at a genome-wide scale.

In this work we introduce a novel MI-based GWES method, SpydrPick, which is scalable to handle analyses even at a pan-genome-wide scale. To account for population structure, we use a sequence reweighting technique commonly employed when analyzing protein sequence alignments (4,5,14), and also more recently when performing GWES (10,11). To select the best candidates of directly co-selected or interacting mutations among the identified signals of co-variation, we use a pruning method originally introduced for analysis of gene expression data (15), combined with an outlier detection method that identifies significant outliers in terms of a global background distribution estimated across the genome. The focus on the statistical quantification of the background pattern across the genome lends itself well to an intuitive and efficient visualization of the results akin to a Manhattan plot used in genome-wide association studies, which we term as the GWES Manhattan plot. We demonstrate the usefulness and reliability of SpydrPick by application to both simulated data and two large population genomic datasets of the major human pathogens Streptococcus pneumoniae and Neisseria meningitidis. For the latter pathogen, we analyzed the entire pan-genome, which contains so many mutations that most model-based approaches are computationally infeasible, including even our recent highly optimized DCA-based software (11). An open-source C++ implementation of SpydrPick is available at https://github.com/santeripuranen/SpydrPick.

MATERIALS AND METHODS

Method

An overview of the SpydrPick pipeline is shown in Figure 1. The different steps are described in detail in the following sections.

Mutual information

MI is an information theoretic measure of the mutual dependence between two random variables (16). More specifically, let Inline graphic and be two discrete random variables with outcome spaces and . The MI between and is then formally defined by:

(1)

where Inline graphic is the joint probability of and , while and are the corresponding marginal probabilities. In practice, the distributions in (1) are typically unknown and have to be estimated from data. Let denote the count of the joint outcome and occurring in a dataset containing independent and identically distributed (IID) observations generated from Inline graphic . Typically, the joint probabilities are estimated by the relative frequencies of the joint outcomes corresponding to maximum likelihood estimates. To avoid issues related to zero counts and increase the stability of the estimator, we add 0.5 to the joint counts:

(2)

where Inline graphic and . The corresponding marginal probabilities are calculated from the estimated joint probabilities as described above. In the Bayesian framework, the above point estimator is the posterior mean under a Dirichlet prior distribution with the hyperparameters set to 0.5, corresponding to Jeffreys’ prior (17).

Sequence reweighting

In the context of this work, Inline graphic and in the previous paragraph correspond to single-nucleotide polymorphisms (SNPs) and the outcome spaces represent (subsets of) the four nucleotides and an additional category representing gaps. The observed data is in form of a multiple sequence alignment (MSA) containing sequences Inline graphic of length . In general, the sequences in an MSA strongly violate the IID assumption since they share a linkage through an evolutionary relationship. This non-independence has long been recognized as a major issue in comparative analysis, introducing a phylogenetic bias that leads to an increase in false positives (18), impeding the separation of interesting signals from background noise caused by the population structure. As a result, various techniques for correcting for the population structure have been developed over the years (see (13) for an overview). Here, we apply a technique known as sequence reweighting, which has successfully been used previously for both protein contact map prediction (4,5) and DCA-based GWES (10,11). Reweighting assigns a weight to each sequence according to how different it is from the other sequences in the MSA, such that the counts of allele pairs occurring in the MI estimator will reflect the level of clusteredness across the MSA.

Let Inline graphic denote the number of sequences (including ) in the data whose mean per-site Hamming distance to is smaller than a specified threshold. The weight given to sequence is then simply calculated by

Similar to previous works (5,10,11), we use a default distance threshold value of 0.1. Considering the large genetic distance separation that was recently observed for many bacterial species (19), we expect the results to be fairly robust toward the exact value of the distance threshold, as long as the value is chosen from an appropriate region. For example, previous DCA-based methods have been shown to be stable for values in the range of 0.10–0.25 (5,10).

The effective count Inline graphic is calculated by summing the weights of all sequences with the corresponding joint configuration over the SNP sites represented by and . The counts in (2) are then replaced with the corresponding effective counts:

The above estimates are finally plugged into (1) resulting in the reweighted MI estimator.

Filtering out indirect links

An unavoidable issue with methods based solely on pairwise association tests is their inability to distinguish between direct and indirect associations. In particular, in the GWES context it is typically expected that a strong direct dependence between two distant SNP sites would be accompanied by a collection of slightly weaker indirect dependencies between sites in close proximity of the coupled sites due to genetic linkage. As a result, pinpointing the exact locations of co-evolving loci at SNP resolution in a bacterial GWES is in general very difficult due to strong LD between nearby sites. Still, considering that the identified links need to be examined manually, our aim is to produce as compact a list of SNP pairs as possible, containing the most likely candidates of mutations co-evolving under a shared selective pressure.

To select a subset of SNP pairs containing only the most promising links, we use the same filtering technique as ARACNE, which was originally introduced as a method for inferring gene expression networks (15). The filtering technique is based on a property known as the data processing inequality, which states that if two variables Inline graphic and only interact through a third variable , then

In other words, the indirect dependence between Inline graphic and cannot be larger than either of the two direct dependencies through which it is mediated. Formally, ARACNE starts from a graph containing a link for each non-zero MI value. The algorithm then examines each triplet of mutually linked variables and removes the weakest link (see Figure 2). In the degenerate case, where there is no unique weakest link in a triplet, no link is removed. The algorithm is order-independent in the sense that a link that has been marked for removal from one triplet is still considered present with respect to a non-examined triplet containing that link.

Figure 2. — Illustration of the ARACNE step (the width of the links represents the interaction strength): (A) true interaction structure: Z is strongly linked to X and Y, which are not directly linked to each other. (B) A pairwise test outputs a significant association between X and Y due to the indirect link through Z. (C) The ARACNE step removes the indirect link between X and Y, being the weakest out of the three links.

Naively applying the ARACNE filtering step would be computationally intractable, since there are in total Inline graphic possible triplets. However, in practice it is sufficient to run the procedure over a small list containing only the top estimated links. Consequently, the main computational part will still be to estimate the MI values over the pairs. The ARACNE approach is not only appealing due to its computational simplicity, but also its ability to produce a small representative set of links that are most likely to be direct. One of the drawbacks with this approach is that it will never output a triplet of mutually linked sites (except in the degenerate case) even if such a triplet existed. However, three mutually linked sites will still be contained in a single connected component and thus the association between the sites will remain visible.

Threshold for result storage

Saving the complete output of a GWES to disk would typically result in such large files that they would become unwieldy. Nevertheless, since the main target is to identify the largest MI values, estimation results can be filtered online (i.e. as each new value is calculated) to reduce the amount of storage required. To this end, we use a subsampling procedure to determine a threshold for saving a user-specified top fraction of the MI values. This is done by randomly selecting a subset of SNP pairs for which the MI values are calculated. The empirical cumulative distribution function is then used to estimate an appropriate saving threshold that corresponds to the user-specified top fraction. To increase stability, the procedure is repeated several times and the median threshold value is selected.

Outlier analysis

To assess if a link is strong enough to warrant further study, we perform an outlier analysis. Due to genetic linkage, SNPs in close chromosomal proximity tend to be in strong linkage disequilibrium (LD). Note that LD here refers to SNPs showing a significant association specifically due to close genetic linkage. Since strong LD masks any potential signal of shared co-evolutionary selection pressure, we restrict the outlier analysis to non-LD pairs. The default approach for filtering out LD-pairs is to use a simple distance-based cut-off.

To estimate an outlier threshold among the non-LD pairs, we use a data-driven procedure based on Tukey's outlier test (20). The test assesses how extreme an MI value is in comparison to a global background distribution observed for the analyzed dataset. If the MI value of a direct link is flagged as an outlier, the corresponding SNP pair will automatically be carried forward for further analysis. As background distribution for the outlier test, we use an extreme value distribution by which we effectively attempt to model the distribution of maximum MI values for a site (w.r.t. non-LD pairs). In practice, we save the maximum MI value of each site and calculate the lower ( Inline graphic ) and upper () quartiles of the empirical extreme value distribution. Following Tukey's criterion, we then label an MI value larger than as an outlier. In addition to the default threshold, we label an MI value larger than as an extreme outlier.

The typical approach for determining significance in this type of problem is to run a permutation analysis (15,21). For this application, such an approach would be too inclusive since the maximum MI values observed in the background distribution of real MSAs exceed those observed under a null model in which the sites are unlinked through permutations. Moreover, the extent of the tail region of the background distribution may vary significantly between datasets due to differences in population structure, recombination rate, etc. For these reasons, our significance analysis is based on identification of outliers among the actual MI values observed for a particular population. Being based on quartiles, Tukey's outlier test is by design very robust against extreme values. The critical assumption behind this procedure is that the majority of SNPs are not linked to other SNPs beyond LD, which is a reasonable assumption in most cases.

Mutual information without gaps

When calculating the MI values, gaps are by default considered an outcome. While some gaps can be informative, others may simply be due to difficulties in the sequencing process: difficult-to-sequence regions may be systematically absent from all lower-quality sequences, resulting in distinct patches of gap characters that appear in parallel across samples. Hence, some interactions may be artificially amplified in regions with low-quality sequence data. To facilitate discovery of such cases in the subsequent manual analysis, we also calculate the MI value of the top pairs using only sequences where neither site of a pair contains a gap. Since the collection of sequences without gaps varies between pairs, it is difficult to compare gap-free MI values between SNP pairs in a meaningful way. However, the gap-free MI value can still be informative for a given pair in the sense that a large decrease in MI when dropping the gap sequences is an indication of a gap-driven interaction.

Implementation

The complete SpydrPick pipeline was implemented in C++ and supports parallel execution in a shared memory environment. Its space-efficient data structure, indexing strategy and online filtering of output jointly enable excellent scalability to an order of magnitude larger genome datasets than previous GWES software.

GWES Manhattan plot

For compactly visualizing the results of a GWES, we use a modified version of the GWAS Manhattan scatter plot. In a standard GWAS Manhattan plot, the association strength between a SNP and some phenotype (y-axis) is plotted against the chromosomal location of the SNP (x-axis), meaning that each point represents a single SNP. A GWES Manhattan plot has a similar design, however, each point now represents a pair of SNPs such that the x-axis displays the distance between the chromosomal locations of the SNPs and the y-axis displays the association strength between the SNPs, which is here determined by their MI value.

Data

Simulated data

To test the accuracy of our method in a controlled setting, we applied it to simulated evolutionary scenarios with known parameter values. These simulated datasets were generated using fwdpp (22) along with custom functions to simulate bacterial recombination and selection.

To model a single recombining species adapting to multiple niches, we simulated two subpopulations, or demes, of size Inline graphic and that experience divergent selection pressures but exchange DNA for homologous recombination. We simulated a 200 kbp segment from a metapopulation of size individuals, with an overall population mutation rate of per bp, where is the physical mutation rate. Each new mutation had a Inline graphic chance of affecting fitness and experiencing either positive selection in deme 1, or negative selection in deme 2. Fitness effects were multiplicative, with individual fitness calculated as , where represents the selection coefficient of the mutations that have a fitness effect, and Inline graphic takes on a value of 1 or −1 depending on whether the individual is in deme 1 or 2, respectively.

Since our ability to detect the positively co-selected mutations in deme 1 may vary with its relative size in the metapopulation, we varied the size of deme 1. We explored scenarios in which demes were similar in size (50:50) or where deme 1 was considerably more rare, only 10 and 5% of the total metapopulation (10:90 and 5:95, respectively). For each parameter set, we specified a selection coefficient Inline graphic so that .

We also varied the population recombination rate Inline graphic , where represents the physical recombination rate, so that mutations with fitness effects were either more linked () or less linked () to neutral mutations. Each individual in the metapopulation had the same chance of receiving DNA. Recombination within and between demes was proportional to deme size, such that the probability an individual in deme 1 served as a donor for any given recombination event was Inline graphic , representing a scenario in which there are no significant physical barriers between the demes. For all simulations, individuals that served as recombination donors transferred geometrically distributed DNA tracts with a mean length of 500 bp.

For each scenario, we ran five simulations for Inline graphic generations, after which a random sample of size was taken from the metapopulation, sampling demes with respect to their relative sizes. To investigate how the sample size affected the accuracy of our method, we subsampled the initial dataset with sample sizes ranging from 50 to 800. Using 10 iterations per sample size, we thus generated Inline graphic datasets for each scenario and sample size. In total, the simulation study covered 3000 datasets, which on average contained 14 976 SNPs, after filtering out sites with a minor allele frequency (MAF) <1%. On average, 10 mutations were randomly placed under selection in a simulation.

Streptococcus pneumoniae

Our first real alignment contained 3042 S. pneumoniae strains collected in Maela, a refugee camp close to the border between Thailand and Myanmar (23). The whole genome alignment was generated from short-read data aligned to the reference sequence of S. pneumoniae ATCC 700669 whose genome is a circular chromosome of 2 221 315 bp (24). Loci with MAF >1% and gap frequency (GF) <15% were included in the analysis. The filtered alignment contained 94 880 SNPs.

The diverse population structure in the data, together with the recombinant nature of S. pneumoniae, make the data ideal for GWES (10). Moreover, this particular dataset has previously been analyzed by DCA approaches, which successfully discovered several interacting regions with plausible biological explanations (10,11). Hence, the main aim for this dataset was to investigate how well the earlier highlight findings could be rediscovered using our model-free method.

Neisseria meningitidis

Our second real alignment contained 2148 N. meningitidis strains, of which 543 were published by Lucidarme et al. (25) and the rest were obtained from different sequencing projects run in the Wellcome Sanger Institute, Cambridge (Supplementary Table S1). The pan-genome of the strains included in the study was created using Roary (26), with a percentage of isolates needed to consider a gene as core set to 95%. The core gene alignment and individual gene alignments of the 13 052 genes conforming the pan-genome under the above criteria were obtained directly from the output. All individual genes were concatenated to obtain a pan-genome-wide alignment of 11 375 926 bp using the Alignment Manipulation and Summary (AMAS) tool (27). Loci with MAF >1% and GF <70% were included in the analysis. The filtered alignment contained 137 814 SNPs. An approximately maximum likelihood phylogenetic tree was estimated with FastTree (28) from the SNP sites in the core alignment (obtained with SNP-sites (29)) using the GTR model of nucleotide substitution and gamma rate heterogeneity among sites.

In contrast to the S. pneumoniae alignment, where all sequences were mapped to a reference sequence, this pan-genome-wide alignment was constructed by concatenating individual gene alignments. As a result, we can no longer use a straightforward distance-based cut-off to filter out LD-mediated links. Instead, we simply define two sites within the same gene as an LD-pair and two sites from different genes as a non-LD pair. The main aim for this dataset was to investigate if our method would still be able to extract plausible signals of co-selection under this modified setup.