Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2018 Aug 4;106(1):1–18. doi: 10.1093/biomet/asy033

Gene hunting with hidden Markov model knockoffs

M Sesia 1,, C Sabatti 1, E J Candès 1
PMCID: PMC6373422  PMID: 30799875

SUMMARY

Modern scientific studies often require the identification of a subset of explanatory variables. Several statistical methods have been developed to automate this task, and the framework of knockoffs has been proposed as a general solution for variable selection under rigorous Type I error control, without relying on strong modelling assumptions. In this paper, we extend the methodology of knockoffs to problems where the distribution of the covariates can be described by a hidden Markov model. We develop an exact and efficient algorithm to sample knockoff variables in this setting and then argue that, combined with the existing selective framework, this provides a natural and powerful tool for inference in genome-wide association studies with guaranteed false discovery rate control. We apply our method to datasets on Crohn’s disease and some continuous phenotypes.

Keywords: False discovery rate, Genome-wide association study, Knockoff, Variable selection

1. Introduction

1.1. The need for controlled variable selection

Automatic variable selection is a fundamental challenge in statistics, the urgency of which is induced by the growing reliance of many fields of science on the analysis of large amounts of data. As researchers strive to understand increasingly complex phenomena, the technology of high-throughput experiments allows them to measure and simultaneously examine millions of covariates. However, despite the abundance of variables available, often only a fraction of these are expected to be relevant to the question of interest. By discovering which variables are important, scientists can design a more targeted follow-up investigation and hope to understand how certain factors influence an outcome. A compelling example is offered by genome-wide association studies, whose goal is to identify which markers of genetic variation influence the risk of a particular disease or a trait, choosing from up to millions of single-nucleotide polymorphisms. A good selection algorithm should be able to detect as many relevant variables as possible using only a small number of samples, since these tend to be expensive to acquire. It should also ensure that the findings are replicable. Several statistical techniques have been proposed in an effort to address and balance these conflicting needs. The standard approach in genome-wide association studies is to separately compute a Inline graphic-value for the null hypothesis of no association between the outcome of interest and each polymorphism, using a generalized linear model with one fixed effect and possibly random effects capturing the contribution of all other variables. To identify significant associations, the Inline graphic-values may be compared to a threshold that guarantees approximate control of the familywise error rate at the 0Inline graphic05 level, i.e., the probability of committing at least one Type I error, across all tests. This approach is very conservative and the selected variables, while apparently reproducibly associated with the response, can typically only explain a small portion of the genetic variance in the phenotype of interest (Manolio et al., 2009).

An alternative criterion for evaluating significance is the false discovery rate (Benjamini & Hochberg, 1995). This is attractive when one expects a multiplicity of true discoveries and it has been adopted in studies involving gene expression and many other genomic measurements (Storey & Tibshirani, 2003), including the study of expression quantitative trait loci. A broader adoption of the false discovery rate has been advocated as a natural strategy for improving the power of association studies for complex traits (Sabatti et al., 2003; Storey & Tibshirani, 2003; Brzyski et al., 2017).

Controlled variable selection is inherently difficult in high dimensions, but genome-wide association studies present at least two specific challenges. First, many phenotypes depend on the genetic variants through mechanisms that are mostly unknown (Zuk et al., 2012) and may involve interactions (Carlborg & Haley, 2004). Unfortunately, methods based on marginal testing are ill-equipped to detect interactions and the few current approaches that simultaneously analyse the role of multiple variants rely on linearity assumptions. The second prominent obstacle arises from the presence of correlations between the explanatory variables, as polymorphisms that occupy nearby positions in the genome are tightly linked. This results from the process by which the DNA is transmitted in humans and, as a fundamental characteristic of association studies, it cannot be neglected by methods aiming for valid inference.

These issues motivate the need for methods that can identify important variables for complex phenomena, while providing rigorous guarantees of Type I error control under milder and well-justified assumptions. In the following, we will present our solution and its detailed application to a few studies, after a brief summary of related previous work. Since a few technical terms from genetics appear in this paper, a glossary is included in the Supplementary Material.

1.2. Model-X knockoffs

Knockoffs (Candès et al., 2018) partially address the aforementioned issues by taking a radically different path from the traditional literature on high-dimensional variable selection. They provide a powerful and versatile method that rigorously controls the false discovery rate, under no modelling assumptions on the conditional distribution Inline graphic of the response Inline graphic given the covariates Inline graphic. In fact, Inline graphic may remain completely unspecified. This result is achieved by considering a setting in which the distribution Inline graphic of the covariates is presumed to be known. When this is the case, the latter can be used to generate a new set of artificial variables, the knockoff copies, that serves as a negative control for the original variables. It thus becomes possible to estimate and control the false discovery rate. Since this procedure takes the somewhat unusual path of modelling the covariates instead of the response, we sometimes refer to it as model-X knockoffs. In many circumstances, the premise of model-X knockoffs is arguably more principled than those of its traditional counterparts. In general, it is reasonable to shift the central burden of assumptions from Inline graphic to Inline graphic, since the former is the object of inference. In a genome-wide association study, an agnostic approach to the conditional distribution of the response is especially valuable, due to the possibly complex nature of the relations between genetic variants and phenotypes. Moreover, the presumption of knowing Inline graphic is well-grounded, since geneticists have at their disposal a rich set of models for how DNA variants arise and spread across human populations over time. Genetic variation has been assessed in large collections of individuals: the UK Biobank (Sudlow et al., 2015) contains the genotypes of 500 000 subjects, while hundreds of thousands of additional samples are available from the National Center for Biotechnology Information (Mailman et al., 2007). This combination of theoretical knowledge and data gives us a good understanding of Inline graphic.

Since knockoffs require knowledge of the underlying distribution Inline graphic of the original variables, which may not be accessible exactly, in practice some approximation is needed. However, even if the true Inline graphic is known, creating the knockoff copies is in general very difficult. To this date, the only special case for which an algorithm has been developed is that of multivariate Gaussian covariates (Candès et al., 2018). In this sense, knockoffs have not yet fully resolved the second crucial difficulty of association studies mentioned earlier, because a multivariate Gaussian approximation cannot fully take advantage of our prior information on the sequential structure of DNA (Wall & Pritchard, 2003). It thus seems important to develop new techniques that can benefit from advances in the study of population genetics and exploit more accurate parametric models for Inline graphic.

1.3. Our contributions

In this paper, we introduce a new algorithm to sample knockoff copies of variables distributed as a hidden Markov model. To the best of our knowledge, this result is the first extension of model-X knockoffs beyond the special case of a Gaussian design, and it involves a class of covariate distributions that is of great practical interest. In fact, hidden Markov models are widely employed to describe sequential data with complex correlations.

While many applications of hidden Markov models are found in the context of speech processing (Juang & Rabiner, 1991) and video segmentation (Boreczky & Wilcox, 1998), their presence has also become nearly ubiquitous in the statistical analysis of biological sequences. Important instances include protein modelling (Krogh et al., 1994), sequence alignment (Hughey & Krogh, 1996), gene prediction (Krogh, 1997), copy number reconstruction (Wang et al., 2007), segmentation of the genome into diverse functional elements (Ernst & Kellis, 2012) and identification of ancestral DNA segments (Falush et al., 2003; Tang et al., 2006; Li & Durbin, 2011). Of special interest to us, following the empirical observation that variation along the human genome could be described by blocks of limited diversity (Patil et al., 2001), hidden Markov models have been broadly adopted to describe haplotypes, i.e., the sequence of alleles at a series of markers along one chromosome. The literature is too extensive to recapitulate: starting from some initial formulations (Stephens et al., 2001; Zhang et al., 2002; Qin et al., 2002; Li & Stephens, 2003), a vast set of models and algorithms is used routinely and effectively to reconstruct haplotypes and to impute missing genotype values. Software implementations include fastPHASE (Scheet & Stephens, 2006), Impute (Marchini et al., 2007; Marchini & Howie, 2010), Beagle (Browning & Browning, 2007; Browning & Browning, 2011), Bimbam (Guan & Stephens, 2008) and MaCH (Li et al., 2010). The success of these algorithms in reconstructing partially observed genotypes can be tested empirically, and their realized accuracy is a testament to the fact that hidden Markov models offer a good phenomenological description of the dependence between the explanatory variables in genome-wide association studies.

By developing a suitable construction for the knockoffs, we incorporate the prior knowledge on patterns of genetic variation and obtain a new variable selection method that addresses all the critical issues of association studies discussed in §1.1.

1.4. Related work

This paper is most closely related to Candès et al. (2018), which introduced the framework of model-X knockoffs and considered the special case of multivariate Gaussian variables. Earlier work (Barber & Candès, 2015) developed a closely related methodology specific to linear regression with a fixed design matrix, i.e., fixed-X knockoffs. In the interest of simplicity, in the rest of this paper we will refer to model-X knockoffs simply as knockoffs.

Traditional multivariate variable selection techniques have been applied in genome-wide association studies on numerous occasions. Some works have employed penalized regression, but they either lack Type I error control (Hoggart et al., 2008; Wu et al., 2009) or require very restrictive modelling assumptions (Brzyski et al., 2017). Similarly, their Bayesian alternatives (Li et al., 2011; Guan & Stephens, 2011) do not provide finite-sample guarantees. Some have tried to control the Type I errors of standard penalized regression methods through stability selection (Alexander & Lange, 2011), but the resulting procedure does not correctly account for variable correlations and is less powerful than marginal testing. Others have employed machine learning tools (Bureau et al., 2005) that can produce variable importance measures but no valid inference. In theory, some inferential guarantees have been obtained for the lasso (Zhao & Yu, 2006; Candès & Plan, 2009), generalized linear models (van de Geer et al., 2014) and even random forests (Wager & Athey, 2018), but they only hold under rather stringent sparsity assumptions.

Hidden Markov models have appeared before as part of a variable selection procedure for association studies, in order to combine marginal tests of association from correlated polymorphisms (Sun & Cai, 2009; Wei et al., 2009). However, this approach is fundamentally different from ours, since it is not multivariate and makes very different modelling assumptions.

2. Controlled variable selection via knockoffs

2.1. Problem statement

The controlled variable selection problem can be stated in formal terms by adopting the general setting of Candès et al. (2018). Suppose that we can observe a response Inline graphic and a vector of covariates Inline graphic. Given Inline graphic such samples Inline graphic drawn from a population, we would like to know which variables are associated with the response. This can be made more precise by assuming that the observations are sampled independently from

graphic file with name M20.gif

for some joint distribution Inline graphic. The concept of a relevant variable can be understood by first defining its opposite. We say that Inline graphic is null if and only if Inline graphic is independent of Inline graphic, conditionally on all other variables Inline graphic. This uniquely defines the set of null covariates Inline graphic and the complement Inline graphic. Our goal is to obtain an estimate Inline graphic of Inline graphic while controlling the false discovery rate, the expected value of the false discovery proportion,

graphic file with name M30.gif

We emphasize the logic of this definition: a variable is null if it has no predictive power once we take into account all the other variables, i.e., it does not influence the response in any way. To relate this to traditional inference, in a generalized linear model, being null is equivalent to having a vanishing regression coefficient, under an extremely mild condition (Candès et al., 2018).

2.2. The limitations of marginal testing

Although by far the most common data analysis strategy in genome-wide association studies, marginal inference is not necessarily a principled choice, but rather one of convenience. Indeed, the scientific goal is to uncover the genetic basis of complex traits, those that are expected to be influenced by a large number of possibly interacting genetic variants. In this framework, the most natural model for relating a trait to genetic polymorphisms includes many such DNA variations. Adopting the simplifying additive assumption that is pervasive in genetics, one might be interested in estimating a generalized linear model that relates the trait value to a linear combination of the allele counts at many polymorphisms. Indeed, the statistical genetics literature documents many contributions in this direction, both in the Bayesian (Hoggart et al., 2008; Guan & Stephens, 2011) and in the frequentist (Wu et al., 2009) setting, as more comprehensively reviewed in Sabatti (2013). Yet, approaches that study the effects of many variants jointly, and try to identify the contribution of each one conditional on the rest, have not become part of the standard analysis pipeline for genome-wide data, even if they are the prevalent approach for variants prioritization and follow-up studies (Hormozdiari et al., 2014). This is due to difficulties encountered in articulating an effective genome-wide search for variants that influence the phenotype given every other polymorphism. These range from considerations of computational and data manipulation convenience, e.g., handling of missing data, to the challenge of distinguishing the contribution of highly correlated neighbouring variants, to the fact that, until recently, high-dimensional model selection strategies lacked finite-sample guarantees on the quality of the selected set. The contribution of this paper stems from the observation that this latest impasse can now in principle be overcome by deploying the knockoffs framework (Candès et al., 2018). We will describe how we handle the other difficulties in §6.1 and §7. For an up-to-date discussion of the advantages of investigating the effects of a variant in the context of all other recorded polymorphisms, see Brzyski et al. (2017).

2.3. The method of knockoffs

The main idea in Candès et al. (2018) is to generate a set of artificial covariates that have the same structure as the original ones but are known to be null. These are called the knockoff copies of Inline graphic and they can be used as negative controls to estimate the false discovery rate with almost any existing variable selection algorithm. In this paper, we develop new methods for sampling the knockoff copies, but we do not alter other aspects of the variable selection procedure of Candès et al. (2018). Therefore, we only present a short summary below, leaving a more detailed description for the Supplementary Material.

For each variable Inline graphic, we need to construct a knockoff copy Inline graphic in such a way that Inline graphic and Inline graphic satisfy the following conditions:

graphic file with name M36.gif (1)
graphic file with name M37.gif (2)

Above, the symbol Inline graphic indicates equality in distribution, while Inline graphic denotes the vector obtained by swapping the entries Inline graphic and Inline graphic for each Inline graphic. The pairwise exchangeability condition (2) requires the distribution of Inline graphic to be invariant under this transformation. As we discuss later, (2) is essential and it is not always easy to produce a nontrivial, i.e., different from Inline graphic itself, vector Inline graphic that satisfies it. We refer to (1) as the nullity condition, since it implies that all knockoff copies are null variables in the augmented model that includes both Inline graphic and Inline graphic. This clearly holds whenever Inline graphic is constructed without looking at Inline graphic.

Once we have the knockoff copies Inline graphic, we can perform controlled variable selection in two steps. First, we compute feature importance statistics Inline graphic and Inline graphic, such that Inline graphic and Inline graphic measure the importance of Inline graphic and Inline graphic in predicting Inline graphic, for each Inline graphic. For example, we can think of Inline graphic and Inline graphic as the magnitudes of the lasso coefficients for Inline graphic and Inline graphic, obtained by regressing Inline graphic on Inline graphic and Inline graphic jointly, although many other options are available. Then, we combine them into a vector Inline graphic with Inline graphic entries defined as Inline graphic. Intuitively, a positive and large value of Inline graphic indicates that the Inline graphicth variable is truly important. More precisely, the knockoff filter of Barber & Candès (2015) is used to compute a data-dependent significance threshold Inline graphic in such a way as to select important variables with provable control of the false dicovery rate. In summary, knockoffs can be seen as a versatile wrapper that makes it possible to extend rigorous statistical guarantees, under very mild assumptions, to powerful practical methods that would otherwise be too complex for a direct theoretical analysis.

2.4. Constructing knockoffs

In §2.3 we have said that the knockoff variables need to satisfy the nullity and pairwise exchangeability properties, (1) and (2). We now develop exact and computationally efficient procedures for the case in which Inline graphic corresponds to a Markov chain or a hidden Markov model, inspired by following result.

Proposition 1 (Appendix B in Candès et al., 2018).

Let Inline graphic be a vector of Inline graphic covariates with some known distribution Inline graphic. Suppose that, with a single iteration over Inline graphic, we sequentially sample Inline graphic from Inline graphic, independently of the observed value of Inline graphic. Then, the vector Inline graphic that we obtain is a knockoff copy of Inline graphic.

The conditional distribution above of Inline graphic given all the other variables Inline graphic and Inline graphic depends on the knockoff copies generated during the previous iterations, and it can be very difficult to compute in general, even though the distribution of Inline graphic is known. Therefore, Proposition 1 suggests a general recipe, but obtaining a practical algorithm is not always straightforward.

3. Knockoffs for Markov chains

We begin by focusing our attention on discrete Markov chains. Formally, we say that a vector of random variables Inline graphic, each taking values in a finite state space Inline graphic, is distributed as a discrete Markov chain if its joint probability mass function can be written as

graphic file with name M88.gif (3)

where Inline graphic denotes the marginal distribution of the first element of the chain and the transition matrices between consecutive variables are Inline graphic.

Our first result, whose proof can be found in the Supplementary Material, provides a way of sampling exact knockoff copies of a discrete Markov chain.

Proposition 2.

Suppose that Inline graphic is distributed as the Markov chain in (3), with known parameters Inline graphic. Then, a knockoff copy Inline graphic can be obtained by sequentially sampling, with a single iteration over Inline graphic, the Inline graphicth knockoff variable Inline graphic from

Proposition 2. (4)

with the normalization functions Inline graphic defined recursively as

Proposition 2. (5)

Therefore, Algorithm 1 is an exact procedure for sampling knockoff copies of a Markov chain.

Algorithm 1.

Knockoff copies of a discrete Markov chain.

For Inline graphic to Inline graphic:

 For Inline graphic in Inline graphic:

  Compute Inline graphic according to (5).

 Sample Inline graphic according to (4).

At each step Inline graphic of Algorithm 1, the evaluation of the normalization function Inline graphic involves a sum over all elements of the finite state space Inline graphic and depends only on the previous Inline graphic. Since this operation must be repeated for all values of Inline graphic, sampling the Inline graphicth knockoff variable requires Inline graphic time, where Inline graphic is the number of possible states of the Markov chain. This procedure is sequential, generating one knockoff variable at a time. Therefore, the total computation time is Inline graphic, while the required memory is Inline graphic. It is also trivially parallelizable if one wishes to construct a knockoff copy for each of Inline graphic independent Markov chains. These features make Algorithm 1 efficient and suitable for high-dimensional applications.

4. Knockoffs for hidden Markov models

4.1. Hidden Markov models

A hidden Markov model assumes the presence of a latent Markov chain, whose states are not directly visible but conditional on which the observations are independently sampled. Formally, we say that Inline graphic, taking values in a finite state space Inline graphic, is distributed as a hidden Markov model with Inline graphic hidden states if there exists a vector Inline graphic such that

graphic file with name M121.gif (6)

where Inline graphic indicates the law of a discrete Markov chain as in (3), with each element Inline graphic taking values in Inline graphic. Conditional on Inline graphic, each Inline graphic is sampled independently from the emission distribution Inline graphic. We emphasize that we are restricting our attention to these discrete distributions solely for simplicity. At the price of slightly more involved notation, the knockoff construction can easily be extended to continuous emission distributions.

4.2. Generating knockoffs for hidden Markov models

The observed variables Inline graphic in the hidden Markov model (6) do not satisfy the Markov property. In fact, computing the conditional distributions Inline graphic from Proposition 1 would involve a sum over all possible configurations of Inline graphic. The complexity of this operation is exponential in Inline graphic, thus making the naïve approach unfeasible even for moderately large datasets. Our solution is inspired by the traditional forward-backward methods for hidden Markov models. Having observed Inline graphic, we propose to construct a knockoff copy Inline graphic according to Algorithm 2.

Algorithm 2.

Knockoff copies of a hidden Markov model.

Sample Inline graphic from Inline graphic using Algorithm 3.

Sample a knockoff copy Inline graphic of Inline graphic using Algorithm 1.

Sample Inline graphic from Inline graphic, which is easy by conditional independence.

A graphical representation of Algorithm 2 is shown in Fig. 1. In the first stage, the latent Markov chain is imputed by sampling from the conditional distribution of Inline graphic given Inline graphic. This is done efficiently with Algorithm 3, a forward-backward iteration discussed in the Supplementary Material and similar to the Viterbi algorithm. Once Inline graphic has been sampled, a knockoff copy Inline graphic can be obtained with Algorithm 1. Finally, we sample Inline graphic from Inline graphic, which is easy because of the conditional independence between the emission distributions in the hidden Markov model.

Fig. 1.

Fig. 1.

Sketch of Algorithm 2 for knockoff copies of a hidden Markov model, in the case Inline graphic.

Algorithm 3.

Forward-backward sampling for a hidden Markov model.

Initialize Inline graphic, Inline graphic and Inline graphic for all Inline graphic.

For Inline graphic to Inline graphic (forward pass):

 For Inline graphic to Inline graphic:

   Inline graphic.

For Inline graphic to Inline graphic (backward pass):

 Sample Inline graphic according to Inline graphic.

Return Inline graphic.

The computation time required by Algorithms 1 and 3 is Inline graphic, while the complexity of the final stage is simply Inline graphic because the emission distributions are independent conditional on the latent Markov chain. Therefore, Algorithm 2 runs in Inline graphic time. The following two results establish the correctness of this approach.

Proposition 3.

Suppose that Inline graphic is observed from the hidden Markov model in (6), with known parameters Inline graphic. Then, Algorithm 3 produces an exact sample from the conditional distribution of its latent Markov chain Inline graphic given Inline graphic.

Theorem 1.

Suppose that Inline graphic is observed from the hidden Markov model in (6), with known parameters Inline graphic. Then Inline graphic generated by Algorithm 2 is a knockoff copy of Inline graphic. That is, for any subset Inline graphic,

Theorem 1. (7)

In particular, this implies that Inline graphic is a knockoff copy of Inline graphic.

Proof.

It suffices to prove (7), since marginalizing over Inline graphic implies that Inline graphic has the same distribution as Inline graphic. Conditioning on the values of the latent variables, one can write

Proof.

The first equality above follows from line 1 of Algorithm 2 and Proposition 3, whose proof can be found in the Supplementary Material. The second equality follows from the conditional independence of the emission distributions in a hidden Markov model. The third equality follows from Inline graphic being a knockoff copy of Inline graphic, as established in Proposition 2. □

5. Hidden Markov models in genome-wide association studies

5.1. Modelling single-nucleotide polymorphisms

In a genome-wide association study, the response Inline graphic is the status of a disease or a quantitative trait of interest, while each sample of Inline graphic consists of the genotype for a set of single-nucleotide polymorphisms. In particular, we consider the case in which Inline graphic collects unphased genotypes. For simplicity, in this section we restrict our attention to a single chromosome, since distinct ones are typically assumed to be independent. Different hidden Markov models have been proposed to describe the block-like patterns observed in the distribution of the alleles at adjacent markers, but in this paper we adopt the model implemented in fastPHASE (Scheet & Stephens, 2006) and outlined below. We opt for this model because we find that it has both an intuitive interpretation and remarkable computational efficiency. However, our knockoff construction from §4 can easily be implemented with other parameterizations.

The unphased genotype of an individual can be seen as the componentwise sum of two unobserved sequences, called haplotypes Inline graphic, where Inline graphic is a binary variable representing the allele on the Inline graphicth marker. The main modelling assumption is that the two haplotypes are independent and identically distributed as hidden Markov models. This idea is sketched in Fig. 2 for Inline graphic. In order to precisely describe this model, we begin by focusing on a single sequence Inline graphic. Its distribution is in the same form as the model defined earlier in (6),

graphic file with name M190.gif

with a latent Markov chain Inline graphic whose elements indicate membership in one of Inline graphic groups of closely related haplotypes. These groups are characterized by specific allele frequencies at the various markers, so that one can see Inline graphic as a mosaic of segments, each originating from one of Inline graphic distinct motifs that can be loosely taken as representing the genome of the population founders. This model provides a good description of the local patterns of correlation, but it is phenomenological in nature and should not be interpreted as an accurate representation of the real sequence of mutations and recombinations that originate the population haplotypes.

Fig. 2.

Fig. 2.

Sequence of Inline graphic genotype polymorphisms (shaded) as the componentwise sum of two hidden Markov model haplotypes (white).

The marginal distribution of the first element of the hidden Markov chain Inline graphic is

graphic file with name M197.gif

while the transition matrices are

graphic file with name M198.gif

The parameters Inline graphic describe the propensity of different motifs to succeed each other. The occurrence of a transition is regulated by the values of Inline graphic, which are intuitively related to the genetic recombination rates. Once a sequence of ancestral segments is fixed, the allele Inline graphic in position Inline graphic is sampled from the emission distribution

graphic file with name M203.gif

The parameters Inline graphic represent the probabilities of the alleles being equal to 1, for each of the Inline graphic polymorphisms and the Inline graphic ancestral haplotype motifs. These can be estimated along with Inline graphic and Inline graphic.

Having defined the distribution of Inline graphic, we return our attention to the observed genotype vector. By definition, the genotype Inline graphic of an individual is obtained by pairing, marker by marker, the alleles on each haplotype and discarding information on the haplotype of origin, i.e., the phase. Then, under standard assumptions such as the Hardy–Weinberg equilibrium, the population from which the genotype vector of a subject is randomly sampled can be described as the elementwise sum of two independent and identical haplotype distributions described by the above model. Consequently, its distribution is also a hidden Markov model. The latent Markov chain has bivariate states, corresponding to unordered pairs of haplotype latent states. It is easy to verify that these can take Inline graphic possible values. By this construction, it follows that the initial-state probabilities for the genotype model are

graphic file with name M212.gif (8)

and the transition matrices are

graphic file with name M213.gif (9)

Similarly, the emission probabilities for Inline graphic are

graphic file with name M215.gif (10)

5.2. Parameter estimation

The construction of knockoff copies requires knowing the distribution of the covariates, as discussed in §2.3. However, exact knowledge is unrealistic in practical applications and some degree of approximation is ultimately unavoidable. Since we have argued that the model in (8)– (10) offers a sensible and tractable description of real genotypes, it makes sense to estimate the Inline graphic parameters in Inline graphic from the data. In the usual setting for genome-wide association studies, one has available Inline graphic observations for each of the Inline graphic sites, so this task is not unreasonable. Moreover, the validity of this approach is empirically verified in our simulations with real genetic covariates, as discussed in the next section. Alternatively, if additional unsupervised observations, i.e., including only the covariates, from the same population are available, one could include them to improve the estimation.

All parameters can be efficiently estimated with a standard expectation-maximization technique in Inline graphic time, as already implemented in the freely available imputation software fastPHASE. This fits the model described above, for the original purpose of recovering missing observations, and it conveniently provides the estimates Inline graphic. An important advantage of the hidden Markov model is that the number of parameters only grows linearly in Inline graphic, thus greatly reducing the risk of overfitting compared to a multivariate Gaussian approximation. The complexity of this model is controlled by the number Inline graphic of haplotype motifs, whose typical recommended values are in the range of 10 (Scheet & Stephens, 2006) and can be fine-tuned with crossvalidation. Even though the theoretical guarantee of false discovery rate control with knockoffs requires Inline graphic to be known, we have observed that our procedure is robust with respect to estimation, by performing several numerical experiments discussed in the next section and in the Supplementary Material.

6. Numerical simulations

6.1. Numerical simulation with real genetic covariates

We now verify the power and robustness of our procedure with real covariates obtained from a genome-wide association study. We consider 29 258 polymorphisms on chromosome 1, genotyped in 14 708 individuals from WTCCC (2007). Following Candès et al. (2018), we simulate the response from a conditional logistic regression model of Inline graphic with Inline graphic nonzero coefficients, as described in the Supplementary Material.

Before applying our procedure, we reduce the number of covariates by pruning. This is desirable due to the presence of extremely high correlations between neighbouring sites, which makes it fundamentally impossible to distinguish nearly identical variables with a limited amount of data. Our solution uses hierarchical clustering to identify groups of sites in such a way that no two polymorphisms in different clusters have correlation greater than 0Inline graphic5. Then, within each group we identify a single representative that is most strongly associated with the phenotype in a hold-out set of 1000 observations, described in more detail in the Supplementary Material. At this point, we will use knockoffs to perform variable selection on the cluster representatives, thus effectively interpreting these groups as the basic units of inference among which we search for important variables. Far from removing all correlations and making variable selection trivial, by pruning we acknowledge that a limited amount of data only allows limited resolution. Had we more data, we would prune less. This approach is also consistent with the common practice in genome-wide association studies of interpreting findings as identifying regions in the genome rather than as individual polymorphisms.

Having reduced the number of variables to 5260 by pruning, we split the samples, i.e., the rows of Inline graphic, into 10 subsets and separately fit the model of §5.1 with fastPHASE, using the default settings and assuming the presence of Inline graphic latent haplotype clusters. Once the parameters are estimated, we construct the knockoff copies using Algorithm 2. With our implementation, this takes approximatively 0Inline graphic1 seconds on a single core of a 2Inline graphic60 GHz Intel Xeon CPU for each individual. We run the knockoffs procedure on each split, adopting as variable importance measures the magnitudes of the logistic regression coefficients fitted with a Inline graphic-norm penalty tuned by crossvalidation. The knockoff filter is then applied at level Inline graphic and with offset equal to 0. The power and proportion of false discoveries are estimated by comparing our selections to the true logistic model, counting a finding as true if and only if any of the polymorphisms in the selected cluster has a nonzero coefficient. The entire experiment is repeated 10 times, starting with the choice of the logistic model. This yields 100 point estimates for the power and false discovery rate, whose empirical distribution is shown in Fig. 3 and Table 1, for different values of the signal amplitude. We have also applied the knockoff filter with offset, i.e., its slightly more conservative version, as explained in the Supplementary Material. As shown in Table 1, the value of the offset is of little practical consequence, except when very few discoveries are made, i.e., for a weak signal.

Fig. 3.

Fig. 3.

(a) False discovery proportion, FDP, and (b) power of our procedure with real genetic variables. A box represents 100 experiments. The dashed black line in (a) indicates the target level Inline graphic. The offset of the knockoff filter is set equal to 1.

Table 1.

False discovery rate and power, in percentage, for the experiment of Fig. 3 with Inline graphic normal confidence intervals, i.e., standard errors multiplied by Inline graphic, with and without offset

Signal amplitude FDR (95% c.i.) Power (95% c.i.)
Offset 0 Offset 1 Offset 0 Offset 1
 8  9Inline graphic3 Inline graphic 4Inline graphic7 Inline graphic 27Inline graphic9 Inline graphic 17Inline graphic3 Inline graphic
10 10Inline graphic3 Inline graphic 7Inline graphic6 Inline graphic 47Inline graphic9 Inline graphic 42Inline graphic2 Inline graphic
12 10Inline graphic6 Inline graphic 8Inline graphic2 Inline graphic 59Inline graphic1 Inline graphic 55Inline graphic8 Inline graphic
14 11Inline graphic1 Inline graphic 9Inline graphic1 Inline graphic 68Inline graphic4 Inline graphic 66Inline graphic7 Inline graphic
16 11Inline graphic8 Inline graphic 9Inline graphic7 Inline graphic 76Inline graphic0 Inline graphic 74Inline graphic3 Inline graphic
18 10Inline graphic1 Inline graphic 8Inline graphic0 Inline graphic 79Inline graphic1 Inline graphic 77Inline graphic9 Inline graphic
20 10Inline graphic5 Inline graphic 8Inline graphic7 Inline graphic 78Inline graphic3 Inline graphic 77Inline graphic6 Inline graphic

c.i., confidence interval.

The results show that the false discovery rate is controlled and suggest that one can safely apply our method to a genome-wide association study. Our confidence derives from the fact that our procedure enjoys the rigorous robustness of knockoffs for any conditional distribution of the phenotype. As far as Type I error control is concerned, it does not seem consequential that in this experiment we have chosen to simulate the response from a generalized linear model. In fact, the false discovery rate is provably controlled for any Inline graphic, provided that Inline graphic is well-specified. Since we have not artificially simulated the covariates but used real genotypes, we can see no reason why our procedure should not similarly enjoy the same control on a real association study.

7. Applications to genome-wide association studies

7.1. Analysis of genome-wide association data

We apply our procedure to data from two association studies: the Northern Finland 1966 Birth Cohort study of metabolic syndrome (Sabatti et al., 2009), accession number phs000276.v2.p1, and the Wellcome Trust Case Control Consortium study of Crohn’s disease (WTCCC, 2007).

The metabolic syndrome study comprises observations on 5402 individuals from northern Finland, including genotypes for approximately Inline graphic polymorphisms and nine phenotypes. We focus on measurements of cholesterol, triglyceride levels and height, as there is a rich literature on their genetic bases that we can rely upon for comparison. Since not all outcome measurements are available for every subject, the effective values of Inline graphic are different for each phenotype and a little lower than 5402. From the Crohn’s disease study, we analyse 2996 control and Inline graphic disease samples typed at Inline graphic polymorphisms.

We pre-process the data as described in the Supplementary Material and reduce the number of variables by pruning with the same method used in the numerical simulation of §6.1. Then, we perform variable selection using our knockoff procedure. Before applying Algorithm 2 to construct the knockoff copies, we estimate the parameters Inline graphic of the hidden Markov model from §5.1 using fastPHASE, separately for each of the first 22 chromosomes. Since the estimation of the covariate distribution does not make use of the response, we compute a single set of estimates for all phenotypes in the metabolic syndrome study, and a separate one for the Crohn’s disease study. In both cases, we run fastPHASE with a prespecified number of latent haplotype clusters Inline graphic. With its default settings, the imputation software estimates Inline graphic with the additional constraint that Inline graphic can only depend on the first index Inline graphic. For simplicity, we do not modify this setting.

Having sampled the knockoff copies, we assess variable importance as in §6.1, by performing a lasso regression of Inline graphic on the standardized knockoff-augmented matrix of covariates Inline graphic, with a regularization parameter Inline graphic chosen through ten-fold crossvalidation. For the Crohn’s disease study the response is binary and we use logistic regression with an Inline graphic-norm penalty instead of the lasso. Relevant polymorphisms are then selected by applying the knockoff filter with target level Inline graphic and offset equal to Inline graphic.

7.2. Results

We performed the analysis described above on the four datasets. Since our method is not deterministic, in each case the selections depend on the realization of Inline graphic. Repeating the procedure multiple times and cherry-picking the results would obviously violate the control of the false discovery rate, so we instead display all findings that are selected at least 10 times over 100 independent repeats of the knockoffs procedure. This is only supposed to provide the reader with an impression of the variability of our method, since in principle control of the false discovery rate does not necessarily hold if one aggregates selections obtained with different realizations of Inline graphic. Finding a good way of combining these selections remains an open research problem.

While we do not have sufficient experimental evidence to assess which of our discoveries are true, we can compare our results to those of studies carried out on much larger samples and consider these as the only available approximation of the truth. For lipids we rely on Global Lipids Genetics Consortium (2013), for height on Wood et al. (2014) and Marouli et al. (2017), and for Crohn’s disease on Franke et al. (2010). Since different studies include slightly different sets of polymorphisms and our analysis involves a pruning phase, some care has to be taken in deciding when findings match. Each of our clusters spans a genomic locus that can be described by the positions of the first and last polymorphisms. We consider one of our findings to be replicated if the larger study reports as significant a variable whose position is within the region spanned by the cluster we discover.

Our procedure identifies a larger number of potentially significant loci than traditional methods based on marginal testing, except in the case of triglycerides, for which very few findings are obtained with either approach. In Fig. 4(a), the distribution of the number of discoveries over 100 independent realizations of our knockoff variables is compared to the corresponding fixed quantity from the standard genomic analysis on the same dataset, as performed in the earlier works cited above. We can thus verify that, while our procedure is not deterministic, we consistently select more variables. In Fig. 4(b), we show the proportion of our discoveries that is confirmed by the corresponding meta-analyses, separately for each dataset. If we tried to naïvely estimate the false discovery rate from these plots, we would obtain a value much larger than the target level Inline graphic, but this would not be very meaningful because none of the meta-analyses is believed to have correctly identified all relevant associations. Instead, some perspective can be gained by comparing our proportion of confirmed discoveries to that obtained with marginal testing on the same data. In the case of one type of cholesterol and triglycerides, our confirmed proportion is appreciably higher, even though one may have intuitively expected a better agreement between studies relying on the same testing framework.

Fig. 4.

Fig. 4.

Discoveries made on different datasets: (a) Total number of discoveries; (b) proportion of the discoveries that are confirmed by the meta-analysis. The boxplots refer to our method, while the dashed black lines represent the standard genomic analysis with the same data. The phenotypes are cholesterol, HDL, LDL; triglycerides, TG; height, HT, and Crohn’s disease, CT.

It should not be surprising that our results are at least partially consistent with those of previous studies. In spite of the fact that our method relies on fundamentally different principles, we have selected relevant variables after computing importance measures based on sparse generalized linear regression. The robustness of our Type I error control is completely unaffected by the validity of such a model, but a bias towards the discovery of additive linear effects naturally arises. In future studies, one may discover additional associations by easily deploying our procedure with more complex nonlinear measures of feature importance.

8. Discussion

Conditionally on Inline graphic and Inline graphic, the selections depend on the specific realization of the knockoffs Inline graphic. Different repetitions of our procedure provide reasonably consistent answers on the same data, but at this point it is not clear how to best aggregate the different results.

In our analysis of genetic data, we have pruned the variables during the pre-processing phase and restricted the inference to the representatives for each group. Alternatively, one could try to adapt the idea of group knockoffs in Dai & Barber (2016) to our method.

Different parameterizations of the hidden Markov model have been developed within the genotype imputation community and they can be easily exploited by our procedure. For example, if a collection of known haplotypes is available, it is possible to include them in the description of Inline graphic used to generate the knockoff copies. It would be interesting to investigate from an applied perspective the relative advantages of one choice over another.

Since we have computed variable importance measures based on generalized linear models, even though our false discovery rate control does not rely on any assumptions of linearity, the power may be negatively affected if the true likelihood is far from linear. In order to fully exploit the flexibility and robustness of knockoffs, it would be interesting to explore the use of alternative statistics that can better capture interactions and nonlinearities, e.g., trees.

At this point we know how to perform controlled variable selection with knockoffs in the special cases where the variables can be described by either a hidden Markov model or a multivariate normal distribution. It would be interesting to extend this to other classes of covariates, such as more general graphical models.

Supplementary Material

Supplementary Data

Acknowledgement

Candès was partially supported by the U.S. Office of Naval Research and by a Math+X Award from the Simons Foundation. Sesia was partially supported by the U.S. National Institutes of Health and the Simons Foundation. We thank Lucas Janson for inspiring discussions and for sharing his computer code.

Supplementary material

Supplementary material available at Biometrika online includes proofs of the theoretical results, a summary of existing knockoff methodology, further methodological details related to the numerical simulation and the data analysis, and a glossary of relevant technical terms from genetics.

References

  1. Alexander, D. H. & Lange, K. (2011). Stability selection for genome-wide association. Genet. Epidemiol. 35, 722–8. [DOI] [PubMed] [Google Scholar]
  2. Barber, R. F. & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43, 2055–85. [Google Scholar]
  3. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300. [Google Scholar]
  4. Boreczky, J. S. & Wilcox, L. D. (1998). A hidden Markov model framework for video segmentation using audio and image features. In Proc. 1998 IEEE Int. Conf. Acoust. Speech Sig. Proces., vol. 6 IEEE. [Google Scholar]
  5. Browning, S. & Browning, B. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 1084–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Browning, S. R. & Browning, B. L. (2011). Haplotype phasing: Existing methods and new developments. Nature Rev. Genet. 12, 703–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Brzyski, D., Peterson, C. B., Sobczyk, P., Candès, E. J., Bogdan, M. & Sabatti, C. (2017). Controlling the rate of GWAS false discoveries. Genetics 205, 61–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P. & Van Eerdewegh, P. (2005). Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28, 171–82. [DOI] [PubMed] [Google Scholar]
  9. Candès, E. J., Fan, Y., Janson, L. & Lv, J. (2018). Panning for gold: ‘model-X’ knockoffs for high dimensional controlled variable selection. J. R. Statist. Soc. B 80, 551–77. [Google Scholar]
  10. Candès, E. J. & Plan, Y. (2009). Near-ideal model selection by Inline graphic minimization. Ann. Statist. 37, 2145–77. [Google Scholar]
  11. Carlborg, O. & Haley, C. S. (2004). Epistasis: Too often neglected in complex trait studies? Nature Rev. Genet. 5, 618–25. [DOI] [PubMed] [Google Scholar]
  12. Dai, R. & Barber, R. (2016). The knockoff filter for FDR control in group-sparse and multitask regression. In Proc. 33rd Int. Conf. Mach. Learn., Balcan M. F. & Weinberger K. Q., eds., vol. 48 of Proceedings of Machine Learning Research. New York: Association for Computing Machinery. [Google Scholar]
  13. Ernst, J. & Kellis, M. (2012). ChromHMM: Automating chromatin-state discovery and characterization. Nature Meth. 9, 215–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Falush, D., Stephens, M. & Pritchard, J. K. (2003). Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164, 1567–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Franke, A., McGovern, D. P. B., Barrett, J. C., Wang, K., Radford-Smith, G. L., Ahmad, T., Lees, C. W., Balschun, T., Lee, J., Roberts, R. et al. (2010). Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nature Genet. 42, 1118–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Global Lipids Genetics Consortium (2013). Discovery and refinement of loci associated with lipid levels. Nature Genet. 45, 1274–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Guan, Y. & Stephens, M. (2008). Practical issues in imputation-based association mapping. PLOS Genet. 4, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Guan, Y. & Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Statist. 5, 1780–815. [Google Scholar]
  19. Hoggart, C. J., Whittaker, J. C., De Iorio, M. & Balding, D. J. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLOS Genet. 4, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. (2014). Identifying causal variants at loci with multiple signals of association. Genetics 198, 497–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hughey, R. & Krogh, A. (1996). Hidden Markov models for sequence analysis: Extension and analysis of the basic method. Bioinformatics 12, 95–107. [DOI] [PubMed] [Google Scholar]
  22. Juang, B. H. & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics 33, 251–72. [Google Scholar]
  23. Krogh, A. (1997). Two methods for improving performance of a HMM and their application for gene finding. In Proc. 5th Int. Conf. on Intelligent Systems for Molecular Biology, Gaasterland, T. Karp, P. Karplus, K. Ouzounis, G. Sander C. & Valencia, A. eds. Menlo Park, California: AAAI Press, pp. 179–86. [PubMed] [Google Scholar]
  24. Krogh, A., Brown, M., Mian, I., Sjӧlander, K. & Haussler, D. (1994). Hidden Markov models in computational biology. J. Molec. Biol. 235, 1501–31. [DOI] [PubMed] [Google Scholar]
  25. Li, H. & Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature 475, 493–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li, J., Das, K., Fu, G., Li, R. & Wu, R. (2011). The Bayesian lasso for genome-wide association studies. Bioinformatics 27, 516–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Li, N. & Stephens, M. (2003). Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li, Y., Willer, C. J., Ding, J., Scheet, P. & Abecasis, G. R. (2010). MaCH: Using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Mailman, M. D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L.. et al. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Manolio, T. A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A.. et al. (2009). Finding the missing heritability of complex diseases. Nature 461, 747–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Marchini, J. & Howie, B. (2010). Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511. [DOI] [PubMed] [Google Scholar]
  32. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet. 39, 906–13. [DOI] [PubMed] [Google Scholar]
  33. Marouli, E., Graff, M., Medina-Gomez, C., Lo, K.S., Wood, A.R., Kjaer, T.R., Fine, R.S., Lu, Y., Schurmann, C., Highland, H.M.. et al. (2017). Rare and low-frequency coding variants alter human adult height. Nature 542, 186–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P.. et al. (2001). Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294, 1719–23. [DOI] [PubMed] [Google Scholar]
  35. Qin, Z. S., Niu, T. & Liu, J. S. (2002). Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am. J. Hum. Genet. 71, 1242–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sabatti, C. (2013). Multivariate linear models for GWAS In Advances in Statistical Bioinformatics, Do, K.-A. Qin Z. & Vannucci, M. eds. Cambridge: Cambridge University Press, pp. 188–207. [Google Scholar]
  37. Sabatti, C., Hartikainen, A.-L., Pouta, A., Ripatti, S., Brodsky, J., Jones, C.G., Zaitlen, N.A., Varilo, T., Kaakinen, M., Sovio, U.. et al. (2009). Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genet. 41, 35–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Sabatti, C., Service, S. & Freimer, N. (2003). False discovery rate in linkage and association genome screens for complex disorders. Genetics 164, 829–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Scheet, P. & Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Stephens, M., Smith, N. J. & Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68, 978–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Storey, J. D. & Tibshirani, R. J. (2003). Statistical significance for genomewide studies. Proc. Nat. Acad. Sci. 100, 9440–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M.. et al. (2015). UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Sun, W. & Cai, T. T. (2009). Large-scale multiple testing under dependence. J. R. Statist. Soc. B 71, 393–424. [Google Scholar]
  44. Tang, H., Coram, M., Wang, P., Zhu, X. & Risch, N. (2006). Reconstructing genetic ancestry blocks in admixed individuals. Am. J. Hum. Genet. 79, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. van de Geer, S., Buhlmann, P., Ritov, Y. & Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42, 1166–202. [Google Scholar]
  46. Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Statist. Assoc. DOI: 10.1080/01621459.2017.1319839. [DOI] [Google Scholar]
  47. Wall, J. D. & Pritchard, J. K. (2003). Haplotype blocks and linkage disequilibrium in the human genome. Nature Rev. Genet. 4, 587–97. [DOI] [PubMed] [Google Scholar]
  48. Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F., Hakonarson, H. & Bucan, M. (2007). PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, 1665–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wei, Z., Sun, W., Wang, K. & Hakonarson, H. (2009). Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics 25, 2802–8. [DOI] [PubMed] [Google Scholar]
  50. Wood, A. R., Esko, T., Yang, J., Vedantam, S., Pers, T.H., Gustafsson, S., Chu, A.Y., Estrada, K., Luan, J., Kutalik, Z.. et al. (2014). Defining the role of common variation in the genomic and biological architecture of adult human height. Nature Genet. 46, 1173–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. WTCCC (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wu, T. T., Chen, Y. F., Hastie, T. J., Sobel, E. & Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25, 714–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Zhang, K., Deng, M., Chen, T., Waterman, M. S. & Sun, F. (2002). A dynamic programming algorithm for haplotype block partitioning. Proc. Nat. Acad. Sci. 99, 7335–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Zhao, P. & Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–63. [Google Scholar]
  55. Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Nat. Acad. Sci. 109, 1193–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES