Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2021 Sep 3;118(36):e2104683118. doi: 10.1073/pnas.2104683118

DeepLINK: Deep learning inference using knockoffs with applications to genomics

Zifan Zhu a, Yingying Fan b,1, Yinfei Kong c, Jinchi Lv b, Fengzhu Sun a,1
PMCID: PMC8433583  PMID: 34480002

Significance

Although practically attractive with high prediction and classification power, complicated learning methods often lack interpretability and reproducibility, limiting their scientific usage. A useful remedy is to select truly important variables contributing to the response of interest. We develop a method for deep learning inference using knockoffs, DeepLINK, to achieve the goal of variable selection with controlled error rate in deep learning models. We show that DeepLINK can also have high power in variable selection with a broad class of model designs. We then apply DeepLINK to three real datasets and produce statistical inference results with both reproducibility and biological meanings, demonstrating its promising usage to a broad range of scientific applications.

Keywords: false discovery rate, knockoffs, deep learning, microbiome, single-cell

Abstract

We propose a deep learning–based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.


The era of big data gives us enormous new opportunities but meanwhile, also produces unprecedented challenges in solving various data-related problems. The challenges are not just because of the large size of the data but also and even more caused by the complexity in, for example, text, image, video, and audio data. As a result, complicated models such as deep neural networks have been proposed and popularly used to analyze big data. Despite the appealing high prediction and classification power of deep neural networks, there is strong pushback from the scientific community because of its “black box” nature. The complicated structure of many deep neural networks has made the interpretation and reproducibility of such models incredibly difficult if even possible at all. To alleviate these issues, dimension reduction methods such as variable selection and latent factor models have been used in statistics and related applications.

In the past decade, feature (variable) selection has been a central topic in statistics (1, 2). Feature selection aims at identifying the truly important features that contribute to the effect of some response of interest. One desirable property of feature selection methods is that the error rate of selecting incorrect features can be controlled at some preselected target level while achieving high power. The celebrated procedure of Benjamini and Hochberg (3, 4) for false discovery rate (FDR) control has been shown to enjoy such a property both theoretically and empirically under some conditions of the P values calculated for evaluating the feature importance. Although a vast number of methods have been proposed for feature selection with the goal of controlling error rate, such as the Benjamini–Yekutieli procedure (5), local FDR (6), q value (7), the adaptive Benjamini–Hochberg procedure (BH for short hereafter) (8), P value weighting (9), FDR regression (10), independent hypothesis weighting (11), adaptive shrinkage (12), adaptive P value thresholding (13), and the structure-adaptive Benjamini–Hochberg algorithm (14), very few can be used in complicated models such as deep neural networks. The intrinsic difficulties are that most existing methods were proposed under much simpler model settings that are difficult or not even possible at all to generalize or depend heavily on the P values as the feature importance measure. Such P values can be calculated based on some classical or asymptotic theory in simpler models. When we move away from these simple model settings to more complicated ones such as deep neural networks, however, we no longer have the luxury of calculating theoretically justified P values, making feature selection highly challenging. Recently, Candès et al. (15) proposed a new framework of model-X knockoffs for achieving the FDR control in feature selection, bypassing the use of conventional P values. Model-X knockoffs can be used as a wrapper by combining with any feature selection methods that produce feature importance measures satisfying certain conditions. We provide a brief review of the model-X knockoffs in a later section. Thanks to the flexibility of model-X knockoffs, it was recently extended to the setting of deep neural networks in ref. 16 via proposing a new network architecture, DeepPINK, when the features have joint Gaussian distributions. The distributional assumption of joint Gaussian limits the practical applicability of the proposed method therein. In this paper, we explore more general distributional assumptions for the feature vector and propose a method for deep learning inference using knockoffs, named DeepLINK.

Latent factor models, which use lower-dimensional unobservable factors to model the comovements of features, have been well studied and broadly used in statistics (1719), sociology (20, 21), bioinformatics (2224), and economics (2529). The most commonly used factor model assumes a linear relationship between the feature vector and latent factors. Since in practice, we can never be certain whether the dependency is truly linear, we are likely to face the problem of model misspecification, making the statistical estimation and inference results unreliable. Recently, the advent of deep learning has motivated the nonlinear factor models described by the architecture of the autoencoder.

DeepLINK combines the flexible nonlinear factor modeling power of the autoencoder with the feature selection and prediction power of DeepPINK. The nonlinear factor model for the feature vector described by the autoencoder enables us to generate the knockoff variables effectively without imposing restrictive joint distribution assumptions (e.g., Gaussian) on features. The feature selection and prediction power of DeepPINK allow for interpretable and reproducible statistical inference without sacrificing much power. It is worth mentioning that for the special case when both the factor model and the regression model of response on features are linear, the problem of model-X knockoffs inference was investigated in ref. 30 via proposing a parametric inference framework of Intertwined Probabilistic Factors Decoupling (IPAD). We demonstrate the superior performance of DeepLINK via simulations and three real data examples. Compared with IPAD, DeepLINK is more flexible and more robust to model misspecification and meanwhile, achieves comparable feature selection results with generally higher power.

Deep Learning–Based Knockoffs Inference

Variable Selection with False Discovery Rate (FDR) Control.

Consider the high-dimensional supervised learning with independent and identically distributed (i.i.d.) observations (xi,yi), i=1,,n, where xi=(xi1,,xip)TRp is the feature vector and yiR is the scalar response. The number of features p can be comparable with or even larger than the number of observations n. Let {1,2,,p} be the full set of all the features. Assume that the conditional distribution of response yi depends only on a small subset of features, and we aim to find the Markov blanket (31) (i.e., the smallest subset S0 such that yi is independent of all remaining features given those in S0). That is,

yi   {xij:jS0c}|{xik:kS0}, [1]

where S0c denotes the complement of subset S0 in the full set {1,2,,p}. The existence and uniqueness of the Markov blanket can be guaranteed under mild conditions on the joint distribution of (xi,yi). The discussions in ref. 15 have more details. For the ease of presentation, we refer to features in S0 as the “true” features and those in S0c as the “null” features in future presentation.

The goal of our study is to identify true features while controlling the error rate under a predetermined level. Various performance metrics have been proposed to measure the feature selection error rate, such as the familywise error rate, k-familywise error rate (k-FWER) (32), false discovery proportion (FDP) (33), and FDR (3). Here, we adopt the widely used FDR defined as

FDRE[FDP]  withFDPS^S0cmax{|S^|,1}, [2]

where S^ is the set of selected features using some statistical procedure, || means the cardinality of a set, and the expectation is taken with respect to the randomness in S^. A modified version of FDR (mFDR) is defined as

mFDRES^S0c|S^|+q1, [3]

where q(0,1) is the target FDR level. It is seen that FDR is more conservative than mFDR since controlling the FDR naturally results in the control of mFDR. We also use another important performance measure, power, to investigate the capability of a statistical procedure in discovering the true features. Formally speaking, power is defined as the expectation of the true discovery proportion (TDP):

PowerE[TDP]  withTDPS^S0|S0|. [4]

A desirable inference framework should be able to control the FDR at a prechosen target level and meanwhile, achieve high power.

Model Settings.

We focus on the setting where the high-dimensional feature vector xi depends on some low-dimensional latent factor vector fiRr with rp in a potentially nonlinear fashion. Specifically, assume the following factor structure for xi:

xi=g(fi)+ϵi,i=1,,n, [5]

where g is a vector-valued function whose coordinates can take some nonlinear functional forms that are unknown to us, and ϵiRp is the factor model error vector with i.i.d. components. We make the additional assumption that the marginal distribution of the components of ϵi is from some parametric family fθ with unknown parameter θRm, where m is some fixed positive integer.

When the coordinates of g are all linear functions, model Eq. 5 becomes the widely used latent factor model in the literature, which we will refer to as the linear factor model to ease the presentation. Most existing works have been developed under the linear factor model assumption, which can be restrictive in some applications. Our proposed method will use a data-adaptive way to estimate the possibly nonlinear function g.

We also assume that the response yi depends on xi via the following nonparametric regression model

yi=h(xi)+εi,i=1,,n, [6]

where h is some unknown function and can be either linear or nonlinear and εi’s are independent model errors. For the ease of presentation, we will use matrix and vector notation by denoting X=(x1,,xn)T the n×p design matrix, F=(f1,,fn)T the n×r matrix of factors, and y=(y1,,yn)T the n-dimensional response vector. Define C as an n×p matrix whose rows are g(fi)T, i=1,,n. Then, model Eq. 5 can be rewritten as

X=C+E, [7]

where E is a matrix with the ith row being ϵiT. Our goal can be specifically stated as developing a feature selection method with the FDR controlled at the target level q under the flexible model settings Eqs. 5 and 6.

The Model-X Knockoffs Framework.

We will adopt the recently developed model-X knockoffs framework introduced in ref. 15 to achieve our goal of feature selection. For completeness, we give a brief review of the model-X knockoffs framework below. We refer the readers to ref. 15 for full details.

As discussed in the Introduction, various FDR control methods have been proposed since the seminal work of BH. Most of these existing methods achieve the FDR control under the assumption that valid P values can be calculated. However, having valid P values can become a luxury in the high-dimensional big data settings. Taking the generalized linear models as an example, when the feature dimensionality p diverges with sample size n at a rate of n2/3 or faster, the classical asymptotic theory of maximum likelihood estimation (MLE) no longer applies. Consequently, the resulting P values calculated using the formula from the classical asymptotic theory become invalid. Ref. 34 has formal results on such a phenomenon. When more complicated models such as the random forests or deep neural networks are used, how to calculate valid P values for evaluating the feature importance is still an open question. To overcome this difficulty, Barber and Candès (35) introduced the fixed-X knockoffs framework, bypassing the use of p values to achieve the FDR control in the Gaussian linear model when p is smaller than n/2. Recently, Candès et al. (15) proposed the model-X knockoffs framework, which achieves theoretically guaranteed FDR control in arbitrary dimensions and for arbitrary dependence structure of response y on features x. These advantages motivate us to adapt the model-X knockoffs framework to our model settings.

The salient idea of the model-X knockoffs is to construct the so-called “model-X knockoff variables,” which perfectly mimic the dependence structure of the original variables but are conditionally independent of the response. For completeness, we include the definition of the model-X knockoff variables introduced in ref. 15 as follows.

Definition: Model-X knockoff variables for a set of random variables x=(x1,,xp)T are a new set of random variables x~=(x~1,,x~p)T that satisfies the following properties.

  • 1)

    For any subset S{1,,p}, (xT,x~T)swap(S)=D(xT,x~T), where (xT,x~T)swap(S) is obtained by swapping the components xj and x~j in (xT,x~T) for each jS and =D denotes equal in distribution;

  • 2)

    x~   y|x.

The second property above is satisfied as long as x~ is constructed without using the information of response y. To construct knockoff variables that satisfy the first property, we need to know the joint distribution of x. When such distribution is available, ref. 15 proposed a generic algorithm Sequential Conditional Independent Pairs (SCIP) for the knockoff variable construction. When such information is unavailable, there has been some recent work on the practical construction of knockoff variables (for example, refs. 30 and 3640).

Denote by x~i the vector of knockoff variables for xi, i=1,,n, and let X~=(x~1,,x~n)T. For each j=1,,p, let Wj be the knockoff statistic defined for measuring the importance of the jth original feature. Specifically, Wj is a function of the augmented data matrix [X,X~] and the response vector y [i.e., Wj=wj([X,X~],y), with wj a function satisfying the “sign-flip” property]:

wj([X,X~]swap(S),y)=wj([X,X~],y),jS,wj([X,X~],y),jS, [8]

where S can be any subset of {1,,p}. The formal characterizations of the desired knockoff statistics as well as examples are in ref. 15. Intuitively, valid knockoff statistics measure the importance of original features, with large positive ones indicating the original features being important, and for unimportant features in Sc, the corresponding Wj’s are expected to have small magnitudes and be symmetric around zero.

Finally, the set of important features is selected as S^={j:Wjt} with t=T or t=T+, where T is the knockoff threshold and T+ is the knockoff+ threshold as proposed in ref. 15 and included below for completeness:

T=mint>0:|{j:Wjt}|max{|{j:Wjt}|,1}q, [9]
T+=mint>0:1+|{j:Wjt}|max{|{j:Wjt}|,1}q. [10]

Here, S^ is defined as an empty set if T= or T+=.

It has been formally shown in ref. 15 that the knockoff threshold controls the mFDR exactly and the knockoff+ threshold controls the FDR exactly at the finite-sample level, regardless of the sample size n, feature dimensionality p, and dependence structure of response y on features x.

DeepLINK.

We next introduce our framework of DeepLINK, a deep learning–based statistical inference framework using knockoffs. It consists of two parts: 1) an autoencoder network for the knockoff variable construction and 2) a multilayer perceptron (MLP) network for feature selection with the FDR control.

As reviewed in the last section, there are two key ingredients in the successful implementation of the model-X knockoffs framework: 1) the construction of knockoff variables and 2) the construction of knockoff statistics. Since the joint distribution of xi is unknown to us, the generic algorithm proposed in ref. 15 is no longer applicable to our settings. A remedy is to exploit the nonlinear factor model structure in Eq. 5 to construct approximate knockoff variables using the estimated distribution.

In view of Eq. 7, ideally if the realization C and the marginal distribution fθ of ϵi are both known a priori, then the knockoff variables can be constructed as

X~=C+E~, [11]

with the entries of E~ independently drawn from distribution fθ. It can be easily checked that such X~ satisfies the two properties in Definition. Since C and fθ are generally unknown in practice, we next discuss methods to estimate them. We will also discuss the construction of knockoff statistics.

Part 1: Autoencoder for knockoffs construction.

The principal component analysis (PCA) has been a predominant method for extracting latent factors in the existing literature (41, 42). However, a key assumption for PCA to work well is that xi depends on fi linearly. To address the challenge caused by the potentially nonlinear factor model as specified in Eq. 5, we propose to use the deep learning model of autoencoder.

Given the design matrix X, we train an autoencoder with X as the input as well as the target output. An illustrative plot of the autoencoder network is shown in Fig. 1. Denote by C^ the corresponding autoencoder output matrix. We propose to construct the knockoffs data matrix as

X~=C^+E~, [12]

where E~ is a matrix with entries independently sampled from the estimated marginal distribution fθ^ of ϵi. For the specific case when fθ is the Gaussian density of N(0,σ2), we have θ=σ2, which can be estimated as σ^2=(np)11in,1jpêij2 with êij’s the entries of the residual matrix E^=XC^. This corresponds to the maximum likelihood estimate with the pseudo-observations E^. In general, parameter θ can be estimated by the maximum likelihood approach or the method of moments based on E^.

Fig. 1.

Fig. 1.

The autoencoder architecture. p-dim and r-dim indicate p dimensions and r dimensions, respectively.

Part 2: MLP for feature selection.

To construct the knockoff statistics, we need to first construct the feature importance measure. Since an important goal of our framework is to accommodate the flexible nonlinear relationship between response y and features x, we propose to use the MLP for such modeling purpose. The input of MLP is the augmented data matrix [X,X~]. Instead of directly feeding the augmented feature vector into the MLP, we exploit the idea of DeepPINK developed in ref. 16 and construct a pairwise-connected filter layer with each filter representing a linear combination of one original feature and its knockoff counterpart. The filter layer is then fed to the canonical MLP. The illustrative architecture of DeepPINK is shown in Fig. 2.

Fig. 2.

Fig. 2.

The DeepPINK architecture. 2p-dim, p-dim, and 1-dim indicate 2p dimensions, p dimensions, and 1 dimension, respectively.

To simplify the notation, we use DeepPINK with one hidden layer after the filter layer to discuss the construction of the knockoff statistics. Let z=(z1,,zp)T and z~=(z~1,,z~p)T be the filter weights (i.e., each filter Fj=zjxj+z~jx~j) and W(1)Rp×m,W(2)Rm×1 be the two weight matrices connecting the filter layer with the output layer, where m is the number of neurons before the output layer. The knockoff statistics are defined as

Wj=Zj2Z~j2,j=1,,p, [13]

where Zj=zjwj,Z~j=z~jwj, and w=(w1,,wp)T=W(1)W(2). This can be easily generalized to cases with more than one hidden layer. Since the weights of neurons are natural measures of their importance, intuitively Wj’s defined in Eq. 13 are valid knockoff statistics. Ref. 16 has more detailed discussions on the intuition of Wj in Eq. 13. Important features can then be selected using the knockoffs inference procedure reviewed previously.

Simulation Studies

We first evaluate the performance of DeepLINK on the simulated datasets. We consider various simulation settings when 1) the factor model is linear or nonlinear, 2) the link function between the response and the features is linear or nonlinear, and 3) the feature dimensionality is low or high. The computational cost of DeepLINK is presented in SI Appendix, section 1.

Simulation Designs.

We explore three different factor models: the linear factor model (Eq. 14), the additive quadratic factor model (Eq. 15), and the logistic factor model (Eq. 16), where for i=1,,n,

xi=Λfi+ϵi, [14]
xi=Λ[fiT,(fi2)T,fi1fi2,fi1fi3,fi2fi3]T+ϵi, [15]
xij=cj1+exp([1,fiT]λj)+ϵij,j=1,,p. [16]

Here, fi=(fi1,fi2,fi3)T is the vector of latent factors, fi2 is the shorthand notation for (fi12,fi22,fi32)T, Λ and λj are the factor loading parameters of appropriate dimensions, and cj’s are some constants. The fij, cj, λij, and entries of Λ and ϵi are all sampled independently from the standard normal distribution N(0,1).

The response vector y=(y1,,yn)T is simulated from model Eq. 6. We investigate two different forms of the link function h—the linear design (Eq. 17) and the nonlinear design (Eq. 18):

h(x)=xTβ, [17]
h(x)=sin(xTβ)exp(xTβ). [18]

To simulate the coefficient vector β=(β1,,βp)T, we first randomly choose s true signal locations and then set the βj at each location to be A or A with equal probability, where A is some positive value that varies in our simulation studies. The remaining ps components of β are set to zero. It is seen that when the link function h is linear, A measures the signal strength with a larger value corresponding to a stronger signal. When h is nonlinear, however, the signal strength may no longer be a monotone increasing function of A. The discussions in SI Appendix, section 2 have an example illustrating this. In fact, to the best of our knowledge, there lacks a widely adopted measure for the signal strength in the nonlinear model settings. Model errors εis are also sampled independently from the standard normal distribution N(0,1).

Parameter Settings.

For all the simulation studies, the target FDR q is set to 0.2, and the sample size n is set to 1,000. For the linear link function setting, we explore two different feature dimensionalities p=500,1,500 with true signal size s set to 10 and 30, respectively. For the nonlinear link function setting, p is set to 50 and 500, and s is fixed at 10. We vary the value of A to investigate its impact on the performance of DeepLINK.

Neural Network Settings.

We next provide the details on the neural network architectures. We train the autoencoder network using the Adam algorithm with the mean squared error (MSE) as the loss function. For the linear factor model, the number of neurons in the autoencoder’s bottleneck layer is estimated by the PCp1 algorithm proposed in ref. 43. It is worth noting that PCp1 is designed for linear factor models. For the nonlinear factor models, we set it to the true number of factors r=3. We conduct a robustness study of DeepLINK to the misspecification of r in SI Appendix, section 3. We remark that r can be tuned by the cross-validation in real applications. For DeepPINK used in the feature selection step, we use MSE as the loss function coupled with the L1 regularization when the link function h is linear. When the link function is nonlinear, we change the loss function to the mean squared logarithmic error (MSLE) because MSE may cause explosive gradients for large response values. In fact, MSLE also works well with other nonlinear link functions (SI Appendix, section 4). For a general guidance, we suggest using MSE first and switching to MSLE when the gradients become too large during the model training. For both linear and nonlinear link functions, we use the Adam optimizer to train the network. For both autoencoder and DeepPINK networks, we recommend to use the exponential linear unit (ELU) as the activation function according to our experience gained from empirical studies. Our numerical study also suggests that the learning rate of Adam and the coefficient of L1 regularization need to be tuned for the best performance of our method. The neural network settings are summarized in Table 1.

Table 1.

Neural network parameter settings

Activation Loss Optimizer Regularization
Autoencoder
 Linear h ELU MSE Adam None
 Nonlinear h ELU MSE Adam None
MLP
 Linear h ELU MSE Adam L1 regularization
 Nonlinear h ELU MSLE Adam L1 regularization

Simulation Results.

We investigate the performance of DeepLINK in the simulation study with different combinations of factor models, link functions, and dimensionalities. For each setting, we apply DeepLINK to 100 independently simulated datasets and calculate the average FDP and TDP as the empirical FDR and power, respectively. The knockoff+ threshold is used in our numerical studies because it controls the exact FDR.

Simulation results with the linear factor model.

We compare DeepLINK with IPAD reviewed in the Introduction using the simulated data in the linear factor model setting as specified in Eq. 14. Both methods successfully control the FDR under the target level 0.2. In terms of power, IPAD slightly outperforms DeepLINK in settings with the linear link function (Fig. 3 A and B). This is reasonable because IPAD was proposed under the assumption of the linear factor model and linear link function and makes full use of these parametric model structures, while DeepLINK makes no use of these model structures at all. For the nonlinear link function (Fig. 3 C and D), however, the power of IPAD drops significantly, while DeepLINK still maintains decently high power. It is also interesting to observe that the power of DeepLINK first increases sharply to the peak and then decreases slightly as A increases, which can be explained by the fact that A no longer serves as a good measure of the signal strength. These results demonstrate the versatility of DeepLINK. The capability of DeepLINK to tackle complicated nonlinear link functions makes it more useful in real applications since it is more robust to possible model misspecification.

Fig. 3.

Fig. 3.

Comparisons between DeepLINK and IPAD in simulation settings with the linear factor model. AD represent different combinations of the link function h, the number of features p, and the number of true signals s. FDR+ and Power+ denote the empirical FDR and power obtained using the knockoff+ threshold. The black dashed lines indicate the target FDR level. Each plot shows the change of FDR+ (solid lines) and Power+ (long dashed line) for DeepLINK (blue) and IPAD (orange) against varying signal amplitude A.

Another interesting observation is that our simulation produces highly correlated features. To study DeepLINK’s ability to disentangle important features from their highly correlated noise features, we conducted additional analyses in SI Appendix, section 5.

Simulation results with the nonlinear factor model.

We now consider nonlinear factor models Eqs. 15 and 16. We will drop IPAD from the comparison because IPAD was proposed under the assumption of linear factor model and is not expected to perform well when the model is severely misspecified.* For the additive quadratic factor model in Eq. 15, FDR is perfectly controlled in all settings. Meanwhile, high power is achieved with reasonably large A in the two settings with linear link function (Fig. 4 A and B). However, the two settings with nonlinear link functions are very challenging, and the power of DeepLINK is significantly lower (Fig. 4 C and D). For the logistic factor model Eq. 16, DeepLINK controls the FDR and can achieve power close to one with a wide range of values for A in each setting (Fig. 5). The success of FDR control by DeepLINK in nonlinear factor model settings provides evidence that the autoencoder network can well capture the nonlinear factor structure and thus, generates valid knockoffs data matrices. Similar to the linear factor model setting, we again observe an inverted U-shaped curve of the power when the link function is nonlinear, which can be explained by the same reason as before.

Fig. 4.

Fig. 4.

Simulation results of DeepLINK in settings with the additive quadratic factor model. AD represent different combinations of the link function h, the number of features p, and the number of true signals s. FDR+ and Power+ denote the empirical FDR and power obtained using the knockoff+ threshold. The black dashed lines indicate the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long dashed line) against varying signal amplitude A.

Fig. 5.

Fig. 5.

Simulation results of DeepLINK in settings with the logistic factor model. AD represent different combinations of the link function h, the number of features p, and the number of true signals s. FDR+ and Power+ denote the empirical FDR and power obtained using the knockoff+ threshold. The black dashed lines indicate the target FDR level. Each plot shows the change of FDR+ (solid line) and Power+ (long dashed line) against varying signal amplitude A.

Robustness of DeepLINK to different activation functions.

We explore the effects of different activation functions used in the autoencoder and DeepPINK networks on the performance of DeepLINK (Figs. 68). In general, DeepLINK is robust to different combinations of activation functions in terms of both FDR control and power. The only exception is using the rectified linear unit (ReLU) activation in the autoencoder network. In the linear factor model setting with linear link function h and low feature dimensionality, the autoencoder with ReLU activation fails to control the FDR when the signal amplitude is small (Fig. 6A). We also observe that the autoencoder with ReLU activation has slightly inflated FDR in some other settings (Figs. 7 A and D and 8A). In the linear and additive quadratic factor model settings with linear link function h and large feature dimensionality (Figs. 6B and 7B), ReLU yields lower power than other activation functions when used in the autoencoder network. We thus recommend against using the ReLU activation in the autoencoder network for the DeepLINK applications.

Fig. 6.

Fig. 6.

Linear factor model simulation results of DeepLINK using different activation functions. AD represent different combinations of the link function h, the number of features p, and the number of true signals s. FDR+ and Power+ denote the empirical FDR and power obtained using the knockoff+ threshold. The black dashed lines indicate the target FDR level. Each plot shows the change of FDR+ (solid lines) and Power+ (long dashed lines) against varying signal amplitude A for different activation functions (ELU, ReLU, and Tanh) used in autoencoder and DeepPINK (e.g., ReLU-ELU represents using ReLU in autoencoder and ELU in DeepPINK).

Fig. 8.

Fig. 8.

Logistic factor model simulation results of DeepLINK using different activation functions. AD represent different combinations of the link function h, the number of features p, and the number of true signals s. FDR+ and Power+ denote the empirical FDR and power obtained using the knockoff+ threshold. The black dashed lines indicate the target FDR level. Each plot shows the change of FDR+ (solid lines) and Power+ (long dashed lines) against varying signal amplitude A for different activation functions (ELU, ReLU, and Tanh) used in autoencoder and DeepPINK (e.g., ReLU-ELU represents using ReLU in autoencoder and ELU in DeepPINK).

Fig. 7.

Fig. 7.

Additive quadratic factor model simulation results of DeepLINK using different activation functions. AD represent different combinations of the link function h, the number of features p, and the number of true signals s. FDR+ and Power+ denote the empirical FDR and power obtained using the knockoff+ threshold. The black dashed lines indicate the target FDR level. Each plot shows the change of FDR+ (solid lines) and Power+ (long dashed lines) against varying signal amplitude A for different activation functions (ELU, ReLU, and Tanh) used in autoencoder and DeepPINK (e.g., ReLU-ELU represents using ReLU in autoencoder and ELU in DeepPINK).

Real Data Applications

We further apply DeepLINK to three real data applications. All predictors in the three datasets below were standardized to unit variance before the analysis. In all real data applications, the error distribution fθ was fitted assuming Gaussian distribution. The robustness of DeepLINK with respect to misspecified error distribution is investigated in SI Appendix, section 6. We also compare the performance of DeepLINK with that of random forests (44, 45) in SI Appendix, section 7.

Application to a Microbiome Dataset.

The microbiome dataset is publicly available in a colorectal cancer (CRC)–related metagenomic study in Zeller et al. (46). The dataset contains the whole genome–sequenced (WGS) DNAs from stool samples of 184 individuals (91 CRC patients and 93 healthy controls). We aligned the DNA sequences against the National Center for Biotechnology Information (NCBI) microbial reference genome database and constructed an abundance matrix according to the alignment results. The matrix consisted of 184 rows and 434 columns, with each entry representing the abundance of a microbial species in the corresponding sample. We randomly split the dataset into the training set (80%) and the test set (20%) and implemented the DeepLINK on the training part. The trained model was then applied to the test data, and the classification error rate was calculated. The random splitting procedure was repeated 100 times. However, the mean misclassification error on the test data was consistently around 0.5 under various parameter settings of DeepLINK, suggesting that the simple application of DeepLINK could fail. We also tried some other popular classification methods such as the Lasso and simple deep neural network without the special architecture as in DeepLINK, all of which gave us error rates between 0.4 and 0.5, similar to the random guessing.

Consequently, we performed a variable screening step first and then applied the DeepLINK method on the screened dataset. Considering the relatively small sample size and to reduce the chance of including noise confounders, we identified an independent microbiome dataset for screening. This independent dataset was also publicly available and was collected for CRC–microbiome association analysis (47). It contained 128 WGS DNA samples with 74 CRC patients and 54 controls. Since these two microbiome datasets had different numbers of features, we constrained ourselves to the 274 common features in our analysis. There are multiple options for the screening step (2, 48, 49). We adopted one of the state-of-the-art methods, which was based on the distance correlation, and ranked these 274 variables by the values of the asymptotic test statistics (50, 51). We randomly split the Zeller et al. (46) CRC microbiome data into the training and test sets at the ratio of 80 to 20%. The justification for the training/testing set split ratio is given in SI Appendix, section 8. Then, we trained the DeepLINK model using the top-ranked variables with the training data. To evaluate the impact of the number of retained variables after the screening step, denoted as d, we examined multiple values of d. Finally, we applied the trained DeepLINK model onto the training and test datasets and calculated the corresponding classification error rates. The whole process was repeated 100 times. We set the number of neurons in the bottleneck layer of the autoencoder to three. We chose the other model parameters by cross-validation. The MLP in DeepPINK had only one hidden layer with d neurons. The dropout rate was 0.4, while L1- and L2-regularization weights were both 0.001. The mean and SE of the misclassification error on the training and test data are given in Table 2. We can see that the mean test error was the lowest when 30 variables were retained after the screening step. However, as d increased to as large as 100, the mean test error became relatively high (0.385), indicating that when DeepLINK lost the help of the screening step in eliminating noise variables, its performance could be compromised.

Table 2.

Mean and SE (in parentheses) of the misclassification error rates for the microbiome dataset

Training Test
d=20 0.172 (0.003) 0.319 (0.008)
d=30 0.104 (0.004) 0.306 (0.007)
d=40 0.019 (0.003) 0.328 (0.009)
d=50 0.008 (0.002) 0.319 (0.008)
d=100 0.000 (0.000) 0.385 (0.012)

The top 20 most selected microbial species along with their selection frequencies by DeepLINK coupled with screening are presented in Table 3. We only present the results for d=30 when the mean classification error rate was the lowest. Many of these selected species were reported to have important associations with CRC in the previous literature. For example, Parvimonas micra and Akkermansia muciniphila were among the four-bacteria biomarker panel of CRC identified by Osman et al. (52). In addition, P. micra’s enrichment in CRC was demonstrated in a number of previous studies (47, 5355), and Purcell et al. (56) also reported its enrichment in one of the CRC subtypes. Other important CRC-related species that were also reported in previous studies include Dialister pneumosintes (57) and Bacteroides fragilis (5861).

Table 3.

Top 20 most selected microbial species when d=30 for the microbiome dataset

Species Frequency
D. pneumosintes 99
Eikenella corrodens 96
Staphylococcus haemolyticus 75
Intestinimonas butyriciproducens 75
B. fragilis 70
Latilactobacillus sakei 58
Clostridium bornimense 58
P. micra 51
A. muciniphila 49
Gemella sp. oral taxon 928 48
Clostridium chauvoei 45
Corynebacterium sp. NML98-0116 43
Prevotella intermedia 34
Streptococcus sp. A12 31
Ndongobacter massiliensis 30
Ruminococcus bicirculans 29
Lactococcus garvieae 28
Fusobacterium varium 26
Anaerococcus mediterraneensis 26
Desulfovibrio fairfieldensis 23

Application to a Murine Single-Cell RNA-Sequencing Dataset.

The murine single-cell RNA-sequencing (scRNA-seq) dataset is publicly available from Lane et al. (62), aiming to investigate the effect of lipopolysaccharides (LPS)-stimulated nuclear factor-κB (NF-κB) on gene expression. We first preprocessed the data following the suggestions in ref. 63. We filtered out cells either with mapping rate below 20% or with nonzero expression proportion below 5%. We also filtered out genes expressed in less than 5% of total cells. The preprocessed data matrix contained the expression, in the form of transcripts per million (TPM), of 13,777 genes from 570 cells. We were interested in differential gene expression between cells with two conditions: unstimulated (202 cells) and stimulated with LPS after 150 min (368 cells). Due to the high dimensionality, it was computationally infeasible to implement the DeepLINK on this dataset directly, even with powerful servers. The success of screening in the previous microbiome example motivated us to apply a screening step first to reduce the dimensionality. Since this dataset had a relatively larger sample size than the microbiome dataset, we randomly split the dataset into three parts for screening (50%), training (40%), and test (10%), instead of using an independent dataset for screening. We used the same model architecture and parameters tuned from the previous microbiome analysis. The mean and SE of the misclassification error over 100 repetitions on the training and test sets, respectively, for this scRNA-seq dataset are provided in Table 4. We observe that the mean misclassification error on the test data can get as low as 0.010 when d=200.

Table 4.

Mean and SE (in parentheses) of the misclassification error rates for the murine scRNA-seq dataset

Training Test
d=20 0.000 (0.000) 0.021 (0.003)
d=30 0.000 (0.000) 0.018 (0.002)
d=40 0.000 (0.000) 0.012 (0.002)
d=50 0.000 (0.000) 0.014 (0.002)
d=100 0.000 (0.000) 0.016 (0.002)
d=200 0.000 (0.000) 0.010 (0.001)
d=300 0.000 (0.000) 0.013 (0.002)
d=400 0.000 (0.000) 0.012 (0.002)
d=500 0.000 (0.000) 0.015 (0.002)

We further looked at the top 20 most selected genes by DeepLINK equipped with screening for d=200 as presented in Table 5. Many of the selected genes were also reported as significant features in the original study (62) including Sqstm1, Sdc4, Abcg1, Rab31, Gmnn, Angpt2, Hsp90aa1, Tnfaip2, Clec4e, Gpx1, Sod2, and Fas. Gene Ontology (GO) analysis with domain Biological Process (BP) indicates that the up-regulation of genes in LPS-stimulated cells is related to NF-κB signaling (Sqstm1) and LPS response (Sod2). Also, Hsp90aa1 can bind LPS and mediate LPS-induced inflammatory response according to Uniprot (64), which may be related to its up-regulation in LPS-stimulated cells.

Table 5.

Top 20 most selected genes when d=200 for the murine scRNA-seq dataset

Gene Frequency Gene Frequency
Sqstm1 73 Gm26825 61
Cdkn1a 66 Hsp90aa1 60
Sdc4 65 Tnfaip2 60
Abcg1 64 Clec4e 58
Rab31 64 Gpx1 58
Gm28875 64 Sod2 57
Gmnn 63 Srsf5 55
Angpt2 63 Fas 51
Ehd4 61 Get1 50
Dnaja1 61 Hsp90ab1 50

Application to a Human Single-Cell RNA-Sequencing Dataset.

Another scRNA-seq dataset that we investigated is from a human glioblastoma study led by Darmanis et al. (65). We were interested in differential gene expression between Neoplastic cells in the tumor core and the surrounding periphery. We used the same criteria as in the murine scRNA-seq study to preprocess the data, which resulted in a dataset with TPMs of 23,257 genes from 632 cells (580 in the tumor core and 52 in the periphery). Again, due to the high dimensionality, we first conducted dimensionality reduction using the distance correlation screening and then applied DeepLINK. The model architecture and parameters were the same as those in the last two real data studies. We repeated the experiment 100 times and present the mean and SE of the misclassification error on the training and test data, respectively, in Table 6. We see that the mean misclassification errors on the test data achieve the smallest value when d=200 and then become more or less stable.

Table 6.

Mean and SE (in parentheses) of the misclassification error for the human scRNA-seq dataset

Training Test
d=20 0.006 (0.001) 0.072 (0.003)
d=30 0.002 (0.000) 0.068 (0.003)
d=40 0.001 (0.000) 0.064 (0.003)
d=50 0.001 (0.000) 0.060 (0.003)
d=100 0.001 (0.000) 0.059 (0.003)
d=200 0.000 (0.000) 0.046 (0.003)
d=300 0.000 (0.000) 0.056 (0.003)
d=400 0.000 (0.000) 0.049 (0.003)
d=500 0.000 (0.000) 0.050 (0.003)

The top 20 most selected genes by DeepLINK equipped with screening for d=200 are shown in Table 7. We next examined the biological meaning of these selected genes. As pointed out in the original study (65), down-regulation of genes like ATP1A2 and PRODH in the periphery might be related to their functions in the interstitial matrix invasion. We also observed that HIF3A was down-regulated in the tumor core, which was probably associated with the hypoxia in core. Previous study also demonstrated that HIF3A was a dominant-negative regulator of HIF-1 and was thus down-regulated in a hypoxic environment (66). GO analysis with domain BP indicates that some genes up-regulated in periphery have functions related to cell migration from periphery to core. For instance, HES6 has GO term nervous system development, which is highly relevant to tumor cell migration. IGSF21 and CNTN1 have GO term cell–cell adhesion, which is a central part in cell migration. ALDOC has GO term glycolytic process, which produces a small amount of adenosine triphosphate (ATP) and may help the cell migration as an energy provider. SERPINE2 has GO term regulation of cell migration. Also, genes such as SPARCL1, NPL, and ST6GALNAC3 are involved in various metabolic processes.

Table 7.

Top 20 most selected genes when d=200 for the human scRNA-seq dataset

Gene Frequency Gene Frequency
ATP1A2 73 ANKRD20A9P 45
PRODH 73 HIF3A 44
HES6 62 NPL 44
IGSF21 57 AC131097.1 44
PPM1K 56 FAM240C 41
ALDOC 55 ST6GALNAC3 40
SPARCL1 55 MMP28 40
SERPINE2 52 CNTN1 39
RNPC3 52 MTRNR2L1 39
LOC102724788 47 EFHD1 38

Discussion

In this paper, we have developed a high-dimensional inference framework via knockoffs, DeepLINK, to enhance the interpretability and reproducibility of deep learning models. DeepLINK generates the knockoff variables under the possibly nonlinear factor model assumption using an autoencoder network and then fits the regression/classification model using the DeepPINK network. We have used various simulated datasets to numerically demonstrate that DeepLINK can achieve successful FDR control with attractive power in selecting features that are truly important for the response of interest. We have also showcased the practical utility and performance of DeepLINK on three real data applications.

When comparing the prediction performance of DeepLINK with random forests in SI Appendix, section 7, we noticed that random forests can outperform DeepLINK in terms of prediction for the microbiome dataset. This is likely caused by the distinctive prediction power of MLP and random forests. We remark that the MLP in the second step of DeepLINK can be replaced with random forests if one suspects that the latter can outperform in prediction. We also emphasize that the mainpurpose of DeepLINK is feature selection with controlled error rate, and to achieve the goal of FDR control in feature selection, the prediction power may be slightly compromised in some applications.

There are five potential directions for future investigations. First, in the real data applications, we consider binary outcomes. DeepLINK can be easily extended to the case of multiple classes if we replace the loss function in the second step of binary cross-entropy with multiclass cross-entropy. Second, the knockoff variable-generating process of DeepLINK simulates the idiosyncratic matrix E outside of the autoencoder network with nondeep learning techniques. Designing a new deep neural network, which can automate the knockoff variable-generating process, may increase its efficiency and accuracy. Third, we currently have two separate networks in DeepLINK: the knockoff variable-generating network of autoencoder and the model fitting and inference network of DeepPINK. We would like to integrate them into one single network for a joint optimization so that the whole process can be fully automated. Such a feature can make DeepLINK even more user friendly. Fourth, heterogeneity in the samples is a practically important issue. It is possible that the samples consist of multiple subpopulations and that they have different true features. It is likely that DeepLINK can be extended to accommodate the heterogeneity. The key is to construct valid knockoff variables reflecting the subpopulation information. One naive method is to construct knockoff variables for each subpopulation and then combine them appropriately to form valid knockoff variables for the overall population. If this can be achieved, the second step of feature selection using MLP can be applied without modification. Finally, we would like to provide theoretical justifications on DeepLINK in terms of both FDR control and power. This can in turn guide the training of the underlying networks and further improve the interpretation of our deep learning inference method.

Supplementary Material

Supplementary File
pnas.2104683118.sapp.pdf (499.5KB, pdf)

Acknowledgments

This work was supported by NIH Grants R01GM120624 and 1R01GM131407 and NSF Grant DMS-1953356. Z.Z. was supported by the Viterbi Fellowship.

Footnotes

The authors declare no competing interest.

*The performance of IPAD is already very poor when the link function h alone takes a nonlinear form, as shown in Fig. 3 C and D.

This article is a PNAS Direct Submission.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2104683118/-/DCSupplemental.

Data Availability

Software data have been deposited in GitHub (https://github.com/zifanzhu/DeepLINK). Preprocessed data matrices for the four publicly available data sets can be downloaded with the corresponding link: Zeller microbiome (67), Yu microbiome (68), murine scRNA-seq (69), and human scRNA-seq (70).

References

  • 1.Fan J., Lv J., A selective overview of variable selection in high dimensional feature space. Stat. Sin. 20, 101–148 (2010). [PMC free article] [PubMed] [Google Scholar]
  • 2.Fan J., Fan Y., High-dimensional classification using features annealed independence rules. Ann. Stat. 36, 2605–2637 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Benjamini Y., Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995). [Google Scholar]
  • 4.Benjamini Y., Discovering the false discovery rate. J. R. Stat. Soc. B 72, 405–416 (2010). [Google Scholar]
  • 5.Benjamini Y., Yekutieli D., The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 1165–1188 (2001). [Google Scholar]
  • 6.Efron B., Tibshirani R., Storey J. D., Tusher V., Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 1151–1160 (2001). [Google Scholar]
  • 7.Storey J. D., A direct approach to false discovery rates. J. R. Stat. Soc. Series B Stat. Methodol. 64, 479–498 (2002). [Google Scholar]
  • 8.Benjamini Y., Krieger A. M., Yekutieli D., Adaptive linear step-up procedures that control the false discovery rate. Biometrika 93, 491–507 (2006). [Google Scholar]
  • 9.Genovese C. R., Roeder K., Wasserman L., False discovery control with p-value weighting. Biometrika 93, 509–524 (2006). [Google Scholar]
  • 10.Scott J. G., Kelly R. C., Smith M. A., Zhou P., Kass R. E., False discovery rate regression: An application to neural synchrony detection in primary visual cortex. J. Am. Stat. Assoc. 110, 459–471 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ignatiadis N., Klaus B., Zaugg J. B., Huber W., Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13, 577–580 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Stephens M., False discovery rates: A new deal. Biostatistics 18, 275–294 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lei L., Fithian W., Adapt: An interactive procedure for multiple testing with side information. J. R. Stat. Soc. Series B Stat. Methodol. 80, 649–679 (2018). [Google Scholar]
  • 14.Li A., Barber R. F., Multiple testing with the structure-adaptive Benjamini-Hochberg algorithm. J. R. Stat. Soc. Series B Stat. Methodol. 81, 45–74 (2019). [Google Scholar]
  • 15.Candès E., Fan Y., Janson L., Lv J., Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. B 80, 551–577 (2018). [Google Scholar]
  • 16.Lu Y., Fan Y., Lv J., Noble W. S., “DeepPINK: Reproducible feature selection in deep neural networks” in Advances in Neural Information Processing Systems, Bengio S., et al., Eds. (Advances in Neural Information Processing Systems, 2018), pp. 8676–8686. [Google Scholar]
  • 17.Uematsu Y., Fan Y., Chen K., Lv J., Lin W., SOFAR: Large-scale association network learning. IEEE Trans. Inf. Theory 65, 4924–4939 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Zheng Z., Lv J., Lin W., Nonsparse learning with latent variables. Oper. Res. 69, 346–359 (2021). [Google Scholar]
  • 19.Friguet C., Kloareg M., Causeur D., A factor model approach to multiple testing under dependence. J. Am. Stat. Assoc. 104, 1406–1415 (2009). [Google Scholar]
  • 20.Shen Y., Jin R., “Learning personal+ social latent factor model for social recommendation” in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Yang Q., Agarwal D., Eds. (Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012), pp. 1303–1311.
  • 21.Jenatton R., Le Roux N., Bordes A., Obozinski G., “A latent factor model for highly multi-relational data” in Advances in Neural Information Processing Systems (NIPS 2012), Pereira F., Burges C. J. C., Bottou L., Weinberger K. Q., Eds. (Advances in Neural Information Processing Systems, 2012), vol. 25, pp. 3176–3184. [Google Scholar]
  • 22.Argelaguet R., et al., Multi-Omics Factor Analysis—A framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14, e8124 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Frichot E., Schoville S. D., Bouchard G., François O., Testing for associations between loci and environmental gradients using latent factor mixed models. Mol. Biol. Evol. 30, 1687–1699 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Blum Y., Le Mignon G., Lagarrigue S., Causeur D., A factor model to analyze heterogeneity in gene expression. BMC Bioinformatics 11, 368 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fan J., Fan Y., Lv J., High dimensional covariance matrix estimation using a factor model. J. Econom. 147, 186–197 (2008). [Google Scholar]
  • 26.Scott J. T., Factor analysis and regression. Econometrica 34, 552–562 (1966). [Google Scholar]
  • 27.Scott J. T., Factor analysis regression revisited. Econometrica 37, 719 (1969). [Google Scholar]
  • 28.Diebold F. X., Rudebusch G. D., Aruoba S. B., The macroeconomy and the yield curve: A dynamic latent factor approach. J. Econom. 131, 309–338 (2006). [Google Scholar]
  • 29.Uddin A., Yu D., Latent factor model for asset pricing. J. Behav. Exp. Finance 27, 100353 (2020). [Google Scholar]
  • 30.Fan Y., Lv J., Sharifvaghefi M., Uematsu Y., IPAD: Stable interpretable forecasting with knockoffs inference. J. Am. Stat. Assoc. 115, 1822–1834 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Pearl J., “Markov and Bayesian networks: Two graphical representations of probabilistic knowledge” in Probabilistic Reasoning in Intelligent Systems, Pearl J., Ed. (Morgan Kaufmann, San Francisco, CA, 1988), pp. 77–141. [Google Scholar]
  • 32.Hommel G., Hoffmann T., Controlled Uncertainty in Multiple Hypothesenprüfung/Multiple Hypotheses Testing (Springer, 1988), pp. 154–161. [Google Scholar]
  • 33.Lehmann E. L., Romano J. P., Generalizations of the Familywise Error Rate in Selected Works of EL Lehmann (Springer, 2012), pp. 719–735. [Google Scholar]
  • 34.Fan Y., Demirkaya E., Lv J., Nonuniformity of p-values can occur early in diverging dimensions. J. Mach. Learn. Res. 20, 1–33 (2019). [PMC free article] [PubMed] [Google Scholar]
  • 35.Barber R. F., Candès E. J., Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085 (2015). [Google Scholar]
  • 36.Fan Y., Demirkaya E., Li G., Lv J., RANK: Large-scale inference with graphical nonlinear knockoffs. J. Am. Stat. Assoc. 115, 362–379 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bates S., Candès E., Janson L., Wang W., Metropolized knockoff sampling. J. Am. Stat. Assoc., (2020). 10.1080/01621459.2020.1729163. [DOI] [Google Scholar]
  • 38.Huang D., Janson L., Relaxing the assumptions of knockoffs by conditioning. Ann. Stat. 48, 3021–3042 (2020). [Google Scholar]
  • 39.Sesia M., Sabatti C., Candès E. J., Gene hunting with hidden Markov model knockoffs. Biometrika 106, 1–18 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bai X., Ren J., Fan Y., Sun F., KIMI: Knockoff inference for motif identification from molecular sequences with controlled false discovery rate. Bioinformatics 37, 759–766 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Jolliffe I., Principal Component Analysis (Springer Verlag, New York, NY, 2002). [Google Scholar]
  • 42.Bengio Y., Courville A., Vincent P., Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013). [DOI] [PubMed] [Google Scholar]
  • 43.Bai J., Ng S., Determining the number of factors in approximate factor models. Econometrica 70, 191–221 (2002). [Google Scholar]
  • 44.Breiman L., Random forests. Mach. Learn. 45, 5–32 (2001). [Google Scholar]
  • 45.Chi C. M., Vossler P., Fan Y., Lv J., Asymptotic properties of high-dimensional random forests. arXiv [Preprint] (2020). https://arxiv.org/abs/2004.13953 (Accessed 29 April 2020).
  • 46.Zeller G., et al., Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Yu J., et al., Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017). [DOI] [PubMed] [Google Scholar]
  • 48.Fan J., Lv J., Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. B 70, 849–911 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Fan J., Lv J., Sure independence screening (invited review article). Wiley StatsRef: Statistics Reference Online (2018). https://par.nsf.gov/biblio/10091881. Accessed 1 June 2018.
  • 50.Székely G. J., Rizzo M. L., Bakirov N. K., Measuring and testing dependence by correlation of distances. Ann. Stat. 35, 2769–2794 (2007). [Google Scholar]
  • 51.Gao L., Fan Y., Lv J., Shao Q., Asymptotic distributions of high-dimensional distance correlation inference. Ann. Stat. (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Osman M. A., et al., Parvimonas micra, Peptostreptococcus stomatis, Fusobacterium nucleatum and Akkermansia muciniphila as a four-bacteria biomarker panel of colorectal cancer. Sci. Rep. 11, 2925 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Löwenmark T., et al., Parvimonas micra as a putative non-invasive faecal biomarker for colorectal cancer. Sci. Rep. 10, 15250 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Dai Z., et al., Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6, 70 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Drewes J. L., et al., High-resolution bacterial 16S rRNA gene profile meta-analysis and biofilm status reveal common colorectal cancer consortia. NPJ Biofilms Microbiomes 3, 34 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Purcell R. V., Visnovska M., Biggs P. J., Schmeier S., Frizelle F. A., Distinct gut microbiome patterns associate with consensus molecular subtypes of colorectal cancer. Sci. Rep. 7, 11590 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Ai L., et al., Systematic evaluation of supervised classifiers for fecal microbiota-based prediction of colorectal cancer. Oncotarget 8, 9546–9556 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Toprak N. U., et al., A possible role of Bacteroides fragilis enterotoxin in the aetiology of colorectal cancer. Clin. Microbiol. Infect. 12, 782–786 (2006). [DOI] [PubMed] [Google Scholar]
  • 59.Boleij A., et al., The Bacteroides fragilis toxin gene is prevalent in the colon mucosa of colorectal cancer patients. Clin. Infect. Dis. 60, 208–215 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Viljoen K. S., Dakshinamurthy A., Goldberg P., Blackburn J. M., Quantitative profiling of colorectal cancer-associated bacteria reveals associations between fusobacterium spp., enterotoxigenic Bacteroides fragilis (ETBF) and clinicopathological features of colorectal cancer. PLoS One 10, e0119462 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Haghi F., Goli E., Mirzaei B., Zeighami H., The association between fecal enterotoxigenic B. fragilis with colorectal cancer. BMC Cancer 19, 879 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Lane K., et al., Measuring signaling and RNA-seq in the same cell links gene expression to dynamic patterns of nf-κb activation. Cell Syst. 4, 458–469.e5 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Korthauer K., et al., A practical guide to methods controlling false discoveries in computational biology. Genome Biol. 20, 118 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.UniProt Consortium , UniProt: A hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Darmanis S., et al., Single-cell RNA-seq analysis of infiltrating neoplastic cells at the migrating front of human glioblastoma. Cell Rep. 21, 1399–1410 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Maynard M. A., et al., Human HIF-3alpha4 is a dominant-negative regulator of HIF-1 and is down-regulated in renal cell carcinoma. FASEB J. 19, 1396–1406 (2005). [DOI] [PubMed] [Google Scholar]
  • 67.Zhu Z., Data from “Abundance matrix for the WGS microbiome data set from Zeller et al.” GitHub. https://github.com/zifanzhu/DeepLINK/tree/main/Real_data_analyses/human_microbiome/data/microbiome_data_common.csv. Accessed 21 June 2021.
  • 68.Zhu Z., Data from “Abundance matrix for the WGS microbiome data set from Yu et al.” GitHub. https://github.com/zifanzhu/DeepLINK/tree/main/Real_data_analyses/human_microbiome/data/yu_CRC_common.csv. Accessed 21 June 2021.
  • 69.Zhu Z., Data from “Expression matrix for the murine scRNA-seq data set from Lane et al.” GitHub. https://github.com/zifanzhu/DeepLINK/tree/main/Real_data_analyses/murine_sc_RNAseq/data/rna1.csv. Accessed 21 June 2021.
  • 70.Zhu Z., Data from “Expression matrix for the human scRNA-seq data set from Darmanis et al.” GitHub. https://github.com/zifanzhu/DeepLINK/tree/main/Real_data_analyses/human_sc_RNAseq/data/rna2.csv. Accessed 21 June 2021.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.2104683118.sapp.pdf (499.5KB, pdf)

Data Availability Statement

Software data have been deposited in GitHub (https://github.com/zifanzhu/DeepLINK). Preprocessed data matrices for the four publicly available data sets can be downloaded with the corresponding link: Zeller microbiome (67), Yu microbiome (68), murine scRNA-seq (69), and human scRNA-seq (70).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES