A modern maximum-likelihood theory for high-dimensional logistic regression

Pragya Sur; Emmanuel J Candès

doi:10.1073/pnas.1810420116

. 2019 Jul 1;116(29):14516–14525. doi: 10.1073/pnas.1810420116

A modern maximum-likelihood theory for high-dimensional logistic regression

Pragya Sur ^a,^1,², Emmanuel J Candès ^a,^b,^1,²

PMCID: PMC6642380 PMID: 31262828

Significance

Logistic regression is a popular model in statistics and machine learning to fit binary outcomes and assess the statistical significance of explanatory variables. Here, the classical theory of maximum-likelihood (ML) estimation is used by most software packages to produce inference. In the now common setting where the number of explanatory variables is not negligible compared with the sample size, we show that classical theory leads to inferential conclusions that cannot be trusted. We develop a theory that provides expressions for the bias and variance of the ML estimate and characterizes the asymptotic distribution of the likelihood-ratio statistic under some assumptions regarding the distribution of the explanatory variables. This theory can be used to provide valid inference.

Keywords: logistic regression, high-dimensional inference, maximum-likelihood estimate, likelihood-ratio test

Abstract

Students in statistics or data science usually learn early on that when the sample size $n$ is large relative to the number of variables $p$ , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ². The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.

Logistic regression (1, 2) is one of the most frequently used models to estimate the probability of a binary response from the value of multiple features/predictor variables. It is widely used in the social sciences, the finance industry, the medical sciences, and so on. As an example, a typical application of logistic regression may be to predict the risk of developing a given coronary heart disease from a patient’s observed characteristics. Consequently, graduate students in statistics and many fields that involve data analysis learn about logistic regression, perhaps before any other nonlinear multivariate model. In particular, most students know how to interpret the excerpt of the computer output from Fig. 1, which displays regression coefficient estimates, standard errors, and P values for testing the significance of the regression coefficients. In textbooks we learn the following: (i) Fitting a model via maximum likelihood produces estimates that are approximately unbiased. (ii) There are formulas to estimate the accuracy or variability of the maximum-likelihood estimate (MLE) (used in the computer output from Fig. 1).

Fig. 1. — Excerpt from an object of class “glm” obtained by fitting a logistic model in R. The coefficient estimates ${\hat{β}}_{j}$ are obtained by maximum likelihood, and for each variable, R provides an estimate of the SD of $\hat{β_{j}}$ as well as a P value for testing whether $β_{j} = 0$ or not.

These approximations come from asymptotic results. Imagine we have $n$ independent observations $(y_{i}, X_{i})$ where $y_{i} \in {0,1}$ is the response variable and $X_{i} \in R^{p}$ the vector of predictor variables. The logistic model posits that the probability of a case conditional on the covariates is given by

P (y_{i} = 1 ∣ X_{i}) = ρ' (X_{i}^{'} β),

where $ρ' (t) = e^{t} / (1 + e^{t})$ is the standard sigmoidal function. When $p$ is fixed and $n \to \infty$ , the MLE $\hat{β}$ obeys

\sqrt{n} (\hat{β} - β) \overset{d}{\to} N (0, I_{β}^{- 1}),

[1]

where $I_{β}$ is the $p \times p$ Fisher information matrix evaluated at the true $β$ (3). A classical way of understanding Eq. 1 is in the case where the pairs $(X_{i}, y_{i})$ are i.i.d. and the covariates $X_{i}$ are drawn from a distribution obeying mild conditions so that the MLE exists and is unique. Now the limiting result Eq. 1 justifies the first claim of near unbiasedness. Further, software packages then return standard errors by evaluating the inverse Fisher information matrix at the MLE $\hat{β}$ [this is what R (4) does in Fig. 1]. In turn, these standard errors are then used for the purpose of statistical inference; for instance, they are used to produce P values for testing the significance of regression coefficients, which researchers use in thousands of scientific studies.

Another well-known result in logistic regression is Wilks’ theorem (5), which gives the asymptotic distribution of the likelihood-ratio test (LRT): (iii) Consider the likelihood ratio obtained by dropping $k$ variables from the model under study. Then under the null hypothesis that none of the dropped variables belongs to the model, twice the log-likelihood ratio (LLR) converges to a χ² distribution with $k$ degrees of freedom in the limit of large samples. Once more, this approximation is often used in many statistical software packages to obtain P values for testing the significance of individual and/or groups of coefficients.

1. Failures in Moderately Large Dimensions

New technologies now produce extremely large datasets, often with huge numbers of features on each of a comparatively small number of experimental units. However, software packages and practitioners continue to perform calculations as if classical theory applies and, therefore, the main issue is this: Do these approximations hold in high-dimensional settings where $p$ is not vanishingly small compared with $n$ ?

To address this question, we begin by showing results from an empirical study. Throughout this section, we set $n = 4,000$ and unless otherwise specified, $p = 800$ (so that the “dimensionality” $p / n$ is equal to 1/5). We work with a matrix of covariates, which has i.i.d. $N (0,1 / n)$ entries, and different types of regression coefficients scaled in such a way that

γ^{2} ≔ Var (X_{i}^{'} β) = 5 .

This is a crucial point: We want to make sure that the size of the log-odds ratio $X_{i}^{'} β$ does not increase with $n$ or $p$ , so that $ρ' (X_{i}^{'} β)$ is not trivially equal to either 0 or 1. Instead, we want to be in a regime where accurate estimates of $β$ translate into a precise evaluation of a nontrivial probability. With our scaling $γ = \sqrt{5} \approx 2.236$ , about 95% of the observations will be such that $- 4.472 \leq X_{i}^{'} β \leq 4.472$ so that $0.011 \leq ρ' (X_{i}^{'} β) \leq 0.989$ .

Unbiasedness?

Fig. 2 plots the true and fitted coefficients in the setting where one-quarter of the regression coefficients have a magnitude equal to 10, and the rest are 0. Half of the nonzero coefficients are positive and the other half are negative. A striking feature is that the black curve does not pass through the center of the blue scatter. This disagrees with what we would expect from classical theory. Clearly, the regression estimates are not close to being unbiased. When the true effect size $β_{j}$ is positive, we see that the MLE tends to overestimate it. Symmetrically, when $β_{j}$ is negative, the MLE tends to underestimate the effect sizes in the sense that the fitted values are in the same direction but with magnitudes that are too large. In other words, for most indexes $| {\hat{β}}_{j} | > | β_{j} |$ so that we are overestimating the magnitudes of the effects.

The bias is not specific to this example as the theory we develop in this paper will make clear. Consider a case where the entries of $β$ are drawn i.i.d. from $N (3,16)$ (the setup is otherwise unchanged). Fig. 3A shows that the pairs $(β_{j}, {\hat{β}}_{j})$ are not distributed around a straight line of slope 1; rather, they are distributed around a line with a larger slope. Our theory predicts that the points should be scattered around a line with slope 1.499 shown in red, as if we could think that $E {\hat{β}}_{j} \approx 1.499 β_{j}$ .

Fig. 3. — (A) Scatterplot of the pairs $(β_{j}, {\hat{β}}_{j})$ for i.i.d. $N (3,16)$ regression coefficients. The black line has slope 1. Again, we see that the MLE seriously overestimates effect magnitudes. The red line has slope $α^{⋆} \approx 1.499$ predicted by the solution to Eq. 5. We can see that ${\hat{β}}_{j}$ seems centered around $α^{⋆} β_{j}$ . (B) True conditional probability $f (X *) = ρ' (X *' β)$ (black curve) and corresponding estimated probabilities $\hat{f} (X *) = ρ' (X *' \hat{β})$ (blue points). Observe the dramatic shrinkage of the estimates toward the end points.

This bias is highly problematic for estimating the probability of our binary response. Suppose we are given a vector of covariates $X_{*}$ and estimate the regression function $f (X_{*}) = P (y = 1 ∣ X_{*})$ with

\hat{f} (X_{*}) = ρ' (X_{*}^{'} \hat{β}) .

Then because we tend to overestimate the magnitudes of the effects, we will also tend to overestimate or underestimate the probabilities depending on whether $f (X_{*})$ is greater or less than a half. This is illustrated in Fig. 3B. Observe that when $f (X_{*}) < 1 / 2$ , many predictions tend to be close to 0, even when $f (X_{*})$ is nowhere near 0. A similar behavior is obtained by symmetry; when $f (X_{*}) > 1 / 2$ , we see a shrinkage toward the other end point, namely, 1. Hence, we see a shrinkage toward the extremes and the phenomenon is amplified as the true probability $f (X_{*})$ approaches 0 or 1. Expressed differently, the MLE may predict that an outcome is almost certain (i.e., $\hat{f}$ is close to 0 or 1) when, in fact, the outcome is not at all certain. This behavior is misleading.

Accuracy of Classical Standard Errors?

Consider the same matrix $X$ as before and regression coefficients now sampled as follows: Half of the $β_{j}$ s are i.i.d. draws from $N (7,1)$ , and the other half vanish. Fig. 4A shows standard errors computed via Monte Carlo of maximum-likelihood (ML) estimates ${\hat{β}}_{j}$ corresponding to null coordinates. This is obtained by fixing the signal $β$ and resampling the response vector and covariate matrix $10,000$ times. Note that for any null coordinate, the classical estimate of SE based on the inverse Fisher information can be explicitly calculated in this setting and turns out to be equal to 2.66 (SI Appendix, section A). Since the SE values evidently concentrate around 4.75, we see that in higher dimensions, the variance of the MLE is likely to be much larger than the classical asymptotic variance. Naturally, using classical results would lead to incorrect P values and confidence statements, a major issue first noted in ref. 6.

Fig. 4. — (A) Distribution of $SE ({\hat{β}}_{j})$ for each variable $j$ , in which the SE is estimated from $10, 000$ samples. The classical SE value is shown in blue. Classical theory underestimates the variability of the MLE. (B) SE estimates computed from R for a single null (for which $β_{j} = 0$ ) obtained across $10, 000$ replicates resampling the response vector and the covariate matrix.

The variance estimates obtained from statistical software packages are different from the value 2.66 above because they do not take expectation over the covariates and use the MLE $\hat{β}$ in lieu of $β$ (plugin estimate) (SI Appendix, section A). Since practitioners often use these estimates, it is useful to describe how they behave. To this end, for each of the $10, 000$ samples $(X, y)$ drawn above, we obtain the R SE estimate for a single MLE coordinate corresponding to a null variable. The histogram is shown in Fig. 4B. The behavior for this specific coordinate is typical of that observed for any other null coordinate, and the maximum value for these standard errors remains below 4.5, significantly below the typical values observed via Monte Carlo simulations in Fig. 4A.

Distribution of the LRT?

By now, the reader should be suspicious that the χ² approximation for the distribution of the likelihood-ratio test holds in higher dimensions. Indeed, it does not and this actually is not a new observation. In ref. 7, the authors established that for a class of logistic regression models, the LRT converges weakly to a multiple of a χ² variable in an asymptotic regime in which both $n$ and $p$ tend to infinity in such a way that $p / n \to κ \in (0,1 / 2)$ . The multiplicative factor is an increasing function of the limiting aspect ratio $κ$ and exceeds 1 as soon as $κ$ is positive. This factor can be computed by solving a nonlinear system of two equations in two unknowns given in Eq. 8 below. Furthermore, ref. 7 links the distribution of the LRT with the asymptotic variance of the marginals of the MLE, which turns out to be provably higher than that given by the inverse Fisher information. These findings are of course completely in line with the conclusions from the previous paragraphs. The issue is that the results from ref. 7 assume that $β = 0$ ; that is, they apply under the global null where the response does not depend upon the predictors, and it is a priori unclear how the theory would extend beyond this case. Our goal in this paper is to study properties of the MLE and the LRT for high-dimensional logistic regression models under general signal strengths—restricting to the regime where the MLE exists.

To investigate what happens when we are not under the global null, consider the same setting as in Fig. 4. Fig. 5 shows the distribution of P values for testing a null coefficient based on the χ² approximation. Not only are the P values far from uniform, but also the enormous mass near 0 is problematic for multiple-testing applications, where one examines P values at very high levels of significance, e.g., near Bonferroni levels. In such applications, one would be bound to make a large number of false discoveries from using P values produced by software packages. To further demonstrate the large inflation near the small P values, we display in Table 1 estimates of the P-value probabilities in bins near 0. The estimates are much higher than what is expected from a uniform distribution. Clearly, the distribution of the LRT is far from a $χ_{1}^{2}$ .

Table 1.

P-value probabilities with SEs in parentheses

Threshold, %	Classical, %
$P {P value \leq 5}$	$10.77 (0.062)$
$P {P value \leq 1}$	$3.34 (0.036)$
$P {P value \leq 0.5}$	$1.98 (0.028)$
$P {P value \leq 0.1}$	$0.627 (0.016)$
$P {P value \leq 0.05}$	$0.365 (0.012)$
$P {P value \leq 0.01}$	$0.136 (0.007)$

Open in a new tab

Here, n=4, 000, p=800, X has i.i.d. Gaussian entries, and half of the entries of β are drawn from $N$ (7, 1).

Summary.

We have hopefully made the case that classical results, which software packages continue to rely upon, can be inaccurate in higher dimensions. (i) Estimates seem systematically biased in the sense that effect magnitudes are overestimated. (ii) Estimates are far more variable than classical results. And (iii) inference measures, e.g., P values, are unreliable especially at small values. Given the widespread use of logistic regression in high dimensions, a theory explaining how to adjust inference to make it valid is needed.

2. Our Contribution

We develop a theory for high-dimensional logistic regression models with independent variables that is capable of accurately describing all of the phenomena we have discussed. Taking them one by one, the theory from this paper explicitly characterizes (i) the bias of the MLE, (ii) the variability of the MLE, and (iii) the distribution of the LRT, in an asymptotic regime where the sample size and the number of features grow to infinity in a fixed ratio. Moreover, we shall see that our asymptotic results are extremely accurate in finite-sample settings in which $p$ is a fraction of $n$ ; e.g., $p = 0.2 n$ .

A useful feature of this theory is that in our model, all of our results depend on the true coefficients $β$ only through the signal strength $γ$ , where $γ^{2} ≔ Var (X_{i}^{'} β)$ . This immediately suggests that estimating some high-dimensional parameter is not required to adjust inference. We propose in Section 5 a method for estimating $γ$ and empirically study the quality of inference based on this estimate.

At the mathematical level, our arguments are very involved. Our strategy is to introduce an approximate message-passing algorithm that tracks the MLE in the limit of a large number of features and samples. In truth, a careful mathematical analysis is delicate and we defer the mathematical details to SI Appendix.

3. Prior Work

Asymptotic properties of M estimators in the context of linear regression have been extensively studied in diverging dimensions starting from ref. 8, followed by refs. 9 and 10, in the regime $p = o (n^{α})$ , for some $α < 1$ . Later on, the regime where $p$ is comparable to $n$ became the subject of a series of remarkable works (11–14); these papers concern the distribution of M estimators in linear models. The rigorous results from this literature all assume strongly convex loss functions, a property critically missing in logistic regression. The techniques developed in the work of El Karoui (14) and in ref. 13 play a crucial role in our analysis; the connections are detailed in SI Appendix. While this paper was under review, we also learned about extensions to penalized versions of such strongly convex losses (15). Again, this literature is concerned with linear models only and it is natural to wonder what extensions to generalized linear models might look like; see the comments at the end of the talk (16). More general exponential families were studied in refs. 17 and 18; these works were also in setups subsumed under $p = o (n)$ . Very recently, ref. 19 investigated classical asymptotic normality of the MLE under the global null and regimes in which it may break down as the dimensionality increases.

In parallel, there exists an extensive body of literature on penalized maximum-likelihood estimates/procedures for generalized linear models; see refs. 20 and 21, for example, and the references cited therein. This body of literature often allows $p$ to be larger than $n$ but relies upon strong sparsity assumptions on the underlying signal. The setting in these works is, therefore, different from ours.

In the low-dimensional setting where the MLE is consistent, finite-sample corrections to the MLE and the LRT have been suggested in a series of works—see, for instance, refs. 22–35. Although these finite-sample approaches aim at correcting the problems described in the preceding sections, that is, the bias of the MLE and nonuniformity of the P values, the corrections are not sufficiently accurate in high dimensions and the methods are often not scalable to high-dimensional data; see SI Appendix, section A for some simulations in this direction. A line of simulation-based results exists to guide practitioners about the sample size required to avoid finite-sample problems (36, 37). The rule of thumb is usually 10 events per variable (EPV) or more but we shall later clearly see that such a rule is not valid when the number of features is large. Ref. 38 contested the previously established 10 EPV rule. To the best of our knowledge, logistic regression in the regime where $p$ is comparable to $n$ has been quite sparsely studied. This paper follows up on the earlier contribution (7) of the authors, which characterized the LLR distribution in the case where there is no signal (global null). This earlier reference derived the asymptotic distribution of the LLR as a function of the limiting ratio $p / n$ . This former result may be seen as a special case of Theorem 4, which deals with general signal strengths. As is expected, the arguments are now much more complicated than when working under the global null.

4. Main Results

Setting.

We describe the asymptotic properties of the MLE and the LRT in a high-dimensional regime, where $n$ and $p$ both go to infinity in such a way that $p / n \to κ > 0$ . We work with independent observations ${X_{i}, y_{i}}$ from a logistic model such that $P (y_{i} = 1 ∣ X_{i}) = ρ' (X_{i}^{'} β)$ . We assume here that $X_{i} \sim N (0, n^{- 1} I_{p})$ , where $I_{p}$ is the $p$ -dimensional identity matrix. The exact scaling of $X_{i}$ is not important. As noted before, the important scaling is the signal strength $X_{i}^{'} β$ and we assume that the $p$ regression coefficients (recall that $p$ increases with $n$ ) are scaled in such a way that

lim_{n \to \infty} Var (X_{i}^{'} β) = γ^{2},

[2]

where $γ$ is fixed. It is useful to think of the parameter $γ$ as the signal strength. Another way to express Eq. 2 is to say that ${lim}_{n \to \infty} {‖ β ‖}^{2} / n = γ^{2}$ .

4.a. When Does the MLE Exist?

The MLE $\hat{β}$ is the minimizer of the negative log-likelihood $ℓ$ defined via (observe that the sigmoid is the first derivative of $ρ$ )

ℓ (b) = \sum_{i = 1}^{n} {ρ (X_{i}^{'} b) - y_{i} (X_{i}^{'} b)}, ρ (t) = \log (1 + e^{t}) .

[3]

A first important remark is that in high dimensions, the MLE does not asymptotically exist if the signal strength $γ$ exceeds a certain functional $g_{MLE} (κ)$ of the dimensionality: i.e., $γ > g_{MLE} (κ)$ . This happens because in such cases, there is a perfect separating hyperplane—separating the cases from the controls if you will—sending the MLE to infinity. It turns out that a companion paper (39) precisely characterizes the region in which the MLE exists.

Theorem 1 (39).

Let $Z$ be a standard normal variable with density $φ (t)$ and $V$ be an independent continuous random variable with density $2 ρ' (γ t) φ (t)$ . With $x_{+} = max (x, 0)$ , set

g_{MLE}^{- 1} (γ) = min_{t \in R} \{E {(Z - t V)}_{+}^{2}\},

[4]

which is a decreasing function of $γ$ . Then in the setting described above,

\begin{matrix} γ > g_{MLE} (κ) \Rightarrow {lim}_{n, p \to \infty} P {MLE exists} = 0, \\ γ < g_{MLE} (κ) \Rightarrow {lim}_{n, p \to \infty} P {MLE exists} = 1 . \end{matrix}

Hence, the curve $γ = g_{MLE} (κ)$ , or, equivalently, $κ = g_{MLE}^{- 1} (γ)$ shown in Fig. 6 separates the $κ - γ$ plane into two regions: One in which the MLE asymptotically exists and one in which it does not. Clearly, we are interested in this paper in the former region (the purple region in Fig. 6A).

4.b. A System of Nonlinear Equations.

As we shall soon see, the asymptotic behavior of both the MLE and the LRT is characterized by a system of equations in three variables $(α, σ, λ)$ ,

\{\begin{aligned} σ^{2} & = \frac{1}{κ^{2}} E [2 ρ' (Q_{1}) {(λ ρ' ({p r o x}_{λ ρ} (Q_{2})))}^{2}] \\ 0 & = E [ρ' (Q_{1}) Q_{1} λ ρ' ({p r o x}_{λ ρ} (Q_{2}))] \\ 1 - κ & = E [\frac{2 ρ' (Q_{1})}{1 + λ ρ^{″} ({p r o x}_{λ ρ} (Q_{2}))}] \end{aligned},

[5]

where $(Q_{1}, Q_{2})$ is a bivariate normal variable with mean $0$ and covariance

Σ (α, σ) = [\begin{matrix} γ^{2} & - α γ^{2} \\ - α γ^{2} & α^{2} γ^{2} + κ σ^{2} \end{matrix}] .

[6]

With $ρ$ as in Eq. 3, the proximal mapping operator is defined via

{p r o x}_{λ ρ} (z) = \arg min_{t \in R} \{λ ρ (t) + \frac{1}{2} {(t - z)}^{2}\} .

[7]

The system of equations in Eq. 5 is parameterized by the pair $(κ, γ)$ of dimensionality and signal strength parameters. It turns out that the system admits a unique solution if and only if $(κ, γ)$ is in the region where the MLE asymptotically exists!

It is instructive to note that in the case where the signal strength vanishes, $γ = 0$ , the system of equations in Eq. 5 reduces to the 2-dimensional system

\{\begin{aligned} σ^{2} & = \frac{1}{κ^{2}} E [{(λ ρ' ({p r o x}_{λ ρ} (τ Z)))}^{2}] \\ 1 - κ & = E [\frac{1}{1 + λ ρ^{″} ({p r o x}_{λ ρ} (τ Z))}] \end{aligned},

[8]

where $τ^{2} ≔ κ σ^{2}$ and $Z \sim N (0,1)$ . This holds because $Q_{1} = 0$ . It is not surprising that this system is that from ref. 7 since that work considers $β = 0$ and, therefore, $γ = 0$ .

We remark that similar equations have been obtained for M estimators in linear models; see, for instance, ref. 11; SI Appendix, Eqs. S1 and S2; and refs. 13–15.

4.c. The Average Behavior of the MLE.

Our first main result characterizes the “average” behavior of the MLE.

Theorem 2.

Assume the dimensionality and signal strength parameters $κ$ and $γ$ are such that $γ < g_{MLE} (κ)$ (the region where the MLE exists asymptotically and is shown in Fig. 6). Assume the logistic model described above where the empirical distribution of ${β_{j}}$ converges weakly to a distribution $Π$ with finite second moment. Suppose further that the second moment converges in the sense that as $n \to \infty$ , ${A v e}_{j} (β_{j}^{2}) \to E β^{2}$ , $β \sim Π$ . Then for any pseudo-Lipschitz function $ψ$ of order 2,^† the marginal distributions of the MLE coordinates obey

\frac{1}{p} \sum_{j = 1}^{p} ψ ({\hat{β}}_{j} - α_{⋆} β_{j}, β_{j}) \overset{a.s.}{\to} E [ψ (σ_{⋆} Z, β)], Z \sim N (0,1),

[9]

where $β \sim Π$ , independent of $Z$ .

Among the many consequences of this result, we give three:

•
This result quantifies the exact bias of the MLE in some statistical sense. This can be seen by taking $ψ (t, u) = t$ in Eq. 9, which leads to

\frac{1}{p} \sum_{j = 1}^{p} ({\hat{β}}_{j} - α_{⋆} β_{j}) \overset{a.s.}{\to} 0

and says that ${\hat{β}}_{j}$ is centered about $α_{⋆} β_{j}$ . This can be seen from the empirical results from the previous sections as well. When $κ = 0.2$ and $γ = \sqrt{5}$ , the solution to Eq. 5 obeys $α_{⋆} = 1.499$ and Fig. 3A shows that this is the correct centering.

•
Second, our result also provides the asymptotic variance of the MLE marginals after they are properly centered. This can be seen by taking $ψ (t, u) = t^{2}$ , which leads to

\frac{1}{p} \sum_{j = 1}^{p} {({\hat{β}}_{j} - α_{⋆} β_{j})}^{2} \overset{a.s.}{\to} σ_{⋆}^{2} .

As before, this can also be seen from the empirical results from the previous section. When $κ = 0.2$ and $γ = \sqrt{5}$ , the solution to Eq. 5 obeys $σ_{⋆} = 4.744$ and this is what we see in Fig. 4.

•
Third, our result establishes that upon centering the MLE around $α_{⋆} β$ , it becomes decorrelated from the signal $β$ . This can be seen by taking $ψ (t, u) = t u$ , which leads to

\frac{1}{p} \sum_{j = 1}^{p} ({\hat{β}}_{j} - α_{⋆} β_{j}) β_{j} \overset{a.s.}{\to} 0 .

This can be seen from our earlier empirical results in Fig. 6B. The scatter directly shows the decorrelated structure and the x axis passes right through the center, corroborating our theoretical finding.

It is of course interesting to study how the bias $α_{⋆}$ and the SD $σ_{⋆}$ depend on the dimensionality $κ$ and the signal strength $γ$ . We numerically observe that the larger the dimensionality and/or the larger the signal strength, the larger the bias $α_{⋆}$ will be. This dependence is illustrated in Fig. 7A. Further, note that as $κ$ approaches 0, the bias $α_{⋆} \to 1$ , indicating that the MLE is asymptotically unbiased if $p = o (n)$ . The same behavior applies to $σ_{⋆}$ ; that is, $σ_{⋆}$ increases in either $κ$ or $γ$ as shown in Fig. 7B. This plot shows the theoretical prediction $σ_{⋆}$ divided by the average classical SD obtained from $I_{β}^{- 1}$ , the inverse of the Fisher information. As $κ$ approaches 0, the ratio goes to 1, indicating that the classical SD value is valid for $p = o (n)$ ; this is true across all values of $γ$ . As $κ$ increases, the ratio deviates increasingly from 1 and we observe higher and higher variance inflation. In summary, the MLE increasingly deviates from what is classically expected as the dimensionality, the signal strength, or both increase.

Fig. 7. — (A) Bias $α_{⋆}$ as a function of $κ$ , for different values of the signal strength $γ$ . Note the logarithmic scale for the y axis. The curves asymptote at the value of $κ$ for which the MLE ceases to exist. (B) Ratio of the theoretical SD $σ_{⋆}$ and the average SD of the coordinates, as obtained from classical theory; i.e., computed using the inverse of the Fisher information. (C) Functional dependence of the rescaling constant $κ σ_{⋆}^{2} / λ_{⋆}$ on the parameters $κ$ and $γ$ .

Theorem 2 is an asymptotic result, and we study how fast the asymptotic kicks in as we increase the sample size $n$ . To this end, we set $κ = 0.1$ and let half of the coordinates of $β$ have constant value 10 and the other half be 0. Note that in this example, $γ^{2} = 5$ as before. Our goal is to empirically determine the parameters $α_{⋆}$ and $σ_{⋆}$ from $68,000$ runs, for each $n$ taking values in ${2,000,4,000,8,000}$ . Note that there are several ways of determining $α_{⋆}$ empirically. For instance, the limit Eq. 9 directly suggests taking the ratio $\sum_{j} {\hat{β}}_{j} / \sum_{j} β_{j}$ . An alternative is to consider taking the ratio when restricting the summation to nonzero indexes. Empirically, we find there is not much difference between these two choices and choose the latter option, denoting it as $\hat{α}$ . With $κ = 0.1, γ = \sqrt{5}$ , the solution to Eq. 5 is equal to $α_{⋆} = 1.1678, σ_{⋆} = 3.3466, λ_{⋆} = 0.9605$ . Table 2 shows that $\hat{α}$ is very slightly larger than $α_{⋆}$ in finite samples. However, observe that as the sample size increases, $\hat{α}$ approaches $α_{⋆}$ , confirming the result from Eq. 9.

Table 2.

Empirical estimates of the centering and SD of the MLE

Parameter	$p = 200$	$p = 400$	$p = 800$
$α_{⋆} = 1.1678$	$1.1703 (0.0002)$	$1.1687 (0.0002)$	$1.1681 (0.0001)$
$σ_{⋆} = 3.3466$	$3.3567 (0.0011)$	$3.3519 (0.0008)$	$3.3489 (0.0006)$

Open in a new tab

SEs of these estimates are within parentheses. In this setting, κ=0.1 and γ² =5. Half of the β_js are equal to 10 and the others to 0.

4.d. The Distribution of the Null MLE Coordinates.

Whereas Theorem 2 describes the average or bulk behavior of the MLE across all of its entries, our next result provides the explicit distribution of ${\hat{β}}_{j}$ whenever $β_{j} = 0$ , i.e., whenever the $j$ th variable is independent from the response $y$ .

Theorem 3.

Let $j$ be any variable such that $β_{j} = 0$ . Then in the setting of Theorem 2, the MLE obeys

{\hat{β}}_{j} \overset{d}{\to} N (0, σ_{⋆}^{2}) .

[10]

Further, for any finite subset of null variables ${i_{1}, \dots, i_{k}}$ , the components of $({\hat{β}}_{i_{1}}, \dots, {\hat{β}}_{i_{k}})$ are asymptotically independent.

In words, the null MLE coordinates are asymptotically normal with mean 0 and variance given by the solution to the system Eq. 5. An important remark is this: We have observed that $σ_{⋆}$ is an increasing function of $γ$ . Hence, the distribution of a null MLE coordinate depends on the magnitude of the remaining coordinates of the signal: The variance increases as the other coefficient magnitudes increase.

We return to the finite-sample precision of the asymptotic variance $σ_{⋆}^{2}$ . As an empirical estimate, we use ${\hat{β}}_{j}^{2}$ averaged over the null coordinates ${j : β_{j} = 0}$ since it is approximately unbiased for $σ_{⋆}^{2}$ . We work in the setting of Table 2 in which $σ_{⋆} = 3.3466$ , averaging our $68,000$ estimates. The results are given in Table 2; we observe that $\hat{σ}$ is very slightly larger than $σ_{⋆}$ . However, it progressively gets closer to $σ_{⋆}$ as the sample size $n$ increases.

Next, we study the accuracy of the asymptotic convergence results in Eq. 4. In the setting of Table 2, we fit $500,000$ independent logistic regression models and plot the empirical cumulative distribution function of $Φ ({\hat{β}}_{j} / σ_{⋆})$ in Fig. 8A for some fixed null coordinate. Observe the perfect agreement with a straight line of slope 1.

Fig. 8. — The setting is that from Table 2 with $n = 4, 000$ . (A) Empirical cdf of $Φ ({\hat{β}}_{j} / σ_{⋆})$ for a null variable ( $β_{j} = 0$ ). (B) P values given by the LLR approximation Eq. 11 for this same null variable. (C) Empirical distribution of the P values from B. (D) Same as C but showing accuracy in the lower tail (check the range of the horizontal axis). All these plots are based on 500,000 replicates.

4.e. The Distribution of the LRT.

We finally consider the distribution of the likelihood-ratio statistic for testing $β_{j} = 0$ .

Theorem 4.

Consider the LLR $Λ_{j} = {min}_{b : b_{j} = 0} ℓ (b) - {min}_{b} ℓ (b)$ for testing $β_{j} = 0$ . In the setting of Theorem 2, twice the LLR is asymptotically distributed as a multiple of a χ² under the null,

2 Λ_{j} \overset{d}{\to} \frac{κ σ_{⋆}^{2}}{λ_{⋆}} χ_{1}^{2} .

[11]

Also, the LLR for testing $β_{i_{1}} = β_{i_{2}} = \dots = β_{i_{k}} = 0$ for any finite $k$ converges to the rescaled χ² $(κ σ_{⋆}^{2} / λ_{⋆}) χ_{k}^{2}$ under the null.

Theorem 4 explicitly states that the LLR does not follow a $χ_{1}^{2}$ distribution as soon as $κ > 0$ since the multiplicative factor is then larger than 1, as demonstrated in Fig. 7C. In other words, the LLR is stochastically much larger than a $χ_{1}^{2}$ , explaining the large spike near 0 in Fig. 5. Also, Fig. 7C suggests that as $κ \to 0$ , the classical result is recovered.^‡ We refer the readers to SI Appendix, section G for an empirical comparison of the P values based on Eq. 11 and classical P values in settings with small $κ$ and moderate $n, p$ .

Theorem 4 extends to arbitrary signal strengths the earlier result from ref. 7, which described the distribution of the LLR under the global null ( $β_{j} = 0$ for all $j$ ). One can quickly verify that when $γ = 0$ , the multiplicative factor in Eq. 11 is that given in ref. 7, which easily follows from the fact that in this case, Eq. 5 reduces to Eq. 8. Furthermore, if the signal is sparse in the sense that $o (n)$ coefficients have nonzero values, then $γ^{2} = 0$ , which immediately implies that the asymptotic distribution for the LLR from ref. 7 still holds in such cases.

To investigate the quality of the accuracy of Eq. 11 in finite samples, we work on the P-value scale. We select a null coefficient and compute P values based on Eq. 11. The histogram for the P values across $500,000$ runs is shown in Fig. 8B and the empirical cumulative distribution function (cdf) in Fig. 8C. In stark contrast to Fig. 4, we observe that the P values are uniform over the bulk of the distribution.

From a multiple-testing perspective, it is essential to understand the accuracy of the rescaled χ² approximation in the tails of the distribution. We plot the empirical cdf of the P values, zooming in on the tail, in Fig. 8D. We find that the rescaled χ² approximation works well even in the tails of the distribution. To obtain a more refined idea of the quality of approximation, we zoom in on the smaller bins close to 0 and provide estimates of the P-value probabilities in Table 3 for $n = 4,000$ and $n = 8,000$ . The tail approximation is accurate, modulo a slight deviation in the bin for $P {P - value} \leq 0.1$ for the smaller sample size. For $n = 8,000$ , however, this deviation vanishes and we find perfect coverage of the true values. It seems that our approximation is extremely precise even in the tails.

Table 3.

P-value probabilities estimated over $500,000$ replicates with standard errors in parentheses

Threshold, %	$p = 400$ , %	$p = 800$ , %
$P {P value \leq 5}$	$5.03 (0.031)$	$5.01 (0.03)$
$P {P value \leq 1}$	$1.002 (0.014)$	$1.005 (0.014)$
$P {P value \leq 0.5}$	$0.503 (0.01)$	$0.49 (0.0099)$
$P {P value \leq 0.1}$	$0.109 (0.004)$	$0.096 (0.0044)$
$P {P value \leq 0.05}$	$0.052 (0.003)$	$0.047 (0.0031)$
$P {P value \leq 0.01}$	$0.008 (0.0013)$	$0.008 (0.0013)$

Open in a new tab

Here, κ=0.1 and the setting is otherwise the same as in Table 2.

4.f. Other Scalings.

Throughout this section, we worked under the assumption that $lim_{n \to \infty} Var (X_{i}^{'} β) = γ^{2}$ , which does not depend on $n$ , and we explained that this is the only scaling that makes sense to avoid a trivial problem. We set the variables to have variance $1 / n$ but this is of course somewhat arbitrary. For example, we could choose them to have variance $v$ as in $X_{i} \sim N (0, v I_{p})$ . This means that $X_{i} = \sqrt{v n} Z_{i}$ , where $Z_{i}$ is as before. This gives $X_{i}^{'} β = Z_{i}^{'} b$ , where $β = b / \sqrt{n v}$ . The conclusions from Theorems 2 and 3 then hold for the model with predictors $Z_{i}$ and regression coefficient sequence $b$ . Consequently, by simple rescaling, we can pass the properties of the MLE in this model to those of the MLE in the model with predictors $X_{i}$ and coefficients $β$ . For instance, the SE of $\hat{β}$ is equal to $σ_{⋆} / \sqrt{n v}$ , where $σ_{⋆}$ is just as in Theorems 2 and 3. On the other hand, the result for the LRT, namely Theorem 4, is scale invariant.

4.g. Non-Gaussian Covariates

Our model assumes that the features are Gaussian. However, we expect that the same results hold under other distributions with the proviso that they have sufficiently light tails. In this section, we empirically study the applicability of our results for certain non-Gaussian features.

In genetic studies, we often wish to understand how a binary response/phenotype depends on single-nucleotide polymorphisms (SNPs), which typically take on values in ${0,1,2}$ . When the $j$ th SNP is in Hardy–Weinberg equilibrium, the chance of observing 0, 1, and 2 is respectively $p_{j}^{2}$ , $2 p_{j} (1 - p_{j})$ , and ${(1 - p_{j})}^{2}$ , where $p_{j}$ is between 0 and 1. Below we generate independent features with marginal distributions as above for parameters $p_{j}$ varying in $[0.25, 0.75]$ . We then center and normalize each column of the feature matrix $X$ to have 0 mean and unit variance. Keeping everything else as in the setting of Fig. 3, we study the bias of the MLE in Fig. 9A. As for Gaussian designs, the MLE seriously overestimates effect magnitudes and our theoretical prediction $α_{⋆}$ accurately corrects for the bias. We also see that the bias-adjusted residuals $\hat{β} - α_{⋆} β$ are uncorrelated with the effect sizes $β$ , as shown in Fig. 9B.

Fig. 9. — Simulation for a non-Gaussian design. The $j$ th feature takes values in ${0,1,2}$ with probabilities $p_{j}^{2}, 2 p_{j} (1 - p_{j}), {(1 - p_{j})}^{2}$ ; here, $p_{j} \in [0.25, 0.75]$ and $p_{j} \neq p_{k}$ for $j \neq k$ . Features are then centered and rescaled to have unit variance. The setting is otherwise the same as for Fig. 3. (A) Analogue of Fig. 3A. Red line has slope $α_{⋆} \approx 1.499$ . (B) Analogue of Fig. 6B. Observe the same behavior as earlier: The theory predicts correctly the bias and the decorrelation between the bias-adjusted residuals and the true effect sizes.

The bulk distribution of a null coordinate suggested by Theorem 3 and the LRT distribution from Theorem 4 are displayed in Fig. 10. Other than the design, the setting is the same as in Fig. 8. The theoretical predictions are once again accurate. Furthermore, upon examining the tails of the P-value distribution, we once more observe a close agreement with our theoretical predictions. All in all, these findings indicate that our theory is expected to apply to a far broader class of features. We conduct additional experiments based on real-data design matrices in SI Appendix, section F.

That said, we caution readers against overinterpreting our results. For linear models, for example, it is known that the distribution of M estimators in high dimensions can be significantly different if the covariates follow a general elliptical distribution, rather than a normal distribution (11, 14).

5. Adjusting Inference by Estimating the Signal Strength

All of our asymptotic results, namely, the average behavior of the MLE, the asymptotic distribution of a null coordinate, and the LLR, depend on the unknown signal strength $γ$ . In this section, we describe a simple procedure for estimating this single parameter from an idea proposed by Boaz Nadler and Rina Barber after E.J.C. presented the results from this paper at the Mathematisches Forshunginstitut Oberwolfach on March 12, 2018.

5.a. ProbeFrontier: Estimating $γ$ by Probing the MLE Frontier

We estimate the signal strength by actually using the predictions from our theory, namely, the fact that we have asymptotically characterized in Section 4.a the region where the MLE exists. We know from Theorem 1 that for each $γ$ , there is a maximum dimensionality $g_{MLE}^{- 1} (γ)$ at which the MLE ceases to exist. We propose an estimate $\hat{κ}$ of $g_{MLE}^{- 1} (γ)$ and set $\hat{γ} = g_{MLE} (\hat{κ})$ . Below, we refer to this as the ProbeFrontier method.

Given a data sample $(y_{i}, X_{i})$ , we begin by choosing a fine grid of values $κ \leq κ_{1} \leq κ_{2} \leq \dots \leq κ_{K} \leq 1 / 2$ . For each $κ_{j}$ , we execute the following procedure:

Subsample.

Sample $n_{j} = p / κ_{j}$ observations from the data without replacement, rounding to the nearest integer. Ignoring the rounding, the dimensionality of this subsample is $p / n_{j} = κ_{j}$ .

Check whether MLE exists.

For the subsample, check whether the MLE exists or not. This is done by solving a linear programing feasibility problem; if there exists a vector $b \in R^{p}$ such that $X_{i}^{'} b$ is positive when $y_{i} = 1$ and negative otherwise, then perfect separation between cases and controls occurs and the MLE does not exist. Conversely, if the linear program is infeasible, then the MLE exists.

Repeat.

Repeat the two previous steps $B$ times and compute the proportion of times $\hat{π} (κ_{j})$ the MLE does not exist.

We next find $(κ_{j - 1}, κ_{j})$ , such that $κ_{j}$ is the smallest value in $K$ for which $\hat{π} (κ_{j}) \geq 0.5$ . By linear interpolation between $κ_{j - 1}$ and $κ_{j}$ , we obtain $\hat{κ}$ for which the proportion of times the MLE does not exist would be 0.5. We set $\hat{γ} = g_{MLE} (\hat{κ})$ . (Since the “phase-transition” boundary for the existence of the MLE is a smooth function of $κ$ , as is clear from Fig. 6, choosing a sufficiently fine grid ${κ_{j}}$ would make the linear interpolation step sufficiently precise.)

5.b. Empirical Performance of Adjusted Inference

We demonstrate the accuracy of ProbeFrontier via some empirical results. We begin by generating 4,000 i.i.d. observations $(y_{i}, X_{i})$ using the same setup as in Fig. 8 ( $κ = 0.1$ and half of the regression coefficients are null). We work with a sequence ${κ_{j}}$ of points spaced apart by $1 0^{- 3}$ and obtain $\hat{γ}$ via the procedure described above, drawing 50 subsamples. Solving the system Eq. 5 using $κ = 0.1$ and $\hat{γ}$ yields estimates for the theoretical predictions $(α_{⋆}, σ_{⋆}, λ_{⋆})$ equal to $(\hat{α}, \hat{σ}, \hat{λ}) = (1.1681, 3.3513, 0.9629)$ . In turn, this yields an estimate for the multiplicative factor $κ σ_{⋆}^{2} / λ_{⋆}$ in Eq. 11 equal to $1.1663$ . Recall from Section 4 that the theoretical values are $(α_{⋆}, σ_{⋆}, λ_{⋆}) = (1.1678, 3.3466, 0.9605)$ and $κ σ_{⋆}^{2} / λ_{⋆} = 1.1660$ . Next, we compute the LLR statistic for each null and P values from the approximation Eq. 11 in two ways: First, by using the theoretically predicted values, and second, by using our estimates. A scatterplot of these two sets of P values is shown in Fig. 11 (blue). We observe impeccable agreement.

Fig. 11. — Null P values obtained using the $(κ {\hat{σ}}^{2} / \hat{λ}) χ_{1}^{2}$ approximation plotted against those obtained using $(κ σ_{⋆}^{2} / λ_{⋆}) χ_{1}^{2}$ . Observe the perfect agreement with the red diagonal.

Next, we study the accuracy of $\hat{γ}$ across different choices for $γ$ , ranging from 0.3 to 5. We begin by selecting a fine grid of $γ$ values and for each, we generate observations $(y_{i}, X_{i})$ with $n = 4,000$ , $p = 400$ (so that $κ = 0.1$ ), and half the coefficients have a nonvanishing magnitude scaled in such a way that the signal strength is $γ$ . Fig. 12A displays $\hat{γ}$ vs. $γ$ in blue, and we note that ProbeFrontier works very well. We observe that the blue points fluctuate very mildly above the diagonal for larger values of the signal strength but remain extremely close to the diagonal throughout. This confirms that ProbeFrontier estimates the signal strength $γ$ with reasonably high precision. Having obtained an accurate estimate for $γ$ , plugging it into Eq. 5 immediately yields an estimate for the bias $α_{⋆}$ , SD $σ_{⋆}$ , and the rescaling factor in Eq. 11. We study the accuracy of these estimates in Fig. 12 B–D. We observe a similar behavior in all these cases, with the procedure yielding extremely precise estimates for smaller values and reasonably accurate estimates for higher values.

Fig. 12. — (*A–D*) ProbeFrontier estimates of signal strength $\hat{γ}$ (A), bias $\hat{α}$ (B), SD $\hat{σ}$ (C), and LRT factor $κ {\hat{σ}}^{2} / \hat{λ}$ (D) in Eq. 11, plotted against the theoretical values.

Finally, we focus on the estimation accuracy for a particular $(κ, γ)$ pair across several replicates. In the setting of Fig. 8, we generate 6,000 samples and obtain estimates of bias ( $\hat{α}$ ), SD ( $\hat{σ}$ ), and rescaling factor for the LRT ( $κ {\hat{σ}}^{2} / \hat{λ}$ ). The averages of these estimates are reported in Table 4. Our estimates always recover the true values up to the first digit. It is instructive to study the precision of the procedure on the P-value scale. To this end, we compute P values from Eq. 11, using the estimated multiplicative factor $κ {\hat{σ}}^{2} / \hat{λ}$ . The empirical cdf of the P values both in the bulk and in the extreme tails is shown in Fig. 13. We observe perfect agreement with the uniform distribution, establishing the practical applicability of our theory and methods.

Table 4.

Parameter estimates in the setting of Table 2: Averages over 6,000 replicates with SEs within parentheses

Parameters	True	Estimates
$γ$	2.2361	$2.2771 (0.0012)$
$α$	1.1678	$1.1698 (0.0001)$
$σ$	3.3466	$3.3751 (0.0008)$
$κ σ^{2} / λ$	1.166	$1.1680 (0.0001)$

Open in a new tab

Fig. 13. — (A) Empirical distribution of the P values based on the LLR approximation Eq. 11, obtained using the estimated factor $κ {\hat{σ}}^{2} / \hat{λ}$ . (B) Same as A, but showing the tail of the empirical cdf. The calculations are based on 500,000 replicates.

5.c. Debiasing the MLE and Its Predictions

We have seen that maximum likelihood produces biased coefficient estimates and predictions. The question is, how precisely can our proposed theory and methods correct this? Recall the example from Fig. 3, where the theoretical prediction for the bias is $α_{⋆} = 1.499$ . For this dataset, ProbeFrontier yields $\hat{α} = 1.511$ , shown as the green line in Fig. 14A. Clearly, the estimate of bias is extremely precise and coefficient estimates ${\hat{β}}_{j} / \hat{α}$ appear nearly unbiased.

Further, we can also use our estimate of bias to refine the predictions since we can estimate the regression function by $ρ' (X' \hat{β} / \hat{α})$ . Fig. 14B shows our predictions on the same dataset. In stark contrast to Fig. 3B, the predictions are now centered around the regression function and the massive shrinkage toward the extremes has disappeared. The predictions constructed from the debiased MLE are more accurate and no longer falsely predict almost certain outcomes. Rather, we obtain fairly nontrivial chances of being classified in either of the two response categories—as it should be.

6. Broader Implications and Future Directions

This paper shows that in high dimensions, classical ML theory has limitations; e.g., classical theory predicts that the MLE is approximately unbiased when in reality it overestimates effect magnitudes. Since the purpose of logistic modeling is to estimate the risk of a specific disease given a patient’s observed characteristics, for example, the bias of the MLE is problematic. A consequence of the bias is that the MLE pushes the predicted chance of being sick toward 0 or 1. This, along with the fact that P values computed from classical approximations are misleading in high dimensions, clearly make the case that routinely used statistical tools fail to provide meaningful inferences from both an estimation and a testing perspective.

We have developed a theory which gives the asymptotic distribution of the MLE and the LRT in a model with independent covariates. As seen in Section 4.g, our results likely hold for a broader range of feature distributions (i.e., other than Gaussian) and it would be important to establish this rigorously. Further, we have also shown how to adjust inference by plugging in an estimate of signal strength in our theoretical predictions.

We conclude with a few directions for future work: It would be of interest to develop corresponding results in the case where the predictors are correlated and to extend the results from this paper to other generalized linear models. Further, it is crucial to understand the robustness of our proposed P values to model misspecifications. We provide some preliminary simulations in SI Appendix, section E. Finally, covariates following general elliptical distributions can be challenging, as shown in refs. 11 and 14 in the context of linear models. Hence, caution is in order as to the broader applicability of our theory.

Supplementary Material

Supplementary File

pnas.1810420116.sapp.pdf^{(880.4KB, pdf)}

Acknowledgments

P.S. was partially supported by a Ric Weiland Graduate Fellowship. E.J.C. was partially supported by the Office of Naval Research under Grant N00014-16-1-2712, by the National Science Foundation via DMS 1712800, by the Math + X Award from the Simons Foundation, and by a generous gift from TwoSigma. P.S. thanks Andrea Montanari and Subhabrata Sen for helpful discussions about approximate message passing. We thank Małgorzata Bogdan, Nikolaos Ignatiadis, Asaf Weinstein, Lucas Janson, Stephen Bates, and Chiara Sabatti for useful comments about an early version of this paper.

Footnotes

The authors declare no conflict of interest.

^†A function $ψ : R^{m} \to R$ is said to be pseudo-Lipschitz of order $k$ if there exists a constant $L > 0$ such that for all $t_{0}, t_{1} \in R^{m}$ , $‖ ψ (t_{0}) - ψ (t_{1}) ‖ \leq L (1 + {‖ t_{0} ‖}^{k - 1} + {‖ t_{1} ‖}^{k - 1}) ‖ t_{0} - t_{1} ‖$ .

^‡For the analytically motivated reader, we remark that unlike classical theory, here the asymptotic distribution of the LRT, i.e., Theorem 4, does not follow directly from the asymptotic normality result in Theorem 3. It requires additional probabilistic analysis.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1810420116/-/DCSupplemental.

References

1.Cox D. R., The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodol.) 20, 215–232 (1958). [Google Scholar]
2.Hosmer D. W., Lemeshow S., Applied Logistic Regression (John Wiley & Sons, 2013), vol. 398. [Google Scholar]
3.Van der Vaart A. W., Asymptotic Statistics (Cambridge University Press; ), vol. 3. [Google Scholar]
4.R Core Team , R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2018). [Google Scholar]
5.Wilks S. S., The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938). [Google Scholar]
6.Candes E., Fan Y., Janson L., Lv J., Panning for gold: ‘Model-x’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 80, 551–577 (2018). [Google Scholar]
7.Sur P., Chen Y., Candès E. J., The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probab. Theory Relat. Fields, 10.1007/s00440-018-00896-9 (23 January 2019). [DOI] [Google Scholar]
8.Huber P. J., Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Stat. 1, 799–821 (1973). [Google Scholar]
9.Portnoy S., Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2} / n$ is large. I. Consistency. Ann. Stat. 12, 1298–1309 (1984) . [Google Scholar]
10.Portnoy S., Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2} / n$ is large; II. Normal approximation. Ann. Stat. 13, 1403–1417 (1985). [Google Scholar]
11.El Karoui N., Bean D., Bickel P. J., Lim C., Yu B., On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. U.S.A 110, 14557–14562(2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Bean D., Bickel P. J., El Karoui N., Yu B., Optimal m-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. U.S.A. 110, 14563–14568 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Donoho D., Montanari A., High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 166, 935–969 (2013). [Google Scholar]
14.El Karoui N., On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 170, 95–175 (2017). [Google Scholar]
15.Bradic J., Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing. Electron. J. Stat. 10, 3894–3944(2016). [Google Scholar]
16.El Karoui N., Random matrices and high-dimensional M-estimation: Applications to robust regression, penalized robust regression and GLMs. https://video.seas.harvard.edu/media/14_03_31+Noureddine+El+Karoui–AM.mp4/1_e5szurf0/13151391 (31 March 2014).
17.Portnoy S., Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 16, 356–366 (1988). [Google Scholar]
18.He X., Shao Q. M., On parameters of increasing dimensions. J. Multivariate Anal. 73, 120–135 (2000). [Google Scholar]
19.Fan Y., Demirkaya E., Lv J., Nonuniformity of p-values can occur early in diverging dimensions. arXiv:1705.03604 (10 May 2017). [PMC free article] [PubMed]
20.Bunea F., Honest variable selection in linear and logistic regression models via l1 and l1+l2 penalization. Electron J. Stat. 2, 1153–1194 (2008). [Google Scholar]
21.Van de Geer S., Bühlmann P., Ritov Y., Dezeure R., On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42, 1166–1202 (2014). [Google Scholar]
22.Firth D., Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993). [Google Scholar]
23.Anderson J., Richardson S., Logistic discrimination and bias correction in maximum likelihood estimation. Technometrics 21, 71–78 (1979). [Google Scholar]
24.McLachlan G., A note on bias correction in maximum likelihood estimation with logistic discrimination. Technometrics 22, 621–627 (1980). [Google Scholar]
25.Schaefer R. L., Bias correction in maximum likelihood logistic regression. Stat. Med. 2, 71–78 (1983). [DOI] [PubMed] [Google Scholar]
26.Copas J. B., Binary regression models for contaminated data. J. R. Stat. Soc. Ser. B (Methodol.) 50, 225–253 (1988). [Google Scholar]
27.Cordeiro G. M., McCullagh P., Bias correction in generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 53, 629–643 (1991). [Google Scholar]
28.Bull S., Hauck W., Greenwood C., Two-step jackknife bias reduction for logistic regression MLEs. Commun. Stat. Simul. Comput. 23, 59–88 (1994). [Google Scholar]
29.Jennings D. E., Judging inference adequacy in logistic regression. J. Am. Stat. Assoc. 81, 471–476 (1986). [Google Scholar]
30.Bartlett M. S., Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 160, 268–282 (1937). [Google Scholar]
31.Box G., A general distribution theory for a class of likelihood criteria. Biometrika 36, 317–346 (1949). [PubMed] [Google Scholar]
32.Lawley D., A general method for approximating to the distribution of likelihood ratio criteria. Biometrika 43, 295–303 (1956). [Google Scholar]
33.Cordeiro G. M., Improved likelihood ratio statistics for generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 45, 404–413 (1983). [Google Scholar]
34.Bickel P. J., Ghosh J., A decomposition for the likelihood ratio statistic and the Bartlett correction–A Bayesian argument. Ann. Stat. 18, 1070–1090 (1990). [Google Scholar]
35.Moulton L. H, Weissfeld L. A., Laurent R. T. S., Bartlett correction factors in logistic regression models. Comput. Stat. Data Anal. 15, 1–11 (1993). [Google Scholar]
36.Peduzzi P., Concato J., Kemper E., Holford T. R., Feinstein A. R., A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996). [DOI] [PubMed] [Google Scholar]
37.Vittinghoff E., McCulloch C. E., Relaxing the rule of ten events per variable in logistic and Cox regression. Am. J. Epidemiol. 165, 710–718 (2007). [DOI] [PubMed] [Google Scholar]
38.Courvoisier D. S., Combescure C, Agoritsas T., Gayet-Ageron A., Perneger T. V., Performance of logistic regression modeling: Beyond the number of events per variable, the role of data structure. J. Clin. Epidemiol. 64, 993–1000 (2011). [DOI] [PubMed] [Google Scholar]
39.Candès E. J., Sur P., The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv:1804.09753 (25 April 2018). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

pnas.1810420116.sapp.pdf^{(880.4KB, pdf)}

[r1] 1.Cox D. R., The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodol.) 20, 215–232 (1958). [Google Scholar]

[r2] 2.Hosmer D. W., Lemeshow S., Applied Logistic Regression (John Wiley & Sons, 2013), vol. 398. [Google Scholar]

[r3] 3.Van der Vaart A. W., Asymptotic Statistics (Cambridge University Press; ), vol. 3. [Google Scholar]

[r4] 4.R Core Team , R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2018). [Google Scholar]

[r5] 5.Wilks S. S., The large-sample distribution of the likelihood ratio for testing composite hypotheses. Ann. Math. Stat. 9, 60–62 (1938). [Google Scholar]

[r6] 6.Candes E., Fan Y., Janson L., Lv J., Panning for gold: ‘Model-x’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 80, 551–577 (2018). [Google Scholar]

[r7] 7.Sur P., Chen Y., Candès E. J., The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probab. Theory Relat. Fields, 10.1007/s00440-018-00896-9 (23 January 2019). [DOI] [Google Scholar]

[r8] 8.Huber P. J., Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Stat. 1, 799–821 (1973). [Google Scholar]

[r9] 9.Portnoy S., Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2} / n$ is large. I. Consistency. Ann. Stat. 12, 1298–1309 (1984) . [Google Scholar]

[r10] 10.Portnoy S., Asymptotic behavior of M-estimators of $p$ regression parameters when $p^{2} / n$ is large; II. Normal approximation. Ann. Stat. 13, 1403–1417 (1985). [Google Scholar]

[r11] 11.El Karoui N., Bean D., Bickel P. J., Lim C., Yu B., On robust regression with high-dimensional predictors. Proc. Natl. Acad. Sci. U.S.A 110, 14557–14562(2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Bean D., Bickel P. J., El Karoui N., Yu B., Optimal m-estimation in high-dimensional regression. Proc. Natl. Acad. Sci. U.S.A. 110, 14563–14568 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r13] 13.Donoho D., Montanari A., High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Probab. Theory Relat. Fields 166, 935–969 (2013). [Google Scholar]

[r14] 14.El Karoui N., On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probab. Theory Relat. Fields 170, 95–175 (2017). [Google Scholar]

[r15] 15.Bradic J., Robustness in sparse high-dimensional linear models: Relative efficiency and robust approximate message passing. Electron. J. Stat. 10, 3894–3944(2016). [Google Scholar]

[r16] 16.El Karoui N., Random matrices and high-dimensional M-estimation: Applications to robust regression, penalized robust regression and GLMs. https://video.seas.harvard.edu/media/14_03_31+Noureddine+El+Karoui–AM.mp4/1_e5szurf0/13151391 (31 March 2014).

[r17] 17.Portnoy S., Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 16, 356–366 (1988). [Google Scholar]

[r18] 18.He X., Shao Q. M., On parameters of increasing dimensions. J. Multivariate Anal. 73, 120–135 (2000). [Google Scholar]

[r19] 19.Fan Y., Demirkaya E., Lv J., Nonuniformity of p-values can occur early in diverging dimensions. arXiv:1705.03604 (10 May 2017). [PMC free article] [PubMed]

[r20] 20.Bunea F., Honest variable selection in linear and logistic regression models via l1 and l1+l2 penalization. Electron J. Stat. 2, 1153–1194 (2008). [Google Scholar]

[r21] 21.Van de Geer S., Bühlmann P., Ritov Y., Dezeure R., On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat. 42, 1166–1202 (2014). [Google Scholar]

[r22] 22.Firth D., Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993). [Google Scholar]

[r23] 23.Anderson J., Richardson S., Logistic discrimination and bias correction in maximum likelihood estimation. Technometrics 21, 71–78 (1979). [Google Scholar]

[r24] 24.McLachlan G., A note on bias correction in maximum likelihood estimation with logistic discrimination. Technometrics 22, 621–627 (1980). [Google Scholar]

[r25] 25.Schaefer R. L., Bias correction in maximum likelihood logistic regression. Stat. Med. 2, 71–78 (1983). [DOI] [PubMed] [Google Scholar]

[r26] 26.Copas J. B., Binary regression models for contaminated data. J. R. Stat. Soc. Ser. B (Methodol.) 50, 225–253 (1988). [Google Scholar]

[r27] 27.Cordeiro G. M., McCullagh P., Bias correction in generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 53, 629–643 (1991). [Google Scholar]

[r28] 28.Bull S., Hauck W., Greenwood C., Two-step jackknife bias reduction for logistic regression MLEs. Commun. Stat. Simul. Comput. 23, 59–88 (1994). [Google Scholar]

[r29] 29.Jennings D. E., Judging inference adequacy in logistic regression. J. Am. Stat. Assoc. 81, 471–476 (1986). [Google Scholar]

[r30] 30.Bartlett M. S., Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 160, 268–282 (1937). [Google Scholar]

[r31] 31.Box G., A general distribution theory for a class of likelihood criteria. Biometrika 36, 317–346 (1949). [PubMed] [Google Scholar]

[r32] 32.Lawley D., A general method for approximating to the distribution of likelihood ratio criteria. Biometrika 43, 295–303 (1956). [Google Scholar]

[r33] 33.Cordeiro G. M., Improved likelihood ratio statistics for generalized linear models. J. R. Stat. Soc. Ser. B (Methodol.) 45, 404–413 (1983). [Google Scholar]

[r34] 34.Bickel P. J., Ghosh J., A decomposition for the likelihood ratio statistic and the Bartlett correction–A Bayesian argument. Ann. Stat. 18, 1070–1090 (1990). [Google Scholar]

[r35] 35.Moulton L. H, Weissfeld L. A., Laurent R. T. S., Bartlett correction factors in logistic regression models. Comput. Stat. Data Anal. 15, 1–11 (1993). [Google Scholar]

[r36] 36.Peduzzi P., Concato J., Kemper E., Holford T. R., Feinstein A. R., A simulation study of the number of events per variable in logistic regression analysis. J. Clin. Epidemiol. 49, 1373–1379 (1996). [DOI] [PubMed] [Google Scholar]

[r37] 37.Vittinghoff E., McCulloch C. E., Relaxing the rule of ten events per variable in logistic and Cox regression. Am. J. Epidemiol. 165, 710–718 (2007). [DOI] [PubMed] [Google Scholar]

[r38] 38.Courvoisier D. S., Combescure C, Agoritsas T., Gayet-Ageron A., Perneger T. V., Performance of logistic regression modeling: Beyond the number of events per variable, the role of data structure. J. Clin. Epidemiol. 64, 993–1000 (2011). [DOI] [PubMed] [Google Scholar]

[r39] 39.Candès E. J., Sur P., The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. arXiv:1804.09753 (25 April 2018). [Google Scholar]

PERMALINK

A modern maximum-likelihood theory for high-dimensional logistic regression

Pragya Sur

Emmanuel J Candès

Series information

Significance

Abstract

Fig. 1.

1. Failures in Moderately Large Dimensions

Unbiasedness?

Fig. 2.

Fig. 3.

Accuracy of Classical Standard Errors?

Fig. 4.

Distribution of the LRT?

Fig. 5.

Table 1.

Summary.

2. Our Contribution

3. Prior Work

4. Main Results

Setting.

4.a. When Does the MLE Exist?

Theorem 1 (39).

Fig. 6.

4.b. A System of Nonlinear Equations.

4.c. The Average Behavior of the MLE.

Theorem 2.

Fig. 7.

Table 2.

4.d. The Distribution of the Null MLE Coordinates.

Theorem 3.

Fig. 8.

4.e. The Distribution of the LRT.

Theorem 4.

Table 3.

4.f. Other Scalings.

4.g. Non-Gaussian Covariates

Fig. 9.

Fig. 10.

5. Adjusting Inference by Estimating the Signal Strength

5.a. ProbeFrontier: Estimating γ by Probing the MLE Frontier

Subsample.

Check whether MLE exists.

Repeat.

5.b. Empirical Performance of Adjusted Inference

Fig. 11.

Fig. 12.

Table 4.

Fig. 13.

5.c. Debiasing the MLE and Its Predictions

Fig. 14.

6. Broader Implications and Future Directions

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

5.a. ProbeFrontier: Estimating $γ$ by Probing the MLE Frontier