Sample Size Determination for Logistic Regression on a Logit-Normal Distribution

Seongho Kim; Elisabeth Heath; Lance Heilbrunh

doi:10.1177/0962280215572407

. Author manuscript; available in PMC: 2016 Sep 4.

Published in final edited form as: Stat Methods Med Res. 2015 Mar 4;26(3):1237–1247. doi: 10.1177/0962280215572407

Sample Size Determination for Logistic Regression on a Logit-Normal Distribution

Seongho Kim ^1,^2,^*, Elisabeth Heath ², Lance Heilbrunh ^1,²

PMCID: PMC4560689 NIHMSID: NIHMS657632 PMID: 25744106

Abstract

Although the sample size for simple logistic regression can be readily determined using currently available methods, the sample size calculation for multiple logistic regression requires some additional information, such as the coefficient of determination ( $R_{cov}^{2}$ ) of a covariate of interest with other covariates, which is often unavailable in practice. The response variable of logistic regression follows a logit-normal (LN) distribution which can be generated from a logistic transformation of a normal distribution. Using this property of logistic regression, we propose new methods of determining the sample size for simple and multiple logistic regressions using a normal transformation of outcome measures. Simulation studies and a motivating example show several advantages of the proposed methods over the existing methods: (i) no need for $R_{cov}^{2}$ for multiple logistic regression, (ii) available interim or group-sequential designs, and (iii) much smaller required sample size.

Keywords: Logistic regression, Logit-normal distribution, Power calculation, Sample size determination, Transformation

1. Introduction

Logistic regression analysis has been widely used to fit models for probability of disease given marker values and to test the effect of a specific covariate on the binary response variable, often in the presence of other covariates. The sample size determination for logistic regression is not straightforward due to its non-linearity, however. Several methods have been introduced for its sample size calculation. Whittemore [1] proposed a method by approximating the Fisher information matrix when the probability of response was small. Self and Mauritsen [2] introduced a general and flexible approach by approximating the power of score tests. Hsieh [3] presented sample size tables for logistic regression, in particular, with normally distributed covariates using Whittemore [1]’s formula. Later, Hsieh et al. [4] proposed more accurate and simple formulae to calculate sample size for logistic regression models, which were implemented in PASS 12 (NCSS, Kaysville, Utah, USA).

The aforementioned methods can be readily applied to determine the sample size for simple logistic regression. However, the sample size calculation for multiple logistic regression requires some additional information, such as the coefficient of determination ( $R_{cov}^{2}$ ) of a covariate of interest with other covariates, which is occasionally difficult to obtain in practice. Indeed, a biochemical or drug industry is not often willing to share some sensitive information with outside investigators due to confidentiality concerns, as described in the motivating example in Section 2. In this case, the sample size would be determined either by using simple logistic regression, which underestimates the required sample size, or by guessing the required $R_{cov}^{2}$ values subjectively. These practices will yield either under- or over-powered designs.

Some researchers focused on the sample size calculation for a transformed outcome measure [5,6]. The simulation studies in [6] showed that skewness increases the required sample size and the sample size is decreased the most by transformation to a normal distribution. In fact, the response variable of a logistic regression follows a logit-normal (LN) distribution which is generated from the logistic transformation of a normal distribution [7]. These properties of logistic regression have inspired us to develop a transformation-based approach to determine the sample size. Our approach is first to transform a logistic outcome measure into a normal distribution and then the sample size is determined by the t-test. The sample size determination using a transformed outcome measure has three major advantages over the existing methods: (i) no need for $R_{cov}^{2}$ in the case of multiple logistic regression; (ii) straightforward implementation of interim or group-sequential designs based on a transformed outcome measure; and (iii) much smaller required sample size. It should be noted that our approach would be applied when a logistic outcome measure is continuous and comes from a logistic regression model. When a logistic outcome measure is either binary or ordinal, the proposed approach would not be used.

2. Motivating example

Prostate cancer (PrC) is the second leading cause of cancer-related death in men. Although prostate specific antigen (PSA) blood testing remains the most widely used tool for PrC detection, important efforts have been conducted to determine alternative biomarkers to overcome its lack of specificity. Recently, it has been discovered that sarcosine, alanine, glutamate, and glycine are metabolic biomarkers of PrC progression [8,9]. Using these metabolic biomarkers, a PrC diagnostic algorithm (a logistic regression model) was developed which also took into account clinical information such as PSA and prostate volume. The outcome measure of this logistic algorithm will be called the M-score.

A new study was planned to validate the M-score by comparing with the PrC biopsy result in African American (AA) men who were referred for prostate biopsy for any clinical indication. The primary hypothesis was that the M-score, which has not been extensively studied in AA men, would have similar test characteristics in AA men as it did in European American men. The question that we were asked as statisticians was: how many AA men need to be included in the study? In the previous study, the M-score was elevated in AA PrC patients compared to those with benign prostate disease using a small sample size of 18. Based on this previous result, the study was designed to detect a difference of 10 points in the mean M-score of AA men with vs. those without PrC, based on biopsy results. Since the M-score was generated from a logistic regression model, a sample size could be determined based on a logistic regression model. However, covariates included in the logistic model were blinded and the biopsy result, positive or negative, was the only covariate available to us.

3. Methods

A logit-normal (LN) distribution is a probability distribution of a random variable whose logit follows a normal distribution. If a random variable X follows a normal distribution, then its logistic Y = logistic(X) is logit-normally distributed, where $logistic (x) = \frac{exp (x)}{1 + exp (x)}$ . Likewise, if Y follows an LN distribution, then X = logit(Y) is normally distributed, where $logit (y) = log (\frac{y}{1 - y})$ . In this paper we are interested in determining the sample size when an outcome measure follows an LN distribution.

Suppose there are two independent groups that have been randomly and independently drawn from an LN distribution, and each group has n₀ and n₁ observations, respectively, Y₀₁, Y₀₂, ⋯, Y_0n₀ and Y₁₁, Y₁₂, ⋯, Y_1n₁, with n = n₀ + n₁. That is, we assume that

Y_{i j} ~ L N (ϕ_{i}, ψ_{i}),

(1)

where LN stands for a logit-normal distribution with mean of ϕ_i and standard deviation (SD) of ψ_i, i = 0,1, and j = 1,2, ⋯, n_i. An investigator wishes to test the null hypothesis that the two population means are equal, H₀: ϕ₀ = ϕ₁. The probability density function (pdf) of an LN distribution is

f_{Y} (y; μ, σ) = \frac{1}{σ \sqrt{2 π}} \cdot exp (\frac{- {(logit (y) - μ)}^{2}}{2 σ^{2}}) \cdot \frac{1}{y (1 - y)}, y \in (0, 1),

(2)

where logit(y)~N(μ, σ) and N stands for a normal distribution. The pdf depends on the mean and SD of logit(Y) rather than those of Y, so the best approach to the sample size determination for an LN distribution is to use a normal distribution that logit(Y) follows. This approach requires that all the individual logit transformations, logit(Y_ij), i = 0,1; j = 1,2, …, n_i, must be available in order to calculate the mean and SD for a normal distribution. However, as in the motivating example, no information (i.e., individual observations, Y_ij) was available to calculate the mean and SD of logit(Y). The mean (ϕ) and SD (ψ) of Y are the only available information, and no analytical solution exists to recover the mean and SD of logit(Y) from those of Y. Therefore, two approaches are proposed to estimate the mean and SD of logit(Y): the delta method and optimization.

3.1. Delta method

The delta method is employed to approximate the mean and SD of logit(Y). Suppose Y follows an LN distribution with mean of ϕ and SD of ψ. Then a logit of Y, X = logit(Y), follows a normal distribution with mean of μ and SD of σ by definition of an LN distribution. Since $logit (Y) = log (\frac{Y}{1 - Y})$ , then by the delta method, the estimates of mean and SD of X are

\hat{μ} \approx log (\frac{ϕ}{1 - ϕ});

(3)

\hat{σ} \approx \frac{ψ}{ϕ (1 - ϕ)} .

(4)

where ϕ and ψ are the mean and SD of an LN random variable Y.

3.2. Optimization

The delta method is the first-order approximation to the mean and variance of a transformed variable and the logit function is non-linear over the range of values being examined. Thus, the delta method-based approximation may not fit well particularly when the coefficient of variation is higher. To circumvent this difficulty, an optimization method is used to estimate the mean and SD of logit(Y). The estimates (μ̂, σ̂) are obtained by minimizing the function in Equation (5) as follows:

(\hat{μ}, \hat{σ}) = {argmin}_{μ, σ} {{(1 - \frac{E (Y | μ, σ)}{ϕ})}^{2} + {(1 - \frac{Var (Y | μ, σ)}{ψ^{2}})}^{2}}

(5)

where ϕ and ψ are the mean and SD of an LN random variable Y;

E (Y | μ, σ) = \int_{0}^{1} z f_{Y} (z; μ, σ) d z;

(6)

Var (Y | μ, σ) = \int_{0}^{1} {(z - E (z | μ, σ))}^{2} f_{Y} (z; μ, σ) d z .

(7)

The first and second terms in Equation (5) are the relative bias between the true value and the estimate, so minimizing Equation (5) is reducing the relative bias for both estimates. Note that Equations (6) and (7) are the mean and variance of an LN distribution. Since there is no analytical solution available for them, a numerical integration was used to obtain these values. The numerical integration was carried out by adaptive quadrature [10] of functions implemented in the function ‘quadinf’ of an R package pracma. In order to optimize Equation (5), the R function ‘nlminb’ was used for optimization.

The pdf f_Y(y|μ, σ), given in Equation (2), of an LN distribution, however, numerically becomes a function that has nonzero values only at both y=0 and y=1, as SD (σ) of a corresponding normal distribution increases. In other words, given μ, the pdf f_Y(y|μ, σ) goes to δ(y) + δ(1 − y) as σ → ∞, where δ(y) = 0 if y ≠ 0 and 1 if y = 0. For example, Supplementary Information Figure S1(a) displays the density plots of a logit-normal distribution when the mean (μ) of a corresponding normal distribution is zero and its SD (σ) ranges from 0.1 to 1000. This plot shows that the pdf has two modes when σ ≥ 10 and seems to numerically end up to a function δ(y) + δ(1 − y) for large σ. In fact, this makes numerical integration difficult to obtain an accurate integration. To investigate the behavior of the numerical integration, the corresponding areas under each of the pdfs of Supplementary Information Figure S1(a) are computed and plotted in Supplementary Figure S1(b). In this figure, although all the areas under the pdfs are supposed to be equal to one, some areas under the pdf numerically become less than one when σ ≥ 20.

In particular, the aforementioned numerical difficulty occurs when the optimization method (M2) is applied to the three cases (ϕ ≤ 0.1 or ϕ ≥ 0.9; ψ ≥ 0.3), (ϕ = 0.2 or 0.8; ψ ≥ 0.4) and (0.3 ≤ ϕ ≤ 0.7; ψ ≥ 0.5). In other cases, the developed optimization-based approach holds well-posedness. For example, the traces of the objective function in Equation (5) are investigated when (ϕ₀ = 0.3, ϕ₁ = 0.7) and (ϕ₀ = 0.1, ϕ₁ = 0.9) using the optimization method (M2), as shown in Supplementary Information Figure S2. Due to the characteristic of an LN distribution, it can be observed that the absolute true mean (μ) and true SD (σ) of the corresponding normal distribution of an LN distribution LN(ϕ, ψ) is the exactly same as those of LN(1 − ϕ, ψ). Therefore, in the ideal case, it should be expected that the contour plot of the objective function of ϕ₀ = 0.3 is symmetric to that of ϕ₁ = 0.7 regardless of the SD (ψ). Indeed, the solution areas (i.e., darkest blue color) are symmetric to each other until ψ = 0.3. However, as ψ becomes more than 0.3, their solution areas become asymmetric, as can be seen in Supplementary Information Figure S3(a)–(e). This phenomenon grows worse when (ϕ₀ = 0.1, ϕ₁ = 0.9). In this case, even when the SD (ψ) is 0.1, their solution areas (darkest blue color) become dissimilar (see Supplementary Information Figure S3(f)–(h)).

3.3. Sample size determination

The sample size determination for an LN distribution is considered in this section. As mentioned in the motivating example, the means and SDs of two groups whose populations follow an LN distribution are the only available information for sample size calculation. To this end, three approaches can be used. The first uses the test on proportions, since Y ∈ (0,1), the second uses the simple logistic regression with one binary covariate (i.e., $R_{cov}^{2} = 0$ ), and the third uses the t-test with or without transformation to a normal distribution. It should be noted that the second approach will underestimate the sample size when there are other covariates, resulting in an underpowered study design. In this regard, the following five methods are considered in this work: (1) Using the t-test with the delta method (M1), (2) Using the t-test with the optimization (M2), (3) Using the t-test without transformation (M3), (4) Using the test on proportions (M4), and (5) Using the simple logistic regression with one binary covariate (M5).

3.4. Software

An R package ssLogitNorm is developed for the power and sample size determination for an LN distribution and it is freely available at http://cansur.sourceforge.net. The brief instruction on how to use the R package ssLogitNorm can be found in the Supplementary Information.

4. Simulation studies

Because there is no analytical solution of the mean and SD of an LN distribution, Monte Carlo simulation was performed to find the true mean and true SD of an LN distribution corresponding to those of a normal distribution. In particular, the normal distributions were chosen corresponding to the LN distributions with the difference in mean (Δ) of 0.1 and 0.2 and the same SD (i.e., ψ = 0.10, 0.20, and 0.30) as can be seen in Table 1. Due to the characteristic of the LN distribution, it can be observed that the absolute true mean (μ) and true SD (σ) of the corresponding normal distribution of an LN distribution LN(ϕ, ψ) is the exactly same as those of LN(1 − ϕ, ψ). For instance, the corresponding normal distribution of LN(0.6, 0.2), is N(0.49,0.99), while that of LN(0.4,0.2) is N(−0.49,0.99) in Table 1. Note that the corresponding normal distributions were unable to be discovered when (ϕ, ψ) is (0.1, 0.3) or (0.9, 0.3) due to the difficulty caused by numerical integration, as discussed in Section 3.2. Using these values, the two proposed approaches to a normal transformation are evaluated in terms of bias in estimates of mean and SD. The sample size calculation is further compared with other methods.

Table 1. Transformation to a Normal distribution.

Δ is the difference in mean between two logit-normal random variables, ϕ₁ = ϕ₀ + Δ, where ϕ_i is the logit-normal mean of the ith group; LN and Normal stand for a logit-normal distribution and its corresponding normal distribution, respectively. Bias is calculated as the difference between true and estimate (i.e., true – estimate); ψ is the logit-normal standard deviation of both groups; μ_i and σ_i are the mean and standard deviation of a normal distribution of the ith group, respectively.

(a) Δ = 0.1

True						Transformation to a Normal distribution
LN		Normal				Delta method (M1)				Optimization (M2)
						Bias
ϕ₀	ψ	μ₀	μ₁	σ₀	σ₁	μ₀	μ₁	σ₀	σ₁	μ₀	μ₁	σ₀	σ₁
0.1	0.1	−2.64	−1.50	1.10	0.65	−0.4428	−0.1137	−0.0111	0.0250	−0.0023	0.0046	−0.005	0.0025
	0.2	−4.46	−1.93	2.99	1.52	−2.2628	−0.5437	0.7678	0.2700	−0.0042	0.0004	0.0048	−0.0040
	0.3	-	−3.36	-	3.57	-	−1.9737	-	1.6950	-	−0.0019	-	−0.0033
0.2	0.1	−1.50	−0.89	0.65	0.50	−0.1137	−0.0427	0.0250	0.0238	0.0046	0.0042	0.0025	0.0042
	0.2	−1.93	−1.06	1.52	1.14	−0.5437	−0.2127	0.2700	0.1876	0.0004	0.0019	−0.0040	0.0009
	0.3	−3.36	−1.52	3.57	2.31	−1.9737	−0.6727	1.6950	0.8814	−0.0019	−0.0049	−0.0033	0.0000
0.3	0.1	−0.89	−0.42	0.50	0.43	−0.0427	−0.0145	0.0238	0.0133	0.0042	0.0035	0.0042	−0.0039
	0.2	−1.06	−0.49	1.14	0.99	−0.2127	−0.0845	0.1876	0.1567	0.0019	−0.0019	0.0009	0.0020
	0.3	−1.52	−0.65	2.31	1.92	−0.6727	−0.2445	0.8814	0.6700	−0.0049	0.0019	0.0000	−0.0002
0.4	0.1	−0.42	0.00	0.43	0.42	−0.0145	0.0000	0.0133	0.0200	0.0035	0.0000	−0.0039	0.0035
	0.2	−0.49	0.00	0.99	0.95	−0.0845	0.0000	0.1567	0.1500	−0.0019	0.0000	0.0020	0.0041
	0.3	−0.65	0.00	1.92	1.82	−0.2445	0.0000	0.6700	0.6200	0.0019	0.0000	−0.0002	0.0013
0.5	0.1	0.00	0.42	0.42	0.43	0.0000	0.0145	0.0200	0.0133	0.0000	−0.0035	0.0035	−0.0039
	0.2	0.00	0.49	0.95	0.99	0.0000	0.0845	0.1500	0.1567	0.0000	0.0019	0.0041	0.0020
	0.3	0.00	0.65	1.82	1.92	0.0000	0.2445	0.6200	0.6700	0.0000	−0.0019	0.0013	−0.0002
0.6	0.1	0.42	0.89	0.43	0.50	0.0145	0.0427	0.0133	0.0238	−0.0035	−0.0042	−0.0039	0.0042
	0.2	0.49	1.06	0.99	1.14	0.0845	0.2127	0.1567	0.1876	0.0019	−0.0019	0.0020	0.0009
	0.3	0.65	1.52	1.92	2.31	0.2445	0.6727	0.6700	0.8814	−0.0019	0.0049	−0.0002	0.0000
0.7	0.1	0.89	1.50	0.50	0.65	0.0427	0.1137	0.0238	0.0250	−0.0042	−0.0046	0.0042	0.0025
	0.2	1.06	1.93	1.14	1.52	0.2127	0.5437	0.1876	0.2700	−0.0019	−0.0004	0.0009	−0.0040
	0.3	1.52	3.36	2.31	3.57	0.6727	1.9737	0.8814	1.6950	0.0049	0.0019	0.0000	−0.0033

(b) Δ = 0.2

True						Transformation to a Normal distribution
LN		Normal				Delta method (M1)				Optimization (M2)
						Bias
ϕ₀	ψ	μ₀	μ₁	σ₀	σ₁	μ₀	μ₁	σ₀	σ₁	μ₀	μ₁	σ₀	σ₁
0.1	0.1	−2.64	−0.89	1.10	0.50	−0.4428	−0.0427	−0.0111	0.0238	−0.0023	0.0042	−0.0050	0.0042
	0.2	−4.46	−1.06	2.99	1.14	−2.2628	−0.2127	0.7678	0.1876	−0.0042	0.0019	0.0048	0.0009
	0.3	-	−1.52	-	2.31	-	−0.6727	-	0.8814	-	−0.0049	-	0.0000
0.2	0.1	−1.50	−0.42	0.65	0.43	−0.1137	−0.0145	0.0250	0.0133	0.0046	0.0035	0.0025	−0.0039
	0.2	−1.93	−0.49	1.52	0.99	−0.5437	−0.0845	0.2700	0.1567	0.0004	−0.0019	−0.0040	0.0020
	0.3	−3.36	−0.65	3.57	1.92	−1.9737	−0.2445	1.6950	0.6700	−0.0019	0.0019	−0.0033	−0.0002
0.3	0.1	−0.89	0.00	0.50	0.42	−0.0427	0.0000	0.0238	0.0200	0.0042	0.0000	0.0042	0.0035
	0.2	−1.06	0.00	1.14	0.95	−0.2127	0.0000	0.1876	0.1500	0.0019	0.0000	0.0009	0.0041
	0.3	−1.52	0.00	2.31	1.82	−0.6727	0.0000	0.8814	0.6200	−0.0049	0.0000	0.0000	0.0013
0.4	0.1	−0.42	0.42	0.43	0.43	−0.0145	0.0145	0.0133	0.0133	0.0035	−0.0035	−0.0039	−0.0039
	0.2	−0.49	0.49	0.99	0.99	−0.0845	0.0845	0.1567	0.1567	−0.0019	0.0019	0.0020	0.0020
	0.3	−0.65	0.65	1.92	1.92	−0.2445	0.2445	0.6700	0.6700	0.0019	−0.0019	−0.0002	−0.0002
0.5	0.1	0.00	0.89	0.42	0.50	0.0000	0.0427	0.0200	0.0238	0.0000	−0.0042	0.0035	0.0042
	0.2	0.00	1.06	0.95	1.14	0.0000	0.2127	0.1500	0.1876	0.0000	−0.0019	0.0041	0.0009
	0.3	0.00	1.52	1.82	2.31	0.0000	0.6727	0.6200	0.8814	0.0000	0.0049	0.0013	0.0000
0.6	0.1	0.42	1.50	0.43	0.65	0.0145	0.1137	0.0133	0.0250	−0.0035	−0.0046	−0.0039	0.0025
	0.2	0.49	1.93	0.99	1.52	0.0845	0.5437	0.1567	0.2700	0.0019	−0.0004	0.0020	−0.0040
	0.3	0.65	3.36	1.92	3.57	0.2445	1.9737	0.6700	1.6950	−0.0019	0.0019	−0.0002	−0.0033
0.7	0.1	0.89	2.64	0.50	1.10	0.0427	0.4428	0.0238	−0.0111	−0.0042	0.0023	0.0042	−0.0050
	0.2	1.06	4.46	1.14	2.99	0.2127	2.2628	0.1876	0.7678	−0.0019	0.0042	0.0009	0.0048
	0.3	1.52	-	2.31	-	0.6727	-	0.8814	-	0.0049	-	0.0000	-

Open in a new tab

Table 1 displays the results of a normal transformation using the delta method and the optimization approach. As the mean (ϕ) of an LN distribution is closer to 0.5, the estimates of a corresponding normal distribution have small bias for both methods. As the SD (ψ) of an LN distribution increases, the bias of the estimates increases for the delta method, but the optimization approach appears to show no such a trend. As expected, the method M2 has the merit of better approximating the mean (μ) and SD (σ) of a corresponding normal distribution than M1 does. Note that because no true value is available when (ϕ, ψ) is (0.1, 0.3) or (0.9, 0.3), no bias is given in these cases in Table 1.

The proposed methods are further evaluated by the sample size determination. The sample sizes are estimated when (α = 0.1, β = 0.1), (α = 0.05, β = 0.2), and (α = 0.05, β = 0.1) with two-sided two sample test and their estimates are shown in Table 2. Note that the sample size in the table is per group. Since the proposed methods are based on a normal distribution, the true sample sizes for a normal distribution can be obtained, which are in the column ‘True’ in the table. When the SD (ψ) of an LN distribution is smaller, both the delta (M1) and the optimization (M2) approaches achieve very similar sample sizes to those of the true sample size. However, as the SD (ψ) increases, the sample size of M1 becomes larger than the true sample size, while M2 maintains the sample size similar to the true sample size regardless of the size of SD (ψ). This better performance of M2 is because the transformation to a normal distribution of M2 outperforms that of M1 as shown in Table 1. However, in the cases of (Δ = 0.1, ϕ₀ = 0.1, ψ = 0.3), (Δ = 0.2, ϕ₀ = 0.1, ψ = 0.3), and (Δ = 0.2, ϕ₀ = 0.7, ψ = 0.3), due to the aforementioned numerical difficulty, M2 fails to estimate the required sample size. On the contrary, M1 does not suffer from the numerical difficulty, even in the above cases, owing to the analytical solutions (i.e., Equations (3) and (4)) and therefore, calculates the required sample sizes for all cases.

Table 2. Sample size determinations.

Δ is the difference in mean between two logit-normal random variables, ϕ₁ = ϕ₀ + Δ, where ϕ_i is the logit-normal mean of the ith group; LN and Normal stand for a logit-normal distribution and its corresponding normal distribution, respectively; ψ is the logit-normal standard deviation of both groups; M1: Using the t-test with the delta method, M2: Using the t-test with the optimization, M3: Using the t-test without transformation, M4: Using the test on proportions, and M5: Using the logistic regression with one binary covariate.

(a) Δ = 0.1

LN		Sample size determination
		α = 0.1, β = 0.1						α = 0.05, β = 0.2						α = 0.05, β = 0.1

		Transformation			M3	M4	M5	Transformation			M3	M4	M5	Transformation			M3	M4	M5
ϕ₀	ψ	True	M1	M2				True	M1	M2				True	M1	M2
0.1	0.1	12	22	12	18	837	216	11	21	12	17	767	199	15	27	15	23	1027	266
	0.2	16	86	16	70	837	216	15	79	15	64	767	199	20	105	20	86	1027	266
	0.3	-	192	-	155	837	216	-	176	-	143	767	199	-	235	-	191	1027	266
0.2	0.1	17	19	17	18	802	319	16	18	16	17	736	293	21	24	20	23	985	392
	0.2	42	74	42	70	802	319	39	68	39	64	736	293	52	91	52	86	985	392
	0.3	47	165	47	155	802	319	43	152	43	143	736	293	58	202	57	191	985	392
0.3	0.1	18	19	18	18	751	388	17	18	17	17	688	356	22	23	22	23	921	476
	0.2	61	71	60	70	751	388	57	66	56	64	688	356	75	88	74	86	921	476
	0.3	103	159	105	155	751	388	95	146	97	143	688	356	127	195	129	191	921	476
0.4	0.1	19	19	18	18	682	422	18	17	17	17	625	387	23	23	23	23	837	518
	0.2	68	71	68	70	682	422	63	65	63	64	625	387	84	87	84	86	837	518
	0.3	143	158	142	155	682	422	131	145	131	143	625	387	176	193	174	191	837	518
0.5	0.1	19	19	18	18	596	422	18	17	17	17	546	387	23	23	23	23	731	518
	0.2	68	71	68	70	596	422	63	65	63	64	546	387	84	87	84	86	731	518
	0.3	143	158	142	155	596	422	131	145	131	143	546	387	176	193	174	191	731	518
0.6	0.1	18	19	18	18	493	388	17	18	17	17	452	356	22	23	22	23	604	476
	0.2	61	71	60	70	493	388	57	66	56	64	452	356	75	88	74	86	604	476
	0.3	103	159	105	155	493	388	95	146	97	143	452	356	127	195	129	191	604	476
0.7	0.1	17	19	17	18	372	319	16	18	16	17	341	293	21	24	20	23	456	392
	0.2	42	74	42	70	372	319	39	68	39	64	341	293	52	91	52	86	456	392
	0.3	47	165	47	155	372	319	43	152	43	143	341	293	58	202	57	191	456	392

(b) Δ = 0.2

LN		Sample size determination
		α = 0.1, β = 0.1						α = 0.05, β = 0.2						α = 0.05, β = 0.1

		Transformation			M3	M4	M5	Transformation			M3	M4	M5	Transformation			M3	M4	M5
ϕ₀	ψ	True	M1	M2				True	M1	M2				True	M1	M2
0.1	0.1	5	8	5	6	205	67	5	8	5	6	188	62	7	10	7	7	252	82
	0.2	9	29	9	18	205	67	9	27	9	17	188	62	11	35	11	23	252	82
	0.3	-	63	-	40	205	67	-	58	-	37	188	62	-	77	-	49	252	82
0.2	0.1	6	6	6	6	194	88	6	6	6	6	178	81	7	8	7	7	238	108
	0.2	15	21	15	18	194	88	14	20	14	17	178	81	18	26	18	23	238	108
	0.3	20	46	20	40	194	88	19	43	19	37	178	81	25	57	25	49	238	108
0.3	0.1	6	6	6	6	179	101	6	6	6	6	164	93	7	7	7	7	220	124
	0.2	18	20	18	18	179	101	17	18	17	17	164	93	22	24	22	23	220	124
	0.3	33	43	33	40	179	101	31	40	31	37	164	93	41	52	41	49	220	124
0.4	0.1	6	6	6	6	160	105	6	6	6	6	146	97	7	7	7	7	196	129
	0.2	19	19	19	18	160	105	18	18	18	17	146	97	23	24	23	23	196	129
	0.3	39	42	38	40	160	105	36	39	36	37	146	97	47	51	47	49	196	129
0.5	0.1	6	6	6	6	136	101	6	6	6	6	124	93	7	7	7	7	166	124
	0.2	18	20	18	18	136	101	17	18	17	17	124	93	22	24	22	23	166	124
	0.3	33	43	33	40	136	101	31	40	31	37	124	93	41	52	41	49	166	124
0.6	0.1	6	6	6	6	107	88	6	6	6	6	98	81	7	8	7	7	131	108
	0.2	15	21	15	18	107	88	14	20	14	17	98	81	18	26	18	23	131	108
	0.3	20	46	20	40	107	88	19	43	19	37	98	81	25	57	25	49	131	108
0.7	0.1	5	8	5	6	73	67	5	8	5	6	67	62	7	10	7	7	89	82
	0.2	9	29	9	18	73	67	9	27	9	17	67	62	11	35	11	23	89	82
	0.3	-	63	-	40	73	67	-	58	-	37	67	62	-	77	-	49	89	82

Open in a new tab

In addition, using the t-test (M3), the test on proportions (M4), and the simple logistic regression with one binary covariate (M5), the sample sizes have been determined. These M3, M4, and M5 methods calculate the sample size without transformation, which is different from M1 and M2. The sample size of M3 is similar to that of M1, while the sample size of M4 and M5 are much larger than that of other methods. In particular, the sample sizes of M4 and M5 are comparable when ϕ₀ ≥ 0.5, while M4 has much larger sample size than M5 does when ϕ₀ ≤ 0.4 in Table 2.

5. Application to the motivating example

The five methods M1 to M5 are applied to the motivating example for the sample size determination using the outcomes obtained from the previous study, as described in Section 2. There are two groups: PrC biopsy positive (Group 1) and negative (Group 2). It is planned to determine the sample size when the means and SDs of the M-scores are (0.69, 0.11) and (0.59, 0.18), respectively, for Group 1 and Group 2. Note that the sample size for M5 is calculated under the assumption that there is no other covariate (i.e., $R_{cov}^{2} = 0$ ), using PASS 12 (NCSS, Kaysville, Utah, USA) where the method of Hsieh et al. [4] is implemented. When α = 0.05 and β = 0.2 with a two-sided two-sample test, the estimated sample sizes of M1, M2, and M3 are, respectively, 35, 44, 36 per group, while those of M4 and M5 are 462 and 350 per group, respectively. Similar to the simulation studies, M4 has the largest sample size and the methods M1 to M3 have the similar sample size. Since $R_{cov}^{2} \neq 0$ in this motivating example, the sample size determined by simple logistic regression is underestimated. Nevertheless, comparing M2 with M5, the required sample size of M2 is more than 7 times smaller than that of M5.

6. Discussion

Logistic regression is one of the most popular methods in diagnostic classification. It is known that the response variable of the logistic regression follows a logit-normal distribution [7]. However, due to the complexity of an LN distribution, it is difficult to estimate the sample size directly. Therefore, two transformation-based approaches were introduced for the sample size determination for an LN distribution.

There are several studies showing that a transformation to a normal distribution can reduce the required sample size [5,6]. In fact, our simulation studies show consistent results with those previous studies. One approach to consider is using the t-test without transformation (M3). The estimated sample size of M3 is very close to that of M1 when the SD is small in our simulation studies and, likewise, the methods M1 and M3 result in similar sample sizes of 35 and 36 per group, respectively, in the motivating example. Nevertheless, it should be noted that the outcome analysis is different from each other. Namely, M3 analyzes the outcomes in the raw scale, while M1 analyzes the transformed outcomes. For example, the estimated density plots for M1 and M3 are depicted in Figure 1 using the results estimated from the motivating example. The density plot of M1 is the estimated LN density plot, and that of M3 is a normal density plot. Although the methods M1 and M3 result in very similar sample sizes, the distribution of M1 is left-skewed, while that of M3 is symmetric. In addition, it can be seen that the distribution of M3 is right-truncated. Therefore, the M3-based analysis could mislead the outcome analysis, especially when the true logit-normal density is excessively skewed.

Estimated density plots of the motivating example.

One drawback of the optimization approach (M2) is the instability of the numerical integration when it is applied to the three cases (ϕ ≤ 0.1 or ϕ ≥ 0.9; ψ ≥ 0.3), (ϕ = 0.2 or 0.8; ψ ≥ 0.4) and (0.3 ≤ ϕ ≤ 0.7; ψ ≥ 0.5), as discussed in Section 3.2. However, this will not prevent the proposed optimization method (M2) from being used for trial design because it is usually adequate for practitioners to contemplate the study with a small ψ. Furthermore, because the delta method (M1) does not suffer from numerical difficulty, M1 can calculate the required sample size even though ψ is large. Therefore, when the variance is large, one might instead use M1 that estimate the sample size smaller than M4 and M5 do.

Several major benefits of using the proposed methods are as follows. First, in case of multiple logistic regression, the proposed methods require no $R_{cov}^{2}$ between a covariate of interest and other covariates. In particular, in the motivating example, the underestimated sample size is inescapable if the logistic regression with one binary covariate (M5) is used while ignoring other covariates (i.e., $R_{cov}^{2} = 0$ ). On the contrary, the proposed methods calculate the sample size solely depending on the response variable, irrespective of covariates, meaning that the descriptive statistics of the response variable are the only required information.

Second, interim or group-sequential designs can be implemented with ease based on a transformed outcome measure. In order to save time and resources as well as to reduce study patients' exposure to an inferior treatment, there is a high demand for interim or group-sequential designs. However, such designs using logistic regression are not straightforward due to their complexity. On the other hand, the t-test has available versions of interim or group-sequential designs. Therefore, the proposed methods can be readily used to generate interim and group-sequential designs based on the normal-transformed outcome measures.

Third, the proposed methods have much smaller required sample sizes than other methods do. As shown in the simulation studies, the methods M4 and M5 yield the largest sample sizes among the five methods. The studies in reference [6] show that the required sample size can be decreased the most when the outcome measure of interest follows a normal distribution. This implies that the proposed methods can achieve the same power and significance level as do the methods M4 and M5 but with much smaller sample sizes. In particular, when the variance of the response variable is small, either method M1 or M2 can be used, but M2 will be a better choice when the variance is large. In addition, an R package ssLogitNorm is publicly available, which can be run in a web browser. The ssLogitNorm package is user-friendly and developed in an interactive web application platform.

Supplementary Material

Supplementary Information

NIHMS657632-supplement-Supplementary_Information.pdf^{(477.4KB, pdf)}

Acknowledgement

The authors would like to thank the editor and the anonymous reviewers. The Biostatistics Core is supported in part by NIH Cancer Center Support Grant P30 CA022453 to the Karmanos Cancer Institute at Wayne State University.

References

1.Whittemore A. Sample size for logistic regression with small response probability. Journal of the American Statistical Association. 1981;76:27–32. [Google Scholar]
2.Self SG, Mauritsen RH. Power/sample size calculations for generalized linear models. Biometrics. 1988;44:79–86. [Google Scholar]
3.Hsieh FY. Sample size tables for logistic regression. Statistics in Medicine. 1989;8:795–802. doi: 10.1002/sim.4780080704. [DOI] [PubMed] [Google Scholar]
4.Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine. 1998;17:1623–1634. doi: 10.1002/(sici)1097-0258(19980730)17:14<1623::aid-sim871>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
5.Wolfe R, Carlin JB. Sample-size calculation for a log-transformed outcome measure. Controlled Clinical Trials. 1999;20:547–554. doi: 10.1016/s0197-2456(99)00032-x. [DOI] [PubMed] [Google Scholar]
6.Jin H, Zhao X. Transformation and sample size. Sweden: Department of Economics and Society: Dalarna University; 2009. [Google Scholar]
7.Frederic P, Lad F. Two moments of the logitnormal distribution. Communications in Statistics-Simulation and Computation. 2008;37:1263–1269. [Google Scholar]
8.Sreekumar A, Poisson LM, Rajendiran TM, et al. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature. 2009;457:910–914. doi: 10.1038/nature07762. [DOI] [PMC free article] [PubMed] [Google Scholar] [Research Misconduct Found]
9.McDunn JE, Li Z, Adam KP, et al. Metabolomic signatures of aggressive prostate cancer. The Prostate. 2013;73:1547–1560. doi: 10.1002/pros.22704. [DOI] [PubMed] [Google Scholar]
10.Kuonen D. Numerical integration in S-plus or R: a survey. Journal of Statistical Software. 2003;8:1–14. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

NIHMS657632-supplement-Supplementary_Information.pdf^{(477.4KB, pdf)}

[R1] 1.Whittemore A. Sample size for logistic regression with small response probability. Journal of the American Statistical Association. 1981;76:27–32. [Google Scholar]

[R2] 2.Self SG, Mauritsen RH. Power/sample size calculations for generalized linear models. Biometrics. 1988;44:79–86. [Google Scholar]

[R3] 3.Hsieh FY. Sample size tables for logistic regression. Statistics in Medicine. 1989;8:795–802. doi: 10.1002/sim.4780080704. [DOI] [PubMed] [Google Scholar]

[R4] 4.Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine. 1998;17:1623–1634. doi: 10.1002/(sici)1097-0258(19980730)17:14<1623::aid-sim871>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]

[R5] 5.Wolfe R, Carlin JB. Sample-size calculation for a log-transformed outcome measure. Controlled Clinical Trials. 1999;20:547–554. doi: 10.1016/s0197-2456(99)00032-x. [DOI] [PubMed] [Google Scholar]

[R6] 6.Jin H, Zhao X. Transformation and sample size. Sweden: Department of Economics and Society: Dalarna University; 2009. [Google Scholar]

[R7] 7.Frederic P, Lad F. Two moments of the logitnormal distribution. Communications in Statistics-Simulation and Computation. 2008;37:1263–1269. [Google Scholar]

[R8] 8.Sreekumar A, Poisson LM, Rajendiran TM, et al. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature. 2009;457:910–914. doi: 10.1038/nature07762. [DOI] [PMC free article] [PubMed] [Google Scholar] [Research Misconduct Found]

[R9] 9.McDunn JE, Li Z, Adam KP, et al. Metabolomic signatures of aggressive prostate cancer. The Prostate. 2013;73:1547–1560. doi: 10.1002/pros.22704. [DOI] [PubMed] [Google Scholar]

[R10] 10.Kuonen D. Numerical integration in S-plus or R: a survey. Journal of Statistical Software. 2003;8:1–14. [Google Scholar]

PERMALINK

Sample Size Determination for Logistic Regression on a Logit-Normal Distribution

Seongho Kim

Elisabeth Heath

Lance Heilbrunh

Abstract

1. Introduction

2. Motivating example

3. Methods

3.1. Delta method

3.2. Optimization

3.3. Sample size determination

3.4. Software

4. Simulation studies

Table 1. Transformation to a Normal distribution.

Table 2. Sample size determinations.

5. Application to the motivating example

6. Discussion

Figure 1.

Supplementary Material

Acknowledgement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sample Size Determination for Logistic Regression on a Logit-Normal Distribution

Seongho Kim

Elisabeth Heath

Lance Heilbrunh

Abstract

1. Introduction

2. Motivating example

3. Methods

3.1. Delta method

3.2. Optimization

3.3. Sample size determination

3.4. Software

4. Simulation studies

Table 1. Transformation to a Normal distribution.

Table 2. Sample size determinations.

5. Application to the motivating example

6. Discussion

Figure 1.

Supplementary Material

Acknowledgement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases