Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Sep 4.
Published in final edited form as: Stat Methods Med Res. 2015 Mar 4;26(3):1237–1247. doi: 10.1177/0962280215572407

Sample Size Determination for Logistic Regression on a Logit-Normal Distribution

Seongho Kim 1,2,*, Elisabeth Heath 2, Lance Heilbrunh 1,2
PMCID: PMC4560689  NIHMSID: NIHMS657632  PMID: 25744106

Abstract

Although the sample size for simple logistic regression can be readily determined using currently available methods, the sample size calculation for multiple logistic regression requires some additional information, such as the coefficient of determination (Rcov2) of a covariate of interest with other covariates, which is often unavailable in practice. The response variable of logistic regression follows a logit-normal (LN) distribution which can be generated from a logistic transformation of a normal distribution. Using this property of logistic regression, we propose new methods of determining the sample size for simple and multiple logistic regressions using a normal transformation of outcome measures. Simulation studies and a motivating example show several advantages of the proposed methods over the existing methods: (i) no need for Rcov2 for multiple logistic regression, (ii) available interim or group-sequential designs, and (iii) much smaller required sample size.

Keywords: Logistic regression, Logit-normal distribution, Power calculation, Sample size determination, Transformation

1. Introduction

Logistic regression analysis has been widely used to fit models for probability of disease given marker values and to test the effect of a specific covariate on the binary response variable, often in the presence of other covariates. The sample size determination for logistic regression is not straightforward due to its non-linearity, however. Several methods have been introduced for its sample size calculation. Whittemore [1] proposed a method by approximating the Fisher information matrix when the probability of response was small. Self and Mauritsen [2] introduced a general and flexible approach by approximating the power of score tests. Hsieh [3] presented sample size tables for logistic regression, in particular, with normally distributed covariates using Whittemore [1]’s formula. Later, Hsieh et al. [4] proposed more accurate and simple formulae to calculate sample size for logistic regression models, which were implemented in PASS 12 (NCSS, Kaysville, Utah, USA).

The aforementioned methods can be readily applied to determine the sample size for simple logistic regression. However, the sample size calculation for multiple logistic regression requires some additional information, such as the coefficient of determination (Rcov2) of a covariate of interest with other covariates, which is occasionally difficult to obtain in practice. Indeed, a biochemical or drug industry is not often willing to share some sensitive information with outside investigators due to confidentiality concerns, as described in the motivating example in Section 2. In this case, the sample size would be determined either by using simple logistic regression, which underestimates the required sample size, or by guessing the required Rcov2 values subjectively. These practices will yield either under- or over-powered designs.

Some researchers focused on the sample size calculation for a transformed outcome measure [5,6]. The simulation studies in [6] showed that skewness increases the required sample size and the sample size is decreased the most by transformation to a normal distribution. In fact, the response variable of a logistic regression follows a logit-normal (LN) distribution which is generated from the logistic transformation of a normal distribution [7]. These properties of logistic regression have inspired us to develop a transformation-based approach to determine the sample size. Our approach is first to transform a logistic outcome measure into a normal distribution and then the sample size is determined by the t-test. The sample size determination using a transformed outcome measure has three major advantages over the existing methods: (i) no need for Rcov2 in the case of multiple logistic regression; (ii) straightforward implementation of interim or group-sequential designs based on a transformed outcome measure; and (iii) much smaller required sample size. It should be noted that our approach would be applied when a logistic outcome measure is continuous and comes from a logistic regression model. When a logistic outcome measure is either binary or ordinal, the proposed approach would not be used.

2. Motivating example

Prostate cancer (PrC) is the second leading cause of cancer-related death in men. Although prostate specific antigen (PSA) blood testing remains the most widely used tool for PrC detection, important efforts have been conducted to determine alternative biomarkers to overcome its lack of specificity. Recently, it has been discovered that sarcosine, alanine, glutamate, and glycine are metabolic biomarkers of PrC progression [8,9]. Using these metabolic biomarkers, a PrC diagnostic algorithm (a logistic regression model) was developed which also took into account clinical information such as PSA and prostate volume. The outcome measure of this logistic algorithm will be called the M-score.

A new study was planned to validate the M-score by comparing with the PrC biopsy result in African American (AA) men who were referred for prostate biopsy for any clinical indication. The primary hypothesis was that the M-score, which has not been extensively studied in AA men, would have similar test characteristics in AA men as it did in European American men. The question that we were asked as statisticians was: how many AA men need to be included in the study? In the previous study, the M-score was elevated in AA PrC patients compared to those with benign prostate disease using a small sample size of 18. Based on this previous result, the study was designed to detect a difference of 10 points in the mean M-score of AA men with vs. those without PrC, based on biopsy results. Since the M-score was generated from a logistic regression model, a sample size could be determined based on a logistic regression model. However, covariates included in the logistic model were blinded and the biopsy result, positive or negative, was the only covariate available to us.

3. Methods

A logit-normal (LN) distribution is a probability distribution of a random variable whose logit follows a normal distribution. If a random variable X follows a normal distribution, then its logistic Y = logistic(X) is logit-normally distributed, where logistic(x)=exp(x)1+exp(x). Likewise, if Y follows an LN distribution, then X = logit(Y) is normally distributed, where logit(y)=log(y1y). In this paper we are interested in determining the sample size when an outcome measure follows an LN distribution.

Suppose there are two independent groups that have been randomly and independently drawn from an LN distribution, and each group has n0 and n1 observations, respectively, Y01, Y02, ⋯, Y0n0 and Y11, Y12, ⋯, Y1n1, with n = n0 + n1. That is, we assume that

Yij~LN(ϕi,ψi), (1)

where LN stands for a logit-normal distribution with mean of ϕi and standard deviation (SD) of ψi, i = 0,1, and j = 1,2, ⋯, ni. An investigator wishes to test the null hypothesis that the two population means are equal, H0: ϕ0 = ϕ1. The probability density function (pdf) of an LN distribution is

fY(y;μ,σ)=1σ2π·exp((logit(y)μ)22σ2)·1y(1y),y(0,1), (2)

where logit(y)~N(μ, σ) and N stands for a normal distribution. The pdf depends on the mean and SD of logit(Y) rather than those of Y, so the best approach to the sample size determination for an LN distribution is to use a normal distribution that logit(Y) follows. This approach requires that all the individual logit transformations, logit(Yij), i = 0,1; j = 1,2, …, ni, must be available in order to calculate the mean and SD for a normal distribution. However, as in the motivating example, no information (i.e., individual observations, Yij) was available to calculate the mean and SD of logit(Y). The mean (ϕ) and SD (ψ) of Y are the only available information, and no analytical solution exists to recover the mean and SD of logit(Y) from those of Y. Therefore, two approaches are proposed to estimate the mean and SD of logit(Y): the delta method and optimization.

3.1. Delta method

The delta method is employed to approximate the mean and SD of logit(Y). Suppose Y follows an LN distribution with mean of ϕ and SD of ψ. Then a logit of Y, X = logit(Y), follows a normal distribution with mean of μ and SD of σ by definition of an LN distribution. Since logit(Y)=log(Y1Y), then by the delta method, the estimates of mean and SD of X are

μ^log(ϕ1ϕ); (3)
σ^ψϕ(1ϕ). (4)

where ϕ and ψ are the mean and SD of an LN random variable Y.

3.2. Optimization

The delta method is the first-order approximation to the mean and variance of a transformed variable and the logit function is non-linear over the range of values being examined. Thus, the delta method-based approximation may not fit well particularly when the coefficient of variation is higher. To circumvent this difficulty, an optimization method is used to estimate the mean and SD of logit(Y). The estimates (μ̂, σ̂) are obtained by minimizing the function in Equation (5) as follows:

(μ^,σ^)=argminμ,σ{(1E(Y|μ,σ)ϕ)2+(1Var(Y|μ,σ)ψ2)2} (5)

where ϕ and ψ are the mean and SD of an LN random variable Y;

E(Y|μ,σ)=01zfY(z;μ,σ)dz; (6)
Var(Y|μ,σ)=01(zE(z|μ,σ))2fY(z;μ,σ)dz. (7)

The first and second terms in Equation (5) are the relative bias between the true value and the estimate, so minimizing Equation (5) is reducing the relative bias for both estimates. Note that Equations (6) and (7) are the mean and variance of an LN distribution. Since there is no analytical solution available for them, a numerical integration was used to obtain these values. The numerical integration was carried out by adaptive quadrature [10] of functions implemented in the function ‘quadinf’ of an R package pracma. In order to optimize Equation (5), the R function ‘nlminb’ was used for optimization.

The pdf fY(y|μ, σ), given in Equation (2), of an LN distribution, however, numerically becomes a function that has nonzero values only at both y=0 and y=1, as SD (σ) of a corresponding normal distribution increases. In other words, given μ, the pdf fY(y|μ, σ) goes to δ(y) + δ(1 − y) as σ → ∞, where δ(y) = 0 if y ≠ 0 and 1 if y = 0. For example, Supplementary Information Figure S1(a) displays the density plots of a logit-normal distribution when the mean (μ) of a corresponding normal distribution is zero and its SD (σ) ranges from 0.1 to 1000. This plot shows that the pdf has two modes when σ ≥ 10 and seems to numerically end up to a function δ(y) + δ(1 − y) for large σ. In fact, this makes numerical integration difficult to obtain an accurate integration. To investigate the behavior of the numerical integration, the corresponding areas under each of the pdfs of Supplementary Information Figure S1(a) are computed and plotted in Supplementary Figure S1(b). In this figure, although all the areas under the pdfs are supposed to be equal to one, some areas under the pdf numerically become less than one when σ ≥ 20.

In particular, the aforementioned numerical difficulty occurs when the optimization method (M2) is applied to the three cases (ϕ ≤ 0.1 or ϕ ≥ 0.9; ψ ≥ 0.3), (ϕ = 0.2 or 0.8; ψ ≥ 0.4) and (0.3 ≤ ϕ ≤ 0.7; ψ ≥ 0.5). In other cases, the developed optimization-based approach holds well-posedness. For example, the traces of the objective function in Equation (5) are investigated when (ϕ0 = 0.3, ϕ1 = 0.7) and (ϕ0 = 0.1, ϕ1 = 0.9) using the optimization method (M2), as shown in Supplementary Information Figure S2. Due to the characteristic of an LN distribution, it can be observed that the absolute true mean (μ) and true SD (σ) of the corresponding normal distribution of an LN distribution LN(ϕ, ψ) is the exactly same as those of LN(1 − ϕ, ψ). Therefore, in the ideal case, it should be expected that the contour plot of the objective function of ϕ0 = 0.3 is symmetric to that of ϕ1 = 0.7 regardless of the SD (ψ). Indeed, the solution areas (i.e., darkest blue color) are symmetric to each other until ψ = 0.3. However, as ψ becomes more than 0.3, their solution areas become asymmetric, as can be seen in Supplementary Information Figure S3(a)–(e). This phenomenon grows worse when (ϕ0 = 0.1, ϕ1 = 0.9). In this case, even when the SD (ψ) is 0.1, their solution areas (darkest blue color) become dissimilar (see Supplementary Information Figure S3(f)–(h)).

3.3. Sample size determination

The sample size determination for an LN distribution is considered in this section. As mentioned in the motivating example, the means and SDs of two groups whose populations follow an LN distribution are the only available information for sample size calculation. To this end, three approaches can be used. The first uses the test on proportions, since Y ∈ (0,1), the second uses the simple logistic regression with one binary covariate (i.e., Rcov2=0), and the third uses the t-test with or without transformation to a normal distribution. It should be noted that the second approach will underestimate the sample size when there are other covariates, resulting in an underpowered study design. In this regard, the following five methods are considered in this work: (1) Using the t-test with the delta method (M1), (2) Using the t-test with the optimization (M2), (3) Using the t-test without transformation (M3), (4) Using the test on proportions (M4), and (5) Using the simple logistic regression with one binary covariate (M5).

3.4. Software

An R package ssLogitNorm is developed for the power and sample size determination for an LN distribution and it is freely available at http://cansur.sourceforge.net. The brief instruction on how to use the R package ssLogitNorm can be found in the Supplementary Information.

4. Simulation studies

Because there is no analytical solution of the mean and SD of an LN distribution, Monte Carlo simulation was performed to find the true mean and true SD of an LN distribution corresponding to those of a normal distribution. In particular, the normal distributions were chosen corresponding to the LN distributions with the difference in mean (Δ) of 0.1 and 0.2 and the same SD (i.e., ψ = 0.10, 0.20, and 0.30) as can be seen in Table 1. Due to the characteristic of the LN distribution, it can be observed that the absolute true mean (μ) and true SD (σ) of the corresponding normal distribution of an LN distribution LN(ϕ, ψ) is the exactly same as those of LN(1 − ϕ, ψ). For instance, the corresponding normal distribution of LN(0.6, 0.2), is N(0.49,0.99), while that of LN(0.4,0.2) is N(−0.49,0.99) in Table 1. Note that the corresponding normal distributions were unable to be discovered when (ϕ, ψ) is (0.1, 0.3) or (0.9, 0.3) due to the difficulty caused by numerical integration, as discussed in Section 3.2. Using these values, the two proposed approaches to a normal transformation are evaluated in terms of bias in estimates of mean and SD. The sample size calculation is further compared with other methods.

Table 1. Transformation to a Normal distribution.

Δ is the difference in mean between two logit-normal random variables, ϕ1 = ϕ0 + Δ, where ϕi is the logit-normal mean of the ith group; LN and Normal stand for a logit-normal distribution and its corresponding normal distribution, respectively. Bias is calculated as the difference between true and estimate (i.e., true – estimate); ψ is the logit-normal standard deviation of both groups; μi and σi are the mean and standard deviation of a normal distribution of the ith group, respectively.

(a) Δ = 0.1

True Transformation to a Normal distribution
LN Normal Delta method (M1) Optimization (M2)
Bias
ϕ0 ψ μ0 μ1 σ0 σ1 μ0 μ1 σ0 σ1 μ0 μ1 σ0 σ1
0.1 0.1 −2.64 −1.50 1.10 0.65 −0.4428 −0.1137 −0.0111 0.0250 −0.0023 0.0046 −0.005 0.0025
0.2 −4.46 −1.93 2.99 1.52 −2.2628 −0.5437 0.7678 0.2700 −0.0042 0.0004 0.0048 −0.0040
0.3 - −3.36 - 3.57 - −1.9737 - 1.6950 - −0.0019 - −0.0033
0.2 0.1 −1.50 −0.89 0.65 0.50 −0.1137 −0.0427 0.0250 0.0238 0.0046 0.0042 0.0025 0.0042
0.2 −1.93 −1.06 1.52 1.14 −0.5437 −0.2127 0.2700 0.1876 0.0004 0.0019 −0.0040 0.0009
0.3 −3.36 −1.52 3.57 2.31 −1.9737 −0.6727 1.6950 0.8814 −0.0019 −0.0049 −0.0033 0.0000
0.3 0.1 −0.89 −0.42 0.50 0.43 −0.0427 −0.0145 0.0238 0.0133 0.0042 0.0035 0.0042 −0.0039
0.2 −1.06 −0.49 1.14 0.99 −0.2127 −0.0845 0.1876 0.1567 0.0019 −0.0019 0.0009 0.0020
0.3 −1.52 −0.65 2.31 1.92 −0.6727 −0.2445 0.8814 0.6700 −0.0049 0.0019 0.0000 −0.0002
0.4 0.1 −0.42 0.00 0.43 0.42 −0.0145 0.0000 0.0133 0.0200 0.0035 0.0000 −0.0039 0.0035
0.2 −0.49 0.00 0.99 0.95 −0.0845 0.0000 0.1567 0.1500 −0.0019 0.0000 0.0020 0.0041
0.3 −0.65 0.00 1.92 1.82 −0.2445 0.0000 0.6700 0.6200 0.0019 0.0000 −0.0002 0.0013
0.5 0.1 0.00 0.42 0.42 0.43 0.0000 0.0145 0.0200 0.0133 0.0000 −0.0035 0.0035 −0.0039
0.2 0.00 0.49 0.95 0.99 0.0000 0.0845 0.1500 0.1567 0.0000 0.0019 0.0041 0.0020
0.3 0.00 0.65 1.82 1.92 0.0000 0.2445 0.6200 0.6700 0.0000 −0.0019 0.0013 −0.0002
0.6 0.1 0.42 0.89 0.43 0.50 0.0145 0.0427 0.0133 0.0238 −0.0035 −0.0042 −0.0039 0.0042
0.2 0.49 1.06 0.99 1.14 0.0845 0.2127 0.1567 0.1876 0.0019 −0.0019 0.0020 0.0009
0.3 0.65 1.52 1.92 2.31 0.2445 0.6727 0.6700 0.8814 −0.0019 0.0049 −0.0002 0.0000
0.7 0.1 0.89 1.50 0.50 0.65 0.0427 0.1137 0.0238 0.0250 −0.0042 −0.0046 0.0042 0.0025
0.2 1.06 1.93 1.14 1.52 0.2127 0.5437 0.1876 0.2700 −0.0019 −0.0004 0.0009 −0.0040
0.3 1.52 3.36 2.31 3.57 0.6727 1.9737 0.8814 1.6950 0.0049 0.0019 0.0000 −0.0033
(b) Δ = 0.2

True Transformation to a Normal distribution
LN Normal Delta method (M1) Optimization (M2)
Bias
ϕ0 ψ μ0 μ1 σ0 σ1 μ0 μ1 σ0 σ1 μ0 μ1 σ0 σ1
0.1 0.1 −2.64 −0.89 1.10 0.50 −0.4428 −0.0427 −0.0111 0.0238 −0.0023 0.0042 −0.0050 0.0042
0.2 −4.46 −1.06 2.99 1.14 −2.2628 −0.2127 0.7678 0.1876 −0.0042 0.0019 0.0048 0.0009
0.3 - −1.52 - 2.31 - −0.6727 - 0.8814 - −0.0049 - 0.0000
0.2 0.1 −1.50 −0.42 0.65 0.43 −0.1137 −0.0145 0.0250 0.0133 0.0046 0.0035 0.0025 −0.0039
0.2 −1.93 −0.49 1.52 0.99 −0.5437 −0.0845 0.2700 0.1567 0.0004 −0.0019 −0.0040 0.0020
0.3 −3.36 −0.65 3.57 1.92 −1.9737 −0.2445 1.6950 0.6700 −0.0019 0.0019 −0.0033 −0.0002
0.3 0.1 −0.89 0.00 0.50 0.42 −0.0427 0.0000 0.0238 0.0200 0.0042 0.0000 0.0042 0.0035
0.2 −1.06 0.00 1.14 0.95 −0.2127 0.0000 0.1876 0.1500 0.0019 0.0000 0.0009 0.0041
0.3 −1.52 0.00 2.31 1.82 −0.6727 0.0000 0.8814 0.6200 −0.0049 0.0000 0.0000 0.0013
0.4 0.1 −0.42 0.42 0.43 0.43 −0.0145 0.0145 0.0133 0.0133 0.0035 −0.0035 −0.0039 −0.0039
0.2 −0.49 0.49 0.99 0.99 −0.0845 0.0845 0.1567 0.1567 −0.0019 0.0019 0.0020 0.0020
0.3 −0.65 0.65 1.92 1.92 −0.2445 0.2445 0.6700 0.6700 0.0019 −0.0019 −0.0002 −0.0002
0.5 0.1 0.00 0.89 0.42 0.50 0.0000 0.0427 0.0200 0.0238 0.0000 −0.0042 0.0035 0.0042
0.2 0.00 1.06 0.95 1.14 0.0000 0.2127 0.1500 0.1876 0.0000 −0.0019 0.0041 0.0009
0.3 0.00 1.52 1.82 2.31 0.0000 0.6727 0.6200 0.8814 0.0000 0.0049 0.0013 0.0000
0.6 0.1 0.42 1.50 0.43 0.65 0.0145 0.1137 0.0133 0.0250 −0.0035 −0.0046 −0.0039 0.0025
0.2 0.49 1.93 0.99 1.52 0.0845 0.5437 0.1567 0.2700 0.0019 −0.0004 0.0020 −0.0040
0.3 0.65 3.36 1.92 3.57 0.2445 1.9737 0.6700 1.6950 −0.0019 0.0019 −0.0002 −0.0033
0.7 0.1 0.89 2.64 0.50 1.10 0.0427 0.4428 0.0238 −0.0111 −0.0042 0.0023 0.0042 −0.0050
0.2 1.06 4.46 1.14 2.99 0.2127 2.2628 0.1876 0.7678 −0.0019 0.0042 0.0009 0.0048
0.3 1.52 - 2.31 - 0.6727 - 0.8814 - 0.0049 - 0.0000 -

Table 1 displays the results of a normal transformation using the delta method and the optimization approach. As the mean (ϕ) of an LN distribution is closer to 0.5, the estimates of a corresponding normal distribution have small bias for both methods. As the SD (ψ) of an LN distribution increases, the bias of the estimates increases for the delta method, but the optimization approach appears to show no such a trend. As expected, the method M2 has the merit of better approximating the mean (μ) and SD (σ) of a corresponding normal distribution than M1 does. Note that because no true value is available when (ϕ, ψ) is (0.1, 0.3) or (0.9, 0.3), no bias is given in these cases in Table 1.

The proposed methods are further evaluated by the sample size determination. The sample sizes are estimated when (α = 0.1, β = 0.1), (α = 0.05, β = 0.2), and (α = 0.05, β = 0.1) with two-sided two sample test and their estimates are shown in Table 2. Note that the sample size in the table is per group. Since the proposed methods are based on a normal distribution, the true sample sizes for a normal distribution can be obtained, which are in the column ‘True’ in the table. When the SD (ψ) of an LN distribution is smaller, both the delta (M1) and the optimization (M2) approaches achieve very similar sample sizes to those of the true sample size. However, as the SD (ψ) increases, the sample size of M1 becomes larger than the true sample size, while M2 maintains the sample size similar to the true sample size regardless of the size of SD (ψ). This better performance of M2 is because the transformation to a normal distribution of M2 outperforms that of M1 as shown in Table 1. However, in the cases of (Δ = 0.1, ϕ0 = 0.1, ψ = 0.3), (Δ = 0.2, ϕ0 = 0.1, ψ = 0.3), and (Δ = 0.2, ϕ0 = 0.7, ψ = 0.3), due to the aforementioned numerical difficulty, M2 fails to estimate the required sample size. On the contrary, M1 does not suffer from the numerical difficulty, even in the above cases, owing to the analytical solutions (i.e., Equations (3) and (4)) and therefore, calculates the required sample sizes for all cases.

Table 2. Sample size determinations.

Δ is the difference in mean between two logit-normal random variables, ϕ1 = ϕ0 + Δ, where ϕi is the logit-normal mean of the ith group; LN and Normal stand for a logit-normal distribution and its corresponding normal distribution, respectively; ψ is the logit-normal standard deviation of both groups; M1: Using the t-test with the delta method, M2: Using the t-test with the optimization, M3: Using the t-test without transformation, M4: Using the test on proportions, and M5: Using the logistic regression with one binary covariate.

(a) Δ = 0.1

LN Sample size determination
α = 0.1, β = 0.1 α = 0.05, β = 0.2 α = 0.05, β = 0.1

Transformation M3 M4 M5 Transformation M3 M4 M5 Transformation M3 M4 M5
ϕ0 ψ True M1 M2 True M1 M2 True M1 M2
0.1 0.1 12 22 12 18 837 216 11 21 12 17 767 199 15 27 15 23 1027 266
0.2 16 86 16 70 837 216 15 79 15 64 767 199 20 105 20 86 1027 266
0.3 - 192 - 155 837 216 - 176 - 143 767 199 - 235 - 191 1027 266
0.2 0.1 17 19 17 18 802 319 16 18 16 17 736 293 21 24 20 23 985 392
0.2 42 74 42 70 802 319 39 68 39 64 736 293 52 91 52 86 985 392
0.3 47 165 47 155 802 319 43 152 43 143 736 293 58 202 57 191 985 392
0.3 0.1 18 19 18 18 751 388 17 18 17 17 688 356 22 23 22 23 921 476
0.2 61 71 60 70 751 388 57 66 56 64 688 356 75 88 74 86 921 476
0.3 103 159 105 155 751 388 95 146 97 143 688 356 127 195 129 191 921 476
0.4 0.1 19 19 18 18 682 422 18 17 17 17 625 387 23 23 23 23 837 518
0.2 68 71 68 70 682 422 63 65 63 64 625 387 84 87 84 86 837 518
0.3 143 158 142 155 682 422 131 145 131 143 625 387 176 193 174 191 837 518
0.5 0.1 19 19 18 18 596 422 18 17 17 17 546 387 23 23 23 23 731 518
0.2 68 71 68 70 596 422 63 65 63 64 546 387 84 87 84 86 731 518
0.3 143 158 142 155 596 422 131 145 131 143 546 387 176 193 174 191 731 518
0.6 0.1 18 19 18 18 493 388 17 18 17 17 452 356 22 23 22 23 604 476
0.2 61 71 60 70 493 388 57 66 56 64 452 356 75 88 74 86 604 476
0.3 103 159 105 155 493 388 95 146 97 143 452 356 127 195 129 191 604 476
0.7 0.1 17 19 17 18 372 319 16 18 16 17 341 293 21 24 20 23 456 392
0.2 42 74 42 70 372 319 39 68 39 64 341 293 52 91 52 86 456 392
0.3 47 165 47 155 372 319 43 152 43 143 341 293 58 202 57 191 456 392
(b) Δ = 0.2

LN Sample size determination
α = 0.1, β = 0.1 α = 0.05, β = 0.2 α = 0.05, β = 0.1

Transformation M3 M4 M5 Transformation M3 M4 M5 Transformation M3 M4 M5
ϕ0 ψ True M1 M2 True M1 M2 True M1 M2
0.1 0.1 5 8 5 6 205 67 5 8 5 6 188 62 7 10 7 7 252 82
0.2 9 29 9 18 205 67 9 27 9 17 188 62 11 35 11 23 252 82
0.3 - 63 - 40 205 67 - 58 - 37 188 62 - 77 - 49 252 82
0.2 0.1 6 6 6 6 194 88 6 6 6 6 178 81 7 8 7 7 238 108
0.2 15 21 15 18 194 88 14 20 14 17 178 81 18 26 18 23 238 108
0.3 20 46 20 40 194 88 19 43 19 37 178 81 25 57 25 49 238 108
0.3 0.1 6 6 6 6 179 101 6 6 6 6 164 93 7 7 7 7 220 124
0.2 18 20 18 18 179 101 17 18 17 17 164 93 22 24 22 23 220 124
0.3 33 43 33 40 179 101 31 40 31 37 164 93 41 52 41 49 220 124
0.4 0.1 6 6 6 6 160 105 6 6 6 6 146 97 7 7 7 7 196 129
0.2 19 19 19 18 160 105 18 18 18 17 146 97 23 24 23 23 196 129
0.3 39 42 38 40 160 105 36 39 36 37 146 97 47 51 47 49 196 129
0.5 0.1 6 6 6 6 136 101 6 6 6 6 124 93 7 7 7 7 166 124
0.2 18 20 18 18 136 101 17 18 17 17 124 93 22 24 22 23 166 124
0.3 33 43 33 40 136 101 31 40 31 37 124 93 41 52 41 49 166 124
0.6 0.1 6 6 6 6 107 88 6 6 6 6 98 81 7 8 7 7 131 108
0.2 15 21 15 18 107 88 14 20 14 17 98 81 18 26 18 23 131 108
0.3 20 46 20 40 107 88 19 43 19 37 98 81 25 57 25 49 131 108
0.7 0.1 5 8 5 6 73 67 5 8 5 6 67 62 7 10 7 7 89 82
0.2 9 29 9 18 73 67 9 27 9 17 67 62 11 35 11 23 89 82
0.3 - 63 - 40 73 67 - 58 - 37 67 62 - 77 - 49 89 82

In addition, using the t-test (M3), the test on proportions (M4), and the simple logistic regression with one binary covariate (M5), the sample sizes have been determined. These M3, M4, and M5 methods calculate the sample size without transformation, which is different from M1 and M2. The sample size of M3 is similar to that of M1, while the sample size of M4 and M5 are much larger than that of other methods. In particular, the sample sizes of M4 and M5 are comparable when ϕ0 ≥ 0.5, while M4 has much larger sample size than M5 does when ϕ0 ≤ 0.4 in Table 2.

5. Application to the motivating example

The five methods M1 to M5 are applied to the motivating example for the sample size determination using the outcomes obtained from the previous study, as described in Section 2. There are two groups: PrC biopsy positive (Group 1) and negative (Group 2). It is planned to determine the sample size when the means and SDs of the M-scores are (0.69, 0.11) and (0.59, 0.18), respectively, for Group 1 and Group 2. Note that the sample size for M5 is calculated under the assumption that there is no other covariate (i.e., Rcov2=0), using PASS 12 (NCSS, Kaysville, Utah, USA) where the method of Hsieh et al. [4] is implemented. When α = 0.05 and β = 0.2 with a two-sided two-sample test, the estimated sample sizes of M1, M2, and M3 are, respectively, 35, 44, 36 per group, while those of M4 and M5 are 462 and 350 per group, respectively. Similar to the simulation studies, M4 has the largest sample size and the methods M1 to M3 have the similar sample size. Since Rcov20 in this motivating example, the sample size determined by simple logistic regression is underestimated. Nevertheless, comparing M2 with M5, the required sample size of M2 is more than 7 times smaller than that of M5.

6. Discussion

Logistic regression is one of the most popular methods in diagnostic classification. It is known that the response variable of the logistic regression follows a logit-normal distribution [7]. However, due to the complexity of an LN distribution, it is difficult to estimate the sample size directly. Therefore, two transformation-based approaches were introduced for the sample size determination for an LN distribution.

There are several studies showing that a transformation to a normal distribution can reduce the required sample size [5,6]. In fact, our simulation studies show consistent results with those previous studies. One approach to consider is using the t-test without transformation (M3). The estimated sample size of M3 is very close to that of M1 when the SD is small in our simulation studies and, likewise, the methods M1 and M3 result in similar sample sizes of 35 and 36 per group, respectively, in the motivating example. Nevertheless, it should be noted that the outcome analysis is different from each other. Namely, M3 analyzes the outcomes in the raw scale, while M1 analyzes the transformed outcomes. For example, the estimated density plots for M1 and M3 are depicted in Figure 1 using the results estimated from the motivating example. The density plot of M1 is the estimated LN density plot, and that of M3 is a normal density plot. Although the methods M1 and M3 result in very similar sample sizes, the distribution of M1 is left-skewed, while that of M3 is symmetric. In addition, it can be seen that the distribution of M3 is right-truncated. Therefore, the M3-based analysis could mislead the outcome analysis, especially when the true logit-normal density is excessively skewed.

Figure 1.

Figure 1

Estimated density plots of the motivating example.

One drawback of the optimization approach (M2) is the instability of the numerical integration when it is applied to the three cases (ϕ ≤ 0.1 or ϕ ≥ 0.9; ψ ≥ 0.3), (ϕ = 0.2 or 0.8; ψ ≥ 0.4) and (0.3 ≤ ϕ ≤ 0.7; ψ ≥ 0.5), as discussed in Section 3.2. However, this will not prevent the proposed optimization method (M2) from being used for trial design because it is usually adequate for practitioners to contemplate the study with a small ψ. Furthermore, because the delta method (M1) does not suffer from numerical difficulty, M1 can calculate the required sample size even though ψ is large. Therefore, when the variance is large, one might instead use M1 that estimate the sample size smaller than M4 and M5 do.

Several major benefits of using the proposed methods are as follows. First, in case of multiple logistic regression, the proposed methods require no Rcov2 between a covariate of interest and other covariates. In particular, in the motivating example, the underestimated sample size is inescapable if the logistic regression with one binary covariate (M5) is used while ignoring other covariates (i.e., Rcov2=0). On the contrary, the proposed methods calculate the sample size solely depending on the response variable, irrespective of covariates, meaning that the descriptive statistics of the response variable are the only required information.

Second, interim or group-sequential designs can be implemented with ease based on a transformed outcome measure. In order to save time and resources as well as to reduce study patients' exposure to an inferior treatment, there is a high demand for interim or group-sequential designs. However, such designs using logistic regression are not straightforward due to their complexity. On the other hand, the t-test has available versions of interim or group-sequential designs. Therefore, the proposed methods can be readily used to generate interim and group-sequential designs based on the normal-transformed outcome measures.

Third, the proposed methods have much smaller required sample sizes than other methods do. As shown in the simulation studies, the methods M4 and M5 yield the largest sample sizes among the five methods. The studies in reference [6] show that the required sample size can be decreased the most when the outcome measure of interest follows a normal distribution. This implies that the proposed methods can achieve the same power and significance level as do the methods M4 and M5 but with much smaller sample sizes. In particular, when the variance of the response variable is small, either method M1 or M2 can be used, but M2 will be a better choice when the variance is large. In addition, an R package ssLogitNorm is publicly available, which can be run in a web browser. The ssLogitNorm package is user-friendly and developed in an interactive web application platform.

Supplementary Material

Supplementary Information

Acknowledgement

The authors would like to thank the editor and the anonymous reviewers. The Biostatistics Core is supported in part by NIH Cancer Center Support Grant P30 CA022453 to the Karmanos Cancer Institute at Wayne State University.

References

  • 1.Whittemore A. Sample size for logistic regression with small response probability. Journal of the American Statistical Association. 1981;76:27–32. [Google Scholar]
  • 2.Self SG, Mauritsen RH. Power/sample size calculations for generalized linear models. Biometrics. 1988;44:79–86. [Google Scholar]
  • 3.Hsieh FY. Sample size tables for logistic regression. Statistics in Medicine. 1989;8:795–802. doi: 10.1002/sim.4780080704. [DOI] [PubMed] [Google Scholar]
  • 4.Hsieh FY, Bloch DA, Larsen MD. A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine. 1998;17:1623–1634. doi: 10.1002/(sici)1097-0258(19980730)17:14<1623::aid-sim871>3.0.co;2-s. [DOI] [PubMed] [Google Scholar]
  • 5.Wolfe R, Carlin JB. Sample-size calculation for a log-transformed outcome measure. Controlled Clinical Trials. 1999;20:547–554. doi: 10.1016/s0197-2456(99)00032-x. [DOI] [PubMed] [Google Scholar]
  • 6.Jin H, Zhao X. Transformation and sample size. Sweden: Department of Economics and Society: Dalarna University; 2009. [Google Scholar]
  • 7.Frederic P, Lad F. Two moments of the logitnormal distribution. Communications in Statistics-Simulation and Computation. 2008;37:1263–1269. [Google Scholar]
  • 8.Sreekumar A, Poisson LM, Rajendiran TM, et al. Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature. 2009;457:910–914. doi: 10.1038/nature07762. [DOI] [PMC free article] [PubMed] [Google Scholar] [Research Misconduct Found]
  • 9.McDunn JE, Li Z, Adam KP, et al. Metabolomic signatures of aggressive prostate cancer. The Prostate. 2013;73:1547–1560. doi: 10.1002/pros.22704. [DOI] [PubMed] [Google Scholar]
  • 10.Kuonen D. Numerical integration in S-plus or R: a survey. Journal of Statistical Software. 2003;8:1–14. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

RESOURCES