Skip to main content
PLOS One logoLink to PLOS One
. 2020 Mar 11;15(3):e0229615. doi: 10.1371/journal.pone.0229615

A simple model suggesting economically rational sample-size choice drives irreproducibility

Oliver Braganza 1,*
Editor: Luis M Miller2
PMCID: PMC7065751  PMID: 32160229

Abstract

Several systematic studies have suggested that a large fraction of published research is not reproducible. One probable reason for low reproducibility is insufficient sample size, resulting in low power and low positive predictive value. It has been suggested that insufficient sample-size choice is driven by a combination of scientific competition and ‘positive publication bias’. Here we formalize this intuition in a simple model, in which scientists choose economically rational sample sizes, balancing the cost of experimentation with income from publication. Specifically, assuming that a scientist’s income derives only from ‘positive’ findings (positive publication bias) and that individual samples cost a fixed amount, allows to leverage basic statistical formulas into an economic optimality prediction. We find that if effects have i) low base probability, ii) small effect size or iii) low grant income per publication, then the rational (economically optimal) sample size is small. Furthermore, for plausible distributions of these parameters we find a robust emergence of a bimodal distribution of obtained statistical power and low overall reproducibility rates, both matching empirical findings. Finally, we explore conditional equivalence testing as a means to align economic incentives with adequate sample sizes. Overall, the model describes a simple mechanism explaining both the prevalence and the persistence of small sample sizes, and is well suited for empirical validation. It proposes economic rationality, or economic pressures, as a principal driver of irreproducibility and suggests strategies to change this.

Introduction

Systematic attempts at replicating published research have produced disquietingly low reproducibility rates, often below 50% [15]. A recent survey suggests that a vast majority of scientists believe we are currently in a ‘reproducibility crisis’ [6]. While the term ‘crisis’ is contested [7], the available evidence on reproducibility certainly raises questions. One likely reason for low reproducibility rates is insufficient sample size and resulting low statistical power and positive predictive value [812]. In the most prevalent scientific statistical framework, i.e. null-hypothesis-significance-testing (NHST), the statistical power of a study is the probability to detect a hypothesized effect with a given sample size. Insufficient power reduces the probablity that a given hypothesis can be supported by statistical significance. Insufficient sample sizes therefore directly impair a scientist’s purported goal of providing evidence for a hypothesis. Additionally, small sample sizes imply low positive predictive value (PPV), i.e. a low probability that a given, statistically significant finding is indeed true [8, 10]. Therefore small sample sizes undermine not only the purported goal of the individual researcher, but also the reliability of the scientific literature in general.

Despite this, there is substantial evidence that chosen sample sizes are overwhelmingly too small [1015]. For instance in neuroscientific research, systematic evaluation of meta-analyses in various subfields yielded mean power estimates of 8 to 31% [10], substantially less than the generally aspired 80%. Notably, these estimates should be considered optimistic. This is because they are based on effect size estimates from meta-analyses which are in turn likely to be inflated due to publication bias [11, 16]. Remarkably, more prestigious journals appear to contain particularly small sample sizes [11, 17, 18]. Moreover, the scientific practice of choosing insufficient sample sizes appears to be extremely persistent, despite perennial calls for improvement since at least 1962 [11, 13, 19].

Perhaps the most prominent explanation for this phenomenon is the competitive scientific environment [6, 20, 21]. Scientists must maximize the number and impact of publications they produce with scarce resources (time, funding) in order to secure further funding and often, by implication, their job. For instance Smaldino and McElreath have suggested that ‘efficient’ scientists may ‘farm’ significant (i.e. publishable) results with low sample sizes [13]. This suggests that sample-size choices may reflect an economic equilibrium or, in other words, that small sample sizes may be economically rational. Notably, economic equilibria may be enforced not only by rational choice but also through competitive selection mechanisms (see Discussion) [13, 22, 23]. The existence of an economic equilibrium of small sample sizes would help to explain both the prevalence and the persistence of underpowering. Recently, this economic argument has been formally explored in two related optimality models [24, 25]. While similar to the present model in spirit and conclusion, these models contain some higher order parameters, creating challenges for empirical validation. Here, we present a simple model, well suited to empirical validation, in which observed sample sizes reflect an economic equilibrium. Scientists choose a sample size to maximize their profit by balancing the cost of experimentation with the income following from successful publications. For simplicity we assume only statistically significant ‘positive findings’ can be published and converted to funding, reflecting ‘positive publication bias’. The model predicts an (economically rational) equilibrium sample size (ESS), for a given base probability of true results (b), effect size (d), and mean grant income per publication (IF). We find that i) lower b leads to lower ESS, ii) greater d and IF lead to larger ESS. For plausible parameter distributions, the model predicts a bi-modal distribution of achieved power and reproducibility rates below 50%, both in line with empirical findings. Finally, we explore the ability of conditional equivalence testing [26] to address these issues and find that it leads to almost uniformly superior outcomes.

Materials and methods

Model

Economically rational scientists choose sample sizes to maximize their Profit from science given by their Income from funding minus the Cost of experimentation (Eq 1). For simplicity we assume they receive funding only, if they publish and they can publish only positive results. The first condition reflects the dependence of funding decisions on the publication record as captured by the adage ‘publish or perish’. The second condition captures the well documented phenomenon of positive publication bias (see Central Assumptions section below). Specifically,

Profit(s,IF,d,b)=IF×TPR(s,d,b)Income-sCost (1)

where IF is a positive constant reflecting mean grant income per publication (Income Factor), TPR(s, d, b) is the total publishable rate given a sample size (s), effect size (d) and base probability of true effects (b). The latter term (b) [27] has also been called the ‘pre-study probability of a relationship being true (R/(R+ 1))’ [8]. At the same time scientists incur the cost of experimentation which is assumed to be linearly related to sample size (s). For simplicity we scale IF as the number of samples purchasable per publication such that the cost of experimentation reduces to s. Accordingly, each sample pair costs one monetary (or temporal) unit. TPR(s, d, b) is the sum of false and true positive rates and can be calculated using basic statistical formulas [8, 27] (Eq 2):

TPR(s,d,b)=α×(1-b)falsepositiverate+(1-β(s))×btruepositiverate (2)

where α = 0.05 is the Type-1 and β(s) the Type-2 error and (1 − β(s)) is statistical power. The equilibrium sample size (ESS) is then the sample size at which Profit is maximal.

To model conditional equivalence testing (CET), we assumed the procedure described by [26]. Briefly, all negative results are subjected to an equivalence test, to establish if they are statistically significant negative findings. Significant negative, here, is defined as an effect within previously determined equivalence bounds (±Δ), which are set to the ‘smallest effect size of interest’ [28]. The total publishable rate for CET (TPRCET(s, d, b, Δ)) is thus the sum of TPR(s, d, b) and subsequently detected significant negative findings (Eq 3).

TPRCET(s,d,b,δ)=TPR(s,d,b)+αCET×b×βfalsenegativerate+(1-βCET(s,δ))×(1-b)×(1-α)truenegativerate (3)

where αCET = 0.05 is the Type-1 and βCET(s, Δ) the Type-2 error, (1 − βCET(s, Δ)) is statistical powerCET and Δ is the equivalence bound of the equivalence test. Note the additional correction factors β and (1 − α) for the false and true negative rates respectively, which account for the fact that the equivalence test is performed conditionally on the lack of a previous significant positive finding. The power of the CET (1 − βCET(c, Δ)) was computed using the two one sided t-tests (TOST) procedure for independent sample t-tests using the TOSTER package in R [29] (TOSTER::power.TOST.two with αCET = 0.05). Profit is then computed as above, but with all published findings (TPRCET(s, d, b, Δ)) instead of only positive findings (TPR(s, d, b)) contributing to Income.

Positive predictive value (/mathitPPV) is computed as the fraction of true published findings to total published findings. All model code is added as supporting information.

Central assumptions

Our model relies on three central simplifying assumptions, which here shall first be made explicit and justified:

  1. Economic equilibrium sample sizes are the result of profit maximization (i.e. optimization).

  2. Due to positive-publication-bias scientists can publish only positive results, and receive income proportional to their publication rate.

  3. Sample size is chosen for a set of parameters (b, d, IF, see Table 1) which are externally given (e.g. by the research field).

Table 1. Parameter space, input parameters to the model.

Parameter Description Range
base rate (b) base rate of true positive effects (also called pre-study probability of true effect) 0–1
effect size (d) cohens d (effect normalized to standard deviation) 0.1–1.5
Income Factor (IF) number of sample pairs purchasable per publication 100–1000

The first assumption (profit maximization) can most simply be construed as rational choice in the economic sense but may also be the outcome of competitive selection [13]. For instance if funding is stochastic, scientists who choose profit-maximizing sample sizes would have an increased chance of survival. In contexts, where the cost of sampling is mainly researcher time, profit can similarly be interpreted as time. While rational choice would depend on private estimates of the parameters (b, d, IF), competitive selection could operate through a process of cultural evolution, potentially combined with social learning [13, 23]. Importantly, rational choice and competitive selection are not mutually exclusive and potentially cooperative.

The second assumption (positive publication bias), though obviously oversimplified, seems justified as a coarse description of most competitive scientific fields [30]. Even if negative results are published, they may not achieve high impact and translate to funding.

The third assumption (optimization for given b, d, IF) implies that scientists have no agency over the base probability of a hypothesis being true (b), the true effect size (d) or the mean income following publication (IF). Arguably, b and d are exogenously given by arising hypotheses and true effects while IF is likely to be an exogenous property of a research field. Accordingly, a scientific environment with a fixed or constrained combination of the three parameters can be thought of as a scientific niche. Note that this does not preclude the simultaneous occupation of multiple niches by individual scientists, for instance a high-IF/low-b niche and a high-b/low-IF niche. In combination with the first assumption this implies that scientists choose/ learn/ are selected for specific sample sizes within niches (but may simultaneously occupy multiple niches). Note, that alternative models, in which choice of b is endogenized, describe similar results [24, 25].

Simulation

Simulations were performed in Python3.6 using the StatsModels toolbox [31]. Model code is shown as supporting information. Statistical power was calculated assuming independent, equally sized samples (s) and a two-sided, unpaired t-test given effect size d. Note, that this implies, IF should be interpreted as the number of sample pairs purchasable, and s indicates the size of one of the samples. We also calculated power using a one sample t-test, where IF represents the number of individual samples, and all results were robust. The Type-1 error (α) is assumed to be 0.05 throughout. Distributions in Fig 3 were generated using the numpy.random module. The input distribution for d was generated using a gamma distribution tuned to match empirical findings [11] (k = 3.5, θ = 0.2). Input distributions of b and IF were generated using uniform or beta distributions with α and β chosen from α = (1.1, 10), β = (1.1, 10) (IF values multiplied by 1000). Bimodal distributions were generated by mixing a low and high skewed beta distribution (with the above parameters) with weights (0.5, 0.5, bimodal) or weights (0.9, 0.1, low/ bimodal). From these distributions 1000 values were drawn at random and the ESS computed for each constellation. To compute the implied distribution of emergent power and positive predictive values, the corresponding values for each ESS were weighted by its TPR/ESS. This corrects for the fact that small ESS will allow to conduct more studies, but with smaller TPR. For instance, a niche with half the ESS will allow twice the number of studies, suggesting these studies may be twice as frequent in the literature. However, of these smaller studies, a smaller fraction (TPR) will be significant, reducing the relative abundance of these studies in the literature. In fact the two nearly cancel each other, such that the weighting does not significantly affect the emergent distribution.

Fig 3. Distributions of statistical power for plausible input parameter distributions, random input parameter constellations were drawn from a range of plausible, simulated distributions (grey, see Methods).

Fig 3

For each input parameter constellation the resultant ESS and power were calculated. A) Summary of model in- and outputs and the probed distributions. B) Empirically matched distribution of effect sizes d [11], used for all output distribution in C. C) Emergent power distributions and mean positive predictive values for each combination of distributions of b and IF.

Results

The equilibrium sample size (ESS)

We will now first illustrate the basic model behavior with an exemplary parameter set (b = 0.2, 0.5;d = 0.5;IF = 200). The most important model feature is the robust emergence of an economically optimal, i.e. ‘rational’, equilibrium sample size (ESS) at which Profit, i.e. the (indirect) Income from publications minus the Cost of experimentation, is maximal (Fig 1). A scientist’s Income (Fig 1A, blue & green curves) will be proportional to her publication rate, i.e. her total publishable rate (Eq 2). An optimum sample size (s) emerges because this rate must saturate close to the actual rate of true effects (b). Specifically, with infinite sample size and resulting infinite power, the total publishable rate approaches the rate of actually true effects (b) plus a fraction of false positives (α × (1 − b)). The saturation and slope of the Income curve will thus depend on b, IF and power, the latter of which is a function of sample size. Conversely, additional samples always cost more, implying that at some sample size additional Cost will outpace additional Income. Computing Profit for increasing sample sizes (Eq 1) therefore reveals an optimal sample size at which Profit is maximal, termed the equilibrium sample size (ESS; Fig 1B).

Fig 1. Equilibrium sample size, basic model behavior illustrated with d = 0.5, IF = 200.

Fig 1

A) Illustrative income (blue, green for b = 0.2, 0.5, respectively) and cost (black) function with increasing sample size (s); MU: monetary units where one MU buys one sample B) Profit functions for b = (0, 0.1,…, 1). For any given b the (ESS) is the sample size at which profit is maximal. C) Relation of the ESS to b (black curve). Respective ESS for b = 0.2 and 0.5 are indicated by blue and green lines. D) Statistical power of the ESS (P(ESS)) for the given d and IF. E) Expected distribution of statistical power at ESS if b is uniformly distributed.

At very small sample sizes, insufficient power will preclude the detection of true positives, but income from false positives will never fall below α × IF. Increasing sample size can then increase or decrease Profit, depending on the resulting increase in statistically significant results. For instance if true effects are scarce (b ≤ 0.2 for Fig 1), increasing power will only modestly increase income (Fig 1A and 1B, blue line), leading to sharply decreasing Profit. In the extreme (no true effects, b = 0) increasing sample size linearly increases Cost but the rate of statistically significant (publishable) findings remains constant at α. The result is a range of small b at which ESS remains at the minimal value (s = 4 in our model) (Fig 1C). While we set s = 4 as the minimal possible sample size, the minimal publishable sample size may vary by field conventions. The model suggests simply, that there will be an economic pressure toward the ESS. This economic pressure should be proportional to the peakedness of the Profit curve, i.e. the marginal decrease in Profit when slightly deviating from ESS. As b increases from zero the peakedness decreases until the ESS begins to rapidly shift to larger values (between b = 0.3 and 0.4 in our example). At larger values of b peakedness increases again and the ESS begins to saturate. Note that adding a constant overhead cost or income per study will not affect the ESS. Such an overhead would shift the Cost curve (Fig 1A, black) as well as the Profit curves (Fig 1B) up or down, without altering the optimal sample size. Accordingly, we find that for a given d and IF, hypotheses with smaller base probability, lead to smaller rational sample sizes.

Statistical power at ESS

We can now also explore the statistical power implied by the ESS (Fig 1D). It is most helpful to separate the resultant curve into three phases: i) a range of constant small power where ESS is minimal, ii) a small range of b (≈0.4 < b < 0.6) where power rises steeply and iii) a range of large b where ESS and power saturate. Unsurprisingly, where ESS is minimal, studies are also severely underpowered. Conversely, where ESS begins to saturate, studies become increasingly well powered. Notably, there is only a small range of b where moderately powered studies should emerge. In other words, for most values of b, power should be either very low or very high. For instance, assuming a uniform distribution of b, i.e. scientific environments with all values of b are equally frequent, we should expect a bi-modal distribution of power (Fig 1E). If b is already bi-modally distributed, this prediction becomes even stronger. For instance scientific niches may be clustered around novelty driven research with small b and confirmatory research with large b. Overall, power like ESS is positively related to b with a distinctive three phase waveform.

Next, we explored how changing each individual input parameter (b, d and IF) affected ESS and power (Fig 2A and 2B, respectively). Specifically, we tested the sensitivity of the b to ESS and b to power relationships for plausible ranges of d and IF (Fig 2A). The ESS for a given b should depend both on effect size d (via power) and IF(via the relative cost of a sample). We reasoned that the majority of scientific research is likely to be conducted in the ranges of d ∈ [0.2, 1] and IF ∈ [50, 500] (see Discussion). For small d and IF the range of small b, where ESS and power are minimal is expanded (Fig 2A and 2B upper left panels). Conversely, both greater d and IF shift the inflection point at which larger sample sizes become profitable rightward. Accordingly, in this domain, above minimal sample sizes should be chosen even with small b (Fig 2A, lower right panels). When d and/or IF are large enough, ESS leads to well powered studies across most values of b (Fig 2B, lower right panels). Accordingly the distinctive three-phase waveform is conserved throughout much of the plausible parameter space but breaks down towards its edges. These data suggest that, ceteris paribus, a policy to increase any of the three input parameters, individually or in combination, will tend to promote better powered studies. For instance, increasing the funding ratio and therewith IF or funding more confirmatory research with high b, should both lead to better powered research.

Fig 2.

Fig 2

Effect of d and IF on ESS, A) Each individual line depicts the ESS as a function of b for a given combination of d and IF. B) Statistical power resultant from the ESS in panel (A).

Emergent power distributions for plausible input parameter distributions

In real scientific settings a number of distributions of effect sizes d, income factors IF and base probabilities b are plausible. We therefore next investigated the emergent power distributions for multiple distributions of input parameters (Fig 3). We then calculated the ESS and resultant power for each random input parameter constellation (Fig 3A). The distribution of effect sizes was modeled after empirical data [11] with the majority of effects in the medium to large range (Fig 3B, see Discussion). Since the true distribution of IF is unknown, we modeled a range of distributions below IF = 1000. We reasoned that a single publication leading to funding for 1000 sample pairs was a conservative upper bound in view of published sample sizes [11]. Within this range we probed a uniform distribution as well as low, medium and high distributions of IF. Note, that these data also demonstrate the predicted consequences of increasing or decreasing IF. Since the true distribution of b is similarly unknown, we first probed the uniform distribution (minimal assumption, Fig 3C1-4) and a bimodal distribution (assuming a cluster of exploratory fields with low b and confirmatory fields with high b, Fig 3C5-11). In both these cases the mean b is by definition around 0.5, i.e. as many hypotheses are true as are false. However, many scientific areas place an emphasis on ‘novelty’ suggesting substantially lower b [8, 13, 32]. We therefore also probed two more realistic distributions of b, namely low (most values around 0.1, Fig 3C9-12) and low/ bimodal (low mixed with a minor second mode with high b (Fig 3C13-16). The latter models a situation where most studies (90%) are exploratory and the remaining studies are confirmatory. We found the resulting bimodal distribution of power to be robust throughout, with only the relative weights of the peaks changing. Next we investigated the mean reproducibility rates which could be expected for the resultant distributions. The positive predictive value (PPV) measures the probability that a positive finding is indeed true. It thus provides an upper bound on expected reproducibility rates (as the power of reproduction studies approaches 100%, reproducibility rates will approach PPV). Note that the PPV should be interpreted in light of the underlying b. For instance, for the first two distributions of b (Fig 3C1-11), the mean base probability is already 50%, so PPV < 0.5 would indicate performance worse than chance. For more realistic distributions of b (Fig 3C12-16) PPV ranged from 0.26 (Fig 3C10) to 0.4 (Fig 3C16), comparable to reported reproducibility rates. Thus, for plausible parameter distributions, rational sample-size choice robustly leads to a bimodal distribution of statistical power and expected reproducibility rates below 50%. Additionally, these simulations suggest that creating more research environments with high b (e.g. Fig 3C5-8), for instance in the form of research institutions dedicated to confirmatory research [33], should lead to larger sample sizes and higher reproducibility. Finally, more scientific environments with large income per publication (IF), for instance through higher funding ratios, should lead to better powered science and higher reproducibility rates.

Conditional equivalence testing (CET)

An additional potential strategy to address the economic pressure towards small sample sizes is conditional equivalence testing (CET) [26, 29] (Fig 4). In CET, when a scientist fails to find a significant positive result in the standard NHST, she continues to test if her data statistically support a null effect (significant negative). A significant negative is defined as an effect within previously determined equivalence bounds (±Δ), which are set to the ‘smallest effect size of interest’ [28]. This addresses one of the main (and legitimate) drivers of positive publication bias, namely that absence of (significant) evidence is not evidence of absence [34]. Assuming that CET thus allows the publication of statistically significant negative findings in addition to significant positive findings implies i) an increase in the fraction of research that is published, ii) a resulting additional source of income from publication without additional sampling cost and iii) an additional incentive for sufficient statistical power, as we will see below.

Fig 4. Conditional equivalence testing.

Fig 4

Exploration of model behavior under conditional equivalence testing. A) Basic model behavior illustrated with d = 0.5, IF = 200, Δ = 0.5d. Left: illustrations of effect sizes that would be considered significant postives (black) or negatives (red). Subpanels show emergent income and cost curves (a1), resulting Profit (a2), resulting ESSCET (a3), resulting power in black and powerCET in red (a4), and resulting power distribution given uniformly distributed b (a5). For details see Fig 1. B) same as A but for Δ = d. C) Statistical power in black and powerCET in red at ESSCET for Δ = 0.5d analogous to (a4) for various input parameter constellations of d and IF. For details see Fig 2. D) Same as C, but for Δ = d analogous to (b4). E) Distributions of emergent statistical power in black and powerCET in red for plausible input parameter distributions given Δ = 0.5d. For details see Fig 3. F) Same as E but for Δ = d.

A crucial step in CET is the a priori definition of an equivalence bound (±Δ), below which effects would be deemed consistent with a null-effect. While this can be conceptually challenging in practice [28], it is important to point out that considering a ‘smallest effect size of interest’ is often suggested as an implicit justification for small sample sizes. In the following, we explore the effects of CET given either Δ = 0.5d or Δ = d and α = 0.05 for both NHST and CET (Fig 4). Defining Δ in terms of d allows us to consider the same range of scientific environments, with variable expectations of d as in the previous analyses (Figs 2 and 3). If Δ = 0.5d, a scientist is interested in detecting an effect of size d as above, but is somewhat uncertain how smaller effect sizes (0.5d to d) should be interpreted (Fig 4A). However, if her data support an effect <0.5d she would interpret her finding as a significant negative finding and publish it as such. The power of the CET (powerCET) to detect a negative effect in this setting is substantially smaller than the power of the original NHST power, leading to a second shoulder in the income curve at higher sample sizes (Fig 4a1). For the shown example (Δ = 0.5d, d = 0.5, IF = 200), this does not affect the ESS (Fig 4a2 and 4a3) or the resulting power to detect a positive effect (Fig 4a4 and 4a5, black). Note that each ESS now implies not only a power to detect positive effect but also a powerCET to detect a negative effect (Fig 4a4 and 4a5, black and red, respectively), and that even ESS with adequate power can have low powerCET. If Δ = d (Fig 4B), the same procedure is applied by the scientists, but all effects statistically significantly smaller than d will be published as negative findings. Note that this does not imply all studies are published, since studies with insufficient power to detect either positive or negative findings will remain inconclusive. Ceteris paribus and Δ = d, powerCET is comparable to the power of the original NHST. This has two consequences: i) the income shoulders of the two tests align, boosting profit at the respective sample size and ii) the dependency of Profit on b was largely removed (Fig 4b1 and 4b2). Indeed, the added economic incentive to detect and report significant negative findings led to an inversed relationship of b to ESS, since more frequent true negatives (low b) implies higher potential profits from adequate sample sizes for the CET.

Systematically probing the input parameter space (as in Fig 2B) showed that CET did not remove the general dependencies of power on IF or d (Fig 4C and 4D). However, the boundary region of IF and d where adequate power first becomes economical shifted toward smaller values, particularly for small b. For instance, for IF = 100, d = 1 and b ≤ 0.2, the power at ESS shifts from ≈20% to ≈100%. This occurred for Δ = 0.5d (Fig 4C) and even more prominently for Δ = d (Fig 4D), where the relation of b to power is almost completely removed.

Finally, we probed the effect of CET on various input parameter distributions, analogous to Fig 3 (Fig 4E and 4F). We find that CET with Δ = 0.5d leads to improved power and PPV for most parameter distributions, but particularly for more realistic ones (Fig 4E, low and low/bimodal distributions of b). This is even more true when Δ = d where PPV was at 90% or higher for all but the low IF distribution. These results suggest that CET could be a useful tool to change the economics of sample size, increasing not only the publication rate of negative findings but also mean statistical power and thereby the reproducibility of positive findings.

Discussion

Here, we describe a simple model in which sample-size choice is viewed as the result of competitive economic pressures rather than scientific deliberations (similar to [24, 25]). The model formalizes the economic optimality of small sample size for a large range of empirically plausible parameters with minimal assumptions. Additionally, it makes several empirically testable predictions, including a bimodal distribution of observed statistical power. Given the simplicity of the model, the apparent similarity between its predictions and empirically observed patterns is remarkable. Finally, our model allows to explore a range of policy prescriptions to address insufficient sample sizes and irreproducibility. The core model suggests any policy that increases mean funding per publication or the rate of confirmatory research should lead to better powered studies and increased reproducibility. Additionally, conditional equivalence testing may address publication bias and provide an economic incentive for better powered science.

Model predictions and empirical evidence

Our core model predicts i) a correlation between base probability and sample size, ii) a correlation between effect size and sample size, iii) a correlation between mean grant income per publication and sample size. Moreover, for plausible parameter distributions the model predicts iv) a bimodal distribution of achieved statistical power and v) low overall reproducibility rates. For the purpose of discussion, it may be of particular interest to contrast our predictions, based on economically driven sample-size choice, to predictions derived from presumed scientifically driven sample-size choices. For instance scientifically driven sample-size choice might be expected to i) require larger samples for more unlikely findings, ii) require larger samples for smaller effects and iii) be independent of grant income. Moreover, scientifically driven sample sizes might be expected to iv) lead to a unimodal distribution of power around 80% and v) imply PPVs and reproducibility rates above 50%. Which sample sizes are scientifically ideal is of course a complex question in itself, and will depend not only on the cost of sampling but also on the scientific values of true and false positives as well as true and false negatives. Miller and Ulrich, [35] present a scientifically normative model of sample size choice, formalizing many of the above intuitions (however, importantly they do not account for the possibility that negative findings may not enter the published literature). Overall, a prevalence of underpowered research certainly leads to a range of problems, ranging from low reproducibility rates to unreliable metaanalytic effect size estimates [36]. Accordingly, the currently available empirical evidence appears more in line with the economically normative than the scientifically normative account.

ad i) The available evidence suggests that journals with high impact and purportedly more novel (low b) findings feature smaller sample sizes [11, 17, 18], in line with our prediction. This finding seems particularly puzzling given the increased editorial and scientific scrutiny such ‘high-impact’ publications receive. Accordingly, publication in a ‘high-impact’ journal is generally considered a signal of quality and credibility [37]. From the perspective of an individual scientist, increasing sample size strictly increases the probability of being able to support her hypothesis (if she believes it is true) but does not alter the probability of rejecting it (if she expects it to be false). Similarly, a scientific optimality model assuming true and false positive publications have equal but opposite scientific value, suggests more unlikely findings merit larger power [35]. All these considerations suggest high impact journals should contain larger sample sizes, highlighting a need for explanation.

ad ii) The available evidence suggests a negative correlation between effect size and sample size, seemingly contradicting our prediction [15, 17, 38]. However, the authors caution in the interpretation of this result due to the winner’s curse phenomenon [10, 17, 39]. This well documented phenomenon produces an negative correlation between sample size and estimated effect size even when a single hypothesis (i.e. single true effect size) is probed in multiple independent studies. It arises, because for small sample sizes only spuriously inflated effect sizes become statistically significant and enter the literature (also due to positive publication bias). Relating effect sizes from meta-analyses to original sample sizes by scientific subdiscipline may help to overcome this confound.

ad iii) We are unaware of evidence directly relating mean grant income per study to sample size. A study by Fortin and Currie [40] suggests diminishing returns in total impact for increasing awarded grant size. However, impact was assessed without reference to sample sizes. Furthermore, awarded grant size does not necessarily reflect mean grant income, since larger grants may be more competitive. Indeed, more competitive funding systems are likely to a) increase the underlying economic pressures and b) have additional adverse effects [41]. Indirect evidence in line with our prediction is presented by Sassenberg and Ditrich (2019) [42]. The authors show that larger sample sizes were associated with lower costs per sample. Since IF is expressed as ‘samples purchasable per publication’ this is analogous to higher IF correlating with greater sample size.

ad iv) The prediction of a bimodal distribution of power is well corroborated by evidence [10, 14, 15]. Particularly the lack of a mode around 80% power in our model, as well as all empirical studies is notable. By comparison a scientific value driven model by Miller and Ulrich [35] suggests a single broad mode at intermediate levels of power.

ad v) The predicted low overall reproducibility rates are in line with empirical data for many fields. Two, now prominent, studies from the pharmaceutical industry suggested reproducibility rates of 11 and 22% [1, 2]. Academic studies from psychology and experimental economics found 36, 61 and 62% [35]. Our results suggest that differences in these numbers may be driven, for instance, by different base probabilities of hypotheses being true in the different fields.

The present model thus helps to explain a range of empirical phenomena and is amenable to closer empirical scrutiny in the future. Crucially, all in- and output parameters are in principle empirically verifiable. Future studies could, for instance, fit observed power distributions with the present model versus alternative formal models. This could directly generate predictions concerning input parameters which could in turn be empirically tested.

Niche optimization through competitive selection

A central assumption of this model is that sample size is optimized for a given set of parameters (b, d, IF). One way to interpret this is that scientists make rational sample size choices based on their estimates of these parameters for each hypothesis. As noted above, maximizing profit in this context need not be an end in itself but can also be seen as a strategy to secure scientific survival, given the uncertainty of both the scientific process as well as funding decisions. Alternatively, optimization may occur by selection mechanisms, where sample sizes are determined through a process of cultural evolution [13]. In this case one must however make the additional assumption, that parameters remain relatively constant within the scientific niche in which sample sizes are selected. In such a case researchers must only associate the scientific niche with a convention of sample size choice. These conventions could then undergo independent evolution in each niche. Such scientific niches may correspond to scientific subdisciplines and may indeed be identifiable on the basis of empirically consistent sample sizes. We did not address how scientists should distribute their efforts into multiple niches (e.g. exploratory research and confirmatory research). This question has been previously addressed in a related optimality model [24]. The authors suggest that, given prevailing incentives emphasizing novel research, the majority of efforts should be invested into research with low b. This is reflected in the present model by the low skewed distributions of b (Fig 3C9-16). Future evolutionary models could further investigate how mixed strategies of sample size choice perform when individual parameters vary within niches, or scientists are uncertain of the niche.

Input parameter range estimates

We probed the arising ESS for what we judged to be plausible ranges of the three input parameters (b, d, IF):

The base (or pre study) probability of true results is often assumed to be small (b < 0.1) for most fields [13, 32]. This is in part because of a focus on novel research [24]. At the same time there is a small fraction of confirmatory studies, where substantial prior evidence indicates the hypothesis should be true, and b should thus be large. We therefore chose to cover the full range of b (b ∈ [0, 1]) in addition to some plausible distributions. In light of the considerations by [13] and [24], our distributions might be judged conservative in that real values of b may be lower. Notably, models endogenizing choice over b reach similar conclusions [24, 25].

We probed a plausible range (d ∈ [0.1, 1.5]) and an empirically matched distribution of d [11]. By comparison, in psychological research frequently cited reference points for small, medium and large effect sizes are d = 0.2, 0.5 and 0.8, respectively.

Notably our empirically matched distribution (Fig 3B) is based on published effect sizes, which are likely exaggerated due to the winner’s curse [17]. For instance [3, 4] find that true effect sizes are on average only around 50 to 60% of originally published effect sizes. This again renders our estimates conservative, in that true values may be lower.

An empirical estimate of IF is perhaps most difficult, since the full cost per sample (time, wage, money) may be difficult to separate from other arising costs. Note that a constant overhead cost or income, which is independent of sample size, should not alter the optimal sample size. Nevertheless, we reasoned that plausible values for IF should be somewhere within the range of 10 to 1000. For instance, a typically reported sample size is 20 [10, 11]. IF must allow to cover the cost of the positive result plus however many unpublished additional samples were required to obtain it. An IF of 1000 would thus allow for up to 49 negative findings (or 980 unpublished samples). Given that science does not seem to provide substantial net profits, larger values for IF seem implausible. Note, that our model assumes linearly increasing cost for increasing sample size. We found that, for the curvature of the cost function to play a major role for ESS, it would need to be very prominent around the range of sample sizes where the curvature of the power-function is strongest. For simple non-linear functions such as a cost exponent between 0.5 and 1.1, we found model behavior to be similarly captured by a linear cost function with adjusted slope. However, strongly non-linear cost functions might affect ESS beyond the effect of slope (IF).

Together, these considerations suggest that our predictions of power and expected reproducibility rates are more likely to be over- than under estimated. Of course, the forces affecting real sample size choices are (hopefully) not solely the economic ones investigated here. Specifically, scientific deliberations about appropriate sample size should at least play a partial role. Indeed, the model predicts the minimal sample size over a wide range of parameters. This could for instance be the minimal sample size accepted by statistical software. Above, we have suggested it is the minimal sample size deemed acceptable in a scientific discipline. This implies that discipline specific norms on minimal acceptable sample sizes reflect scientific deliberations and are enforced during the editorial and review process. Such non-economic forces may be particularly relevant where the profit peak is broad.

Reproducibility

In our model low reproducibility rates appear purely as a result of the economic pressures on sample size. Many additional practices, such as p-hacking, may increase the false positive rate for a given sample size [4347]. The economic pressures underlying our model must be expected to also promote such practices. Moreover, many of these practices are likely to become more relevant for many small studies. For instance flexible data analysis will increase the probability of false positives for each study. Moreover, in small studies substantial changes of effect size may result from minor changes in analysis (e.g. post-hoc exclusion of data points), thus increasing the relative power of biases. While substantial and mostly laudable efforts are being made to reduce such practices [43, 48], our approach emphasizes that scientists may in fact have limited agency over sample size, given the economic constraints of scientific competition. Given these constraints, our model suggests reproducibility can be enhanced by policies that i) increase the fraction of research with higher b (e.g. more confirmatory research), ii) lead to higher IF (e.g. higher funding rates), or iii) via the introduction of conditional equivalence testing (CET) [26]. Several previous related models have explored the effect of various policy prescriptions in the light of such economic constraints [24, 25]. For instance Campbell and Gustafson [25] explore the effects of increasing requirements for statistical stringency (e.g. setting α = 0.005) as called for by a highly publicized recent proposal [49]. However, they find that this may dramatically lower publication rates, effectively increasing waste (unpublished research) and competitive pressure as well as reducing the rate of ‘breakthrough findings’. While we did not perform an extensive investigation of this policy, our model confirms that with (α = 0.005), a large fraction of scientific niches, particularly with small (b), start yielding net losses. Alternative proposals which directly adress the underlying economic pressures are very much in line with our results [26, 33]. Campbell and Gustafson (2018) propose conditional equivalence testing to increase the publication value of negative findings. Indeed, incorporating their procedure into our model, showed that CET should not only address publication bias, by allowing the publication of more negative findings, but also lead to improved power and reproduciblity of positive findings. In practice, the definition of the equivalence bounds as well as the publication and factual monetary rewarding of negative findings, may render the adoption of CET difficult. However, it is important to note, that even a partial adoption should promote the predicted benefits. Moreover, these hurdles can also be addressed by funding policy. More ambitiously, Utzerath and Fernandez [33] propose to complement the current ‘discovery oriented’ system with an independent confirmatory branch of science, in which secure funding allows scientists to assess hypotheses impartially. Indeed, such a system may have many positive side effects. A consistent prospect of replication may act as an incentive toward good research practices of discovery oriented scientists. An increased number of permanent scientific positions might reduce the pressure toward bad research practices. Additionally, the emergence of a body of confirmatory research would allow to address many pressing meta-scientific questions, including biases in meta-analytic effect size estimates [36], actual frequencies of true hypotheses [32] as well as true reproducibility rates of published literature [5, 8].

Relation to proxyeconomics

Our model is consistent with, and an individual instance of, proxyeconomics [23]. Proxyeconomics refers to any competitive societal system in which an abstract goal (here: scientific progress) is promoted using competition based on proxy measures (here: publications). In such cases, the measures or system may become corrupted due to an overoptimization toward the proxy measure [5054]. As discussed above, such systems have the general potential to create a situation of limited individual agency and system-level lock-in [23]. The present model shows how the specific informational deficits of a proxy allow to create pattern predictions of the potentially emergent corruption. Specifically, an informational idiosyncrasy of the proxy (positive publication bias) leads to a number of predictions which can be i) empirically verified, ii) contrasted with alternative models (see above), and iii) leveraged into policy prescriptions. A similar pattern prediction derived from positive publication bias is the winner’s curse [17]. Together, such pattern predictions provide concrete and compelling evidence for competition induced corruption of proxy measures in competitive societal systems.

Conclusion

Our model strengthens the argument that economic pressures may be a principle driver of insufficient sample sizes and irreproducibility. The underlying mechanism hinges on the combination of positive publication bias and competitive funding. Accordingly, any policy to address irreproducibility should explicity account for the arising economic forces or seek to change them [25, 33].

Supporting information

S1 File. Model code for quick reference.

Code to compute the ESS (and associated parameters) based on b, d, IF.

(PDF)

S2 File. Model code for quick reference.

Code to compute the ESSCET (and associated parameters) based on b, d, IF, Δ.

(PDF)

S3 File. Full python code to compute the ESS (and associated parameters) based on b, d, IF and to generate all figures.

(PY)

S4 File. Full python code to compute the ESSCET (and associated parameters) based on b, d, IF, Δ and to generate all figures.

(PY)

Acknowledgments

Special thanks to Heinz Beck for making this project possible, to Jonathan Ewell for comments on the manuscript and to Harlan Campbell pointing me towards conditional equivalence testing.

Data Availability

The code to run the model and generate the figures is available a supporting files and can additionally be downloaded at www.proxyeconomics.com.

Funding Statement

This study was funded by VW Foundation under the program Originalitätsverdacht, Project: ‘Proxyecomomics’. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012;483(7391):531–3. 10.1038/483531a [DOI] [PubMed] [Google Scholar]
  • 2. Prinz F, Schlange T, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets? Nature reviews Drug discovery. 2011;10(9):712 10.1038/nrd3439-c1 [DOI] [PubMed] [Google Scholar]
  • 3. Camerer CF, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, et al. Evaluating replicability of laboratory experiments in economics. Science. 2016;351(6280):1433–1436. 10.1126/science.aaf0918 [DOI] [PubMed] [Google Scholar]
  • 4. Camerer CF, Dreber A, Holzmeister F, Ho TH, Huber J, Johannesson M, et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour. 2018;2(9):637–644. 10.1038/s41562-018-0399-z [DOI] [PubMed] [Google Scholar]
  • 5. Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716–aac4716. 10.1126/science.aac4716 [DOI] [PubMed] [Google Scholar]
  • 6. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533(7604):452–454. 10.1038/533452a [DOI] [PubMed] [Google Scholar]
  • 7. Fanelli D. Opinion: Is science really facing a reproducibility crisis, and do we need it to? Proceedings of the National Academy of Sciences of the United States of America. 2018;115(11):2628–2631. 10.1073/pnas.1708272114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;4(6):e124 10.1371/journal.pmed.0020124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. McElreath R, Smaldino PE. Replication, communication, and the population dynamics of scientific discovery. PLoS ONE. 2015;10(8). 10.1371/journal.pone.0136088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Button KS, Ioannidis JPa, Mokrysz C, Nosek Ba, Flint J, Robinson ESJ, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews Neuroscience. 2013;14(5):365–76. 10.1038/nrn3475 [DOI] [PubMed] [Google Scholar]
  • 11. Szucs D, Ioannidis JPA. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology. 2017;15(3):e2000797 10.1371/journal.pbio.2000797 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lamberink HJ, Otte WM, Sinke MRT, Lakens D, Glasziou PP, Tijdink JK, et al. Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology. 2018;102:123–128. 10.1016/j.jclinepi.2018.06.014 [DOI] [PubMed] [Google Scholar]
  • 13. Smaldino PE, McElreath R. The natural selection of bad science. Royal Society Open Science. 2016;3(9):160384 10.1098/rsos.160384 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Nord CL, Valton V, Wood J, Roiser JP. Power-up: A Reanalysis of ‘Power Failure’ in Neuroscience Using Mixture Modeling. Journal of Neuroscience. 2017;37(34):8051–8061. 10.1523/JNEUROSCI.3592-16.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Dumas-Mallet E, Button KS, Boraud T, Gonon F, Munafò MR. Low statistical power in biomedical science: a review of three human research domains. Royal Society Open Science. 2017;4(2):160254 10.1098/rsos.160254 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Tsilidis KK, Panagiotou Oa, Sena ES, Aretouli E, Evangelou E, Howells DW, et al. Evaluation of excess significance bias in animal studies of neurological diseases. PLoS biology. 2013;11(7):e1001609 10.1371/journal.pbio.1001609 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Brembs B, Button K, Munafò M. Deep impact: unintended consequences of journal rank. Frontiers in Human Neuroscience. 2013;7 10.3389/fnhum.2013.00291 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Fraley RC, Vazire S. The N-Pact Factor: Evaluating the Quality of Empirical Journals with Respect to Sample Size and Statistical Power. PLoS ONE. 2014;9(10):e109019 10.1371/journal.pone.0109019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Cohen J. The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology. 1962;65(3):145–153. 10.1037/h0045186 [DOI] [PubMed] [Google Scholar]
  • 20. Fang FC, Casadevall A. Competitive science: is competition ruining science? Infection and immunity. 2015;83(4):1229–33. 10.1128/IAI.02939-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Edwards MA, Roy S. Academic Research in the 21st Century: Maintaining Scientific Integrity in a Climate of Perverse Incentives and Hypercompetition. Environmental Engineering Science. 2017;34(1):51–61. 10.1089/ees.2016.0223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Axtell R, Kirman A, Couzin ID, Fricke D, Hens T, Hochberg ME, et al. Challenges of Integrating Complexity and Evolution into Economics In: Wilson DS, Kirman A, editors. Complexity and Evolution: Toward a New Synthesis for Economics. MIT Press; 2016. [Google Scholar]
  • 23.Braganza O. Proxyeconomics, An agent based model of Campbell’s law in competitive societal systems; 2018. Available from: http://arxiv.org/abs/1803.00345.
  • 24. Higginson AD, Munafò MR. Current Incentives for Scientists Lead to Underpowered Studies with Erroneous Conclusions. PLOS Biology. 2016;14(11):e2000995 10.1371/journal.pbio.2000995 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Campbell H, Gustafson P. The World of Research Has Gone Berserk: Modeling the Consequences of Requiring “Greater Statistical Stringency” for Scientific Publication. The American Statistician. 2019;73(sup1):358–373. 10.1080/00031305.2018.1555101 [DOI] [Google Scholar]
  • 26. Campbell H, Gustafson P. Conditional equivalence testing: An alternative remedy for publication bias. PLOS ONE. 2018;13(4):e0195145 10.1371/journal.pone.0195145 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Smaldino PE. Measures of individual uncertainty for ecological models: Variance and entropy. Ecological Modelling. 2013;254:50–53. 10.1016/j.ecolmodel.2013.01.015 [DOI] [Google Scholar]
  • 28. Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259–269. 10.1177/2515245918770963 [DOI] [Google Scholar]
  • 29. Lakens D. Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science. 2017;8(4):355–362. 10.1177/1948550617697177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Fanelli D. Negative results are disappearing from most disciplines and countries. Scientometrics. 2012;90(3):891–904. 10.1007/s11192-011-0494-7 [DOI] [Google Scholar]
  • 31.Seabold S, Perktold J. Statsmodels: Econometric and Statistical Modeling with Python. In: PROC. OF THE 9th PYTHON IN SCIENCE CONF; 2010. p. 57.
  • 32. Johnson VE, Payne RD, Wang T, Asher A, Mandal S. On the Reproducibility of Psychological Science. Journal of the American Statistical Association. 2017;112(517):1–10. 10.1080/01621459.2016.1240079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Utzerath C, Fernández G. Shaping Science for Increasing Interdependence and Specialization. Trends in neurosciences. 2017;40(3):121–124. 10.1016/j.tins.2016.12.005 [DOI] [PubMed] [Google Scholar]
  • 34. Hartung J, Cottrell JE, Giffin JP. Absence of evidence is not evidence of absence. Anesthesiology. 1983;58(3):298–300. 10.1097/00000542-198303000-00033 [DOI] [PubMed] [Google Scholar]
  • 35. Miller J, Ulrich R. Optimizing Research Payoff. Perspectives on Psychological Science. 2016;11(5):664–691. 10.1177/1745691616649170 [DOI] [PubMed] [Google Scholar]
  • 36. Stanley TD, Doucouliagos H, Ioannidis JPA. Finding the power to reduce publication bias. Statistics in Medicine. 2017;36(10):1580–1598. 10.1002/sim.7228 [DOI] [PubMed] [Google Scholar]
  • 37. Brembs B. Prestigious Science Journals Struggle to Reach Even Average Reliability. Frontiers in Human Neuroscience. 2018;12:37 10.3389/fnhum.2018.00037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Kühberger A, Fritz A, Scherndl T. Publication Bias in Psychology: A Diagnosis Based on the Correlation between Effect Size and Sample Size. PLoS ONE. 2014;9(9):e105825 10.1371/journal.pone.0105825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Esposito L, Drexler JFJF, Braganza O, Doberentz E, Grote A, Widman G, et al. Large-scale analysis of viral nucleic acid spectrum in temporal lobe epilepsy biopsies. Epilepsia. 2015;56(2):234–243. 10.1111/epi.12890 [DOI] [PubMed] [Google Scholar]
  • 40. Fortin JM, Currie DJ. Big Science vs. Little Science: How Scientific Impact Scales with Funding. PLoS ONE. 2013;8(6):e65263 10.1371/journal.pone.0065263 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Gross K, Bergstrom CT. Contest models highlight inherent inefficiencies of scientific funding competitions. PLOS Biology. 2019;17(1):e3000065 10.1371/journal.pbio.3000065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Sassenberg K, Ditrich L. Research in Social Psychology Changed Between 2011 and 2016: Larger Sample Sizes, More Self-Report Measures, and More Online Studies. Advances in Methods and Practices in Psychological Science. 2019;2(2):107–114. 10.1177/2515245919838781 [DOI] [Google Scholar]
  • 43. Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Percie du Sert N, et al. A manifesto for reproducible science. Nature Human Behaviour. 2017;1(1):0021 10.1038/s41562-016-0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Kerr NL. HARKing: hypothesizing after the results are known. Personality and social psychology review. 1998;2(3):196–217. 10.1207/s15327957pspr0203_4 [DOI] [PubMed] [Google Scholar]
  • 45. Eklund A, Nichols TE, Knutsson H. Cluster failure—Why fMRI inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences. 2016;113(28):7900–7905. 10.1073/pnas.1602413113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology—undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science. 2011;22(11):1359–66. 10.1177/0956797611417632 [DOI] [PubMed] [Google Scholar]
  • 47. Kriegeskorte N, Simmons WK, Bellgowan PSF, Baker CI. Circular analysis in systems neuroscience: the dangers of double dipping. Nature Neuroscience. 2009;12(5):535–540. 10.1038/nn.2303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. McNutt M. Journals unite for reproducibility. Science. 2014;346(6210):679–679. 10.1126/science.aaa1724 [DOI] [PubMed] [Google Scholar]
  • 49. Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, et al. Redefine statistical significance. Nature Human Behaviour. 2018;2(1):6–10. 10.1038/s41562-017-0189-z [DOI] [PubMed] [Google Scholar]
  • 50. Campbell DT. Assessing the impact of planned social change. Evaluation and Program Planning. 1979;2(1):67–90. 10.1016/0149-7189(79)90048-X [DOI] [Google Scholar]
  • 51. Goodhart CAE. Problems of Monetary Management: The UK Experience In: Monetary Theory and Practice. London: Macmillan Education UK; 1984. p. 91–121. [Google Scholar]
  • 52. Strathern M. ‘Improving ratings’: audit in the British University system. European Review Marilyn Strathern European Review Eur Rev. 1997;55(5):305–321. [Google Scholar]
  • 53.Manheim D, Garrabrant S. Categorizing Variants of Goodhart’s Law; 2018. Available from: https://arxiv.org/abs/1803.04585v3.
  • 54.Fire M, Guestrin C. Over-Optimization of Academic Publishing Metrics: Observing Goodhart’s Law in Action; 2018. Available from: http://arxiv.org/abs/1809.07841. [DOI] [PMC free article] [PubMed]

Decision Letter 0

Luis M Miller

11 Dec 2019

PONE-D-19-26742

Economically rational sample-size choice and irreproducibility

PLOS ONE

Dear Dr. Braganza,

I write you in regards to the manuscript PONE-D-19-26742 entitled “Economically rational sample-size choice and irreproducibility” which you submitted to PLOS ONE.

I have solicited advice from two expert Reviewers, who have returned the reports shown below. The two reviewers provide positive recommendations, but both agree that the paper would benefit from a round of revisions before acceptance.

Reviewer #1 suggests that “the manuscript would be stronger if more specific, concrete, prospective predictions were made” and that “it would have been interesting to explore the impact of modifying the input parameters on the predicted outputs of the model”. I encourage you to explore some of the modifications proposed by the reviewer.

Reviewer #2 also suggests going beyond the analysis provided in the current version of the paper and exploring some modifications of the model. Additionally, (s)he raises a number of minor issues that could strengthen the paper.

Based on the Reviewers' reports and my own reading of the paper, I came to the decision to offer you the opportunity to revise the manuscript. If you decide to prepare a substantially revised version of the paper, please provide a detailed response to both Reviewers regarding how you have addressed their concerns. If you resubmit, I would ask the same two Reviewers to review again the paper.

We would appreciate receiving your revised manuscript by Jan 25 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Luis M. Miller, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please consider changing the title so as to meet our title format requirement (https://journals.plos.org/plosone/s/submission-guidelines). In particular, the title should be "Specific, descriptive, concise, and comprehensible to readers outside the field" and in this case it is not informative and specific about your study's scope and methodology.

3. Please do not include funding sources in the Acknowledgments or anywhere else in the manuscript file. Funding information should only be entered in the financial disclosure section of the submission system. https://journals.plos.org/plosone/s/submission-guidelines#loc-acknowledgments

4. Please upload a copy of Figure 4, to which you refer in your text on page 6. If the figure is no longer to be included as part of the submission please remove all reference to it within the text.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The author presents a relatively simple model, whereby scientists choose “economically rational” sample sizes on the basis of the trade off between the costs of collecting data (with larger studies being more expensive) and the value of publications (i.e., resulting grant income). The authors acknowledge the relatively simplicity of the model, but suggest that the model performs reasonably well in terms of predicting the distribution of statistical power in a way that mirror empirical efforts to estimate this.

The work is a valuable addition to the literature, and complements previous efforts that have taken a similar approach (i.e., using modelling) to understand the impact of incentive structures on the behaviour of scientists, and the resulting quality of published research. However, I felt that (notwithstanding the virtues of simplicity) it would have been interesting to explore the impact of modifying the input parameters on the predicted outputs of the model (i.e., on scientists’ behaviour).

For example, the drivers of small sample size in the model are low base probability, small effect size, and/or low grant income per publication. One could argue that at least two of these are shaped by funding agencies – low base probability may result from an emphasis on novelty and “groundbreaking” research in funding applications, whilst a low grant income per publication may result from low funding rates. The latter, in particular, varies considerably (from ~5% to ~40% across countries, although I am not aware of any studies that have examined the relationship between funding rate and power distribution.

Similarly, certain fields may be more likely to have a high base rate probability than others – clinical trials of medical interventions, for example, represent the culmination of a long process of discovery, experimentation, validation, etc. Indeed, the principle of equipoise should suggest that only 50% of trials should demonstrate a benefit for the intervention over the comparator (which is borne out by empirical work). What would the model predict should be the difference between research of this kind compared with more blue-sky discovery research where the base probability might be considerably lower.

In other words, for these models to be more than useful descriptions they should provide us with insights into how changing incentive structures might shape scientists’ behaviour, and the resulting quality of research. There may be opportunities here to make prospective (perhaps even quantitative) predictions that then could be tested – either as part of this paper, if the author has the resources to do so, or in future studies. The author briefly touches on these issues, but I think the manuscript would be stronger if more specific, concrete, prospective predictions were made.

Reviewer #2: This is a very timely and important article. A few minor revisions could improve the manuscript.

• “IF is a positive constant reflecting mean grant income per publication.” Can we also think about this parameter as a reflection of the “cost to collect data.” If “IF” is large, this translates to the potential profit being higher. Does this imply that the relative cost per sample is small? Conversely, if “IF” is small, does this translate to a situation where data is relatively expensive? Can you discuss how your results reflect on fields where collecting is expensive versus fields where collecting data is relatively inexpensive? On a related note, see Sassenberg et al. (2019) who conclude that a journal’s “demand for higher statistical power [...] evoked strategic responses among researchers. [...] [R]esearchers used less costly means of data collection, namely, more online studies and less effortful measures.''

• Consider exploring (or at the very least commenting) on a nonlinear cost for sample size. In reality, the cost of increasing one’s sample size from 10 to 20 might be different than increasing the sample size from 100 to 110.

• “Which sample sizes are scientifically ideal is of course a complex question in itself, and will depend not only on the cost of sampling but also on the scientific values of true and false positives as well as true and false negatives.” I suggest a comment or a reference on the meta-analytic impact of studies with low power. Consider for example the conclusions of Stanley et al. (2017) in “Finding the power to reduce publication bias.”

• “For instance, Campbell and Gustafson [43] propose conditional equivalence testing to increase the publication value of negative findings.” I suspect that if, in your model, both positive and negative findings were given equal value, the optimal sample size would still be very low. If, regardless of the outcome, the study will be published and the “IF” received, won’t it be optimal to conduct a large number of very small studies? With this in mind, could you elaborate on “increase the publication value of negative findings”? In other words, for the equation “Profit = IF * TPR – s”, what could we consider for replacing the TPR term? How should we be compensating researchers for their work? While I understand that your model is only an approximation of the complicated research economy, can you point to any alternatives to giving researchers a certain amount of grant money per (positive) publication?

Small things:

• “For simplicity we assume they receive funding only, if they publish and they can publish only positive results.” This is an important idea and it needs to be made crystal clear. Please consider rewriting this and perhaps elaborating.

• “To compute the implied distribution of emergent power and positive predictive values, the corresponding values for each ESS were weighted by its TPR/ESS.” This is another very important idea. I suggest writing out an example to make sure the reader understands. For instance: “For example, with the same total amount of ressources at hand, a researcher could conduct 10 small studies with a sample size of 10 or one 2 large studies with sample size of 50, … the emergent studies (i.e., the published literature) will then have….”

• “Two, now prominent, studies from the pharmaceutical industry suggested reproducibility rates of 11 and 22% [1,2, respectively].” Consider adding here a comment/reference to Johnson et al. (2017) “On the Reproducibility of Psychological Science.”

• Fix punctuation and spacing: “chosen for a set of parameters (b, d, IF , see table1)”

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Marcus Munafo

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Mar 11;15(3):e0229615. doi: 10.1371/journal.pone.0229615.r002

Author response to Decision Letter 0


24 Jan 2020

Reviewers' comments:

Review Comments to the Author

Reviewer #1: The author presents a relatively simple model, whereby scientists choose “economically rational” sample sizes on the basis of the trade-off between the costs of collecting data (with larger studies being more expensive) and the value of publications (i.e., resulting grant income). The authors acknowledge the relatively simplicity of the model, but suggest that the model performs reasonably well in terms of predicting the distribution of statistical power in a way that mirror empirical efforts to estimate this.

The work is a valuable addition to the literature, and complements previous efforts that have taken a similar approach (i.e., using modelling) to understand the impact of incentive structures on the behaviour of scientists, and the resulting quality of published research. However, I felt that (notwithstanding the virtues of simplicity) it would have been interesting to explore the impact of modifying the input parameters on the predicted outputs of the model (i.e., on scientists’ behaviour).

For example, the drivers of small sample size in the model are low base probability, small effect size, and/or low grant income per publication. One could argue that at least two of these are shaped by funding agencies – low base probability may result from an emphasis on novelty and “groundbreaking” research in funding applications, whilst a low grant income per publication may result from low funding rates. The latter, in particular, varies considerably (from ~5% to ~40% across countries, although I am not aware of any studies that have examined the relationship between funding rate and power distribution.

Similarly, certain fields may be more likely to have a high base rate probability than others – clinical trials of medical interventions, for example, represent the culmination of a long process of discovery, experimentation, validation, etc. Indeed, the principle of equipoise should suggest that only 50% of trials should demonstrate a benefit for the intervention over the comparator (which is borne out by empirical work). What would the model predict should be the difference between research of this kind compared with more blue-sky discovery research where the base probability might be considerably lower.

In other words, for these models to be more than useful descriptions they should provide us with insights into how changing incentive structures might shape scientists’ behaviour, and the resulting quality of research. There may be opportunities here to make prospective (perhaps even quantitative) predictions that then could be tested – either as part of this paper, if the author has the resources to do so, or in future studies. The author briefly touches on these issues, but I think the manuscript would be stronger if more specific, concrete, prospective predictions were made.

Response:

These comments are well taken. I agree fully, that funding agencies have the power to affect IF and b and potentially even d. Indeed, I have attempted to provide exactly the kind of insight about how changing input-parameters would change economically optimal behavior in figures 2 and 3, but have apparently not explained this well. The analysis shown in figure 2 was designed precisely to show the reader how modifying the input parameters individually or in combination will affect the equilibrium sample size and resulting power. These data show clearly that, ceteris paribus, a policy to increase any of these three variables individually (moving down or right across subplots or right on individual x-axes) will tend to promote better powered studies. They further provide a useful tool to glean their interactions. For instance, if my field is interested in effects of d≥0.5 and we assume equipoise (b=0.5), then income per publication should support at least 500 samples (IF=500) to support well powered research (Fig.2B). Importantly, it also implies that small d or b can be, to a certain degree, compensated by a high IF. Finally, it reveals dead spaces (e.g. IF=10), where no plausible b or d lead to power >50%.

In addition to this map of the parameter space I have attempted to show what distributions of output power might emerge, if different distributions of input parameters are assumed (Fig.3). This provides not only insight as to how the most empirically plausible distributions (Fig.3C, c9-c16) would translate to empirically verifiable power distributions but also how other, perhaps more desirable distributions would. Indeed, Fig.3C, c1-c8 shows precisely the case the reviewer asked for, in which 50% of research niches have b>0.5, e.g. because funding agencies dedicate funding specifically to confirmatory research. Similarly, the rows in Fig. 3C show outcomes if income per publication is increased.

I agree fully, that it would be extremely interesting to explore the empirical relation of funding rates, IF and the resulting power distributions, and am also unaware of research investigating this. Similarly, in line with the reviewer’s suggestion, I strongly suspect that a closer investigation of the empirical literature on power would reveal many high powered studies to be clinical studies (with substantial prior evidence and thus high b). Unfortunately I do not presently have the means to explore this further. However, I have attempted to make the effects of varying input parameters, as deliberated above (Fig.2), as well as the predictions if e.g. funders placed a greater emphasis on confirmatory research (Fig.3) or increased funding ratios (Fig.2, 3) clearer in the manuscript (lines 206, 219ff, 236, 260ff).

I have further added an exploration of a particular policy prescription, namely conditional equivalence testing, as suggested by reviewer 2 (lines 266ff). Together, I believe this offers a range of prospective predictions, both to verify the model and to guide policy.

Reviewer #2: This is a very timely and important article. A few minor revisions could improve the manuscript.

• “IF is a positive constant reflecting mean grant income per publication.” Can we also think about this parameter as a reflection of the “cost to collect data.” If “IF” is large, this translates to the potential profit being higher. Does this imply that the relative cost per sample is small? Conversely, if “IF” is small, does this translate to a situation where data is relatively expensive? Can you discuss how your results reflect on fields where collecting is expensive versus fields where collecting data is relatively inexpensive? On a related note, see Sassenberg et al. (2019) who conclude that a journal’s “demand for higher statistical power [...] evoked strategic responses among researchers. [...] [R]esearchers used less costly means of data collection, namely, more online studies and less effortful measures.''

Response: Thank you for this comment. This is exactly right. As we state in the manuscript (line 68 in the revised manuscript with tracked changes): ‘For simplicity we scale IF as the number of samples purchasable per publication such that the cost of experimentation reduces to sample size (s)’. Accordingly, the predictions and evidence presented by Sassenberg and Ditrich (2019) fits squarely in the model framework presented here. We have included this in the discussion (lines 390ff).

• Consider exploring (or at the very least commenting) on a nonlinear cost for sample size. In reality, the cost of increasing one’s sample size from 10 to 20 might be different than increasing the sample size from 100 to 110.

Response: This is true. Nonlinear cost functions may affect ESS. I have run some exploratory analyses varying the marginal sample cost by including an exponent (cost = svarCost, with s= sample size and varCost= 0.5 to 1.1). While this affected equilibrium sample sizes, the change was mostly due to the resulting change in mean slope (i.e. was similarly and more simply captured by changing IF). In theory, diminishing marginal costs would be relevant particularly if there is a range where sample costs start quickly diminishing, and this is in the range where power approaches 80% or higher (where the Income curve begins to saturate in Fig. 1A). Above this range diminishing marginal cost will have a minor effect, since marginal returns will approach 0 (with 100% Power the publication rate remains the same for larger sample sizes). If marginal cost diminishes quickly below this range, then we can arguable focus on the cost slope above. In other words, for the curvature of the cost function to play a major role for the equilibrium sample size, it would need to be very prominent around exactly the range of sample sizes where the curvature of the power-function is strongest. I would thus cautiously contend that in most cases, a linear cost function (with adjustable slope) is a good approximation. I have now included a brief discussion of this (lines 459ff).

• “Which sample sizes are scientifically ideal is of course a complex question in itself, and will depend not only on the cost of sampling but also on the scientific values of true and false positives as well as true and false negatives.” I suggest a comment or a reference on the meta-analytic impact of studies with low power. Consider for example the conclusions of Stanley et al. (2017) in “Finding the power to reduce publication bias.”

Response: Thank you for this comment. I have modified the manuscript accordingly (line 356).

• “For instance, Campbell and Gustafson [43] propose conditional equivalence testing to increase the publication value of negative findings.” I suspect that if, in your model, both positive and negative findings were given equal value, the optimal sample size would still be very low. If, regardless of the outcome, the study will be published and the “IF” received, won’t it be optimal to conduct a large number of very small studies? With this in mind, could you elaborate on “increase the publication value of negative findings”? In other words, for the equation “Profit = IF * TPR – s”, what could we consider for replacing the TPR term? How should we be compensating researchers for their work? While I understand that your model is only an approximation of the complicated research economy, can you point to any alternatives to giving researchers a certain amount of grant money per (positive) publication?

Response: It is true, that rewarding the publication of each negative finding independent of sample size would promote even lower power (indeed I believe the optimum would consistently be the minimum sample size). This is intuitively nonsensical, because such extreme underpowering will be judged as problematic for the legitimate reason that ‘absence of evidence is not evidence of absence’.

Conditional equivalence testing avoids this problem by separating statistically significant negative findings from inconclusive results. The TPR term then includes statistically significant positive as well as negative results (but not inconclusive results). It thus augments the economic incentive for sufficient power to detect positive results with an economic incentive for sufficient power to detect negative results. This is now explored in the revised manuscript (lines 76ff, lines 266ff, new Fig.4).

Small things:

• “For simplicity we assume they receive funding only, if they publish and they can publish only positive results.” This is an important idea and it needs to be made crystal clear. Please consider rewriting this and perhaps elaborating.

Response: I have added some lines elaborating (lines 59ff)

• “To compute the implied distribution of emergent power and positive predictive values, the corresponding values for each ESS were weighted by its TPR/ESS.” This is another very important idea. I suggest writing out an example to make sure the reader understands. For instance: “For example, with the same total amount of ressources at hand, a researcher could conduct 10 small studies with a sample size of 10 or one 2 large studies with sample size of 50, … the emergent studies (i.e., the published literature) will then have….”

Response: I have added two sentences elaborating (lines 148ff).

• “Two, now prominent, studies from the pharmaceutical industry suggested reproducibility rates of 11 and 22% [1,2, respectively].” Consider adding here a comment/reference to Johnson et al. (2017) “On the Reproducibility of Psychological Science.”

Response: In my understanding, the study by (Johnson et al., 2017) essentially reanalyzes the results from (Open Science Collaboration, 2015), not to judge the reproducibility rates of the published literature, but to infer unknown quantities such as b (base probability of true effects among all tests performed). They find that an observed reproducibility rate of ~36% suggests a b of ~0.1, where ~90% of performed tests never enter the literature. Note, that this is highly consistent with the present findings (Fig. 3 c13-16). Since their study does not add to reproducibility estimates, I would prefer not to cite them in the suggested context. I have however, added a citation where the estimation of b is addressed (line 243, 523).

• Fix punctuation and spacing: “chosen for a set of parameters (b, d, IF , see table1)”

Response: I will go through the document again to check for punctuation and spacing errors. In the shown example (line 103) there is in fact no space after IF and the appearance is due to the latex math environment. In my understanding, according to PloS guidelines, I should use math environment whenever writing variables to make their appearance uniform.

Attachment

Submitted filename: Response to Reviews.docx

Decision Letter 1

Luis M Miller

11 Feb 2020

A simple model suggesting economically rational sample-size

choice drives irreproducibility

PONE-D-19-26742R1

Dear Dr. Braganza,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Luis M. Miller, Ph.D.

Academic Editor

PLOS ONE

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The author has addressed all the comments raised in the previous round; I have no further suggestions for improving the manuscript.

Reviewer #2: Two very minor suggestions:

Line 280 – “In the following, we explore the effects of CET given either = 0.5d or = d (Fig. 4).”

You should add a note to clarify that you have alpha=0.05 for both the testing of the null and the testing of equivalence. (Campbell&Gustafson suggest that other options could be considered.) Also, perhaps you might want to add that there is a one-to-one correspondence between the alpha and the width of the equivalence margin (inverse relationship).

Line 486 – “have explored the the effect of various”

Remove second “the”

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Marcus Munafo

Reviewer #2: No

Acceptance letter

Luis M Miller

14 Feb 2020

PONE-D-19-26742R1

A simple model suggesting economically rational sample-size choice drives irreproducibility

Dear Dr. Braganza:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Luis M. Miller

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Model code for quick reference.

    Code to compute the ESS (and associated parameters) based on b, d, IF.

    (PDF)

    S2 File. Model code for quick reference.

    Code to compute the ESSCET (and associated parameters) based on b, d, IF, Δ.

    (PDF)

    S3 File. Full python code to compute the ESS (and associated parameters) based on b, d, IF and to generate all figures.

    (PY)

    S4 File. Full python code to compute the ESSCET (and associated parameters) based on b, d, IF, Δ and to generate all figures.

    (PY)

    Attachment

    Submitted filename: Response to Reviews.docx

    Data Availability Statement

    The code to run the model and generate the figures is available a supporting files and can additionally be downloaded at www.proxyeconomics.com.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES