The Monty Hall problem is arguably one of the most well-known probability problems in the public domain1,2. The problem was named after Monty Hall, the host of an American television game show, Let’s Make a Deal. The game has three doors, with a car behind one door and a goat behind each of the other two doors. The contestant does not know which door the car is behind and thus randomly chooses a door. This is where the situation becomes interesting. The host, who can see what is behind the two unchosen doors, opens one unchosen door with a goat behind and asks if the contestant would like to switch the already chosen door with the unopen, unchosen door. The question is, would switching increase the contestant’s chance of winning?
The Monty Hall problem became famous as a brain teaser. An overwhelming majority of people’s first guess is that switching would not increase the chance of winning because it is impossible to be certain about which of the two unopen doors has the car behind. However, the fact that the car can be behind either unopen door does not mean that the two unopen doors are equally likely to have the car behind. What is the reason? The order of actions matters.
Let us have a thought experiment with two scenarios (Figure 1a).
Figure 1.

a, The Monty Hall problem and two scenarios. Scenario 1 is the Monty Hall problem, in which the contestant first chooses a door, and then the host opens an unchosen door with a goat behind. Scenario 2 switches the order and lets the host first open a door with a goat behind. Although the contestant is left to choose between two doors in both scenarios, the winning probabilities are different. b, Two analysis procedures for identifying “interesting” features, each of which receives a p-value. In the correct procedure (top), p-value thresholding is performed on all features’ p-values, resulting in a valid control of the FDR. In the incorrect procedure (bottom), a feature screening step is added, so only the features with the smallest p-values are retained before the p-value thresholding step, leading to failed FDR control. The violin plots on the right show the distributions of the false discovery proportions (FDPs) of the two procedures in a simulation study (Zenodo DOI:10.5281/zenodo.7809547), in which the target FDR is 0.05. Note that the FDR is defined as the expectation (i.e., average) of the FDP distribution; only the correct procedure controls the FDR under 0.05. c, An example table that streamlines the analysis procedures.
Scenario 1 is the Monty Hall problem: first, the contestant chooses a door; second, the host opens an unchosen door with a goat behind.
In Scenario 2, we switch the action order and let the host choose first. That is, the host first randomly opens one of the two doors with a goat behind.
In both scenarios, the contestant is left to choose between two unopened doors, one of which has the car behind. However, the contestant’s two choices have different chances of winning under the two scenarios (Figure 1a). In Scenario 1, the contestant has only a 1/3 chance of winning if not switching the choice. In contrast, in Scenario 2, the contestant has an equal 1/2 chance of winning regardless of the choice. Many people surprised at the Monty Hall problem in fact have Scenario 2 in mind, thus thinking the two unopened doors equally likely have the car behind.
The Monty Hall problem is a phenomenal example that demonstrates how the order of actions can influence the final probability calculation. An interesting connection between the Monty Hall problem and scientific research is the calculation of the false discovery rate (FDR), the most widely used criterion in high-throughput data analysis where thousands of features (e.g., genes) are examined simultaneously. Technically, the FDR is defined as the expected proportion of false discoveries among the discoveries.
In bioinformatics analysis, two steps are typically taken to identify “interesting” features (Figure 1b, top). In step 1, a p-value is calculated for every feature (usually, a smaller p-value means the feature is more likely interesting). In step 2, a p-value threshold is determined by a statistical procedure (e.g., the Benjamini-Hochberg procedure3 or Storey’s q-value procedure4) to control the FDR to a target level (e.g., 5%). After the two steps, a feature is identified as a discovery if its p-value is under the threshold.
In practice, most researchers do not validate all discoveries but only the features with the smallest p-values (Figure 1b, top). This “top feature validation” is a reasonable strategy, given the limited amounts of resources. However, if this strategy is not used in the last step but performed as “top feature screening” before step 2, then it would break down the theoretical guarantee of the statistical procedure for p-value thresholding in step 2, resulting in an inflated FDR (Figure 1b, bottom). This phenomenon is well-known to statisticians and often referred to as “double dipping”5,6 because the same set of p-values is used twice: first to screen for the top features and second to find the p-value threshold. This double dipping issue will make step 2 fail to control the FDR to the target level.
To make the discussion more concrete, imagine that we have RNA-seq samples from a wildtype condition and a gene knockdown condition. Our goal is to find the differentially expressed (DE) genes, i.e., the interesting features, whose expression levels changed significantly after the gene knockdown. The standard practice is to calculate a p-value for each gene (step 1) and find the p-value threshold corresponding to the 5% FDR (step 2). Then the genes with p-values below the threshold will be identified as DE genes. In the correct approach, the p-values are used only once to determine the threshold, and if the p-values are valid (i.e., p-values of true non-DE genes should be uniformly distributed between 0 and 1), the identified DE genes should satisfy the target 5% FDR (Figure 1b, top). However, if the p-values are used twice—first to screen for the genes with small p-values after step 1, and second to find the p-value threshold based on only these genes in step 2—then in this incorrect approach, the identified DE genes may have the actual FDR far exceeding the target 5% (Figure 1b, bottom).
Regarding how to avoid the possible failure of FDR control, a practice strategy is to use in silico negative controls, such as permuted data7 or simulated data8 that is expected to contain no interesting features, to verify that the p-values before thresholding (step 2) approximately follow the uniform distribution between 0 and 19. This sanity check is essential but largely neglected in data analysis. Another strategy is to avoid the complexity of p-value calculation and directly control the FDR10, but sanity check is still needed.
In summary, how to calculate probability correctly is a challenging question in many real-world problems, ranging from the Monty Hall problem in mass media to the high-throughput data analysis problem in scientific research. The order of actions taken is a critical but often ignored factor that determines the validity of probability calculation. As a result, to ensure the transparency and reproducibility of statistical analysis results in research papers, researchers should precisely record all data analysis procedures, including, but not limited to, the selection of data points and features, in the exact order. To help researchers implement this practice, research journals may add to the reporting summary document a table that streamlines the analysis procedures (Figure 1c is an example).
Acknowledgements
The authors appreciate the comments and feedback from Dr. Wei Li at University of California, Irvine, Dr. Chongzhi Zang at University of Virginia, and the author’s Ph.D. student Mr. Guanao Yan and postdoc Dr. Xinzhou Ge at UCLA.
The author was supported by the following grants: National Science Foundation DBI-1846216 and DMS-2113754, NIH/NIGMS R35GM140888, Johnson & Johnson WiSTEM2D Award, Sloan Research Fellowship, UCLA David Geffen School of Medicine W. M. Keck Foundation Junior Faculty Award, and the Chan-Zuckerberg Initiative Single-Cell Biology Data Insights Grant. The author was a fellow at the Radcliffe Institute for Advanced Study at Harvard University in 2022–2023 while she was writing this paper.
References
- 1.Letters to the Editor. The American Statistician 29, 67–71 (1975). https://doi.org: 10.1080/00031305.1975.10479121 [DOI] [Google Scholar]
- 2.Rosenhouse J The Monty Hall problem: the remarkable story of Math’s most contentious brain teaser. (Oxford University Press, 2009). [Google Scholar]
- 3.Benjamini Y & Hochberg Y Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 289–300 (1995). [Google Scholar]
- 4.Storey JD The positive false discovery rate: a Bayesian interpretation and the q-value. The annals of statistics 31, 2013–2035 (2003). [Google Scholar]
- 5.Benjamini Y Simultaneous and selective inference: Current successes and future challenges. Biometrical Journal 52, 708–721 (2010). [DOI] [PubMed] [Google Scholar]
- 6.Taylor J & Tibshirani RJ Statistical learning and selective inference. Proceedings of the National Academy of Sciences 112, 7629–7634 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Li Y, Ge X, Peng F, Li W & Li JJ Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology 23, 79 (2022). https://doi.org: 10.1186/s13059-022-02648-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Song D, Wang Q, Yan G, Liu T & Li JJ A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics. bioRxiv, 2022.2009.2020.508796 (2023). https://doi.org: 10.1101/2022.09.20.508796 [DOI] [Google Scholar]
- 9.Song D & Li JJ PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data. Genome Biology 22, 124 (2021). https://doi.org: 10.1186/s13059-021-02341-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ge X et al. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 22, 288 (2021). https://doi.org: 10.1186/s13059-021-02506-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
