Skip to main content
Journal of Postgraduate Medicine logoLink to Journal of Postgraduate Medicine
. 2025 Aug 22;71(3):132–135. doi: 10.4103/jpgm.jpgm_441_25

4. The riddle of confidence levels and the levels of significance in the era of artificial intelligence

A Indrayan 1,
PMCID: PMC12534097  PMID: 40843525

ABSTRACT

Whether it is the diagnosis, treatment, and prognosis of individual patients or the research results derived from group studies, uncertainties are an integral part of the decision-making process because of the omnipresence of inter- and intra-individual variations. Whereas artificial intelligence methods are poised to steeply reduce the uncertainties at the individual level, they cannot completely eliminate the role of chance. This is especially so with group-based medical research where uncertainties may still confound the decision processes even with the advent of artificial intelligence. Probabilities quantify the uncertainties. Confidence level and the level of significance measure the probability in different contexts in statistical inference and remain the essential features of group-based research. Some researchers mix up these two levels. This article explains the difference between these two levels and offers a new perspective in the era of artificial intelligence.

KEY WORDS: Artificial intelligence, confidence level, level of significance, precision medicine, type-I error

Introduction

Medicine often resembles a roulette game, dominated by probabilities and likelihoods. The science of biostatistics is an effective guide to play this game with conviction. No other science understands and handles randomness and uncertainties as much as does statistics since probability is its backbone.

The practice of medicine is riddled with uncertainties, and the probabilities play a decisive role in diagnosis, treatment, and prognosis, especially when confronting a new patient about whom almost nothing is known. A patient is right to demand and expect a treatment that works for him or her, but the omnipresent inter- and intra-individual variations hinder an assured outcome. Despite these challenges, medicine has flourished, provided succour to mankind, and is now poised to accomplish greater credence due to the boost provided by artificial intelligence (AI), machine learning models, and large-language models that help target a specific individual with precision medicine.

AI-based methods surely alleviate the cognitive burden on medical professionals[1] and augment their ability to consider and take actions with a greater degree of confidence. Younger generation, with extensive exposure to computer-based applications, would embrace this development with enthusiasm as the experience and expertise are pushed to the margins by this new technology. However, there is a word of caution. Precision medicine is expected to steeply reduce the role of chance and probabilities, but there is no possibility this could be eliminated.[1] Moreover, this reduction is mostly for clinical practice at the individual patient level, and the role of chance in group-based empirical research may continue unabated. Thus, it is advisable to appreciate the role of probabilities and their implications on healthcare decisions and outcomes even in the era of AI.

Among many types of probabilities encountered in medical decisions, this article focuses on two well-known probability concepts in statistics, namely, the confidence level and the level of significance. These two terms are sometimes confused even by experienced researchers. They are conventionally tied to an empirical research setup where groups of cases are investigated but can also be useful for individual cases in a clinical setup.

Confidence Level

When a patient with cardiac problems presents a laboratory report with 130 mg/dL triglyceride level (TGL), how confident are you to believe the report and consider that this indeed is the level? While AI may suggest some answers, the decision ultimately is yours. If your confidence level is low, you may advise that the test be repeated. If the report is from an accredited laboratory, you may still want to guess that the actual level could be within ± 10 mg/dL of the reported level, and most LIKELY between 120 and 140 mg/dL. Statistically, such variation can be measured by the standard deviation (SD) if multiple measurements are available. But that is theoretical as several measurements on the same person at the same time are almost never done. In another context, if somebody has Q-wave changes in ECG but no complaint, how confident are you to initiate a treatment regimen for a cardiac disease in such a case? One clinician may do so with 60% confidence and another with 80%. Yet another may have low confidence in his or her own assessment of the confidence level and not able to decide one way or the other. Such volatility is in the nature of confidence level and remains prominent in many situations.

We expect that AI models would handle these uncertainties and offer an answer for individual patients,[2] including an ambivalent answer. But similar uncertainties also occur, perhaps at exacerbated levels, in medical research, which is generally based on the study of a series of patients, as explained in a short while. The uncertainties may continue at least in the foreseeable future despite availability of AI. For example, clinicians know that a conclusion regarding a new regimen being better than an existing regimen is generally based on the results of a clinical trial conducted on a group of patients. If the efficacy of the new regimen is 92% against 86% of the existing regimen, how confident are you that the difference is indeed 6%. Statistically, 6% is the sample estimate of the difference in the efficacies of the two regimens, sometimes called the effect size. Imagine a repetition of the trial on similar subjects where the effect size turns out to be only 4%. Assuming there are no data errors and the trial was perfectly executed, the effect size would vary from trial to trial because the participants are not the same, and inter-individual variations play the spoilsport. Statistically, these variations are called sampling fluctuations and are necessary components of any research based on a sample of subjects. Two samples drawn from the same population would almost never give exactly the same results.

The conventional approach in medical research is to study a sample of subjects and attempt to extract a signal from noise that would allow to draw an inference from that sample. This necessitates reliance on summaries such as mean and standard deviation for quantitative data, and count and percentages for qualitative data. These are called estimates because they are derived from samples and are likely to differ from sample to sample even in an era of AI. Howsoever paradoxical it may sound, the inference from group averages and counts is used on the individual patients in clinical practice in the hope that they would provide desirable results in most cases, if not all cases. This might change when AI-based precision medicine takes over, but, as of now, to be more realistic, a plausible range of possibilities is considered within which the sample estimates would vary if several samples were studied. In reality, several samples are not actually studied; instead, established statistical methods are employed to compute this plausible range based on just one sample. For instance, if a trial estimates an effect size of 6%, statistical methods may suggest that the true effect size is likely to be somewhere between 3% and 9% across samples. Note the term ‘likely’ in this statement. This likelihood is statistically called the confidence level. Since isolated samples can give very high or very low effect size depending on the types of subjects happened to be in the sample, if we want to be very confident, the range of plausible values is enlarged, say, 1% to 11% to include those rare possibilities. But such a broad interval could make it difficult for us to make a focused decision. Generally, a 95% confidence level is used to determine the plausible range a to b, effectively saying that 95% of such samples from the same population would contain the actual value of the parameter within this range. This range is called 95% confidence interval (CI) and is generally calculated from a single sample as follows using Gaussian approximation for large samples:

95% CI: Estimate ± 2 × SE (Estimate),

where SE is the standard error of the estimate.

For sample mean, Inline graphic and

for sample proportion Inline graphic ,

where σ is the standard deviation (SD) in the concerned population and π is the proportion in the concerned population. In practice, since parameter values σ and π are almost never known in a medical setup, the sample estimate s for σ and P for π are used. In the case of means of quantitative data, the coefficient 2 is replaced by the corresponding t-value because σ is replaced by its estimate s. This is illustrated in the example given later in this article.

The SE measures the sample-to-sample variability in the estimate. The coefficient 2 is an approximation of 1.96 from a Gaussian distribution and preferred not just because of its simplicity but also because it takes care of the approximation done in using sample values s or P in place of the required σ or π, respectively. This CI is valid only where the statistical distribution of the estimate is Gaussian (normal). This is likely to be so due to central limit theorem if the sample size is large, say, more than 30. For smaller samples from non-Gaussian distributions, other methods, such as nonparametric, of obtaining the CIs are available.[3]

An effective way to narrow down the plausible range – thus a more focused conclusion – is to increase the sample size. If a sample of size 240 gives a 95% CI from 3% to 9%, the sample of size 960 (4 times) would reduce the CI by half to 4.5% to 7.5%. This happens because the sampling fluctuations in the estimate reduce as the sample size increases. That is, the estimate may widely differ from sample to sample when the size of each sample is 20 but would not differ much if the sample size is 300. Common sense, too, says that the large sample estimates would be nearly the same from sample to sample. Mathematically, this is because SE (Estimate) has Inline graphic in the denominator.

As mentioned earlier, if we want to be more confident, say 99%, the CI (the plausible range) must be enlarged. This now becomes

99% CI: Estimate ± 3 × SE (Estimate).

Again, the coefficient 3 approximates the exact 2.58 and 3 is preferred because of the use of sample values in place of the unknown values of the population parameters.

The CIs are not limited to the mean and proportion; they can be obtained for various other parameters such as correlation coefficient, odds ratio, hazard ratio, and survival duration. The objective of the article is not to give formulas for each CI but only to explain the concept of confidence level and the resulting CIs. This explanation is important because confidence level is sometimes mixed up with the level of significance.

The Level of Significance

Shift the focus from estimation of how much is the effect to test of significance where a null and an alternative hypothesis are setup, and the sample data are used to check if these data provide sufficient evidence against the null. Although the level of significance has been explained in the previous article,[4] we revisit it to emphasize its contrast with confidence level in this article so that the distinction between the two levels is clear.

The null affirms the status quo, as though there is nothing new, and the alternative asserts the new finding. In the classical setup of clinical trials, the null would say that there is no difference in the efficacy of the new and the existing regimen (effect size = 0), whereas the alternative is that the effect size > 0 (new regimen is better than the existing – excluding effect size ≠ 0). As mentioned in the previous article, the new strategy is to specify a minimum medically significant effect δ and test the null H0: effect size ≤ δ versus the alternative H1: effect size > δ. An effect size not more than δ is considered trivial and not enough to change the current practice. For example, a difference of 1% in the efficacy may be statistically significant but clinically irrelevant.

If the sample data do not provide sufficient evidence against the null, the null is not accepted but only retained in the sense that the sample fails to provide evidence against it. It returns to the status quo before the study. Type-I error occurs when a true null hypothesis is rejected because the sample happens to provide sufficient evidence against it. Due to sampling fluctuations, the sample may wrongly say that a minimum medically significant effect is present, whereas actually it is not. This is a serious error (opposed to Type-II error of not rejecting a false null), analogous to convicting an innocent in a court of law. Because of gravity of this error, a cap is prefixed so that it does not exceed this threshold. This prefixed threshold (typically 0.05 or 5%) on the tolerance of the probability of Type-I error is called the level of significance.

To summarize, the primary distinction between the confidence level and the level of significance is that the former is used in estimation setup where the objective is to find how much is the effect, and the latter in the test of significance setup where the objective to find whether the value of the parameter is a predefined value or a range as specified in the null and alternative hypotheses. This distinction is likely to remain in the AI era so long as empirical studies on a group of people are conducted. The AI-based decision would be correct if the AI software is correctly trained to distinguish precision quantified by the CI and the medically significant effect, and their values are specified by the user. Although the confidence level and the level of significance have distinct meanings and are used in different contexts as just explained, they happen to be related.

Relation Between the Confidence Level and the Level of Significance

An intense discussion is going on among medical researchers whether confidence interval is more informative of the utility of the results than the statistical tests-based significance.[5] The following clarification may help those who want to understand and resolve this controversy.

Without going into mathematical details, realize that a 95% CI of an effect, such as from 3% to 9%, implies that a null hypothesis between these two values, such as H0: effect size = 4%, would not be rejected at 5% level of significance. This is explained in a short while, but note that a 100 (1 – α)% CI corresponds to 100α% level of significance for a two-tailed test. That is, a 95% CI corresponds to 5% level of significance and a 99% CI corresponds to 1% level of significance. The same can be stated for lower and upper confidence bounds and the corresponding one-tailed tests.

To understand it more concretely, consider estimation of the mean albumin-creatinine ratio (ACR) in urine of chronic kidney disease (CKD) patients and the healthy controls in samples of n1 = n2 = 30 each. To make the calculations easier, also assume that the SDs in the two groups are also same and equal to 10 mg/g. That is, s1 = s2 = 10 mg/g. If the means ACR in the study were 34 and 31 mg/g, respectively, the 95% CI for the difference in means in this case is given by

graphic file with name JPGM-71-132-g004.jpg

where t(n1 + n2 – 2)(0.95) is the table value of Student t at (n1 + n2 – 2) df covering middle 95% of values.

In our example, the 95% CI for the mean difference is

graphic file with name JPGM-71-132-g005.jpg

or –0.67 to + 6.67.

Since this interval contains 0, the null hypothesis H0: Effect size = 0 will not be rejected at 5% level of significance. For testing H0: Effect Size (δ) =4, the Student t is

graphic file with name JPGM-71-132-g006.jpg

graphic file with name JPGM-71-132-g007.jpg

Since the table value of t at 58 df is 2.002 for 5% level of significance (two-sided), and the calculated t of –0.387 based on sample values is less in absolute value, the null cannot be rejected. The result would be the same for any null hypothesis value within the 95% CI from – 0.67 to + 6.67. If the null is outside the CI, such as the difference δ is >8, this will be significant with P < 0.05.

The relation between the confidence level and the level of significance for a two-tailed test is shown in Figure 1. The space for CI is the complement of the space for the level of significance.

Figure 1.

Figure 1

Relation between confidence level and the level of significance

Some statisticians argue that CI provides a range of plausible values and thus offer a more informative representation of data than the test of significance that merely says that statistical significance is achieved by the data or not.[6] However, many medical decisions are binary, such as whether the disease is present or not, or whether to do a surgery or not, or whether the effect size is more than 3 or not. For such binary decisions, tests of significance are required. Moreover, the statistical tests yield exact P values such as 0.670, 0.03, and < 0.001. Such exact P values give a better indication of the level of significance that a confidence level does not. The CI would always invariably be only for 95% or 99% confidence. Nobody computes CI for other confidence levels such as 96.8% CI, whereas the test of significance can say that P < 0.032. Thus, the levels of significance have a definite edge over the confidence levels. Once the test of significance confirms that the minimum clinically relevant effect is present, CI can be used to find the range of plausible values for that effect.

As noted earlier, many expect AI to reduce the uncertainty in all our activities. This expectation may not hold for confidence level and the level of significance in group-based research. Nonetheless, AI can readily do the calculations to give the CI and P value based on your sample data. Final decision will be yours depending on the specification of the confidence level and the level of significance. Thus, it is crucial to understand the distinction between the two and use them with discretion.

Conflicts of interest

There are no conflicts of interest.

Next: 5. Determining the size of sample for reliable conclusions.

Funding Statement

Nil.

References

  • 1.Gandhi TK, Classen D, Sinsky CA, Rhew DC, Vande Garde N, Roberts A, et al. How can artificial intelligence decrease cognitive and work burden for front line practitioners? JAMIA Open. 2023;6:ooad079. doi: 10.1093/jamiaopen/ooad079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Scholes MS. Artificial intelligence and uncertainty. Risk Sci. 2025;1:100004. [Google Scholar]
  • 3.Newcombe RG. Two-sided confidence intervals for the single proportion: Comparison of seven methods. Stat Med. 1998;17:857–72. doi: 10.1002/(sici)1097-0258(19980430)17:8<857::aid-sim777>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]
  • 4.Indrayan A. P values, power, and medical significance for credible results. J Postgrad Med. 2025;71:91–4. doi: 10.4103/jpgm.jpgm_30_25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Egbuchulem KI. How confident is the confidence interval. Ann Ib Postgrad Med. 2022;20:101–202. [PMC free article] [PubMed] [Google Scholar]
  • 6.Adedokun BO. P value and confidence intervals-Facts and farces. Ann Ib Postgrad Med. 2008;6:33–4. doi: 10.4314/aipm.v6i1.64041. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Postgraduate Medicine are provided here courtesy of Wolters Kluwer -- Medknow Publications

RESOURCES