Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Jan 1.
Published in final edited form as: Comput Vis ECCV. 2024 Nov 26;15118:182–199. doi: 10.1007/978-3-031-73027-6_11

Task-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction

Jeffrey Wen 1, Rizwan Ahmad 1, Philip Schniter 1
PMCID: PMC12109201  NIHMSID: NIHMS2074267  PMID: 40438162

Abstract

In imaging inverse problems, one seeks to recover an image from missing/corrupted measurements. Because such problems are ill-posed, there is great motivation to quantify the uncertainty induced by the measurement-and-recovery process. Motivated by applications where the recovered image is used for a downstream task, such as soft-output classification, we propose a task-centered approach to uncertainty quantification. In particular, we use conformal prediction to construct an interval that is guaranteed to contain the task output from the true image up to a user-specified probability, and we use the width of that interval to quantify the uncertainty contributed by measurement-and-recovery. For posterior-sampling-based image recovery, we construct locally adaptive prediction intervals. Furthermore, we propose to collect measurements over multiple rounds, stopping as soon as the task uncertainty falls below an acceptable level. We demonstrate our methodology on accelerated magnetic resonance imaging (MRI): https://github.com/jwen307/TaskUQ.

Keywords: Inverse Problems, Uncertainty Quantification, Conformal Prediction, Posterior Sampling, MRI

1. Introduction

In imaging inverse problems, one seeks to recover an image x from measurements y=h(x) that mask, distort, and/or corrupt x with noise [7]. Linear inverse problems, where y=Ax+ϵ with noise ϵ and known forward operator A, include deblurring, super-resolution, inpainting, colorization, computed tomography (CT), and magnetic resonance imaging (MRI) [26]. Non-linear inverse problems include phase-retrieval, de-quantization, and image-to-image translation. These problems are generally ill-posed, in that it is impossible to perfectly infer x from y.

Most image recovery methods provide a single “point estimate” x^ from measurement y [7]. From x^ alone, it is difficult to determine accuracy with respect to the true x. That is, x^ does not quantify the uncertainty [1] in measurement-and-reconstruction that arises from the ill-posed nature of the inverse problem. This is problematic in safety-critical applications like medical imaging, where hallucinations or degradations in x^ can result in costly misdiagnoses [8].

Several approaches have been proposed to provide uncertainty quantification (UQ) within the image recovery process. One approach is to utilize Bayesian Neural Networks (BNNs), which treat the reconstruction network parameters as random variables [9, 21, 32, 40, 60]. This allows one to quantify epistemic (i.e., model) uncertainty by measuring the variation over reconstructions generated by different draws from the parameter distribution. Posterior sampling methods [2, 6, 14, 19, 20, 30] instead draw many samples from the distribution p(xy) of plausible x given y, known as the posterior, and aim to quantify the uncertainty that the measurement process imposes on x (i.e., aleatoric uncertainty). It’s possible to combine BNNs with posterior sampling, as in [22].

Although the samples generated by BNNs and posterior-sampling methods could be used in many ways, they are most commonly used to compute pixel-wise uncertainty images or “maps.” To compute pixel-wise uncertainty intervals with statistical guarantees, conformal prediction can be used [4, 29, 36, 53]. Still, the value of these pixel-wise uncertainty maps is not clear. For example, when recovering images, we are usually concerned about many-pixel visual structures (e.g., lesions in MRI, hallucinations) that single-pixel statistics say little about. Secondly, uncertainty maps are not easy to interpret. They often convey little beyond the notion that there is less pixel-wise uncertainty in smooth regions as compared to near edges. Lastly, it’s not clear how pixel-wise uncertainty relates to the overall imaging goal, which is often task-oriented, such as detecting whether a tumor is present or not.

One approach to assess multi-pixel uncertainty is Bayesian Uncertainty Quantification by Optimization (BUQO) [52], which aims to test whether a particular “structure of interest” in the maximum a-posteriori (MAP) reconstruction is truly present. However, inpainting is used to hypothesize what the image would look like without the structure, the correctness of which is difficult to guarantee.

In this work, we propose a novel UQ framework for imaging inverse problems that aims to provide a more impactful measure of uncertainty. In particular, we aim to quantify to what extent a downstream task behaves differently when supplied with the reconstructed image versus the true image. Our framework supports any measurement-and-reconstruction procedure and any downstream task that outputs a real-valued scalar. Our contributions are as follows.

  1. We propose to construct, using conformal prediction, an interval in the task-output space that is guaranteed to contain the true task output up to a user-specified probability. The prediction interval width provides a natural way to quantify the uncertainty that measurement-and-reconstruction contributes to the downstream task output. (See Fig. 1.)

  2. For posterior-sampling-based image reconstruction, we propose to construct adaptive uncertainty intervals that shrink when the measurements offer more certainty about the true output of the downstream task. (See Fig. 2.)

  3. We propose a multi-round acquisition protocol whereby measurements are accumulated until the task uncertainty is acceptably low.

  4. We demonstrate our approach on accelerated MRI with the task of soft-output-classifying a meniscus tear. Several conformal predictors are evaluated and compared.

Fig. 1:

Fig. 1:

High-level overview of our approach: For true image x, measurement y=h(x), recovery x^=g(y), and task output z^=f(x^), we use conformal prediction to construct an interval 𝒞x^;dcalR that is guaranteed to contain the true task output z=f(x) in the sense that P(Z𝒞(X^;Dcal))1-α for some chosen error-rate α.

Fig. 2:

Fig. 2:

Detailed overview of our approach: For true image x, measurement y=h(x), reconstructions {x^(j)}j=1p, and task outputs z^(j)=f(x^(j)), we use conformal prediction with a calibration set dcal={({x^i(j)}j=1p,zi)}i=1n to construct an interval 𝒞({x^(j)};dcal)=[bl,bh] that is guaranteed to contain the true task output z=f(x) in the sense that P(Z𝒞({X^(j)};Dcal))1-α for some chosen error-rate α.

2. Background

Conformal prediction [3, 54] is a framework for generating uncertainty sets with prescribed statistical guarantees. Notably, it can be applied to any black-box predictor without making any distributional assumptions about the data.

We now explain the basics of conformal prediction or, more precisely, the common variant known as split conformal prediction [37, 41]. Say that we have a black-box model f:𝒳𝒵 that predicts a target z𝒵 from features x𝒳. Say that we also have a calibration dataset dcalxi,zii=1n that was unseen when training f, as well as a test feature xtest and an unknown test target ztest. In split conformal prediction, we use the calibration data to construct a prediction set 𝒞xtest;dcal2𝒵 that provides the “marginal coverage” [38] guarantee

PZtest𝒞Xtest;Dcal1-α, (1)

where α[0,1] is a user-specified error rate. In (1) and elsewhere, we use capital letters to denote random variables and lower-case to denote their realizations. Thus (1) can be interpreted as follows: When averaged over random calibration data Dcal and test data Xtest,Ztest, the set 𝒞Xtest;Dcal is guaranteed to contain the correct target Ztest with probability no less than 1-α. Although we would prefer a “conditional coverage” guarantee of the form PZtest𝒞Xtest;DcalXtest=xtest1-α, this is generally impossible to achieve [38, 55].

We now describe the standard recipe for constructing a prediction set 𝒞xtest;dcal. First one chooses a nonconformity score s(x,z;f)R that assigns higher values to worse predictions. Then one computes the empirical quantile

q^Quantile(1-α)(n+1)n;s1,,sn (2)

from the calibration scores si=sxi,zi;f. Finally one constructs

𝒞xtest;dcal=z:sxtest,z;fq^. (3)

Under these choices, it can be proven [37,56] that the marginal coverage guarantee (1) holds when X1,Z1,,Xn,Zn,Xtest,Ztest are i.i.d., and even under the weaker condition that they are exchangeable [54].

There are many ways to construct the nonconformity score s(x,z;f). For real-valued targets z, the simplest choice would be the absolute residual

sx,z;f=z-fx𝒞x;dcal=fx-q^,fx+q^, (4)

which gives an x-invariant interval length of 𝒞x;dcal=2q^. We will discuss a few other choices in the sequel. For more on conformal prediction, we suggest the excellent overviews [3, 54].

3. Proposed Method

Suppose that we collect measurements y=h(x) of the true image x, from which we compute an image recovery x^=g(y). We would ideally like that x^=x, but this is impossible to guarantee with an ill-posed inverse problem. Although there are many ways to quantify the difference between x^ and x (e.g., PSNR, SSIM [57], LPIPS [62], DISTS [18]), we will instead assume that we are primarily interested in using x^ for some downstream task f(x^)R. As a running example, we consider x to be a medical image, y to be accelerated MRI measurements, and f()[0,1] to be the soft output of a classifier that aims to detect the presence or absence of a pathology. For example, when f(x^)=0.7, the classifier believes that there is a 70% chance that the pathology exists.

When image recovery is imperfect (i.e., x^x), we expect the task output to also be imperfect, in the sense that z^=f(x^)f(x)=z. We are thus strongly motivated to understand how close z^ is to the true z or, even better, to construct a prediction interval 𝒞(x^)R that contains the true z with some guarantee. The interval width |𝒞(x^)| would then quantify the uncertainty that the measurement-and-reconstruction process contributes to predicting the true task output z.

We emphasize that our approach makes no assumptions about the task f() beyond it producing a real number. For example, if f() is a soft-output classifier, we do not assume that it is accurate or even calibrated [25]. Likewise, our approach does not aim to assess the uncertainty implicit in the task, but rather the additional uncertainty that measurement-and-reconstruction contributes to the task. For a soft-output classifier, a (true) output of z=f(x)=0.7 would express considerable uncertainty about the presence of a pathology in x. But if the true z could be perfectly predicted from x^, then the measurement-and-reconstruction process would bring no additional uncertainty.

To construct the interval 𝒞(x^), we use conformal prediction. Adapting the methodology from Sec. 2 to the current setting, we use a calibration set dcal=x^i,zii=1n of (recovered-image, true-task-output) pairs, and we expect to satisfy the marginal coverage guarantee

PZ𝒞(X^;Dcal)1-α (5)

when (X^1,Z1),,(X^n,Zn),(X^,Z) are exchangeable. In 5) and in the sequel, we explicitly denote the dependence of 𝒞(x^) on the calibration data. To construct the calibration set, we assume access to ground-truth examples xii=1n, a measurement model h(), a reconstruction model g(), and a task function f(), from which we can construct yi=hxi, x^i=gyi, and zi=fxi for i=1,,n.

In some cases we may instead have access to a posterior-sampling-based image reconstruction model that generates p recoveries {x^i(j)}j=1p from every measurement yi via x^i(j)=g(yi,vi(j)), where {vi(j)}j=1p are i.i.d. code vectors and, typically, vi(j)~𝒩(0,I). In this case, the prediction interval becomes 𝒞({x^i(j)}j=1p;dcal). As we will see, posterior sampling facilitates locally adaptive prediction sets.

Next we describe different ways to construct the prediction intervals 𝒞x^;dcal and 𝒞({x^i(j)}j=1p;dcal), and later we describe a multi-round measurement protocol that exploits locally adaptive prediction intervals. See Fig. 2 for a detailed overview of our approach.

3.1. Method 1: Absolute Residuals (AR)

We first consider the case where image recovery yields a point-estimate x^=g(y) of the true x. As described in Sec. 2, a simple way to construct a nonconformity score is through the absolute residual (recall (4))

s(x^,z;f)=|z-f(x^)|. (6)

Evaluating this score on the calibration set dcal gives sii=1n, whose empirical quantile q^ can be computed as in (2) and used to construct the prediction interval

𝒞x^;dcal=[f(x^)-q^,f(x^)+q^], (7)

which then provides the marginal coverage property (5) [37].

Note that, with this choice of score, the interval width 𝒞x^;dcal=2q^ varies with the calibration set dcal but not with x^. Thus, for a fixed dcal, the score (6) provides no way to tell whether one x^ will yield more task-output uncertainty than a different x^.

3.2. Method 2: Locally-Weighted Residuals (LWR)

We now consider the case where we have a posterior-sampling-based recovery method that yields p recoveries {x^(j)}j=1p per measurement y. We make no assumption on how accurate or diverse these p samples are, other than assuming that the corresponding task-outputs z^(j)=f(x^(j)) are not all identical.

Suppose that we choose the nonconformity score

s({x^(j)},z;f)=|z-z|σzwithz1pj=1pf(x^(j))σz1pj=1pf(x^(j))-z2, (8)

evaluate it on the calibration set dcal to get scores sii=1n, and compute their empirical quantile q^ as in (2). Then the prediction interval

𝒞({x^(j)};dcal)=z-σzq^,z+σzq^ (9)

of this “locally weighted residual” (LWR) method provides the marginal coverage property in (5) [37].

In words, this method first computes (approximate) posterior samples z^(j)~ZY=y, which are then averaged to approximate the conditional mean z^mmseE(ZY=y)z and square-root conditional covariance cov(ZY=y)σz. When exactly computed, the conditional covariance gives a meaningful uncertainty metric on how well the true Z can be estimated from measurements y, because cov(ZY=y)=E(Z-z^mmse2Y=y). However, the σz that we compute is merely an approximation. So, with the aid of the calibration set, σz is adjusted by the scaling q^ to yield a prediction interval z-σzq^,z+σzq^ that satisfies the marginal coverage criterion (5).

Importantly, the interval width |𝒞({x^(j)};dcal)| now varies with {x^(j)}j=1p through σz. This latter property is known as “local adaptivity” [37].

3.3. Method 3: Conformalized Quantile Regression (CQR)

Another popular locally adaptive method is known as conformalized quantile regression (CQR) [45]. The idea is to construct the nonconformity score using two quantile regressors [35], one which estimates the α2 th quantile of ZY=y and the other which estimates the 1-α2 th quantile.

To compute these quantile estimates, we will once again assume access to a posterior-sampling-based recovery method that yields p recoveries {x^(j)}j=1p per measurement y. From the corresponding task-outputs z^(j)=f(x^(j)), we compute the empirical quantiles z^α2 and z^1-α2 using

z^(ω)Quantileω;z^(1),,z^(p). (10)

From these quantile estimates, we construct the nonconformity score

s({x^(j)},z;f)=maxz^α2-z,z-z^1-α2, (11)

evaluate it on the calibration set dcal to obtain sii=1n, and compute their (1-α)(n+1)/n-empirical quantile q^ as in (2). Then the prediction interval

𝒞({x^(j)};dcal)=z^α2-q^,z^1-α2+q^ (12)

provides the marginal coverage in (5) [45]. Like (9), this interval is locally adaptive. We will compare these three conformal prediction methods in Sec. 4.

3.4. Multi-Round Measurement Protocol

In many applications, there is a significant cost to collecting a large number of measurements (i.e., acquiring a high-dimensional y). One example is MRI, the details of which will be discussed in Sec. 4. For these applications, we propose to collect measurements over multiple rounds, stopping as soon as the task uncertainty falls below a prescribed level τ. The goal is to collect the minimal number of measurements that accomplishes the task with probability of at least 1-α.

Our approach is to use the prediction interval width |𝒞({x^(j)};dcal)| as the metric for task uncertainty. This requires the interval to be locally adaptive, as with LWR and CQR above. The details are as follows. First, a sequence of C>1 nested measurement configurations is chosen, so that the resulting measurement sets obey 𝒴[1]𝒴[2]𝒴[C]. Then, for each configuration k=1,,C, a calibration set dcal[k] is collected, from which the set-valued function 𝒞(;dcal[k]) is constructed. At test time, we begin by collecting measurements y𝒴[1] according to the first (i.e., minimal) configuration. From y we compute the reconstructions {x^(j)}j=1p and, from them, the task uncertainty |𝒞({x^(j)};dcal[1])|. If this uncertainty falls below the desired τ, we stop collecting measurements. If not, we would collect the additional measurements in 𝒴[2]\𝒴[1], and repeat the procedure. Figure 3 summarizes the proposed multi-round protocol.

Fig. 3:

Fig. 3:

Proposed multi-round measurement protocol. In each round, measurements are collected and reconstructions and conformal intervals are computed. If the length of the interval falls below a user-set threshold τ, the procedure stops. Otherwise, more measurements are collected, and the process repeats until the threshold has been met.

4. Numerical Experiments

We now demonstrate our task-based uncertainty quantification framework on MRI [34]. MRI offers exceptional soft-tissue contrast without ionizing radiation but suffers from very slow scan times. Accelerated MRI speeds the acquisition process by collecting a fraction 1/R of the measurements specified by the Nyquist sampling theorem. The integer R is known as the “acceleration rate.” When R>1, the inverse problem is ill-posed.

In MRI, a typical task is to diagnose the presence or absence of a pathology. Although this task is typically performed by a radiologist, neural-network-based classification is expected to play a significant role in aiding radiologists [12]. Thus, in our experiments, we implement the task f() using a neural network. Details are given below.

Data:

We use the multi-coil fastMRI knee dataset [61] and in particular the non-fat-suppressed subset, which includes 484 training volumes (17286 training slices, or images) and 100 validation volumes (2188 validation images). We use pathology labels from fastMRI+ [63]. For knee-MRI, meniscus tears yield the largest fastMRI+ label set, and so we choose meniscus-tear-detection as our task. To collect measurements, we retrospectively subsample the fastMRI data in the k-space using a set of random nested masks that yield acceleration rates R{16,8,4,2}, the details of which are described in the Supplementary Materials.

Image Recovery:

We consider two recovery networks g(). As a point estimator, we use the state-of-the-art E2E-VarNet from [49] and, as a posterior sampler, we use the conditional normalizing flow (CNF) from [58]. Both were specifically designed around the fastMRI dataset. Another option would be the MRI diffusion sampler [15], but its performance is a bit worse than the CNF and its sampling speed is 8000 × slower [58]. The E2E-VarNet and CNF were each trained to handle all four acceleration rates with a single model.

Task Network:

We used a ResNet50 [27] for the task network f(). Starting from an ImageNet-based initialization, we pretrained the weights to minimize the unsupervised SimCLR loss [13] and later minimized binary-cross-entropy loss using the fastMRI+ labels. See the Supplementary Materials for details.

Empirical Validation:

Recall that the marginal coverage guarantee (5) holds on average over random test samples (X^,Z) and random calibration data Dcal={(X^1,Z1),,(X^n,Zn)}. To empirically validate marginal coverage and evaluate other average-performance metrics, we perform Monte-Carlo averaging over T=10000 trials as follows. In each trial t, we randomly partition the 2188-sample validation dataset into a 70% calibration fold with indices ical[t] and a 30% test fold with indices itest[t], construct conformal predictors using the calibration data dcal[t]={({x^i(j)}j=1p,zi)}ical[t], and evaluate performance on test fold t. Finally, we average performance over the T trials. Further details are given below.

4.1. Effect of Acceleration Rate and Conformal Prediction Scheme

We have seen that the interval length |𝒞({x^(j)};dcal)| provides a way to quantify the uncertainty that the measurement-and-reconstruction scheme contributes to the meniscus-classification task. So a natural question is: How is the interval length affected by the MRI acceleration R? We study this question below.

For a fixed acceleration R, the interval length is also affected by the choice of conformal predictor. All else being equal, better conformal predictors yield smaller uncertainty sets [3]. So another question is: How is the interval length affected by selecting among the AR, LWR, or CQR conformal methods?

To answer these questions, we compute the “average mean interval length” MIL¯1Tt=1TMIL[t] using the trial-t mean interval length

MILt1testtitestt𝒞({x^ij}j=1p;dcalt), (13)

Figure 4 plots the average mean interval length versus R for the AR, LWR, and CQR conformal predictors using T=10000 trials, p=32 posterior samples, and error-rate α=0.05. The figure shows that, as expected, the average mean interval length decreases as more measurements are collected (i.e., as R decreases). The figure also shows that, as expected, the (locally adaptive) LWR and CQR methods give consistently smaller average mean interval lengths than the (non-adaptive) AR method. In this sense, posterior sampling is advantageous over point sampling.

Fig. 4:

Fig. 4:

a) Average mean interval length versus acceleration R with p=32 samples. b) Mean interval length versus p with acceleration R=16. All results use error-rate α=0.05 and T=10000 trials.

4.2. Effect of Number of Posterior Samples

Above, we saw that the measurement process and conformal method both affect the prediction-interval length. We conjecture that the image reconstruction process will also affect the prediction-interval length. To investigate this, we vary the number of samples p produced by the posterior-sampling scheme, reasoning that smaller values of p correspond to less accurate recoveries (e.g., a less accurate posterior mean approximation).

Figure 4b plots the average mean interval length versus p for the LWR and CQR conformal predictors using T=10000 trials, acceleration R=16, and error-rate α=0.05. As expected, the interval length decreases as the posterior sample size p grows. But interestingly, LWR is much more sensitive to small values of p than CQR. One implication is that small values of p may suffice when used with an appropriate conformal prediction method.

4.3. Empirical Validation of Coverage

To verify that the marginal coverage guarantee (5) holds, we compute the empirical coverage of Monte-Carlo trial t as

ECt1testtitestt1{zi𝒞({x^ij};dcalt)}, (14)

where 1{} denotes the indicator function. Existing theory (see, e.g., (3) says that when ({X^i(j)},Zi) and Dcal[t] in 14 are exchangeable pairs of random variables, EC[t] is random and distributed as

EC[t]~BetaBinntest,ncal+1-lcal,lcalforlcalncal+1α, (15)

where ntesttest[t] and ncalcal[t].

For each of the three conformal methods, Fig. 5 shows the histogram of {EC[t]}t=1T from (14) for T=10000, error-rate α=0.05, acceleration R=8, and p=32 posterior samples. The figure shows that this histogram is close to the histogram created from T samples of the theoretical distribution in (15). Figure 5 also prints the average empirical coverage 1Tt=1TEC[t] for each method, which is very close to the target value of 1-α=0.95. Thus we see that, in practice, conformal prediction behaves close to the theory.

Fig. 5:

Fig. 5:

For the AR, LWR, and CQR conformal methods, each subplot shows the histograms of the empirical and theoretical empirical-coverage samples {EC[t]}t=1T across T=10000 Monte-Carlo trials using =0.05, R=8, and p=32. The subplots are also labeled with the empirical mean of {EC[t]}t=1T, which is very close to the target value of 1-α=0.95.

4.4. Multi-Round Measurements

We now investigate the application of the multi-round measurement protocol from Sec. 3.4 to accelerated MRI. For this, we simulated the collection of MRI slices over rounds k=1,,5, stopping as soon as the α=0.01 interval width |𝒞({x^(j)};dcal[k])| falls below the threshold of τ=0.1. The first round collects k-space measurements at acceleration rate R[1]=16, and the remaining rounds each collect additional k-space measurements to yield R[2]=8, R[3]=4, R[4]=2, and R[5]=1 respectively. For quantitative evaluation, we randomly selected 8 multi-slice volumes from the 100-volume fastMRI validation set to act as test volumes (half of which were labeled as meniscus tears and half of which were not), and we used the remaining 92 volumes for calibration. We will refer to the corresponding index sets as test and cal. Additional details about the MRI measurement procedure are given in the Supplementary Materials.

We begin by discussing the AR conformal prediction method, which uses the point-sampling E2E-VarNet [49] for image recovery. The AR method produces prediction intervals 𝒞(x^;dcal[k]) that are x^-invariant (i.e., not locally adaptive). Thus, immediately after calibration, it is known that k=4 measurement rounds (i.e., R=2) are necessary and sufficient to achieve the τ=0.1 threshold at error-rate α=0.01.

The LWR and CQR conformal prediction methods both use the CNF [58] with p=32 posterior samples and yield locally adaptive prediction intervals 𝒞({x^(j)};dcal[k]). This allows them to evaluate the interval length for each {x^(j)} and stop the measurement process as soon as that length falls below the threshold τ. For test image itest, we denote the final measurement round as

kimin{k:|𝒞({x^i(j)};dcal[k])|<τ}. (16)

(Note that {x^i(j)} also changes with the measurement round k, although the notation does not explicitly show this.) The average acceleration is then

R=1testitest1Rki-1. (17)

Table 1 shows the average acceleration R for the AR, LWR, and CQR conformal methods. We see that R=2 for the AR method because it always uses four measurement rounds. The LWR and CQR methods achieve higher average accelerations R because fewer measurement rounds suffice in a large fraction of cases. Table 1 also shows that the empirical coverage is close to what we would expect given this relatively small test set.

Table 1:

Average metrics for the multi-round MRI simulation (± standard error).

Method Average Acceleration Empirical Coverage Average Max Center Error

AR 2.000 0.991 ± 0.008 0.032 ± 0.017
LWR 5.157 0.992 ± 0.005 0.020 ± 0.002
CQR 6.762 0.987 ± 0.008 0.044 ± 0.009

Figure 6 plots the distribution of final-round kiitest for the AR, LWR, and CQR conformal methods. It too shows that the AR method always uses four rounds (i.e., R=2), while the LWR and CQR methods typically use fewer rounds. However, this plot also shows that the LWR method sometimes uses five measurement rounds. This may seem counter-intuitive but can be explained as follows. At k=4, the AR method is calibrated so that the true score z lands in the prediction interval in all but α=1% of the cases, where the length of that interval is small enough to meet the τ=0.1 threshold. Meanwhile, the LWR (and CQR) methods adapt the prediction interval based on the difficulty of {x^(j)}. In most cases, the LWR prediction interval is smaller than the AR interval, but for a few “difficult” cases the LWR prediction interval is wider, and in fact too wide to meet the τ=0.1 threshold. For these difficult cases, the LWR method moves on to the fifth measurement round.

Fig. 6:

Fig. 6:

Fraction of slices accepted after a given acceleration rate.

Based on the previous discussion, one might conjecture that the prediction intervals accepted by the AR method at round k=4 will be somehow worse than those accepted by LWR at k=4, even though their lengths all meet the threshold. We can confirm this by interpreting the midpoint of the prediction interval as an estimate of z and evaluating the absolute error on that estimate, which we call the “center error” (CE):

CE({x^(j)},z)z-bl+bu2wherebl,bu=𝒞({x^(j)};dcal). (18)

When evaluating the center error, we take the maximum over the slices in each volume. Table 1 lists the average maximum center error and confirms that it is smaller for LWR than for AR.

Figure 7 shows examples of image reconstructions, pixel-wise standard deviation maps, and CQR prediction intervals for a test image labeled with a meniscus tear. At higher accelerations like R=16, relatively large variations across posterior samples {x^(j)} result in relatively large variations across classifier outputs {z^(j)}, which result in a large prediction interval, i.e., high uncertainty about the ground-truth classifier output z. At lower accelerations like R=4, relatively small variations across posterior samples yield smaller prediction intervals, i.e., less uncertainty about z. While the pixel-wise standard-deviation maps also show reduced variation across posterior samples, it’s difficult to draw conclusions about uncertainty in the downstream task from them. For example, the same pixel-wise variations could result from a set of reconstructions that show clear evidence for a tear in some cases and clear evidence to the contrary in others, or from a set of reconstructions that show clear evidence for a tear in all cases but are corrupted by different noise realizations. Our uncertainty quantification methodology circumvents these issues by focusing on the task itself. Furthermore, by leveraging the framework of conformal prediction, it ensures that the uncertainty estimates are statistically meaningful.

Fig. 7:

Fig. 7:

MR Image reconstructions and CQR prediction intervals at accelerations R=16 and R=4 with error-rate α=0.01 and a total of p=32 posterior samples. The fastMRI+ bounding box around the meniscus tear is magnified in red. The prediction intervals shrink as the posterior samples become more consistent in the meniscus region. The standard-deviation maps show areas of high pixel-wise uncertainty but are difficult to connect to the downstream task. Note, image brightness was increased to better highlight the tear. Best viewed when zoomed.

As far as practical implementation is concerned, for each slice in a volume, the CNF reconstructions, ResNet-50 classifier outputs, and conformal prediction intervals can be computed in 414 milliseconds for p=32 samples, or 7.4 milliseconds for p=2 samples, on a single NVIDIA A100 GPU.

5. Discussion

A number of works on uncertainty quantification for MRI have been proposed based on Bayesian neural networks and posterior sampling, e.g., [11, 15, 17, 20, 21, 30, 40, 47, 58]. They produce a set of possible reconstructions {x^(j)}, from which a pixel-wise uncertainty map is typically computed. Conformal prediction methods have also been proposed to generate pixel-wise uncertainty maps for MRI and other imaging inverse problems [4, 29, 36, 53], but with statistical guarantees. However, when imaging is performed with the eventual goal of performing a downstream task, pixel-wise uncertainty maps are of questionable value. In this work, we construct a conformal prediction interval that is statistically guaranteed to contain the task output from the true image. We focus on tasks that output a real-valued scalar, such as soft-output binary classification.

Other works have applied conformal prediction to MRI tasks. Lu et al. [39] consider a dataset xi,zi with MRI images xi and discrete ordinal labels zi{1,,K} that rate the severity of a pathology. They design a predictor that, given test x, outputs a set 𝒵(x)2K that is guaranteed to contain the true label z with probability 1-α. Different from our work, [39] involves no inverse problem and aims to quantify the uncertainty in a discrete z. Sankaranarayanan et al. [46] compute uncertainty intervals on the presence/absence of semantic attributes in images, and mention that one application could be pathology detection in MRI (although they do not pursue it). Although their high-level goal is similar to ours, their solution requires a trained “disentangled” generative network that, in the case of MRI, would generate MRI images from pathology probabilities. To our knowledge, no such networks exist for MRI. In contrast, our method requires only a trained pathology classifier f(), which should be readily available.

Limitations:

First, our method requires a downstream task, which is not always available. Second, we demonstrated our method on only a single inverse problem and task; validation on other applications is needed. Third, our MRI application ideas are preliminary and not ready for clinical use. Since we use the conformal prediction interval width as a proxy for the diagnostic value of the reconstructed image(s), several aspects of our design (e.g., the choice of classifier f(), recovery algorithm g(), conformal prediction method, threshold τ, and error-rate α) would need to be tuned and validated through rigorous clinical studies. Fourth, for ease of exposition, the conformal methods that we use (AR, LWR, CQR) are somewhat simple. More advanced methods, like risk-controlling prediction sets (RCPS) [10], may perform better. Fifth, we considered only tasks that output a single real-valued scalar, such as soft-output binary classification. Extensions to more general tasks would be useful. Lastly, our posterior sampler only considers aleatoric uncertainty. In principle, epistemic uncertainty could be included by sampling the generator’s weights from a distribution, as in [22], but more work is needed in this direction.

6. Conclusion

For imaging inverse problems, we proposed a method to quantify how much uncertainty the measurement-and-reconstruction process contributes to a downstream task, such as soft-output classification. In particular, we use conformal prediction to construct an interval that is guaranteed to contain the task-output from the true image with high probability. We showed that, with posterior-sampling-based image recovery methods, the prediction intervals can be made adaptive, and we proposed a multi-round measurement protocol that stops acquiring new data when the task uncertainty is sufficiently small. We applied our method to meniscus-tear detection in accelerated knee MRI and demonstrated significant gains in acceleration rate.

Supplementary Material

supplementary

Acknowledgements

This work was supported in part by the National Institutes of Health under Grant R01-EB029957.

References

  • 1.Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya UR, et al. : A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion 76, 243–297 (2021) [Google Scholar]
  • 2.Adler J, Öktem O: Deep Bayesian inversion. arXiv:1811.05910 (2018) [Google Scholar]
  • 3.Angelopoulos AN, Bates S: Conformal prediction: A gentle introduction. Foundations and Trends in Machine Learning 16(4), 494–591 (2023). 10.1561/2200000101 [DOI] [Google Scholar]
  • 4.Angelopoulos AN, Kohli AP, Bates S, Jordan MI, Malik J, Alshaabi T, Upadhyayula S, Romano Y: Image-to-image regression with distribution-free uncertainty quantification and applications in imaging. In: Proc. Intl. Conf. on Machine Learning (2022). 10.48550/arXiv.2202.05265 [DOI] [Google Scholar]
  • 5.Angelopoulos AN, Bates S, Jordan M, Malik J: Uncertainty sets for image classifiers using conformal prediction. In: Proc. Intl. Conf. on Learning Representations (2020) [Google Scholar]
  • 6.Ardizzone L, Kruse J, Wirkert S, Rahner D, Pellegrini EW, Klessen RS, Maier-Hein L, Rother C, Köthe U: Analyzing inverse problems with invertible neural networks. In: Proc. Intl. Conf. on Learning Representations (2019) [Google Scholar]
  • 7.Arridge S, Maass P, Öktem O, Schönlieb CB: Solving inverse problems using data-driven models. Acta Numerica 28, 1–174 (Jun 2019) [Google Scholar]
  • 8.Banerji CRS, Chakraborti T, Harbron C, MacArthur BD: Clinical AI tools must convey predictive uncertainty for each individual patient. Nature Medicine 29(12), 2996–2998 (2023). 10.1038/s41591-023-02562-7 [DOI] [PubMed] [Google Scholar]
  • 9.Barbano R, Zhang C, Arridge S, Jin B: Quantifying model uncertainty in inverse problems via Bayesian deep gradient descent. In: Proc. IEEE Intl. Conf. on Pattern Recognition. pp. 1392–1399 (2021). 10.1109/ICPR48806.2021.9412521 [DOI] [Google Scholar]
  • 10.Bates S, Angelopoulos A, Lei L, Malik J, Jordan M: Distribution-free, risk-controlling prediction sets. Journal of the ACM 68(6) (2021). 10.1145/3478535 [DOI] [Google Scholar]
  • 11.Bendel M, Ahmad R, Schniter P: A regularized conditional GAN for posterior sampling in inverse problems. In: Proc. Neural Information Processing Systems Conf. (2023) [PMC free article] [PubMed] [Google Scholar]
  • 12.Boeken T, Feydy J, Lecler A, Soyer P, Feydy A, Barat M, Duron L: Artificial intelligence in diagnostic and interventional radiology: Where are we now? Diagnostic and Interventional Imaging 104(1), 1–5 (2023) [DOI] [PubMed] [Google Scholar]
  • 13.Chen T, Kornblith S, Norouzi M, Hinton G: A simple framework for contrastive learning of visual representations. In: Proc. Intl. Conf. on Machine Learning. pp. 1597–1607 (2020) [Google Scholar]
  • 14.Chung H, Kim J, McCann MT, Klasky ML, Ye JC: Diffusion posterior sampling for general noisy inverse problems. In: Proc. Intl. Conf. on Learning Representations (2023) [Google Scholar]
  • 15.Chung H, Ye JC: Score-based diffusion models for accelerated MRI. Med. Image Analysis 80, 102479 (2022) [DOI] [PubMed] [Google Scholar]
  • 16.Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L: Imagenet: A large-scale hierarchical image database. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition. pp. 248–255 (2009) [Google Scholar]
  • 17.Denker A, Schmidt M, Leuschner J, Maass P: Conditional invertible neural networks for medical imaging. Journal of Imaging 7(11), 243 (2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ding K, Ma K, Wang S, Simoncelli EP: Image quality assessment: Unifying structure and texture similarity. IEEE Trans. on Pattern Analysis and Machine Intelligence 44(5), 2567–2581 (2020) [DOI] [PubMed] [Google Scholar]
  • 19.Durmus A, Moulines E, Pereyra M: Efficient Bayesian computation by proximal Markov chain Monte Carlo: When Langevin meets Moreau. SIAM Journal on Imaging Sciences 11(1), 473–506 (2018) [Google Scholar]
  • 20.Edupuganti V, Mardani M, Vasanawala S, Pauly J: Uncertainty quantification in deep MRI reconstruction. IEEE Trans. on Medical Imaging 40(1), 239–250 (Jan 2021) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ekmekci C, Cetin M: Uncertainty quantification for deep unrolling-based computational imaging. IEEE Trans. on Computational Imaging 8, 1195–1209 (2022). 10.1109/TCI.2022.3233185 [DOI] [Google Scholar]
  • 22.Ekmekci C, Cetin M: Quantifying generative model uncertainty in posterior sampling methods for computational imaging. In: Proc. Neural Information Processing Systems Workshop (2023) [Google Scholar]
  • 23.Engstrom L, Ilyas A, Salman H, Santurkar S, Tsipras D: Robustness (python library) (2019), https://github.com/MadryLab/robustness [Google Scholar]
  • 24.Falcon W, et al. : Pytorch lightning (2019), https://github.com/PyTorchLightning/pytorch-lightning [Google Scholar]
  • 25.Guo C, Pleiss G, Sun Y, Weinberger KQ: On calibration of modern neural networks. In: Proc. Intl. Conf. on Machine Learning. vol. 70, pp. 1321–1330 (2017) [Google Scholar]
  • 26.Hammernik K, Küstner T, Yaman B, Huang Z, Rueckert D, Knoll F, Akçakaya M: Physics-driven deep learning for computational magnetic resonance imaging: Combining physics and machine learning for improved medical imaging. IEEE Signal Processing Magazine 40(1), 98–114 (2023) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition. pp. 770–778 (2016) [Google Scholar]
  • 28.Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proc. Neural Information Processing Systems Conf. vol. 30 (2017) [Google Scholar]
  • 29.Horwitz E, Hoshen Y: Conffusion: Confidence intervals for diffusion models. arXiv.2211.09795 (2022). 10.48550/arXiv.2211.09795 [DOI] [Google Scholar]
  • 30.Jalal A, Arvinte M, Daras G, Price E, Dimakis A, Tamir J: Robust compressed sensing MRI with deep generative priors. In: Proc. Neural Information Processing Systems Conf. (2021) [Google Scholar]
  • 31.Joshi M, Pruitt A, Chen C, Liu Y, Ahmad R: Technical report (v1.0)–pseudo-random cartesian sampling for dynamic MRI. arXiv:2206.03630 (2022) [Google Scholar]
  • 32.Kendall A, Gal Y: What uncertainties do we need in Bayesian deep learning for computer vision? In: Proc. Neural Information Processing Systems Conf. (2017) [Google Scholar]
  • 33.Kingma DP, Ba J: Adam: A method for stochastic optimization. In: Proc. Intl. Conf. on Learning Representations (2015) [Google Scholar]
  • 34.Knoll F, Hammernik K, Zhang C, Moeller S, Pock T, Sodickson DK, Akcakaya M: Deep-learning methods for parallel magnetic resonance imaging reconstruction: A survey of the current approaches, trends, and issues. IEEE Signal Processing Magazine 37(1), 128–140 (Jan 2020) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Koenker R, Bassett G: Regression quantiles. Econometrica 46(1) (1978). 10.2307/1913643 [DOI] [Google Scholar]
  • 36.Kutiel G, Cohen R, Elad M, Freedman D, Rivlin E: Conformal prediction masks: Visualizing uncertainty in medical imaging. In: Proc. Intl. Conf. on Learning Representations (2023) [Google Scholar]
  • 37.Lei J, G’Sell M, Rinaldo A, Tibshirani RJ, Wasserman L: Distribution-free predictive inference for regression. Journal of the American Statistical Association (2018) [Google Scholar]
  • 38.Lei J, Wasserman L: Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society 76 (2014). 10.1111/rssb.12021 [DOI] [Google Scholar]
  • 39.Lu C, Angelopoulos AN, Pomerantz S: Improving trustworthiness of AI disease severity rating in medical imaging with ordinal conformal prediction sets. Proc. Intl. Conf. on Medical Image Computation and Computer-Assisted Intervention (2022). 10.48550/arXiv.2207.02238 [DOI] [Google Scholar]
  • 40.Narnhofer D, Effland A, Kobler E, Hammernik K, Knoll F, Pock T: Bayesian uncertainty estimation of learned variational MRI reconstruction. IEEE Trans. on Medical Imaging 41(2), 279–291 (2022) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Papadopoulos H, Proedrou K, Vovk V, Gammerman A: Inductive confidence machines for regression. In: Proc. European Conf. on Machine Learning. pp. 345–356 (2002). 10.1007/3-540-36755-1_29 [DOI] [Google Scholar]
  • 42.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S: PyTorch: An imperative style, high-performance deep learning library. In: Proc. Neural Information Processing Systems Conf. pp. 8024–8035 (2019) [Google Scholar]
  • 43.Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P: SENSE: Sensitivity encoding for fast MRI. Magnetic Resonance in Medicine 42(5), 952–962 (1999) [PubMed] [Google Scholar]
  • 44.Roemer PB, Edelstein WA, Hayes CE, Souza SP, Mueller OM: The NMR phased array. Magnetic Resonance in Medicine 16(2), 192–225 (1990) [DOI] [PubMed] [Google Scholar]
  • 45.Romano Y, Patterson E, Candès EJ: Conformalized quantile regression. In: Proc. Neural Information Processing Systems Conf. pp. 3543–3553 (2019). 10.48550/arXiv.1905.03222 [DOI] [Google Scholar]
  • 46.Sankaranarayanan S, Angelopoulos AN, Bates S, Romano Y, Isola P: Semantic uncertainty intervals for disentangled latent spaces. In: Proc. Neural Information Processing Systems Conf. (2022). 10.48550/arXiv.2207.10074 [DOI] [Google Scholar]
  • 47.Schlemper J, Castro DC, Bai W, Qin C, Oktay O, Duan J, Price AN, Hajnal J, Rueckert D: Bayesian deep learning for accelerated MR image reconstruction. In: Proc. Machine Learning for Medical Image Reconstruction Workshop. pp. 64–71 (2018) [Google Scholar]
  • 48.Simonyan K, Zisserman A: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014) [Google Scholar]
  • 49.Sriram A, Zbontar J, Murrell T, Defazio A, Zitnick CL, Yakubova N, Knoll F, Johnson P: End-to-end variational networks for accelerated MRI reconstruction. In: Proc. Intl. Conf. on Medical Image Computation and Computer-Assisted Intervention. pp. 64–73 (2020) [Google Scholar]
  • 50.Sriram A, Zbontar J, Murrell T, Defazio A, Zitnick CL, Yakubova N, Knoll F, Johnson P: End-to-end variational networks for accelerated MRI reconstruction. https://github.com/facebookresearch/fastMRI (2020) [Google Scholar]
  • 51.Sukthanker RS, Huang Z, Kumar S, Timofte R, Van Gool L: Generative flows with invertible attentions. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (2022). 10.48550/arXiv.2106.03959 [DOI] [Google Scholar]
  • 52.Tang M, Repetti A: A data-driven approach for Bayesian uncertainty quantification in imaging. arXiv (2023). 10.48550/arXiv.2304.11200 [DOI] [Google Scholar]
  • 53.Teneggi J, Tivnan M, Stayman JW, Sulam J: How to trust your diffusion model: A convex optimization approach to conformal risk control (2023). 10.48550/arXiv.2302.03791 [DOI] [Google Scholar]
  • 54.Vovk V, Gammerman A, Shafer G: Algorithmic Learning in a Random World. Springer; (2005) [Google Scholar]
  • 55.Vovk V: Conditional validity of inductive conformal predictors. In: Asian Conf. Mach. Learn. pp. 475–490 (2012) [Google Scholar]
  • 56.Vovk V, Gammerman A, Saunders C: Machine-learning applications of algorithmic randomness. In: Proc. Intl. Conf. on Machine Learning. pp. 444–453 (1999) [Google Scholar]
  • 57.Wang Z, Bovik AC, Sheikh HR, Simoncelli EP: Image quality assessment: From error visibility to structural similarity. IEEE Trans. on Image Processing 13(4), 600–612 (Apr 2004) [DOI] [PubMed] [Google Scholar]
  • 58.Wen J, Ahmad R, Schniter P: A conditional normalizing flow for accelerated multi-coil MR imaging. In: Proc. Intl. Conf. on Machine Learning (2023) [PMC free article] [PubMed] [Google Scholar]
  • 59.Wen J, Ahmad R, Schniter P: MRI CNF (2023), https://github.com/jwen307/mri_cnf [Google Scholar]
  • 60.Xue Y, Cheng S, Li Y, Tian L: Reliable deep-learning-based phase imaging with uncertainty quantification. Optica 6(5) (2019). 10.1364/OPTICA.6.000618 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Zbontar J, et al. : fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv:1811.08839 (2018) [Google Scholar]
  • 62.Zhang R, Isola P, Efros AA, Shechtman E, Wang O: The unreasonable effectiveness of deep features as a perceptual metric. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition. pp. 586–595 (2018) [Google Scholar]
  • 63.Zhao R, Yaman B, Zhang Y, Stewart R, Dixon A, Knoll F, Huang Z, Lui YW, Hansen MS, Lungren MP: fastMRI+: Clinical pathology annotations for knee and brain fully sampled magnetic resonance imaging data. Scientific Data 9(1), 152 (2022). 10.1038/s41597-022-01255-z [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary

RESOURCES