Multi-level Canonical Correlation Analysis for Standard-dose PET Image Estimation

Le An; Pei Zhang; Ehsan Adeli; Yan Wang; Guangkai Ma; Feng Shi; David S Lalush; Weili Lin; Dinggang Shen

doi:10.1109/TIP.2016.2567072

. Author manuscript; available in PMC: 2017 Nov 11.

Published in final edited form as: IEEE Trans Image Process. 2016 May 11;25(7):3303–3315. doi: 10.1109/TIP.2016.2567072

Multi-level Canonical Correlation Analysis for Standard-dose PET Image Estimation

Le An ¹, Pei Zhang ², Ehsan Adeli ³, Yan Wang ⁴, Guangkai Ma ⁵, Feng Shi ⁶, David S Lalush ⁷, Weili Lin ⁸, Dinggang Shen ^9,^✉

PMCID: PMC5106345 NIHMSID: NIHMS792299 PMID: 27187957

Abstract

Positron emission tomography (PET) images are widely used in many clinical applications such as tumor detection and brain disorder diagnosis. To obtain PET images of diagnostic quality, a sufficient amount of radioactive tracer has to be injected into a living body, which will inevitably increase the risk of radiation exposure. On the other hand, if the tracer dose is considerably reduced, the quality of the resulting images would be significantly degraded. It is of great interest to estimate a standard-dose PET (S-PET) image from a low-dose one in order to reduce the risk of radiation exposure and preserve image quality. This may be achieved through mapping both standard-dose and low-dose PET data into a common space and then performing patch based sparse representation. However, a one-size-fits-all common space built from all training patches is unlikely to be optimal for each target S-PET patch, which limits the estimation accuracy. In this paper, we propose a data-driven multi-level Canonical Correlation Analysis (mCCA) scheme to solve this problem. Specifically, a subset of training data that is most useful in estimating a target S-PET patch is identified in each level, and then used in the next level to update common space and improve estimation. Additionally, we also use multi-modal magnetic resonance images to help improve the estimation with complementary information. Validations on phantom and real human brain datasets show that our method effectively estimates S-PET images and well preserves critical clinical quantification measures, such as standard uptake value.

Keywords: PET estimation, multi-level CCA, sparse representation, locality-constrained linear coding, multi-modal MRI

I. Introduction

POSITRON emission tomography (PET) is a functional imaging technique that is often used to reveal metabolic information for detecting tumors, searching for metastases and diagnosing certain brain diseases [1, 2]. By detecting pairs of gamma rays emitted from the radioactive tracer injected into a living body, the PET scanner generates an image, based on the map of radioactivity of the tracer at each voxel location.

To obtain PET images of diagnostic quality, a standard-dose tracer is often used. However, this raises the risk of radioactive exposure, which can be potentially detrimental to one’s health. Recently, researchers have tried to lower the dose during PET scanning, e.g., using half of the standard dose [3]. Although it is desirable to reduce the dose during the imaging process, reducing the dose will inevitably degrade the overall quality of the PET image. As shown in Fig. 1(a) and (b), the low-dose PET (L-PET) image and the standard-dose (SPET) image differ significantly in image quality, though both images are of the same subject. Our method aims to estimate the S-PET image in a data-driven manner to produce a result (Fig. 1(c)) that is very close to the original S-PET image. Also, since the modern PET scanner is often combined with other imaging modalities (e.g., magnetic resonance imaging (MRI)) to provide both metabolic and anatomical details [4], such information could be leveraged for better estimation of S-PET images.

Fig. 1 — (a) Low-dose PET image of a subject. (b) Standard-dose PET image of the same subject. (c) Estimated standard-dose PET image by our method.

Since PET images often have poor signal-to-noise ratio (SNR) due to the high level of noise and low spatial resolution, there are a lot of works that have been proposed to improve the PET image quality during the reconstruction or the post-reconstruction process. For example, during the reconstruction process, anatomical information from MRI prior [5–7] has been utilized. In [8], a nonlocal regularizer is developed, which can selectively consider the anatomical information only when it is reliable, and this information can come from MRI or CT. In the post-reconstruction process, CT [9, 10] or MRI [11] information can be incorporated. In [12], both CT and MRI are combined in the post-reconstruction process. These methods can suppress noise and improve image quality. Some works have specifically focused on reducing the noise in PET images, including the use of the singular value thresholding concept and Stein’s unbiased risk estimate [13], the use of spatiotem-poral patches in a non-local means framework [14], the joint use of wavelet and curvelet transforms [15], and simultaneous delineation and denoising [16].

The aforementioned methods are mainly developed to improve the PET image quality during the reconstruction or the post-reconstruction process. On the contrary, in this work, we study the possibility of generating S-PET alike image with diagnostic quality from L-PET image and MRI image, and investigate how well the image quantification can be preserved in the estimated S-PET images. Our method is essentially a learning-based mapping approach to infer unknown data from known data in different modalities, in contrast to conventional image enhancement methods.

For practical nuclear medicine, it is desirable to reduce the dose of radioactive tracer. However, lowering radiation dose changes the underlying biological or metabolic process. Therefore, the low-dose and standard-dose PET images can be different in terms of activity. This actually brings the motivation of our method, in which we aim to estimate standard-dose alike PET images from L-PET images, which cannot be achieved by simple post-processing operations such as denoising. To our best knowledge, very few methods attempt to directly estimate the S-PET image from an LPET image, for example, using regression forest [17, 18]. Specifically, in [17] and [18], a regression forest can be trained to estimate a voxel value in an S-PET image, with L-PET voxel values in the neighborhood as input to the RF. The quality of the estimated S-PET images can be further improved by incremental refinement. In the CT imaging domain, to obtain a CT image of diagnostic quality with a lesser dose, Fang et al. [19] proposed a low-dose CT perfusion deconvolution method using tensor total-variation regularization.

Recently there have been rapid development in sparse representation (SR) and dictionary learning for medical images [20]. For example, estimating S-PET image from L-PET image can be achieved in patch-based SR by learning a pair of coupled dictionaries from L-PET and S-PET training patches. It is assumed that both L-PET and S-PET patches lie in the low dimensional manifolds with similar geometry. To estimate a target S-PET patch, its corresponding L-PET patch is first sparsely represented by the L-PET dictionary, which includes a set of training L-PET patches. The resulting reconstruction coefficients are then directly applied to the S-PET dictionary for estimation of S-PET image, where the S-PET dictionary is composed of a set of S-PET patches, each corresponding to an L-PET patch in the L-PET dictionary.

Usually, the patches in the two dictionaries have different distributions (i.e., neighborhood geometry) due to changes in imaging condition. Hence, it is inappropriate to directly apply the learned coefficients from the L-PET dictionary to the S-PET dictionary for estimation. A solution to this problem is to map those patches into a common space before applying sparse coefficients for minimizing their distribution discrepancy. Common space, or sometimes referred to as coherent space, is a feature space where the coherence between the topological structures of data from different modalities (i.e., L-PET and S-PET patches in our case) is established. In this common space, the L-PET and S-PET features share a common topological structure, and thus an S-PET patch can be estimated more accurately by exploiting the geometric structure of the L-PET patches. One popular technique to learn such common space is Canonical Correlation Analysis (CCA) [21], which has been widely applied in various tasks, such as disease classification [22, 23], population studies [24], image registration [25], and medical data fusion [26]. CCA can be used to learn global mapping with the original coupled L-PET and S-PET dictionaries and then map both kinds of data into their common space. However, global common space mapping does not necessarily unify neighborhood structures in the coupled dictionaries that are involved in reconstructing a specific L-PET patch. Hence, it is sub-optimal to estimate its corresponding S-PET patch using the same reconstruction coefficients.

To accurately learn the common space for S-PET estimation, we propose a multi-level CCA (mCCA) framework. Fig. 2 illustrates a two-level scheme. In the first level (top part of Fig. 2), after mapping both L-PET and S-PET data into their common space, a test L-PET patch can be reconstructed by the L-PET dictionary. Rather than immediately estimating the target S-PET patch in this level, a subset of the L-PET dictionary atoms (patches with non-zero coefficients) that are most useful for reconstructing the test L-PET patch are selected and passed on together to the next level with the corresponding S-PET dictionary subset (lower part of Fig. 2). With this data-driven dictionary refinement, the subsequent common space learning and estimation will be improved in the next level. We observe that repeating this process leads to a better final estimation. In addition to the L-PET based estimation, we also leverage multi-modal MRI (i.e., T1-weighted and diffusion tensor imaging (DTI)) to generate an MRI based estimation in a similar way, which can be used to improve the simple L-PET based estimation in a fusion process. This is depicted in Fig. 3. As can be seen, given a test L-PET patch, the training patches are adaptively selected and then used to learn multiple levels of CCA-based common space, with the goal of better representing this test L-PET patch in each level. Similarly, estimation can be made from a test MRI patch (bottom part) by selecting one MRI modality that has the highest correlation with the test L-PET patch. Finally, a fusion strategy generates the final estimated S-PET patch, and all the estimated S-PET patches are aggregated to form the output S-PET image.

Fig. 3 — Our framework for estimating S-PET image from L-PET and MR images. The top part illustrates the proposed multi-level CCA-based estimation from L-PET patches, and the bottom part depicts the same strategy for estimation using MRI as input. In each level, for a particular test patch, a subset of dictionary atoms are adaptively selected and then the refined dictionaries in the original image space are provided to the next level for common space learning and reconstruction. In the final stage, a fusion strategy is adopted to generate the final estimated S-PET patch using both the MRI- and L-PET-based estimations. *Best viewed in color*.

We note that in a recent work [27], qualitative visual inspections were performed by physicians on whole-body PET images, and no significant difference between PET images with different doses was found. However, it was observed that the standard uptake values (SUVs) have changed when using different doses. In our work, we test our method on both brain phantom data with abnormality (i.e., lesion) and real brain data. Different from [27], we provide quantitative evaluations in terms of both image quality and clinical quantification measures. The results suggest that our estimated standard-dose alike PET images are more similar to the ground-truth standard-dose images, while the low-dose PET images are significantly deviated from standard-dose PET images in various measures.

Compared to [17, 18], in which the PET estimation is formulated as a regression problem, our approach tackles it as a sparse representation problem. The sparse representation is computed in an iteratively-refined common space for L-PET and S-PET images. Because the intra-data relationships in the L-PET and S-PET data spaces are different, a direct coding and estimation step in the original image space would not be optimal. In our approach, the estimation uses the sparse coefficients learned in the common space, which has shown to be more effective through experiments using both image quality and clinical quantification measures. Compared to the results in [18], with the same data and experimental settings, superior performance is achieved by our method. Furthermore, only T1-weighted MRI was used in [18], while in our method, multi-modal MRI can be adaptively selected and utilized for improved estimation as compared to using only T1.

In summary, the contributions of this work are two-fold:

An mCCA based data-driven scheme is developed to estimate an S-PET image from an L-PET counterpart, such that its quality is iteratively refined;
Our framework combines both L-PET and multi-modal MRI for better estimation. To the best of our knowledge, this is the first work that estimates an S-PET image by fusing the information from its low-dose counterpart and multi-modal MR images.

The effectiveness of our proposed method was evaluated on a real human brain image dataset. Extensive experiments were conducted using both image quality metrics and clinical quantification measures. The results demonstrate that the estimated S-PET images well preserve critical measurements such as standard uptake value (SUV) and show the improved image quality in terms of quantitative measures such as peak signal-to-noise ratio (PSNR), as compared to the L-PET images and also the estimations by those baseline methods.

Below we first describe the proposed method in detail in Section II. Then, we show extensive experimental results, evaluated with different metrics, on both phantom brain dataset and real human brain dataset in Section III. Finally, we conclude the paper in Section IV.

II. Methodology

Suppose we have a group of N training image pairs, with each composed of an L-PET image and an S-PET image. Given a target L-PET image, we seek to estimate its S-PET counterpart using the training set in a patch-wise manner. Specifically, we first break down each pair of the training images into a number of patches at corresponding voxels, thus leading to sets of L-PET and associated S-PET training patches. Given a target S-PET patch to be estimated, the training patches within the corresponding neighborhood are extracted and preselected. After learning and refining a common space in multiple levels, an estimate of the target SPET patch from its L-PET counterpart is obtained by patch based SR with the selected training patches. By replacing L-PET with multi-modal MRI and repeating the above process, we can obtain the estimates of the target S-PET patch from multiple modalities. We then fuse those estimates together to obtain the final estimate. Below we elaborate mCCA for L-PET and multi-modal MR based estimation in detail. We use bold lowercase letters (e.g., w) to denote vectors and bold uppercase letters (e.g., W) for matrices. Before diving into details, we first briefly review CCA.

A. Canonical Correlation Analysis (CCA)

First introduced in [21], CCA is a multivariate statistical analysis tool. CCA aims at projecting two sets of multivariate data into a common space such that the correlation between the projected data is maximized.

In our problem, given two data matrices X = {X_i ∈ ℝ^d, i = 1, 2, …, K} and Y = {Y_i ∈ ℝ^d, i = 1, 2, …, K} containing K pairs of data from two modalities, the goal of CCA is to find pairs of column projection vectors w_X ∈ ℝ^d and w_Y ∈ ℝ^d such that the correlation between $w_{X}^{T} X$ and $w_{Y}^{T} Y$ is maximized. Specifically, the objective function to be maximized is

arg max_{w_{L}, w_{S}} \frac{w_{X}^{⊤} C_{X Y} w_{Y}}{\sqrt{w_{X}^{⊤} C_{X X} w_{X}} \sqrt{w_{Y}^{⊤} C_{Y Y} w_{Y}}},

(1)

where the data covariance matrices are computed by C_XX = E[XX^⊤], C_YY = E[YY^⊤], and C_XY = E[XY^⊤], in which E[·] calculates the expectation. Eq. (1) can be reformulated as a constrained optimization problem as follows: maximize

maximize w_{X}^{T} C_{X Y} w_{Y}

subject to w_{X}^{T} C_{X X} w_{X} = 1,

(2)

w_{Y}^{T} C_{Y Y} w_{Y} = 1 .

Eq. (2) can be solved through the following generalized eigenvalue problem

[\begin{matrix} 0 & C_{X Y} \\ C_{Y X} & 0 \end{matrix}] [\begin{matrix} w_{X} \\ w_{Y} \end{matrix}] = λ [\begin{matrix} C_{X X} & 0 \\ 0 & C_{Y Y} \end{matrix}] [\begin{matrix} w_{X} \\ w_{Y} \end{matrix}] .

(3)

w_X is an eigenvector of $C_{X X}^{- 1} C_{X Y} C_{Y Y}^{- 1} C_{Y X}$ , and w_Y is an eigenvector of $C_{Y Y}^{- 1} C_{Y X} C_{X X}^{- 1} C_{X Y}$ . The projection matrices and W_X and W_Y are obtained by stacking w_X and w_Y as column vectors, corresponding to different eigenvalues of the above generalized eigenvalue problem.

B. Patch Preselection and Common Space Learning

Let y_L,p be a column vector representing a vectorized target L-PET patch of size m × m × m extracted at voxel p. To estimate its S-PET counterpart, we first construct an L-PET dictionary by extracting patches across the N training L-PET images within a neighborhood of size t × t × t centered at p. Repeating this process for each of the N training S-PET images, we harvest an S-PET dictionary coupled with L-PET dictionary, and there are a total of t³ × N patches (atoms) in each dictionary.

As the size of the dictionary is proportional to t and N, out of all t³ × N patches, we preselect a subset of K L-PET patches that are the most similar to y_L,p, for computational efficiency in the subsequent learning process. The similarity between patches y_i and y_j are defined using structural similarity (SSIM) [28]

S S I M (y_{i}, y_{j}) = \frac{2 μ_{i} μ_{j}}{μ_{i}^{2} + μ_{j}^{2}} \times \frac{2 σ_{i} σ_{j}}{σ_{i}^{2} + σ_{j}^{2}},

(4)

where μ and σ are the patch mean and standard deviation, respectively. This preselection strategy has been adopted with success in medical image analysis [29]. Note that for this patch selection, we are computing a metric of structural similarity between patches based on the observed statistics of the voxel intensities in each patch. This method defines similarity on the basis of the observed first- and second-order statistics within each patch and does not make any particular assumption about the noise structure of the images, either Gaussian or not. Other similarity metrics could also be applied here if suitable for PET images.

Let D_L = {d_{L_i} ∈ ℝ^d, i = 1, 2, …, K} be the L-PET dictionary after preselection and D_S = {d_{S_i} ∈ ℝ^d, i = 1, 2, …, K} be the corresponding S-PET dictionary, where d = m³, d_{L_i} and d_{S_i} are a pair of L-PET patch and its corresponding S-PET patch. To improve the correlation between D_L and D_S, we use CCA to learn mappings w_L, w_S ∈ ℝ^d, such that after mapping the correlation coefficient between D_L and D_S is maximized. The mappings are obtained substituting X with D_L and Y with D_S in Eq. (3). The projection matrices W_L and W_S are composed of w_L and w_S corresponding to different eigenvalues. In this step, a subset of the training L-PET patches that are most similar to the given L-PET patch is selected. Also, a CCA mapping is performed to transform the L-PET dictionary and S-PET dictionary into a common space where they have a maximized correlation.

C. S-PET Estimation by mCCA

In our multi-level scheme, we learn CCA mapping for each level and reconstruct the target L-PET patch y_L,p in the common space at all times. Specifically, let $D_{L}^{1}$ and $D_{S}^{1}$ be the L-PET and S-PET dictionaries in the first level, and $W_{L}^{1}$ and $W_{S}^{1}$ be the learned mappings. The reconstruction coefficients α¹ for y_L,p in this level are determined by

arg min_{α^{1}} {‖ {W_{L}^{1}}^{⊤} (y_{L, p} - D_{L}^{1} α^{1}) ‖}_{2}^{2} + λ {‖ δ ⊙ α^{1} ‖}_{2}^{2},

(5)

s . t . 1^{⊤} α^{1} = 1,

where ⊙ is element-wise multiplication. The computation of δ is defined as

δ = exp (\frac{dist ({W_{L}^{1}}^{⊤} y_{L, p}, {W_{L}^{1}}^{⊤} D_{L}^{1})}{σ}),

(6)

such that each element in δ is computed from the Euclidean distance between projected patch ${W_{L}^{1}}^{⊤} y_{L, p}$ and each projected dictionary atom in ${W_{L}^{1}}^{⊤} D_{L}^{1}$ . σ is a parameter to adjust the weight decay based on the locality. For example, the elements in $D_{L}^{1}$ , which are far away from y_L,p, will have larger penalty, resulting in smaller reconstruction coefficients in α¹. On the other hand, the neighbors of y_L,p in $D_{L}^{1}$ will be less penalized, thus allowing higher weights for reconstruction. Eq. (5) is referred to as locality-constrained linear coding (LLC), which has an analytical solution [30], given by

{\hat{α}}^{1} = (C + λ diag (δ)) 1,

(7)

α^{1} = {\hat{α}}^{1} / 1^{⊤} {\hat{α}}^{1},

where C = (D_L − 1ŷ^⊤)(D_L − 1ŷ^⊤)^⊤and $ŷ = {W_{L}^{1}}^{⊤} y_{L, p}$ . It has been shown that the locality constraint can be more effective than sparsity [31]. Instead of using α¹ to estimate the S-PET patch as output in this level, the dictionary atoms in $D_{L}^{1}$ with significant coefficients (e.g., larger than a predefined threshold) in α¹ are selected to build a refined L-PET dictionary. The refined L-PET dictionary and corresponding S-PET dictionary are used for both common space learning and reconstruction in the next level.

In the l^th level where l ≥ 2, the reconstruction coefficients α^l are calculated by

arg min_{α^{l}} {‖ {W_{L}^{l}}^{⊤} (y_{L, p} - D_{L}^{l} α^{l}) ‖}_{2}^{2} + λ {‖ δ ⊙ α^{l} ‖}_{2}^{2} + γ {‖ {W_{S}^{l}}^{⊤} (y_{S, p}^{l} - y_{S, p}^{l - 1}) ‖}_{2}^{2}, s . t . 1^{⊤} α^{l} = 1,

(8)

where $W_{L}^{l}$ and $W_{S}^{l}$ are the CCA mapping for $D_{L}^{l}$ and $D_{S}^{l}$ at the current l^th level, $y_{S, p}^{l - 1} = D_{s}^{l - 1} α^{l - 1}$ is the estimated S-PET patch from the previous (l − 1)^th level and $y_{S, p}^{l} = D_{s}^{l} α^{l}$ is the estimation in the l^th level. The third term of Eq. (8) enforces that the estimation in the l^th level does not significantly deviate from the estimation in the (l − 1)^th level, ensuring a gradual and smooth refinement in each level.

By repeating the process above, the dictionary atoms that are most important in reconstructing a target L-PET patch are selected. Therefore, the mapping and reconstruction in the subsequent level can be more effective towards the goal of estimating a particular target S-PET patch. In the final level, we obtain the L-PET based estimation, denoted by ŷ_S,p.

D. MRI based Estimation and Fusion

As MRI can reveal anatomical details, we would like to take this advantage for S-PET estimation. For a given L-PET patch y_L,p, we select one MR modality from T1 and DTI (i.e., fractional anisotropy (FA) and mean diffusivity (MD)) images such that the correlation (i.e., cosine similarity in our case) between the selected MRI patch y_M,p and y_L,p is the highest, though other advanced methods, such as combining all MR modalities, can be used. To compute the correlation, the patches are first normalized to have a zero mean and unit variance, as it helps to eliminate the influence of different intensity scales across different image modalities. The MRI based estimation y̬_S,p is computed similarly to the L-PET based estimation, ŷ_S,p, by using the dictionary pair from the MR images of selected modality and the S-PET images in the training set. The final fused estimation is obtained by

y_{S, p} = ω_{1} ŷ_{S, p} + ω_{2} {\overset{ˇ}{y}}_{S, p} .

(9)

The fusion weights ω₁ and ω₂ are learned adaptively for each target S-PET patch by minimizing the following function

arg min_{ω_{1}, ω_{2}} {‖ W_{L}^{⊤} y_{L, p} - W_{S}^{⊤} y_{S, p} ‖}_{2}^{2} + {‖ P_{M}^{⊤} y_{M, p} - P_{S}^{⊤} y_{S, p} ‖}_{2}^{2},

s . t . y_{S, p} = ω_{1} ŷ_{S, p} + ω_{2} {\overset{ˇ}{y}}_{S, p},

ω_{1} + ω_{2} = 1, ω_{1} \geq 0, ω_{2} \geq 0,

(10)

where W_L and W_S are the mapping in the final level for the L-PET based estimation, while P_M and P_S are the mapping in the final level for MRI based estimation. This objective function ensures that in their corresponding common spaces, the final output is close to both the input L-PET patch and the input MRI patch. Note that the L-PET based estimation and the MRI based estimation can both be obtained with different number of levels for common space learning. The optimal values for ω₁ and ω₂ can be efficiently computed using a recently proposed active-set algorithm [32].

III. Experiments

For the proof of concept, we first evaluate our method on a simulated phantom brain dataset with 20 subjects. Then, a real brain dataset from 11 subjects are introduced and evaluated in detail. The description for the datasets and the experimental results are presented in the following.

On both phantom and real brain datasets, a leave-one-out cross-validation (LOOCV) strategy was employed, i.e., each time one subject is used as the target subject and the rest are used for training. The patch size is set to 5 × 5 × 5 and the neighborhood size for dictionary patch extraction is 15 × 15 × 15. The patches are extracted with a stride of one voxel and the overlapping regions are averaged to generate the final estimation. During preselection, K = 1200 patches are selected. The regularization parameter λ in Eq. (5) and Eq. (8) is set to 0.01 and γ in Eq. (8) was set to 0.1. We use two-level CCA for both L-PET and MRI based estimation in the experiments, as we observe that the use of more levels does not bring significant improvement while increasing the computational time. In the first level, the dictionary atoms with coefficients larger than 0.001 in α¹ were selected for learning both the common space mapping and the reconstruction coefficients α² in the second level.

A. Phantom Brain Dataset

1) Data Description

The phantom brain dataset is constructed from 20 anatomical normal brain models [33, 34]. Within each model, a 3-D “fuzzy” tissue membership volume is available for each tissue class, including background, cerebrospinal fluid, gray matter, white matter, fat, muscle, muscle/skin, skull, blood vessels, connective region around fat, dura matter and bone marrow. To examine the estimation quality especially in abnormal regions, we randomly place a lesion for each brain in the middle temporal gyrus. Fig. 4 shows examples of an L-PET image and the corresponding SPET image. Besides simulated PET images, each model has a T1-weighted MRI.

Fig. 4 — Sample images from brain phantom data. Bounding boxes enclose the lesion regions.

For the proof of concept, we want to answer the following questions:

Are the estimated S-PET alike images better than the original L-PET images?
Is performing the estimation in common space better than in the original image space?
Is multi-level CCA mapping more effective than a single-level CCA mapping?
Is MRI useful as additional estimation source?
Can comparable results be achieved by a denoising filter?

These questions are answered by evaluations using both image quality and clinical measures, as defined in the following.

2) Image Quality Evaluation

For quantitative evaluation, we first compute Peak Signal-to-Noise Ratio (PSNR) between an estimated S-PET image and the ground-truth S-PET image. PSNR is defined as

P S N R = 10 {log}_{10} (\frac{D R^{2}}{M S E}),

(11)

where DR denotes the dynamic range of the image, and the mean square error (MSE) between the estimation y and the ground-truth s with an image of size n × o × p is given by

M S E = \frac{1}{nop} \sum_{i = 1}^{n} \sum_{j = 1}^{o} \sum_{k = 1}^{p} ({| y (i, j, k) - s (i, j, k) |}^{2}) .

(12)

A larger value of PSNR indicates more similarities between the estimated S-PET image and the ground-truth S-PET images. Fig. 5 shows the average PSNR scores for the 20 subjects. Baseline comparisons include sparse representation in the original image space (SR), CCA, and multi-level CCA using L-PET as the only estimation source, while both multilevel CCA and multi-modal MRI are used in our method.

The results show that the estimation in the common space learned by CCA is more accurate than the estimation in the original image space. In addition, using more levels of common space learning leads to better results, which are further improved by leveraging MRI data. To validate the statistical significance of our method, we perform pairedsample t-test to compare the baseline methods against ours. As indicated in Fig. 5, p < 0.01 is observed for all the baseline methods under comparison. This provides further evidence about the advantage of our method, i.e., mCCA (L+M).

Furthermore, the Signal-to-Noise Ratio (SNR) is computed on the estimated images. SNR is defined as

S N R = 20 {log}_{10} \frac{m_{R O I}}{σ_{R O I}},

(13)

where m_ROI and σ_ROI are the mean and standard deviation in the region of interest (ROI). In this case, the ROI is the lesion region. The average SNR values are shown in Fig. 6. Higher value of SNR indicates better quality. Compared to the other baseline methods, the highest SNR is achieved by the proposed method with statistical significance of p < 0. 05. The superior performance in SNR measure is congruent with the observation of the comparison of PSNR values.

Fig. 6 — Average SNR scores on the phantom brain dataset. The ROI for computing SNR is the lesion region. Error bars indicate standard deviation. L means using L-PET as the only estimation source, and L+M means using both L-PET and MRI for estimation. Higher score is better. † indicates p < 0.01 in the t-test as compared to our method, and * means p < 0.05.

3) Clinical Measure Evaluation

Besides image quality measure, it is also important that the ROI in an estimated S-PET image is well preserved in terms of clinical quantification, as compared to the ground-truth S-PET images. To examine this aspect, we evaluate two measures in the lesion region as the ROI. The first measure is Contrast-to-Noise Ratio (CNR), which is important in clinical applications to detect the potentially low remnant activity after therapy [14]. CNR is computed between ROI and the background (cerebellum in this paper). We use the definition of CNR in [35], i.e.,

C N R = \frac{(m_{R O I} - m_{B G}) / m_{B G}}{\sqrt{σ_{R O I}^{2} + σ_{B G}^{2}}},

(14)

where m_ROI and m_BG are the mean intensities, σ_ROI and σ_BG are the standard deviations of the ROI and the background, respectively.

As the goal is to estimate S-PET images that are similar to the ground-truth S-PET images, we report the CNR difference between the estimated S-PET images and the ground-truth SPET images. Smaller difference indicates less deviation from the ground-truth. Fig. 7 shows the average CNR difference. We can see that the CNR difference is significant in the original L-PET images as well as the estimated S-PET images by patch-based SR. This difference is reduced by estimating the S-PET image in the common space learned by CCA, and the proposed mCCA scheme further bridges this difference. The small p-values, i.e., p < 0.05, demonstrate the statistical importance of the CNR results obtained by our method.

Fig. 7 — Average CNR difference on the phantom brain dataset. The ROI for computing CNR is the lesion region. L means using L-PET as the only estimation source, and L+M means using both L-PET and MRI for estimation. Error bars indicate standard deviation. Lower score is better. † indicates p < 0.01 in the t-test as compared to our method, and * means p < 0.05.

Apart from CNR, SUV calculated from the PET images is also critical for diagnostic evaluation and treatment planning [36]. The use of SUV can remove variability among patients, which is caused by the differences in body size and the amount of injected tracer dose. Particularly, the changes in SUV are important in clinical applications. For example, the SUV changes can be used to classify patients into different PET based treatment response categories, so such response classification can guide subsequent treatment decisions [37]. The SUV can be calculated on a per voxel basis, in which the value for a voxel at location (i,j,k) is defined by

S U V = \frac{c (i, j, k)}{(a / w)},

(15)

where c(i,j,k) is the radioactivity concentration in that voxel (in kBq/ml) and w is the body weight of the subject (in g). a is the decay-corrected amount of injected dose (in kBq). As suggested in [13], smaller changes in SUV are highly desirable, meaning that the estimation does not significantly change the quantitative markers of the PET image. Thus, we report the SUV difference in the lesion region in both L-PET images and the estimated S-PET images, in order to examine how the SUV in these images deviates from the SUV in the ground-truth S-PET images.

Fig. 8 shows the SUV difference in the same ROI by different methods. Compared to the baseline methods, the SUV difference by the proposed method is the smallest. This shows that the estimated S-PET images by our method can better preserve the SUV, indicating an improved clinical usability as compared to the L-PET images or the outputs by the baseline methods. This improvement is statistically important as suggested by the small p-values in comparison with other baseline methods.

Since image denoising can also improve the image quality, we compare our method with the following state-of-the-art denoising methods: 1) BM3D [38], which is a denoising method based on an enhanced sparse representation in transform-domain, and it has also shown favorable performance in PET sinogram denoising [39]; 2) Optimized Blockwise Nonlocal Means (OBNM) [40], which is originally developed for 3D MR images and can also be applied to PET images. The denoising directly operates on the L-PET input. The comparison results, in terms of different measures, are listed in Table I. Note that, for fair comparison, only L-PET is used as the estimation source in our method in this comparison.

TABLE I.

Comparison with different denoising methods. For PSNR and SNR, higher score is better. For CNR and SUV difference, lower score is better.

Method	PSNR	SNR	CNR diff.	SUV diff.
mCCA	26.61 ± 1.31	13.75 ± 0.91	0.33 ± 0.18	0.27 ± 0.03
BM3D [38]	24.21 ± 1.94^†	13.17 ± 0.97^†	0.72 ± 0.48^†	0.39 ± 0.10^†
OBNM [40]	24.49 ± 1.86^†	13.22 ± 0.96^†	0.45 ± 0.20^†	0.37 ± 0.07^†

Open in a new tab

^†

indicates p < 0.01 in the t-test as compared to our method.

We observe from the results in Table I that our estimated S-PET images are notably better than the denoised L-PET images. In other words, a simple denoising process cannot produce S-PET alike images that are close to the groundtruth S-PET images, and the clinical quantification cannot be well preserved. To verify the statistical significance, we also perform t-test to compare the results by the two denoising methods against the proposed method, and in all different measures, a small p-value, i.e., p < 0.01, is observed.

The evaluations on the phantom brain dataset with abnormal structures (i.e., lesion), using both image quality and clinical quantification measures, with comparisons to baseline methods, suggest that: 1) the estimated S-PET images are more similar to the ground-truth S-PET images, compared to the L-PET images and outputs by other baseline methods, 2) estimation in common space is more effective than that in the original image space, 3) multi-level CCA common space learning leads to improved results, 4) MRI can help further improve the estimation accuracy, and 5) performing simple denoising is not adequate to produce S-PET alike images that are close to the ground-truth S-PET images.