Skip to main content
Medical Physics logoLink to Medical Physics
. 2011 May 6;38(5):2629–2638. doi: 10.1118/1.3578604

Repeatability of SUV measurements in serial PET

J Schwartz 1,a), J L Humm 1, M Gonen 2, H Kalaigian 3, H Schoder 4, S M Larson 4, S A Nehmeh 5
PMCID: PMC7986573  PMID: 21776800

Abstract

Purpose

: The standardized uptake value (SUV) is a quantitative measure of FDG tumor uptake frequently used as a tool to monitor therapeutic response. This study aims to (i) assess the reproducibility and uncertainty of SUVmax and SUVmean, due to purely statistical, i.e., nonbiological, effects and (ii) to establish the minimum uncertainty below which changes in SUV cannot be expected to be an indicator of physiological changes.

Methods

: Three sets of measurements were made using a GE Discovery STE PET/CT Scanner in 3D mode: (1) A uniform68Ge 20 cm diameter cylindrical phantom was imaged. Thirty serial frames were acquired for durations of 3, 6, 10, 15, and 30 min. (2) Esser flangeless phantom (Data Spectrum, ∼6.1 L) with fillable thin-walled cylinders inserts (diameters: 8, 12, 16, and 25 mm; height: ∼3.8 mm) was scanned for five consecutive 3 min runs. The cylinders were filled with 18FDG with a 37 kBq/cc concentration, and with a target-to-background ratio (T/BKG) of 3/1. (3) Eight cancer patients with healthy livers were scanned ∼1.5 h post injection. Three sequential 3 min scans were performed for one bed position covering the liver, with the patient and bed remaining at the same position for the entire length of the scan. Volumes of interest were drawn on all images using the corresponding CT and then transferred to the PET images. For each study (1–3), the average percent change in SUVmean and SUVmax were determined for each run pair. Moreover, the repeatability coefficient was calculated for both the SUVmean and SUVmax for each pair of runs. Finally, the overall ROI repeatability coefficient was determined for each pair of runs.

Results

: For the68Ge phantom the average percent change in SUVmax and SUVmean decrease as a function of increasing acquisition time from 4.7 ± 3.1 to 1.1 ± 0.6%, and from 0.14 ± 0.09 to 0.04 ± 0.03%, respectively. Similarly, the coefficients of repeatability also decrease between the 3 and 30 min acquisition scans, in the range of 10.9 ± 3.9% – 2.6 ± 0.9%, and 0.3 ± 0.1% – 0.10 ± 0.04%, for the SUVmax and SUVmean, respectively. The overall ROI repeatability decreased from 18.9 ± 0.2 to 6.0 ± 0.1% between the 3 and 30 min acquisition scans. For the 18FDG phantom, the average percent change in SUVmax and SUVmean decreases with target diameter from 3.6 ± 2.0 to 1.5 ± 0.8% and 1.5 ± 1.3 to 0.26 ± 0.15%, respectively, for targets from 8 – 25 mm in diameter and for a region in the background (BKG). The coefficients of repeatability for SUVmax and SUVmean also decrease as a function of target diameter from 7.1 ± 2.5 to 2.4 ± 0.9 and 4.2 ± 1.5 to 0.6 ± 0.2, respectively, for targets from 8 mm to BKG in diameter. Finally, overall ROI repeatability decreased from 12.0 ± 4.1 to 13.4 ± 0.5 targets from 8 mm to BKG in diameter. Finally, for the measurements in healthy livers the average percent change in SUVmax and SUVmean were in the range of 0.5 ± 0.2% – 6.2 ± 3.9% and 0.4 ± 0.1 and 1.6 ± 1%, respectively. The coefficients of repeatability for SUVmax and SUVmean are in the range of 0.6 ± 0.7% – 9.5 ± 12% and 0.6 ± 0.7% – 2.9 ± 3.6%, respectively. The overall target repeatability varied between 27.9 ± 0.5% and 41.1 ± 1.0%.

Conclusions

: The statistical fluctuations of the SUVmean are half as large as those of the SUVmax in the absence of biological or physiological effects. In addition, for clinically applicable scan durations (i.e., ∼3 min) and FDG concentrations, the SUVmax and SUVmean have similar amounts of statistical fluctuation for small regions. However, the statistical fluctuations of the SUVmean rapidly decrease with respect tothe SUVmax as the statistical power of the data grows either due to longer scanning times or as the target regions encompass a larger volume.

I. INTRODUCTION

Positron emission tomography (PET) has recently been shown to be a useful tool to monitor tumor response to therapy.1 Previous studies have suggested that changes in 18FDG uptake can predict early response to therapy in patients with lymphoma,2 breast,3,4 ovarian,5 gastric,6 and nonsmall cell lung7 cancers. The standard uptake value (SUV) is an established index for measuring this uptake in a region of interest (ROI) and is defined as the concentration in the ROI (megabecquerel/milliliter) divided by the ratio of the injected dose (megabecquerel) to the patient’s body weight (gram). The most commonly quoted variants of this measure are the SUVmax, which uses the pixel with the largest concentration in the target, or the SUVmean, which averages a given set of pixel values within the ROI.

The change over time the SUV is frequently used to assess tumor response to therapy.8 A reduction in the number of viable cells in the tumor or in the metabolism of damaged cells can be associated with a decrease in FDG uptake.9–16 Therefore, a change in FDG uptake after therapy, as measured by SUV, may be considered to be a measure of tumor response.6–8,12,17,18

Many factors can affect the values of the SUVmax and SUVmean measurements in serial PET studies. These include fluctuations in plasma glucose and/or patient weight, variations in size shape and repositioning of ROIs (in the case of SUVmean),19,20 errors in image coregistration (in the case of SUVmean), variations in the dose and uptake time21 difference in the patient’s positioning within the gantry which may affect both resolution and sensitivity, and mismatch between PET and CT due to motion which may affect the accuracy of attenuation correction.22 In addition, the acquisition mode (2D vs 3D) as well as the reconstruction algorithm and parameters used will also impact the values and errors of SUV measurements.19,20,23

In order to develop unbiased empirical limits for asserting response to therapy, it is necessary to measure both the reproducibility (variation of outcomes of an experiment carried out in conditions varying within a typical range) and repeatability (variation of outcomes of an experiment carried out in the same conditions) of SUVs. Several groups have addressed the question of reproducibility in clinical studies, concentrating on the SUVmax and SUVmean variability for double-baseline 18FDG scans.17,18,24,25 These studies have found differences of up to 13% with a standard deviation of 10%.

In addition to the issues discussed above, uncertainty in measured SUV values can also arise from statistical fluctuations, which decrease with increasing acquisition time and/or injected dose. The measurements of reproducibility described in the various references above attempt to address the question of how to make statistically significant comparisons between serial measurements (for patient imaging) of the SUVmax or SUVmean. Nevertheless, because their acquisition methods involve repositioning the patient, reinjecting a dose, and waiting one to several days between different scans, their results reflect a combination of effects and parameters. Thus, to explore the effect of statistical fluctuations on the various SUV measurements, it is essential to explore the amount of variability in SUVmax and SUVmean between any two of multiple scans for which all adjustable parameters remain constant, i.e., changes are merely due to statistical processes. That is, it is necessary to measure the repeatability.

Several phantom experiments have already been performed to assess variability in SUV parameters across institutions in an effort to standardize and cross-calibrate imaging acquisition and reconstruction protocols so that data can be pooled from various sites to make multiinstitutional studies and comparisons accurate.23,26–28 Fahey et al. concluded that SUV variability due solely to instrument and analysis factors was 10% – 25% across institutions.26 Doot et al. also studied instrument and analysis factors affecting repeatability within their institution for a long lived 68Ge phantom as well as reproducibility across institutions of this same phantom.23,27 This group concluded that acquisition, processing and analysis all affected noise and bias of the measurements. Doot found that recovery coefficients standard deviations using the maximum ROI value were in the range 0.2% – 11% according to scan duration and object size, and that the respective coefficients of variation for the mean over the ROI were between 6.7 and 0.2%. While these studies are thorough and precise, their focus is toward multiinstitutional collaborations and how to compare data from various sources. We focus here on approaches to these questions in the case of intrainstitutional studies in which patients will be scanned two or more times to observe the progression of disease and/or effect of therapies on malignancy status.

Thus, in this study, we investigate the effect of statistical noise on the repeatability of the SUVmax and SUVmean both in phantoms, using both 68Ge and 18FDG, and for patients undergoing routine clinical PET scans and receiving 18FDG. We then determine the 95% confidence intervals for these measurements. We also report on the position variability of SUVmax and that of the position of the activity concentration weighted centroid of a tumor volume, due to statistical noise. We sought to investigate the effect that scan time, target size, and nonpathological biological effects would have on these parameters.

We hypothesize that using a large uniform source not subject to decay within the scan time, would allow us to study the statistical variations associated only with machine related variables. A 68Ge (half-life of 271 days) large (20 cm diameter [?tjl]?>× 20 cm height) cylindrical phantom can be treated as such a constant source over the imaging period because of its long half-life and, therefore, fluctuations due to the Poisson nature radioactive decay would only minimally contribute to the overall statistical variations in the measurements. In addition, the phantom’s size and homogeneous activity distribution imply that partial volume effects can be ignored. However, as we did not have the ability to make 68Ge targets of various sizes, we used an Esser phantom filled with FDG to investigate target size effects. In addition, FDG is the most commonly used PET tracer, so these measurements more accurately reflect the clinical situation. Lastly, we imaged healthy livers in patients undergoing clinical FDG scans, so that we might additionally study the effect of biological “noise” in the absence of pathology. Again, we chose to image the liver because it is a large organ and therefore do not have to be concerned with partial volume effects.

Finally, using this data, we propose minimum SUVmax and SUVmean thresholds above which changes may be associated with biological processes. While monitoring response to therapy over time was not part of this project, our results allow us to predict the lower limit in the change of SUV below which no change can be correlated with response.

II. METHODS

II.A. Scanner

All data were acquired on the GE Discovery STE PET/CT (GE Healthcare Technologies, Waukesha, WI). The Discovery STE PET/CT scanner is a 24-ring bismuth germinate (BGO) system with 15.7 and 70 cm axial and transaxial, respectively. The detector inner diameter is 88 cm. There are 560 BGO crystals in each ring. Individual BGO crystals are 4.7 × 6.3 × 30 mm (tangential × axial × radial). The DSTE is equipped with retractable septa for 3D acquisitions. The septa consists of a set of 23 tungsten septa (5.4 cm deep and 0.8 mm thick), one between each of the 24 detector rings. The 3D (and 2D) acquisition mode is operated with energy windows of 375 – 650 and 425 – 650 keV, respectively. Time coincidence window of 9.75 ns is also used fro 3D mode. An axial acceptance of 23 crystal rings is used for 3D mode data binning. The spatial resolution (FWHM) of the scanner is approximately 5.4 mm in 3D mode following NEMA NU2 standards.29 System sensitivity is 8.4 counts per second kBq−1 in 3D.

PET images were reconstructed, using the 3D Vue Point iterative algorithm (20 subsets, 2 iterations) provided by the manufacturer, into 47 slices, 128 × 128 each. The voxel size was 5.47 × 5.47 × 3.27 mm3. PET images were corrected for decay, attenuation (using the CT images), randoms, and scatters.

II.B. Phantom studies

II.B.1. 68Ge phantom

PET/CT images of a cylindrical 68Ge phantom were acquired in order to understand the effect of statistical variations of counts on the repeatability of PET measurements in the absence of all biological effects and without appreciable decay of the activity source. The phantom was 20 cm in diameter by 30 cm in height, and had an activity concentration of 37 kBq/cc. The cylinder was scanned 30 times at each acquisition time of 3, 6, 10, 15, 30 min in 3D mode without moving it from its initial position.

II.B.2. 18FDG phantom

A study was performed using the flangeless Esser phantom (Data Spectrum)(∼6.1 L) with fillable thin-walled cylinders inserts (diameters: 8, 12, 16, and 25 mm; height: ∼3.8 mm). Photographs of the (a) phantom and (b) cylindrical inserts and (c) PET images are shown in Figs. 1(a) and 1(c). The cylinders were filled with 18FDG at a concentration of 37 kBq/cc at the beginning of the scan, a target-to-background ratio (T/BKG) of 3/1 (comparable to clinically realistic lesion T/BKG in the liver). The phantom was scanned for five consecutive 3 min runs (equivalent to the standard clinical acquisition time per FOV at out institution) in 3D mode and decay corrected.

FIG. 1.

FIG. 1.

(a) Esser Flangeless ACR phantom, (b) Esser phantom lid with cylindrical inserts, and (C) 18FDG PET image of the Esser phantom.

II.C. Patient studies

II.C.1. 18FDG liver scans

Eight oncology patients (two female and eight male), with no suspected liver disease scheduled for a standard PET/CT scan, were included. Each patient was injected intravenously with 444 MBq (12 mCi) FDG and scanned 1.0 h post injection. Following this standard clinical scan (∼1.5 h post injection), three sequential 3 min scans were performed for one bed position covering the liver, with the patient and bed remaining in the same position for the entire length of the scan. We assume that during these 9 min of scan-time no significant biological changes occurred in the liver. Thus, contributions to differences in the measurements not due to statistical fluctuations were minimized. Finally, statistical errors due to decay or reinjection are also avoided in this method. A representative liver image is shown in Fig. 2. The retrospective analysis of these three 3 min scans was approved, under a waiver of consent, by the Institutional Review Board of Memorial Sloan-Kettering Cancer Center (WA0230-10).

FIG. 2.

FIG. 2.

Representative liver image of one patient imaged over three consecutive 3 min scans (a–c).

III. ANALYSIS

III.A. 68Ge phantom

Volumes of interest (VOI) (∼17 cm radius and covering the 30 central slices out of a 47 slices/FOV total) were drawn based on the CT images, encompassing the largest volume possible but well within the boundaries of the image so as to avoid edge-effects in the data. The VOIs were then transferred to each one of the serial PET image sets.

III.B. 18FDG phantom

VOI’s were drawn on the CT image over the interior part for each cylindrical insert so as to match the CT definition of the volume (13 slices, 228, 192, 24, 108, 108, 204 pixels for the 8, 12, 16, and 25 mm inserts, respectively), and over the background portion of the image (13 slices, 2508 pixels). The VOI’s were then applied to all PET images.

III.C. Patients

For each patient, VOI’s were drawn on the CT image over the interior of the healthy liver (number of slices was in the range 29–40 and the number of pixels was in the range 3800–15 000). Each patient’s VOI’s were then applied to all PET images for that patient.

III.D. Statistical analysis

The mean and maximum SUV’s within each of the VOI’s were determined for each of the sequential runs for (1) each acquisition time length of the 68Ge phantom (2) every insert of the 18FDG phantom and including the background and (3) for each patient’s healthy liver image.

Pixel-by-pixel Pearson correlation coefficients (CC) were calculated for every pair of runs for each of the following cases:

  • 1.

    every acquisition-duration using the 68Ge phantom,

  • 2.

    every insert of the 18FDG phantom and including the background (BKG) region

  • 3.

    every patient’s healthy liver image.

The mean and standard deviation across repetitions were calculated as a function of case (1) scan duration, case (2) object size, and case (3) individual patient liver.

Repeatability is measured by acquiring replicates of the same subject under identical conditions. From its analysis, it is possible to deduce the minimum amount of agreement that is possible from scan to scan. As Bland and Altman have previously pointed out, correlation analysis is often incorrectly used to determine the “likeness” of two measurements.30,31 This is because any two measurements of the same parameter will have a good correlation only if the sample that is measured repeatedly has a wide spread of values within it. Therefore, a high correlation does not automatically imply that there is good agreement between measurements.

The agreement between independent identical measurements can be visualized using a Bland–Altman plot and analysis of its parameters used to determine repeatability. In this study, the Bland–Altman formalism was used to compare replicate runs on a pixel wise basis, as well as to compare indices of interest for a given ROI for every pair of replicate runs. Each pixel in a given ROI is represented in the Bland–Altman formalism with two coordinates: (1) the average intensity of the pixel (in becquerel/cubic centimeter) across the two runs and (2) the percent change of the intensity between the two runs

S(x,y)=(pij+pik2,2(pij-pik)×100pij+pik), (1)

where pij and pik are the pixel intensity of the ith pixel index in the jth and kth runs, respectively.

The dispersion of the mean differences is also presented on a histogram and analyzed for normality either by fitting to a gaussian (if the data set is sufficiently large to do so) distribution or by applying a statistical test such as the Kolmogorov–Smirnov or Lilliefors (used in this study), to name two. Thus, if the data does not deviate from normality, as determined by the aforementioned tests, the 95% confidence interval can be taken as ±1.96 times the SD of the percent difference. Bland–Altman calculated the coefficient of repeatability (CR) of any two runs as

CR=1.96×vari2n-1, (2)

where vari is the variance of the y-coordinate of S(x,y) [Eq. (1)], summed over every pixel i in the ROI. This represents the value below which the absolute value of the difference between two repeated test results may be expected to lie with a probability of 95%.

Because the CRs are only estimates of the values which apply to the whole population, they themselves have sampling errors (SE), shown by Bland and Altman30,31 to be given by the standard deviation of the sample (s) and the sample size (n) as:

SE=3s2n. (3)

The 95% confidence interval (CI) for the CRs is given finding the correct value of the t-distribution with n−1 degrees of freedom and for a p-value of 5%. The confidence interval is given by

CI=CR±t×SE. (4)

Every pair of replicate runs was analyzed as explained above. For these studies, each run represents a replicate. We define “overall ROI repeatability” as the CR calculated by comparing every pair of runs pixel wise using Eq. (2). The average overall run repeatability is the value obtained by averaging all these CR’s. The results for every parameter were averaged across replicate runs, and the standard deviation (SD) over these runs was calculated. Pixel-wise correlations were also calculated for every pair of runs. Finally, Bland–Altman analyses were also performed for the difference between replicates of the SUVmax and SUVmean for the 68Ge cylindrical phantom, 18FDG-filled Esser phantom, and patient data.

IV. RESULTS

Results for percent changes in SUVmax and SUVmean (ΔSUVmax and ΔSUVmean, respectively) are presented as the μ ± SD, where μ and SD are the average and standard deviation values across the replicates (repeated scans). Lilliefors normality testing revealed no significant deviation from normality (P ≥ 0.05) for any of the parameters investigated.

IV.A. 68Ge phantom

Pixel-wise scatter plots were constructed for every pair of runs at each of the five acquisition times and the correlation coefficient calculated. Representative scatter plots are shown in Figs. 3(a) and 3(b) for the 3 and 30 min acquisition, respectively. The correlation coefficients were then averaged over all pairs of runs and the results of the averaging were 0.22 ± 0.01, 0.37 ± 0.02, 0.51 ± 0.01, 0.62 ± 0.01, and 0.77 ± 0.01 for the 3, 6, 10, 20, and 30 min acquisition-durations, respectively. Bland–Altman analysis was carried out for percent change in SUVmax and SUVmean between every pair of runs at each acquisition-duration, and representative Bland–Altman plots are shown in Figs. 3(c) and 3(d) for the 3 and 30 min acquisition, respectively. In the 68Ge phantom, the value for SUVmax changes by 4.7 ± 3.1% of the average observed SUVmax across the 33 min scans. The change of SUVmax diminishes with scan duration and appears to approach a constant value 1.1% change over 30 runs when acquisition times are longer than 15 min, as shown in Fig. 4(a). The coefficient of repeatability [CR, Eq. (2)] of the SUVmax also decreases with run duration across the 30 runs, ranging from an average 10.9 ± 3.9% for the 3 min scans to 2.6 ± 0.9% for the 30 min ones. The repeatability for SUVmean is lower, with CR ranging between 0.3 ± 0.1 and 0.10 ± 0.04% for the 3 and 30 min acquisition times, respectively. Finally, the average overall run CR decreases from 18.9 ± 0.2 to 6.0 ± 0.1% between the 3 and 30 min acquisition times, respectively. The average repeatability coefficients (for SUVmax, SUVmean, and overall run repeatability) are plotted as a function of acquisition-duration in Fig. 4(b). All these results are summarized in Table I.

FIG. 3.

FIG. 3.

Representative pixel-wise scatter plots for the 68Ge phantom ROIs comparing Runi and Runj for the (a) 3 and (b) 30 min acquisitions. Bland–Altman plots these same two runs, Runi and Runj, of the 68Ge ROIs for acquisition-durations of (c) 3 and (d) 30 min acquisitions.

FIG. 4.

FIG. 4.

(a) The percent change of SUVmax (•) and of SUVmean (◊) averaged over every pair-combination of the 30 runs performed at each acquisition-duration versus scan duration for the 68Ge phantom. (b) Average percent coefficient of repeatability of SUVmax (•), SUVmean (◊) and the ROI as a whole (▿) averaged over every pair-combination of the 30 runs performed for each acquisition-duration versus scan duration for the 68Ge phantom.

TABLE I.

Result of repeatability analysis for each of the five different acquisition-durations 68Ge cylindrical phantom with an activity concentration of ∼ 37 kBq/cc. For each of the five different acquisition-durations value, the result for each parameter is averaged over every possible pair of the 30 replicates.

  Run duration (minutes)
Average 3 6 10 15 30
%ΔSUVmax 4.7 ± 3.1 2.1 ± 1.5 2.6 ± 1.1 1.1 ± 0.7 1.1 ± 0.6
CR SUVmax (%) 10.9 ± 3.9 4.9 ± 1.8 6.3 ± 2.3 2.6 ± 0.9 2.6 ± 0.9
%ΔSUVmean 0.14 ± 0.09 0.08 ± 0.05 0.09 ± 0.05 0.05 ± 0.03 0.04 ± 0.03
CR SUVmean (%) 0.3 ± 0.1 0.2 ± 0.1 0.2 ± 0.1 0.10 ± 0.04 0.10 ± 0.04
Correlation 0.22 ± 0.01 0.37 ± 0.02 0.51 ± 0.01 0.62 ± 0.01 0.77 ± 0.01
CR (%) 18.9 ± 0.2 13.1 ± 0.2 10.0 ± 0.1 8.0 ± 0.1 6.0 ± 0.1

IV.B. 18FDG phantom

Pixel-wise scatter plots for every pair of runs for each lesion size were analyzed. The correlation coefficients determined from these plots were averaged over every pair of runs and the results were: 0.88 ± 0.04, 0.94 ± 0.01, 0.97 ± 0.01, 0.98 ± 0.01, 0.98 ± 0.01, 0.98 ± 0.01, and 0.14 ± 0.02 for the 8, 12, 16, 3 × 25 mm inserts and the background, respectively (BKG). The analysis of the pairwise Bland–Altman plots [representative example from the 68Ge data shown in Figs. 3(c) and 3(d)] for each insert diameter as well as for the background yields average SUVmax changes across the five runs in the ROIs of 3.6 ± 2.0%, 2.0 ± 1.0%, 2.2 ± 2.0%, 2.2 ± 1.1%, 1.5 ± 1.3%, 1.8 ± 1.1%, for the 8, 12, 16, and three 25 mm inserts, respectively. The change of SUVmax is relatively constant with respect to insert diameter.

The percent change for the SUVmean are much closer to zero than the percent change for the SUVmax, ranging from 1.5 ± 1.3 to 0.83 ± 0.55% change for the 8 and 25 mm inserts, respectively. The repeatability coefficient of the SUVmax decreases from an average 7.1 ± 2.5% for the 8 mm diameter to 2.9 ± 1.0% for the 25 mm ones, as shown in Fig. 5(a). The repeatability coefficients of the SUVmean are smaller (i.e., better repeatability), ranging between 4.2 ± 1.5% and 1.2 ± 0.4% for the 8 and 25 mm inserts, respectively. The average overall run repeatability coefficient has values between 12.0 ± 4.1 and 10.6 ± 1.3% for the 8 and 25 mm inserts, respectively. The various repeatability coefficients are shown as a function of insert diameter in Fig. 5(b). All these results are summarized in Table II.

FIG. 5.

FIG. 5.

(a) The percent change of SUVmax (•) and of SUVmean (◊) averaged over every pair-combination of all replicate runs versus insert diameter for the 18FDG phantom. (b)Average percent coefficient of repeatability of SUVmax (•), SUVmean (◊) and the ROI as a whole (▿) averaged over every pair-combination of all replicate runs versus insert diameter for the 18FDG phantom.

TABLE II.

Result of repeatability analysis for each of the six cylindrical inserts in the ACR phantom filled with 18FDG. For each of the inserts, the result for each parameter is averaged over every possible pair of the five back-to-back replicates of 3 min duration each. The activity concentrations were 37 and 12.3 kBq/cc in the cylinders and background (BKG), respectively at the beginning of the study.

  Cylinder radius (mm)
Average: 8 12 16 25 25 25 BKG
%ΔSUVmax 3.6 ± 2.0 2.0 ± 1.0 2.2 ± 2.0 2.2 ± 1.1 1.5 ± 1.3 1.8 ± 1.1 1.5 ± 0.8
CR SUVmax (%) 7.1 ± 2.5 4.5 ± 1.6 3.1 ± 1.1 5.1 ± 1.8 2.9 ± 1.0 4.2 ± 1.5 2.4 ± 0.9
%ΔSUVmean 1.5 ± 1.3 0.65 ± 0.37 0.54 ± 0.28 0.89 ± 0.41 0.73 ± 0.48 0.83 ± 0.55 0.26 ± 0.15
CR SUVmean (%) 4.2 ± 1.5 1.5 ± 0.5 1.1 ± 0.4 1.7 ± 0.6 1.9 ± 0.7 1.2 ± 0.4 0.6 ± 0.2
Correlation 0.88 ± 0.04 0.94 ± 0.01 0.97 ± 0.01 0.98 ± 0.01 0.98 ± 0.01 0.98 ± 0.01 0.14 ± 0.02
CR (%) 12.0 ± 4.1 12.0 ± 2.0 11.0 ± 1.8 10.9 ± 1.2 10.7 ± 1.3 10.6 ± 1.3 13.4 ± 0.5
Change in SUVmax position (pixels) 0.6 ± 0.5 3.3 ± 2.2 2.6 ± 1.6 1.4 ± 0.6 2.1 ± 0.7 2.0 ± 1.1
Change in SUVmean position (pixels) 0.10 ± 0.06 0.04 ± 0.02 0.03 ± 0.01 0.02 ± 0.01 0.03 ± 0.01 0.02 ± 0.01

IV.C. 18FDG liver scans

The average changes of SUVmax in the liver ROI, averaged across the 3 identical runs, range from 0.5 ± 0.2 to 6.2 ± 3.9% for the eight patients. The percent change for the SUVmean from the same ROIs range between 0.4 ± 0.1 and 1.6 ± 1% for the eight patients. These changes are graphed versus patient number in Fig. 6(a). The repeatability coefficient of the SUVmax ranges from an average 0.6 ± 0.7 to 9.5 ± 12%. The repeatability coefficients of the SUVmean have values ranging between 0.6 ± 0.7 to 2.9 ± 3.6%. The average overall run repeatability coefficient ranged between 27.9 ± 0.5 and 41.1 ± 1.0%. This data are summarized in Fig. 6(b) and all results are shown in Table III.

FIG. 6.

FIG. 6.

(a) The percent change of SUVmax (•) and of SUVmean (◊) averaged over every pair-combination of all replicate runs for each patient’s liver ROI. (b) Average percent coefficient of repeatability of SUVmax (•), SUVmean (◊) and the ROI as a whole (▿) averaged over every pair-combination of the 3 runs performed for each FDG patient versus patient number for the liver.

TABLE III.

Result of repeatability analysis for each of normal liver in eight patients undergoing standard clinical 18FDG scans. The result for each parameter is averaged over every possible pair of the three replicates.

  Patient number  
Average: 1 2 3 4 5 6 7 8  
%ΔSUVmax 1.9 ± 1.1 3.4 ± 1.2 2.6 ± 1.3 2.2 ± 0.8 1.2 ± 0.8 1.8 ± 0.7 0.5 ± 0.2 6.2 ± 3.9  
CR SUVmax (%) 2.7 ± 3.4 8.1 ± 10 5 ± 6.2 4.9 ± 6.1 3.7 ± 4.6 1.6 ± 2 0.6 ± 0.7 9.5 ± 12  
%ΔSUVmean 0.4 ± 0.1 1 ± 0.6 0.6 ± 0.2 0.6 ± 0.2 1 ± 0.4 0.9 ± 0.5 0.6 ± 0.3 1.6 ± 1.0  
CR SUVmean (%) 1.0 ± 1.2 1.5 ± 1.8 1.3 ± 1.7 0.6 ± 0.7 2.3 ± 2.8 1.7 ± 2.2 1.1 ± 1.4 2.9 ± 3.6  
Correlation 0.11 ± 0.03 0.13 ± 0.37 0.06 ± 0.01 0.04 ± 0.13 0.02 ± 0.01 0.01 ± 0.01 0.02 ± 0.04 0.01 ± 0.01  
CR (%) 27.9 ± 0.5 38.1 ± 0.5 36.3 ± 0.9 28.1 ± 0.7 27.3 ± 0.5 28.4 ± 0.6 29.9 ± 0.6 41.1 ± 1.0  

V. DISCUSSION

18F-FDG PET is increasingly being used for its predictive value in assessing tumor response after radiation and chemo-therapies,8,32 as well as for assessment of response in drug development.8,33 Therefore, accurate quantitation of tumor FDG uptake is essential for reliable assessment of response. Criteria have been established for the assessment of response using FDG PET, including the EORTC34 and PERCIST35 standards, with the goal of creating a uniform and consistent approach to the quantization of response. They serve as a starting point for designing clinical trials of novel therapies and as a way to introduce structured quantitative reporting of this data. The ultimate aim, however, is to be able to determine what changes can unquestionably be attributed to response. The PERCIST criterion is more stringent and requires a 30% decline in SUV, while the EORTC requires 15% – 25%. Thus, within this context, the results presented in this paper are directly relevant and serve to answer a complimentary question: what is the change that is not due to biological effects at all but due to rather to purely statistical ones. In fact, the question answered in the work presented here is a prerequisite to any questions dealing with response criteria when using PET imaging to assess treatment response.

This study sought to determine how the SUV measurements vary when measured repeatedly under identical conditions. Once the repeatability is measured, changes can be more accurately attributed to either biological or statistical phenomena. Aside from operator error, instrumentation noise, and physiologically derived effects (e.g., cardiac or breathing motion), statistical uncertainties due to the inherent nature of radioactive decay combined with their effect on image reconstruction can lead to errors in the SUV measurement. The experiments reported in this paper sought to address this latter source of variability. That is, repeated measurements of activity distributions in the absence of all biological effects, and in absolutely identical acquisition and reconstruction conditions.

Comparison between measurements is often incorrectly performed on the basis of a correlation coefficient. However, correlation coefficients only measure the linear relationship between two variables (in this case, two separate runs). Hence, CCs are invariant under location and scale transformations and so are not sensitive to agreement or equality between two measurements. Thus, correlation coefficients alone will not be a good measure of repeatability, especially for homogeneous distributions.

The Pearson correlation coefficients, calculated for each acquisition-duration and averaged over all pairs of runs, show that correlations remained poor even when the phantom was imaged for a period 30 min (0.77 ± 0.01). However, the correlation between runs improves as scan duration is increased, from 0.22 ± 0.01 for 3 min-long scans to 0.77 ± 0.01 for 30 min scans. The better correlation obtained for longer scans is due to the decrease in statistical noise which is reflected in an improvement (i.e., decrease) of the coefficient of repeatability. This can be visualized in the Bland–Altman plots, as in Figs. 3(c) and 3(d), in a decrease of the spread of values in the y-axis of this figure. The x-axis in this plot represents the average value of the concentration for a given pixel across the two runs investigated in a given plot. The spread of these values does not change as a function of scan duration. This is seen in the scatter plots in the same figure as a tightening of the distribution about the major axis, without a change in length of this axis. Assuming the 68Ge phantom is perfectly homogeneous, this can be attributed to an increasing point-spread function (PSF) as we move radially away from the central axis of the scanner. The PSF affects this measurement because the 68Ge phantom, which was centered in the scanner gantry, has a large radial dimension (20 cm).

The average correlations between replicate images of the 18FDG-phantom decreased (from 0.98 ± 0.01 to 0.14 ± 0.02) as a function of target size due to partial volume effect (PVE). This increase in the average correlations with decreasing source size is a consequence of the limited spatial resolution of the PET scanner. The PVE causes intensity values in an image to be reduced from those which are ideally expected.36 A small source will thus seem to be larger than it is and appear to have a lower apparent concentration and lesser contrast because part of the signal spills out beyond the edge of the object into neighboring pixels. The consequence of PVE is that the range of possible observed concentrations is larger than that expected if variability were due only to Poisson statistics. Therefore, comparisons of measured intensities of any two runs have a wider dispersion and thus lie on a straight line with a high correlation, as was observed in this experiment. As the target size increases, however, the PVE is reduced (“flatter” Gaussian distribution) so that the correlations decrease. Finally, for objects that are very large, such as that used for the BKG ROI, the correlations are close to zero.

The results for patient scans of normal livers (3 min duration), which represent a fairly (although not totally) uniform “human” extended phantom, agreed qualitatively with those for the Ge cylindrical phantom scanned at 3 min, although weaker in value (0.01 ± 0.01 to 0.11 ± 0.03). Taken together, these findings demonstrate the unsuitability of the correlation coefficient as a measurement for the determination of reproducibility of sequential scans. Therefore, we expect that for tumor lesions the observations will lie somewhere between the liver and 18FDG phantom results. Depending on the tissue and location, lesions can be highly heterogeneous thus may have a wide range of uptake values within their boundaries. In that case, we expect correlations between replicate runs to have values on the order of 0.7–0.8. These values will be higher for smaller lesions due to PVE.

An appropriate analysis of repeatability was proposed by Bland and Altman.30,31 The method has already been described in the analysis Sec. IV, and involves plotting and analysis of the data in a way that readily demonstrated the agreement between two replicates. The pixel-wise difference between any two measurements is given as the percentage of the mean of the two. The coefficient of repeatability (CR), which can also be shown on the Bland–Altman plot, indicates the range of percent differences which can be expected for the ROI as a whole. This analysis was also performed to compare SUVmax and SUVmean; i.e., to obtain the interval within which these parameters are replicated with 95% confidence between any two runs.

As can be seen in Fig. 4(b), repeatability increases with scan duration; the range of percent differences between runs is smaller (tighter bound), and so is the CR and its uncertainty [Eq. (4)]. Longer scan times imply higher total counts, which in turn mean that parameters such as SUVmean and SUVmax can be determined with smaller statistical errors. The SUVmean is always more repeatable (lower CR) than the SUVmax, for typical clinical scan times (3 min per bed position), the former is more than 2 times more statistically robust than the latter. However, as the acquisition time approaches 30 min, the values of repeatability become equivalent for both the SUVmean and SUVmax, although the CR for SUVmean continues to have smaller error bars.

The measurements of repeatability as a function of object size (18FDG phantom) imply that measurements of larger objects are more stable than those of smaller ones (Fig. 5). Again, this can be attributed to the PVE, which effectively erodes the statistical power of the data. The SUVmean and SUVmax CRs have similar values for smaller targets (8 mm) but diverge for larger ones, with SUVmean always more repeatable. This is as expected, since for smaller objects the image has fewer pixels, and thus the maximum pixel value has a much larger impact on the average value than in larger objects where many pixels confirm the image.

In normal livers, which are not as homogeneous as either of the phantoms studies, the difference in repeatability between SUVmax and SUVmean is much more pronounced (Fig. 6). As heterogeneity of the source increases, so do the tails of the intensity-probability distribution, which means the SUVmax will be much farther from than mean than in a homogeneous source. Thus, for the liver images, the SUVmax varies by as much as 30%, whereas the SUVmean variation is always less than 5%.

Ongoing measurements on patient lesions by our group are expected to show a larger minimum threshold for assessing real change, since this study was exclusively confined to an analysis of repeat measurements, without patients being reinjected with tracer or repositioned on the scanner on a subsequent occasion. Clearly, patient physiology will affect FDG uptake, and thus will be dependent on such factors as duration of fasting, degree of mental and physical activity, etc. To assess these variations, patients will need to undergo repeat baseline scans. Such effects were beyond the scope of the present study. Nevertheless, we can say that within the context of the response criteria named above, the results for healthy livers tend to support the higher 30% cutoff proposed by PERCIST.35

VI. CONCLUSIONS

We have shown that the statistical fluctuations of the SUVmax are on the order of 5%, without including biological or physiological effects. The fluctuations of SUVmean are half as large as those of SUVmax for the cylindrical inserts. For the background regions of the FDG phantom, however, the ratio of these values is (%ΔSUVmax:%ΔSUVmean) 20:1. In addition, we have shown that for clinically applicable scan durations (i.e., ∼ 3 min) and FDG concentrations, the SUVmax and SUVmean have similar amounts of statistical fluctuation for small regions. However, the statistical fluctuations of the SUVmean rapidly decrease with respect to the SUVmax as the statistical power of the data grows either due to longer scanning times or to more as the target regions encompass a larger volume. Further studies are currently ongoing in a clinical setting, focusing on patient lesions, to assess the effect of lesion heterogeneity and size on these same parameters. These data will help us identify whether it is necessary to set organ and or lesion and patient specific limits for these parameters when using them to determine outcome.

REFERENCES

  • 1. Cascini G. L. et al. , “18F-FDG PET is an early predictor of pathologic tumor response to preoperative radiochemotherapy in locally advanced rectal cancer,” J. Nucl. Med. 47(8), 1241–1248 (2006). [PubMed] [Google Scholar]
  • 2. Kostakoglu L. et al. , “PET predicts prognosis after 1 cycle of chemotherapy in aggressive lymphoma and Hodgkin’s disease,” J. Nucl. Med. 43(8), 1018–1027 (2002). [PubMed] [Google Scholar]
  • 3. Dose Schwarz J. et al. , “Early prediction of response to chemotherapy in metastatic breast cancer using sequential 18F-FDG PET,” J. Nucl. Med. 46(7), 1144–1150 (2005). [PubMed] [Google Scholar]
  • 4. Avril N. et al. , “Metabolic characterization of breast tumors with positron emission tomography using F-18 fluorodeoxyglucose,” J. Clin. Oncol. 14(6), 1848–1857 (1996). [DOI] [PubMed] [Google Scholar]
  • 5. Avril N. et al. , “Prediction of response to neoadjuvant chemotherapy by sequential F-18-fluorodeoxyglucose positron emission tomography in patients with advanced-stage ovarian cancer,” J. Clin. Oncol. 23(30), 7445–7453 (2005). 10.1200/JCO.2005.06.965 [DOI] [PubMed] [Google Scholar]
  • 6. Ott K. et al. , “Prediction of response to preoperative chemotherapy in gastric carcinoma by metabolic imaging: Results of a prospective trial,” J. Clin. Oncol. 21(24), 4604–4610 (2003). 10.1200/JCO.2003.06.574 [DOI] [PubMed] [Google Scholar]
  • 7. Nahmias C. et al. , “Time course of early response to chemotherapy in non-small cell lung cancer patients with 18F-FDG PET/CT,” J. Nucl. Med. 48(5), 744–751 (2007). 10.2967/jnumed.106.038513 [DOI] [PubMed] [Google Scholar]
  • 8. Kelloff G. J. et al. , “Progress and promise of FDG-PET imaging for cancer patient management and oncologic drug development,” Clin. Cancer Res. 11(8), 2785–2808 (2005). 10.1158/1078-0432.CCR-04-2626 [DOI] [PubMed] [Google Scholar]
  • 9. Haberkorn U. et al. , “Fluorodeoxyglucose imaging of advanced head and neck cancer after chemotherapy,” J. Nucl. Med. 34(1), 12–17 (1993). [PubMed] [Google Scholar]
  • 10. Hoekstra O. S. et al. , “Early treatment response in malignant lymphoma, as determined by planar fluorine-18-fluorodeoxyglucose scintigraphy,” J. Nucl. Med. 34(10), 1706–1710 (1993). [PubMed] [Google Scholar]
  • 11. Jansson T. et al. , “Positron emission tomography studies in patients with locally advanced and/or metastatic breast cancer: A method for early therapy evaluation?,” J. Clin. Oncol. 13(6), 1470–1477 (1995). [DOI] [PubMed] [Google Scholar]
  • 12. Minn H. and Soini I., “[18F]fluorodeoxyglucose scintigraphy in diagnosis and follow up of treatment in advanced breast cancer,” Eur. J. Nucl. Med. 15(2), 61–66 (1989). 10.1007/BF00702620 [DOI] [PubMed] [Google Scholar]
  • 13. E. F. Patz, Jr ., et al. , “Persistent or recurrent bronchogenic carcinoma: detection with PET and 2-[F-18]-2-deoxy-D-glucose,” Radiology 191(2), 379–382 (1994). [DOI] [PubMed] [Google Scholar]
  • 14. Rege S. D. et al. , “Change induced by radiation therapy in FDG uptake in normal and malignant structures of the head and neck: Quantitation with PET,” Radiology 189(3), 807–812 (1993). [DOI] [PubMed] [Google Scholar]
  • 15. Torizuka T. et al. , “Value of fluorine-18-FDG-PET to monitor hepatocellular carcinoma after interventional therapy,” J. Nucl. Med. 35(12), 1965–1969 (1994). [PubMed] [Google Scholar]
  • 16. Wahl R. L. et al. , “Metabolic monitoring of breast cancer chemohormonotherapy using positron emission tomography: Initial evaluation,” J Clin. Oncol. 11(11), 2101–2111 (1993). [DOI] [PubMed] [Google Scholar]
  • 17. Paquet N. et al. , “Within-patient variability of (18)F-FDG: Standardized uptake values in normal tissues,” J. Nucl. Med. 45(5), 784–788 (2004). [PubMed] [Google Scholar]
  • 18. Weber W. A. et al. , “Reproducibility of metabolic measurements in malignant tumors using FDG PET,” J. Nucl. Med. 40(11), 1771–1777 (1999). [PubMed] [Google Scholar]
  • 19. Visvikis D. et al. , “Influence of OSEM and segmented attenuation correction in the calculation of standardised uptake values for [18F]FDG PET,” Eur. J. Nucl. Med. 28(9), 1326–1335 (2001). 10.1007/s002590100566 [DOI] [PubMed] [Google Scholar]
  • 20. Krak N. C. et al. , “Effects of ROI definition and reconstruction method on quantitative outcome and applicability in a response monitoring trial,” Eur J. Nucl. Med. Mol. Imaging 32(3), 294–301 (2005). 10.1007/s00259-004-1566-1 [DOI] [PubMed] [Google Scholar]
  • 21. Lowe V. J. et al. , “Optimum scanning protocol for FDG-PET evaluation of pulmonary malignancy,” J. Nucl. Med. 36(5), 883–887 (1995). [PubMed] [Google Scholar]
  • 22. Nehmeh S. A. et al. , “Four-dimensional (4D) PET/CT imaging of the thorax,” Med. Phys. 31(12), 3179–3186 (2004). 10.1118/1.1809778 [DOI] [PubMed] [Google Scholar]
  • 23. Doot R. K. et al. , “Instrumentation factors affecting variance and bias of quantifying tracer uptake with PET/CT,” Med. Phys. 37(11), 6035–6046 (2010). 10.1118/1.3499298 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Nahmias C. and Wahl L. M., “Reproducibility of standardized uptake value measurements determined by 18F-FDG PET in malignant tumors,” J. Nucl. Med. 49(11), 1804–1808 (2008). 10.2967/jnumed.108.054239 [DOI] [PubMed] [Google Scholar]
  • 25. Velasquez L. M. et al. , “Repeatability of 18F-FDG PET in a multicenter phase I study of patients with advanced gastrointestinal malignancies,” J. Nucl. Med. 50(10), 1646–1654 (2009). 10.2967/jnumed.109.063347 [DOI] [PubMed] [Google Scholar]
  • 26. Fahey F. H. et al. , “Variability in PET quantitation within a multicenter consortium,” Med. Phys. 37(7), 3660–3666 (2010). 10.1118/1.3455705 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Kinahan P. E. et al. , “PET/CT assessment of response to therapy: tumor change measurement, truth data, and error,” Transl. Oncol. 2(4), 223–230 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Clarke L. P. et al. , “Quantitative imaging for evaluation of response to cancer therapy,” Transl. Oncol. 2(4), 195–197 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Kinahan P. E., Alessio A. M., and Fessler J. A., “Dual energy CT attenuation correction methods for quantitative assessment of response to cancer therapy with PET/CT imaging,” Technol. Cancer Res. Treat. 5(4), 319–327 (2006). [DOI] [PubMed] [Google Scholar]
  • 30. Altman D. G. and Bland J. M., “Measurement in medicine: the analysis of method comparison studies,” Statistician 32(3), 307–310 (1983). 10.2307/2987937 [DOI] [Google Scholar]
  • 31. Bland J. M. and Altman D. G., “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 1(8476), 307–310 (1986). 10.1016/S0140-6736(86)90837-8 [DOI] [PubMed] [Google Scholar]
  • 32. Young H. et al. , “Measurement of clinical and subclinical tumour response using [18F]-fluorodeoxyglucose and positron emission tomography: Review and 1999 EORTC recommendations. European Organization for Research and Treatment of Cancer (EORTC) PET Study Group,” Eur. J. Cancer 35(13), 1773–1782 (1999). 10.1016/S0959-8049(99)00229-4 [DOI] [PubMed] [Google Scholar]
  • 33. Weber W. A., “Chaperoning drug development with PET,” J. Nucl. Med. 47(5), 735–737 (2006). [PubMed] [Google Scholar]
  • 34. Young T. and Maher J., “Collecting quality of life data in EORTC clinical trials—what happens in practice?,” Psychooncology 8(3), 260–263 (1999). [DOI] [PubMed] [Google Scholar]
  • 35. Wahl R. L. et al. , “From RECIST to PERCIST: Evolving considerations for PET response criteria in solid tumors,” J. Nucl. Med. 50(Suppl 1), 122S–50S ( 2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Soret M., Bacharach S.L., and Buvat I., “Partial-volume effect in PET tumor imaging,” J. Nucl. Med. 48(6), 932–945 (2007). 10.2967/jnumed.106.035774 [DOI] [PubMed] [Google Scholar]

Articles from Medical Physics are provided here courtesy of American Association of Physicists in Medicine

RESOURCES