Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 11.
Published in final edited form as: IEEE Trans Radiat Plasma Med Sci. 2023 Apr;7(4):372–381. doi: 10.1109/TRPMS.2022.3231702

Pre-training via Transfer Learning and Pretext Learning a Convolutional Neural Network for Automated Assessments of Clinical PET Image Quality

Jessica B Hopson 1,, Radhouene Neji 2, Joel T Dunn 3, Colm J McGinnity 4, Anthime Flaus 5, Andrew J Reader 6,#, Alexander Hammers 7,#
PMCID: PMC7614424  EMSID: EMS159486  PMID: 37051163

Abstract

Positron emission tomography (PET) using a fraction of the usual injected dose would reduce the amount of radioligand needed, as well as the radiation dose to patients and staff, but would compromise reconstructed image quality. For performing the same clinical tasks with such images, a clinical (rather than numerical) image quality assessment is essential. This process can be automated with convolutional neural networks (CNNs). However, the scarcity of clinical quality readings is a challenge. We hypothesise that exploiting easily available quantitative information in pretext learning tasks or using established pre-trained networks could improve CNN performance for predicting clinical assessments with limited data. CNNs were pre-trained to predict injected dose from image patches extracted from eight real patient datasets, reconstructed using between 0.5%-100% of the available data. Transfer learning with seven different patients was used to predict three clinically-scored quality metrics ranging from 0-3: global quality rating, pattern recognition and diagnostic confidence. This was compared to pre-training via a VGG16 network at varying pre-training levels. Pre-training improved test performance for this task: the mean absolute error of 0.53 (compared to 0.87 without pre-training), was within clinical scoring uncertainty. Future work may include using the CNN for novel reconstruction methods performance assessment.

Index Terms: Convolutional neural networks, Deep learning, Image quality, Image reconstruction, Transfer learning

I. Introduction

POSITRON emission tomography (PET) is an important modality in the management of several brain diseases, notably memory problems, dementia, epilepsy [1], and brain tumours [2]. Simultaneous positron emission tomography − magnetic resonance imaging (PET-MR) has potential to become an important tool in the diagnosis of memory clinic patients [3], by simultaneously providing both functional and structural information of the brain [4]. However, a drawback to PET imaging is the need for administration of the radioligand: the more radioligand injected, the higher the associated radiation dose, and radioligand production to Good Manufacturing Practice standards is very expensive [5]. Both can be targeted by reducing the injected dose [6], but at the expense of the reconstructed image quality, caused by increased noise and a loss of resolution [7]. Traditionally, image quality assessment has relied on human raters, but this is not scalable and impractical for large-scale investigations or repeating assessments while optimising reconstruction algorithms. Thus, it is beneficial to determine automatically whether a reconstructed image is of satisfactory diagnostic quality. One approach is to use convolutional neural networks (CNNs), as they are data-driven [8] and versatile, being used for medical imaging tasks including image segmentation, lesion detection and image de-noising [9]–[14]. CNNs have also shown to outperform radiomic analysis in certain tasks [15].

A drawback of using CNNs is that generally a large dataset is required for training [16]. However, this is challenging in medical imaging, as clinically-annotated data are scarce and not scalable [17]. Associated acquisition and processing costs contribute to the paucity of clinician-labelled data. Pre-training (an architecture trained on a separate image database) and transfer learning (taking features learned on one problem and using them for a new problem) could be exploited to overcome this lack of training data [18] [19]. Transfer learning can be approached with either fine tuning or fixed representations [20]. Fine tuning involves unfreezing some or all of the network parameters in order to update the weights with a very low learning rate and a new dataset [18].

Pre-training with natural image databases has been used in the medical imaging field. ImageNet [21] is a dataset of more than 15 million high-resolution natural images annotated for 22,000 classes. A subset of ImageNet used for deep learning tasks consists of 1.2 million images categorised into 1000 classes [22]. Pre-trained networks via ImageNet have been used for the binary classification of papillary thyroid carcinoma [23], where exploiting a VGG16 [24] network achieved a 97.66% accuracy when distinguishing between carcinomas and benign thyroid nodules in two-dimensional cytological image patches. Dunnmon et al. [25] showed that CNNs could be pre-trained on ImageNet to classify chest radiographs into “normal” or “abnormal” with an area under the receiver operating characteristic curve (AUC) of 0.96. This was built upon by Tang et al. [26], who investigated a greater number of established architectures, including VGGNets [24] and Inception-v3 networks [27], and compared pre-trained networks against training from scratch. Their work showed that when using a moderately sized dataset (8500 two-dimensional chest radiographs), pre-training on ImageNet out-performed training the model from scratch. It has also been shown that a transfer-learned VGG16 network pre-trained via ImageNet outperformed an Inception-v3 network for the task of classifying into ‘Normal’ or ‘Pneumonia present’, and matching the complexity of the model to the dataset size along with data augmentation, further improved CNN accuracy [28].

However, there is debate on the usefulness of transfer learning from a natural image domain to a medical imaging domain [29], [30]. Recent work on transfer learning questions the validity of using natural images for pre-training in medical imaging tasks due to the high dissimilarity between the domains [31]. Heker et al. [32] also showed transfer learning within domain (from one medical image dataset to another) outperformed inter-domain pre-training (from a natural image database to a medical image database) for segmentation and classification tasks.

Pre-training can be used to overcome extensive training data needs. Mustafa et al. [19] showed network performance could be improved by increasing size of both the model architecture, and the natural image pre-training dataset. ImageNet-21k (the full ImageNet dataset classified into 21000 classes) obtained superior performance over ImageNet (the subset of ImageNet classified into 1000 classes). However, the full dataset is more computationally expensive and labels are not mutually exclusive: a single image may have multiple labels, making classification more difficult [33].

As pre-trained networks act as feature extractors, layers need to be frozen to prevent the network from retraining too extensively, potentially negating any pre-training benefit. As shown in [34], inclusion of multiple layers in the fine tuning process can improve model performance. In an Alzheimer’s disease diagnosis classification task, all convolutional layers of pre-trained networks were frozen, except for the last fully connected layer which was trained on the new dataset [35]. Kieffer et al. [36] retrained the last convolutional block and fully connected layer of a VGG16 network for their dissimilar domains.

Whilst there have been many image quality assessment investigations in the computer vision field [37]–[39], similar experiments within the medical imaging field have emphasised binary classifications. One such investigation by Sujit et al. [40] used a CNN to predict image quality of brain MRIs compared to visual evaluation by two experts by scoring the images as ‘0 (acceptable)’ or ‘1 (unacceptable)’, achieving an accuracy of 84%. Oksuz et al. [41] used cardiac MR images to detect motion-related artefacts, classifying the image as “good” or “poor”. Similarly, Ma et al. [42] used a CNN to carry out both a binary classification (‘0 − non-diagnostic’ or ‘1 − diagnostic’) and three-class classification (‘0 − poor/non-diagnostic’, ‘1 − diagnostic’ or ‘2 − excellent’) of MR images, achieving an accuracy of 84% and 65%, respectively. In the field of retinopathy, a similar approach was taken by Coyner et al. [43] to determine whether clinically-scored 2D images were of “acceptable”, “possibly acceptable” or “not acceptable” quality, and they found a strong correlation (Spearman’s rank correlation coefficient of 0.84 − 0.92) between the CNN and the expert scoring. Another retinopathy study showed that pre-training via ImageNet outperformed training from scratch [44].

Building on previous work [45], the aim of this work was to assess the influence of transfer learning as a function of the number of fixed parameters during training, as well as comparing different pre-trained architectures for the prediction of clinical PET image quality assessments. Most of the literature relates to CT and MR images, and there is limited research on PET image quality assessment. Varying image qualities were generated by reconstructing images with different simulated injected doses such that all other factors that may affect image quality, including reconstruction algorithm and system design, were fixed, and only the dose level determined the differences in image quality. The performance of an established VGG16 network [24] pre-trained on the ImageNet-1K database [21] was assessed and compared to a pre-text learning task (using an easily extracted label, the injected dose) as a form of pre-training to predict three clinically scored metrics. These metrics are: global quality rating (GQR), which evaluates the aesthetic component of the reconstruction, and takes into consideration qualities such as noise level and resolution; pattern recognition (PR), which determines whether the reconstruction allows for the clinician to determine the presence of any pathological patterns suggesting a particular diagnosis; and diagnostic confidence (DC), which concerns the certainty with which the clinician can use the reconstruction to observe these patterns and make a diagnosis. As low count reconstruction causes the degradation of PET image quality, these metrics were designed by the authors to encompass the necessary considerations involved in using these low count clinical PET images for making a confident and accurate diagnosis. These three metrics are scored as: 0 = “unacceptable”, 1 = “poor but usable”, 2 = “acceptable” and 3 = “good/excellent”, with 0.5 ratings accepted. The aim was to reduce the need for clinically-assessed training data by the use of pre-trained networks and transfer learning.

II. Methods and materials

A. Real Patient Data

Data from PET-MR scans of 21 memory clinic patients with suspected dementia were used. Each patient underwent an [18F]FDG-PET and a simultaneous 3D T1-magnetisation-prepared gradient-echo (MP-RAGE) MRI scan on a 3T PET-MR scanner (Biograph mMR, Siemens Healthcare, Erlangen, Germany). Siemens e7 tools was used for image reconstruction, using the clinical standard of the ordered subset expectation maximisation (OSEM) algorithm with 2 iterations and 21 subsets to reconstruct 3D images of 344 × 344 × 127 voxels. It is possible for the list-mode PET data to be resampled [45], [46], and such methods can be used to simulate seven lower injected doses (Fig. 1). To simulate the lower injected doses, the prompts and random coincidences in the corresponding list-mode PET data for the 100% count reconstruction were randomly sampled. Subsequently, the corresponding emission, normalisation, randoms and sinograms were re-calculated along with the attenuation sinograms using Siemens e7 tools. The list-mode PET data from all datasets were resampled [46], simulating seven lower injected doses (Fig. 1). Images were normalised between 0-1 to reduce standard deviations, suppressing the impact of outliers, then blinded and randomised prior to clinician scoring. The simulated dose levels were grouped into “low quality” (0.5% and 1% of counts), “medium quality” (5% and 10% of counts), and “high quality” (25%, 50% and 100% of counts) categories to decrease clinical assessment times. In total, ten patients at three different count levels (one per category) were assessed by an experienced clinician. For calibration, nine images were scored twice by the same clinician in different scoring sessions (Fig. 2). Another experienced clinician also scored the same images. Pearson’s correlation coefficients of 0.91, 0.91, 0.90 for GQR, PR and DC, respectively, were found between the clinicians’ scores. It was also found that 91% of the scores were within 0.5 of that given by clinician 1, which is within the calibration range (Fig. 2) showing the robustness of the scores given by each of the clinicians. After consensus readings, only one score was greater than 0.5 from clinician one’s original scoring, thus clinician one’s scores were used for this study. Fig. 3 shows the 20-point discrete colour scale used during the scoring process which was optimised for reading brain PET images at the Guy’s and St Thomas’ PET Centre. This colour scale allows for the specific pathological patterns to be more easily highlighted, such that with a lower quality image, a diagnosis could still be made, helping to determine pattern recognition and diagnostic confidence. The image used in Fig. 3 corresponds to the same example in the top right subplot in Fig. 1. Table I shows a Pearson’s pairwise correlation coefficient for each of the metric scores given by both clinicians.

Fig. 1. Cropped example transverse slices from “low”, “medium” and “high” quality reconstructions for two independent datasets.

Fig. 1

Fig. 2.

Fig. 2

Inter-session scoring difference for all clinical metrics.The dotted line is the identity line for scoring session 1.

Fig. 3.

Fig. 3

A clinically assessed image with a 20-point colour scale with a lower global quality rating (1), but higher pattern recognition (2) and diagnostic confidence (2) scores (corresponding to the top-right subplot in Fig. 1). The colour-scale is the technique by which the clinician determines the pattern recognition and diagnostic confidence ratings.

Table 1. Correlation between clinical quality metrics.

Metric Comparison Pearson’s Correlation Coefficient
Clinician 1 Clinician 2
GQR vs. PR 0.90 0.93
GQR vs. DC 0.84 0.93
PR vs. DC 0.96 0.99

B. Extraction and Thresholding of Patches

For each patient, 1000 patches of size 80 × 80 pixels were extracted at random from transverse, sagittal and coronal planes to cover the whole 3D volume, whilst not being too computationally exhaustive. Patches instead of whole images were also used as a form of data augmentation, as has been shown to improve model performance [28]. To eliminate background-only patches, a thresholding algorithm was applied to each individual patch. If the average pixel value of the patch was below one-eighth of the average pixel value of the whole image volume, then the patch was rejected as background, ensuring that only patches with meaningful brain information were used, as is used when defining brain masks in FDG PET imaging [47]. Of these thresholded patches, 100 each from the transverse, sagittal and coronal planes, were randomly sampled, such that 300 patches were used for each reconstructed image volume. This number of patches was chosen as it is proportional to the brain:background ratio in the whole image volume.

C. Model Architecture

A VGG16 architecture [23] with random weight initialisation (Section II.D) or ImageNet [20] weight initialisation (Sections II.D and II.E) acted as a feature extractor. In addition to the VGG16 backbone being used for other medical imaging tasks [28], [36] and its block structure providing a demonstrable method of pre-training levels, the VGG16 backbone was chosen as it is representative of similar CNN architectures (Fig. 4). A total of 26 different established backbones were trained on the same dataset, with the minimum validation loss monitored. The eleven models with the lowest validation losses, were trained another two times, and were then tested on the same three test patients (Fig. 4). There are five convolutional blocks consisting of convolution layers and a max pooling layer (Fig. 5). The three original fully connected layers of the VGG16 architecture were replaced with a single fully connected layer with three outputs relating to each of the clinical metrics (GQR, PR and DC) with ReLU activation (Fig. 5) in Section II.D. This is because the original VGG16 network is used for classification of the ImageNet database into 1000 different classes. However, this is a regression task, with the output a prediction of three clinical quality scores. Therefore, the original projection head was replaced by another fully connected layer to achieve this task. In Section II.E, the replacement fully connected layer had a single output corresponding to the simulated dose. The same seven patients were used in the transfer learning dataset, reconstructed at three simulated dose levels each. Greyscale input patches were replicated along one axis to obtain three channels as VGG16 inputs are RGB images.

Fig. 4.

Fig. 4

Bar chart of the mean absolute error based on the same three test patients for 11 established CNN backbones for three training runs. A VGG16 backbone is shown to be representative of other established CNN backbones

Fig. 5. Transfer learned VGG16 architecture.

Fig. 5

A patch of size 80 × 80 is used as the input with the three quality metrics as output. Each of the five convolutional blocks is labelled.

D. Transfer Learning: Pre-training a VGG16 Backbone

The VGG16 network (Fig.5) was initialised with random weights, with a learning rate of either 10-3 (default) or 10-4 as a baseline comparison model.

The VGG16 network was then also initialised using ImageNet weights, and trained by sequentially unfreezing weights from consecutive blocks starting from training zero convolutional blocks (only the fully connected layer) up to and including all five convolutional blocks. Each model was trained seven times and seven patients were used in the transfer learning dataset. This model was also tested as a function of the number of patients in the transfer learning dataset; a maximum of 6300 patches were used for transfer learning.

E. Pretext Learning: Injected Dose Inference

Dose inference was used as a pre-text learning task due to the positive correlation between the number of counts in the reconstruction and the clinician-assigned score (Fig. 6). Patient weight was used to standardise between patients. In total, eight patients (16,800 patches) were used for training this network, with another patient used for validation. Both pre-trained (via ImageNet and re-training of the last two convolutional blocks) and not pre-trained networks (via random weight initialisation and re-training either the last two or all five convolutional blocks) were compared. For additional comparison, the dose standardised by patient weight was used as input into a model consisting of a fully connected layer with the three clinical metrics as the output. All pipelines are shown in Fig. 7.

Fig. 6.

Fig. 6

Correlation between clinically-scored metrics and injected dose standardised by patient weight. Dotted line = line of best fit. Error bars = ± 0.5, informed by Fig. 2

Fig. 7. Pipelines of pre-training.

Fig. 7

(a) VGG16 architecture without pre-training (random weight initialisation). (b) Transfer-learned VGG16 backbone pre-trained on ImageNet to predict the same three clinical metrics. (c) VGG16 backbone with random weight initialisation to initially infer dose as a pretext learning task, with weights from this model fixed and the fully connected layer removed and replaced with another to output the prediction of the three clinical quality metrics. (d) VGG16 backbone with ImageNet weights to initially infer dose as a pretext learning task, with weights from this model fixed and the fully connected layer removed and replaced with another to output the prediction of the three clinical quality metrics.

F. Evaluation Metrics

Three evaluation metrics were used: mean absolute error (MAE) between the predicted and true clinical scores; percentage exact agreement (predicted score = true score); and percentage close agreement (predicted score = true score ± 0.5), in line with Fig. 2. All models were tested on the same three test patients at three dose levels each.

G. Implementation

Reconstructions via Siemens e7 tools were carried out in MATLAB (The Mathworks, Inc.). All models were trained using the Keras [18] application programming interface implemented in TensorFlow [48]. Training was accelerated using either an NVIDIA Tesla K40 12GB GPU or NVIDIA Quadro RTX6000 24GB GPU. All models used the Adam optimiser with a learning rate of 10-4 (unless otherwise stated), a mean-squared-error loss function and batch size of 10. Training occurred for 1000 epochs, and models were saved at the lowest validation loss which was used for analysis.

III. Results

A. Influence of the Number of Retrained Blocks

Based on the three test patients, Fig. 8 shows a series of boxplots describing the evaluation metrics as a function of the number of pre-fully connected layer blocks when seven patients at three dose levels each were used in the transfer learning dataset. Re-training the last one or two blocks outperformed training the fully connected layer only, but there was no clear benefit to retraining the last 3-5 blocks, as percentage close agreement decreased. Re-training a higher number of convolutional blocks, generally increased the variation across separate training runs, without further decreases in the mean or median absolute error. There is a plateau in the MAE around 0.5, in line with Fig. 2.

Fig. 8.

Fig. 8

Boxplots of MAE, percentage exact agreement and percentage close agreement as a function of the number of pre-fully connected layer convolutional blocks of the VGG16 backbone retrained (0 to 5). Left: 0 - only the fully connected layer trained; Right: 5 - all weights updated. Plotted across all clinical metrics for seven training runs. The middle line is the median.

B. Influence of the Number of Patients in the Transfer Learning Dataset

Fig. 9 shows a series of boxplots describing the same metrics as in Fig. 8. Based on Section III.A, the last two convolutional blocks of the VGG16 backbone were re-trained. Generally, the MAE and the associated variation decreased with increasing numbers of independent training datasets, but plateaued around 0.5, again in line with Fig. 2. Using one patient in the transfer learning dataset was consistently outperformed across all evaluation metrics by more independent patients in the transfer learning. There was clear improvement in the percentage exact agreement with increasing numbers of independent training patients.

Fig. 9.

Fig. 9

Boxplots of the mean absolute error, percentage exact agreement and percentage close agreement as a function of the number of separate patients in the transfer learning datasets (n=1 to n=7). Plotted across all clinical metrics for seven training runs. The middle line is the median. The model used in for this experiment is that in Fig. 7b with the last 2 convolutional blocks trained.

C. Dose Inference Performance as a Function of Pre-Training and Retraining

Table II shows the MAE and percentage exact agreement comparing the different models investigated for analysing the pretext dose inference task (before transfer learning to clinical metric prediction), assessing the validation patient reconstructed at seven different dose levels. The green boxes indicate the best value across the three clinical metrics. Pre-training VGG16 with ImageNet and retraining the last two blocks vastly outperformed the VGG16 network with random weight initialisation and retraining all five convolutional blocks. Table III also shows the impact of using only the easily accessible dose label as input into a single fully connected layer, for a learning rate of both 10-3 and 10-4, with the output as the prediction of the three clinical metrics. The green boxes indicate the best value across the three clinical metrics; lighter green boxes indicate a draw.

Table II. Pretext learning task: dose inference evaluation.

VGG16 Network Pre-training Weights Mean Absolut Error %Exact Agreement
Retrain last 2 blocks ImageNet 0.09 71
Retrain all 5 blocks Random 0.27 29

Table III. Comparison to easily accessible labels as input.

Metric Pre-trained Network Dose/Weight to Clinical
Scores (LR = 10-3)
Dose/Weight to Clinical
Scores (LR = 10-4)
GQR PR DC GQR PR DC GQR PR DC
% Exact Agreement 56 22 11 0 0 0 0 0 0
% Close Agreement 89 44 33 89 56 33 78 56 33
MAE 0.20 0.64 0.71 0.29 0.65 0.89 0.28 0.64 0.88
Spearman’s Rank 0.97 0.64 0.52 0.39 0.66 0.64 0.39 0.66 0.64

D. Comparison of all Pipelines

Fig. 10 compares all pipelines; from left to right, the amount of pre-training increases, starting from no pre-training (Fig. 7a), to pre-training via ImageNet (Fig. 7b), to transfer learning from the pretext learning task with random weight initialisation (Fig. 7c), and subsequently with ImageNet weight initialisation (Fig. 7d). No pre-training with a learning rate of 10-3 was outperformed by all other models. Whilst reducing the learning rate improved MAE 0.87 to 0.58, including ImageNet weights and freezing the first three convolutional blocks further reduced MAE to 0.53. Adding another layer of pre-training via the pretext learning task with random weight initialisation, reduced variation over the seven separate training runs. Initialising this model with ImageNet weights, slightly reduced the MAE compared to when random weight initialisation was used; however, MAE between all pre-trained models were comparable. Table IV compares the same models as in Fig. 10, but evaluated across individual clinical metrics for each evaluation metric and the Spearman’s rank correlation coefficient. Bright green boxes indicate the best value for each evaluation metric, whilst lighter green boxes indicate the second best performing network. The majority of the best values for each metric were achieved by a pre-trained model. Generally, GQR is predicted more accurately than PR and DC.

Fig. 10. Comparison boxplots of the MAE across all clinical metrics for each methodology over seven separate training runs.

Fig. 10

LR = Learning Rate. Pipeline (a) corresponds to training the VGG16 backbone with random weight initialisation; pipeline (b) involves training the VGG16 backbone with ImageNet weight initialisation, in this case, training last two convolutional blocks as the fully connected layer; pipeline (c) corresponds to using a pretext learning task of dose inference with random weight initialisation; and pipeline (d) is the dose inference pretext learning task with ImageNet weight initialisation.

Table IV. Comparison of clinical metrics for each pipeline.

Model Weight
Initialisation
Trainable
Layers
Mean Absolute
Error ± Standard
Deviation
Percentage Exact
Agreement ± Standard Deviation
Percentage Close
Agreement ± Standard Deviation
Spearman’ Rank
Cirrelation
Coefficient
GQR PR DC GQR PR DC GQR PR DC GQR PR DC
VGG16 Backbone (No pre-training, LR = 0.001) Random All 0.72 ± 0.24 0.90 ± 0.17 0.99 ± 0.17 11 ± 11 3 ± 8 3 ± 8 52 ± 22 23 ± 8 16 ± 13 0.15 0.19 0.14
VGG16 Backbone (No pre-training, LR= 0.0001) Random All 0.51 ± 0.42 0.61 ± 0.07 0.63 ± 0.05 32 ± 18 30 ± 12 25 ± 11 75 ± 22 44 ± 6 51 ± 6 0.60 0.62 0.57
VGG16 Backbone (pre-training) ImageNet Last 2 Blocks and Fully Connected Layer 0.22 ± 0.06 0.66 ± 0.09 0.69 ± 0.05 52 ± 8 24 ± 8 27 ± 11 97 ± 5 49 ± 11 46 ± 10 0.96 0.67 0.59
Pretext Learning Random Dose Inference All 0.36 ± 0.01 0.64 ± 0.01 0.72 ± 0.01 11 ± 2 4 ± 10 13 ± 5 72 ± 10 64 ± 5 33 ± 0 0.93 0.54 0.59
Pretext Learning ImageNet Dose Inference Last 2 Blocks and Fully Connected Layer 0.32 ± 0.06 0.58 ± 0.09 0.68 ± 0.05 13 ± 4 31 ± 5 11 ± 0 71 ± 11 44 ± 0 44 ± 0 0.93 0.59 0.56

IV. Discussion

There is a strong correlation between the metrics (Table I), especially between PR and DC, but these metrics were designed to specifically incorporate what the clinician is observing in order to determine if a decreasing image quality may still lead to an image that is clinically useful for making a diagnosis. The VGG16 backbone was chosen as it representative of other established CNNs (Fig. 4). Testing the VGG16 backbone as a function of the number of trainable convolutional blocks shows the necessary level of pre-training required for a better performance on unseen patients. The expected trend in Fig. 8 is that of a “U-shape” for MAE and an “inverted U-shape” for the percentage agreement metrics. At either extreme (either 0 or all 5 of the convolutional blocks trained), there should be a decreased performance (higher MAE and lower percentage exact and close agreements). This is because when none of the convolutional blocks are trained, there remain high-level features specific to the ImageNet dataset and not to medical images, and similarly, when all parameters are trainable, this negates any pre-training benefit [49]. Unfreezing the final two convolutional blocks’ weights resulted in the best percentage exact and close agreements, suggesting a balance between the number of high level features and pre-training benefit, agreeing with the work of Yosinski et al. [49].

Fig. 8 informed the investigation into the number of patients in the transfer learning dataset where the last two convolutional blocks of the VGG16 backbone were re-trained. The expected trend was that with more patients in the transfer learning dataset, the better the network performance on unseen datasets as there are more examples from which the model can learn [50]. Fig. 9 indeed shows that increasing the number of patients in the training dataset improved the percentage exact agreement and decreased the MAE, supporting the idea that a larger dataset can improve generalisability to unseen datasets. Dunnmon et al. [25] showed that increasing the number of 2D chest X-ray images from 2000 to 20,000 improved the average area under the receiver operating characteristic curve (AUC) from 0.84 to 0.95, but improvements plateaued afterwards, as using 200,000 images, only achieved a non-significant increase in AUC to 0.96. Results from the use of just one patient in the transfer learning dataset were consistently outperformed by using ≥2 patients, suggesting that use of one patient in the dataset was insufficient to generalise to unseen data. The percentage close agreement started to plateau at ~70% beyond ≥3 independent training patients, but the percentage exact agreement continually improved. This suggests that, whilst three patients are sufficient for predictions within 0.5 of the clinician-scores, a larger dataset (i.e. >7) may be required to achieve exact agreements between the true and predicted scores.

For both Fig. 8 and Fig. 9, the MAE plateaued at ~0.5 corresponding to the inter-session scoring uncertainty within the clinical assessments of the images (Fig. 2). A total of 27 images were re-assessed by the same clinician, with 44% of the readings scored at least 0.5 different to the original assessments. Thus, obtaining a MAE of approximately 0.5 may be the optimum achievable for this task.

Table II shows that using pre-training via ImageNet outperformed no pre-training for dose inference. Informed by Fig. 8, pre-training VGG16 with ImageNet weights and retraining the last two blocks was used as the pre-trained network. The pre-trained network vastly outperformed VGG16 with random weight initialisation and training all five convolutional blocks, confirming the benefit of using pre-training instead of training from scratch for this task. The MAE was substantially lower than in the clinical score prediction task, possibly because 56 individual images were available compared to 21 clinically-assessed images.

Comparing all pipelines, Fig. 10 and Table IV show that pre-training provided a benefit to the overall network performance on unseen test data. Fig. 10 shows the MAE for no pre-training was ~0.92 when the default learning rate of 10-3 was used, suggesting that the network did not generalise well to unseen data. By simply decreasing the learning rate to 10-4, the MAE was substantially reduced to 0.58. Subsequently changing this model to include ImageNet weight initialisation and only re-training the last two convolutional blocks together with the new fully connected layer resulted in decreases in the MAE across all predicted clinical metrics and test patients, i.e. pre-training via ImageNet provided an improvement to overall model performance at test-time. Introducing the dose inference pretext learning task with no pre-training (all parameters trainable and random weight initialisation), reduced the variation between runs, suggesting the dose inference pretext learning task was the source of uncertainty reduction. Using pre-training via ImageNet weight initialisation, again reduced the MAE compared to the same model with random weight initialisation, indicating that by adding another level of pre-training, model performance could be further improved (if minimally). Table IV shows that using a pretext learning task had an inferior performance than transfer learning straight from ImageNet. This is perhaps due to dose inference being more difficult when the images are normalised, as the dose information contained in the overall activity levels is eliminated, and ImageNet provides more complexity as a feature extractor, whereas when using the dose inference network, there is only one output from which the transfer-learned model is trying to learn. Whilst there was a comparable MAE across all three clinical metrics (Fig. 10) for all pipelines, it is arguable that transfer learning directly from a VGG16 backbone, as opposed to carrying out a pretext learning task, is more favourable, as additional patient datasets are not required.

The VGG16 backbone with random initialisation achieved the best performance for two of the 12 evaluation metrics (Table IV), but only narrowly so. However, when ImageNet weights were added, this resulted in best performance for 7 out of 12 evaluation metrics, often by a large margin.

Reducing the learning rate to 10-4 from 10-3 improved model performance across all evaluation metrics. This suggests that at the higher learning rate, training loss converges too early, such that when early stopping is employed, the model did not generalise well to unseen data. Using no pre-training with a reduced learning rate performed best for 2 out of 12 evaluation metrics (Table IV), namely for PR and DC. This may indicate that the VGG16 architecture was sufficient as a feature extractor for these metrics. However, the metrics’ standard deviations overlapped between this model and using ImageNet for pre-training, and GQR prediction was inferior. The Spearman’s rank correlation coefficients were also consistently highest for pre-trained networks, suggesting an overall better performance for predicting clinician scores.

Table IV supports Fig. 10 and shows that generally, GQR is more closely predicted to the true score than PR and DC. This may be because the global quality of an image is much easier to predict, as the model can estimate noise level. It is well-known that with a decreasing injected dose, the noise level in PET images increases, thus at lower doses, images become noisier [51]. This is also supported in Fig. 6, showing that the correlation between GQR and injected dose is stronger than for PR and DC. However, PR and DC, are more subjective metrics that involve the clinician using the images to make a diagnosis and judging their clinical plausibility based on the presence of specific pathological patterns, which is not provided as prior information to the network. In this case, the clinician estimated being able to make a reasonably confident diagnosis based on some of the noisier images. It has also been shown in Table III that it is not possible to only use injected dose standardised by patient weight as input into a machine learning network. Despite this model being much smaller, and using an easily available label, the best results were achieved when using PET images in the transfer learning dataset, evident by the number of green boxes in the table. Whilst most prior research is on binary classification and not in PET imaging, our maximum Spearman’s rank correlation coefficients of 0.96 for GQR, 0.67 for PR and 0.59 for DC are comparable to that achieved by Coyner et al.[43] (0.86 − 0.92).

Future work may include training and testing with a larger number of clinically-scored images, to observe the trend beyond seven separate transfer learning datasets. This work could also be applied to other low-dose reconstruction algorithms such as KEM [7] and MR-guided methods [52] or novel regularisation strategies [53]. As these methods are superior to standard OSEM, they should give higher quality predictions, even at lower simulated doses. However, similarly to [31], this may need to involve the inclusion of reconstructions generated via these algorithms into the training dataset. Further work could also investigate different established CNN backbones. Additionally, the model developed here may be used in different applications, including dynamic PET (where shorter frames are desirable for better time resolution, but also lead to noisier data) and different radioligands. Other recent advances, such as self-supervised learning for pre-training, as used by [54], [55], or vision-and-language pre-training for the similar task of determining clinical evaluations of medical images [56] could be investigated in relation to this clinical task.

Conclusion

This study shows that pre-training can help improve network performance in the task of automatically predicting clinical image quality assessments of PET images. By transfer-learning using ImageNet weight initialisation and retraining the last two convolutional blocks of a VGG16 backbone, a superior model performance is achieved for this clinical task. Additionally, this method does not require further independent patient datasets, unlike for using a pretext learning task. Overall, this work shows that pre-training on natural images for application to medical images has the potential to provide benefit for regressing from images to clinically-scored metrics.

Acknowledgements

The authors would like to thank Dr Veerle Kersemans for her guidance and support, Dr Zuzana Balazova for her work on data acquisition and processing, and Dr Abolfazl Mehranian for his work on low-count PET reconstruction. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. All authors declare that they have no known conflicts of interest in terms of competing financial interests or personal relationships that could have an influence or are relevant to the work reported in this paper.

This work was funded by the EPSRC Centre for Doctoral Training in Smart Medical Imaging [#EP/S032789/1] and Siemens Healthcare, Erlangen, Germany. This work was supported by the Wellcome/EPSRC Centre for Medical Engineering at King’s College London (WT 203148/Z/16/Z) and the Department of Health via the National Institute for Health Research (NIHR) comprehensive Biomedical Research Centre award to Guy’s & St Thomas’ NHS Foundation Trust in partnership with King’s College London and King’s College Hospital NHS Foundation Trust. This work involved human subjects in its research. Approval of all ethical and experimental procedures and protocols was granted by the institution’s Research and Development department, the North East − York research ethics committee (reference 15/NE/0203), and the radiation protection committee (ARSAC).

Contributor Information

Jessica B. Hopson, Department of Biomedical Engineering, King’s College London

Radhouene Neji, Siemens Healthcare Limited.

Joel T. Dunn, King’s College London & Guy’s and St Thomas’ PET Centre, King’s College London

Colm J McGinnity, King’s College London & Guy’s and St Thomas’ PET Centre, King’s College London.

Anthime Flaus, King’s College London & Guy’s and St Thomas’ PET Centre, King’s College London.

Andrew J. Reader, Department of Biomedical Engineering, King’s College London

Alexander Hammers, King’s College London & Guy’s and St Thomas’ PET Centre, King’s College London.

References

  • [1].Nasrallah I, Dubroff J. An overview of PET neuroimaging. Semin Nucl Med. 2013;43(6):449–461. doi: 10.1053/j.semnuclmed.2013.06.003. [DOI] [PubMed] [Google Scholar]
  • [2].Galldiks N, Lohmann P, Albert NL, Tonn JC, Langen KJ. Current status of PET imaging in neuro-oncology. Neuro-Oncology Adv. 2019 May;1(1):1–11. doi: 10.1093/noajnl/vdz010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Kaltoft NS, Marner L, Larsen VA, Hasselbalch SG, Law I, Henriksen OM. Hybrid FDG PET/MRI vs FDG PET and CT in patients with suspected dementia − A comparison of diagnostic yield and propagated influence on clinical diagnosis and patient management. PLoS One. 2019 May;14(5):e0216409. doi: 10.1371/journal.pone.0216409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Judenhofer MS, et al. Simultaneous PET-MRI: a new approach for functional and morphological imaging. Nat Med. 2008 Apr;14(4):459–465. doi: 10.1038/nm1700. [DOI] [PubMed] [Google Scholar]
  • [5].Gaspar L, et al. Long-term quality assurance of [18F]-fluorodeoxyglucose (FDG) manufacturing. Am J Nucl Med Mol Imaging. 2016;6(3):154. [PMC free article] [PubMed] [Google Scholar]
  • [6].Kaplan S, Zhu YM. Full-Dose PET Image Estimation from Low-Dose PET Image Using Deep Learning: a Pilot Study. Journal of Digital Imaging. 2018 doi: 10.1007/s10278-018-0150-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Bland J, et al. MR-guided kernel em reconstruction for reduced dose PET imaging. IEEE Trans Radiat Plasma Med Sci. 2018 May;2(3):235–243. doi: 10.1109/TRPMS.2017.2771490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Yamashita R, Nishio M, Do RKG, Togashi K. Convolutional neural networks: an overview and application in radiology. Insights Imaging. 2018;9(4):611–629. doi: 10.1007/s13244-018-0639-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Kumar A, Kim J, Lyndon D, Fulham M, Feng D. An Ensemble of Fine-Tuned Convolutional Neural Networks for Medical Image Classification. IEEE J Biomed Heal Informatics. 2017 Jan;21(1):31–40. doi: 10.1109/JBHI.2016.2635663. [DOI] [PubMed] [Google Scholar]
  • [10].Yap MH, et al. Automated Breast Ultrasound Lesions Detection Using Convolutional Neural Networks. IEEE J Biomed Heal Informatics. 2018;22(4):1218–1226. doi: 10.1109/JBHI.2017.2731873. [DOI] [PubMed] [Google Scholar]
  • [11].Gondara L. Medical Image Denoising Using Convolutional Denoising Autoencoders; IEEE Int Conf Data Min Work ICDMW; 2016. pp. 241–246. [Google Scholar]
  • [12].Milletari F, Navab N, Ahmadi SA. V-Net: Fully convolutional neural networks for volumetric medical image segmentation; Proc - 2016 4th Int Conf 3D Vision, 3DV 2016; 2016. pp. 565–571. [Google Scholar]
  • [13].Kayalibay B, Jensen G, van der Smagt P. CNN-based Segmentation of Medical Imaging Data. arXiv. 2017 [Google Scholar]
  • [14].Tripathi PC, Bag S. CNN-DMRI: A Convolutional Neural Network for Denoising of Magnetic Resonance Images R. Pattern Recognit Lett. 2020;135:57–63. [Google Scholar]
  • [15].Truhn D, Schrading S, Haarburger C, Schneider H, Merhof D, Kuhl C. Radiomic versus Convolutional Neural Networks Analysis for Classification of Contrast-enhancing Lesions at Multiparametric Breast MRI. Radiology. 2019;290(3):290–297. doi: 10.1148/radiol.2018181352. [DOI] [PubMed] [Google Scholar]
  • [16].Alzubaidi L, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021 Mar;8(1):1–74. doi: 10.1186/s40537-021-00444-8. 2021 81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Willemink MJ, et al. Preparing medical imaging data for machine learning. Radiology. 2020 Feb;295(1):4–15. doi: 10.1148/radiol.2020192224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Chollet F. Keras. 2015 [Google Scholar]
  • [19].Mustafa B, et al. Supervised Transfer Learning at Scale for Medical Imaging. 2021 [Google Scholar]
  • [20].Zbontar J, Jing L, Misra I, LeCun Y, Deny S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. 2021 [Google Scholar]
  • [21].Deng Jia, Dong Wei, Socher R, Li Li-Jia, Li Kai, Fei-Fei Li. ImageNet: A large-scale hierarchical image database. Proc CVPR. 2009:248–255. [Google Scholar]
  • [22].Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. [Accessed: 14-Apr-2022];Communications of the ACM. 2017 Online Available: https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html. [Google Scholar]
  • [23].Guan Q, et al. Deep convolutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in cytological images: a pilot study. J Cancer. 2019;10(20):4876. doi: 10.7150/jca.28769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 3rd International Conference on Learning Representations ICLR 2015 - Conference Track Proceedings; 2015. [Google Scholar]
  • [25].Dunnmon JA, Yi D, Langlotz CP, Ré C, Rubin DL, Lungren MP. Assessment of convolutional neural networks for automated classification of chest radiographs. Radiology. 2019;290(3):537–544. doi: 10.1148/radiol.2018181422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Tang YX, et al. Automated abnormality classification of chest radiographs using deep convolutional neural networks. npj Digit Med. 2020;3(1) doi: 10.1038/s41746-020-0273-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the Inception Architecture for Computer Vision; Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit; 2015. Dec, pp. 2818–2826. [Google Scholar]
  • [28].Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1) [Google Scholar]
  • [29].Neyshabur B, Sedghi H, Brain G, Zhang C. What is being transferred in transfer learning? [Google Scholar]
  • [30].Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: Understanding transfer learning for medical imaging. Adv Neural Inf Process Syst. 2019;32(NeurIPS) [Google Scholar]
  • [31].Wen Y, Chen L, Deng Y, Zhou C. Rethinking pre-training on medical imaging. J Vis Commun Image Represent. 2021 Jul;78:103145 [Google Scholar]
  • [32].Heker M, Greenspan H. Joint Liver Lesion Segmentation and Classification via Transfer Learning. 2020;1:1–5. [Google Scholar]
  • [33].Ridnik T, Ben-Baruch E, Noy A, Zelnik-Manor L. ImageNet-21K Pretraining for the Masses. 2021:1–20. [Google Scholar]
  • [34].Tajbakhsh N, et al. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans Med Imaging. 2016;35(5):1299–1312. doi: 10.1109/TMI.2016.2535302. [DOI] [PubMed] [Google Scholar]
  • [35].Naz S, Ashraf A, Zaib Ahmad. Transfer learning using freeze features for Alzheimer neurological disorder detection using ADNI dataset. Multimed Syst. 2022;28:85–94. [Google Scholar]
  • [36].Kieffer B, Babaie M, Kalra S, Tizhoosh HR. Convolutional neural networks for histopathology image classification: Training vs Using pre-trained networks; Proc 7th Int Conf Image Process Theory, Tools Appl IPTA 2017; 2018. pp. 1–6. [Google Scholar]
  • [37].Bosse S, Maniry D, Müller KR, Wiegand T, Samek W. Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment. 2018 doi: 10.1109/TIP.2017.2760518. [DOI] [PubMed] [Google Scholar]
  • [38].Yao X, Cao Q, Feng X, Cheng G. Learning to Assess Image Quality Like an Observer. 2022:1–13. doi: 10.1109/TNNLS.2022.3149534. [DOI] [PubMed] [Google Scholar]
  • [39].Kang L, Ye P, Li Y, Doermann D. Convolutional neural networks for no-reference image quality assessment. 2014 [Google Scholar]
  • [40].Sujit SJ, Coronado I, Kamali A, Narayana PA, Gabr RE. Automated image quality evaluation of structural brain MRI using an ensemble of deep learning networks. J Magn Reson Imaging. 2019;50(4):1260–1267. doi: 10.1002/jmri.26693. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Oksuz I, et al. Automatic CNN-based detection of cardiac MR motion artefacts using k-space data augmentation and curriculum learning. Med Image Anal. 2019 Jul;55:136–147. doi: 10.1016/j.media.2019.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Ma JJ, et al. Diagnostic Image Quality Assessment and Classification in Medical Imaging: Opportunities and Challenges; Proc - Int Symp Biomed Imaging; 2020. pp. 337–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Coyner AS, et al. Automated Fundus Image Quality Assessment in Retinopathy of Prematurity Using Deep Convolutional Neural Networks. Ophthalmol Retin. 2019 May;3(5):444–450. doi: 10.1016/j.oret.2019.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Zago GT, Andreão RV, Dorizzi B, Teatini Salles EO. Retinal image quality assessment using deep learning. Comput Biol Med. 2018 Dec;103:64–70. doi: 10.1016/j.compbiomed.2018.10.004. [DOI] [PubMed] [Google Scholar]
  • [45].Hopson JB, et al. Pre-training and Transfer Learning for Training Set Reduction and Improving Automated Assessments of Clinical PET Image Quality; 2021 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC); 2021. [Google Scholar]
  • [46].Markiewicz PJ, et al. Rapid processing of PET list-mode data for efficient uncertainty estimation and data analysis. Phys Med Biol. 2016;61(13):N322–N336. doi: 10.1088/0031-9155/61/13/N322. [DOI] [PubMed] [Google Scholar]
  • [47].Ashburner J, et al. SPM8 Manual. [Accessed: 11-Apr-2022];2013 Online Available: https://www.fil.ion.ucl.ac.uk/spm/doc/spm8_manual.pdf. [Google Scholar]
  • [48].Abadi M, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. 2016 [Google Scholar]
  • [49].Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? [Google Scholar]
  • [50].Zhou L, Pan S, Wang J, Vasilakos AV. Machine learning on big data: Opportunities and challenges. Neurocomputing. 2017 May;237:350–361. [Google Scholar]
  • [51].Ladefoged CN, Hasbak P, Hornnes C, Højgaard L, Andersen FL. Low-dose PET image noise reduction using deep learning: Application to cardiac viability FDG imaging in patients with ischemic heart disease. Phys Med Biol. 2021;66(5) doi: 10.1088/1361-6560/abe225. [DOI] [PubMed] [Google Scholar]
  • [52].Mehranian A, et al. PET image reconstruction using multi-parametric anato-functional priors. Phys Med Biol. 2017;62(15):5975–6007. doi: 10.1088/1361-6560/aa7670. [DOI] [PubMed] [Google Scholar]
  • [53].Reader AJ, Ellis S. Bootstrap-Optimised Regularised Image Reconstruction for Emission Tomography. IEEE Trans Med Imaging. 2020 Jun;39(6):2163–2175. doi: 10.1109/TMI.2019.2956878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].He K, Chen X, Xie S, Li Y, Dollar P, Girshick R. Masked Autoencoders Are Scalable Vision Learners. 2022:15979–15988. [Google Scholar]
  • [55].Niu C, Wang G. Unsupervised contrastive learning based transformer for lung nodule detection. Phys Med Biol. 2022 doi: 10.1088/1361-6560/ac92ba. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [56].Moon JH, Lee H, Shin W, Kim YH, Choi E. Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training. IEEE J Biomed Heal Informatics. 2022;XX(Xx):1–10. doi: 10.1109/JBHI.2022.3207502. [DOI] [PubMed] [Google Scholar]

RESOURCES