Abstract
Artifacts are a common occurrence in Diffusion MRI (dMRI) scans. Identifying and removing them is essential to ensure the accuracy and viability of any post-processing carried out on these scans. This makes quality control (QC) a crucial first step prior to any analysis of dMRI data. Several QC methods for artifact detection exist, however they suffer from problems like requiring manual intervention and the inability to generalize across different artifacts and datasets. In this paper, we propose an automated deep learning (DL) pipeline that utilizes a 3D-Densenet architecture to train a model on diffusion volumes for automatic artifact detection. Our method is validated on 9000 volumes sourced from 7 large clinical datasets spanning different acquisition protocols (with different gradient directions, high and low b-values, single-shell and multi-shell acquisitions) from multiple scanners. Additionally, they represent diverse subject demographics including age, sex and the presence or absence of pathologies. Our QC method is found to accurately generalize across this heterogenous data by correctly detecting 92% artifacts on average across our test set. This consistent performance over diverse datasets underlines the generalizability of our method, which currently is a significant barrier hindering the widespread adoption of automated QC techniques. Thus, 3D-QCNet can be integrated into diffusion pipelines to effectively automate the arduous and time-intensive process of artifact detection.
Keywords: Quality control, Artifacts, MRI, Diffusion MRI, Deep learning
1. Introduction
Diffusion MRI (dMRI) has become a widely adopted imaging technique over the past few years (Baliyan et al., 2016), as it is able to provide a unique insight into white matter architecture by exploiting the differential water diffusion across tissue types. dMRI is used widely to study conditions where WM abnormalities are expected such as brain tumors, brain trauma, neurodegenerative and neuropsychiatric disorders (Soares et al., 2013). However, dMRI acquisition suffers from artifacts that can make their use in research studies challenging in the absence of intensive quality control (QC) (Le Bihan et al., 2006; Tournier et al., 2011; Pierpaoli, 2010). Thus, our goal is to develop a deep learning based pipeline that will automatically detect and isolate artifact ridden volumes from dMRI scans.
dMRI artifacts usually include motion induced signal drop-out, ghosting, herringbone, Gibbs ringing, chemical shift, susceptibility, interslice instability and multi-band interleaving (Heiland, 2008; Moratal et al., 2008; Krupa and Bekiesińska-Figatowska, 2015; Smith et al., 1991) (Wood and Henkelman, 1985; Simmons et al., 1994; Schenck, 1996). Patient motion during the scan can result in signal dropout from individual slices in a volume, in ghosting artifact, or in artifacts arising from stitching together misaligned data in acquisitions that use slice interleaving or multi-band acceleration. Field inhomogeneity from interfaces between water and fat or between tissue and air (as in the sinuses), or from metal implants, can cause distortions such as chemical shift and susceptibility artifacts.
The detrimental effects of using scans marred by artifacts on further post-processing has been noted extensively and have been shown to negatively affect the viability of results (Bammer et al., 2003; Van Dijk et al., 2012; Reuter et al., 2015), and also affect their interpretation. This makes it imperative to detect these artifacts so that artifactual data can either be excluded from further analysis or corrected as part of a preprocessing pipeline. Hence QC for the detection of these artifacts is an essential part of a dMRI processing pipeline (Soares et al., 2013).
Currently, most QC happens through an expert manually going through all scans, flagging those that they deem artifactual. This is an arduous and time intensive process. It is infeasible when large datasets are involved and quickly becomes a bottleneck for any dMRI study; alternatively, people opt for spot checking that may leave a lot of artifactual data in leading to incorrect results. Further, the unavoidable subjectiveness of having different experts annotate different studies will lead to uncertain outcomes since two individuals may have a varied opinion of what may constitute an artifact. This makes a normalized and automated QC method, as we propose, which can replicate or exceed human performance, imperative for any dMRI processing pipeline. While some form of QC tools already exist like FSL EDDY (Andersson et al., 2016; Bastiani et al., 2019), DTI Studio (Jiang et al., 2006), DTI Prep (Oguz et al., 2014) and TORTOISE (Pierpaoli et al., 2010), they are usually limited to detecting and correcting the specific artifact they have been designed for, mostly motion and eddy current induced distortions (Liu et al., 2015; Kelly et al., 2017; Iglesias et al., 2017; Alfaro-Almagro et al., 2018; Graham et al., 2018). Moreover, they are not entirely accurate, as often the results of their correction need subsequent inspection to detect the presence of any remaining abnormalities. Their performance in correcting for motion and eddy current artifacts may be affected by the presence of artifactual data. These methods are also not exhaustive in covering many other prevalent artifacts like ghosting, herringbone and chemical shifts. Deep learning (DL) methods like QC Automator (Samani et al., 2019) have shown excellent performance in detecting a wide range of artifacts, but have not been tested on patient data extensively. Also, it is based on 2D deep learning models and require ground-truth annotations for every slice from every volume which can be strenuous for human annotators. This can be particularly concerning specifically during model fine-tuning where an annotated subset is required every time an unseen dataset is to be processed.
Our proposed method, 3D-QCNet alleviates these issues by implementing a 3D architecture which identifies artifacts in both patient and healthy diffusion scans while simultaneously streamlining the task of annotation and optimizing the process of fine-tuning. We show that our method achieves excellent performance on diverse datasets sourced from different scanners, patients, and pathologies. Thus, we present a user-centric automated 3D QC pipeline for DTI scans which is tested to be both accurate and flexible on data observed in most practical scenarios. This makes it convenient to integrate in existing preprocessing dMRI pipelines, hence automating an essential but previously time consuming and laborious process.
2. Methods and data
2.1. Overview
Our method utilizes a 3D-DenseNet (Huang et al., 2017) architecture which classifies dMRI volumes into an artifact class or a normal class. In the following sections, we will first describe the various datasets we used to train and validate 3D-QCNet. We will then discuss the specifics of our method – its architecture, training process, performance optimizing techniques and finally our evaluation criteria. All work in this paper was carried out with the approval of the IRB of the University of Pennsylvania.
2.2. Data
In this study, we used 7 datasets, of which 3 were utilized for training and validation (Dataset 1, 2 and 3) while the other 4 were used exclusively for testing (Dataset 4, 5, 6, 7) model performance on unseen data distributions. The specific details for each dataset can be found in Table 1a and Table 1b. These datasets were heterogenous in nature, having been sourced from scans acquired with diverse scanning parameters, with different pathologies and subject demographics. The volumes were selected randomly for annotation from each of their respective larger datasets. Since volumes without artifacts largely outnumbered volumes with artifacts, we randomly subsampled and dropped some non-artifactual (normal) scans from the training/validation set to balance the data (with the aim of reducing the number of non-artifactual scans and creating a bias). No explicit subsampling was done on the test set. This resulted in a total ground truth labelled dataset of 9258 volumes from 678 subjects. The annotations were done by an expert (DP) with 8 years of experience identifying artifacts in diffusion scans across studies in autism spectrum disorder, brain tumors, traumatic brain injury, etc. The expert identified the following artifacts: motion-induced signal dropout, interslice instability, ghosting, chemical shift and susceptibility. The expert went through each volume, annotating approximately at a rate of 100 volumes every 15 min. Therefore, the total manual time spent labelling all the 9258 volumes from the 7 datasets totaled roughly 22 h spread across multiple days (to avoid rater fatigue).
Table 1a.
Dataset details - acquisition parameters.
| Dataset # |
Original - volumes/ subjects |
Annotated -volumes/ subjects |
Scanner - Siemens 3 T |
TR ∣ TE (ms) |
Resolution (mm) | b-value | # Dirs |
Multiband Factor |
|
|---|---|---|---|---|---|---|---|---|---|
| Training - Validation | 1 | 10997/78 | 1263/77 | TimTrio | 6500∣84 | 2.2 × 2.2 × 2.2 | 1000 | 30 | NA |
| 2 | 99995/1419 | 2703/94 | Verio | 8100∣82 | 1.875 × 1.875 × 2 | 1000 | 64 | NA | |
| 3 | 5117/165 | 2945/165 | Verio | 11000∣76 | 2 × 2×2 | 1000 | 30 | NA | |
| Testing | 4 | 6440/85 | 1098/44 | PrismaFit | 2900∣94 | 2.4 × 2.4 × 2.4 | 1000 | 64 | 3 |
| 5 | 55937/633 | 400/242 | TimTrio | 8000∣82 | 2.2 × 2.2 × 2.2 | 1000 | 30 | NA | |
| 6 | 68460/585 | 600/30 | Skyra | 9000∣92 | 2.73 × 2.73 × 2.7 | 1300 | 64 | NA | |
| 7 | 3063/26 | 249/26 | PrismaFit | 4300∣75 | 2 × 2×2 | 300, 800, 2000 | 109 | 2 |
Table 1b.
Dataset details – pathology information.
| Dataset # |
Type | % of Artifact Volumes |
Age range (y) |
|
|---|---|---|---|---|
| Training and Validation | 1 | TBI dataset with lesions and WM hyperintensities | 21% | 18–66 |
| 2 | Developmental dataset | 44% | 8–22 | |
| 3 | Autism dataset | 32% | 6–25 | |
| Testing | 4 | TBI dataset with lesions and WM hyperintensities | 41% | 18–71 |
| 5 | Hypertension, cardiovascular disease and WM hyperintensities | 25% | 55–94 | |
| 6 | TBI dataset with lesions and WM hyperintensities | 29% | 18–71 | |
| 7 | Healthy controls | 53% | 24–26 |
2.3. Data distribution
We distributed our data into 3 sets – training, validation and testing. Datasets 1, 2 and 3 were exclusively used for creating the training and validation sets. The split was done in a stratified manner such that 25% of subjects from both the Artifact and Normal classes were put in the validation set and the remaining were used as part of the training set. This resulted in 5619 training volumes and 1292 validation volumes from a total of 336 subjects. It is worth noting that we split our datasets based on subjects rather than volumes, to prevent data leakage that may occur when different volumes from the same subject are present in both training and validation sets. Our testing set, which constituted 2347 volumes from 342 subjects, was entirely sourced from 4 unseen datasets (Datasets 4, 5, 6 and 7) that were not used in any way during training and validation.
2.4. Model and architecture
Our method is inspired by the success that Convolutional Neural Network Architectures (Krizhevsky et al., 2017) have had in the domain of computer vision (Gu et al., 2018). CNNs are adept at identifying spatial relationships in images by determining differentiating features of varying granularity along their depth. These automatically extracted features are then passed through a conventional dense neural network (NN), which is trained end-to-end to optimize the loss function such that the model learns to classify among the different classes.
We adapted a deep learning model to solve a 3D classification problem, with Fig. 1 showing the architecture. We used a special kind of CNN architecture called DenseNet which has been shown to have superior performance compared to other architectures in multiple classification tasks (Huang et al., 2017).
Fig. 1.
3D-QCnet DenseNet Model Architecture showing how a 3D volume is processed by a series of dense and transition blocks before being sent to the GAP and dense layers for final classification into the two classes – artifact and normal.
The DenseNet architecture consists of dense blocks interconnected with each other using transition layers. Within each dense block, the input is passed through BatchNorm (Ioffe and Szegedy, 2015) and activation layers before being sent to multiple 3D convolutional layers. Significantly, the output from the CNN layers is concatenated with their input creating skip connections. This feature differentiates DenseNet from other architectures like ResNet (He et al., 2016) where skip connections exist but concatenation is replaced by the addition operation. The outputs from the dense block are then sent to the transition blocks which are responsible for downsampling the feature maps, an essential operation for any CNN model. Since the concatenation operation inside dense blocks enforces shape limitations, this operation has to be carried out outside of these blocks in transition cells before the output is sent to the next dense block. These two blocks form a chain as they continue to process the feature maps in varying order of granularity along the depth of the model. The resultant output is then sent to the Global Average Pooling layer which further reduces dimensions by averaging across feature maps. The final output can be considered as the features extracted by the CNN and are subsequently fed through a set of dense NN layers to obtain the softmax output.
We configured our DenseNet model to have 3 dense blocks each having 2 3D-convolutional layers with a filter size of 3×3×3 and skip connections between their inputs and outputs. The convolutional layers were additionally preceded by BatchNorm and Relu Activation layers. Each transition block further consisted of 4 layers comprising Batch-Norm, Activation, Conv and Average Pooling. The output emanating from the last GAP layer was fed to a single dense layer of size 2 having the softmax activation. This output was then optimized against the target distribution through the loss function. Fig. 1 describes the architecture in further detail.
We preferred using DenseNet over other architectures for a few reasons (Huang et al., 2017). First, the skip connections allow for easier transmission of gradients during back propogation, diminishing the vanishing gradient problem. This allows for a deeper architecture enabling the model to learn features at different granular levels. Also, the number of learnable parameters in a Densenet architecture are considerably less, which leads to a reduced tendency for the model to overfit on the training data.
Our training setup for the 3D-QCNet model involved preprocessing in which we resized all our 3D volumes extracted from diffusion scans into a common dimension size of 96×96×70. To achieve this, we clipped volumes which had more than 70 slices or added empty slices as padding for those which were below this threshold. A constant input resolution was a requirement imposed by the model architecture. We chose 70 as a clipping threshold based on attempting to maximize the information per volume vs choosing a resolution that could be batched and fit into most practical GPUs for optimal model training and optimization. Therefore, while training, we used a batch size of 5 due to the increased memory requirements of 3D models and set the epochs to 20. The model was trained to optimize cross-entropy loss using the Adam optimizer with a learning rate of 10e-4. These hyperparameters were obtained by tuning them on the validation set. Further, as the model was not showing signs of overfitting, we abstained from adding regularization and augmentation strategies that are often used to combat this problem.
On a Nvidia 1080 GPU the model took approximately 3 h to train. The code for our method was written in Python 3.6 and utilized the TensorFlow and Keras libraries for model implementation. It can be publicly accessed here https://github.com/adnamad/3D-QCNet.
2.5. Evaluation metrics
We evaluated the performance of our method by defining “artifacts” as the positive class and observing the precision, recall and accuracy metrics which are defined as follows:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Here, TP, FP, FN, TN represent true positive, false positive, false negative and true negative, respectively.
2.6. Inference pipeline
Overall, our pipeline works in the following steps when used to curate a new dataset. First, it accepts dMRI scans, extracts 3D volumes and preprocesses them into a constant size before sending the data to the 3D-QCNet model. The model runs in inference mode and provides the user with predicted artifact/normal label for each volume. Should a test set be provided, performance metrics are calculated and reported. At the end, the user is presented with a report stating the volumes which were found to have an artifact. The user may then choose to remove these volumes before proceeding to further downstream analysis.
3. Results
3.1. Performance of 3D-QCNet on validation and test data
Table 2 below represents performance of 3D-QCNet on the validation and test sets. The model obtains high metric scores with an average accuracy of 89% and 95% on the validation and test sets, respectively. Significantly, an average recall of 92% is also obtained on the test set and no large discrepancies in performance are observed across the diverse group of 7 datasets. Fig. 2 lists the accuracy and related details, on the validation and test sets, and also provides the ROC curves for all the test datasets. Fig. 3 shows representative examples of 3D-QCNet’s performance with respect to ground-truth. Fig. 4 examines a false positive case, in which the expert annotator had deemed it artifact free but was found to be artifactual by 3D-QCNet. Artifacts were found on a second inspection, which were missed by the expert in the first inspection.
Table 2.
Results – 3D-QCNet Model.
| Dataset | Accuracy | Precision | Recall | |
|---|---|---|---|---|
| Validation Set | Dataset 1 | 97 | 95 | 80 |
| Dataset 2 | 81 | 91 | 66 | |
| Dataset 3 | 89 | 90 | 82 | |
| Average | 89 | 92 | 76 | |
| Test Set | Dataset 4 | 97 | 84 | 81 |
| Dataset 5 | 92 | 86 | 98 | |
| Dataset 6 | 96 | 81 | 100 | |
| Dataset 7 | 92 | 95 | 89 | |
| Average | 94 | 87 | 92 |
Fig. 2.
ROC curves for Test Datasets.
Fig. 3.
Scans from the test set illustrated to demonstrate 3D-QCNet’s model performance with respect to ground-truth. True Positive samples – A Ghosting artifact, B Herringbone artifact, C Motion/interslice instability artifact, D Faint Chemical artifact (marked in yellow). True Negative samples – E Weighted Image is noisy but is correctly marked as normal. F B0 image with no artifacts. False Positive samples – G Abnormal anatomy of the brain may be affecting the classifier. H Weighted image is noisy but there are no visible artifacts; the model may be too sensitive. False Negative samples – I Chemical shift artifact alongside instability and susceptibility.
Fig. 4.
These scans are from a volume that was marked as normal by our annotator but 3D-QCNet labelled it as having an artifact. Later, on closer inspection it was found to have ghosting artifacts in the ventricles and subcortical regions along with some interslice instability.
3.2. Comparisons with QC-Automator
We compare our method to QC-Automator as it is the only other automated artifact detection tool available. Since QC-Automator generates slice wise 2D predictions, we modified the algorithm to output volume level annotations by aggregating per slice results. Only volumes that had more than 5 artifactual slices in either the axial or sagittal direction were marked as having artifacts. However, as established before, the method is reliant on fine-tuning to work on unseen datasets, and therefore, due to the manually intensive task of labelling individual slices of every volume, we restricted this comparison to only Dataset 4. For fine-tuning, we chose a subset of 20 volumes from Dataset 4 and manually annotated each slice with artifact or normal class labels. We then compared the performance of baseline QC-Automator, QC-Automator fine-tuned on a 10% subset and 3D-QCNet on these 20 volumes. We also ran a non-fine-tuned comparison between the 2 methods on the entirety of Dataset 4. These results are described in Table 3.
Table 3.
Comparison of 3D-QCNet vs QC-Automator.
| Subset of 20 volumes in Dataset 4 | Accuracy | Precision | Recall |
|---|---|---|---|
| QC-Automator (without 10% for re-training, rejecting volumes with more than 5 artifactual axial or sagittal slices) | 80 | 70 | 90 |
| QC-Automator (with 10% for re-training, rejecting volumes with more than 5 artifactual axial or sagittal slices) | 90 | 90 | 90 |
| 3D-QCNet | 100 | 100 | 100 |
| Entire data set 4 | Accuracy | Precision | Recall |
| QC-Automator (without 10% for re-training, rejecting volumes with more than 5 artifactual axial or sagittal slices) | 81 | 82 | 75 |
| 3D-QCNet | 98.1 | 98.6 | 81.4 |
3.3. Comparing across multiple annotators
To ensure our method wasn’t biased to labels from one human expert, we had Dataset 5 labelled by a second annotator, who had a month’s training on identifying artifacts (compared to the expert Annotator 1, who has 8 + years of experience). The results of utilizing different ground truth labels from the two annotators to evaluate 3D-QCNet is show in Table 4. We also compare the performance of the 2 annotators relative to each other as one has significantly more experience in artifact detection relative to the other.
Table 4.
Results on Dataset 5 when using labels from different annotators.
| Ground Truth | Comparison | Accuracy | Precision | Recall |
|---|---|---|---|---|
| Annotator 1 | 3D-QCNet | 92.75 | 86.5 | 98.85 |
| Annotator 2 | 3D-QCNet | 76 | 75.6 | 75.6 |
| Annotator 1 | Annotator 2 | 80 | 73.1 | 85 |
4. Discussion
4.1. Overview
In this paper, we have presented a QC method to automatically detect artifacts in dMRI scans by utilizing a deep learning model trained on 3D diffusion volumes. Our overarching goal was to present users with an accurate and convenient QC toolkit that works out of the box, with minimal manual intervention needed on any practical dataset. To achieve this, it was important to demonstrate the efficacy and generalizability of our method on multiple realistic scenarios that users may encounter in practice.
4.2. 3D-QCNet robustness and generalizability
In order to establish the robustness of 3D-QCNet, we annotated a large amount of data - 9258 volumes from 678 subjects taken from 7 different datasets (Table 1a and b). Of these, data from 4 datasets (2347 volumes / 342 subjects), were used exclusively for testing. Our aim with using these datasets was to demonstrate the efficacy and generalizability of our method on multiple unseen yet realistic scenarios that users can expect in practice.
The 7 datasets presented a significant challenge for validating the robustness of our method, as they contained scans from different sites, scanners, subjects and pathologies. In addition, the 4 test datasets contained a data distribution which was unseen by the model and yet the evaluation metrics on this set were found to match, and in some instances exceed the validation performance. This strongly suggests that our method can be used accurately and out of the box on most datasets regardless of their inherent characteristics. This is significant as machine learning algorithms have been found to be notoriously bad at generalizing to unseen data distributions in the field of medical imaging, and their performance can often be found lacking in practical scenarios (Davatzikos, 2019).
Our method is also the first automated QC technique for dMRI, that has been tested to work on clinical datasets containing pathologies. The 3D-QCNet model generalized across a diverse set of diseases with focal abnormalities like WM hyperintensities (Dataset 4 and 5) and non-focal abnormalities (Dataset 3), or a combination thereof and spanned pathologies like autism and traumatic brain injury (Datasets 3,4 and 6). Our results demonstrate the model’s ability to generalize across 4 unseen test datasets not only with pathologies, but also substantially different scanning parameters, spanning single, multishell, and state of the art multiband acquisitions as well as legacy dMRI data. This is especially demonstrated by the high recall values attained by Datasets 5 and 6 that each represented data from multiple sites and scanners and Dataset 7 which was a multishell acquisition (a scanning characteristic absent from the training data) spanning multiple b-values having different levels of signal to noise ratio (SNR) across them. The fact that the model was able to accurately generalize across datasets with substantial variations such as scanner and pathology differences, also suggests that our model is indeed learning to detect artifacts and not other extraneous features within the biology. Overall, 3D-QCNet’s strong generalization performance allows it to be used as a convenient black box that users can use to accurately QC any dataset, without being concerned about its intrinsic characteristics.
4.3. Performance analysis and detecting artifacts
While there may be significant differences between scans obtained from multiple sites, scanners and patients, our 3D-QCNet model is still able to perform well on widely varying unseen test datasets. This is evident from the statistics and the ROC curves presented in Fig. 2.
Expert QC annotators tend to prefer a higher artifact detection rate (higher recall) even if it means some non-artifactual cases are marked as corrupted (lower precision). Ideally, we would want the deep learning model to match the performance of a human rater, however, the strenuous nature of annotating a large dataset makes these ratings error prone (as seen in Fig. 4). Therefore, assuming small insignificant inaccuracies inherent within the ground truth data, a recall rate close to 100 may be an unreasonable expectation. Table 2 shows the performance of 3D-QCNet. On the test datasets, the method achieves an average recall of 92 while maintaining a precision of 87. Based on our comprehensive analysis on multiple clinical quality datasets, it can be seen that 3D-QCNet works well on unseen datasets without the need for further modifications.
Fig. 3 shows the different artifacts that 3D-QCNet is able to detect. It may be noted, that artifacts are detected as outliers, and there is not enough data to train for some artifacts like the corduroy effects. However, we manually analyzed the model predicted outputs on a significant subset of volumes to identify if there were any specific artifacts that were consistently misclassified. The various artifacts annotated in our datasets included motion-induced signal dropout, interslice instability, ghosting, chemical shift and susceptibility. 3D-QCNet was able to detect these various cases with high sensitivity and no particular type of artifact was routinely missed in testing. This is in contrast to manual QC of diffusion data, where an artifact, such as a chemical shift or a motion artifact that does not affect an entire slice, can go undetected if the rater performing QC does not diligently inspect every single slice of the data. In addition, we also observed that the model was often able to detect certain cases of artifacts that were not part of the annotated training set such as the rare herringbone artifact (image B in Fig. 3), while their sensitivity was not as high as the ones with ground truth labels. While artifacts like motion and eddy current related ones can be corrected by post-processing, the correction performance may be affected by the presence of other artifacts. Therefore, we recommend the application of 3D-QCNet to the entire dataset prior to correction, retaining the data that can be corrected, and then reapplying 3D-QCNet to weed out those volumes that failed the correction step.
Fig. 3 also presents some false positives, one of which is further analyzed in Fig. 4. The example was flagged as artifactual by 3D-QCNet, but was labeled as normal by the expert. However, on close inspection it was found to have a subtle artifact that was missed by the expert. This underlines the significance of an automated system that is sensitive to artifacts and is able to reduce the effort that the rater needs to put in it, with the effort limited to checking the volumes flagged to be artifactual by 3D-QCNet. This significantly reduces the human effort, and makes it possible to QC large datasets.
4.4. Comparison to other automated QC tools
We compared 3D-QCNet to QC-Automator, as it was the only other deep learning based automated QC technique. The diversity, as well as the magnitude of the data we annotated, were much larger than those used in training and testing. In fact, 3D-QCNet was trained and tested on 4 times the number of subjects and had multiple pathologies when compared to QC-Automator, which was based on healthy controls. This increase in training and testing data was feasible in large part due to the 3D structure of our model, that only required a single label for an entire volume consisting of multiple slices, markedly reducing manual annotation time compared to 2D slice wise annotation methods. This led to monumental savings in terms of both time and effort. As a result, we were able to label and train on 3 times more data than used by QC Automator in considerably less time. It also made our pipeline significantly more user friendly and practical when base models were needed to be fine-tuned on previously unseen datasets.
The performance of the two were compared by modifying the latter to predict volume-wise labels. This was done by assigning volumes with more than 5 artifactual slices an artifact label while the others were marked as normal. It must be noted that QC-Automator differs from our method in two significant ways. First, as it was trained on slice-wise ground truth annotations, manually labelling datasets for training is a cumbersome process. Second, as stated in the paper, when being tested on new datasets, the method requires a subset of annotated data for the purposes of fine-tuning, which may or may not be available. In comparison, 3D-QCNet only required one ground truth label per volume and hence its datasets were easier to label. It also doesn’t implicitly require fine-tuning to achieve acceptable performance on new datasets. This is evident in Table 3 where baseline QC-Automator is unable to outperform 3D-QCNet (not fine-tuned) even after being boosted by fine-tuning.
One of the reasons for this difference in performance may be due to the ability of 3D convolutional kernels to capture features across 3 dimensions as opposed to 2. This means they are better able to capture not only slice-wise features but also volumetric features across the depth of a scan. This ability to exploit additional contextual information is what allows 3D methods to outperform 2D ones on many ML tasks such as glioma grade classification and age estimation (Shabanian et al., 2019; Mzoughi et al., 2020). We believe this is also what allows our method to obtain better performance in artifact detection than QC-Automator without the need for fine-tuning, since it is likely that volumetric features captured across depth do a better job of accounting for any domain shift across training and testing datasets.
4.5. Inter-rater comparison
We also analyze if our model could have been biased to annotations from one human rater by getting labels on dataset 5 from a second annotator. We then evaluated 3D-QCNet relative to ground truth labels from both the experts and also compared the experts against each other in Table 4. It is worth noting that annotator 1 has significantly more experience than annotator 2 (8 years vs 2 months). Therefore, we make the reasonable assumption that annotator 1′s experience translates into him being closer the ‘true’ ground truth. That is evident, in how annotator 2′s labels correlate with annotator 1. Annotator 2 only achieves recall of 83 and precision of 71 with respect to annotator 1′s labels. Considering that annotator 2′s labels may not be as accurate as annotator 1, 3D-QCNet still achieves a recall of 75 and a precision of 75 when using annotator 2′s labels as ground truth. There could be two reasons for this discrepancy, one as mentioned before, could be the inherent inaccuracy of Annotator 2 given their relative inexperience. Another could be that due to the data it was trained on, the model is more akin to Annotator 1 and learns their expertise as well as biases. However, dMRI expertise at the level of Annotator 1 is rare, and 3D-QCNet would be an immense help to clinical sites where such extertise is not readily available.
4.6. Limitations and workarounds
3D-QCNet is very robust to the different datasets. However, it cannot be deemed to be immune to all data. Thus, fine-tuning of the model may be needed if performance deteriorates on some dataset in the future. Fine-tuning has been shown to improve performance of computer vision models when a domain shift is observed between training and testing data distributions (Chu et al., 2016). By training on a subset of the target distribution, pre-trained models are able to learn to account for this change which in turn helps in making these models more generalizable. We investigate the effects of fine-tuning in Appendix Section A1, and show that when an annotated set is available for a dataset, a subset of it can be used to fine-tune 3D-QCNet and improve performance metrics. However, we stress that evidence from Table 2 suggests that our method generalizes well across a host of diverse datasets. This implies that fine-tuning will most likely not be required in most cases.
Further, we acknowledge that different people have different perceptions about what may constitute an artifact. Some are more sensitive while others are more conservative in what they classify as an artifact especially in boundary cases where the distinction is more subjective. This is illustrated in the Fig. 5. We analyze the effects of adjusting the decision thresholds on Dataset 7 in Appendix Section A2, and show that given a validation set, the user may tune the threshold value to balance the precision-recall values to be better account for personal preferences.
Fig. 5.
This volume was marked as having an artifact by 3D-QCNet however our expert human rater (DP) believes that while there is some susceptibility present, it is not enough to warrant removing the data. In such borderline cases, being able to control the model’s sensitivity through probability thresholds, will allow users to fine-tune results based on their preferences.
Finally, as noted in Section 4.5, it would be helpful to evaluate the model with annotations from a more diverse set of annotators. This would ensure that the model isn’t biased towards one reviewer and doesn’t model their biases along with their expertise. Annotating QC data on this scale is a massive challenge and our annotation efforts were limited by the availability of qualified experts. However, we expect that 3D-QCNet will fill a void in dMRI QC, and will be actively used by many studies. This will provide curated data for future versions.
4.7. Future Work
While our method provides an accurate and convenient automated pipeline to filter artifactual DTI scans, it can be further improved upon. As 3D-QCNet only reports problematic volumes, one potential modification may involve building a hierarchical classifier which first uses a head trained on 3D volume labels to identify artifactual volumes after which another head trained on 2D slice annotations refines the prediction to identify problematic slices. This would give the benefit of the greater artifact detecting accuracy of a 3D classifier and the more granular predictions of a 2D model. This may be useful for users who are working with smaller datasets and would prefer not to discard an entire volume if a few slices are affected. Further, since most users are only interested in weeding out the affected volumes, we do not provide fine-grained classification which identifies different artifact types. However, in rare cases where such a distinction may be desired, our method can be adapted in the future, to be trained on data with annotations for individual artifacts. Additionally, the method could be improved by training on data annotated by multiple experts with similar experience, to avoid overfitting to one expert’s biases when it comes to the subjective task of quality control. As the code is available online, the community can train it with additional raters and upload future improved versions.
5. Conclusion
Quality control of dMRI data, involving the identification and removal of artifacts, is an essential step in pre-processing to ensure accurate subsequent analysis. In this paper, we have developed a deep learning based 3D classification method for artifact detection that is accurate, fast and convenient. It is shown to work effectively across a wide range of datasets with diverse scanner parameters and having multiple pathologies. The generalizability of the model across 4 different test datasets suggests that it would work out of the box on future datasets, although fine-tuning has been discussed to optimize further in the future. These features combined with the method’s focus on minimal expert intervention, will potentially enable it to be seamlessly integrated in dMRI processing pipelines to effectively automate an essential but previously time consuming, subjective and manual process. The user can concentrate their effort on the data flagged as artifactual reducing the manual effort significantly, and increasing the size of the data that can be QC-ed. This underlines the significance of our contribution in this age of big data when enormous amounts of data need to be sifted through.
Acknowledgements
Data from various freely available datasets, as well as clinical datasets, were used for the purposes of testing the model. Only a subset of the data is free for release. The work was supported by the NIH grant - NIH R01- MH117807 (PI: Ragini Verma). We thank the following Pis and their corresponding NIH grants, to permit the use of their data for training and testing: NINDS R01 NS065980 (PI: Junghoon Kim), NINDS U01 NS086090 (PI: Ramon Diaz-Arrastia), R01-DC008871 (PI: Timothy Roberts). Data from the Philadelphia Neurodevelopmental Cohort (PNC) was also used for development and has been made freely available to the community by Pis Raquel and Ruben Gur. Data from the Systolic Blood Pressure Intervention Trial was funded by the National Institutes of Health (HHSN268200900040C, HHSN268200900046C, HHSN268200900047C, HHSN268200900048C, and HHSN268200900049C and interagency agreement A-HL-13-002-001). It was also supported in part with resources and use of facilities through the US Department of Veterans Affairs. Additional support was provided by 1S10OD023495-01, 1RF1AG054409, R01-AG055606, funding from the Alzheimer’s Association, and through the National Center for Advancing Translational Sciences UL1RR024134 and UL1TR000003.
Appendix
A1. Fine-Tuning
Fine-tuning allows deep learning models to leverage previously learned knowledge and better adapt or generalize to new tasks. Users who would like to run our model on their data and have access to an annotated set to evaluate performance can utilize fine-tuning to potentially get better results. This will be especially useful if there is a significant domain shift between the scans the model was trained on and the scans from the target dataset. This could be due to multiple reasons such as scanner or demographic differences. Results on our test set indicate that 3D-QCNet generalizes well across multiple datasets containing a diverse range of these differences. However, if a future user finds performance lacking on their dataset (assuming they have annotated data for evaluation), they can further label a subset and fine-tune our model on it. We show in the experiments below, that fine-tuning helps boost performance compared to using the vanilla model. This however should be balanced with the added annotating effort required on the part of a human annotator.
Table A1 shows a simulated experiment where we trained a model on 1000 volumes from dataset 1 and tested it on dataset 4 and 5. We then used the same model but fine-tuned it on a subset comprising 10% of volumes from dataset 4 and 5 and evaluated performance again.
Results in Table A1 indicate that fine-tuning leads to a huge improvement in recall values while still maintaining sufficiently high precision. We believe that this improvement may be because training even on a small subset of the target data allows the model to adapt its learned weights to subtle discriminating features that may be unique to this dataset. Therefore, by leveraging its broader past knowledge and the more specific newer knowledge, the model is able to improve its performance on this new set.
While being able to always fine-tune on a target dataset is desirable, the added labelling effort required to create such a set should kept in mind. For instance, the 10% fine-tuning data for Dataset 4 included 222 volumes from 102 subjects, which a human expert had to manually peruse and identify artifacts. This can often involve considerable human hours and efforts. Overall, our 3D-QCNet model demonstrates good generalizability across varied datasets, but users with an annotated evaluation data and the resources to label an additional subset can utilize fine-tuning to better adapt the model and boost performance.
Table A1.
Effects of Fine-tuning on a subset of data.
| Accuracy | Precision | Recall | ||
|---|---|---|---|---|
| Dataset 4 | Base Model | 77 | 100 | 30 |
| Fine-Tune | 88 | 82 | 81 | |
| Accuracy | Precision | Recall | ||
| Dataset 5 | Base Model | 83 | 100 | 10 |
| Fine-Tune | 93 | 80 | 82 |
A2. Flexible thresholds
It’s very likely that different expert annotators will have a different idea of what qualifies as an artifact, as some may be more strict or sensitive while others may be more lax. This also sometimes may depend on the downstream tasks these scans will be utilized for. To account for this, we analyze the effects of trying different probability decision thresholds for our model on the overall precision and recall. By default, the value for this threshold is chosen as 0.5 and that is how the results in Table 2 were generated. However, decreasing the threshold for the artifact class will lead to a more lenient criterion for assigning a volume as having an artifact. This means that more artifacts will be picked by the model, including ones that may be too faint to be deemed as such, like in Fig. 5. This will lead to an overall increase in the number of ground-truth artifact cases being captured, and hence an increase in recall for the artifact class will be observed. This will often be at the cost of precision, as the lowered bar will cause many normal scans to be labelled as artifacts as well. The opposite effect, an increase in precision and decrease in recall, will be observed when the threshold is increased.
We show in Fig. A2 how precision and recall change as the threshold is varied on Dataset 7. It can be observed from the plot that the precision-recall values follow a similar pattern as described before and how the ideal value for a specific user’s use-case may not be at 0.5. Therefore, users who have the ability to evaluate the performance of the model on their data through an annotated testing set, may tune the threshold based on their desired precision-recall tradeoff on a validation subset of this data.
Fig. A2.
Precision Recall Curve with thresholds visualized for 3D-QCNet applied to Dataset 7.
Footnotes
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data Availability
Data will be made available on request.
References
- Alfaro-Almagro F, Jenkinson M, Bangerter NK, Andersson JL, Griffanti L, Douaud G, Sotiropoulos SN, Jbabdi S, Hernandez-Fernandez M, Vallee E, 2018. Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage 166, 400–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andersson JL, Graham MS, Zsoldos E, Sotiropoulos SN, 2016. Incorporating outlier detection and replacement into a non-parametric framework for movement and distortion correction of diffusion MR images. Neuroimage 141, 556–572. [DOI] [PubMed] [Google Scholar]
- Baliyan V, Das CJ, Sharma R, Gupta AK, 2016. Diffusion weighted imaging: technique and applications. World J. Radiol 8 (9), 785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bammer R, Markl M, Barnett A, Acar B, Alley M, Pelc N, Glover G, Moseley M, 2003. Analysis and generalized correction of the effect of spatial gradient field distortions in diffusion-weighted imaging. Magn. Reson. Med.: Off. J. Int. Soc. Magn. Reson. Med 50 (3), 560–569. [DOI] [PubMed] [Google Scholar]
- Bastiani M, Cottaar M, Fitzgibbon SP, Suri S, Alfaro-Almagro F, Sotiropoulos SN, Jbabdi S, Andersson JL, 2019. Automated quality control for within and between studies diffusion MRI data using a non-parametric framework for movement and distortion correction. Neuroimage 184, 801–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu B, Madhavan V, Beijbom O, Hoffman J, Darrell T, 2016. Best practices for fine-tuning visual classifiers to new domains. Computer Vision – ECCV Workshops. 2016. Springer International Publishing, Cham. [Google Scholar]
- Davatzikos C., 2019. Machine learning in neuroimaging: progress and challenges. Neuroimage 197, 652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Graham MS, Drobnjak I, Zhang H, 2018. A supervised learning approach for diffusion MRI quality control with minimal training data. Neuroimage 178, 668–676. [DOI] [PubMed] [Google Scholar]
- Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, 2018. Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377. [Google Scholar]
- He K, Zhang X, Ren S, and Sun J. Deep residual learning for image recognition. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. [Google Scholar]
- Heiland S., 2008. From A as in Aliasing to Z as in zipper: artifacts in MRI. Clin. Neuroradiol 18 (1), 25–36. [Google Scholar]
- Huang G, Liu Z, Van Der Maaten L, and Weinberger KQ. Densely connected convolutional networks. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. [Google Scholar]
- Iglesias JE, Lerma-Usabiaga G, Garcia-Peraza-Herrera LC, Martinez S, and Paz-Alonso PM. Retrospective head motion estimation in structural brain MRI with 3D CNNs. in International Conference on Medical Image Computing and Computer-Assisted Intervention. 2017. Springer. [Google Scholar]
- Ioffe S, Szegedy C, 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv Prepr. arXiv 1502, 03167. [Google Scholar]
- Jiang H, Van Zijl PC, Kim J, Pearlson GD, Mori S, 2006. DtiStudio: resource program for diffusion tensor computation and fiber bundle tracking. Comput. Methods Prog. Biomed 81 (2), 106–116. [DOI] [PubMed] [Google Scholar]
- Kelly C, Pietsch M, Counsell S, and Tournier J-D. Transfer learning and convolutional neural net fusion for motion artefact detection. in Proceedings of the Annual Meeting of the International Society for Magnetic Resonance in Medicine, Honolulu, Hawaii. 2017. [Google Scholar]
- Krizhevsky A, Sutskever I, Hinton GE, 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM 60 (6), 84–90. [Google Scholar]
- Krupa K, Bekiesińska-Figatowska M, 2015. Artifacts in magnetic resonance imaging. Pol. J. Radiol 80, 93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Le Bihan D, Poupon C, Amadon A, Lethimonnier F, 2006. Artifacts and pitfalls in diffusion MRI. J. Magn. Reson. Imaging. Off. J. Int. Soc. Magn. Reson. Med 24 (3), 478–488. [DOI] [PubMed] [Google Scholar]
- Liu B, Zhu T, Zhong J, 2015. Comparison of quality control software tools for diffusion tensor imaging. Magn. Reson. Imaging 33 (3), 276–285. [DOI] [PubMed] [Google Scholar]
- Moratal D, Vallés-Luch A, Martí-Bonmatí L, Brummer ME, 2008. k-Space tutorial: an MRI educational tool for a better understanding of k-space. Biomed. Imaging Interv. J 4 (1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mzoughi H, Njeh I, Wali A, Slima MB, BenHamida A, Mhiri C, Mahfoudhe KB, 2020. Deep multi-scale 3D convolutional neural network (CNN) for MRI gliomas brain tumor classification. J. Digit Imaging 33 (4), 903–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oguz I, Farzinfar M, Matsui J, Budin F, Liu Z, Gerig G, Johnson HJ, Styner MA, 2014. DTIPrep: quality control of diffusion-weighted images. Front. Neuroinf 8, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pierpaoli C., 2010. Artifacts in diffusion MRI. Diffus. MRI Theory, Methods, Appl 303–317. [Google Scholar]
- Pierpaoli C, Walker L, Irfanoglu M, Barnett A, Basser P, Chang L, Koay C, Pajevic S, Rohde G, and Sarlls J. TORTOISE: an integrated software package for processing of diffusion MRI data in ISMRM 18th annual meeting. 2010. [Google Scholar]
- Reuter M, Tisdall MD, Qureshi A, Buckner RL, van der Kouwe AJ, Fischl B, 2015. Head motion during MRI acquisition reduces gray matter volume and thickness estimates. Neuroimage 107, 107–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Samani ZR, Alappatt JA, Parker D, Ismail AAO, Verma R, 2019. QC-automator: deep learning-based automated quality control for diffusion mr images. Front. Neurosci 13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schenck JF, 1996. The role of magnetic susceptibility in magnetic resonance imaging: MRI magnetic compatibility of the first and second kinds. Med. Phys 23 (6), 815–850. [DOI] [PubMed] [Google Scholar]
- Shabanian M, Eckstein EC, Chen H, and DeVincenzo JP. Classification of Neurodevelopmental Age in Normal Infants Using 3D-CNN based on Brain MRI. in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2019. [Google Scholar]
- Simmons A, Tofts PS, Barker GJ, Arridge SR, 1994. Sources of intensity nonuniformity in spin echo images at 1.5 T. Magn. Reson. Med 32 (1), 121–128. [DOI] [PubMed] [Google Scholar]
- Smith R, Lange R, McCarthy S, 1991. Chemical shift artifact: dependence on shape and orientation of the lipid-water interface. Radiology 181 (1), 225–229. [DOI] [PubMed] [Google Scholar]
- Soares J, Marques P, Alves V, Sousa N, 2013. A hitchhiker’s guide to diffusion tensor imaging. Front. Neurosci 7, 31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tournier J-D, Mori S, Leemans A, 2011. Diffusion tensor imaging and beyond. Magn. Reson. Med 65 (6), 1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Dijk KR, Sabuncu MR, Buckner RL, 2012. The influence of head motion on intrinsic functional connectivity MRI. Neuroimage 59 (1), 431–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wood ML, Henkelman RM, 1985. MR image artifacts from periodic motion. Med. Phys 12 (2), 143–151. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data will be made available on request.






