Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jul 1.
Published in final edited form as: Med Image Anal. 2021 Mar 24;71:101997. doi: 10.1016/j.media.2021.101997

Active, continual fine tuning of convolutional neural networks for reducing annotation efforts

Zongwei Zhou a, Jae Y Shin a, Suryakanth R Gurudu b, Michael B Gotway c, Jianming Liang a,*
PMCID: PMC8483451  NIHMSID: NIHMS1718460  PMID: 33853034

Abstract

The splendid success of convolutional neural networks (CNNs) in computer vision is largely attributable to the availability of massive annotated datasets, such as ImageNet and Places. However, in medical imaging, it is challenging to create such large annotated datasets, as annotating medical images is not only tedious, laborious, and time consuming, but it also demands costly, specialty-oriented skills, which are not easily accessible. To dramatically reduce annotation cost, this paper presents a novel method to naturally integrate active learning and transfer learning (fine-tuning) into a single framework, which starts directly with a pre-trained CNN to seek “worthy” samples for annotation and gradually enhances the (fine-tuned) CNN via continual fine-tuning. We have evaluated our method using three distinct medical imaging applications, demonstrating that it can reduce annotation efforts by at least half compared with random selection.

Keywords: Active learning, Annotation cost reduction, Convolutional neural networks, Computer-aided diagnosis, Medical image analysis, Transfer learning

1. Introduction

Convolutional neural networks (CNNs) (LeCun et al., 2015) have ushered in a revolution in computer vision owing to the use of large annotated datasets, such as ImageNet (Deng et al., 2009) and Places (Zhou et al., 2017a). As evidenced by two recent books (Shen et al., 2019; Zhou et al., 2019a) and numerous compelling techniques for different imaging tasks (Moen et al., 2019; Yamamoto et al., 2019; Ravizza et al., 2019; Esteva et al., 2019; Huang et al., 2020; Isensee et al., 2021), there is widespread and intense interest in applying CNNs to medical image analysis, but the adoption of CNNs in medical imaging is hampered by the lack of such large annotated datasets. Annotating medical images is not only tedious and time consuming, but it also requires costly, specialty-oriented knowledge and skills, which are not readily accessible. Therefore, we seek to answer this critical question: How to dramatically reduce the cost of annotation when applying CNNs to medical imaging? In doing so, we have developed a novel method called ACFT (active, continual fine-tuning) to naturally integrate active learning and transfer learning into a single framework. Our ACFT method starts directly with a pre-trained CNN to seek “salient” samples from the unannotated pool for annotation, and the (fine-tuned) CNN is continually fine-tuned using newly annotated samples combined with all misclassified samples. We have evaluated our method in three different applications, including colonoscopy frame classification, polyp detection, and pulmonary embolism (PE) detection, demonstrating that the cost of annotation can be reduced by at least half.

This performance is attributable to a simple yet powerful observation: to boost the performance of CNNs in medical imaging, multiple patches are usually generated automatically for each sample through data augmentation; these patches generated from the same sample share the same label, and are naturally expected to have similar predictions by the current CNN before they are expanded into the training dataset. As a result, their entropy (Shannon, 1948) and diversity (Kukar, 2003) provide a useful indicator of the “power” of a sample for elevating the performance of the current CNN. However, automatic data augmentation inevitably generates “hard” samples, injecting noisy labels. Therefore, to significantly enhance the robustness of active selection, we compute entropy and diversity from only a portion of the patches according to the majority predictions detailed in Section. 3.3) by the current CNN. Furthermore, to strike a balance between exploration and exploitation, we incorporate randomness in our active

Algorithm 1:

ACFT – Active, continual fine-tuning

graphic file with name nihms-1718460-t0012.jpg

selection as detailed in Section. 3.4; and to prevent catastrophic forgetting, we combine newly selected samples with misclassified samples as described in Section. 5.6.

Several researchers have demonstrated the utility of fine-tuning CNNs for medical image analysis, but they only performed one-time fine-tuning; that is, simply fine-tuning a pre-trained CNN once with all available training samples, involving no active selection processes (Tajbakhsh et al., 2016; Lu et al., 2017; Esteva et al., 2017; Mormont et al., 2018; Ding et al., 2018; Irvin et al., 2019; Zhou et al., 2019c; Chen et al., 2019; Tajbakhsh et al., 2019; Ardila et al., 2019). To our knowledge, our proposed method is among the first to integrate active learning into fine-tuning CNNs in a continual fashion to make CNNs more amenable to medical image analysis, particularly with the intention of decreasing the efforts of annotation dramatically. Compared with conventional active learning, our method, summarized as Algorithm 1, offers eight advantages :

  1. Our algorithm starts with a completely empty labeled dataset, requiring no seed-labeled samples (see Algorithm 1);

  2. Our algorithm actively selects the most informative and representative samples by naturally exploiting expected consistency among the patches within each sample (see Section. 3.1);

  3. Our algorithm computes selection criteria locally on a small number of patches within each sample, saving considerable computation time (see Section. 3.2);

  4. Our algorithm automatically handles noisy labels via majority selection (see Section. 3.3);

  5. Our algorithm balances exploration and exploitation by incorporating randomness into active selection (see Section. 3.4).

  6. Our algorithm incrementally improves the learner through continual fine-tuning rather than through repeated re-training (see Section. 5.5);

  7. Our algorithm focuses on hard samples, preventing catastrophic forgetting (see Section. 5.6);

  8. Our algorithm autonomously balances training samples among classes (see Section. 6.2 and Fig. 8);

Fig. 8.

Fig. 8.

The positive/negative ratio in the candidates selected by ACFT, AFT and RFT. Please note that the ratio in RFT serves as an approximation for the ratio of the entire dataset.

More importantly, our method has the potential to positively impact computer-aided diagnosis (CAD) in medical imaging. The current regulations require that CAD systems be deployed in a “closed” environment, in which all CAD results are reviewed and errors, if any, must be corrected by radiologists. As a result, all false positives are dismissed and all false negatives are supplied, an instant on-line feedback process that makes it possible for CAD systems to be self-learning and self-improving after deployment given the continual fine-tuning capability of our method.

2. Related work

2.1. Our work

We presented AIFT (active, incremental fine-tuning) in our CVPR paper (Zhou et al., 2017b) to integrate active learning and deep learning via continual fine-tuning. Nevertheless, AIFT was limited to binary classifications and medical imaging, and used all labeled samples available at each step, thereby demanding extensive training time and substantial computer memory. Our current approach is a significant extension of our previous work with several major enhancements: (1) generalization from binary classification to multi-class classification; (2) extension from computer-aided diagnosis in medical imaging to scene classification in natural images; (3) combination of newly selected samples with hard (misclassified) ones, to eliminate easy samples for reducing training time, and to concentrate on hard samples for preventing catastrophic forgetting; (4) injection of randomness to enhance robustness in active selection; (5) extensive experimentation with all reasonable combinations of data and models in search of an optimal strategy; (6) demonstration of consistent annotation reduction using different CNN architectures; and (7) illustration of the active selection process using a gallery of patches associated with predictions.

2.2. Transfer learning for medical imaging

Pre-training a model on large-scale image datasets and then fine-tuning it on various target tasks has become a de facto paradigm across many medical specialties. As summarized by Irvin et al. (2019), to classify the common thoracic diseases on chest radiography, nearly all the leading approaches (Guan and Huang, 2018; Guendel et al., 2018; Tang et al., 2018; Ma et al., 2019) follow this paradigm by adopting different architectures along with their weights pre-trained from ImageNet. Other representative medical applications include identifying skin cancer from dermatologist level photographs (Esteva et al., 2017), diagnosing Alzheimer’s Disease (Ding et al., 2018) from 18 F-FDG PET of the brain, and performing effective detection of pulmonary embolism (Tajbakhsh et al., 2019) from CTPA. Recent breakthrough in self-supervised pre-training (Grill et al., 2020; Caron et al., 2020; Chen and He, 2020), on the other hand, has led to visual representation that approaches and possibly surpasses what was learned from ImageNet. Self-supervised pre-training has also been adopted for the medical domain, wherein Zhou et al. (2019c); Zhu et al. (2020); Feng et al. (2020); Haghighi et al. (2020); Azizi et al. (2021) develop generic CNNs that are directly pre-trained from medical images, mitigating the mandatory requirement of expert annotation and reducing the large domain gap between natural and medical images. Despite the immense popularity of transfer learning in medical imaging, these works exclusively employed one-time fine-tuning—simply fine-tuning a pre-trained CNN with available training samples for only one time. In real-world applications, instead of training on a still dataset, experts record new samples constantly and expect the samples to be used upon their availability; with the ability to deal with new data, continual learning is the bridge to active and open world learning (Mundt et al., 2020). Compared with the existing continual learning approaches (Käding et al., 2016; Zhou et al., 2017b), our newly devised learning strategy is more amenable to active fine-tuning because it focuses more on the newly annotated samples and also recognizes those misclassified ones, eliminating repeated training on those easy samples in the annotated pool.

2.3. Integrating active learning with deep learning

The uncertainty and diversity are the most compelling active selection criteria, which appraise the worthiness of annotating a sample from two different aspects. Uncertainty-based criteria argue that the more uncertain a prediction is, the more value added when including the label of that sample into the training set. Sampling with least confidence (Culotta and McCallum, 2005), large entropy (Dagan and Engelson, 1995; Mahapatra et al., 2018; Shao et al., 2018; Kuo et al., 2018), or margins (Scheffer et al., 2001; Balcan et al., 2007) of the prediction has been successful in training models with fewer labels than random sampling. The limitation of uncertainty-based criteria is that some of the selected samples are prone to redundancy and outliers (Sourati et al., 2019) and may not be representative enough for the data distribution as a whole. Alternatively, diversity-based criteria have the advantage of selecting a set of most representative samples, related to the labeled ones, from those in the rest of the unlabeled set. The intuition is that there is no need to repeatedly annotate those samples with context information if the most representative one has already been covered. Mutual information (Li and Guo, 2013; Gal et al., 2017), Kullback-Leibler divergence (Kulick et al., 2014; McCallumzy and Nigamy, 1998), Fisher information (Sourati et al., 2018; 2019), K-centers and core sets (Sener and Savarese, 2017), calculated among either model predictions or image features, are often used to ensure the diversity. Although alleviating redundancy and outliers, a serious hurdle of diversity-based criteria is the computational complexity for a large pool of unlabeled samples. We address this issue by measuring diversity over patches augmented from the same sample, making the calculation much more manageable. To exploit the benefits and potentials of the two selecting aspects, the studies of Wang et al. (2018); Ozdemir et al. (2018); Mahapatra et al. (2018); Shui et al. (2020), as well as our ACFT, consider the mixture strategy of combing uncertainty and diversity explicitly. Yang et al. (2017); Beluch et al. (2018); Kuo et al. (2018) further compute the selection criteria from an ensemble of CNNs—these approaches are, however, very costly in computation, as they must train a set of models to compute their uncertainty measure based on models’ disagreements. For additional active learning methods, we refer the reader to comprehensive literature reviews (Tajbakhsh et al., 2020; Munjal et al., 2020; Hino, 2020; Ren et al., 2020); but these existing methods are fundamentally different from our ACFT in that they all repeatedly retrained CNNs from scratch at each step, whereas we continually fine-tune the (fine-tuned) CNN incrementally. As a result, our ACFT offers several advantages as listed in Section. 1, and leads to dramatic annotation cost reduction and computation efficiency. Besides, we have found that there are only seven fundamental patterns in CNN predictions, as summarized in Section. 3.1. Multiple methods may be developed to select a particular pattern: entropy, Gaussian distance, and standard deviation would seek Pattern A, while diversity, variance, and divergence look for Pattern C. We are among the first to analyze the prediction patterns in active learning and investigate the effectiveness of typical patterns rather than comparing the many methods.

3. Proposed method

ACFT was conceived in the context of computer-aided diagnosis (CAD) applied to medical imaging. A CAD system typically employs a candidate generator, which can quickly produce a set of candidates, among which some are true positives and others are false positives. To train a classifier, each of the candidates must be labeled. In this work, an object to be labeled is considered as a “candidate” in general. We assume that each candidate takes one of |Y| possible labels. To boost CNN performance for CAD systems, multiple patches are usually generated automatically for each candidate through data augmentation; those patches that are generated from the same candidate inherit the candidate’s label. In other words, all labels are acquired at the candidate level. Mathematically, given a set of candidates, U = {C1, C2, ..., Cn}, where n is the number of candidates, and each candidate Ci={xi1,xi2,,xim} is associated with m patches, our ACFT algorithm iteratively selects a set of candidates for labeling as illustrated in Algorithm 1.

ACFT is generic and applicable to many tasks in computer vision and image analysis. For clarity, we illustrate the ideas behind ACFT with the Places-3 dataset (Zhou et al., 2017a) for scene classification in natural images (see Fig. A.10), where no candidate generator is needed, as each image may be directly regarded as a candidate.

Designing an active learning algorithm involves two key issues: (1) how to determine the “worthiness” of a candidate for annotation and (2) how to update the classifier/learner. In the following sections, we first illustrate our hypothesis in Section. 3.1 with Fig. 1 and Table 1, and then detail each of the components in our active selection criteria with its rationale and benefit.

Fig. 1.

Fig. 1.

Automatic data augmentation inevitably generates noisy patches, and there is no need to classify all patches confidently. Therefore, we propose majority selection, which computes active selection criteria on only the top 25% of the patches with the highest confidences on the dominant predicted category. To demonstrate the necessity of majority selection, we illustrate two images (A and B) and their augmented patches, arranged according to the dominant category predicted by the CNN. Based on Places-3, Image A is labeled as living room, and its augmented patches are mostly incorrectly classified by the current CNN; therefore, including it in the training set is of great value. On the contrary, Image B is labeled as office, and the current CNN classifies most of its augmented patches as office with high confidence; labeling it would be of limited utility. Without majority selection, the criteria would mislead the selection, as it indicates that Image B is more diverse than Image A (297.52 vs. 262.39) while sharing similar entropy (17.33 vs. 18.50). With majority selection, the criteria show that Image A is considerably more uncertain and diverse than Image B, measured by either entropy (4.59 vs. 2.17) or diversity (9.32 vs. 0.35), and as expected, more worthy of labeling. From this active selection analysis, we remark that the majority selection is a critical component in our ACFT.

Table 1.

Active selection patterns analysis. We illustrate the relationships among seven prediction patterns and four active selection criteria, assuming that a candidate Ci has 11 augmented patches, and their probabilities Pi are predicted by the current CNN, presented in the second column. With majority selection, the entropy and diversity are calculated based on the top 25% (3 patches in this illustration) highest confidences on the dominant predicted category. The first choice of each method (column) is bolded and the second choice is underlined.

graphic file with name nihms-1718460-t0013.jpg

3.1. Illustrating active candidate selection

Fig. 1 shows the active candidate selection process for multi-class classification. To facilitate comprehension, Table 1 illustrates the process in the context of binary classification. Assuming the prediction of patch xij by the current CNN is Pij we call the histogram of Pij,j[1,m] the prediction pattern of candidate Ci. As shown in Row 1 of Table 1, in binary classification, there are seven typical prediction patterns:

  1. Pattern A is mostly concentrated at 0.5, with a higher degree of uncertainty. Most active learning algorithms Settles, (Guyon et al., 2011) favor these types of candidates as they are effective for reducing uncertainty.

  2. Pattern B is flatter than Pattern A, as the patches’ predictions are spread widely from 0 to 1 with a higher degree of inconsistency among the patches’ predictions. Since all the patches belonging to a candidate are generated via data augmentation, they (at least the majority) are expected to make similar predictions. These types of candidates have the potential to significantly enhance the current CNN’s performance.

  3. Pattern C is clustered at the both ends, with a higher degree of diversity. These types of candidates are most likely associated with noisy labels at the patch level as illustrated in Fig. 2 (c), and they are the least favorable for use in active selection because they may cause confusion when fine-tuning the CNN.

  4. Patterns D and E are clustered at either end (i.e., 0 or 1), with a higher degree of certainty. These types of candidates should not undergo annotation at this step because it is likely the current CNN has correctly predicted them, and therefore these candidates would contribute very little towards fine-tuning the current CNN.

  5. Patterns F and G have a higher degree of certainty for some of the patches’ predictions but are associated with some outliers. These types of candidates are valuable because they are capable of smoothly improving the CNN’s performance. While such candidates might not make dramatic contributions, they do not significantly degrade the CNN’s performance either.

Fig. 2.

Fig. 2.

Three examples of colonoscopy frames: (a) informative, (b) non-informative, and (c) ambiguous. “Ambiguous” frames are labeled as “informative” because experts label frames based on the overall quality: if over 75 % of a frame (i.e., candidate in this application) is clear, the frame is considered “informative”. As a result, an ambiguous candidate contains both clear and blurred components, and generates noisy labels at the patch level from automatic data augmentation. For example, the entire frame (c) is labeled as “informative,” but not all the patches (d) associated with this frame are “informative”, although they inherit the “informative” label. This limitation is the main motivation for the majority selection approach in our ACFT method.

3.2. Seeking worthy candidates

In active learning, the key is to develop criteria for determining candidate annotation “worthiness”. Our criteria for candidate “worthiness” are based on a simple, yet powerful, observation: all patches augmented from the same candidate (Fig. 1) share the same label; therefore, they are expected to have similar predictions by the current CNN. As a result, their entropy and diversity provide a useful indicator of the “power” of a candidate for elevating the performance of the current CNN. Intuitively, entropy captures classification certainty—a higher uncertainty value denotes a greater degree of information (e.g., pattern A in Table 1), whereas diversity indicates prediction consistency among the candidate patches—a higher diversity value denotes a greater degree of prediction inconsistency (e.g., pattern C in Table 1). Formally, assuming that each candidate takes one of |Y| possible labels, we define the entropy and diversity of Ci as

ei=1mk=1|Y|j=1mPij,klogPij,k,di=k=1|Y|j=1ml=jm(Pij,kPil,k)logPij,kPil,k (1)

Combining entropy and diversity yields

ai=λ1ei+λ2di (2)

where λ1 and λ2 are trade-offs between entropy and diversity. We use two parameters for convenience, to easily turn on/off entropy or diversity during experiments.

3.3. Handling noisy labels via majority selection

Automatic data augmentation is essential for boosting CNN performance, but it inevitably generates “hard” samples for some candidates, as shown in Fig. 2 (c), injecting noisy labels. Therefore, to significantly enhance the robustness of our method, we compute entropy and diversity by selecting only a portion of the patches of each candidate according to the predictions by the current CNN.

Specifically, for each candidate Ci we first determine its dominant category, which is defined by the category with the highest confidence in the mean prediction. That is,

y^i=argmaxyY1mj=1mPij,y (3)

where Pij,y is the output of each patch j from the current CNN given ∀xCi on label y. After sorting Pi according to dominant category y^i, we apply Eq. 2 to the top α × 100% of the patches to construct the score matrix ai of size αm × αm for each candidate Ci in U. Our proposed majority selection method automatically excludes the patches with noisy labels (see Table 1 : diversity and diversityα) because of their low confidences.

3.4. Injecting randomization in active selection

As discussed in Borisov et al. (2010) and Zhou et al. (2017b), simple random selection may outperform active selection at the beginning, because the active selection method depends on the current CNN selecting examples for labeling. As a result, a poor selection made at an early stage may adversely affect the quality of subsequent selections, whereas the random selection approach is less frequently locked into a poor hypothesis. In other words, the active selection method concentrates on exploiting the knowledge gained from the labels already acquired to further explore the decision boundary, whereas the random selection approach concentrates solely on exploration, and is thereby able to locate areas of the feature space where the classifier performs poorly. Therefore, an effective active learning strategy must strike a balance between exploration and exploitation. Towards this end, we inject randomization into our method by selecting actively according to the sampling probability aiS.

ai=(aiaωb)/(a1aωb),ais=ai/iai,i[1,ωb] (4)

where ai is sorted ai according to its value in descending order, and ω is named random extension. Suppose b number of candidates are required for annotation. Instead of selecting top b candidates, we extend the candidate selection pool to ωb. Then we select candidates from this pool with their sampling probabilities aiS to inject randomization.

4. Experiments

4.1. Medical applications

4.1.1. Colonoscopy frame classification

Image quality assessment in colonoscopy can be viewed as an image classification task whereby an input image is labeled as either informative or non-informative. One way to measure the quality of a colonoscopy procedure is to monitor the quality of the captured images. Such quality assessment can be used during live procedures to limit low-quality examinations or, in a post-processing setting, for quality monitoring purposes. In this application, colonoscopy frames are regarded as candidates, since the labels (informative or non-informative) are associated with frames as illustrated in Fig. 2 (ac). In total, there are 4,000 colonoscopy candidates from 6 complete colonoscopy videos. A trained expert then manually labeled the collected images as informative or non-informative (line 11 in Algorithm 1). A gastroenterologist further reviewed the labeled images for corrections. The labeled frames are separated at the video level into training and test sets, each containing approximately 2,000 colonoscopy frames. For data augmentation, we extracted 21 patches from each frame as shown in Fig. 2 (d).

4.1.2. Polyp Detection

Polyps, as shown in Fig. 3, can present themselves in the colonoscopy with substantial variations in color, shape, and size. The variable appearance of polyps can often lead to misdetection, particularly during long and back-to-back colonoscopy procedures where fatigue negatively affects the performance of colonoscopists. Computer-aided polyp detection may enhance optical colonoscopy screening accuracy by reducing polyp misdetection. In this application, each polyp detection is regarded as a candidate. The dataset contains 38 patients with one video each. The training dataset is composed of 21 videos (11 with polyps and 10 without polyps), while the testing dataset is composed of 17 videos (8 videos with polyps and 9 videos without polyps). At the video level, the candidates are divided into the training dataset (16,300 candidates) and test dataset (11,950 candidates). At each polyp candidate location with the given bounding box, we performed data augmentation by a factor f ∈ {1.0, 1.2, 1.5}. At each scale, we extracted patches after the candidate is translated by 10 percent of the resized bounding box in vertical and horizontal directions. We further rotated each resulting patch 8 times by mirroring and flipping. The patches generated by data augmentation belong to the same candidate. Each candidate contains 24 patches.

Fig. 3.

Fig. 3.

Polyps in colonoscopy videos with different shape and appearance.

4.1.3. Pulmonary embolism detection

Pulmonary embolism (PE) is a major national health problem, and computer-aided PE detection could play a major role in improving PE diagnosis and decreasing the reading time required for CTPA datasets. We employed a database consisting of 121 CTPA datasets with a total of 326 PE instances. Each PE detection is regarded as a candidate with 50 patches. We divided candidates at the patient level into a training dataset, with 434 true positives (199 unique PE instances) and 3,406 false positives, and a testing dataset, with 253 true positives (127 unique PE instances) and 2,162 false positives. The overall PE probability is calculated by averaging the probabilistic prediction generated for the patches within a given PE candidate after data augmentation.

4.2. Baselines and implementation

4.2.1. Active learning strategy baselines

Tajbakhsh et al. (2016) reported the state-of-the-art performance of fine-tuning and learning from scratch using entire datasets, which are used to establish baseline performance for comparison. These authors also investigated the performance of (partial) fine-tuning using a sequence of partial training datasets, but our dataset partitions are different from theirs. Therefore, for fair comparison with their approach, we introduce RFT, which fine-tunes the original CNN model M0 from the beginning, using all available labeled samples LQ, where Q is randomly selected at each step.

We summarized several active learning strategies in Table 2. Studying different active learning strategies is important because active learning procedure can be very computationally inefficient in practice, in terms of label reuse and model reuse. We present two strategies that aim at overcoming the above limitations. First, we propose to combine newly annotated data with the labeled data that is misclassified by the current CNN. Second, we propose continual fine-tuning to speed up model training and, in turn, encourage data reuse. ACFT(HQ) denotes the optimized learning strategy, which continually fine-tunes the current CNN model Mt−1 using newly annotated candidates enlarged by those misclassified candidates; that is, QH. Compared with other learning strategy baselines (Tajbakhsh et al., 2016; Zhou et al., 2017b; 2019b) as codified in Table 2, ACFT(HQ) saves training time through faster convergence compared with repeatedly fine-tuning the original pre-trained CNN, and boosts performance by eliminating easy samples, focusing on hard samples, and preventing catastrophic forgetting. In all three applications, our ACFT begins with an empty training dataset and directly uses pre-trained CNNs (AlexNet and GoogLeNet) on ImageNet.

Table 2.

Active learning strategy definition. We have codified different learning strategies covering the makeup of training samples and the initial CNN weights of fine-tuning.

Code Description of learning strategy
RFT(LQ) Fine-tuning from M0 using L and randomly selected Q
AFT(LQ) Fine-tuning from M0 using L and actively selected Q
ACFT(Q) Continual fine-tuning from Mt − 1 using actively selected Q only
ACFT(LQ) Continual fine-tuning from Mt − 1 using L and actively selected Q
ACFT(HQ) Continual fine-tuning from Mt − 1 using H and actively selected Q

1 L: Annotated candidates. 2 Q: Newly annotated candidates. 3 H: Misclassified candidates. 4 M0: Pre-trained CNNs from large scale dataset (like ImageNet). 5 Mt − 1: Pre-trained CNNs from last active selecting step.

4.2.2. Experimental settings

We have investigated the effectiveness of ACFT in four applications: scene classification, colonoscopy frame classification, polyp detection, and pulmonary embolism (PE) detection. Ablation studies have been conducted to confirm the significant design of our majority selection and randomization, built upon conventional entropy and diversity based active selection criteria. For all four applications, we set α to 1/4 and ω to 5. The deep learning library Matlab and Caffe are utilized to implement active learning and transfer learning (more details can be found at https://github.com/MrGiovanni/Active-Learning). We based our experiments on AlexNet and GoogLeNet because their architectures offer an optimal depth balance, deep enough to investigate the impact of ACFT and AFT on pre-trained CNN performance, but shallow enough to conduct experiments quickly. The learning parameters used for training and fine-tuning of AlexNet in our experiments are summarized in Table 3. The Adam optimizer is utilized to optimize the objective functions described in our paper. The batch size is 512 in the learning procedure.

Table 3.

Learning parameters used for training and fine-tuning of AlexNet for AFT in our experiments. μ is the momentum, lrfc8 is the learning rate of the weights in the last layer, α is the learning rate of the weights in the rest layers, and γ determines how lr decreases over epochs. “Epochs” indicates the number of epochs used in each step. For ACFT, all the parameters are set to the same as AFT except the learning rate lr, which is set to 1/10 of that for AFT.

Applications μ lr lr fc8 γ epoch
Colonoscopy frame classification 0.9 1e-4 1e-3 0.95 8
Polyp detection 0.9 1e-4 1e-3 0.95 10
Pulmonary embolism detection 0.9 1e-3 1e-2 0.95 5

5. Results

In this section, Fig. 5 begins with an overall performance between our active continual fine-tuning (ACFT) and random fine-tuning (RFT), revealing the amount of annotation effort that has been reduced in each application. Fig. 6 compares eight different active selection criteria, demonstrating that majority selection and randomness are critical in finding the most representative samples to elevate the current CNN’s performance. Fig. 7 further presents the observed distribution from each active selection criteria, qualitatively confirming the rationale of our devised candidate selecting approaches. Table 4 finally compares four different active learning strategies, suggesting that continual fine-tuning using newly annotated candidates enlarged by those misclassified candidates significantly saves computational resources while maintaining the compelling performance in all three medical applications.

Fig. 5.

Fig. 5.

ACFT aims to minimize the number of samples for experts to label by iteratively recommending the most informative and representative samples. For scene classification (a), by actively selecting 2,906 images (6.92 % of the entire dataset), ACFT (solid orange) can offer equivalent performance to the use of 4,452 images through random selection, thus saving 34.7 % annotation cost relative to random fine-tuning (RFT in dashed orange). Furthermore, with 1,176 actively-selected images (2.80 % of the whole dataset), ACFT can achieve performance equivalent to full training (dashed black) using 42,000 images, thereby saving 97.2 % annotation cost (relative to full training). In (b)—(d), we highlight the major results that compared with RFT, our ACFT can reduce the cost of annotation by 81.5 % for colonoscopy frame classification, 86.3 % for polyp detection, and 80.3 % for pulmonary embolism detection. Following the standard active learning experimental setup, both ACFT and RFT select samples from the remaining training dataset; they will eventually use the same whole training dataset, naturally yielding similar performance at the end. However, the goal of active learning is to find such sweet spots where a learner can achieve an acceptable performance using the least number of labeled samples.

Fig. 6.

Fig. 6.

Comparing eight active selection approaches with random selection on AlexNet (Krizhevsky et al., 2012) (top panel) and GoogLeNet (Szegedy et al., 2015) (bottom panel) for our three distinct medical applications, including (a) colonoscopy frame classification, (b) polyp detection, and (c) pulmonary embolism detection, demonstrates consistent patterns with AlexNet. The solid black line denotes the current state-of-the-art performance of fine-tuning using full training data and the dashed black line denotes the performance of training from scratch using full training data.

Fig. 7.

Fig. 7.

Distribution of predictions for the top ten candidates actively selected by the four ACFT methods at Step 3 in colonoscopy frame classification. Positive candidates are shown in red and negative candidates are shown in blue. This visualization confirms the assumption in Table 1 that diversity+majority selection criteria prefers Pattern B whereas diversity suggests Pattern C; both entropy and entropy+majority favor Pattern A due to its higher degree of uncertainty. However, in this case at Step 3, with entropy+majority selection criteria, there are no more candidates with Pattern A; therefore, candidates with Pattern B are selected.

Table 4.

Comparison of proposed active learning strategies and selection criteria. As measured by the Area under the Learning Curve (ALC), bolded values in the table indicate the outstanding learning strategies (see Table 2) using certain active selection criteria, and starred values represent the best performance taking both learning strategies and active selection criteria into consideration. For all three applications, we report baseline performance of random fine-tuning (RFT) using AlexNet in the table footnote. Considering the variance of random sampling for each active learning step, we conduct five independent trials for RFT and report the mean and standard deviation (mean±s.d.).

Application Learning strategy + Diversity
− Majority
− Randomness
+ Diversity
+ Majority
− Randomness
+ Diversity
− Majority
+ Randomness
+ Diversity
+ Majority
+ Randomness
+ Entropy
− Majority
− Randomness
+ Entropy
+ Majority
− Randomness
+ Entropy
− Majority
+ Randomness
+ Entropy
+ Majority
+ Randomness
Colonoscopy frame classification ACFT(Q) 0.8375 0.8773 0.8995 0.9160 0.8444 0.8227 0.9136 0.9061
ACFT(LQ) 0.8501 0.8956 0.9083 0.9262 0.9149 0.9051 0.9033 0.9223
AFT(LQ) 0.9183 0.9253 0.9299 0.9344 0.9219 0.9180 0.9268 0.9291
ACFT(HQ) 0.9048 0.9236 0.9241 0.9179 0.9198 0.9266 0.9257 0.9293
Polyp detection ACFT(Q) 0.8669 0.9023 0.8984 0.9168 0.8834 0.8656 0.9034 0.9271
ACFT(LQ) 0.9195 0.9142 0.9497 0.9488 0.9204 0.9255 0.9475 0.9444
AFT(LQ) 0.9242 0.9285 0.9353 0.9355 0.9292 0.9238 0.9367 0.9522
ACFT(HQ) 0.9013 0.9370 0.9116 0.9363 0.9321 0.9436 0.9196 0.9443
Pulmonary embolism detection ACFT(Q) 0.7828 0.7911 0.7690 0.7977 0.7855 0.7736 0.7296 0.7833
ACFT(LQ) 0.8083 0.8176 0.7975 0.8263 0.8032 0.8086 0.8022 0.8245
AFT(LQ) 0.7650 0.7973 0.7978 0.8040 0.7917 0.7878 0.7964 0.8222
ACFT(HQ) 0.8272 0.7876 0.8047 0.8245 0.8218 0.7995 0.8155 0.8205

1 RFT in colonoscopy frame classification: ALC = 0.8958±0.0176 2 RFT in polyp detection: ALC = 0.9358±0.0130 3 RFT in pulmonary embolism detection: ALC = 0.7849±0.0261

5.1. ACFT reduces 35% annotation effort in scene classification

Fig. 5 (a) compares ACFT with RFT in scene classification using the Places-3 dataset. For RFT, six different sequences are generated via systematic random sampling. The final curve is plotted showing the average performance of six runs. As shown in Fig. 5 (a), ACFT, with only 2,906 candidate queries, can achieve performance equivalent to RFT with 4,452 candidate queries, as measured by the Area Under the Curve (AUC); moreover, using only 1,176 candidate queries, ACFT can achieve performance equivalent to full training using all 42,000 candidates. Therefore, 34.7 % of the RFT labeling costs and 97.2 % of the full training costs could be saved using ACFT. When nearly 100 % training data are used, the performance continues to improve, suggesting that the dataset size is still insufficient, given 22 layers GoogLeNet architecture. ACFT is a general algorithm that is not only useful for medical datasets but other datasets as well, and is also effective for multi-class problems.

5.2. ACFT reduces 82% annotation effort in colonoscopy frame classification

Fig. 5 (b) shows that ACFT, with approximately 120 candidate queries (6 %), achieves performance equivalent to a 100 % trained dataset fine-tuned from AlexNet (solid black line, AUC = 0.9366), and, with only 80 candidate queries (4 %), can achieve performance equivalent to a 100 % training dataset learned from scratch (dashed black line, AUC = 0.9204). Using only 48 candidate queries, ACFT equals the performance of RFT at 260 candidate queries. Therefore, about 81.5 % of the labeling cost associated with with RFT in colonoscopy frame classification is recovered using ACFT. Detailed analysis in Fig. 6 reveals that during the early stages, RFT yields performance superior to some of the active selecting processes because: 1) random selection gives samples with the positive-negative ratio compatible with the testing and validation dataset; 2) the pre-trained CNN gives poor predictions in the domain of medical imaging, as it was trained by natural images. Its output probabilities are mostly inconclusive or even opposite, yielding poor selection scores. However, with randomness injected, as described in Section. 3.4, ACFT (+majority and +randomness) shows superior performance, even at early stages, with continued performance improvement during subsequent steps (see the red and blue curves in Fig. 6). Besides, evidenced by Table 4, ACFT performs comparably with AFT, but, unlike the latter, does not require use of the entire labeled dataset or fine-tuning from the beginning.

5.3. ACFT reduces 86% annotation effort in polyp detection

Fig. 5 (c) shows that ACFT, with approximately 320 candidate queries (2.04 %), can achieve performance equivalent to a 100 % training dataset fine-tuned from AlexNet (solid black line, AUC = 0.9615), and, with only 10 candidate queries (0.06 %), can achieve performance equivalent to a 100 % training dataset learned from scratch (dashed black line, AUC = 0.9358). Furthermore, ACFT, using only 20 candidate queries, achieves performance equivalent to RFT using 146 candidate queries. Therefore, nearly 86.3 % of the labeling cost associated with the use of RFT for polyp detection could be recovered with our method. The fast convergence and outstanding performance of ACFT is attributable to the majority selection and randomization method, which can both efficiently select the informative and representative candidates while excluding those with noisy labels, yet still boost the performance during the early stages. For example, the diversity criteria, if without using majority selection, would strongly favor candidates whose prediction pattern resembles Pattern C (see Table 1), thus performing poorer than RFT due to noisy labels generated through data augmentation.

5.4. ACFT reduces 80% annotation effort in pulmonary embolism detection

Fig. 5 (d) shows that ACFT, with 2,560 candidate queries (66.68 %) nearly achieves performance equivalent to both the 100 % training dataset fine-tuned from AlexNet and learning from scratch (solid black line and dashed black line, where AUC = 0.8763 and AUC = 0.8706, respectively). With 320 candidate queries, ACFT can achieve the performance equivalent to RFT using 1,627 candidate queries. Based on this analysis, the cost of annotation in pulmonary embolism detection can be reduced by 80.3% using ACFT compared with RFT.

5.5. Observations on active selection criteria

We meticulously monitored the active selection process and examined the selected candidates. For example, we include the top ten candidates selected by the four ACFT methods at Step 3 in colonoscopy frame classification in Fig. 7. From this process, we have observed the following:

  • Patterns A and B are dominant in the earlier stages of ACFT as the CNN has not been fine-tuned properly to the target domain;

  • Patterns C, D and E are dominant in the later stages of ACFT as the CNN has been largely fine-tuned on the target dataset;

  • Majority selection is effective for excluding Patterns C, D, and E, whereas entropy only (without the majority selection) can handle Patterns C, D, and E reasonably well;

  • Patterns B, F, and G generally make good contributions to elevating the current CNN’s performance;

  • Entropy and entropy+majority favor Pattern A due to its higher degree of uncertainty, and;

  • Diversity+majority prefers Pattern B whereas diversity prefers Pattern C. This is why diversity may cause sudden disturbances in the CNN’s performance and why diversity+majority is generally preferred.

5.6. Comparison of proposed learning strategies

As summarized in Table 2, several active learning strategies can be derived. The prediction performance was evaluated according to the Area under the Learning Curve (ALC), in which the learning curve plots AUC as a function of the number of labels queried (Guyon et al., 2011), computed on the testing dataset. Table 4 shows the ALC of ACFT(Q), ACFT(LQ), AFT(LQ) and ACFT(HQ) compared with RFT. Our comprehensive experiments have demonstrated that:

  1. ACFT(Q) considers only newly selected candidates for fine-tuning, resulting in an unstable CNN performance due to the catastrophic forgetting of the previous samples;

  2. ACFT(LQ) requires a careful parameter adjustment. Although its performance is acceptable, it requires the same computing time as AFT(LQ), indicating that there is no advantage to continually fine-tuning the current CNN;

  3. AFT(LQ) shows the most reliable performance compared with ACFT(Q) and ACFT(LQ);

  4. The optimized version, ACFT(HQ), shows comparable performance to AFT(LQ) and occasionally outperforms AFT(LQ) by eliminating easy samples, focusing on hard samples, and preventing catastrophic forgetting.

In summary, our results suggest that (1) it is unnecessary to retrain models repeatedly from scratch for each active learning step and (2) learning newly annotated candidates plus a small portion of the misclassified candidates leads to equivalent performance to using the entire labeled set.

6. Discussion

6.1. How does intra-diversity differ from inter-diversity?

Since measuring diversity between selected samples and unlabeled samples is computationally intractable, especially for a large pool of data (Sourati et al., 2016), the existing diversity sampling cannot be applied directly to our real-world medical applications. To name a few, selection criteria R in Chakraborty et al. (2015) involves all unlabeled samples (patches). There are 391,200 training patches for polyp detection, and computing their R would demand 1.1 TB memory (391,002 ×8). In addition, their algorithms for batch selection are based on the truncated power method (Yuan and Zhang, 2013), which is unable to find a solution even for our smallest application (colonoscopy frame classification with 42,000 training patches). Holub et al. (2008) cannot be directly used for our real-world applications either, as it has a complexity of O(L3 × N3) and requires to train L × N classifiers in each step, where N indicates the number of unlabeled patches and L indicates the number of classes. In addressing the computational complexity problem, we exploit the inherent consistency among the patches that are augmented from the same sample, making it feasible for our real-world applications. To contrast these two measures of diversity, the variance among samples refers to inter-diversity, while the variance among patches augmented from the same sample refers to intra-diversity. We recognize that intra-diversity would inevitably suffer from redundancy in selection, as it treats each sample separately and dismisses inter-diversity among samples. An obvious solution is to inject randomness into active selection criteria, as described in Section. 3.4. Nonetheless, a better solution is to combine inter- and intra-diversity together by computing inter-diversity locally on the smaller set of samples selected by intra-diversity. These solutions all aim at selecting sufficiently diverse samples with manageable computational complexity.

6.2. Can actively selected samples be automatically balanced?

Data is often imbalanced in real-world applications. The images of target classes of interest, e.g., certain types of diseases, only appear in a small portion of the dataset. We encounter severe imbalances in our three applications. The ratio between positives and negatives is around 1:9 in the polyp and pulmonary embolism detection. Meanwhile, the ratio is approximately 3:7 in the colonoscopy frame classification. Learning from such imbalanced datasets leads to a common issue: majority bias (Aggarwal et al., 2020), which is a prediction bias towards majority classes over minority classes. Training data should be balanced in terms of classes (Japkowicz and Stephen, 2002; He and Garcia, 2009; Buda et al., 2018). Similar to most studies in active learning literature, our proposed selection criteria are not directly designed to tackle the issue of imbalance, but they have an implicit impact on balancing the data. For instance, when the current CNN has already learned more from positive samples, the next active learning selection would be more likely to prefer those negative samples, and vice-versa. On the contrary, random selection would consistently select new samples that follow roughly the same positive/negative ratio as the entire dataset. As shown in Fig. 8, our ACFT(HQ) and AFT(LQ) are capable of automatically balancing the selected training data. After monitoring the active selection process, ACFT(HQ) and AFT(LQ) select twice as many positives compared to random selection. This does not suggest that the number of positives and negatives must be approximately identical in the selected samples. Negative samples naturally present more contextual variance than positive ones, as negatives can contain a vast array of possibilities not including the disease of interest. It is expected that the CNN should learn more from negatives to shape the decision boundary of positives. An ideal selection should cover a sufficient variety of negatives while striking an emphasis on the positives. We believe that this accounts for the quick achievement of superior performance in imbalanced data for our ACFT(HQ) and AFT(LQ).

6.3. How to prevent model forgetting in continual learning?

When a CNN learns from a stream of tasks continually, the learning of the new task can degrade the CNN’s performance for earlier tasks (Kirkpatrick et al., 2017; Chen and Liu, 2018; Parisi et al., 2019). This phenomenon is called catastrophic forgetting, which was first recognized by McCloskey and Cohen (1989). In this paper, we have also observed similar behavior in active continual fine-tuning when the CNN encounters newly selected samples. This problem might not arise if the CNN is repeatedly trained on the entire labeled set at every active learning step. But fully reusing the labeled samples takes a lot of resources; further especially when the labeled set gets larger and larger, the impact of the newly selected samples on the model training becomes smaller and smaller (relative to the whole labeled set). To make the training more efficient and maximize the contribution of new data, we attempted to fine-tune the CNN only on the newly selected samples, developing the learning strategy called ACFT(Q). However, as seen in Table 4, ACFT(Q) results in a substantially unstable performance because of the catastrophic forgetting. To track the forgotten samples, we have plotted a histogram of the misclassified candidates (H) by the current CNN against labeled candidates (L) and newly selected candidates (Q) in Fig. 9. We found that if the CNN is only fine-tuned on the newly selected samples at each step, it tends to forget the samples that have been learned from previous steps. This is because new data will likely override the weights that have been learned in the past, and thus overfitting the CNN on this data and degrading the model’s generalizability. Therefore, we propose to combine the newly selected (Q) and misclassified (H) candidates together to continual fine-tune the current CNN, which not only spotlights the power of new data to achieve the comparable performance (see Table 4: ACFT(HQ) vs. AFT(LQ)), but also eases the computational cost by eliminating re-training on easy samples, focusing on hard ones, and preventing catastrophic forgetting.

Fig. 9.

Fig. 9.

Labels are reused differently in four active learning strategies, as summarized in Table 2. Specifically, the labels can be non-reused, partially reused, or 100% reused. We plot the number of candidates along with each active learning step, including labeled candidates (L), newly annotated candidates (Q), and misclassified candidates (H). As seen, by only continual fine-tuning on the hybrid data of HQ, our ACFT significantly reduces training time through faster convergence than repeatedly fine-tuning on the entire labeled data of LQ. Most importantly, as evidence by Table 4, partially reusing labels can achieve compelling performance because it boosts performance by eliminating labeled easy candidates, focusing on hard ones, and preventing catastrophic forgetting.

6.4. Is ACFT generalizable to other models?

We based our experiments on AlexNet and GoogLeNet. Alteratively, deeper architectures, such as VGG (Simonyan and Zisserman, 2014), ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and FixEfficientNet (Touvron et al., 2020), could have been used and they are known to show relatively higher performance for challenging computer vision tasks. However, the purpose of this work is not to achieve the highest performance for different medical image tasks but to answer a critical question: How can annotation costs be significantly reduced when applying CNNs to medical imaging? For this purpose, we have experimented with our three applications, demonstrating consistent patterns between AlexNet and GoogLeNet as shown in Fig. 6. As a result, given this generalizability, we can focus on comparing the prediction patterns and learning strategies rather than running experiments on different CNN architectures. Moreover, our active selection criteria only rely on data augmentation and model prediction, without being tied to specific types of predictors. This suggests that not only various CNN architectures, but also other predictive methods—spanning old fashions (e.g., SVM, Random Forests, and AdaBoost) to recent trends such as CapsuleNet (Sabour et al., 2017) and Transformer (Dosovitskiy et al., 2020)—can benefit from the progress in active learning.

6.5. Can we do better on the cold start problem?

It is crucial to intelligently select initial samples for an active learning procedure, especially for algorithms like our ACFT, which starts from a completely empty labeled dataset. Our results in Fig. 6 and several other studies (Borisov et al., 2010; Zhou et al., 2017b; Yuan et al., 2020) reveal that uniformly, randomly selecting initial samples from the unlabeled set could outperform active selection at the beginning. This is one of the most challenging problems in active learning, known as the cold start problem, which is ascribed to (1) data scarcity and (2) model instability at early stages. First, the data distribution in randomly selected samples better reflects the original distribution of the entire dataset than in actively selected samples. Maintaining a similar distribution between training and test data is beneficial when using scarce data. The most common practice is to admit the power of randomness at the beginning and randomly select initial samples from the unlabeled set (Ren et al., 2020). Our paper addresses the cold start problem by incorporating a random sampling probability with respect to the active selection criteria (as detailed in Section. 3.4). The devised ACFT (+randomness vs. −randomness in Fig. 6) shows superior performance, even in early stages, with continued performance improving during the subsequent steps. Second, in the beginning, the CNN understandably fails to amply predict new samples, as it is trained with an inadequate number of samples. With horrible predictions, no matter how marvelous the selection criterion is, the selected samples would be unsatisfactory—as said “garbage in garbage out”. To express meaningful CNN predictions, our paper suggests the use of pre-trained CNNs (as illustrated in Algorithm 1), not only initializing the CNN at the first step, but also providing fairly reasonable predictions for initial active selection. Fig. 5 presents encouraging results of active selection using pre-trained CNNs compared with random sampling from the unlabeled set (ACFT vs. RFT). However, a CNN pre-trained on ImageNet may give poor predictions in the medical imaging domain, as it was trained from only natural images; it is associated with a large domain gap for medical images. As a result, the CNN predictions may be inconclusive or even opposite, yielding poor selection scores. Naturally, one may consider utilizing pre-trained models in the same domains to reduce this domain gap (Zhou et al., 2021; Haghighi et al., 2020; Feng et al., 2020). Yuan et al. (2020) has demonstrated this idea in natural language processing by applying self-supervised language modeling to select initial samples. In the case of medical imaging, we naturally expect that self-supervised methods can also mitigate the pronounced domain gap between natural and medical imaging, offering a great starting point for selecting samples using domain-relevant image representation. More importantly, the learning objectives in self-supervised methods are applicable for discovering the most representative initial samples. For instance, our diversity criterion shares a similar spirit with the learning objective of BYOL (Grill et al., 2020) and of Parts2Whole (Feng et al., 2020), as they all aim to pull together the patches augmented from the same sample. Therefore, their objective functions could serve as an off-the-shelf measure for the power of a sample in elevating the pre-trained CNN’s performance. The underlying hypothesis is that the worthiness of labeling a sample correlates with the learning objective of self-supervised pre-training. Specifically, a sample is potentially more worthy to train the CNN if it requires considerably more effort to perform the task of in-painting (Pathak et al., 2016), restoration (Zhou et al., 2021), contrastive learning (Chen et al., 2020), or colorization (Zhang et al., 2016). We anticipate that self-supervised methods have great potential to accommodate the selection of initial samples by leveraging unlabeled data in the same domain, therefore, more effectively addressing the cold start problem in active learning.

6.6. Is our consistency observation useful for other purposes?

One of our key observations is that all patches augmented from the same sample share the same label, and thus are expected to have similar predictions by the CNN. This inherent invariance allows us to devise the diversity metric for estimating the worthiness of labeling the sample. From a broader view, the use of data consistency before and after a mixture of augmentation has played an important role in many other circumstances. In semi-supervised learning, the consistency loss serves as a bridge between labeled and unlabeled data. While the CNN is trained on labeled data, the consistency loss constrains predictions to be invariant to unlabeled data augmented in varying ways (Yu et al., 2019; Cui et al., 2019; Bortsova et al., 2019; Fotedar et al., 2020). In self-supervised learning, the concept of consistency allows CNNs to learn transformation invariance features by either always restoring the original image from the transformed one (Zhu et al., 2020; Zhou et al., 2021) or explicitly pulling all patches augmented from the same image together in the feature space (Feng et al., 2020; Chen et al., 2020; He et al., 2020). Albeit the great promises of consistency loss, automatic data augmentation inevitably generates “noisy” samples, jeopardizing the data consistency presumption. As an example, when an image contains objects A and B, random cropping may miss either one of the objects fully or partially, causing label inconsistency or representation inconsistency (Purushwalkam and Gupta, 2020; Hinton, 2021). Therefore, the choice of data augmentation is critical in employing the data consistency presumption. Other than data consistency, the prediction consistency of model ensembles can also calculate the diversity. For instance, Gal and Ghahramani (2016); Gal et al. (2017); Tsymbalov et al. (2018) have proposed to estimate the prediction diversity presented in the CNN via Monte-Carlo dropout in the inference; Beluch et al. (2018); Yang et al. (2017); Kuo et al. (2018); Li et al. (2020); Venturini et al. (2020) measure the prediction consistency by feeding images to multiple independent CNNs that have been trained for the same data and purpose. Unlike the data consistency in our work, their presumption is the model consistency, wherein the CNN predictions ought to be consistent if the same sample goes through the model ensembles; otherwise, this sample is considered worthy of labeling.

7. Conclusion

We have developed a novel method for dramatically reducing annotation cost by integrating active learning and transfer learning. Compared with the state-of-the-art random selection method (Tajbakhsh et al., 2016), our method can reduce the annotation cost by at least half for three medical applications and by more than 33 % for natural image dataset Places-3. The superior performance of our method is attributable to eight distinct advantages, detailed in Section. 1. We believe that labeling at the candidate level offers a sensible balance for our three applications, whereas labeling at the patient level would certainly enhance annotation cost reduction, but introduces more severe label noise. Labeling at the patch level compensates for additional label noise but would impose significant burdens on experts for annotation creation.

Fig. 4.

Fig. 4.

Five different pulmonary embolism candidates in the vessel-oriented image representation (Tajbakhsh et al., 2015). It was adopted in this work because it achieves great classification accuracy and accelerates CNN training convergence.

Acknowledgments

This research has been supported partially by ASU and Mayo Clinic through a Seed Grant and an Innovation Grant, and partially by the NIH under Award Number R01HL128785. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We thank S. Tatapudi and A. Pluhar for helping improve the writing of this paper. The content of this paper is covered by patents pending.

Appendix A. Selected Images Gallery

We illustrate the top and bottom five images selected by four active selection strategies (i.e., diversity, diversity+majority, entropy and entropy+majority) from Places-3 at Step 11 in Fig. A.11 to create a visual impression of the appearance of newly selected images. Such a gallery offers an intuitive way to analyze the most/least favored images and has helped us develop different active selection strategies.

Fig. A.10.

Fig. A.10.

We illustrate the ideas behind ACFT by utilizing Places-3 (Zhou et al., 2017a) for scene classification in natural images. For simplicity yet without loss of generality, we limit to 3 categories: (a) kitchen, (b) living room, and (c) office. Places-3 has 15,100 images in each category.

Fig. A.11.

Fig. A.11.

Gallery of top five and bottom five candidates actively selected at Step 11 by the methods proposed in Sections. 3.2 and 3.3 under the experimental setting.

Footnotes

Declaration of Competing Interest

None.

References

  1. Aggarwal U, Popescu A, Hudelot C, 2020. Active learning for imbalanced datasets. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1428–1437. [Google Scholar]
  2. Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L, Tse D, Etemadi M, Ye W, Corrado G, et al. , 2019. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med 25 (6), 954–961. [DOI] [PubMed] [Google Scholar]
  3. Azizi S, Mustafa B, Ryan F, Beaver Z, Freyberg J, Deaton J, Loh A, Karthikesalingam A, Kornblith S, Chen T, et al. , 2021. Big self-supervised models advance medical image classification. arXiv preprint arXiv:2101.05224. [Google Scholar]
  4. Balcan M-F, Broder A, Zhang T, 2007. Margin based active learning. In: International Conference on Computational Learning Theory. Springer, pp. 35–50. [Google Scholar]
  5. Beluch WH, Genewein T, Nürnberger A, Köhler JM, 2018. The power of ensembles for active learning in image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9368–9377. [Google Scholar]
  6. Borisov A, Tuv E, Runger G, 2010. Active batch learning with stochastic query by forest. In: JMLR: Workshop and Conference Proceedings (2010). Citeseer. [Google Scholar]
  7. Bortsova G, Dubost F, Hogeweg L, Katramados I, de Bruijne M, 2019. Semi-supervised medical image segmentation via learning consistency under transformations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 810–818. [Google Scholar]
  8. Buda M, Maki A, Mazurowski MA, 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neur. Netw 106, 249–259. [DOI] [PubMed] [Google Scholar]
  9. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A, 2020. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. [Google Scholar]
  10. Chakraborty S, Balasubramanian V, Sun Q, Panchanathan S, Ye J, 2015. Active batch selection via convex relaxations with guaranteed solution bounds. IEEE Trans. Pattern Analy. Mach. Intell 37 (10), 1945–1958. [DOI] [PubMed] [Google Scholar]
  11. Chen S, Ma K, Zheng Y, 2019. Med3d: Transfer learning for 3d medical image analysis. arXiv preprint arXiv:1904.00625. [Google Scholar]
  12. Chen T, Kornblith S, Norouzi M, Hinton G, 2020. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. [Google Scholar]
  13. Chen X, He K, 2020. Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566. [Google Scholar]
  14. Chen Z, Liu B, 2018. Lifelong machine learning. Synth. Lect. Artif. Intell. Mach. Learn 12 (3), 1–207. [Google Scholar]
  15. Cui W, Liu Y, Li Y, Guo M, Li Y, Li X, Wang T, Zeng X, Ye C, 2019. Semi–supervised brain lesion segmentation with an adapted mean teacher model. In: International Conference on Information Processing in Medical Imaging. Springer, pp. 554–565. [Google Scholar]
  16. Culotta A, McCallum A, 2005. Reducing labeling effort for structured prediction tasks. In: AAAI, Vol. 5, pp. 746–751. [Google Scholar]
  17. Dagan I, Engelson SP, 1995. Committee-based sampling for training probabilistic classifiers. In: Machine Learning Proceedings 1995. Elsevier, pp. 150–157. [Google Scholar]
  18. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L, 2009. Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255. [Google Scholar]
  19. Ding Y, Sohn JH, Kawczynski MG, Trivedi H, Harnish R, Jenkins NW, Lituiev D, Copeland TP, Aboian MS, Mari Aparici C, et al. , 2018. A deep learning model to predict a diagnosis of alzheimer disease by using 18f-FDG PET of the brain. Radiology 290 (2), 456–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. , 2020. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. [Google Scholar]
  21. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S, 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 (7639), 115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J, 2019. A guide to deep learning in healthcare. Nat. Med 25 (1), 24–29. [DOI] [PubMed] [Google Scholar]
  23. Feng R, Zhou Z, Gotway MB, Liang J, 2020. Parts2whole: Self-supervised contrastive learning via reconstruction. In: Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. Springer, pp. 85–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Fotedar G, Tajbakhsh N, Ananth S, Ding X, 2020. Extreme consistency: Overcoming annotation scarcity and domain shifts. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 699–709. [Google Scholar]
  25. Gal Y, Ghahramani Z, 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: international conference on machine learning. PMLR, pp. 1050–1059. [Google Scholar]
  26. Gal Y, Islam R, Ghahramani Z, 2017. Deep bayesian active learning with image data. In: International Conference on Machine Learning. PMLR, pp. 1183–1192. [Google Scholar]
  27. Grill J-B, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG, et al. , 2020. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. [Google Scholar]
  28. Guan Q, Huang Y, 2018. Multi-label chest x-ray image classification via category– wise residual attention learning. Pattern Recogn. Lett [Google Scholar]
  29. Guendel S, Grbic S, Georgescu B, Liu S, Maier A, Comaniciu D, 2018. Learning to recognize abnormalities in chest x-rays with location-aware dense networks. In: Iberoamerican Congress on Pattern Recognition. Springer, pp. 757–765. [Google Scholar]
  30. Guyon I, Cawley GC, Dror G, Lemaire V, 2011. Results of the active learning challenge. In: Active Learning and Experimental Design workshop In conjunction with AISTATS 2010, pp. 19–45. [Google Scholar]
  31. Haghighi F, Taher MRH, Zhou Z, Gotway MB, Liang J, 2020. Learning semantics-enriched representation via self-discovery, self-classification, and self-restoration. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 137–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. He H, Garcia EA, 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng 21 (9), 1263–1284. [Google Scholar]
  33. He K, Fan H, Wu Y, Xie S, Girshick R, 2020. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. [Google Scholar]
  34. He K, Zhang X, Ren S, Sun J, 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. [Google Scholar]
  35. Hino H, 2020. Active learning: Problem settings and recent developments. arXiv preprint arXiv:2012.04225. [Google Scholar]
  36. Hinton G, 2021. How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627. [DOI] [PubMed] [Google Scholar]
  37. Holub A, Perona P, Burl MC, 2008. Entropy-based active learning for object recognition. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, pp. 1–8. [Google Scholar]
  38. Huang G, Liu Z, Weinberger KQ, van der Maaten L, 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, p. 3. [Google Scholar]
  39. Huang S-C, Kothari T, Banerjee I, Chute C, Ball RL, Borus N, Huang A, Patel BN, Rajpurkar P, Irvin J, et al. , 2020. Peneta scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric ct imaging. npj Digi. Med. 3 (1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Irvin J, Rajpurkar P, Ko M, Yu Y, Ciurea-Ilcus S, Chute C, Marklund H, Haghgoo B, Ball R, Shpanskaya K, et al. , 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 590–597. [Google Scholar]
  41. Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH, 2021. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Method 18 (2), 203–211. [DOI] [PubMed] [Google Scholar]
  42. Japkowicz N, Stephen S, 2002. The class imbalance problem: A systematic study. Intell. Data Analy 6 (5), 429–449. [Google Scholar]
  43. Käding C, Rodner E, Freytag A, Denzler J, 2016. Fine-tuning deep neural networks in continuous learning scenarios. In: Asian Conference on Computer Vision. Springer, pp. 588–605. [Google Scholar]
  44. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, et al. , 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), 3521–3526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Krizhevsky A, Sutskever I, Hinton GE, 2012. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105.
  46. Kukar M, 2003. Transductive reliability estimation for medical diagnosis. Artif. In- tell. Med 29 (1), 81–106. [DOI] [PubMed] [Google Scholar]
  47. Kulick J, Lieck R, Toussaint M, et al. , 2014. Active learning of hyperparameters: An expected cross entropy criterion for active model selection. arXiv e-prints.
  48. Kuo W, Häne C, Yuh E, Mukherjee P, Malik J, 2018. Cost-sensitive active learning for intracranial hemorrhage detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 715–723. [Google Scholar]
  49. LeCun Y, Bengio Y, Hinton G, 2015. Deep learning. Nature 521 (7553), 436. [DOI] [PubMed] [Google Scholar]
  50. Li X, Guo Y, 2013. Adaptive active learning for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 859–866. [Google Scholar]
  51. Li X, Yu L, Chen H, Fu C-W, Xing L, Heng P-A, 2020. Transformation-consistent self-ensembling model for semisupervised medical image segmentation. IEEE Trans. Neur. Netw. Learn. Syst [DOI] [PubMed]
  52. Lu L, Zheng Y, Carneiro G, Yang L, 2017. Deep learning and convolutional neural networks for medical image computing. Adv. Comput. Vis. Pattern Recogn
  53. Ma Y, Zhou Q, Chen X, Lu H, Zhao Y, 2019. Multi-attention network for thoracic disease classification and localization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 1378–1382. [Google Scholar]
  54. Mahapatra D, Bozorgtabar B, Thiran J-P, Reyes M, 2018. Efficient active learning for image classification and segmentation using a sample selection and conditional generative adversarial network. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 580–588. [Google Scholar]
  55. McCallumzy AK, Nigamy K, 1998. Employing EM and pool-based active learning for text classification. In: Proc. International Conference on Machine Learning (ICML). Citeseer, pp. 359–367. [Google Scholar]
  56. McCloskey M, Cohen NJ, 1989. Catastrophic interference in connectionist net- works: The sequential learning problem. In: Psychology of learning and motivation, Vol. 24. Elsevier, pp. 109–165. [Google Scholar]
  57. Moen E, Bannon D, Kudo T, Graf W, Covert M, Van Valen D, 2019. Deep learning for cellular image analysis. Nat. Method 1–14. [DOI] [PMC free article] [PubMed]
  58. Mormont R, Geurts P, Marée R, 2018. Comparison of deep transfer learning strategies for digital pathology. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2262–2271. [Google Scholar]
  59. Mundt M, Hong YW, Pliushch I, Ramesh V, 2020. A wholistic view of continual learning with deep neural networks: Forgotten lessons and the bridge to active and open world learning. arXiv preprint arXiv:2009.01797. [DOI] [PubMed] [Google Scholar]
  60. Munjal P, Hayat N, Hayat M, Sourati J, Khan S, 2020. Towards robust and reproducible active learning using neural networks. arXiv, abs/2002.09564.
  61. Ozdemir F, Peng Z, Tanner C, Fuernstahl P, Goksel O, 2018. Active learning for segmentation by optimizing content information for maximal entropy. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 183–191. [Google Scholar]
  62. Parisi GI, Kemker R, Part JL, Kanan C, Wermter S, 2019. Continual lifelong learning with neural networks: A review. Neur. Netw 113, 54–71. [DOI] [PubMed] [Google Scholar]
  63. Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA, 2016. Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. [Google Scholar]
  64. Purushwalkam S, Gupta A, 2020. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. arXiv preprint arXiv:2007.13916. [Google Scholar]
  65. Ravizza S, Huschto T, Adamov A, Böhm L, Büsser A, Flöther FF, Hinzmann R, König H, McAhren SM, Robertson DH, et al. , 2019. Predicting the early risk of chronic kidney disease in patients with diabetes using real-world data. Nat. Med 25 (1), 57–59. [DOI] [PubMed] [Google Scholar]
  66. Ren P, Xiao Y, Chang X, Huang P-Y, Li Z, Chen X, Wang X, 2020. A survey of deep active learning. arXiv preprint arXiv:2009.00236.
  67. Sabour S, Frosst N, Hinton GE, 2017. Dynamic routing between capsules. arXiv preprint arXiv:1710.09829. [Google Scholar]
  68. Scheffer T, Decomain C, Wrobel S, 2001. Active hidden markov models for information extraction. In: International Symposium on Intelligent Data Analysis. Springer, pp. 309–318. [Google Scholar]
  69. Sener O, Savarese S, 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489. [Google Scholar]
  70. Settles B,. Active learning literature survey. University of Wisconsin, Madison: 52 (55–66), 11. [Google Scholar]
  71. Shannon CE, 1948. A mathematical theory of communication. Bell Syst. Techn. J 27 (3), 379–423. [Google Scholar]
  72. Shao W, Sun L, Zhang D, 2018. Deep active learning for nucleus classification in pathology images. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE, pp. 199–202. [Google Scholar]
  73. Shen D, Liu T, Peters TM, Staib LH, Essert C, Zhou S, Yap P-T, Khan A, 2019. Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Vol. 11767. Springer Nature. [Google Scholar]
  74. Shui C, Zhou F, Gagné C, Wang B, 2020. Deep active learning: Unified and principled method for query and training. In: International Conference on Artificial Intelligence and Statistics. PMLR, pp. 1308–1318. [Google Scholar]
  75. Simonyan K, Zisserman A, 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. [Google Scholar]
  76. Sourati J, Akcakaya M, Dy JG, Leen TK, Erdogmus D, 2016. Classification active learning based on mutual information. Entropy 18 (2), 51. [Google Scholar]
  77. Sourati J, Gholipour A, Dy JG, Kurugol S, Warfield SK, 2018. Active deep learning with fisher information for patch-wise semantic segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer, pp. 83–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Sourati J, Gholipour A, Dy JG, Tomas-Fernandez X, Kurugol S, Warfield SK, 2019. Intelligent labeling based on fisher information for medical image segmentation using deep learning. IEEE Trans. Med. Imaging 38 (11), 2642–2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A, et al. , 2015. Going Deeper with Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [Google Scholar]
  80. Tajbakhsh N, Gotway MB, Liang J, 2015. Computer-aided pulmonary embolism detection using a novel vessel-aligned multi-planar image representation and convolutional neural networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 62–69. [Google Scholar]
  81. Tajbakhsh N, Jeyaseelan L, Li Q, Chiang JN, Wu Z, Ding X, 2020. Embracing imperfect datasets: A review of deep learning solutions for medical image seg- mentation. Med. Image Analy 101693. [DOI] [PubMed]
  82. Tajbakhsh N, Shin JY, Gotway MB, Liang J, 2019. Computer-aided detection and visualization of pulmonary embolism using a novel, compact, and discriminative image representation. Med. Image Analy 58, 101541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Tajbakhsh N, Shin JY, Gurudu SR, Hurst RT, Kendall CB, Gotway MB, Liang J, 2016. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Trans. Med. Imaging 35 (5), 1299–1312. [DOI] [PubMed] [Google Scholar]
  84. Tang Y, Wang X, Harrison AP, Lu L, Xiao J, Summers RM, 2018. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In: International Workshop on Machine Learning in Medical Imaging. Springer, pp. 249–258. [Google Scholar]
  85. Touvron H, Vedaldi A, Douze M, Jégou H, 2020. Fixing the train-test resolution discrepancy: Fixefficientnet. arXiv preprint arXiv:2003.08237. [Google Scholar]
  86. Tsymbalov E, Panov M, Shapeev A, 2018. Dropout-based active learning for regression. In: International conference on analysis of images, social networks and texts. Springer, pp. 247–258. [Google Scholar]
  87. Venturini L, Papageorghiou AT, Noble JA, Namburete AIL, 2020. Uncertainty estimates as data selection criteria to boost omni-supervised learning. In: Inter- national Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 689–698. [Google Scholar]
  88. Wang W, Lu Y, Wu B, Chen T, Chen DZ, Wu J, 2018. Deep active self– paced learning for accurate pulmonary nodule segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 723–731. [Google Scholar]
  89. Yamamoto Y, Tsuzuki T, Akatsuka J, Ueki M, Morikawa H, Numata Y, Takahara T, Tsuyuki T, Tsutsumi K, Nakazawa R, et al. , 2019. Automated acquisition of explainable knowledge from unannotated histopathology images. Nat. Commun 10 (1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Yang L, Zhang Y, Chen J, Zhang S, Chen DZ, 2017. Suggestive annotation: A deep active learning framework for biomedical image segmentation. arXiv preprint arXiv:1706.04737. [Google Scholar]
  91. Yu L, Wang S, Li X, Fu C-W, Heng P-A, 2019. Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 605–613. [Google Scholar]
  92. Yuan M, Lin H-T, Boyd-Graber J, 2020. Cold-start active learning through self-supervised language modeling. arXiv preprint arXiv:2010.09535. [Google Scholar]
  93. Yuan X-T, Zhang T, 2013. Truncated power method for sparse eigenvalue problems. J. Mach. Learn. Res 14 (Apr), 899–925. [Google Scholar]
  94. Zhang R, Isola P, Efros AA, 2016. Colorful image colorization. In: Proceedings of the European Conference on Computer Vision. Springer, pp. 649–666. [Google Scholar]
  95. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A, 2017. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Analy. Mach. Intell [DOI] [PubMed]
  96. Zhou SK, Rueckert D, Fichtinger G, 2019. Handbook of Medical Image Computing and Computer Assisted Intervention. Academic Press. [Google Scholar]
  97. Zhou Z, Shin J, Feng R, Hurst RT, Kendall CB, Liang J, 2019. Integrating active learning and transfer learning for carotid intima-media thickness video interpretation. Journal of digital imaging 32 (2), 290–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Zhou Z, Shin J, Zhang L, Gurudu S, Gotway M, Liang J, 2017. Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7340–7349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Zhou Z, Sodha V, Pang J, Gotway MB, Liang J, 2021. Models genesis. Med. Image Analy 67, 101840. doi: 10.1016/j.media.2020.101840. http://www.sciencedirect.com/science/article/pii/S1361841520302048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Zhou Z, Sodha V, Rahman Siddiquee MM, Feng R, Tajbakhsh N, Gotway MB, Liang J, 2019. Models genesis: Generic autodidactic models for 3d medical image analysis. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. Springer International Publishing, Cham, pp. 384–393. https://link.springer.com/chapter/10.1007/978-3-030-32251-9_42 [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Zhu J, Li Y, Hu Y, Ma K, Zhou SK, Zheng Y, 2020. Rubik’s cube+: A self-supervised feature learning framework for 3d medical image analysis. Med. Image Analy 64, 101746. [DOI] [PubMed] [Google Scholar]

RESOURCES