Geometric Analysis of Uncertainty Sampling for Dense Neural Network Layer

Aziz Koçanaoğulları; Niklas Smedemark-Margulies; Murat Akcakaya; Deniz Erdoğmuş

doi:10.1109/lsp.2021.3072292

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: IEEE Signal Process Lett. 2021 Apr 9;28:867–871. doi: 10.1109/lsp.2021.3072292

Geometric Analysis of Uncertainty Sampling for Dense Neural Network Layer

Aziz Koçanaoğulları ¹, Niklas Smedemark-Margulies ², Murat Akcakaya ³, Deniz Erdoğmuş ⁴

PMCID: PMC8224399 NIHMSID: NIHMS1704281 PMID: 34177215

Abstract

For model adaptation of fully connected neural network layers, we provide an information geometric and sample behavioral active learning uncertainty sampling objective analysis. We identify conditions under which several uncertainty-based methods have the same performance and show that such conditions are more likely to appear in the early stages of learning. We define riskier samples for adaptation, and demonstrate that, as the set of labeled samples increases, margin-based sampling outperforms other uncertainty sampling methods by preferentially selecting these risky samples. We support our derivations and illustrations with experiments using Meta-Dataset, a benchmark for few-shot learning. We compare uncertainty-based active learning objectives using features produced by SimpleCNAPS (a state-of-the-art few-shot classifier) as input for a fully-connected adaptation layer. Our results indicate that margin-based uncertainty sampling achieves similar performance as other uncertainty based sampling methods with fewer labelled samples as discussed in the novel geometric analysis.

Keywords: active learning, few-shot learning, information geometry, margin sampling, uncertainty sampling

I. Introduction

Recently, deep neural networks have been commonly used for hypothesis learning in the context of various regression and classification problems. These models require large labeled data sets to achieve good generalization performance [1]. Obtaining large labeled data sets is costly; several approaches exist to overcome this limitation. For example, model adaptation with few samples (informally referred to as zero-, one-, few-shot learning) may enable transformation of a hypothesis model (e.g., a classifier) trained for one specific task to a hypothesis model suitable for another task through minimal changes to its structure and parameters [1], [2]. Model adaptation for deep neural networks between classification tasks is typically achieved by adjusting the last few layers [3], [4]. Active learning is crucial in cases where rapid adaptation is required with limited data labeling resources [5].

Related Work:

As described in Settle’s work [6], active learning objectives often combine the following measures: uncertainty in probability space to select ambiguous samples [7], density in data space (e.g. N nearest neighbors [8]), expected model change (e.g. absolute sum of the gradients [9]), and expected error reduction (e.g. maximizing mutual information [10]). Other proposed methods include influence functions [11], and representer point based selection [12]. In the existing active learning literature, uncertainty sampling methodologies (Entropy Sampling (ES) [13], Confidence Sampling (CS) [14], and Margin Sampling (M) [15]) are often used as baseline comparisons [16], [17] due to their low computational overhead. Existing work usually reports the best performing uncertainty sampling method based on the test performance results, but performance comparison across (ES), (CS) and (M) and justification of the performance differences of these uncertainty methods across different datasets are omitted. The fundamental differences across (ES), (CS) and (M) are studied in the literature with extensive experimentation in different domains [18], [19]. Uncertainty sampling has also been comprehensively studied in parameter estimation for logistic regression models [20]. We suggest that the performance differences across (ES), (CS) and (M) across different testing datasets occur due to the sampling behaviour differences in these uncertainty sampling methods during actively updating the models. However, none of the existing work provide an explanation to these differences. Here, we aim to explain the sampling behavior differences for uncertainty based methods from an information geometric perspective, specifically through the geometrical analysis of the unlabeled sample predictions. We consider a fully connected neural network layer and demonstrate that by design (M) performs better in selecting riskier samples (i.e., the samples that are geometrically located in highly uncertain locations in the probability simplex) that enables it to achieve similar performances to (ES) and (CS) with fewer samples.

Contributions:

As mentioned above, uncertainty sampling methods have been commonly used and they were compared among each other and against other methods always through their performances on different datasets [19], [21]. We propose here a novel analytical approach to compare these methods, such an analytical approach does not exist in the literature. Specifically, (i) We identify conditions under which uncertainty-based methods have equal performance. (ii) We use information geometry to demonstrate the behavior of sample selection for (ES), (CS) and (M) and highlight that (M) selects samples from highly uncertain locations in the probability simplex. (iii) We validate our analysis in a few-shot learning scenario.

Preliminaries:

Let $(X, Y)$ denote the domain of data and labels over C different classes $H = {H_{1}, H_{2}, \dots, H_{C}}$ . Let $(x \in ℝ^{d}, y)$ be a single data example and its corresponding one-hot label vector. To estimate the class label, we fit a parametric model $h_{θ} : X \to Δ_{C}$ where Δ_C is the C-dimensional simplex:

Δ_{C} = {(p_{1}, p_{2}, \dots, p_{C}) \in ℝ^{C} ∣ p_{i} > 0 \forall i, \sum_{i} p_{i} = 1}

Model parameters θ are fit by minimizing a loss function $L : (h_{θ} (X), Y) \to ℝ$ , the model parameters are optimized appropriately. In this paper, we specifically focus on the following parameterized model;

h_{θ} (x) = softmax (θ [\begin{array}{l} x \\ 1 \end{array}]) where θ \in ℝ^{C \times (d + 1)} softmax {(a)}_{i} = \frac{e^{a_{i}}}{\sum_{j = 1}^{d} e^{a_{j}}} where a \in ℝ^{d}

(1)

Here θ is the parameter matrix and softmax : $ℝ^{n} \to Δ_{n}$ denotes the operator that maps the linearly transformed x to the probability simplex [22]. The hypothesis class in (1) is a linear logistic regression function, i.e., a fully connected neural network layer followed by softmax nonlinearity.

In many applications, hypothesis learning can be achieved using a combination of labelled and unlabelled data s.t. $(X, Y) = ({X_{L}, X_{U}}, {Y_{L}, Y_{U}})$ where L and U denote subsets of labelled and unlabelled data respectively; Y_U is not available to the learner. We focus on the empirical risk minimization framework, in which optimal parameters for the model h are computed as:

{\hat{θ}}_{L} = \arg \min_{θ} \frac{1}{| X_{L} |} \sum_{(x, y) \in (X_{L}, Y_{L})} L (h_{θ} (x), y)

For the training of the last fully connected layer of a classification neural network, minimization of average cross entropy loss is considered $L (h_{θ} (x), y) = H (h_{θ} (x), y)$ [23]:

\arg \min_{θ} H (\vec{y}, h_{θ} (x)) = \arg \min_{θ} - \sum_{i} y_{i} \log ({[h_{θ} (x)]}_{i}) = \arg \min_{θ} - \log ({[h_{θ} (x)]}_{\hat{i}}) where \hat{i} = \arg \max_{j} y_{j} = \arg \min_{θ} {[h_{θ} (x)]}_{\hat{i}} where \hat{i} = \arg \max_{j} y_{j}

(2)

Actively learning a model includes an agent that selects anchor samples $x_{a} \in X_{U}$ according to a sampling objective f, and receives the corresponding label y_a from an oracle to update the model:

x_{a} = \arg \max_{x \in X_{U}} f (x, X_{L}, Y_{L}, h_{θ}) {\hat{θ}}_{L \cup a} = \arg \min_{θ} \frac{1}{| X_{L} |} \sum_{(x, y) \in (X_{L} \cup x_{a}, Y_{L} \cup y_{a})} H (\vec{y}, h_{θ} (x)) (X_{L}, Y_{L}) \leftarrow (X_{L}, Y_{L}) \cup (x_{a}, y_{a}) (X_{U}, Y_{U}) \leftarrow (X_{U}, Y_{U}) \ (x_{a}, y_{a})

(3)

In (3), f is designed to decrease the loss value as fast as possible by selecting meaningful samples. For f, we consider (ES), (CS) and (M) objectives presented in Table I and the geometry of each method based on their objectives is illustrated in Figure 1.

Table I:

Uncertainty sampling methods that form the basis of active learning methodologies.

Identifier	Root	Selection Method	Ref
(R)	random	x_a = random(χ_U)	–
(ES)	entropy	x_a = argmax_{xϵχ_U} − ∑_i[h_θ(x)]_i log[h_θ(x)]_i	[13]
(CS)	confidence	x_a = argmin_{xϵχ_U} [h_θ(x)]_i s.t. i = argmax_c [h_θ(x)]_c	[14]
(M)	margin	x_a = argmin_{xϵχ_U} [h_θ(x)]_i − [h_θ(x)]_j s.t. i = argmax_c[h_θ(x)]_c j = argmax_c≠i[h_θ(x)]_c	[15]

Open in a new tab

Figure 1: — Geometry of the given methods in Δ₃. Figures from left to right represent the values of the objective functions presented in Table I (ES), (CS) and (M) respectively.

II. Analysis

The performance differences among (ES), (CS) and (M) selection objectives arise from the sequence of anchor samples x_a selected. Through Proposition 1, we show that these objectives have the same performance for 2-class classification. Further analyses then discuss the performance differences among the methods.

Proposition 1.

Given 2 class case $H = {H_{1}, H_{2}}$ and the set $X$ with $h_{θ} (x) = p \in Δ_{2}$ then;

x_{a} = \arg \max_{p \in h_{θ} (X)} H (p) (ES) = \arg \min_{p} \max_{i} p_{i} (CS) = \arg \min_{p} \max_{j \neq i} \max_{i} p_{i} - p_{j} (M)

Proof.

Let us denote the anchor sampling methods; (i) argmax_p H(p), (ii) argmin_p max_i p_i, (iii) $\arg \min_{p} \max_{j \neq i} \max_{i} p_{i} - p_{j} \cdot p \in Δ_{2} \Rightarrow p_{i} = 1 - p_{\neq i}$ .

\arg \min_{p} \max_{j \neq i} \max_{i} p_{i} - p_{j} = \arg \min_{p} \max_{i} p_{i} - p_{\neq i}

= \arg \min_{p} \max_{i} 2 p_{i} - 1 = \arg \min_{p} \max_{i} p_{i} \Rightarrow (ii) \equiv (iii)

Similarly,

\arg \max_{p} H (p) = \arg \max_{p} - \sum_{i} p_{i} \log (p_{i}) = \arg \max_{p} - p_{i} \log (p_{i}) - p_{\neq i} \log (p_{\neq i}) = \arg \max_{p} - p_{i} \log (p_{i}) - (1 - p_{i}) \log (1 - p_{i})

WLOG assume then, p_i ≥ p_≠i then, 1 ≥ p_i ≥ 0.5 and −p_ilog(p_i)−(1 − p_i) log(1 − p_i) is monotonically decreasing wrt. p_i → given p, q ϵ Δ₂ max_i p_i > max_i q_i ⟹ H(p) < H(q) ⟹ argmax_p H(p) = argmin_p max_ip_i ⟹ (i) ≡ (ii). Then (i) ≡ (ii) ≡ (iii). □

In a C-class classification problem, model h_θ makes a correct decision for a sample and label tuple $(x, y) \in (X, Y)$ if $\arg \max_{i} y_{i} = \arg \max_{i} {[h_{θ} (x)]}_{i}$ . Therefore ∀hθ(x) ∈ Δ_n the critical boundary for a class i is formed where coordinate i is tied with a single other coordinate j. In other words, ∃j s.t. ${[h_{θ} (x)]}_{i} = {[h_{θ} (x)]}_{j} \geq {[h_{θ} (x)]}_{k \neq i, j} \forall k$ .

Assume a three class classification as illustrated in Figure 2 where the labels are denoted with a,b,c. Then in order to correctly classify a for example, the hypothesis should result a probability vector where a has the highest probability mass (blue highlighted area). Hence all possible probability vector with a having the max probability contribute to the true selection. In the figure the decision boundary for class a is also visualized with the blue-bold lines. Given p = h_θ(x), these lines satisfy ∃j s.t. p_i = p_j ≥ p_k ∀k. In other words, the model is uncertain between two competitors. Points $x \in X$ that result in p = h_θ(x) close to these decision boundaries are riskier samples.

(ES) has a frail confidence assessment:

Let ${x 1, x 2} \subset X_{U}$ where argmax_i y1_i = argmax_i y2_i = 1 and $p = h_{\hat{θ}} (x 1)$ , $q = h_{\hat{θ}} (x 2) \in Δ_{10}$ with $p = [.6, .4 - 8 ϵ, ϵ, \dots]$ where 0 < ϵ << 1 and $q = [.7, .0 \bar{3}, \dots]$ . It is apparent that q is more confident on the true class however due to 0.97 ≈ H(p) < H(q) ≈ 1.82 (ES) selects q over p. For this example (M) and (CS) captures the notion of confidence which is determined by the maximum posterior probability and selects the sample with lesser confidence.

(CS) makes decisions using single value:

Let ${x 1, x 2, x 3, x 4} \subset X_{U}$ with $p 1 = h_{\hat{θ}} (x 1)$ , $p 2 = h_{\hat{θ}} (x 2) \in Δ_{10}$ , $q 1 = h_{\hat{θ}} (x 3)$ , $q 2 = h_{\hat{θ}} (x 4) \in Δ_{10}$ with $p 1 = [.5, .5 - 8 ϵ, ϵ, \dots], p 2 = [.6, .4 - 8 ϵ, ϵ, \dots]$ where 0 < ϵ << 1 and $q 1 = [.5, 0.0 \bar{5}, \dots]$ , $q 2 = [.6, 0.0 \bar{4}, \dots]$ and we compare ps then qs. Since confidence level on a particular class differs the same amount .1 (CS) treats p and q cases the same, for the ps the 2^nd best class is still a legitimate competitor, whereas for qs there was no other competitor. However, (M) makes a distinction between these cases by incorporating another element to select the sample closer to decision boundary (p).

Sample Behavior:

Ultimately, the differences in performance occur due to the different geometric locations of the data in the simplex (i.e., h_θ(x) ∈ Δ_n ) compared to the geometric selection regions of different objectives. Let us observe the different outcome possibilities from the fully connected layer. Consider Figure 3 for Δ₃ for three scenarios of h_θ outcomes. Given $(x^{l}, y^{l}) \in (X_{L}, Y_{L})$ and $(x^{u}, y^{u}) \in (X_{U}, Y_{U})$ if ∃θ s.t.: (s) separation $(\forall (x^{l}, y^{l}) \arg \max_{i} {[h_{θ} (x^{l})]}_{i} = \arg \max_{i} y_{i}^{l})$ , (qs) quasi-separation $(\forall (x^{l}, y^{l}) \arg \max_{i} {[h_{θ} (x^{l})]}_{i} = \arg \max_{i} y_{i}^{l}$ and ∃(x^l,y^l) s.t. |argmax_i h_θ(x^l)| ≥ 2. e.g. h_θ(x^l) = [0.5,0.5,0,···], y^l = [1,0,···]), (o) overlap (∃(x^l,y^l) where $\arg \max_{i} {[h_{θ} (x^{l})]}_{i} \neq \arg \max_{i} y_{i}^{l})$ then a stationary point $\hat{θ} = \arg \inf_{θ} (1 / | X_{L} |) \sum_{(x, y) \in (X_{L}, Y_{L})} H (\vec{y}, h_{θ} (x))$ only exists for (o). We refer the reader to Albert’s work [24] for the detailed proof showing (s) and (qs) project samples to a two class decision as shown in Figure 3. In Proposition 1 we already discussed if samples reside on a line which is the case for (s) and (qs), the sampling methodologies behave the same and hence we further investigate the case of (o). Note that it is empirically known that the differences in the performances among (CS), (ES) and (M) increase as the training progresses. As also discussed above, the performances differ only in the case of (o). With the following proposition, we show that during the training, the probability of achieving the case of (o) as the outcome of h_θ in a dataset increases as more data is incorporated into the labeled pool $X_{L}$ for training;

Figure 3: — Separated (s), quasi-separated (qs) and overlapping (o) labelled data $X_{L}$ is used to determine a stationary point $h_{\hat{θ}}$ Observe that (s), (qs) yields unlabelled samples on the lines that force exact same results in (ES), (CS) and (M). (o) on the other hand allows selection methods operate in 2D.

Proposition 2.

Let h_θ be an arbitrary model and $(X, Y) = {(x, y) ∣ x ~ f_{y}}$ where f_y is an arbitrary distribution identified with the label y. Let $(X_{1}, Y_{1}) \subseteq (X_{2}, Y_{2}) \subseteq (X, Y)$ then probability of (o) in $(X_{2}, Y_{2})$ is greater or equal than probability of (o) in $(X_{1}, Y_{1})$ .

Proof.

Let A := ∃ overlap $\in (X_{1}, Y_{1})$ , B := ∃overlap ∈ $(X_{2}, Y_{2}) \ (X_{1}, Y_{1}) \Rightarrow A + B : = \exists$ overlap $\in (X_{2}, Y_{2})$ . Trivially p(A+B) ≥ p(A)+p(B) ⟹ p(A+B) ≥ p(A). □

In summary, Proposition 2 states that as the training progresses, the probability of the case (o) increases which implies that for active learning the (CS), (ES) and (M) will start to have different performances.

Special case with Gaussian data:

For the case of (o), consider samples originating from multi-variate Gaussian distributions for a C-class classification s.t. $\forall (x, y) \in (X, Y)$ , $x = N (μ_{y}, Σ_{y})$ . Note that the equi-probability contours are not centered around the center of the Δ_C (simplex) but the corners. Since x is normally distributed, linear combinations with a scalar shift of the random variables also follow a normal distribution. Moreover, the inverse of softmax operator is well approximated as central logratio transform clr(.) [22], [25]:

clr (p) = [\frac{p_{1}}{g (p)}, \dots, \frac{p_{C}}{g (p)}] where g (p) = {(\prod_{i} p_{i})}^{1 / C}

Hence the outputs of the model defined in (1) follow a logistic-normal distribution for which the pdf is;

f_{Δ_{C}} (p; m, s) = J_{C} \frac{1}{| 2 π s |^{\frac{1}{2}}} \exp (- \frac{1}{2} {(p^{'} - m)}^{T} s^{- 1} (p^{'} - m)) where p^{'} = p / g (p)

(4)

As discussed in [22], it is possible to find equi-probability contours within Δ_C. WLOG assume we are interested in $H_{1}$ with an ideal m = [1 − ε, ε,··· ,ε] with 0 < ε << 1. In Figure 2 we visualize a sample equi-probability contour with the dashed red line. Hence in terms of the probability geometry, the assessment of riskier samples are not centric as guided by the entropy but they follow a pattern that are analogous with the critical boundaries and hence a perspective that is centered around the corners is more preferable. Comparison between Figures 1 and 2 show that (M) captures these equi-probable regions but not (CS) or (ES).

III. Experiments and Results

Paradigm:

Using a few-shot learning scenario, we compare (ES), (CS) and (M). Also we consider the random selection case, see Table I. In few-shot learning approaches, model parameters are updated starting from a checkpoint [1], [2]. In the cases where adaptation needs to be fast, especially in deep feature models, a backbone that is already trained on a large labelled dataset (e.g. image classification ResNet [26]) is kept constant and an adaptation layer is further adjusted to generalize across an unseen task. Specifically, as the backbone, in our experiments we use the model presented in [3] and later simplified in [4] which is by design learned to output features that follow Gaussian distribution. We then actively learn a fully connected layer during adaptation.

Dataset and Experiment:

We use Meta-dataset [27], a benchmark for few-shot learning and image classification that comprises the following labelled image datasets: ILSVRC-2012 [28], Omniglot [29], FGVC-Aircraft [30], CUB-200–2011 [31], Describable Textures [32], QuickDraw [33], FGVCx Fungi, VGG Flower [34], Traffic Signs [35] and MSCOCO [36]. We follow the train-test splits provided by Meta-dataset in our experiments. We define ‘root-acc’; as the test set accuracy achieved using all training labels set is available. We evaluate active learning methods by comparing what percentage of the training set must be queried before achieving the root-acc. In our experiments, we initialize the system by providing labels for 5% of the training samples and training an initial model: $| X_{U} | = 0.05 \times | X_{U \cup L} |$ . At each iteration, we select a batch of 10-samples to be labelled according to the objectives presented in Table I. At each iteration, the models are also updated to a stationary point.

Results:

Results are presented in Table II. In this Table, the rows represent the results for different datasets and the average of all is presented as the final row. The table is divided into 4 grouped-columns, where each group denotes a model reaching a pre-determined performance value close to root-acc. For example, where root-acc 85%, the column root-acc−15% represents a model achieving a performance ≥ 70. Each column in each group presents the results of a different sample selection method from Table I. We present the mean and standard deviation for the percentage of the dataset used to achieve the pre-determined performance value. Lower mean values indicate that the method requires less data to match the performance of the other sampling methods. The results show that (M) picks the samples that result in faster increase in performance. Moreover, we observe that the gap between methodologies increases as labelled data size increases (to achieve higher performance more data is required) which is also stated in Prop. 2.

Table II:

A fully connected layer on top of the backbone network architecture is used. The average number of training samples are ≈ 355 where number of test samples are ≈ 90. In the table root-acc and max-acc refer to accuracy achieved in test-set if all training-set is used and maximum accuracy achieved in any point of active learning respectively. We report the ratio of training samples used to cardinality of the entire training set to reach a neighborhood of root-accuracy. Therefore, a smaller number represents fewer samples used and hence the less the better. It is apparent that margin sampling (M) outperforms other selection methods and achieves the confidence range using less percentage of the training set labelled.

		root-acc −15%				root-acc −10%				root-acc −5%				root-acc −1%
data-set	root-acc (%)	(R)	(ES)	(CS)	(M)	(R)	(ES)	(CS)	(M)	(R)	(ES)	(CS)	(M)	(R)	(ES)	(CS)	(M)

aircraft	83	0.24±0.17	0.29±0.20	0.24±0.17	0.23±0.17	0.29±0.17	0.40±0.22	0.27±0.17	0.26±0.17	0.43±0.20	0.62±0.22	0.38±0.18	0.33±0.17	0.64±0.24	0.80±0.18	0.56±0.20	0.47±0.20
cifar-10	78	0.24±0.16	0.37±0.22	0.21±0.16	0.24±0.15	0.31±0.16	0.49±0.23	0.25±0.16	0.24±0.15	0.47±0.20	0.70±0.19	0.36±0.17	0.33±0.16	0.68±0.22	0.85±0.14	0.52±0.21	0.47±0.21
cu-birds	76	0.24±0.17	0.29±0.20	0.24±0.17	0.23±0.17	0.29±0.17	0.40±0.22	0.27±0.17	0.26±0.17	0.43±0.19	0.62±0.22	0.38±0.18	0.33±0.16	0.64±0.24	0.80±0.17	0.56±0.20	0.48±0.20
fungi	40	0.50±0.14	0.50±0.14	0.50±0.14	0.50±0.14	0.50±0.14	0.50±0.14	0.50±0.14	0.50±0.14	0.53±0.15	0.54±0.17	0.52±0.14	0.51±0.14	0.62±0.20	0.65±0.21	0.57±0.16	0.58±0.16
ms-coco	48	0.38±0.12	0.39±0.12	0.38±0.11	0.39±0.12	0.40±0.12	0.44±0.15	0.39±0.12	0.39±0.12	0.52±0.17	0.58±0.20	0.45±0.13	0.44±0.13	0.73±0.20	0.76±0.22	0.62±0.18	0.58±0.17
omniglot	91	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.76±0.15	0.83±0.11	0.83±0.11	0.80±0.12	0.78±0.12
quickdraw	76	0.44±0.10	0.44±0.11	0.44±0.11	0.44±0.11	0.44±0.10	0.46±0.12	0.44±0.10	0.44±0.10	0.52±0.13	0.60±0.17	0.50±0.11	0.48±0.11	0.78±0.17	0.84±0.15	0.66±0.14	0.63±0.14
traffic-sign	77	0.39±0.12	0.39±0.13	0.38±0.12	0.39±0.12	0.39±0.12	0.41±0.16	0.39±0.12	0.39±0.12	0.42±0.14	0.50±0.24	0.39±0.12	0.39±0.12	0.50±0.19	0.63±0.30	0.41±0.13	0.40±0.12
vgg-flower	91	0.29±0.21	0.30±0.21	0.29±0.21	0.29±0.21	0.30±0.21	0.36±0.25	0.30±0.21	0.29±0.21	0.42±0.20	0.61±0.28	0.33±0.20	0.32±0.20	0.65±0.23	0.85±0.20	0.44±0.20	0.41±0.19

average	-	0.38	0.41	0.38	0.38	0.40	0.47	0.40	0.39	0.50	0.61	0.45	0.43	0.67	0.78	0.57	0.53

Open in a new tab

IV. Conclusion

In this work we provide an analysis of actively adapting a fully connected final layer in a network architecture in a model-adaption setting. Specifically, we focused on uncertainty sampling methods that are widely used as a sanity check in active learning tasks. We have shown that geometrically, fully connected layer behavior and sample positioning by fact strengthens margin sampling over other uncertainty based approaches. Empirically, we validated the claims in a few-shot learning setting where a fully connected adaptation layer exists. With that knowledge, it is possible to propose proxy gradient methods that leverage margin instead of selection based on mutual information surrogates.

Acknowledgement

We would like to thank Jan-Willem van de Meent for valuable input that helped us improve the paper.

This work is supported by NIH (R01DC009834), DARPA (SC1821301), and NSF (CNS-1544895, IIS-1715858, IIS-1717654, IIS-1844885, IIS-1915083).

Contributor Information

Aziz Koçanaoğulları, Northeastern University Department of Electrical and Computer Engineering 409 Dana Research Center 360 Huntington Avenue Boston, MA 02115.

Niklas Smedemark-Margulies, Northeastern University Khoury College of Computer Science, 440 Huntington Ave, Boston, MA 02115.

Murat Akcakaya, Pittsburg University Department of Electrical and Computer Engineering 1238 Benedum Hall Pittsburgh, PA 15261.

Deniz Erdoğmuş, Northeastern University Department of Electrical and Computer Engineering 409 Dana Research Center 360 Huntington Avenue Boston, MA 02115.

References

[1].Finn Chelsea, Abbeel Pieter, and Levine Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR.org, 2017. [Google Scholar]
[2].Oreshkin Boris, Rodríguez López Pau, and Lacoste Alexandre. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018. [Google Scholar]
[3].Requeima James, Gordon Jonathan, Bronskill John, Nowozin Sebastian, and Richard E Turner. Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, pages 7957–7968, 2019. [Google Scholar]
[4].Bateni Peyman, Goyal Raghav, Masrani Vaden, Wood Frank, and Sigal Leonid. Improved few-shot visual classification, 2019.
[5].Sung Flood, Yang Yongxin, Zhang Li, Xiang Tao, Torr Philip HS, and Hospedales Timothy M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018. [Google Scholar]
[6].Settles Burr. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. [Google Scholar]
[7].Lewis David D and Catlett Jason. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994. [Google Scholar]
[8].Berlind Christopher and Urner Ruth. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870–1879, 2015. [Google Scholar]
[9].Yuan Yang, Chung Soo-Whan, and Kang Hong-Goo. Gradient-based active learning query strategy for end-to-end speech recognition. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2832–2836. IEEE, 2019. [Google Scholar]
[10].Sourati Jamshid, Akcakaya Murat, Dy Jennifer G, Leen Todd K, and Erdogmus Deniz. Classification active learning based on mutual information. Entropy, 18(2):51, 2016. [Google Scholar]
[11].Koh Pang Wei and Liang Percy. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885–1894, 2017. [Google Scholar]
[12].Yeh Chih-Kuan, Kim Joon, Yen Ian En-Hsu, and Ravikumar Pradeep K. Representer point selection for explaining deep neural networks. In Advances in neural information processing systems, pages 9291–9301, 2018. [Google Scholar]
[13].Shannon Claude E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948. [Google Scholar]
[14].Lewis David D and Gale William A. A sequential algorithm for training text classifiers. In SIGIR’94, pages 3–12. Springer, 1994. [Google Scholar]
[15].Scheffer Tobias, Decomain Christian, and Wrobel Stefan. Active hidden markov models for information extraction. In International Symposium on Intelligent Data Analysis, pages 309–318. Springer, 2001. [Google Scholar]
[16].Sener Ozan and Savarese Silvio. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. [Google Scholar]
[17].Wang Keze, Zhang Dongyu, Li Ya, Zhang Ruimao, and Lin Liang. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016. [Google Scholar]
[18].Settles Burr and Craven Mark. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079, 2008. [Google Scholar]
[19].Yang Yi, Ma Zhigang, Nie Feiping, Chang Xiaojun, and Hauptmann Alexander G. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113–127, 2015. [Google Scholar]
[20].Schein Andrew I and Ungar Lyle H. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265, 2007. [Google Scholar]
[21].Li Mingkun and Sethi Ishwar K . Confidence-based active learning. IEEE transactions on pattern analysis and machine intelligence, 28(8):1251–1261, 2006. [DOI] [PubMed] [Google Scholar]
[22].Aitchison John. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982. [Google Scholar]
[23].Bishop Christopher M. Pattern recognition and machine learning. springer, 2006. [Google Scholar]
[24].Albert Adelin and Anderson John A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1):1–10, 1984. [Google Scholar]
[25].Aitchison John. Logratios and natural laws in compositional data analysis. Mathematical Geology, 31(5):563–580, 1999. [Google Scholar]
[26].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]
[27].Triantafillou Eleni, Zhu Tyler, Dumoulin Vincent, Lamblin Pascal, Evci Utku, Xu Kelvin, Goroshin Ross, Gelada Carles, Swersky Kevin, Manzagol Pierre-Antoine, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019. [Google Scholar]
[28].Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [Google Scholar]
[29].Lake Brenden M, Salakhutdinov Ruslan, and Tenenbaum Joshua B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015. [DOI] [PubMed] [Google Scholar]
[30].Maji Subhransu, Rahtu Esa, Kannala Juho, Blaschko Matthew, and Vedaldi Andrea. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013. [Google Scholar]
[31].Wah Catherine, Branson Steve, Welinder Peter, Perona Pietro, and Belongie Serge. The caltech-ucsd birds-200–2011 dataset. 2011.
[32].Cimpoi Mircea, Maji Subhransu, Kokkinos Iasonas, Mohamed Sammy, and Vedaldi Andrea. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014. [Google Scholar]
[33].Jongejan Jonas, Rowley Henry, Kawashima Takashi, Kim Jongmin, and Fox-Gieg Nick. The quick, draw!-ai experiment. Mount View, CA, accessed Feb, 17:2018, 2016. [Google Scholar]
[34].Nilsback Maria-Elena and Zisserman Andrew. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008. [Google Scholar]
[35].Houben Sebastian, Stallkamp Johannes, Salmen Jan, Schlipsing Marc, and Igel Christian. Detection of traffic signs in real-world images: The german traffic sign detection benchmark. In (IJCNN), pages 1–8. IEEE, 2013. [Google Scholar]
[36].Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C Lawrence. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [Google Scholar]

[R1] [1].Finn Chelsea, Abbeel Pieter, and Levine Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR.org, 2017. [Google Scholar]

[R2] [2].Oreshkin Boris, Rodríguez López Pau, and Lacoste Alexandre. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018. [Google Scholar]

[R3] [3].Requeima James, Gordon Jonathan, Bronskill John, Nowozin Sebastian, and Richard E Turner. Fast and flexible multi-task classification using conditional neural adaptive processes. In Advances in Neural Information Processing Systems, pages 7957–7968, 2019. [Google Scholar]

[R4] [4].Bateni Peyman, Goyal Raghav, Masrani Vaden, Wood Frank, and Sigal Leonid. Improved few-shot visual classification, 2019.

[R5] [5].Sung Flood, Yang Yongxin, Zhang Li, Xiang Tao, Torr Philip HS, and Hospedales Timothy M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018. [Google Scholar]

[R6] [6].Settles Burr. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. [Google Scholar]

[R7] [7].Lewis David D and Catlett Jason. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994, pages 148–156. Elsevier, 1994. [Google Scholar]

[R8] [8].Berlind Christopher and Urner Ruth. Active nearest neighbors in changing environments. In International Conference on Machine Learning, pages 1870–1879, 2015. [Google Scholar]

[R9] [9].Yuan Yang, Chung Soo-Whan, and Kang Hong-Goo. Gradient-based active learning query strategy for end-to-end speech recognition. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2832–2836. IEEE, 2019. [Google Scholar]

[R10] [10].Sourati Jamshid, Akcakaya Murat, Dy Jennifer G, Leen Todd K, and Erdogmus Deniz. Classification active learning based on mutual information. Entropy, 18(2):51, 2016. [Google Scholar]

[R11] [11].Koh Pang Wei and Liang Percy. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885–1894, 2017. [Google Scholar]

[R12] [12].Yeh Chih-Kuan, Kim Joon, Yen Ian En-Hsu, and Ravikumar Pradeep K. Representer point selection for explaining deep neural networks. In Advances in neural information processing systems, pages 9291–9301, 2018. [Google Scholar]

[R13] [13].Shannon Claude E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948. [Google Scholar]

[R14] [14].Lewis David D and Gale William A. A sequential algorithm for training text classifiers. In SIGIR’94, pages 3–12. Springer, 1994. [Google Scholar]

[R15] [15].Scheffer Tobias, Decomain Christian, and Wrobel Stefan. Active hidden markov models for information extraction. In International Symposium on Intelligent Data Analysis, pages 309–318. Springer, 2001. [Google Scholar]

[R16] [16].Sener Ozan and Savarese Silvio. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. [Google Scholar]

[R17] [17].Wang Keze, Zhang Dongyu, Li Ya, Zhang Ruimao, and Lin Liang. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016. [Google Scholar]

[R18] [18].Settles Burr and Craven Mark. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079, 2008. [Google Scholar]

[R19] [19].Yang Yi, Ma Zhigang, Nie Feiping, Chang Xiaojun, and Hauptmann Alexander G. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision, 113(2):113–127, 2015. [Google Scholar]

[R20] [20].Schein Andrew I and Ungar Lyle H. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265, 2007. [Google Scholar]

[R21] [21].Li Mingkun and Sethi Ishwar K . Confidence-based active learning. IEEE transactions on pattern analysis and machine intelligence, 28(8):1251–1261, 2006. [DOI] [PubMed] [Google Scholar]

[R22] [22].Aitchison John. The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):139–160, 1982. [Google Scholar]

[R23] [23].Bishop Christopher M. Pattern recognition and machine learning. springer, 2006. [Google Scholar]

[R24] [24].Albert Adelin and Anderson John A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1):1–10, 1984. [Google Scholar]

[R25] [25].Aitchison John. Logratios and natural laws in compositional data analysis. Mathematical Geology, 31(5):563–580, 1999. [Google Scholar]

[R26] [26].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [Google Scholar]

[R27] [27].Triantafillou Eleni, Zhu Tyler, Dumoulin Vincent, Lamblin Pascal, Evci Utku, Xu Kelvin, Goroshin Ross, Gelada Carles, Swersky Kevin, Manzagol Pierre-Antoine, et al. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019. [Google Scholar]

[R28] [28].Russakovsky Olga, Deng Jia, Su Hao, Krause Jonathan, Satheesh Sanjeev, Ma Sean, Huang Zhiheng, Karpathy Andrej, Khosla Aditya, Bernstein Michael, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015. [Google Scholar]

[R29] [29].Lake Brenden M, Salakhutdinov Ruslan, and Tenenbaum Joshua B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015. [DOI] [PubMed] [Google Scholar]

[R30] [30].Maji Subhransu, Rahtu Esa, Kannala Juho, Blaschko Matthew, and Vedaldi Andrea. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013. [Google Scholar]

[R31] [31].Wah Catherine, Branson Steve, Welinder Peter, Perona Pietro, and Belongie Serge. The caltech-ucsd birds-200–2011 dataset. 2011.

[R32] [32].Cimpoi Mircea, Maji Subhransu, Kokkinos Iasonas, Mohamed Sammy, and Vedaldi Andrea. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014. [Google Scholar]

[R33] [33].Jongejan Jonas, Rowley Henry, Kawashima Takashi, Kim Jongmin, and Fox-Gieg Nick. The quick, draw!-ai experiment. Mount View, CA, accessed Feb, 17:2018, 2016. [Google Scholar]

[R34] [34].Nilsback Maria-Elena and Zisserman Andrew. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008. [Google Scholar]

[R35] [35].Houben Sebastian, Stallkamp Johannes, Salmen Jan, Schlipsing Marc, and Igel Christian. Detection of traffic signs in real-world images: The german traffic sign detection benchmark. In (IJCNN), pages 1–8. IEEE, 2013. [Google Scholar]

[R36] [36].Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C Lawrence. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [Google Scholar]

PERMALINK

Geometric Analysis of Uncertainty Sampling for Dense Neural Network Layer

Aziz Koçanaoğulları

Niklas Smedemark-Margulies

Murat Akcakaya

Deniz Erdoğmuş

Abstract