DrsNet: Dual-resolution Semantic Segmentation with Rare Class-Oriented Superpixel Prior

Liangjiang Yu; Guoliang Fan

doi:10.1007/s11042-020-09691-y

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: Multimed Tools Appl. 2020 Sep 9;80(2):1687–1706. doi: 10.1007/s11042-020-09691-y

DrsNet: Dual-resolution Semantic Segmentation with Rare Class-Oriented Superpixel Prior

Liangjiang Yu ¹, Guoliang Fan ¹

PMCID: PMC7988710 NIHMSID: NIHMS1627718 PMID: 33776547

Abstract

Rare-class objects in natural scene images that are usually small and less frequent often convey more important information for scene understanding than the common ones. However, they are often overlooked in scene labeling studies due to two main reasons, low occurrence frequency and limited spatial coverage. Many methods have been proposed to enhance overall semantic labeling performance, but only a few consider rare-class objects. In this work, we present a deep semantic labeling framework with special consideration of rare classes via three techniques. First, a novel dual-resolution coarse-to-fine superpixel representation is developed, where fine and coarse superpixels are applied to rare classes and background areas respectively. This unique dual representation allows seamless incorporation of shape features into integrated global and local convolutional neural network (CNN) models. Second, shape information is directly involved during the CNN feature learning for both frequent and rare classes from the re-balanced training data, and also explicitly involved in data inference. Third, the proposed framework incorporates both shape information and the CNN architecture into semantic labeling through a fusion of probabilistic multi-class likelihood. Experimental results demonstrate competitive semantic labeling performance on two standard datasets both qualitatively and quantitatively, especially for rare-class objects.

1. Introduction

The goal of scene labeling is to assign a semantic label to each pixel in an image, leading to simultaneous segmentation and recognition for scene understanding [1–9,23,24]. It is one of the most important problems of computer vision and pattern recognition that plays a significant role in many applications, such as robotics [10], autonomous driving [11], multimedia retrieval [5], e-commerce [12]. In most natural scene images, rare-class objects that are usually small and less frequent could be easily neglected when achieving high overall accuracy is the main goal [13]. However, rare-class objects (e.g., human, boat, vehicle, etc.) are often important for image understanding. A few deep learning approaches were recently proposed to emphasize rare-class objects, most of which encourage balanced data distribution among all classes during the training. This effect is obtained by various sampling methods [13–15], or by assigning different weights to rare classes during the training process [16]. Scene information was also applied to assist label transfer of rare classes [17], so that irrelevant rare classes might be skipped for scene-aware semantic labeling. Some methods tend to merge multiple model components to enhance rare class learning [3,18]. A regional latent semantic dependencies model (RLSD) was proposed to localize potential regions first with multiple labels, and then a recurrent neural network has been applied to characterize the latent semantic dependencies at the regional level [5]. On the other hand, the superpixel-based convolutional neural network (CNN) has been applied to edge-constrained learning, including feature masking for foreground object labeling [19, 20], and it is also used for salient object detection by learning hierarchical contrast features [21]. Saliency detection has been integrated into semantic labeling through combined multi-class likelihood and edge-constrained binary belief [22]. These methods are able to achieve promising semantic labeling performance where object boundary information is indirectly and implicitly involved during CNN learning.

In this work, we aim to develop a comprehensive scene labeling framework with special consideration of rare classes. Due to the sparsity of rare-class objects, it is often hard to balance the performance between rare classes and frequent ones. Our research is motivated by three main ideas. First, object boundaries could be particularly helpful for accurate segmentation during semantic labeling, but it is also difficult and extremely expensive to apply the same rule to both background areas and small rare-class objects in natural scene images. Therefore, we develop a dual resolution representation that is able to flexibly and jointly represent both large background areas and small objects in a unified framework. Second, we incorporate dual resolution superpixels into a unified coarse-to-fine (c2f) learning and inference pipeline. Third, we further improve the detection of rare classes by integrating a rare-class-balanced CNN to mitigate the unbalanced training data problem. As a result, we propose a deep dual-resolution semantic segmentation framework (DrsNet) with rare class-oriented c2f superpixel representation (as shown in Fig. 1), which not only integrates multi-class likelihoods and two-pass global-to-local label transfer, but also supports c2f segmentation to capture rare-class features with computational efficiency. Specifically, this work has the following contributions:

Fig. 1 — Illustration of the dual resolution superpixel representation: (a) an input image; (b) coarse superpixel segmentation where two rare-class objects (human) are mixed in the background; (c) fine superpixel segmentation that is overly complex for background; (d) the dual resolution *c2f* representation where the background area and rare-class objects are represented by coarse and fine superpixels, respectively.

Shape information is directly and explicitly incorporated into the CNN learning and label transfer to produce semantically meaningful and accurate segmentation of background areas and rare-class objects.
The dual resolution superpixel representation supports c2f learning and inference, and it also balances between detailed feature extraction from small rare-class objects and the overall computational load in large background areas.
Rare class labeling and boundary detection are integrated probabilistically together, by fusing multi-class likelihoods from two superpixel CNNs trained at two different resolutions.

The rest of this paper will be organized as follows. In Section 2, we first briefly review the recent scene labeling methods, and we then discuss some algorithms which have paid special attention to rare-class objects or the use of edge information to assist labeling. Section 3 presents the formulation of the proposed scene labeling framework (DrsNet) that involves the dual resolution representation and c2f CNN training and inference. In Section 4, we present the proposed DrsNet framework in details, including the c2f superpixel representation, the superpixel-based CNN learning, scene assisted label transfer, and the pseudo codes of the DrsNet framework. Section 5 shows the experimental results on two standard datasets, where DrsNet shows competitive performance for semantic labeling, especially for rare-class objects, compared with the state-of-the-art algorithms.

2. Related Work

In this section, we will briefly discuss related studies in the field of scene understanding and semantic labeling. First, we will visit some traditional methods using hand-crafted features, including non-parametric and parametric models. Then we will introduce recent deep learning methods that are able to achieve state-of-the-art performance by more effective feature learning and extraction. After that, we will discuss some recent studies that specifically focus on rare-class objects when encountering imbalanced class distribution.

2.1. Traditional Approaches

Traditional scene labeling methods can be roughly grouped into two categories [15,60], i.e., non-parametric and parametric methods. Non-parametric methods typically involve image retrieval and label transfer, and the labels of pixels are predicted by integrating contextual or spatial dependencies during the inference. Many of these methods have achieved promising performance for scene labeling [38–45]. These methods tend to filter out irrelevant scenes in the training data by global contextual constraints, and the multi-class likelihoods are computed by semantically transferring labels from retrieved images according to the query ones. Previous studies usually use low-level or mid-level hand-crafted features from intensity, color, texture information, and so on. However, in order to find high level object correlations, it is very hard to distinguish pixels of different objects using only low or mid-level features, and these features often suffer from high dimensionality problems. Instead of using all features from the whole training set during the inference, which could be troublesome when dealing with extremely large datasets, parametric methods have been widely used to model the joint probabilities of neighboring labels based on spatial dependencies. For example, Markov random fields (MRF) [28,29] and conditional random fields (CRF) [30–34] are able to construct labeling coherence of objects from neighboring regions. Later, graphical models were incorporated into CNNs to mitigate boundary problems [70–72]. However these graphical-model based methods usually require extensive effort to learn all those parameters [35].

2.2. Deep Learning Methods

Deep learning methods have shown very promising semantic labeling performance due to the capability of learning discriminative features, where the receptive fields of the neurons in the convolution layers are cascaded to implicitly capture contextual information [36]. Such contextual knowledge is important for understanding and capturing local and global pixel dependencies [25–27]. For example a multi-scale CNN was used to improve the limitation from hand-crafted features [14]. Many studies have proposed to learn mappings from low-resolution feature maps to high-resolution ones for accurate prediction [64,65]. Also, CNNs have been further transformed into fully convolutional networks (FCN) for significant performance improvement [66]. Dilated convolutions have been demonstrated to support exponential expansion of the receptive field without loss of resolution or coverage [67]. Context Encoding Module was introduced to capture the semantic context of scenes and selectively highlights class-dependent featuremaps [68]. Scene names of images and label map statistics of image patches, are exploited to create label hierarchies for learning better deep feature representations [60]. Multi-level neighboring semantic labeling framework has been proposed for superpixel-based segmentation [63]. On the other hand, recurrent neural networks (RNN) have also shown very promising labeling performance by capturing long-range pixel dependencies [35,62]. Specifically, intra-layer recurrent connection was introduced that integrates contextual information in the 2D space explicitly [37]. Images are represented by a UCG (undirected cyclic graph) [50,53] which has been decomposed into several DAGs (directed acyclic graphs), and DAG-structured RNNs have been applied to model pixel dependencies. Moreover, 2D long-short-term-memory network (LSTM) has been investigated for natural scene images taking into account the complex spatial dependencies of labels [25]. A graph-based LSTM was proposed to model the dependencies among different superpixels [69]. Furthermore, one can also combine contextual information into a deep learning framework for promising labeling performance. For example, label transfer has been incorporated into a parametric CNN framework [46], where the local ambiguities from CNN models are alleviated by using global scene semantics, and scene-relevant class dependencies and priors are transferred by matching CNN features. Similarly, a hierarchical Parsing Net (HPN) was proposed to leverage global scene semantic information and the context among multiple objects [4].

2.3. Scene Labeling with Rare-class Objects

There are two main approaches to deal with rare-class objects with unbalanced distribution, the first one is model fusion, and the second one is data augmentation/expansion. Both of these approaches were developed to find more discriminative representation among different classes. For example, the retrieval set was expanded by explicitly adding superpixels of rare classes [13], and a global semantic context was developed to refine the retrieval and super-pixel matching. Similarly, random sampling was used to expand the retrieval set with extremely rare-class objects [46]. Rare class representations were enhanced by merging classification costs of different contextual models [3]. A reverse sampling technique could also be used to encourage balanced training [61]. An ensemble of Support Vector Machines (SVMs) was proposed where each SVM is trained for a single class [18]. Interestingly, two ideas are integrated to enhance and enrich rare class training [17]. One is the scene assisted rare class retrieval, which selectively augments rare class samples based on the global scene to reduce class ambiguities. The second one is a rare class-balanced CNN that is specifically trained for areas with rare-class objects for regional re-inference. In most CNN-based frameworks, however, edge information or object boundaries have not been fully involved during the patch-based training. Consequentially, shapes of rare-class objects may not be well preserved during inference and labeling. An edge-aware CNN integration framework was proposed for scene labeling with emphasis on rare classes where it was assumed that rare classes are often locally salient [22]. The proposed framework involves not only multi-class local and global beliefs, but also a foreground belief map obtained through superpixel-based CNN specially trained for rare classes. Built on these prior works, we want to further integrate global scene information with local ambiguities along with explicit incorporation of shape information of rare classes in a unified CNN framework.

3. Formulation

We present an integrated framework called DrsNet that involves dual resolution superpixel representation for scene labeling, as generally depicted in Fig. 2. This proposed framework is intended to semantically label rare-class objects more accurately while preserving the performance for frequent-class objects and background areas. Unlike using superpixels for post-processing [17], here superpixels are involved in an earlier stage during CNN training, where edge information is directly used in feature learning. Moreover, we develop a dual resolution c2f superpixel representation to incorporate edge information during CNN learning and inference (Fig. 2) where two CNNs are involved sequentially. The first one is trained using coarser superpixels, denoted as CNN-c, while the second one is trained from finer superpixels with rare class balanced data distribution, denoted as CNN-f. In this section we will present the problem formulation of the proposed DrsNet. For convenience, we list major notations from this Section in Table 1.

Fig. 2 — The *c2f* superpixel representation for multi-class likelihood. From left to right, (1) an image patch is passed into the coarse CNN, (2) the network outputs CNN features, (3) coarse superpixel segments are applied to mask the features for subsequent FC and Softmax layers, (4) an intermediate labeling map gives initial guidance if there are rare classes in the patch, (5) pass the patch to another CNN along with finer superpixel segmentation masking for final labeling.

Table 1.

List of major notations from Section 3

ξ_i	The i^th coarse superpixel in an Image
ξ_i,j	The j^th fine-grained superpixel segmented from ξ_i
$y_{i}^{*}$	Class label of ξ_i
$y_{i, j}^{*}$	Class label of ξ_i,j
CNN-c	CNN trained using coarse superpixel from naturally distributed data
CNN-f	CNN trained using fine-grained superpixel from rare-class balanced data
E(ξ_i, y_i)	Integration of class likelihood for coarse superpixel ξ_i
P_L(ξ_i, y_i)	Local belief calculated from CNN-c training
P_G(ξ_i, y_i)	Global belief calculated from label transfer
P_F(ξ_i,j, y_i,j)	Fine-grained superpixel local belief by passing ξ_i,k into CNN-f
E(ξ_i,j, y_i,j)	Integration of class likelihood for superpixel ξ_i,j

Open in a new tab

3.1. Coarse Superpixel Belief

Given an image x represented by N superpixels that can be obtained by the simple linear iterative clustering (SLIC) algorithm [47]. Let ξ_i, i = 1,2,⋯ ,N denotes a coarse superpixel, and y_i ∈ {1,2,⋯ ,L} the label from L classes. ξ_i could be potentially segmented into K fine-grained superpixels ξ_i,j, j = 1,2,⋯ ,K. The proposed DrsNet framework consists of two stage energy functions calculated from coarse to fine (c2f) superpixel-based belief maps. The first stage incorporates two belief maps from local to global understanding of the scene,

y_{i}^{*} = arg min_{y_{i} \in {1, \dots, L}} E (ξ_{i}, y_{i}),

(1)

where E(ξ_i,y_i) is an integration of class likelihood. First, the local belief is achieved by normalized class-likelihood within each superpixel using CNN feature masking, which is then incorporated into superpixel-wise feature matching for global label transfer,

E (ξ_{i}, y_{i}) = - (P_{L} (ξ_{i}, y_{i}) + P_{G} (ξ_{i}, y_{i})),

(2)

where P_L(ξ_i,y_i) is the local belief computed by the likelihood from the first network, CNN-c, which is trained using naturally distributed data with coarse superpixel masking on CNN features [19,46,48]. The global belief P_G(ξ_i,y_i) is computed by label transfer using statistical features from the scene-assisted retrieval set [17] (to be discussed in Section 4.4).

3.2. Fine-grained Superpixel Belief

Suppose we are able to further segment rare-class superpixel ξ_i into smaller sub-superpixels, denoted as ξ_i,j, we can obtain the second-stage class likelihood computed using the second network, CNN-f, which is trained from a subset of the training data that has balanced class distribution for rare classes, together with denser superpixel segmentation for CNN feature masking. In a similar fashion as CNN-c, CNN-f involves feature masking by finer superpixels. Specifically, this fine-grained superpixel belief evaluates the superpixel-based class likelihood of y_i,j ∈ {1,⋯ ,L} around rare class superpixel ξ_i,j achieved by CNN-f,

y_{i, j}^{*} = arg min_{_{y_{i, j} \in {1, \dots, L}}} - P_{F} (ξ_{i, j}, y_{i, j}) + E (ξ_{i, j}, y_{i, j}),

(3)

where P_F(ξ_i,j,y_i,j) is the normalized belief obtained by passing the superpixel ξ_i,j into CNN-f [17]. Consequently, finer sub-superpixels are applied in the potential rare-class object areas determined by E(ξ_i,y_i), which is integrated by using the superpixel-wise CNN-f belief map. And here for sub-superpixels, E(ξ_i,k,y_i,k) is the same as E(ξ_i,y_i) from the previous stage for ease of computation. In the following section, we will detail the technical steps in the DrsNet framework.

4. Methodology

4.1. Algorithm Overview

The DrsNet framework is detailed in Fig. 3. An image is firstly segmented into coarse superpixels, and then image patches are passed into a truncated CNN-c to obtain the corresponding features at the coarse super-pixel level. Then, feature masking is applied on the CNN-c features, and fed into the fully connected layer to compute softmax class likelihoods. Simultaneously, superpixel-wise feature matching is implemented for global label transfer using scene assisted retrieval set [17], which provides superpixel-wise global belief. Based on the integration of these two belief maps, we further apply finer superpixels around possible rare class areas for an additional class-balanced CNN-f classification, which is similar to the first CNN, but trained using finer sub-superpixels from rare class balanced data. In the following, we will detail a few major technical steps in our proposed DrsNet. First, we present the superpixel-based CNN learning, which incorporates edge information for edge-aware feature learning rather than traditional patch-based CNN training paradigms. Secondly, we will focus on our c2f dual resolution segmentation strategy for semantically labeling rare classes in an accurate and efficient way.

4.2. Superpixel-based CNN Learning

Although traditional pixel-wise CNN learning together with label transfer methods are able to achieve promising global labeling results [46], they usually face some challenges in preserving object shapes for more desired labeling results around boundaries. To achieve accurate rare class labeling while handling boundaries and shapes explicitly, we develop a superpixels-based CNN learning scheme, as depicted in Fig. 4, which utilizes object boundaries as prior knowledge for network learning, and hence produces accurate labeling and semantically meaningful segmentation boundaries with the help of superpixels.

As shown in Fig. 4, shape information is exploited via convolutional feature masking to extract segment features during the learning [19], instead of using feature extraction from masked regions that may produce artificial boundaries. However, applying multi-scaled inputs might cause a negative effect on rare-class objects due to their very limited spatial coverage or low occurrence in an image. So in order to learn useful features from both background areas and rare-class objects, we use superpixel-based segments located near the center of small patches instead of multi-scaled images. Specifically, superpixel-based CNN training has a few steps: (1) We segment the training images into N superpixels, where each pixel falls into a specific superpixel (Fig. 5(a)); (2) We extract an image patch around each superpixel for training (Fig. 5(b)). (3) After the last CNN layer, we mask the feature using the superpixel boundaries (Fig. 5(c)). This means that we keep only the pixels within each superpixel segment for the labeling performed in the FC and Softmax layers.

Fig. 5 — Superpixel-based CNN learning: (a) an image in superpixels; (b) image patches used for CNN training; (c) superpixel feature masks.

4.3. Coarse-to-fine (c2f) Superpixel Representation

In natural scene images, background areas usually have smooth distribution of pixel intensities or a specific texture pattern, whereas small rare-class objects usually contain strong or unique edge features. The dual resolution superpixel representations provide a flexible and efficient trade-off, as shown in Fig. 1. Unlike using superpixels for post-processing [17], where edge information is treated as binary belief of saliency detection, the c2f superpixel representation efficiently captures accurate edge features for rare classes without over-complicating the computation in background areas. Correspondingly, the proposed DrsNet method incorporates two-level superpixel-based edge information into the learning and inference of multi-class likelihoods. As shown in Fig. 2, a superpixel-masked image is first passed into the CNN-c trained using natural class distribution, and after the computation of an integrated belief, finer superpixel segmentation will be applied if there are potential rare-class objects in those query areas. And then these finer superpixels will be passed into the second CNN, called CNN-f to compute an additional rare class belief map on top of the first CNN and label transfer belief maps. In case of dealing with finer superpixels for rare classes, we may upsample a patch if it is smaller than the desired network input as shown in Fig. 4.

4.4. Scene-assisted Retrieval

In order to perform label transfer, the most popular approach is to find a subset from the training data that has similar scene layout with the observation. However it is very common that class distributions are highly unbalanced (Fig. 6). More importantly, rare-class objects may not appear in the retrieval set where only global scenes are mostly preserved, since they are often observed with relatively low occurrence frequencies and small spatial coverage compared with background areas, which usually results in misclassification of rare classes [17]. It is possible to use random selection [13,46] among all rare classes, however this method may introduce scene-irrelevant objects and complicate inference and computation. For example, in an ocean scene, it may not be necessary to consider adding pixels from a sedan. As briefly summarized in Table 2, scene information plays an important role for selecting relevant rare class candidates because different scenes have varied distributions or frequencies for all object classes. A scene assisted rare class retrieval method [17] adopts a non-parametric model for global label transfer, where scene information is incorporated to ensure the distribution of rare classes follows the same distribution as they appear in the training dataset under each scene category. But since all feature matching is patch-based and there is no constraint on object boundaries, it is still difficult to distinguish local similarities between different objects. So unlike pixel-wise retrieval [17], in this work we explore the scene-assisted label transfer with the support of edge information and c2f superpixel representation.

Fig. 6 — Unbalanced class distribution in typical natural scene images [38], where most rare classes have very low occurrence frequency in the entire dataset.

Table 2.

Selected class (rows) frequencies in % under eight scene categories (columns) [38].

	Coast	Forest	Hwy	City	Mtn	Country	Street	T-bldg
Bldg	0.38	0.17	2.32	58.17	0.18	0.52	40.91	60.03
Car	<0.01	0.02	2.37	2.44	<0.01	<0.01	6.26	0.28
Person	0.04	0.08	0.01	0.31	0.09	0.02	0.94	0.06
Boat	0.13	<0.01	0.00	0.03	0.00	0.02	0.04	0.06
Sidewalk	0.00	0.00	0.44	3.24	0.00	0.00	4.32	0.25
Road	0.04	0.19	36.08	8.56	0.23	0.51	24.52	0.76
Field	0.00	0.07	1.42	0.00	0.16	19.05	0.00	0.00

Open in a new tab

(Hwy=Highway, Mtn=Mountian, T-bldg=Tall buildings)

Suppose we have trained a CNN model CNN-c using the training dataset X = {x₁, x₂,⋯ , x_M} with ground truth labels Y = {y₁,y₂,⋯ ,y_M}, y_m ∈ {1,2,⋯ ,L}^H×W, where H and W are the image height and width, M is the number of training images, and L is the number of classes. Suppose a training image x_m is segmented into N superpixels, we compute superpixel-wise feature tensors $u_{m} \in ℝ^{H \times W \times T}$ by passing x_m to the truncated CNN (without the Softmax layer) as discussed in Section 4.2, where T is the number of feature channels, determined by the output of the truncated CNN. Then global feature $z_{m} = [z_{m}^{(1)}, z_{m}^{(2)}, \dots, z_{m}^{(N)}]$ are obtained by averaging over the pooling regions [46].

Given a query image x_q, similarly we can obtain CNN feature z_q, and by matching z_q with CNN features in the training set {z₁,z₂,⋯ ,z_M}, we are able to find the retrieval set $O$ with several exemplars, which makes it possible to obtain the scene category p by finding the dominant scene category in this retrieval subset. Let L_r denote the number of rare classes, we may obtain a L_r × 1 occurrence frequency vector f_p of L_r rare-class objects under scene p. During scene assisted sampling, we obtain the rare class ratio vector v_p by normalizing f_p, and v_p provides the proportion of rare classes in the resulting retrieval set under scene p. As a result, the scene-assisted rare class retrieval set B has the distribution of rare classes that follows v_p. The number of superpixel samples S_r required for rare class r is then given by

S_{r} = min (N_{d} \cdot v_{p}^{(r)}, N_{r}),

(4)

where N_d is the average number of superpixels from dominant or common classes in $O$ such as sky, sea; $v_{p}^{(r)}$ is the r^th element of v_p, representing the occurrence frequency of rare class r under scene p; N_r is the total number of superpixels belonging to rare class r over the whole training set. By sampling S_r superpixels for rare class r from the training set, we are able to obtain a new retrieval set $Z_{p} = O \cup B_{p}$ . Such new retrieval set encourages a well-balanced data distribution between common classes and scene-relevant rare classes.

4.5. Label Transfer

Similar to [17] but at the superpixel level, the global belief P_G(ξ_i,y_i) could be obtained by statistical features in the retrieval set $Z_{p}$ ,

P_{G} (ξ_{i}, y_{i}) = \frac{\sum_{k} D (ξ_{i}, ξ_{k}) δ (y_{k} = y_{i})}{\sum_{k} D (ξ_{i}, ξ_{k})}, \forall ξ_{k} \in Z_{p},

(5)

where ξ_k is the kth nearest neighbor of superpixel ξ_i within Z_p obtained by matching CNN features, y_k is the ground-truth label of superpixel ξ_k in the retrieval set (the majority of label in each superpixel), δ(True) = 1 is an indicator function, and L is the number of classes. D(·) calculates the similarity between two superpixels,

D (ξ_{i}, ξ_{k}) = \prod_{ω \in {f, z, b}} exp (- α^{(ω)} ‖ψ_{i}^{(ω)} - ψ_{k}^{(ω)}‖),

(6)

where $ψ_{i}^{(f)}$ is the CNN feature corresponding to a superpixel $ξ_{i}; ψ_{i}^{(z)}$ is the coordinate of the center of superpixel ξ_i along the vertical direction of the image; and $ψ_{i}^{(b)}$ is the size of the image blob where the superpixel ξ_i is located (upon labeling using CNN-c). This encourages larger distinguishable distances between rare and dominant classes. α^(ω) (ω ∈ {f,z,b}) are the weights that control the trade-off among those three features based on empirical setting [46]. For convenience, we list major notations from this Section in Table 3.

Table 3.

List of major notations from Section 4

L_r	The number of rare classes
v_p	A normalized L_r × 1 occurrence frequency vector of rare-classes under scene p
S_r	The number of superpixel samples required for rare class r
$O$	The original retrieval set
$B_{p}$	Scene assisted retrieval set with rare class distribution that follows v_p
$Z_{p}$	The new retrieval set under scene p
D(ξ_i, ξ_k)	Similarity between two superpixels ξ_i and ξ_k

Open in a new tab

4.6. Algorithm Implementation

In this section, we summarize three specific algorithms discussed above. First, Algorithm 1 presents the superpixel-based and scene-assisted rare class retrieval discussed in Section 4.4 that is used for label transfer. The key is to obtain scene-relevant rare classes along with other frequent ones in the retrieval set. Algorithm 2 details the proposed DrsNet. Suppose we are given a group of training data {x₁,x₂,⋯ ,x_M} with their associated pixel-wise labels {y₁,y₂,⋯ ,y_M}, we will train two CNN networks from the dual-resolution superpixel representations from different training datasets. The first one is the CNN-c that is learned using coarsely segmented superpixels with natural data distribution (as shown in Fig. 5), and the second CNN is denoted as CNN-f that is learned using densely segmented superpixels with a subset of the training superpixels that follow rare class balanced distribution. Both CNN-c and CNN-f are embedded in the DrsNet algorithms as follows.

During the inference process of DrsNet (Algorithm 2), when a query image is given, we first apply coarse superpixel segmentation, and based on these coarse superpixels we can pass each of them into the CNN-c, which gives us not only local CNN belief map P_L(ξ_i,y_i), but also CNN features that are used to compute global belief map P_G(ξ_i,y_i). Then, we are able to obtain an intermediate integration belief map E(ξ_i,y_i) by combining P_L(ξ_i,y_i) and P_G(ξ_i,y_i) so that we can find potential rare class areas. At this point, finer superpixel segmentation is applied around these potential rare class areas for one step further inference that is specially designed for rare-class objects using CNN-f, which leads to rare-class oriented belief map P_L(ξ_i,j,y_i,j). Finally we are able to obtain the superpixel-wise labeling of $y_{i}^{*}$ as Equation (1), or finer superpixel-wise labeling of $y_{i, j}^{*}$ as Equation (3).

5. Experiments

Several datasets have been widely used as the benchmark for scene labeling research [13,15,46], for example, Stanford dataset [28], Barcelona dataset [44], PASCAL Context dataset [49], SIFTflow dataset [38], etc. We are particularly interested in the performance of semantically labeling rare classes, so we choose to use the widely adopted SIFTflow and PASCAL datasets. Specifically, SIFTflow consists of 2688 images captured from 8 natural scenes (256 × 256) that have been manually labeled with 33 object classes where 27 are considered as rare classes [39]. The dataset is split into 2488 training and 200 testing images. The PASCAL Context dataset is an annotation of PASCAL VOC 2010 dataset with semantic segmentation for 10,103 images (375 × 500). Out of 59 classes, 37 are considered as rare classes. We set the rare class frequency threshold as 0.05 for SiftFLow, and 0.01 for Pascal context dataset [50]. In the following, we first evaluate the overall pixel accuracy (percentage of correctly labeled pixels) and the class accuracy averaged over all classes. Second, both metrics are evaluated for rare classes only.

5.1. Overall Labeling Accuracy

We compare DrsNet with a set of recent methods in Tables 4 and 5 in terms of overall labeling accuracy on two datasets. For the SIFTflow dataset (Table 4), two recent methods, FCN and DAG-RNN, demonstrate significant advantages over others including our proposed DrsNet due to their sophisticated network structures. However, their advantages diminish on the PASCAL Context dataset that has more and larger images, indicating possible overfitting. On the other hand, the proposed DrsNet achieves 10.1% and 2.3% better class accuracy than the baseline CNN algorithm [46] and RCSL-seg, where super-pixels are used for post-processing [17] (pixel-wise DrsNet with superpixel re-segmentation). It also outperforms the rare class labeling approach [13] where scene labeling is based on statistical inference on the single-level superpixels, showing the effectiveness of the proposed c2f two-level superpixel method. Compared with our early work that combines binary saliency detection of rare classes and the multi-class likelihood [22], DrsNet is more integrated with better class accuracy but slightly worse in overall pixel accuracy. This is mainly because of our special consideration of rare classes which may cause over-fitting under some circumstances when foreground objects are similar in a local view. For the PASCAL Context dataset (Table 5), the proposed DrsNet achieves the best average class accuracy and the second best overall pixel labeling accuracy (only 0.7% below the best one [51]). DrsNet is also consistently better than the early versions [17, 22] that share a similar CNN-based framework with label transfer, showing the value of c2f superpixel representation and superpixel-based CNN learning and inference for scene labeling.

Table 4.

Overall performance on the SIFTflow dataset

	Pixel (%)	Class (%)
Multiscale ConvNet [14]	67.9	45.9
[14]+cover(balanced)	72.3	50.8
[14]+cover(natural)	78.5	29.6
CNN, Pinheiro et al. [35]	76.5	30.0
RCNN, Pinheiro et al. [35]	77.7	29.8
Superparsing [44]	76.9	29.4
Eigen et al. [42]	77.1	32.5
Tighe et al. [45]	78.6	39.2
Singh et al. [43]	79.2	33.8
Gould et al. [52]	78.4	25.7
Integration model +metric learning [46]	80.1	39.7
Ensemble model +metric learning [15]	81.2	45.5
CNN65-DAG-RNN [53]	81.1	48.2
Yang et al. [13]	79.8	48.7
FCN-8s [54]	85.9	53.9
DAG-RNN [50]	87.3	60.2
RCSL [17]	80.8	41.2
RCSL-seg [17]	81.6	47.5
sRCSL [22]	82.3	49.2
DrsNet	81.8	49.8

Open in a new tab

Table 5.

Overall performance on the PASCAL Global Context dataset

	Pixel (%)	Class (%)
FCN-8s [54]	67.5	52.3
PixelNet [55]	-	51.5
Context-CRF [56]	71.5	53.9
DAG-RNN [50]	73.6	55.8
FCRN [57]	72.9	54.8
Global-Context [58]	73.8	-
PSPNet-Res101 [51]	76.0	60.6
Episodic CAMN [59]	72.1	54.3
RCSL [17]	70.2	51.2
RCSL-seg [17]	70.5	53.1
sRCSL [22]	71.6	52.8
DrsNet	75.3	60.9

Open in a new tab

5.2. Rare Class Labeling Accuracy

We report pixel and class accuracies of rare-class objects on the two datasets in Tables 6 and 7. DrsNet achieves the best results in terms of both metrics on the SIFTflow dataset, compared with several recent methods including our earlier ones [17,22], showing its advantages in handling rare classes. Several rare-class-focused schemes are shown to be effective to improve the sensitivity and specificity of rare-class objects, as shown in the following qualitative results. DrsNet also outperforms DAG-RNN [50] in terms of the accuracy of rare classes on the PASCAL Context dataset. The addition of class-weighted loss for CNN learning in DAG-RNN brings 6% gain on average class accuracy (i.e, 4.5% better than DrsNet), and it is expected that this class-weighted loss is also applicable and helpful to our proposed DrsNet algorithm.

Table 6.

Performance of rare classes on the SIFTflow dataset.

	Pixel (%)	Class (%)
Tighe et al. [45]	48.8	29.9
Yang et al. [13]	59.4	41.9
Shuai et al. [46]	-	30.7
Shuai et al. [15]	-	37.6
RCSL [17]	61.3	39.2
RCSL-seg [17]	62.6	42.3
sRCSL [22]	64.5	43.1
DrsNet	65.2	44.3

Open in a new tab

Table 7.

Performance of rare classes on the PASCAL Context dataset.

	Pixel (%)	Class (%)
DAG-RNN (with weighted loss) [50]	-	38.6 (44.7)
RCSL [17]	55.6	38.1
RCSL-seg [17]	58.3	37.9
sRCSL [22]	60.3	39.1
DrsNet	63.4	40.2

Open in a new tab

5.3. Qualitative Results

We show some qualitative results from the SIFTflow dataset in Fig. 7 where the proposed DrsNet is able to locate and label rare-class objects with perceptually meaningful object boundaries. For example in the 5^th column of the 4^th and 5^th rows, silhouettes of human and vehicles from DrsNet are better preserved compared with other methods, and there is less ambiguity between rare classes and background areas. In the 1^st row, DrsNet creates more detailed boundaries for vehicles and in the 3^rd row, it is able to clearly label and segment almost all windows. We are even able to detect some humans in the last image which is not even correctly labeled in the ground truth.

5.4. More Discussion

The main advantage of the proposed DrsNet is its capability of balancing frequent and rare classes with accurate boundary generation during semantic labeling, as shown in the zoomed boxes of Fig. 7. This is owing to several key rare-class-focused techniques, including dual-resolution c2f image representation, superpixel-based CNN learning and inference, and scene-assisted retrieval for label transfer. Those rare-class-focused schemes can be extended to other deep networks, making this work more general and versatile. As a matter of fact, our two earlier papers [17,22] that share the same basic framework serve as the baselines of DrsNet. Specifically, an early version of DrsNet (referred to as RCSL in Tables 6 and 7) was proposed in [17] where scene-assisted retrieval and label transfer are used to improve rare class learning and RCSL-Seg was developed by incorporating superpixel re-segmentation to refine boundaries. Another early attempt proposed in [22] (referred to as sRCSL in Tables 6 and 7) integrates superpixel representation and binary saliency detection in RCSL to further improve rare class performance. The proposed DrsNet algorithm is deeply motivated by the above two early efforts where dual resolution super-pixel representation and c2f CNN classification play a similar role as saliency detection, but more direct and effective, as shown by the above quantitative and qualitative results. Since the image retrieval plays a very important role in our current framework, more effective and discriminative features along with robust feature learning schema might help to improve the overall performance, for example a combination of verification and identification models [73] for learning discriminative object descriptors. Furthermore, model pruning techniques would also benefit the efficiency of accurate label transfer as the model complexity goes larger. For example, replacing neighboring filters with similar geometric properties during model learning for retrieval, and the filters are progressively pruned by changing the filter weights gradually [74].

6. Conclusion

We have proposed a new dual-resolution semantic labeling framework called DrsNet that improves the semantic labeling performance of natural scene images with special attention to rare classes. The main contribution is the integration of shape information from local CNN-based scene labeling with that from global label transfer at two superpixel levels. The dual resolution c2f superpixel representation is developed to carefully infer and label rare-class objects while sustaining computational efficiency and robustness in the background. Experimental results on two standard datasets show competitive and promising performance of the proposed algorithm in terms of overall labeling accuracy, especially for rare classes.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that helped us improve this paper.

Footnotes

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of a an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

References

1.Hu K, Zhang S, and Zhao X, “Context-based conditional random fields as recurrent neural networks for image labeling,” Multimedia Tools and Applications, pp. 1–11, 2019. [Google Scholar]
2.Li L, Socher R, and Li F, “Towards total scene understanding: Classification, annotation and segmentation in an automatic framework,” in Proc. CVPR, June 2009, pp. 2036–2043. [Google Scholar]
3.George M, “Image parsing with a wide range of classes and scene-level context,” in Proc. CVPR, June 2015, pp. 3622–3630. [Google Scholar]
4.Shi H, Li H, Meng F, Wu Q, Xu L, and Ngan KN, “Hierarchical parsing net: Semantic scene parsing from global scene to objects,” IEEE Transactions on Multimedia, vol. 20, no. 10, pp. 2670–2682, October 2018. [Google Scholar]
5.Zhang J, Wu Q, Shen C, Zhang J, and Lu J, “Multilabel image classification with regional latent semantic dependencies,” IEEE Transactions on Multimedia, vol. 20, no. 10, pp. 2801–2813, October 2018. [Google Scholar]
6.Abdulnabi AH, Shuai B, Zuo Z, Chau L, and Wang G, “Multimodal recurrent neural networks with information transfer layers for indoor scene labeling,” IEEE Transactions on Multimedia, vol. 20, no. 7, pp. 1656–1671, July 2018. [Google Scholar]
7.Kang B, Lee Y, and Nguyen TQ, “Depth-adaptive deep neural network for semantic segmentation,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2478–2490, September. 2018. [Google Scholar]
8.Zhao J, Xie G, and Han J, “Conditional random field with the multi-granular contextual information for pixel labeling,” Multimedia Tools and Applications, vol. 76, no. 7, pp. 9169–9194, 2017. [Google Scholar]
9.Kosov S, Shirahama K, and Grzegorzek M, “Labeling of partially occluded regions via the multi-layer crf,” Multimedia Tools and Applications, vol. 78, no. 2, pp. 2551–2569, 2019. [Google Scholar]
10.Ruiz-Sarmiento JR, Galindo C, and Gonzalez-Jimenez J, “Robot@ home, a robotic dataset for semantic mapping of home environments,” The International Journal of Robotics Research, vol. 36, no. 2, pp. 131–141, 2017. [Google Scholar]
11.Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, and Schiele B, “The cityscapes dataset for semantic urban scene understanding,” in Proc. CVPR, 2016. [Google Scholar]
12.Liang X, Lin L, Yang W, Luo P, Huang J, and Yan S, “Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1175–1186, June 2016. [Google Scholar]
13.Yang J, Price B, Cohen S, and Yang MH, “Context driven scene parsing with attention to rare classes,” in Proc. CVPR, June 2014, pp. 3294–3301. [Google Scholar]
14.Farabet C, Couprie C, Najman L, and LeCun Y, “Learning hierarchical features for scene labeling,” IEEE Trans. PAMI, vol. 35, no. 8, pp. 1915–1929, August 2013. [DOI] [PubMed] [Google Scholar]
15.Shuai B, Zuo Z, Wang G, and Wang B, “Scene parsing with integration of parametric and non-parametric models,” IEEE Trans. Image Processing, vol. 25, no. 5, pp. 2379–2391, May 2016. [DOI] [PubMed] [Google Scholar]
16.Shuai B, Zuo Z, Wang B, and Wang G, “Dag-recurrent neural networks for scene labeling,” in Proc. CVPR, June 2016, pp. 3620–3629. [Google Scholar]
17.Yu L and Fan G, “Rare class oriented scene labeling using cnn incorporated label transfer,” in Advances in Visual Computing: Proc. ISVC, 2016, pp. 309–320. [Google Scholar]
18.Caesar H, Uijlings J, and Ferrari V, “Joint calibration for semantic segmentation,” in Proc. BMVC. BMVA Press, September 2015, pp. 29.1–29.13. [Google Scholar]
19.Dai J, He K, and Sun J, “Convolutional feature masking for joint object and stuff segmentation,” in Proc. CVPR, June 2015, pp. 3992–4000. [Google Scholar]
20.——, “Instance-aware semantic segmentation via multi-task network cascades,” in Proc. CVPR, June 2016, pp. 3150–3158. [Google Scholar]
21.He S, Lau RWH, Liu W, Huang Z, and Yang Q, “Supercnn: A superpixelwise convolutional neural network for salient object detection,” International Journal of Computer Vision, vol. 115, no. 3, pp. 330–344, 2015. [Google Scholar]
22.Yu L and Fan G, “Edge-aware integration model for semantic labeling of rare classes,” in Proc. IEEE International Conference on Image Processing, September 2017. [Google Scholar]
23.Xiao J, Hays J, Ehinger KA, Oliva A, and Torralba A, “Sun database: Large-scale scene recognition from abbey to zoo,” in Proc. CVPR, June 2010, pp. 3485–3492. [Google Scholar]
24.Li J and Wang JZ, “Automatic linguistic indexing of pictures by a statistical modeling approach,” IEEE Trans. PAMI, vol. 25, no. 9, pp. 1075–1088, September 2003. [Google Scholar]
25.Byeon W, Breuel TM, Raue F, and Liwicki M, “Scene labeling with lstm recurrent neural networks,” in Proc. CVPR, June 2015, pp. 3547–3555. [Google Scholar]
26.Mottaghi R, Fidler S, Yao J, Urtasun R, and Parikh D, “Analyzing semantic segmentation using hybrid human-machine crfs,” in Proc. CVPR, June 2013, pp. 3143–3150. [Google Scholar]
27.Mottaghi R, Chen X, Liu X, Cho NG, Lee SW, Fidler S, Urtasun R, and Yuille A, “The role of context for object detection and semantic segmentation in the wild,” in Proc. CVPR, June 2014, pp. 891–898. [Google Scholar]
28.Gould S, Fulton R, and Koller D, “Decomposing a scene into geometric and semantically consistent regions,” in Proc. ICCV, September 2009, pp. 1–8. [Google Scholar]
29.Larlus D and Jurie F, “Combining appearance models and markov random fields for category level object segmentation,” in Proc. CVPR, June 2008, pp. 1–7. [Google Scholar]
30.Triggs B and Verbeek JJ, “Scene segmentation with crfs learned from partially labeled images,” in Advances in Neural Information Processing Systems 20. Curran Associates, Inc., 2008, pp. 1553–1560. [Google Scholar]
31.Shotton J, Winn J, Rother C, and Criminisi A, “Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in Computer Vision – ECCV 2006. Springer Berlin Heidelberg, 2006, pp. 1–15. [Google Scholar]
32.Zhang Y and Chen T, “Efficient inference for fully-connected crfs with stationarity,” in Proc. CVPR, June 2012, pp. 582–589. [Google Scholar]
33.He X, Zemel RS, and Carreira-Perpinan MA, “Multiscale conditional random fields for image labeling,” in Proc. CVPR, vol. 2, June 2004, pp. II–695–II–702 Vol.2. [Google Scholar]
34.Ladický L, Russell C, Kohli P, and Torr PHS, “Associative hierarchical crfs for object class image segmentation,” in Proc. ICCV, September 2009, pp. 739–746. [Google Scholar]
35.Pinheiro PHO and Collobert R, “Recurrent convolutional neural networks for scene labeling,” in Proc. ICML, 2014, pp. 82–90. [Google Scholar]
36.LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, and Jackel LD, “Backpropagation applied to handwritten zip code recognition,” Neural Comput, vol. 1, no. 4, pp. 541–551, December. 1989. [Google Scholar]
37.Liang M, Hu X, and Zhang B, “Convolutional neural networks with intra-layer recurrent connections for scene labeling,” in Advances in NeurIPS 28. Curran Associates, Inc., 2015, pp. 937–945. [Google Scholar]
38.Liu C, Yuen J, and Torralba A, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in Proc. CVPR, June 2009, pp. 1972–1979. [Google Scholar]
39.——, “Nonparametric scene parsing via label transfer,” IEEE Trans. PAMI, vol. 33, no. 12, pp. 2368–2382, December 2011. [DOI] [PubMed] [Google Scholar]
40.Gould S and Zhang Y, “Patchmatchgraph: Building a graph of dense patch correspondences for label transfer,” in Proc. ECCV, 2012, pp. 439–452. [Google Scholar]
41.Tung F and Little JJ, “Collageparsing: Nonparametric scene parsing by adaptive overlapping windows,” in Proc. ECCV, 2014, pp. 511–525. [Google Scholar]
42.Eigen D and Fergus R, “Nonparametric image parsing using adaptive neighbor sets,” in Proc. CVPR, June 2012, pp. 2799–2806. [Google Scholar]
43.Singh G and Kosecka J, “Nonparametric scene parsing with adaptive feature relevance and semantic context,” in Proc. CVPR, June 2013, pp. 3151–3157. [Google Scholar]
44.Tighe J and Lazebnik S, “Superparsing: Scalable nonparametric image parsing with superpixels,” in Proc. ECCV, 2010, pp. 352–365. [Google Scholar]
45.——, “Finding things: Image parsing with regions and per-exemplar detectors,” in Proc. CVPR, June 2013, pp. 3001–3008. [Google Scholar]
46.Shuai B, Wang G, Zuo Z, Wang B, and Zhao L, “Integrating parametric and nonparametric models for scene labeling,” in Proc. CVPR, June 2015, pp. 4249–4258. [Google Scholar]
47.Achanta R, Shaji A, Smith K, Lucchi A, Fua P, and Süsstrunk S, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. PAMI, vol. 34, no. 11, pp. 2274–2282, November 2012. [DOI] [PubMed] [Google Scholar]
48.Vedaldi A and Lenc K, “Matconvnet – convolutional neural networks for matlab,” in Proceeding of the ACM Int. Conf. on Multimedia, 2015. [Google Scholar]
49.Mottaghi R, Chen X, Liu X, Cho N-G, Lee S-W, Fidler S, Urtasun R, and Yuille A, “The role of context for object detection and semantic segmentation in the wild,” in Proc. CVPR, 2014. [Google Scholar]
50.Shuai B, Zuo Z, Wang B, and Wang G, “Scene segmentation with DAG-recurrent neural networks,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 40, no. 6, 2018. [DOI] [PubMed] [Google Scholar]
51.Zhao H, Shi J, Qi X, Wang X, and Jia J, “Pyramid scene parsing network,” in Proc. CVPR, 2017. [Google Scholar]
52.Gould S, Zhao J, He X, and Zhang Y, “Superpixel graph label transfer with learned distance metric,” in Proc. ECCV, 2014, pp. 632–647. [Google Scholar]
53.Shuai B, Zuo Z, Wang B, and Wang G, “DAG-recurrent neural networks for scene labeling,” in Proc. CVPR, 2016. [DOI] [PubMed] [Google Scholar]
54.Shelhamer E, Long J, and Darrell T, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, no. 4, 2017. [DOI] [PubMed] [Google Scholar]
55.Bansal A, Chen X, Russell B, Gupta A, and Ramanan D, “Pixelnet: Towards a general pixel-level architecture,” arXiv preprint arXiv:1609.06694, 2016. [Google Scholar]
56.Lin G, Shen C, Van Den Hengel A, and Reid I, “Efficient piecewise training of deep structured models for semantic segmentation,” in Proc. CVPR, 2016, pp. 3194–3203. [Google Scholar]
57.Wu Z, Shen C, and Hengel A. v. d., “Bridging category-level and instance-level semantic image segmentation,” arXiv preprint arXiv:1605.06885, 2016. [Google Scholar]
58.Hung W-C, Tsai Y-H, Shen X, Lin Z, Sunkavalli K, Lu X, and Yang M-H, “Scene parsing with global context embedding,” in Proc. ICCV, 2017. [Google Scholar]
59.Abdulnabi AH, Shuai B, Winkler S, and Wang G, “Episodic CAMN: Contextual attention-based memory networks with iterative feedback for scene labeling,” in Proc. CVPR, 2017. [Google Scholar]
60.Wang Zhe, Li Hongsheng, Ouyang Wanli, and Wang Xiaogang, “Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision,” in arXiv:1706.02493, 2017. [Google Scholar]
61.Wang Qi , Gao Junyu , Yuan Yuan, “A Joint Convolutional Neural Networks and Context Transfer for Street Scenes Labeling,” in IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 5, pp. 1457–1470, 2018. [Google Scholar]
62.Fan Heng, Chu Peng, Jan Latecki Longin,Ling Haibin, “Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection,” in arXiv:1811.04778, 2018. [Google Scholar]
63.Das Aritra,Ghosh Swarnendu,Sarkhel Ritesh, Choudhuri Sandipan, Das Nibaran, Nasipuri Mita, “Combining Multilevel Contexts of Superpixel Using Convolutional Neural Networks to Perform Natural Scene Labeling,” in Recent Developments in Machine Learning and Data Analytics, Springer AISC, 740, pp. 297–306, 2019. [Google Scholar]
64.Badrinarayanan V, Kendall A and Cipolla R, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, December. 2017. [DOI] [PubMed] [Google Scholar]
65.Lin G, Milan A, Shen C and Reid I, “RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [Google Scholar]
66.Long J, Shelhamer E and Darrell T, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [DOI] [PubMed] [Google Scholar]
67.Yu Fisher and Koltun Vladlen, “Multi-Scale Context Aggregation by Dilated Convolutions,” in International Conference on Learning Representations (ICLR), 2016. [Google Scholar]
68.Zhang H et al. , “Multi-Scale Context Aggregation by Dilated Convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Google Scholar]
69.Stollenga MF, Byeon W, Liwicki M, and Schmidhuber J, “Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation,” in NeurIPS, 2015. [Google Scholar]
70.Chen L, Papandreou G, Kokkinos I, Murphy K, and Yuille AL, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, 2015. [DOI] [PubMed] [Google Scholar]
71.Liu Z, Li X, Luo P, Loy C, and Tang X, “Semantic image segmentation via deep parsing network,” in ICCV, 2015. [Google Scholar]
72.Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, and Torr PH, “Conditional random fields as recurrent neural networks,” in ICCV, 2015. [Google Scholar]
73.Zheng Zhedong, Zheng Liang, and Yang Yi, “A Discriminatively Learned CNN Embedding for Person Re-identification,” in ACM Transactions on Multimedia Computing Communications and Applications, 2017. [Google Scholar]
74.Wang Xiaodong, Zheng Zhedong, He Yang, Yan Fei, Zeng Zhiqiang, and Yang Yi, “Progressive Local Filter Pruning for Image Retrieval Acceleration,” in arXiv:2001.08878 [cs.CV], 2020. [Google Scholar]

[R1] 1.Hu K, Zhang S, and Zhao X, “Context-based conditional random fields as recurrent neural networks for image labeling,” Multimedia Tools and Applications, pp. 1–11, 2019. [Google Scholar]

[R2] 2.Li L, Socher R, and Li F, “Towards total scene understanding: Classification, annotation and segmentation in an automatic framework,” in Proc. CVPR, June 2009, pp. 2036–2043. [Google Scholar]

[R3] 3.George M, “Image parsing with a wide range of classes and scene-level context,” in Proc. CVPR, June 2015, pp. 3622–3630. [Google Scholar]

[R4] 4.Shi H, Li H, Meng F, Wu Q, Xu L, and Ngan KN, “Hierarchical parsing net: Semantic scene parsing from global scene to objects,” IEEE Transactions on Multimedia, vol. 20, no. 10, pp. 2670–2682, October 2018. [Google Scholar]

[R5] 5.Zhang J, Wu Q, Shen C, Zhang J, and Lu J, “Multilabel image classification with regional latent semantic dependencies,” IEEE Transactions on Multimedia, vol. 20, no. 10, pp. 2801–2813, October 2018. [Google Scholar]

[R6] 6.Abdulnabi AH, Shuai B, Zuo Z, Chau L, and Wang G, “Multimodal recurrent neural networks with information transfer layers for indoor scene labeling,” IEEE Transactions on Multimedia, vol. 20, no. 7, pp. 1656–1671, July 2018. [Google Scholar]

[R7] 7.Kang B, Lee Y, and Nguyen TQ, “Depth-adaptive deep neural network for semantic segmentation,” IEEE Transactions on Multimedia, vol. 20, no. 9, pp. 2478–2490, September. 2018. [Google Scholar]

[R8] 8.Zhao J, Xie G, and Han J, “Conditional random field with the multi-granular contextual information for pixel labeling,” Multimedia Tools and Applications, vol. 76, no. 7, pp. 9169–9194, 2017. [Google Scholar]

[R9] 9.Kosov S, Shirahama K, and Grzegorzek M, “Labeling of partially occluded regions via the multi-layer crf,” Multimedia Tools and Applications, vol. 78, no. 2, pp. 2551–2569, 2019. [Google Scholar]

[R10] 10.Ruiz-Sarmiento JR, Galindo C, and Gonzalez-Jimenez J, “Robot@ home, a robotic dataset for semantic mapping of home environments,” The International Journal of Robotics Research, vol. 36, no. 2, pp. 131–141, 2017. [Google Scholar]

[R11] 11.Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, and Schiele B, “The cityscapes dataset for semantic urban scene understanding,” in Proc. CVPR, 2016. [Google Scholar]

[R12] 12.Liang X, Lin L, Yang W, Luo P, Huang J, and Yan S, “Clothes co-parsing via joint image segmentation and labeling with application to clothing retrieval,” IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 1175–1186, June 2016. [Google Scholar]

[R13] 13.Yang J, Price B, Cohen S, and Yang MH, “Context driven scene parsing with attention to rare classes,” in Proc. CVPR, June 2014, pp. 3294–3301. [Google Scholar]

[R14] 14.Farabet C, Couprie C, Najman L, and LeCun Y, “Learning hierarchical features for scene labeling,” IEEE Trans. PAMI, vol. 35, no. 8, pp. 1915–1929, August 2013. [DOI] [PubMed] [Google Scholar]

[R15] 15.Shuai B, Zuo Z, Wang G, and Wang B, “Scene parsing with integration of parametric and non-parametric models,” IEEE Trans. Image Processing, vol. 25, no. 5, pp. 2379–2391, May 2016. [DOI] [PubMed] [Google Scholar]

[R16] 16.Shuai B, Zuo Z, Wang B, and Wang G, “Dag-recurrent neural networks for scene labeling,” in Proc. CVPR, June 2016, pp. 3620–3629. [Google Scholar]

[R17] 17.Yu L and Fan G, “Rare class oriented scene labeling using cnn incorporated label transfer,” in Advances in Visual Computing: Proc. ISVC, 2016, pp. 309–320. [Google Scholar]

[R18] 18.Caesar H, Uijlings J, and Ferrari V, “Joint calibration for semantic segmentation,” in Proc. BMVC. BMVA Press, September 2015, pp. 29.1–29.13. [Google Scholar]

[R19] 19.Dai J, He K, and Sun J, “Convolutional feature masking for joint object and stuff segmentation,” in Proc. CVPR, June 2015, pp. 3992–4000. [Google Scholar]

[R20] 20.——, “Instance-aware semantic segmentation via multi-task network cascades,” in Proc. CVPR, June 2016, pp. 3150–3158. [Google Scholar]

[R21] 21.He S, Lau RWH, Liu W, Huang Z, and Yang Q, “Supercnn: A superpixelwise convolutional neural network for salient object detection,” International Journal of Computer Vision, vol. 115, no. 3, pp. 330–344, 2015. [Google Scholar]

[R22] 22.Yu L and Fan G, “Edge-aware integration model for semantic labeling of rare classes,” in Proc. IEEE International Conference on Image Processing, September 2017. [Google Scholar]

[R23] 23.Xiao J, Hays J, Ehinger KA, Oliva A, and Torralba A, “Sun database: Large-scale scene recognition from abbey to zoo,” in Proc. CVPR, June 2010, pp. 3485–3492. [Google Scholar]

[R24] 24.Li J and Wang JZ, “Automatic linguistic indexing of pictures by a statistical modeling approach,” IEEE Trans. PAMI, vol. 25, no. 9, pp. 1075–1088, September 2003. [Google Scholar]

[R25] 25.Byeon W, Breuel TM, Raue F, and Liwicki M, “Scene labeling with lstm recurrent neural networks,” in Proc. CVPR, June 2015, pp. 3547–3555. [Google Scholar]

[R26] 26.Mottaghi R, Fidler S, Yao J, Urtasun R, and Parikh D, “Analyzing semantic segmentation using hybrid human-machine crfs,” in Proc. CVPR, June 2013, pp. 3143–3150. [Google Scholar]

[R27] 27.Mottaghi R, Chen X, Liu X, Cho NG, Lee SW, Fidler S, Urtasun R, and Yuille A, “The role of context for object detection and semantic segmentation in the wild,” in Proc. CVPR, June 2014, pp. 891–898. [Google Scholar]

[R28] 28.Gould S, Fulton R, and Koller D, “Decomposing a scene into geometric and semantically consistent regions,” in Proc. ICCV, September 2009, pp. 1–8. [Google Scholar]

[R29] 29.Larlus D and Jurie F, “Combining appearance models and markov random fields for category level object segmentation,” in Proc. CVPR, June 2008, pp. 1–7. [Google Scholar]

[R30] 30.Triggs B and Verbeek JJ, “Scene segmentation with crfs learned from partially labeled images,” in Advances in Neural Information Processing Systems 20. Curran Associates, Inc., 2008, pp. 1553–1560. [Google Scholar]

[R31] 31.Shotton J, Winn J, Rother C, and Criminisi A, “Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in Computer Vision – ECCV 2006. Springer Berlin Heidelberg, 2006, pp. 1–15. [Google Scholar]

[R32] 32.Zhang Y and Chen T, “Efficient inference for fully-connected crfs with stationarity,” in Proc. CVPR, June 2012, pp. 582–589. [Google Scholar]

[R33] 33.He X, Zemel RS, and Carreira-Perpinan MA, “Multiscale conditional random fields for image labeling,” in Proc. CVPR, vol. 2, June 2004, pp. II–695–II–702 Vol.2. [Google Scholar]

[R34] 34.Ladický L, Russell C, Kohli P, and Torr PHS, “Associative hierarchical crfs for object class image segmentation,” in Proc. ICCV, September 2009, pp. 739–746. [Google Scholar]

[R35] 35.Pinheiro PHO and Collobert R, “Recurrent convolutional neural networks for scene labeling,” in Proc. ICML, 2014, pp. 82–90. [Google Scholar]

[R36] 36.LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, and Jackel LD, “Backpropagation applied to handwritten zip code recognition,” Neural Comput, vol. 1, no. 4, pp. 541–551, December. 1989. [Google Scholar]

[R37] 37.Liang M, Hu X, and Zhang B, “Convolutional neural networks with intra-layer recurrent connections for scene labeling,” in Advances in NeurIPS 28. Curran Associates, Inc., 2015, pp. 937–945. [Google Scholar]

[R38] 38.Liu C, Yuen J, and Torralba A, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in Proc. CVPR, June 2009, pp. 1972–1979. [Google Scholar]

[R39] 39.——, “Nonparametric scene parsing via label transfer,” IEEE Trans. PAMI, vol. 33, no. 12, pp. 2368–2382, December 2011. [DOI] [PubMed] [Google Scholar]

[R40] 40.Gould S and Zhang Y, “Patchmatchgraph: Building a graph of dense patch correspondences for label transfer,” in Proc. ECCV, 2012, pp. 439–452. [Google Scholar]

[R41] 41.Tung F and Little JJ, “Collageparsing: Nonparametric scene parsing by adaptive overlapping windows,” in Proc. ECCV, 2014, pp. 511–525. [Google Scholar]

[R42] 42.Eigen D and Fergus R, “Nonparametric image parsing using adaptive neighbor sets,” in Proc. CVPR, June 2012, pp. 2799–2806. [Google Scholar]

[R43] 43.Singh G and Kosecka J, “Nonparametric scene parsing with adaptive feature relevance and semantic context,” in Proc. CVPR, June 2013, pp. 3151–3157. [Google Scholar]

[R44] 44.Tighe J and Lazebnik S, “Superparsing: Scalable nonparametric image parsing with superpixels,” in Proc. ECCV, 2010, pp. 352–365. [Google Scholar]

[R45] 45.——, “Finding things: Image parsing with regions and per-exemplar detectors,” in Proc. CVPR, June 2013, pp. 3001–3008. [Google Scholar]

[R46] 46.Shuai B, Wang G, Zuo Z, Wang B, and Zhao L, “Integrating parametric and nonparametric models for scene labeling,” in Proc. CVPR, June 2015, pp. 4249–4258. [Google Scholar]

[R47] 47.Achanta R, Shaji A, Smith K, Lucchi A, Fua P, and Süsstrunk S, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE Trans. PAMI, vol. 34, no. 11, pp. 2274–2282, November 2012. [DOI] [PubMed] [Google Scholar]

[R48] 48.Vedaldi A and Lenc K, “Matconvnet – convolutional neural networks for matlab,” in Proceeding of the ACM Int. Conf. on Multimedia, 2015. [Google Scholar]

[R49] 49.Mottaghi R, Chen X, Liu X, Cho N-G, Lee S-W, Fidler S, Urtasun R, and Yuille A, “The role of context for object detection and semantic segmentation in the wild,” in Proc. CVPR, 2014. [Google Scholar]

[R50] 50.Shuai B, Zuo Z, Wang B, and Wang G, “Scene segmentation with DAG-recurrent neural networks,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 40, no. 6, 2018. [DOI] [PubMed] [Google Scholar]

[R51] 51.Zhao H, Shi J, Qi X, Wang X, and Jia J, “Pyramid scene parsing network,” in Proc. CVPR, 2017. [Google Scholar]

[R52] 52.Gould S, Zhao J, He X, and Zhang Y, “Superpixel graph label transfer with learned distance metric,” in Proc. ECCV, 2014, pp. 632–647. [Google Scholar]

[R53] 53.Shuai B, Zuo Z, Wang B, and Wang G, “DAG-recurrent neural networks for scene labeling,” in Proc. CVPR, 2016. [DOI] [PubMed] [Google Scholar]

[R54] 54.Shelhamer E, Long J, and Darrell T, “Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, no. 4, 2017. [DOI] [PubMed] [Google Scholar]

[R55] 55.Bansal A, Chen X, Russell B, Gupta A, and Ramanan D, “Pixelnet: Towards a general pixel-level architecture,” arXiv preprint arXiv:1609.06694, 2016. [Google Scholar]

[R56] 56.Lin G, Shen C, Van Den Hengel A, and Reid I, “Efficient piecewise training of deep structured models for semantic segmentation,” in Proc. CVPR, 2016, pp. 3194–3203. [Google Scholar]

[R57] 57.Wu Z, Shen C, and Hengel A. v. d., “Bridging category-level and instance-level semantic image segmentation,” arXiv preprint arXiv:1605.06885, 2016. [Google Scholar]

[R58] 58.Hung W-C, Tsai Y-H, Shen X, Lin Z, Sunkavalli K, Lu X, and Yang M-H, “Scene parsing with global context embedding,” in Proc. ICCV, 2017. [Google Scholar]

[R59] 59.Abdulnabi AH, Shuai B, Winkler S, and Wang G, “Episodic CAMN: Contextual attention-based memory networks with iterative feedback for scene labeling,” in Proc. CVPR, 2017. [Google Scholar]

[R60] 60.Wang Zhe, Li Hongsheng, Ouyang Wanli, and Wang Xiaogang, “Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision,” in arXiv:1706.02493, 2017. [Google Scholar]

[R61] 61.Wang Qi , Gao Junyu , Yuan Yuan, “A Joint Convolutional Neural Networks and Context Transfer for Street Scenes Labeling,” in IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 5, pp. 1457–1470, 2018. [Google Scholar]

[R62] 62.Fan Heng, Chu Peng, Jan Latecki Longin,Ling Haibin, “Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection,” in arXiv:1811.04778, 2018. [Google Scholar]

[R63] 63.Das Aritra,Ghosh Swarnendu,Sarkhel Ritesh, Choudhuri Sandipan, Das Nibaran, Nasipuri Mita, “Combining Multilevel Contexts of Superpixel Using Convolutional Neural Networks to Perform Natural Scene Labeling,” in Recent Developments in Machine Learning and Data Analytics, Springer AISC, 740, pp. 297–306, 2019. [Google Scholar]

[R64] 64.Badrinarayanan V, Kendall A and Cipolla R, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, December. 2017. [DOI] [PubMed] [Google Scholar]

[R65] 65.Lin G, Milan A, Shen C and Reid I, “RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [Google Scholar]

[R66] 66.Long J, Shelhamer E and Darrell T, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [DOI] [PubMed] [Google Scholar]

[R67] 67.Yu Fisher and Koltun Vladlen, “Multi-Scale Context Aggregation by Dilated Convolutions,” in International Conference on Learning Representations (ICLR), 2016. [Google Scholar]

[R68] 68.Zhang H et al. , “Multi-Scale Context Aggregation by Dilated Convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [Google Scholar]

[R69] 69.Stollenga MF, Byeon W, Liwicki M, and Schmidhuber J, “Parallel multi-dimensional lstm, with application to fast biomedical volumetric image segmentation,” in NeurIPS, 2015. [Google Scholar]

[R70] 70.Chen L, Papandreou G, Kokkinos I, Murphy K, and Yuille AL, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” in ICLR, 2015. [DOI] [PubMed] [Google Scholar]

[R71] 71.Liu Z, Li X, Luo P, Loy C, and Tang X, “Semantic image segmentation via deep parsing network,” in ICCV, 2015. [Google Scholar]

[R72] 72.Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, and Torr PH, “Conditional random fields as recurrent neural networks,” in ICCV, 2015. [Google Scholar]

[R73] 73.Zheng Zhedong, Zheng Liang, and Yang Yi, “A Discriminatively Learned CNN Embedding for Person Re-identification,” in ACM Transactions on Multimedia Computing Communications and Applications, 2017. [Google Scholar]

[R74] 74.Wang Xiaodong, Zheng Zhedong, He Yang, Yan Fei, Zeng Zhiqiang, and Yang Yi, “Progressive Local Filter Pruning for Image Retrieval Acceleration,” in arXiv:2001.08878 [cs.CV], 2020. [Google Scholar]

PERMALINK

DrsNet: Dual-resolution Semantic Segmentation with Rare Class-Oriented Superpixel Prior

Liangjiang Yu

Guoliang Fan

Abstract

1. Introduction

Fig. 1.

2. Related Work

2.1. Traditional Approaches

2.2. Deep Learning Methods

2.3. Scene Labeling with Rare-class Objects

3. Formulation

Fig. 2.

Table 1.

3.1. Coarse Superpixel Belief

3.2. Fine-grained Superpixel Belief

4. Methodology

4.1. Algorithm Overview

Fig. 3.

4.2. Superpixel-based CNN Learning

Fig. 4.

Fig. 5.

4.3. Coarse-to-fine (c2f) Superpixel Representation

4.4. Scene-assisted Retrieval

Fig. 6.

Table 2.

4.5. Label Transfer

Table 3.

4.6. Algorithm Implementation

5. Experiments

5.1. Overall Labeling Accuracy

Table 4.

Table 5.

5.2. Rare Class Labeling Accuracy

Table 6.

Table 7.

5.3. Qualitative Results

Fig. 7.

5.4. More Discussion

6. Conclusion

Acknowledgements

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases