Interactive Segmentation with Super-Labels

Andrew Delong; Lena Gorelick; Frank R Schmidt; Olga Veksler; Yuri Boykov

doi:10.1007/978-3-642-23094-3_11

. Author manuscript; available in PMC: 2014 Oct 10.

Published in final edited form as: Energy Minimization Methods Comput Vis Pattern Recognit. 2011;6819:147–162. doi: 10.1007/978-3-642-23094-3_11

Interactive Segmentation with Super-Labels

Andrew Delong ^1,^*, Lena Gorelick ^1,^*,^✉, Frank R Schmidt ¹, Olga Veksler ¹, Yuri Boykov ¹

PMCID: PMC4193676 NIHMSID: NIHMS384897 PMID: 25309973

Abstract

In interactive segmentation, the most common way to model object appearance is by GMM or histogram, while MRFs are used to encourage spatial coherence among the object labels. This makes the strong assumption that pixels within each object are i.i.d. when in fact most objects have multiple distinct appearances and exhibit strong spatial correlation among their pixels. At the very least, this calls for an MRF-based appearance model within each object itself and yet, to the best of our knowledge, such a “two-level MRF” has never been proposed.

We propose a novel segmentation energy that can model complex appearance. We represent the appearance of each object by a set of distinct spatially coherent models. This results in a two-level MRF with “super-labels” at the top level that are partitioned into “sub-labels” at the bottom. We introduce the hierarchical Potts (hPotts) prior to govern spatial coherence within each level. Finally, we introduce a novel algorithm with EM-style alternation of proposal, α-expansion and re-estimation steps.

Our experiments demonstrate the conceptual and qualitative improvement that a two-level MRF can provide. We show applications in binary segmentation, multi-class segmentation, and interactive co-segmentation. Finally, our energy and algorithm have interesting interpretations in terms of semi-supervised learning.

1 Introduction

The vast majority of segmentation methods model object appearance by GMM or histogram and rely on some form of spatial regularization of the object labels. This includes interactive [1–3], unsupervised [4–9], binary [1–3, 8, 9] and multi-class [4–7, 10] techniques. The interactive methods make the strong assumption that all pixels within an entire object are i.i.d. when in fact many objects are composed of multiple regions with distinct appearances. Unsupervised methods try to break the image into small regions that actually are i.i.d., but these formulations do not involve any high-level segmentation of objects.

We propose a novel energy that unifies these two approaches by incorporating unsupervised learning into interactive segmentation. We show that this more descriptive object model leads to better high-level segmentations. In our formulation, each object (super-label) is automatically decomposed into spatially coherent regions where each region is described by a distinct appearance model (sub-label). This results in a two-level MRF with super-labels at the top level that are partitioned into sub-labels at the bottom. Figure 1 illustrates the main idea. We introduce the hierarchical Potts (hPotts) prior to govern spatial coherence at both levels of our MRF. The hierarchical Potts prior regularizes boundaries between objects (super-label transitions) differently from boundaries within each object (sub-label transitions). The unsupervised aspect of our MRF allows appearance models of arbitrary complexity and would severely over-fit the image data if left unregularized. We address this by incorporating global sparsity prior into our MRF via the energetic concept of “label costs” [7].

Fig. 1 — Given user scribbles, typical MRF segmentation (Boykov-Jolly) uses a GMM to model the appearance of each object label. This makes the strong assumption that pixels inside each object are i.i.d. In contrast, we define a two-level MRF to encourage inter-object coherence among *super-labels* and intra-object coherence among *sub-labels*.

Since our framework is based on multi-label MRFs, a natural choice of optimization machinery is α-expansion [11, 7]. Furthermore, the number, class, and parameters of each object’s appearance models are not known a priori — in order to use powerful combinatorial techniques we must propose a finite set of possibilities for α-expansion to select from. We therefore resort to an iterative graph-cut process that involves random sampling to propose new models, α-expansion to update the segmentation, and re-estimation to improve the current appearance models. Figure 2 illustrates our algorithm.

Fig. 2 — We iteratively propose new models by randomly sampling pixels from super-labels, optimize the resulting two-level MRF, and re-estimate model parameters

The remainder of the paper is structured as follows. Section 2 discusses other methods for modeling complex appearance, MDL-based segmentation, and related iterative graph-cut algorithms. Section 3 describes our energy-based formulation and algorithm in formal detail. Section 4 shows applications in interactive binary/multi-class segmentation and interactive co-segmentation; furthermore it describes how our framework easily allows appearance models to come from a mixture of classes (GMM, plane, etc.). Section 5 draws an interesting parallel between our formulation and multi-class semi-supervised learning in general.

2 Related Work

Complex Appearance Models

The DDMCMC method [6] was the first to emphasize the importance of representing object appearance with complex models (e.g. splines and texture based models in addition to GMMs) in the context of unsupervised segmentation. However, being unsupervised, DDMCMC does not delineate objects but rather provides low-level segments along with their appearance models. Ours is the first multi-label graph-cut based framework that can learn a mixture of such models for segmentation.

There is an interactive method [10] that decomposes objects into spatially coherent sub-regions with distinct appearance models. However, the number of sub-regions, their geometric interactions, and their corresponding appearance models must be carefully designed for each object of interest. In contrast, we automatically learn the number of sub-regions and their model parameters.

MDL-Based Segmentation

A number of works have shown that minimum description length (MDL) is a useful regularizer for unsupervised segmentation, e.g. [5–7]. Our work stands out here in two main respects: our formulation is designed for semi-supervised settings and explicitly weighs the benefit of each appearance model against the ‘cost’ of its inherent complexity (e.g. number of parameters). To the best of our knowledge, only the unsupervised DDMCMC [6] method allows arbitrary complexity while explicitly penalizing it in a meaningful way. However, they use a completely different optimization framework and, being unsupervised, they do not delineate object boundaries.

Iterative Graph-Cuts

Several energy-based methods have employed EM-style alternation between a graph-cut/α-expansion phase and a model re-estimation phase, e.g. [12, 2, 13, 14, 7]. Like our work, Grab-Cut [2] is about interactive segmentation, though their focus is binary segmentation with a bounding-box interaction rather than scribbles. The bounding box is intuitive and effective for many kinds of objects but often requires subsequent scribble-based interaction for more precise control. Throughout this paper, we compare our method to an iterative multi-label variant of Boykov-Jolly [1] that we call iBJ. Given user scribbles, this baseline method maintains one GMM per object label and iterates between α-expansion and re-estimating each model.

On an algorithmic level, the approach most closely related to ours is the unsupervised method [14, 7] because it also involves random sampling, α-expansion, and label costs. Our framework is designed to learn complex appearance models from partially-labeled data and differs from [14, 7] in the following respects: (1) we make use of hard constraints and the current super-labeling to guide random sampling, (2) our hierarchical Potts potentials regularize sub- and super-labels differently, and (3), again, our label costs penalize models based on their individual complexity rather than using uniform label costs.

3 Modeling Complex Appearance via Super-Labels

We begin by describing a novel multi-label energy that corresponds to our two-level MRF. Unlike typical MRF-based segmentation methods, our actual set of discrete labels (appearance models) is not precisely known beforehand and we need to estimate both the number of unique models and their parameters. Section 3.1 explains this energy formulation in detail, and Section 3.2 describes our iterative algorithm for minimizing this energy.

3.1 Problem Formulation

Let S denote the set of super-labels (scribble colors) available to the user and let P denote the indexes of pixels in the input image I. By “scribbling” on the image, the user interactively defines a partial labeling g : P → S ∪ {none} that assigns to each pixel p a super-label index g_p ∈ S or leaves p unlabeled (g_p = none). Our objective in terms of optimization is to find the following:

an unknown set of ℒ of distinct appearance models (sub-labels) generated from the image, along with model parameters θ_ℓ for each ℓ ∈ ℒ
a complete sub-labeling f : P → ℒ that assigns one model to each pixel, and
a map π : ℒ → S where π(ℓ) = i associates sub-label ℓ with super-label i, i.e. the sub-labels are grouped into disjoint subsets, one for each super-label; any π defines a parent-child relation in what we call a two-level MRF.

Our output is therefore a tuple (ℒ, θ, π, f) with set of sub-labels ℒ, model parameters θ = {θ_ℓ}, super-label association π, and complete pixel labeling f. The final segmentation presented to the user is simply (π ∘ f): P→S which assigns a scribble color (super-label index) to each pixel in P.

In a good segmentation we expect the tuple (ℒ, θ, π, f) to satisfy the following three properties. First, the super-labeling π ∘ f must respect the constraints imposed by user scribbles, i.e. if pixel p was scribbled then we require π(f_p) = g_p. Second, the labeling f should exhibit spatial coherence both among sub-labels and between super-labels. Finally, the set of sub-labels ℒ should contain as many appearance models as is justified by the image data, but no more.

We propose an energy for our two-level MRFs¹ that satisfies these three criteria and can be expressed in the following form²:

E (ℒ, θ, π, f) = \sum_{p \in P} D_{p} (f_{p}) + \sum_{p q \in N} w_{p q} V (f_{p}, f_{q}) + \sum_{ℓ \in ℒ} h_{ℓ} δ_{ℓ} (f)

(1)

The unary terms D of our energy express negative log-likelihoods of appearance models and enforce the hard constraints imposed by the user. A pixel p that has been scribbled (g_p ∈ S) is only allowed to be assigned a sub-label ℓ such that π(ℓ) = g_p. Un-scribbled pixels are permitted to take any sub-label.

D_{p} (ℓ) = {\begin{cases} - \ln \Pr (I_{p} | θ_{ℓ}) & if g_{p} = none \lor g_{p} = π (ℓ) \\ \infty & otherwise \end{cases}

(2)

The pairwise terms V are defined with respect to the current super-label map π as follows:

V (ℓ, ℓ^{'}) = {\begin{cases} 0 & if ℓ = ℓ^{'} \\ c_{1} & if ℓ \neq ℓ^{'} and π (ℓ) = π (ℓ^{'}) \\ c_{2} & if π (ℓ) \neq π (ℓ^{'}) \end{cases}

(3)

We call (3) a two-level Potts potential because it governs coherence on two levels: c₁ encourages sub-labels within each super-label to be spatially coherent, and c₂ encourages smoothness among super-labels. This potential is a special case of our more general class of hierarchical Potts potentials introduced in Appendix 6, but two-level Potts is sufficient for our interactive segmentation applications. For image segmentation we assume c₁ ≤ c₂, though in general any V with c₁ ≤ 2c₂ is still a metric [11] and can be optimized by α-expansion. Appendix 6 gives general conditions for hPotts to be metric. It is commonly known that smoothness costs directly affect the expected length of the boundary and should be scaled proportionally to the size of the image. Intuitively c₂ should be larger as it operates on the entire image as opposed to smaller regions that correspond to objects. The weight w_pq ≥ 0 of each pairwise term in (1) is computed from local image gradients in the standard way (e.g. [1, 2]).

Finally, we incorporate a model-dependent “label costs” [7] to regularize the number of unique models in ℒ and their individual complexity. A label cost h_ℓ is a global potential that penalizes the use of ℓ in labeling f through indicator function δ_ℓ(f) = 1 ⇔ ∃f_p = ℓ. There are many possible ways to define the weight h_ℓ of a label cost, such as Akaike information criterion (AIC) [16] or Bayesian information critierion (BIC) [17]. We use a heuristic described in Section 4.2.

3.2 Our SuperLabelSeg Algorithm

We propose a novel segmentation algorithm based on the iterative Pearl framework [7]. Each iteration of PEARL has three main steps: propose candidate models by random sampling, segment via α-expansion with label costs, and re-estimate the model parameters for the current segmentation. Our algorithm differs from [7] as follows: (1) we make use of hard constraints g and the current super-labeling π ∘ f to guide random sampling, (2) our two-level Potts potentials regularize sub- and super-labels differently, and (3) our label costs penalize models based on their individual complexity rather than uniform penalty.

The proposal step repeatedly generates a new candidate model ℓ with parameters θ_ℓ fitted to a random subsample of pixels. Each model is proposed in the context of a particular super-label i ∈ S, and so the random sample is selected from the set of pixels P_i = {p|π(f_p) = i} currently associated with i. Each candidate ℓ is then added to the current label set ℒ with super-label assignment set to π(ℓ) = i. A heuristic is used to determine a sufficient number of proposals to cover the set of pixels P_i at each iteration.

Once we have candidate sub-labels for every object a naïve approach would be to directly optimize our two-level MRF. However, being random, not all of an object’s proposals are equally good for representing its appearance. For example, a proposal from a small sample of pixels is likely to over-fit or mix statistics (Figure 3, proposal 2). Such models are not characteristic of the object’s overall appearance but are problematic because they may incidentally match some portion of another object and lead to an erroneous super-label segmentation. Before allowing sub-labels to compete over the entire image, we should do our best to ensure that all appearance models within each object are relevant and accurate.

Fig. 3 — The object marked with red has two spatially-coherent appearances: pure gray, and pure white. We can generate proposals for the red object from random patches 1–3. However, if we allow proposal 2 to remain associated with the red object, it may incorrectly claim pixels from the blue object which actually *does* look like proposal 2.

Given the complete set of proposals, we first re-learn the appearance of each object i ∈ S. This is achieved by restricting our energy to pixels that are currently labeled with π(f_p) = i and optimizing via α-expansion with label costs [7]; this ensures that each object is represented by an accurate set of appearance models. Once each object’s appearance has been re-learned, we allow the objects to simultaneously compete for all image pixels while continuing to re-estimate their parameters. Segmentation is performed on a two-level MRF defined by the current (L, θ, π). Again, we use α-expansion with label costs to select a good subset of appearance models and to partition the image. The pseudo-code below describes our SuperLabelSeg algorithm.

SuperLabelSeg (g) where g: P→S ∪ {none} is a partial labeling
1 ℒ = {} //empty label set with global f, π, θ undefined
2 Propose (g) //initialize ℒ, π, θ from user scribbles
3 repeat
4 Segment (P, ℒ) //segment entire image using all available labels ℒ
5 Propose (π ∘ f) //update ℒ, π, θ from current super-labeling
6 until converged

Propose (z) where z : P → S ∪ {none}
1 for each i ∈ S
2 P_i = {p\|z_p = i} // set of pixels currently labeled with super-label i
3 repeat sufficiently
4 generate model ℓ with parameters θ_ℓ fitted to random sample from P_i
5 π (ℓ) = i
6 ℒ = ℒ ∪ {ℓ}
7 end
8 ℒ_i = {ℓ\|π (ℓ) = i}
9 Segment (P_i, ℒ_i)//optimize models and segmentation within super-label i

Segment (P̂, ℒ̂) where P̂ ⊆ P and ℒ̂ ⊆ ℒ
1 let f\|_P̂ denote current global labeling f restricted to P̂
2 repeat
3 f\|_P̂ = argmin _f̂E(ℒ̂, θ, π, f̂) //segment by α-expansion with label costs [7]
4 //where we optimize only on f̂ : P̂ → ℒ̂
5 ℒ = ℒ \ { l ∈ ℒ̂ \| δ_l (f̂) = 0} //discard unused models
6 θ = argmin_θ E(ℒ̂, θ, π, f̂) //re-estimate each sub-model params
7 until converged

Open in a new tab

4 Applications and Experiments

Our experiments demonstrate the conceptual and qualitative improvement that a two-level MRF can provide. We are only concerned with scribble-based MRF segmentation, an important class of interactive methods. We use an iterative variant of Boykov-Jolly [1] (iBJ) as a representative baseline because it is simple, popular, and exhibits a problem characteristic to a wide class of standard methods. By using one appearance model per object, such methods implicitly assume that pixels within each object are i.i.d. with respect to its model. However, this is rarely the case, as objects often have multiple distinct appearances and exhibit strong spatial correlation among their pixels. The main message of all the experiments is to show that by using multiple distinctive appearance models per object we are able to reduce uncertainty near the boundaries of objects and thereby improve segmentation in difficult cases. We show applications in interactive binary/multi-class segmentation and interactive co-segmentation.

Implementation Details

In all our experiments we used publicly available α-expansion code [11, 7, 4, 18]. Our non-optimized matlab implementation takes on the order of one to three minutes depending on the size of the image, with the majority of time spent on re-estimating the sub-model parameters. We used the same within- and between-smoothness costs (c₁ = 5, c₂ = 10) in all binary, multi-class and co-segmentation experiments. Our proposal step uses distance-based sampling within each super-label whereby patches of diameter 3 to 5 are randomly selected. For re-estimating model parameters we use the Matlab implementation of EM algorithm for GMMs and we use PCA for planes. We regularize GMM covariance matrices to avoid overfitting in (ℒ, a, b) color space by adding constant value of 2.0 to the diagonal.

4.1 Binary Segmentation

For binary segmentation we assume that the user wishes to segment an object from the background where the set of super-labels (scribble indexes) is defined by S = {F, B}. In this specific case we found that most of the user interaction is spent on removing disconnected false-positive object regions by scribbling over them with background super-label. We therefore employ a simple heuristic: after convergence we find foreground connected components that are not supported by a scribble and modify their data-terms to prohibit those pixels from taking the super-label F. We then perform one extra segmentation step to account for the new constraints. We apply this heuristic in all our binary segmentation results for both SuperLabelSeg and iBJ (Figures 4, 5, 6). Other heuristics could be easily incorporated in our energy to encourage connectivity, e.g. star-convexity [19, 20].

Fig. 4 — Binary segmentation examples. The second column shows our final sub-label segmentation f where blues indicate foreground sub-labels and reds indicate background sub-labels. The third column is generated by sampling each *I_p* from model *θ_fp*. The last two columns compare our super-label segmentation π ∘ f and iBJ.

Fig. 5 — More binary segmentation results showing scribbles, sub-labelings, synthesized images, and final cut-outs

Fig. 6 — An image exhibiting gradual changes in color. Columns 2–4 show colors sampled from the learned appearance models for iBJ, our two-level MRF restricted to GMMs only, and ours with both GMMs and planes. Our framework can detect a mix of GMMs (grass, clouds) and planes (sky) for the background super-label (top-right).

In Figure 4, top-right, notice that iBJ does not incorporate the floor as part of the background. This is because there is only a small proportion of floor pixels in the red scribbles, but a large proportion of a similar color (roof) in the blue scribbles. By relying directly on the color proportions in the scribbles, the learned GMMs do not represent the actual appearance of the objects in the full image. Therefore the ground is considered a priori more likely to be explained by the (wrong) roof color than the precise floor color, giving an erroneous segmentation despite the hard constraints. Our methods relies on spatial coherence of the distinct appearances within each object and therefore has a sub-label that fits the floor color tightly. This same phenomenon is even more evident in the bottom row of Figure 5. In the iBJ case, the appearance model for the foreground mixes the statistics from all scribbled pixels and is biased towards the most dominant color. Our decomposition allows each appearance with spatial support (textured fabric, face, hair) to have good representation in the composite foreground model.

4.2 Complex Appearance Models

In natural images objects often exhibit gradual change in hue, tone or shades. Modeling an object with a single GMM in color space [1, 2] makes the implicit assumption that appearance is piece-wise constant. In contrast, our framework allows us to decompose an object into regions with distinct appearance models, each from an arbitrary class (e.g. GMM, plane, quadratic, spline). Our algorithm will choose automatically the most suitable class for each sub-label within an object. Figure 6 (right) shows such an example where the background is decomposed into several grass regions, each modeled by a GMM in (L, a, b) color space, and a sky region that is modeled by a plane³ in a 5-dimensional (x, y, L, a, b) space. Note the gradual change in the sky color and how the clouds are segmented as a separate ‘white’ GMM.

Figure 7 show a synthetic image, in which the foreground object breaks into two sub-regions, each exhibiting a different type of gradual change in color. This kind of object appearance cannot be captured by a mixture of GMM models. In general our framework can incorporate a wide range of appearance models as long as there exists a black-box algorithm for estimating the parameters θ_l, which can be used at the line 6 of Segment. The importance of more complex appearance models was proposed by DDMCMC [6] for unsupervised segmentation in a completely different algorithmic framework. Ours is the first multi-label graph-cut based framework that can incorporate such models.

Fig. 7 — Our algorithm detects complex mixtures of models, for example GMMs and planes. The appearance of the above object cannot be captured by GMMs alone.

Because our appearance models may be arbitrarily complex, we must incorporate individual model complexity in our energy. Each label cost h_ℓ is computed based on the number of parameters θ_ℓ and the number ν of pixels that are to be labeled, namely $h_{ℓ} = \frac{1}{2} \sqrt{v} | θ_{ℓ} |$ . We set ν = #{p | g_p≠ none } for line 2 of the SuperLabelSeg algorithm and ν = |P| for lines 4,5. Penalizing complexity is a crucial part of our framework because it helps our MRF-based models to avoid over-fitting. Label costs balance the number of parameters required to describe the models against the number of data points to which the models are fit. When re-estimating the parameters of a GMM we allow the number of components to increase or decrease if favored by the overall energy.

4.3 Multi-Class Segmentation

Interactive multi-class segmentation is a straight-forward application of our energy (1) where the set of super-labels S contains an index for each scribble color. Figure 8 shows examples of images with multiple scribbles corresponding to multiple objects. The resulting sub-labelings show how objects are decomposed into regions with distinct appearances. For example, in the top row, the basket is decomposed into a highly-textured colorful region (4-component GMM) and a more homogeneous region adjacent to it (2-component GMM). In the bottom row, notice that the hair of children marked with blue was so weak in the iBJ appearance model that it was absorbed into the background. The synthesized images suggest the quality of the learned appearance models. Unlike the binary case, here we do not apply the post-processing step enforcing connectivity.

Fig. 8 — Multi-class segmentation examples. Again, we show the color-coded sub-labelings and the learned appearance models. Our super-labels are decomposed into spatially coherent regions with distinct appearance.

4.4 Interactive Co-segmentation

Our two-level MRFs can be directly used for interactive co-segmentation [3, 21]. Specifically, we apply our method to co-segmentation of a collection of similar images as in [3] because it is a natural scenario for many users. This differs from ‘unsupervised’ binary co-segmentation [8, 9] that assumes dissimilar backgrounds and similar-sized foreground objects. Figure 9 shows a collection of four images with similar content. Just by scribbling on one of the images our method is able to correctly segment the objects. Note that the unmarked images contain background colors not present in the scribbled image, yet our method was able to detect these novel appearances and correctly segment the background into sub-labels.

Fig. 9 — Interactive co-segmentation examples. Note that our method detected sub-models for grass, water, and sand in the 1^st and 3^rd bear images; these appearances were not present in the scribbled image.

5 Discussion: Super-Labels as Semi-supervised Learning

There are evident parallels between interactive segmentation and semi-supervised learning, particularly among graph cut methods ([1] versus [22]) and random walk methods ([23] versus [24]). An insightful paper by Duchenne et al. [25] explicitly discusses this observation. Looking back at our energy and algorithm from this perspective, it is clear that we actually do semi-supervised learning applied to image segmentation. For example, the grayscale image in Figure 7 can be visualized as points in a 3D feature space where small subsets of points have been labeled either blue or red. In addition to making ‘transductive’ inferences, our algorithm automatically learned that the blue label is best decomposed into two linear subspaces (green & purple planes in Figure 7, right) whereas the red label is best described by a single bi-modal GMM. The number, class, and parameters of these models was not known a priori but was discovered by SuperLabelSeg.

Our two-level framework allows each object to be modeled with arbitrary complexity but, crucially, we use spatial coherence (smooth costs) and label costs to regularize the energy and thereby avoid over-fitting. Setting c₁ < c₂ in our smooth costs V corresponds to a “two-level clustering assumption,” i.e. that class clusters are better separated than the sub-clusters within each class. To the best of our knowledge, we are first to suggest iterated random sampling and α-expansion with label costs (SuperLabelSeg) as an algorithm for multi-class semi-supervised learning. These observations are interesting and potentially useful in the context of more general semi-supervised learning.

6 Conclusion

In this paper we raised the question of whether GMM/histograms are an appropriate choice for modeling object appearance. If GMMs and histograms are not satisfying generative models for a natural image, they are equally unsatisfying for modeling appearance of complex objects within the image.

To address this question we introduced a novel energy that models complex appearance as a two-level MRF. Our energy incorporates both elements of interactive segmentation and unsupervised learning. Interactions are used to provide high-level knowledge about objects in the image, whereas the unsupervised component tries to learn the number, class and parameters of appearance models within each object. We introduced the hierarchical Potts prior to regularize smoothness within and between the objects in our two-level MRF, and we use label costs to account for the individual complexity of appearance models. Our experiments demonstrate the conceptual and qualitative improvement that a two-level MRF can provide.

Finally, our energy and algorithm have interesting interpretations in terms of semi-supervised learning. In particular, our energy-based framework can be extended in a straight-forward manner to handle general semi-supervised learning with ambiguously-labeled data [26]. We leave this as future work.

Appendix — Hierarchical Potts

In this paper we use two-level Potts potentials where the smoothness is governed by two coefficients, c₁ and c₂. This concept can be generalized to a hierarchical Potts (hPotts) potential that is useful whenever there is a natural hierarchical grouping of labels. For example, the recent work on hierarchical context [27] learns a tree-structured grouping of the class labels for object detection; with hPotts potentials it is also possible to learn pairwise interactions for segmentation with hierarchical context. We leave this as future work.

We now characterize our class of hPotts potentials and prove necessary and sufficient conditions for them to be optimized by the α-expansion algorithm [11]. Let N = ℒ ∪ S denote combined set of sub-labels and super-labels. A hierarchical Potts prior V is defined with respect to an irreducible⁴ tree over node set N. The parent-child relationship in the tree is determined by π : → S where π(ℓ) gives the parent⁵ of ℓ. The leaves of the tree are the sub-labels ℒ and the interior nodes are the super-labels S. Each node i ∈ S has an associated Potts coefficient c_i for penalizing sub-label transitions that cross from one sub-tree of i to another. An hPotts potential is a special case of general pairwise potentials over ℒ and can be written as an |ℒ|×|ℒ| “smooth cost matrix” with entries V(ℓ, ℓ′). The coefficients of this matrix are block-structured in a way that corresponds to some irreducible tree. The example below shows an hPotts potential V and its corresponding tree.

graphic file with name nihms384897u1.jpg

Let πⁿ(ℓ) denote n applications of the parent function as in π(…π(ℓ)). Let lca(ℓ, ℓ′) denote the lowest common ancestor of ℓ and ℓ′, i.e. lca(ℓ, ℓ′) = i where i = πⁿ(ℓ) = π^m(ℓ′) for minimal n, m. We can now define an hPotts potential as

V (ℓ, ℓ^{'}) = c_{lac(ℓ, ℓ^{'})}

(4)

where we assume V(ℓ, ℓ) = c_ℓ = 0 for each leaf ℓ ∈ ℒ. For example, in the tree illustrated above lca(α, β) is super-label 4 and so the smooth cost V(α, β) = c₄.

Theorem 1

Let V be an hPotts potential with corresponding irreducible tree π.

V is metric on ℒ \Leftrightarrow c_{i} \leq 2 c_{j} for all j = π^{n} (i) .

(5)

Proof

The metric constraint V(β, γ) ≤ V(α, γ) + V(β, α) is equivalent to

c_{lac(β, γ)} \leq c_{lac} (α, γ) + c_{lac} (β, α)

(6)

for all α, β, γ ∈ ℒ. Because π defines a tree structure, for every α, β, γ there exists i, j ∈ S such that, without loss of generality,

j = lac (α, γ) = lac (β, α), and i = lac (β, γ) such that j = π^{k} (i) for some k \geq 0.

(7)

In other words there can be up to two unique lowest common ancestors among (α, β, γ) and we assume ancestor i is in the sub-tree rooted at ancestor j, possibly equal to j. For any particular (α, β, γ) and corresponding (i, j) inequality (6) is equivalent to c_i≤ 2c_j. Since π defines an irreducible tree, for each (i, j) there must exist corresponding sub-labels (α, β, γ) for which (6) holds. It follows that c_i≤ 2c_j holds for all pairs j = π^k(i) and completes the proof of (5).

Footnotes

In practice we use non-uniform w_pq and so, strictly speaking, (1) is a conditional random field (CRF) [15] rather than an MRF.

The dependence of D on (π, θ) and of V on π is omitted in (1) for clarity.

In color images a ‘plane’ is a 2-D linear subspace (a 2-flat) of a 5-D image space.

⁴

A tree is irreducible if all its internal nodes have at least two children.

⁵

The root of the tree r ∈ S is assigned π(r) = r.

References

1.Boykov Y, Jolly MP. Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. Int’l Conf. on Computer Vision (ICCV) 2001;1:105–112. [Google Scholar]
2.Rother C, Kolmogorov V, Blake A. ACM SIGGRAPH. 2004. GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts. [Google Scholar]
3.Batra D, Kowdle A, Parikh D, Luo J, Chen T. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2010. iCoseg: Interactive Co-segmentation with Intelligent Scribble Guidance. [Google Scholar]
4.Kolmogorov V, Zabih R. What Energy Functions Can Be Optimized via Graph Cuts. IEEE Trans on Patt Analysis and Machine Intelligence. 2004;26:147–159. doi: 10.1109/TPAMI.2004.1262177. [DOI] [PubMed] [Google Scholar]
5.Zhu SC, Yuille AL. Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. IEEE Trans on Pattern Analysis and Machine Intelligence. 1996;18:884–900. [Google Scholar]
6.Tu Z, Zhu SC. Image Segmentation by Data-Driven Markov Chain Monte Carlo. IEEE Trans on Pattern Analysis and Machine Intelligence. 2002;24:657–673. [Google Scholar]
7.Delong A, Osokin A, Isack H, Boykov Y. Fast Approximate Energy Minimization with Label Costs. Int’l Journal of Computer Vision. 2011 in press. [Google Scholar]
8.Rother C, Minka T, Blake A, Kolmogorov V. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2006. Cosegmentation of Image Pairs by Histogram Matching — Incorporating a Global Constraint into MRFs. [Google Scholar]
9.Vicente S, Kolmogorov V, Rother C. Cosegmentation revisited: Models and optimization. In: Daniilidis K, Maragos P, Paragios N, editors. ECCV 2010 LNCS. Vol. 6312. Springer; Heidelberg: 2010. pp. 465–479. [Google Scholar]
10.Delong A, Boykov Y. Int’l Conf. on Computer Vision, ICCV. 2009. Globally Optimal Segmentation of Multi-Region Objects. [Google Scholar]
11.Boykov Y, Veksler O, Zabih R. Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence. 2001 [Google Scholar]
12.Birchfield S, Tomasi C. Int’l Conf. on Computer Vision. 1999. Multiway cut for stereo and motion with slanted surfaces. [Google Scholar]
13.Zabih R, Kolmogorov V. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2004. Spatially Coherent Clustering with Graph Cuts. [Google Scholar]
14.Isack HN, Boykov Y. Energy-based Geometric Multi-Model Fitting. Int’l Journal of Computer Vision. accepted 2011. [Google Scholar]
15.Lafferty J, McCallum A, Pereira F. Int’l Conf. on Machine Learning, ICML. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. [Google Scholar]
16.Akaike H. A new look at statistical model identification. IEEE Trans on Automatic Control. 1974;19:716–723. [Google Scholar]
17.Schwarz G. Estimating the Dimension of a Model. Annals of Statistics. 1978;6:461–646. [Google Scholar]
18.Boykov Y, Kolmogorov V. An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE Trans on Pattern Analysis and Machine Intelligence. 2004;29:1124–1137. doi: 10.1109/TPAMI.2004.60. [DOI] [PubMed] [Google Scholar]
19.Veksler O. Star shape prior for graph-cut image segmentation. In: Forsyth D, Torr P, Zisserman A, editors. ECCV 2008, Part III LNCS. Vol. 5304. Springer; Heidelberg: 2008. pp. 454–467. [Google Scholar]
20.Gulshan V, Rother C, Criminisi A, Blake A, Zisserman A. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2010. Geodesic Star Convexity for Interactive Image Segmentation. [Google Scholar]
21.Schnitman Y, Caspi Y, Cohen-Or D, Lischinski D. Inducing semantic segmentation from an example. In: Narayanan PJ, Nayar SK, Shum H-Y, editors. ACCV 2006 LNCS. Vol. 3852. Springer; Heidelberg: 2006. pp. 373–384. [Google Scholar]
22.Blum A, Chawla S. Int’l Conf. on Machine Learning. 2001. Learning from labeled and unlabeled data using graph mincuts. [Google Scholar]
23.Grady L. Random Walks for Image Segmentation. IEEE Trans on Pattern Analysis and Machine Intelligence. 2006;28:1768–1783. doi: 10.1109/TPAMI.2006.233. [DOI] [PubMed] [Google Scholar]
24.Szummer M, Jaakkola T. Advances in Neural Information Processing Systems, NIPS. 2001. Partially labeled classification with markov random walks. [Google Scholar]
25.Duchenne O, Audibert JY, Keriven R, Ponce J, Segonne F. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2008. Segmentation by transduction. [Google Scholar]
26.Cour T, Sapp B, Jordan C, Taskar B. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2009. Learning from Ambiguously Labeled Images. [Google Scholar]
27.Choi MJ, Lim J, Torralba A, Willsky AS. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2010. Exploiting Hierarchical Context on a Large Database of Object Categories. [Google Scholar]

[R1] 1.Boykov Y, Jolly MP. Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. Int’l Conf. on Computer Vision (ICCV) 2001;1:105–112. [Google Scholar]

[R2] 2.Rother C, Kolmogorov V, Blake A. ACM SIGGRAPH. 2004. GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts. [Google Scholar]

[R3] 3.Batra D, Kowdle A, Parikh D, Luo J, Chen T. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2010. iCoseg: Interactive Co-segmentation with Intelligent Scribble Guidance. [Google Scholar]

[R4] 4.Kolmogorov V, Zabih R. What Energy Functions Can Be Optimized via Graph Cuts. IEEE Trans on Patt Analysis and Machine Intelligence. 2004;26:147–159. doi: 10.1109/TPAMI.2004.1262177. [DOI] [PubMed] [Google Scholar]

[R5] 5.Zhu SC, Yuille AL. Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation. IEEE Trans on Pattern Analysis and Machine Intelligence. 1996;18:884–900. [Google Scholar]

[R6] 6.Tu Z, Zhu SC. Image Segmentation by Data-Driven Markov Chain Monte Carlo. IEEE Trans on Pattern Analysis and Machine Intelligence. 2002;24:657–673. [Google Scholar]

[R7] 7.Delong A, Osokin A, Isack H, Boykov Y. Fast Approximate Energy Minimization with Label Costs. Int’l Journal of Computer Vision. 2011 in press. [Google Scholar]

[R8] 8.Rother C, Minka T, Blake A, Kolmogorov V. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2006. Cosegmentation of Image Pairs by Histogram Matching — Incorporating a Global Constraint into MRFs. [Google Scholar]

[R9] 9.Vicente S, Kolmogorov V, Rother C. Cosegmentation revisited: Models and optimization. In: Daniilidis K, Maragos P, Paragios N, editors. ECCV 2010 LNCS. Vol. 6312. Springer; Heidelberg: 2010. pp. 465–479. [Google Scholar]

[R10] 10.Delong A, Boykov Y. Int’l Conf. on Computer Vision, ICCV. 2009. Globally Optimal Segmentation of Multi-Region Objects. [Google Scholar]

[R11] 11.Boykov Y, Veksler O, Zabih R. Fast Approximate Energy Minimization via Graph Cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence. 2001 [Google Scholar]

[R12] 12.Birchfield S, Tomasi C. Int’l Conf. on Computer Vision. 1999. Multiway cut for stereo and motion with slanted surfaces. [Google Scholar]

[R13] 13.Zabih R, Kolmogorov V. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2004. Spatially Coherent Clustering with Graph Cuts. [Google Scholar]

[R14] 14.Isack HN, Boykov Y. Energy-based Geometric Multi-Model Fitting. Int’l Journal of Computer Vision. accepted 2011. [Google Scholar]

[R15] 15.Lafferty J, McCallum A, Pereira F. Int’l Conf. on Machine Learning, ICML. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. [Google Scholar]

[R16] 16.Akaike H. A new look at statistical model identification. IEEE Trans on Automatic Control. 1974;19:716–723. [Google Scholar]

[R17] 17.Schwarz G. Estimating the Dimension of a Model. Annals of Statistics. 1978;6:461–646. [Google Scholar]

[R18] 18.Boykov Y, Kolmogorov V. An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE Trans on Pattern Analysis and Machine Intelligence. 2004;29:1124–1137. doi: 10.1109/TPAMI.2004.60. [DOI] [PubMed] [Google Scholar]

[R19] 19.Veksler O. Star shape prior for graph-cut image segmentation. In: Forsyth D, Torr P, Zisserman A, editors. ECCV 2008, Part III LNCS. Vol. 5304. Springer; Heidelberg: 2008. pp. 454–467. [Google Scholar]

[R20] 20.Gulshan V, Rother C, Criminisi A, Blake A, Zisserman A. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2010. Geodesic Star Convexity for Interactive Image Segmentation. [Google Scholar]

[R21] 21.Schnitman Y, Caspi Y, Cohen-Or D, Lischinski D. Inducing semantic segmentation from an example. In: Narayanan PJ, Nayar SK, Shum H-Y, editors. ACCV 2006 LNCS. Vol. 3852. Springer; Heidelberg: 2006. pp. 373–384. [Google Scholar]

[R22] 22.Blum A, Chawla S. Int’l Conf. on Machine Learning. 2001. Learning from labeled and unlabeled data using graph mincuts. [Google Scholar]

[R23] 23.Grady L. Random Walks for Image Segmentation. IEEE Trans on Pattern Analysis and Machine Intelligence. 2006;28:1768–1783. doi: 10.1109/TPAMI.2006.233. [DOI] [PubMed] [Google Scholar]

[R24] 24.Szummer M, Jaakkola T. Advances in Neural Information Processing Systems, NIPS. 2001. Partially labeled classification with markov random walks. [Google Scholar]

[R25] 25.Duchenne O, Audibert JY, Keriven R, Ponce J, Segonne F. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2008. Segmentation by transduction. [Google Scholar]

[R26] 26.Cour T, Sapp B, Jordan C, Taskar B. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2009. Learning from Ambiguously Labeled Images. [Google Scholar]

[R27] 27.Choi MJ, Lim J, Torralba A, Willsky AS. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR. 2010. Exploiting Hierarchical Context on a Large Database of Object Categories. [Google Scholar]

PERMALINK

Interactive Segmentation with Super-Labels

Andrew Delong

Lena Gorelick

Frank R Schmidt

Olga Veksler

Yuri Boykov

Abstract

1 Introduction

Fig. 1.

Fig. 2.

2 Related Work

Complex Appearance Models

MDL-Based Segmentation

Iterative Graph-Cuts

3 Modeling Complex Appearance via Super-Labels

3.1 Problem Formulation

3.2 Our SuperLabelSeg Algorithm

Fig. 3.

4 Applications and Experiments

Implementation Details

4.1 Binary Segmentation

Fig. 4.

Fig. 5.

Fig. 6.

4.2 Complex Appearance Models

Fig. 7.

4.3 Multi-Class Segmentation

Fig. 8.

4.4 Interactive Co-segmentation

Fig. 9.

5 Discussion: Super-Labels as Semi-supervised Learning

6 Conclusion

Appendix — Hierarchical Potts

Theorem 1

Proof

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases