Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Feb 19.
Published in final edited form as: Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2020 Jan 9;2019:2422–2431. doi: 10.1109/cvpr.2019.00253

Geometry-Consistent Generative Adversarial Networks for One-Sided Unsupervised Domain Mapping

Huan Fu 1,*, Mingming Gong 2,3,*, Chaohui Wang 4, Kayhan Batmanghelich 2, Kun Zhang 3, Dacheng Tao 1
PMCID: PMC7030214  NIHMSID: NIHMS1037392  PMID: 32076365

Abstract

Unsupervised domain mapping aims to learn a function GXY to translate domain X to Y in the absence of paired examples. Finding the optimal GXY without paired data is an ill-posed problem, so appropriate constraints are required to obtain reasonable solutions. While some prominent constraints such as cycle consistency and distance preservation successfully constrain the solution space, they overlook the special properties of images that simple geometric transformations do not change the image’s semantic structure. Based on this special property, we develop a geometry-consistent generative adversarial network (Gc-GAN), which enables one-sided unsupervised domain mapping. GcGAN takes the original image and its counterpart image transformed by a predefined geometric transformation as inputs and generates two images in the new domain coupled with the corresponding geometry-consistency constraint. The geometry-consistency constraint reduces the space of possible solutions while keep the correct solutions in the search space. Quantitative and qualitative comparisons with the baseline (GAN alone) and the state-of-the-art methods including CycleGAN [66] and DistanceGAN [5] demonstrate the effectiveness of our method.

1. Introduction

Domain mapping or image-to-image translation, which targets at translating an image from one domain to another, has been intensively investigated over the past few years. Let XX denote a random variable representing source domain images and YY represent target domain images. According to whether we have access to a paired sample {(xi,yi)}i=1N, domain mapping can be studied in a supervised or unsupervised manner. While several works have successfully produced high-quality translations by focusing on supervised domain mapping with constraints provided by cross-domain image pairs [46, 26, 59, 58], the progress of unsupervised domain mapping is relatively slow.

In unsupervised domain mapping, the goal is to model the joint distribution PXY given samples drawn from the marginal distributions PX and PY in individual domains. Since the two marginal distributions can be inferred from an infinite set of possible joint distributions, it is difficult to guarantee that an individual input xX and the output GXY(x) are paired up in a meaningful way without additional assumptions or constraints.

To address this problem, recent approaches have exploited the cycle-consistency assumption, i.e., a mapping GXY and its inverse mapping GYX should be bijections [66, 28, 61]. Specifically, when feeding an example xX into the networks GXYGYX : XYX, the output should be a reconstruction of x and vise versa for y, i.e., GYX (GXY (x)) ≈ x and GXY(GYX(y)) ≈ y. Further, DistanceGAN [5] showed that maintaining the distances between images within domains allows one-sided unsupervised domain mapping.

Existing constraints overlook the special properties of images that simple geometric transformations (global geometric transformations without shape deformation) do not change the image’s semantic structure. Here, semantic structure refers to the information that distinguishes different object/staff classes, which can be easily perceived by humans regardless of trivial geometric transformations such as rotation. Based on this property, we develop a geometry-consistency constraint, which helps in reducing the search space of possible solutions while still keeping the correct set of solutions under consideration, and results in a geometry-consistent generative adversarial network (GcGAN).

Our geometry-consistency constraint is motivated by the fact that a given geometric transformation f(·) between the input images should be preserved by related translators GXY and GX˜Y˜, if X˜ and Y˜ are the domains obtained by applying f(·) on the examples of X and Y, respectively. Mathematically, given a random example x from the source domain X and a predefined geometric transformation function f(·), geometry consistency can be expressed as f(GXY(x))GX˜Y˜(f(x)) and f1(GX˜Y˜(f(x)))GXY(x), where f−1(·) is the inverse function of f(·). Because it is unlikely that GXY and GX˜Y˜ always fail in the same location, GXY and GX˜Y˜ co-regularize each other by the geometry-consistency constraint and thus correct each others’ failures in local regions of their respective translations (see Figure 1 for an illustration). Our geometry-consistency constraint allows one-sided unsupervised domain mapping, i.e., GXY can be trained independently from GYX. In this paper, we employ two simple but representative geometric transformations as examples, i.e., vertical flipping (vf) and 90 degrees clockwise rotation (rot), to illustrate geometry consistency. Quantitative and qualitative comparisons with the baseline (GAN alone) and the state-of-the-art methods including CycleGAN [66] and Distance-GAN [5] demonstrate the effectiveness of our method.

Figure 1: Geometry consistency.

Figure 1:

The original input image is denoted by x, and the predefined function f(·) is a 90° clockwise rotation (rot). GAN alone: GXY1(x). GAN alone (rot): f1(GX˜Y˜1(f(x))). GcGAN: GXY2(x). GcGAN (rot): f1(GX˜Y˜2(f(x))). It can be seen that GAN alone produces geometrically-inconsistent output images, indicating that the learned GXY and GX˜Y˜ are far away from the correct mapping functions. By enforcing geometry consistency, our method results in more sensible domain mapping.

2. Related Work

Generative Adversarial Networks.

Generative adversarial networks (GANs) [21, 45, 14, 47, 51, 3] learn two networks, i.e., a generator and a discriminator, in a staged zero-sum game fashion to generate images from inputs. Many tasks have recently been developed based on deep convolutional GANs, such as image inpainting, style transfer, and domain adaptation [7, 62, 46, 48, 31, 60, 9, 52, 23, 53, 64, 27, 50, 19, 18, 35, 63]. The key component enabling GANs is the adversarial constraint, which enforces the generated images to be indistinguishable from real images.

Domain Mapping.

Recent adversarial domain mapping has been studied in a supervised or unsupervised manner with respect to paired or unpaired inputs. There are a variety of literatures [46, 31, 26, 59, 56, 58, 25, 37, 4, 10] on supervised domain mapping. One representative example is conditional GAN [26], which learned the discriminator to distinguish (x, y) and (x, GXY(x)) instead of y and GXY(x), where (x, y) is a meaningful pair across domains. Further, Wang et al. [59] showed that conditional GANs can be used to generate high-resolution images with a novel feature matching loss, as well as multi-scale generator and discriminator architectures. While there has been significant progress in supervised domain mapping, many real-word applications can not provide aligned images across domains because data preparation is expensive. Thus, different constraints and frameworks have been proposed for image-to-image translation in the absence of training pairs.

In unsupervised domain mapping, only unaligned examples in individual domains are provided, making the task more practical but more difficult. Unpaired domain mapping has a long history, and some successes in adversarial networks have recently been presented [40, 66, 5, 39, 42, 38, 6, 11]. For example, Liu and Tuzel [40] introduced coupled GAN (CoGAN) to learn cross-domain representations by enforcing a weight-sharing constraint. Subsequently, CycleGAN [66], DiscoGAN [28], and DualGAN [61] enforced that translators GXY and GYX should be bijections. Thus, jointly learning GXY and GYX by enforcing cycle consistency can help to produce convincing mappings. Since then, many constraints and assumptions have been proposed to improve cycle consistency [8, 17, 24, 32, 34, 11, 2, 67, 20, 44, 39, 36, 1]. Recently, Benaim and Wolf [5] reported that maintaining the distances between samples within domains allows one-sided unsupervised domain mapping. GcGAN is also a one-sided framework coupled with our geometry-consistency constraint, and produces competitive and even better translations than the two-sided CycleGAN in various applications.

3. Preliminaries

Let X and Y be two domains with unpaired training examples {xi}i=1N and {yj}j=1M, where xi and yj are drawn from the marginal distributions PX and PY, where X and Y are two random variables associated with X and Y, respectively. In the paper, we exploit style transfer without undesirable semantic distortions, and have two goals. First, we need to learn a mapping GXY such that GXY (X) has the same distribution as Y, i.e., PGXY(X)PY. Second, the learned mapping function only changes the image style without distorting the semantic structures.

While many works have modeled the invertibility between GXY and GYX for convincing mappings since the success of CycleGAN, here we propose to enforce geometry consistency as a constraint that allows one-sided domain mapping. Let f(·) be a predefined geometric transformation. We can obtain two extra domains X˜ and Y˜ with examples {x˜i}i=1N and {y˜j}j=1M by applying f(·) on X and Y, respectively. We learn an additional translator GX˜Y˜:X˜Y˜ while learning GXY : XY, and introduce our geometry-consistency constraint based on the predefined transformation such that the two networks can co-regularize each other. Our framework enforces that GXY(x) and GX˜Y˜(x˜) should keep the same geometric transformation with the one between x and x˜, i.e., f(GXY(x))GX˜Y˜(x˜), where x˜=f(x). We denote the two adversarial discriminators as DY and DY˜ with respect to domains  Y and Y˜, respectively.

4. Proposed Method

We present our geometry-consistency constraint and Gc-GAN beginning with a review of the cycle-consistency constraint and the distance constraint. An illustration of the differences between these constraints is shown in Figure 2.

Figure 2: An illustration of the differences between CycleGAN [66], DistanceGAN [5], and our GcGAN.

Figure 2:

x and y are random examples from domain X and Y, respectively. d(xi, xj) is the distance between images xi and xj. f(·) is a predefined geometric transformation function for images, which satisfies f1(f(x))=f(f1(x))=x. GXY and GX˜Y˜ are the generators (or translators) which target the domain translation tasks from X to Y and X˜ to Y˜, where X˜ and Y˜ are two domains obtained by applying f(·) on all the images in X and Y, respectively. DY is an adversarial discriminator in domain Y. The red dotted lines denote the unsupervised constraints with respect to cycle consistency (xGYX (GXY(x))), distance consistency (xGYX (GXY(x))), and our geometry consistency (f(GXY(x))GX˜Y˜(f(x))), respectively.

4.1. Unsupervised Constraints

Cycle-consistency constraint.

Following the cycle-consistency assumption [28, 66, 61], through the translators GXYGYX : XYX and GYXGXY : YXY, the examples x and y in domain X and Y should recover the original images, i.e., xGYX (GXY (x)) and yGXY (GYX (y)). Cycle consistency is implemented by a bidirectional reconstruction process that requires GXY and GYX to be jointly learned, as shown in Figure 2 (Cycle-GAN). The cycle consistency loss Lcyc(GXY,GYX,X,Y) takes the form as:

Lcyc=Ex~PX[GYX(GXY(x))x1]+Ey~PY[GXY(GYX(y))y1]. (1)

Distance constraint.

The assumption behind the distance constraint is that the distance between two examples xi and xj in domain X should be preserved after mapping to domain Y, i.e., d(xi, xj) ≈ a·d(GXY (xi), GXY(xj)) + b, where d(·) is a predefined function to measure the distance between two examples and a and b are the linear coefficient and bias. In DistanceGAN [5], the distance consistency loss Ldis(GXY,X,Y) is the exception to the absolute differences between distances:

Ldis=Exi,xj~PX[|ϕ(xi,xj)ψ(xi,xj)|],ϕ(xi,xj)=1σX(xixj1μX),ψ(xi,xj)=1σY(GXY(xi)GXY(xj)1μY), (2)

where μX, μY (σX, σY) are the means (standard deviations) of distances of all the possible pairs of (xi, xj) within domain X and (yi, yj) within domain Y, respectively.

4.2. Geometry-consistent Generative Adversarial Networks

Adversarial constraint.

Taking GXY as an example, an adversarial loss Lgan(GXY,DY,X,Y) [21] enforces GXY and DY to simultaneously optimize each other in an minimax game, i.e., minGXYmaxDYLgan(GXY,DY,X,Y). In other words, DY aims to distinguish real examples {y} from translated samples {GXY(x)}. By contrast, GXY aims to fool DY so that DY can label a fake example y=GXY(x) as a sample satisfying y~PY. The objective can be expressed as:

Lgan=Ey~PY[log DY(y)]+Ex~PX[log(1DY(GXY(x)))]. (3)

In the transformed domains X˜ and Y˜, we employ the adversarial loss Lgan(GX˜Y˜,DY˜,X˜,Y˜) that has the same form to Lgan(GXY,DY,X,Y).

Geometry-consistency constraint.

As shown in Figure 2 (GcGAN), given a predefined geometric transformation function f(·), we feed the images xX and x˜=f(x) into the translators GXY and GX˜Y˜, respectively. Following our geometry-consistency constraint, the outputs y=GXY(x) and y˜=GX˜Y˜(x˜) should also satisfy y˜f(y) like x and x˜. Considering both f(·) and the inverse geometric transformation function f−1(·), our complete geometry consistency loss Lgeo(GXY,GX˜Y˜,X,Y) has the following form:

Lgeo=Ex~PX[GXY(x)f1(GX˜Y˜(f(x)))1]+Ex~PX[GX˜Y˜(f(x))f(GXY(x))1]. (4)

This geometry-consistency loss can be seen as a reconstruction loss that relies on the predefined geometric transformation function f(·). In this paper, we only take two common geometric transformations as examples, namely vertical flipping (vf) and 90° clockwise rotation (rot), to demonstrate the effectiveness of our geometry-consistency constraint. Note that, GXY and GX˜Y˜ have the same architecture and share all the parameters.

Full objective.

By combining our geometry-consistency constraint with the standard adversarial constraint, a remarkable one-sided unsupervised domain mapping can be targeted. The full objective for our GcGAN LGcGAN LGcGAN(GXY,GX˜Y˜,DY,DY˜,X,Y) will be:

LGcGAN=Lgan(GXY,DY,X,Y)+Lgan(GX˜Y˜,DY˜,X,Y)+λLgeo(GXY,GX˜Y˜,X,Y), (5)

where λ (λ = 20.0 in all the experiments) is a trade-off hyperparameter to weight the contribution of Lgan and Lgeo during the model training. Carefully tuning λ may give preferable results to specific translation tasks.

Network architecture.

The full framework of our GcGAN is illustrated in Figure 2. Our experimental settings, network architectures, and learning strategies follow CycleGAN. We employ the same discriminator and generator as CycleGAN depending on the specific tasks. Specifically, the generator is a standard encoder-decoder, where the encoder contains two convolutional layers with stride 2 and several residual blocks [22] (6/9 blocks with respect to 128 × 128/256 × 256 of input resolution), and the decoder contains two deconvolutional layers also with stride 2. The discriminator distinguishes images at the patch level following PatchGANs [26, 33]. Like CycleGAN, we also use an identity mapping loss [55] in all of our experiments (except SVHN → MNIST), including our baseline (GAN alone). For other details, we use LeakyReLU as nonlinearity for the discriminators and instance normalization [57] to normalize convolutional feature maps.

Learning and inference.

We use the Adam solver [29] with a learning rate of 0.0002 and coefficients of (0.5, 0.999), where the latter is used to compute running averages of gradients and their squares. The learning rate is fixed in the initial 100 epochs, and linearly decays to zero over the next 100 epochs. Following CycleGAN, the negative log likelihood objective is replaced with a more stable and effective least-squares loss [43] for Lgan. The discriminator is updated with random samples from a history of generated images stored in an image buffer [54] of size 50. The generator and discriminator are optimized alternately. In the inference phase, we feed an image only into the learned generator GXY to obtain a translated image.

5. Experiments

We apply our GcGAN to a wide range of applications and make both quantitative and qualitative comparisons with the baseline (GAN alone) and previous state-of-theart methods including DistanceGAN and CycleGAN. We also study different ablations (based on rot) to analyze our geometry-consistency constraint. Since adversarial networks are not always stable, every independent experiment could result in slightly different scores. The scores in the quantitative analysis are computed by the average on three independent experiments.

5.1. Quantitative Analysis

The results demonstrate that our geometry-consistency constraint can not only partially filter out the candidate solutions having mode collapse or semantic distortions and thus produce more sensible translations, but also compatible with other unsupervised constraints such as cycle consistency [66] and distance preservation [5].

Cityscapes.

Cityscapes [12] contains 3975 image-label pairs, with 2975 used for training and 500 for validation (test in this paper). For a fair comparison with CycleGAN, the translators are trained at a resolution of 128 × 128 in an unaligned fashion. We evaluate our domain mappers using FCN scores and scene parsing metrics following previous works [41, 12, 66]. Specifically, for parsing → image, we assume that a high-quality translated image should produce qualitative semantic segmentation like real images when feeding it into a scene parser. Thus, we employ the pretrained FCN-8s [41] provided by pix2pix [26] to predict semantic labels for the 500 translated images. The label maps are then resized to the original resolution of 1024 × 2048 and compared against the ground truth labels using some standard scene parsing metrics including pixel accuracy, class accuracy, and mean IoU [41]. For image → parsing, since the fake labels are in the RGB format, we simply convert them into class-level labels using the nearest neighbor search strategy. In particular, we have 19 (category labels) + 1 (ignored label) categories for Cityscapes, each with a corresponding color value (RGB). For a pixel i in a translated parsing, we compute the distances between the 20 groundtruth color values and the color value of pixel i. The label of pixel i should be the one with the small est distance. Then, the aforementioned metrics are used to evaluate our mapping on the 19 category labels.

The parsing scores are presented in Table 1. Our Gc-GAN outperforms the baseline (GAN alone) by a large margin. We take the average of pixel accuracy, class accuracy, and mean IoU as the final score for analysis [65], i.e., score = (pixel acc + class acc + mean IoU)/3. For image → parsing, GcGAN (32.6%) yields a slightly higher score than CycleGAN (32.0%). For parsing → image, Gc-GAN (29.0% ~ 29.5%) obtains a convincing improvement of 1.3% ~ 1.8% over distanceGAN (27.7%).

Table 1: Parsing scores on Cityscapes.

LGcGAN: The objective in Eqn. 5 with rot. GcGAN-rot-Separate: GXY and GX˜Y˜ do not share parameters. GcGAN-Mix-comb: Training GcGAN with both vf and rot in each iteration. GcGAN-Mix-rand: Training GcGAN with randomly chosen vf and rot in each iteration. GcGAN-rot + Cycle: GcGAN-rot with the cycle-consistency constraint.

method image → parsing parsing → image
pixel acc class acc mean IoU pixel acc class acc mean IoU
Benchmark Performance
CoGAN [40] 0.45 0.11 0.08 0.40 0.10 0.06
BiGAN/ALI [15, 16] 0.41 0.13 0.07 0.19 0.06 0.02
SimGAN [54] 0.47 0.11 0.07 0.20 0.10 0.04
CycleGAN (Cycle) [66] 0.58 0.22 0.16 0.52 0.17 0.11
DistanceGAN [5] - - - 0.53 0.19 0.11
GAN alone (baseline) 0.514 0.160 0.104 0.437 0.161 0.098
GcGAN-rot 0.574 0.234 0.170 0.551 0.197 0.129
GcGAN-vf 0.576 0.232 0.171 0.548 0.196 0.127
Ablation Studies (Robustness & Compatibility)
LGcGANw/o Lgeo(λ=0) 0.486 0.163 0.102 0.396 0.148 0.088
LGcGANw/o Lgan(X˜,Y˜) 0.549 0.199 0.139 0.526 0.184 0.111
GcGAN-rot-Seperate 0.575 0.233 0.170 0.545 0.196 0.124
GcGAN-Mix-comb 0.573 0.229 0.168 0.545 0.197 0.128
GcGAN-Mix-rand 0.564 0.217 0.156 0.547 0.192 0.123
GcGAN-rot + Cycle 0.587 0.246 0.182 0.557 0.201 0.132

We next perform ablation studies to further discuss Gc-GAN. The scores are reported in Table 1. Specifically, GcGAN-rot-Seperate shows that the generator GXY employed in GcGAN is sufficient to handle both the style transfers (without shape deformation) XY and X˜Y˜. GcGAN-Mix-{comb, rand} demonstrate that persevering a geometric transformation can filter out most of the candidate solutions having mode collapse or undesired shape deformation, but preserving more ones can not leach more. Besides, GcGAN-Mix-rand performs slightly worse than GcGAN-Mix-comb. One of the possible reasons is that neither XrotYrot nor XvfYvf are sufficiently trained in the random case, which would decrease the effect of the aforementioned co-regularization mechanism. For GcGAN-rot + Cycle, we set the trade-off parameter for Lcyc to 10.0 as published in CycleGAN. The consistent improvement is a credible support that our geometry-consistency constraint is compatible with the widely-used cycle-consistency constraint. Moreover, when setting λ = 0 in LGcGAN, both GXY and GYX perform badly. One of the possible reasons is that, without the geometry consistency constraint, jointly modeling XY and X˜Y˜ with the shared generator GXY would decrease the performance due to domain diversities caused by the geometric transformations. When removing Lgan(GX˜Y˜,DY˜), the obtained scores are much higher than baseline (GAN alone) because Y can partially correct Y˜ so that GXY is able to handle the mapping X˜Y˜, and then Y˜ can constrain the mapping XY. As the analysis, when learning both Lgan(GXY,DY) and Lgan(GX˜Y˜,DY˜) with Lgeo, the co-regularization help gen erate preferable translations.

SVHN → MNIST.

We apply our approach to the SVHN → MNIST translation task. The translation models are trained on 73257 and 60000 training images of resolution 32 × 32 contained in the SVHN and MNIST training sets, respectively. The experimental settings follow DistanceGAN [5], including the default trade-off parameters for Lcyc and Ldis. We compare our GcGAN with both DistanceGAN and CycleGAN in this translation task. To obtain quantitative results, we feed the translated images into a pretrained classifier trained on the MNIST training split, as done in [5].

Classification accuracies are reported in Table 2. Both GcGAN-rot and GcGAN-vf outperform DistanceGAN and CycleGAN by a large margin (about 6% ~ 7%). From the ablations, adding our geometry-consistency constraint to current unsupervised domain mapping frameworks will achieve different levels of improvements against the original ones. Note that, it seems that the distance-preservation constraint is not compatible with the cycle-consistency constraint on this task, but our geometry-consistency constraint can improve both ones.

Table 2:

Quantitative scores for SVHN → MNIST.

method class acc (%)
Benchmark Performance
DistanceGAN (Dist.) [5] 26.8
CycleGAN (Cycle) [66] 26.1
Self-Distance [5] 25.2
GcGAN-rot 32.5
GcGAN-vf 33.3
Ablation Studies (Compatibility)
Cycle + Dist. [5] 18.0
GcGAN-rot + Dist. 34.0
GcGAN-rot + Cycle 33.8
GcGAN-rot + Dist. + Cycle 33.2

Google Maps.

We obtain 2194 (map, aerial photo) pairs of images in and around New York City from Google Maps [26], and split them into training and test sets with 1096 and 1098 pairs, respectively. We train Map ⇌ Aerial photo translators with an image size of 256×256 using the training set in an unsupervised manner (unpaired) by ignoring the pair information. For Aerial photo → Map, we make comparisons with CycleGAN using average RMSE and pixel accuracy (%). Given a pixel i with the ground-truth RGB value (ri, gi, bi) and the predicted RGB value (ri,gi,bi), if max(|riri|,|gigi|,|bibi|)<δ, we consider this is an accurate prediction. Since maps only contain a limited number of different RGB values, it is reasonable to compute pixel accuracy using this strategy (δ1 = 5 and δ2 = 10 in this paper). For Map → Aerial photo, we only show some qualitative results in Figure 3.

Figure 3:

Figure 3:

Qualitative comparison on Cityscapes (Parsing ⇌ Image) and Google Maps (Map ⇌ Aerial photo). GAN alone suffers from mode collapse. Translated images by GcGAN contain more details. GcGAN = GAN alone + geometry consistency.

From the scores presented in Table 3, GcGAN produces superior translations to the baseline (GAN alone). In particular, GcGAN yields an 18.0% ~ 21.9% improvement over the baseline with respect to pixel accuracy when δ = 5.0, demonstrating that the fake maps obtained by our GcGAN contain more details. In addition, GcGANs achieve competitive scores compared with CycleGAN.

Table 3:

Quantitative scores for Aerial photo → Map.

method RMSE acc (δ1) acc (δ2)
Benchmark Performance
CycleGAN [66] 28.15 41.8 63.7
GAN alone (baseline) 33.27 19.3 42.0
GcGAN-rot 28.31 41.2 63.1
GcGAN-vf 28.50 37.3 58.9
Ablation Studies (Robustness & Compatibility)
GcGAN-rot-Separate 30.25 40.7 60.8
GcGAN-Mix-comb 27.98 42.8 64.6
GcGAN-rot + Cycle 28.21 40.6 63.5

5.2. Qualitative Evalutation

The qualitative results are shown in Figure 3, Figure 4, and Figure 5. Our geometry-consistency constraint improve the training of GAN alone, and helps to generate empirically more impressive translations on various applications. The following applications are trained in the image size of 256 × 256 with the rot geometric transformation.

Figure 4:

Figure 4:

Qualitative comparison for SVHN → MNIST.

Figure 5:

Figure 5:

Qualitative results on different applications, including Horse → Zebra, Monet → Photo, Synthetic ⇌ Real, Summar ⇌ Winter, Photo → Artist Painting, and Day ⇌ Night. GcGAN has the potential to produce realistic images. Zoom in for better view.

Horse → Zebra.

We apply GcGAN to the widely studied object transfiguration application task, i.e., Horse → Zebra. The images are randomly sampled from ImageNet [13] using the keywords (i.e., wild horse and zebra). The numbers of training images are 939 and 1177 for horse and zebra, respectively. We find that training GcGAN without parameter sharing would produce preferable translations for the task.

Synthetic ⇌ Real.

We employ the 2975 training images from Cityscapes as the real-world scenes, and randomly select 3060 images from SYNTHIA-CVPR16 [49], which is a virtual urban scene benchmark, as the synthetic images.

Summer ⇌ Winter.

The images used for the season translation tasks are provided by CycleGAN. The training set sizes for Summer and Winter are 1273 and 854.

Photo ⇌ Artistic Painting.

We translate natural images to artistic paintings with different art styles, including Monet, Cezanne, Van Gogh, and Ukiyo-e. We also perform Gc-GAN on the translation task of Monet’s paintings → photographs. We use the photos and paintings (Monet: 1074, Cezanne: 584, Van Gogh: 401, Ukiyo-e: 1433, and Photographs: 6853) collected by CycleGAN for training.

Day ⇌ Night.

We randomly extract 4500 training images for both Day and Night from the 91 webcam sequences captured by [30].

6. Conclusion

In this paper, we propose to enforce geometry consistency as a constraint for unsupervised domain mapping, which can be viewed as a predefined geometric transformation f(·) preserving the geometry of a scene. The geometry-consistency constraint makes the translation networks on the original images and transformed images co-regularize each other, which not only provides an effective remedy to the mode collapse problem suffered by standard GANs, but also reduces the semantic distortions in the translation. We evaluate our model, i.e., the geometry-consistent generative adversarial network (GcGAN), both qualitatively and quantitatively in various applications. The experimental results demonstrate that GcGAN achieves competitive and sometimes even better translations than the state-of-the-art methods including DistanceGAN and CycleGAN. Finally, our geometry-consistency constraint is compatible with other well-studied unsupervised constraints.

7. Acknowledgement

This research was supported by Australian Research Council Projects FL-170100117, DP-180103424, and IH-180100002. This work was partially supported by SAP SE and CNRS INS2I-JCJC-INVISANA. This work is partially supported by NIH Award Number 1R01HL141813–01, NSF 1839332 Tripod+X, and SAP SE. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. We were also grateful for the computational resources provided by Pittsburgh Super Computing grant number TGASC170024.

References

  • [1].Almahairi Amjad, Rajeswar Sai, Sordoni Alessandro, Bachman Philip, and Courville Aaron. Augmented cyclegan: Learning many-to-many mappings from unpaired data. ICML, 2018. [Google Scholar]
  • [2].Anoosheh Asha, Agustsson Eirikur, Timofte Radu, and Van Gool Luc. Combogan: Unrestrained scalability for image domain translation. In CVPRW, 2018. [Google Scholar]
  • [3].Arjovsky Martin, Chintala Soumith, and Bottou Léon. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. [Google Scholar]
  • [4].Azadi Samaneh, Fisher Matthew, Kim Vladimir, Wang Zhaowen, Shechtman Eli, and Darrell Trevor. Multi-content gan for few-shot font style transfer. In CVPR, 2018. [Google Scholar]
  • [5].Benaim Sagie and Wolf Lior. One-sided unsupervised domain mapping. In NIPS, 2017. [Google Scholar]
  • [6].Benaim Sagie and Wolf Lior. One-shot unsupervised cross domain translation. NIPS, 2018. [Google Scholar]
  • [7].Bousmalis Konstantinos, Trigeorgis George, Silberman Nathan, Krishnan Dilip, and Erhan Dumitru. Domain separation networks. In NIPS, 2016. [Google Scholar]
  • [8].Chang Huiwen, Lu Jingwan, Yu Fisher, and Finkelstein Adam. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018. [Google Scholar]
  • [9].Chen Dongdong, Yuan Lu, Liao Jing, Yu Nenghai, and Hua Gang. Stereoscopic neural style transfer. In CVPR, 2018. [Google Scholar]
  • [10].Chen Qifeng and Koltun Vladlen. Photographic image synthesis with cascaded refinement networks. In ICCV, 2017. [Google Scholar]
  • [11].Choi Yunjey, Choi Minje, Kim Munyoung, Ha Jung-Woo, Kim Sunghun, and Choo Jaegul. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018. [Google Scholar]
  • [12].Cordts Marius, Omran Mohamed, Ramos Sebastian, Rehfeld Timo, Enzweiler Markus, Benenson Rodrigo, Franke Uwe, Roth Stefan, and Schiele Bernt. The cityscapes dataset for semantic urban scene understanding. In CVPR), 2016. [Google Scholar]
  • [13].Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [Google Scholar]
  • [14].Denton Emily L, Chintala Soumith, Fergus Rob, et al. Deep generative image models using a? laplacian pyramid of adversarial networks. In NIPS, 2015. [Google Scholar]
  • [15].Donahue Jeff, Krähenbuhl Philipp, and Darrell Trevor. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016. [Google Scholar]
  • [16].Dumoulin Vincent, Belghazi Ishmael, Poole Ben, Mastropietro Olivier, Lamb Alex, Arjovsky Martin, and Courville Aaron. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016. [Google Scholar]
  • [17].Gokaslan Aaron, Ramanujan Vivek, Ritchie Daniel, In Kim Kwang, and Tompkin James. Improving shape deformation in unsupervised image-to-image translation. ECCV, 2018. [Google Scholar]
  • [18].Gong Mingming, Zhang Kun, Huang Biwei, Glymour Clark, Tao Dacheng, and Batmanghelich Kayhan. Causal generative domain adaptation networks. arXiv preprint arXiv:1804.04333, 2018. [Google Scholar]
  • [19].Gong Mingming, Zhang Kun, Liu Tongliang, Tao Dacheng, Glymour Clark, and Schölkopf Bernhard. Domain adaptation with conditional transferable components. In ICML, pages 2839–2848, 2016. [PMC free article] [PubMed] [Google Scholar]
  • [20].Gonzalez-Garcia Abel, van de Weijer Joost, and Bengio Yoshua. Image-to-image translation for cross-domain disentanglement. NIPS, 2018. [Google Scholar]
  • [21].Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. Generative adversarial nets. In NIPS, 2014. [Google Scholar]
  • [22].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition. In CVPR, 2016. [Google Scholar]
  • [23].Hoffman Judy, Tzeng Eric, Park Taesung, Zhu Jun-Yan, Isola Phillip, Saenko Kate, Efros Alexei A, and Darrell Trevor. Cycada: Cycle-consistent adversarial domain adaptation. ICML, 2018. [Google Scholar]
  • [24].Huang Xun, Liu Ming-Yu, Belongie Serge, and Kautz Jan. Multimodal unsupervised image-to-image translation. ECCV, 2018. [Google Scholar]
  • [25].Isokane Takahiro, Okura Fumio, Ide Ayaka, Matsushita Yasuyuki, and Yagi Yasushi. Probabilistic plant modeling via multi-view image-to-image translation. CVPR, 2018. [Google Scholar]
  • [26].Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. [Google Scholar]
  • [27].Johnson Justin, Alahi Alexandre, and Fei-Fei Li. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016. [Google Scholar]
  • [28].Kim Taeksoo, Cha Moonsu, Kim Hyunsoo, Lee Jung Kwon, and Kim Jiwon. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017. [Google Scholar]
  • [29].Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [Google Scholar]
  • [30].Laffont Pierre-Yves, Ren Zhile, Tao Xiaofeng, Qian Chao, and Hays James. Transient attributes for high-level understanding and editing of outdoor scenes. ACM TOG, 33(4):149, 2014. [Google Scholar]
  • [31].Ledig Christian, Theis Lucas, Huszar Ferenc, Caballero Jose, Cunningham Andrew, Acosta Alejandro, Aitken Andrew, Tejani Alykhan, Totz Johannes, Wang Zehan, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. [Google Scholar]
  • [32].Lee Hsin-Ying, Tseng Hung-Yu, Huang Jia-Bin, Singh Maneesh, and Yang Ming-Hsuan. Diverse image-to-image translation via disentangled representations. In ECCV, 2018. [Google Scholar]
  • [33].Li Chuan and Wand Michael. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, 2016. [Google Scholar]
  • [34].Li Minjun, Huang Haozhi, Ma Lin, Liu Wei, Zhang Tong, and Jiang Yugang. Unsupervised image-to-image translation with stacked cycle-consistent adversarial networks. ECCV, 2018. [Google Scholar]
  • [35].Li Ya, Tian Xinmei, Gong Mingming, Liu Yajing, Liu Tongliang, Zhang Kun, and Tao Dacheng. Deep domain generalization via conditional invariant adversarial networks. In ECCV, pages 624–639, 2018. [Google Scholar]
  • [36].Liang Xiaodan, Zhang Hao, and Xing Eric P. Generative semantic manipulation with contrasting gan. NIPS, 2017. [Google Scholar]
  • [37].Lin Jianxin, Xia Yingce, Qin Tao, Chen Zhibo, and Liu Tie-Yan. Conditional image-to-image translation. In CVPR, 2018. [Google Scholar]
  • [38].Liu Alexander, Liu Yen-Chen, Yeh Yu-Ying, and Wang Yu-Chiang Frank. A unified feature disentangler for multi-domain image translation and manipulation. NIPS, 2018. [Google Scholar]
  • [39].Liu Ming-Yu, Breuel Thomas, and Kautz Jan. Unsupervised image-to-image translation networks. In NIPS, 2017. [Google Scholar]
  • [40].Liu Ming-Yu and Tuzel Oncel. Coupled generative adversarial networks. In NIPS, 2016. [Google Scholar]
  • [41].Long Jonathan, Shelhamer Evan, and Darrell Trevor. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [DOI] [PubMed] [Google Scholar]
  • [42].Ma Shuang, Fu Jianlong, Chen Chang Wen, and Mei Tao. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In CVPR, 2018. [Google Scholar]
  • [43].Mao Xudong, Li Qing, Xie Haoran, Lau Raymond YK, Wang Zhen, and Smolley Stephen Paul. Least squares generative adversarial networks. In ICCV, 2017. [DOI] [PubMed] [Google Scholar]
  • [44].Mejjati Youssef A, Richardt Christian, Tompkin James, Cosker Darren, and Kim Kwang In. Unsupervised attention-guided image to image translation. NIPS, 2018. [Google Scholar]
  • [45].van den Oord Aäron, Kalchbrenner Nal, Vinyals Oriol, Espeholt Lasse, Graves Alex, and Kavukcuoglu Koray. Conditional image generation with pixelcnn decoders. In NIPS, 2016. [Google Scholar]
  • [46].Pathak Deepak, Krahenbuhl Philipp, Donahue Jeff, Darrell Trevor, and Efros Alexei A. Context encoders: Feature learning by inpainting. In CVPR, 2016. [Google Scholar]
  • [47].Radford Alec, Metz Luke, and Chintala Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. [Google Scholar]
  • [48].Reed Scott, Akata Zeynep, Yan Xinchen, Logeswaran Lajanugen, Schiele Bernt, and Lee Honglak. Generative adversarial text to image synthesis. In ICML, 2016. [Google Scholar]
  • [49].Ros German, Sellart Laura, Materzynska Joanna, Vazquez David, and Lopez Antonio M.. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016. [Google Scholar]
  • [50].Royer Amélie, Bousmalis Konstantinos, Gouws Stephan, Bertsch Fred, Moressi Inbar, Cole Forrester, and Murphy Kevin. Xgan: Unsupervised image-to-image translation for many-to-many mappings. ICLR, 2018. [Google Scholar]
  • [51].Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec, and Chen Xi. Improved techniques for training gans. In NIPS, 2016. [Google Scholar]
  • [52].Shen Falong, Yan Shuicheng, and Zeng Gang. Neural style transfer via meta networks. In CVPR, 2018. [Google Scholar]
  • [53].Sheng Lu, Lin Ziyi, Shao Jing, and Wang Xiaogang. Avatarnet: Multi-scale zero-shot style transfer by feature decoration. In CVPR, 2018. [Google Scholar]
  • [54].Shrivastava Ashish, Pfister Tomas, Tuzel Oncel, Susskind Joshua, Wang Wenda, and Webb Russell. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017. [Google Scholar]
  • [55].Taigman Yaniv, Polyak Adam, and Wolf Lior. Unsupervised cross-domain image generation, 2016. [Google Scholar]
  • [56].Ulyanov Dmitry, Lebedev Vadim, Vedaldi Andrea, and Lempitsky Victor. Texture networks: feed-forward synthesis of textures and stylized images. In ICML, 2016. [Google Scholar]
  • [57].Ulyanov Dmitry, Vedaldi Andrea, and Lempitsky Victor. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. [Google Scholar]
  • [58].Wang Chao, Zheng Haiyong, Yu Zhibin, Zheng Ziqiang, Gu Zhaorui, and Zheng Bing. Discriminative region proposal adversarial networks for high-quality image-to-image translation. In ECCV, 2018. [Google Scholar]
  • [59].Wang Ting-Chun, Liu Ming-Yu, Zhu Jun-Yan, Tao Andrew, Kautz Jan, and Catanzaro Bryan. High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585, 2017. [Google Scholar]
  • [60].Wang Yaxing, van de Weijer Joost, and Herranz Luis. Mix and match networks: encoder-decoder alignment for zero-pair image translation. In CVPR, 2018. [Google Scholar]
  • [61].Yi Zili, Zhang Hao, Tan Ping, and Gong Minglun. Dualgan: Unsupervised dual learning for image-to-image translation. In CVPR, 2017. [Google Scholar]
  • [62].Zhang Han, Xu Tao, Li Hongsheng, Zhang Shaoting, Wang Xiao-gang, Huang Xiaolei, and Metaxas Dimitris N. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017. [DOI] [PubMed] [Google Scholar]
  • [63].Zhang Kun, Scholköpf Bernhard, Muandet Krikamol, and Wang Zhikun. Domain adaptation under target and conditional shift. In ICML, pages 819–827, 2013. [Google Scholar]
  • [64].Zhang Yexun, Zhang Ya, and Cai Wenbin. Separating style and content for generalized style transfer. In CVPR, 2018. [Google Scholar]
  • [65].Zhou Bolei, Zhao Hang, Puig Xavier, Fidler Sanja, Barriuso Adela, and Torralba Antonio. Scene parsing through ade20k dataset. In CVPR, 2017. [Google Scholar]
  • [66].Zhu Jun-Yan, Park Taesung, Isola Phillip, and Efros Alexei A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In CVPR, 2017. [Google Scholar]
  • [67].Zhu Jun-Yan, Zhang Richard, Pathak Deepak, Darrell Trevor, Efros Alexei A, Wang Oliver, and Shechtman Eli. Toward multimodal image-to-image translation. In NIPS, 2017. [Google Scholar]

RESOURCES