Abstract
Generative adversarial network (GAN) has become one of the most important neural network models for classical unsupervised machine learning. A variety of discriminator loss functions have been developed to train GAN’s discriminators and they all have a common structure: a sum of real and fake losses that only depends on the actual and generated data respectively. One challenge associated with an equally weighted sum of two losses is that the training may benefit one loss but harm the other, which we show causes instability and mode collapse. In this paper, we introduce a new family of discriminator loss functions that adopts a weighted sum of real and fake parts, which we call adaptive weighted loss functions or aw-loss functions. Using the gradients of the real and fake parts of the loss, we can adaptively choose weights to train a discriminator in the direction that benefits the GAN’s stability. Our method can be potentially applied to any discriminator model with a loss that is a sum of the real and fake parts. For our experiments, SN-GAN, AutoGAN, and BigGAN are used. Experiments validated the effectiveness of our loss functions on unconditional and conditional image generation tasks, improving the baseline results by a significant margin on CIFAR-10, STL-10, and CIFAR-100 datasets in Inception Scores (IS) and Fréchet Inception Distance (FID) metrics.
1. Introduction
Generative Adversarial Network (GAN) [15] has become one of the most important neural network models for unsupervised machine learning. The origin of this idea lies in the combination of two neural networks, one generative and one discriminative, that work simultaneously. The task of the generator is to generate data of a given distribution, while the discriminator’s purpose is to try to recognize which data are created by the generative model and which are the original ones. While a variety of GAN models have been developed, many of them are prone to issues with training such as instability where model parameters might destabilize and not converge, mode collapse where the generative model produces a limited number of different samples, diminishing gradients where the generator gradient vanishes and training does not occur, and high sensitivity to hyperparameters.
In this paper, we focus on the discriminative model to rectify the issues of instability and mode collapse in training GAN. In the GAN architecture, the discriminator model takes samples from the original dataset and the output from the generator as input and tries to classify whether a particular element in those samples is real or fake data [15]. The discriminator updates its parameters by maximizing a discriminator loss function via backpropagation through the discriminator network. In many of the proposed models [15, 16, 30, 28], the discriminator loss function consists of two equally weighted parts: the “real part” that purely relies on the original dataset and the “fake part” that depends on the generator network and its output; for simplicity we will call them and for real and fake losses, respectively. For example, in the original GAN paper [15], the discriminator loss function is written as
| (1) |
with and , where D and G are the discriminative and generative models, respectively, pd is the probability distribution of the real data, and pz is the probability distribution of the generator parameter z.
The goal of the GAN discriminator training is to increase both and so that the discriminator D(·) assigns high scores to real data and low scores to fake data. This is done in (1) by placing equal weights on and [15]. However, the training with is not performed equally on and . Indeed, a gradient ascent training step along the may decrease (or ), depending on the angle between and (or ). For example, if we have a large obtuse angle between and , which is the case in most training steps (see §5.1), training along the direction of may potentially decrease either or by going in the opposite direction to or (see §3 and §5.2). We suggest that this reduction on the real loss may destabilize training and cause mode collapses. Specifically, if a generator is converging with its generated samples close to the data distribution (or a particular mode), a training step that increases the fake loss will reduce the discriminator scores on the fake data and, by the continuity of D(·), reduce the scores on the nearby real data as well. With the updated discriminator now assigning lower scores to the regions of data where the generator previously approximated well, the generator update is likely to move away from that region and to the regions with higher discriminator scores (possibly a different mode). Hence, we see instability or mode collapse. See §5.3 for experimental results.
We propose a new approach in training the discriminative model by modifying the discriminator loss function and introducing adaptive weights in the following way,
| (2) |
We adaptively choose wr and wf weights to calibrate the training in the real and fake losses. Using the information of and , we can control the gradient direction, , by either training in the direction that benefits both and or increasing one loss while not changing the other. This attempts to avoid a situation where training may benefit one loss but significantly harm the other. A more detailed mathematical approach is presented in §3.
Our proposed method can be applied to any GAN model with a discriminator loss function composed of two parts as in (1). For our experiments we have applied adaptive weights to the SN-GAN [34] and the AutoGAN [14] models, and to the SN-GAN [34] and the BigGAN [5] models for unconditional and conditional image generating tasks, respectively. We have achieved significant improvements on them for CIFAR-10, STL-10 and CIFAR-100 datasets in both Inception Scores (IS) and Fréchet Inception Distance (FID) metrics, see §4. Our code is available at Github.
Notation:
We use ⟨·,·⟩2 to denote the Euclidean inner product, ||x||2 the Euclidean 2-norm, and the angle between vectors x and y.
2. Related Work
GAN was first proposed in [15] for creating generative models via simultaneous optimization of a discriminative and a generative model. The original GAN may suffer from vanishing gradients during training, non-convergence of the model(s), and mode collapse; see [6, 31, 33, 38, 39] for discussions. Several papers [1, 16, 28, 30] have addressed the issues of vanishing gradients by introducing new loss functions. The LSGAN proposed in [30] adopted the least squares loss function for the discriminator that relies on minimizing the Pearson χ2 divergence, in contrast to the Jensen–Shannon divergence used in GAN. The WGAN model [1, 16] introduced another way to solve the problem of convergence and mode collapse by incorporating Wasserstein-1 distance into the loss function. As a result, WGAN has a loss function associated with image quality which improves learning stability and convergence. The hinge loss function introduced in [28, 43] achieved smaller error rates than cross-entropy, being stable against different regularization techniques, and having a low computational cost [12]. The models in [3, 26, 45] adopted a loss function called maximum mean discrepancy (MMD). A repulsive function to stabilize the MMD-GAN training was employed in [46], and the MMD loss function was weighted in [11] according to the contribution of data to the loss function. [37] presented a dual discriminator GAN that combines two discriminators in a weighted sum.
New loss functions are not the only way of improving GAN’s framework. DCGAN [38], one of the first and more significant improvements in the GAN architecture, was the incorporation of deep convolutional networks. The Progressive Growing GAN [21] was created based on [1, 16] with the main idea of progressively adding new layers of higher resolution during training, which helps to create highly realistic images. [14, 13, 42] developed neural architecture search methods to find an optimal neural network architecture to train GAN for a particular task.
There are many works dedicated to the conditional GAN, for example BigGAN [5] which utilized a model with a large number of parameters and larger batch sizes showing a significant benefit of scaling.
There are many works devoted to improving or analyzing GAN training. [33] trained the generator by optimizing a loss function unrolled from several training iterations of the discriminator training. SN-GAN [34], normalized the spectral norm of each weight to stabilize the training. Recent work [40] introduced stable rank normalization that simultaneously controls the Lipschitz constant and the stable rank of a layer. [27] developed an analysis to suggest that first-order approximations of the discriminator lead to instability and mode collapse. [36] proved local stability under the model that both the generator and the discriminator are updated simultaneously via gradient descent. [9] analyzed the stability of GANs through stationarity of the generator. [32] points out that absolute continuity is necessary for GAN training to converge. Relativistic GAN [20] addressed the observation that with generator training increasing the probability that fake data is real, the probability of real data being real would decrease. [2] proposed a method of re-weighting training samples to correct for mass shift between the transferred distributions in the domain transfer setup. [7] viewed GAN as an energy-based model and proposed an MCMC sampling based method.
3. Adaptive Weighted Discriminator
If we maximize to convergence in training discriminator D, we should meet the goal to increase both and . However, when we train with a gradient ascent step along , which may be dominated by either or , the training may be done primarily on one of the losses, either or . Consider a gradient ascent training iteration for ,
| (3) |
where λ is a learning rate. Then using the Taylor Theorem, we can expand both and about θ0,
| (4) |
| (5) |
and
| (6) |
where we have omitted the evaluation point θ0 in all gradients (i.e.) to avoid cumbersome expressions. If one of and is obtuse, then to the first order approximation, the corresponding loss is decreased. This causes a decrease in the discriminator assigning a correct score D(·) to the real (or fake) data. Thus, a gradient ascent step with loss (1) may turn out to decrease one of the losses if the angle . This situation occurs quite often in GAN training; see §5.1 for some experimental results illustrating this.
This undesirable situation is expected to happen in GAN training when the generator has produced samples close to the data distribution or its certain modes. If a training step in the direction results in an increase in the fake loss or equivalently a decrease in the discriminator scores D(G(z)) on the fake data, it will decrease the scores D(x) on the real data as well by the continuity of D(·). Equivalently, this reduces the real loss. With the updated discriminator assigning lower scores to the regions of the data where the generator previously approximated well, the generator update using the new discriminator will likely move in the direction where the discriminator scores are higher and hence leave the region it was converging to. We suggest that this is one of the causes of instability in GAN training. If the regions with high discriminator scores contain only a few modes of the data distribution, this leads to mode collapse; see the study in §5.3.
To remedy this situation, we propose to modify the training gradient to encourage high discriminator scores for real data. We propose a new family of discriminator loss functions, which we call adaptive weight loss function or aw-loss function; see equation (2).
We first show that the proposed weighted discriminator (2) with fixed weights carries the same theoretical properties of the original GAN as stated in [15, 32] for binary-cross-entropy loss function, i.e. when the min-max problem is solved exactly, we recover the data distribution.
Theorem 1.
Let pd(x) and pg(x) be the density functions for the data and model distributions, and , respectively. Consider with fixed wr, wf > 0.
Given a fixed pg(x), is maximized by for x ∈ supp(pd) ∪ supp(pg).
with the minimum attained by pg(x) = pd(x).
See Appendix A for a proof of Theorem 1. To choose the weights wr and wf, we propose an adaptive scheme, where the weights wr and wf are determined using gradient information of both and . This structure allows us to adjust the direction of the gradient of the discriminator loss function to achieve the goal of training to increase both and , or at least not to decrease either loss. We propose Algorithm 1 based on the following gradient relations with various weight choices.
Theorem 2.
Consider in (2) and the gradient .
- If and , then is the angle bisector of and , i.e.
If and , then ,.
If and , then ,.
See Appendix A for a proof of Theorem 2. The first case in the above theorem allows us to choose weights for (2) such that we can train and by going in the direction of the angle bisector. However, sometimes the direction of the angle bisector might not be optimal. For example, if the angle between and is close to 180°, then the bisector direction will effectively not train either loss. During training, is often easier to train than meaning that the fake gradient has a larger magnitude. In this situation, we might want to train just on the real gradient direction by simply choosing wf = 0, but if the angle between and is obtuse, we will increase but significantly decrease which is undesirable. The second case in Theorem 2 suggests a direction that forms an acute angle with and orthogonal to (see Figure 1); such a direction will increase and to the first order approximation will leave unchanged. When is high, the third case in Theorem 2 would allow us to increase while minimizing changes to .
Figure 1:
Depiction of the second case of Theorem 2.
Inspired by the Theorem 2 and observations that we have made, we can calibrate discriminator training in a way that produces and maintains high real loss to reduce fluctuations in the real loss (or real discriminator scores) to improve stability. Algorithm 1 describes the procedure for updating weights of the aw-loss function in (2) during training using the information of and .

Algorithm 1 is designed to first avoid, up to the first order approximation, decreasing or during a gradient ascent iteration. Furthermore, it chooses to favor training real loss unless the mean real score is greater than the mean fake score (i.e. sf ≤ sr) and the real mean score is at least 0.5 (i.e. α1 = 0.5 ≤ sr). Here the mean discriminator scores sr and sf represent the mean probability that the discriminator assigns to xi’s and yj’s respectively as real data. When sr is highly satisfactory with sr ≥ α2 = 0.75 (i.e. the midpoint between the minimum probability 0.5 and the maximum probability 1 for correct classifications of real data), we favor training the fake loss; otherwise, we train both equally. By maintaining these training criteria, we will reduce the fluctuations in real and fake discriminator scores and hence avoid instability. See study in §5.3. Note that we impose a small gap δ = 0.05 in sf − δ > sr to account for situations when sr is nearly identical to sf.
The way we favor training the real or fake loss depends on whether the angle between and is obtuse or not. In Algorithm 1, the first and the third cases are concerned with the more frequent situation (see §5.1 and Figure 4) where the angle between and is obtuse. These cases are the ones that are developed in Theorem 2. In the first case, we favor training real by going in the direction orthogonal to the , see Figure 1 for illustration. In the third case, we favor the fake loss by going in the direction orthogonal to . In a similar manner, the second and the fourth cases are concerned with the situation when the angle between and is acute. We use the same criteria to decide if training should favor the real or fake directions, but in this case we favor training the real or fake loss by using the direction of the corresponding gradient. Lastly, in the fifth case, it is desirable to increase both sr and sf without either taking priority, so we choose to train in the direction of the angle bisector between and .
Figure 4:
Angles between gradients at each iteration. Top: original loss; Bottom: aw-loss.
The two threshold α1 and α2 in Algorithm 1 can be treated as hyperparameters. Our ablation studies show that the default α1 = 0.5 and α2 = 0.75 as discussed earlier are indeed good choices, see Appendix B.
All weights stated in Algorithm 1 normalize both real and fake gradients for the purpose of avoiding differently sized gradients, which has the effect of preventing exploding gradients and speeds up training, i.e. achieves better IS and FID with fewer epochs, see Figure 3. With this implementation, we implement a linear learning rate decay to ensure convergence. However, aw-method performs well without normalization, and achieves comparable results. We list the detailed results in Appendix B.
Figure 3:
AutoGAN vs aw-AutoGAN IS and FID plots for the first 40 epochs.
A small constant ε is added to all the weights in Algorithm 1 to avoid numerical discrepancies in cases that would prevent the discriminator model from training/updating. As an example, there are cases when our algorithm would set wr = 0 but at the same time would be almost zero, which will result in being practically zero. We have set ε = 0.05 in all of our experiments.
Algorithm 1 has a small computational overhead. At each iteration we compute inner products and norms that are used for computing wr and wf, and then use these weight to update . If we have k trainable parameters, then it takes an order of 6k operations to compute inner products between real–fake, real–real, and fake–fake gradients, and an order of 3k operations to form , totalling to an order of 9k operations in Algorithm 1. This is a fraction of total computational complexity for one training iteration.
4. Experiments & Results
We implement our Adaptive Weighted Discriminator for SN-GAN [34] and AutoGAN [14] models, and for SN-GAN [34] and BigGAN [5] models, on unconditional and conditional image generating tasks, respectively (commonly referred to as unconditional and conditional GANs). AutoGAN is an architecture based on neural search. In our experiments and testings; we do not invoke a neural search with our aw-loss; we have simply implemented the aw-method on the model and architecture exactly provided by [14].
We test our method on three datasets: CIFAR-10 [25], STL-10 [10], and CIFAR-100 [25]. The datasets and implementation details provided in Appendix B.
All of the above mentioned models train the discriminator by minimizing the negative aw-hinge loss [28, 43]. Our aw-loss also uses the negative aw-hinge loss as follows:
| (7) |
with wr and wf updated every iteration using Algorithm 1.
To evaluate the performance of the models, we employ the widely used Inception Score [39] (IS) and Fréchet Inception Distance [18] (FID) metrics; see [29] for more details. We compute these metrics every 5 epochs and we report the best IS and FID achieved by each model within the 320 (SN-GAN), 300 (AutoGAN), and BigGAN (1,000) training epochs as in the original works.
We first present the results for three datasets CIFAR-10, STL-10m and CIFAR-100 in Tables 1, 2, and 3, respectively, for the unconditional GAN. In addition to baseline results, we have included top published results for each dataset for the comparison purposes.
Table 1:
CIFAR-10 (unconditional GAN): The aw-method substantially improves SN-GAN/AutoGAN.
Table 2:
STL-10 (unconditional GAN)
Table 3:
CIFAR-100 (unconditional GAN)
| Method | IS ↑ | FID ↓ |
|---|---|---|
| SS-GAN [8] | - | 21.02† |
| MSGAN [44] | - | 19.74 |
| SRN-GAN [40] | 8.85 | 19.55 |
| SN-GAN [34] | 8.18±.12* | 22.40* |
| aw-SN-GAN (Ours) | 8.31±.02 | 19.08 |
| AutoGAN [14] | 8.54±.10* | 19.98* |
| aw-AutoGAN (Ours) | 8.90±.06 | 19.00 |
- results from our test
- quoted from [44].
For CIFAR-10 in Table 1, our methods significantly improve SN-GAN and AutoGAN baseline results. Indeed, our aw-AutoGAN achieves IS substantially above all comparisons other than StyleGAN2. Note that aw-AutoGAN uses 5.4M parameters vs. 26.2M for StyleGAN2.
For STL-10 in Table 2, our methods also significantly improve SN-GAN and AutoGAN baseline results. Our aw-SN-GAN achieved the highest IS and aw-AutoGAN achieved the lowest FID score among comparisons.
For CIFAR-100 in Table 3, our methods improve IS significantly for AutoGAN but modestly for SN-GAN. Our aw-Auto-GAN achieved the highest IS and the lowest FID score among comparisons.
We have also included some visual examples that were randomly generated by our aw-Auto-GAN model in Figure 2. We also consider the convergence of our method against training epochs by plotting in Figure 3 the IS and FID scores of 50,000 generated samples at every 5 epochs for AutoGAN vs aw-AutoGAN. For all the datasets, our model consistently achieves faster convergence than the baseline.
Figure 2:
aw-AutoGAN: CIFAR-10 (left), STL-10 (center), CIFAR-100 (right); samples randomly generated.
We next consider our aw-method for class conditional image generating task, using two base models, SN-GAN [34] and BigGAN [5], on CIFAR-10 and CIFAR-100 datasets. Results are listed in Table 4.
Table 4:
CIFAR-10 and CIFAR-100 (conditional GAN): The aw-method substantially improves SN-GAN and BigGAN.
Table 4 shows that our method works well for the conditional GAN too. The aw-method significantly improves the SN-GAN and BigGAN baselines.
5. Exploratory & Ablation Studies
In this section, we present three studies to illustrate potential problems of equally weighted GAN loss and advantages of our adaptive weighted loss. The hinge loss is implemented in the first and second studies, and a binary cross-entropy loss function is used for the third.
5.1. Angles between Gradients
In the first study, we examine the angles between , , (or ). We use the CIFAR-10 dataset with the DCGAN architecture [38] and we look at 50 iterations in the first epoch of training. In Figure 4, we plot the following 3 angles:, and against iterations for the original loss (1) on the top and for the aw-loss on the bottom. For the original loss (Left), (blue) stays greater then 90°, closer to 180°. (green) often goes above 90° and so the training is often done to decrease the real loss. also goes above 90°, though to a lesser extent. With the aw-loss (Right), and stay below the 90° line and indicate that we train in the direction of and orthogonal to in most steps.
5.2. Real Discriminator Scores and Real-Fake Gap after Training
Our second experiment is an ablation study to show that aw-loss increases the discriminator scores for real data and increases the gap between real and fake discriminator scores. We again apply the DCGAN model with the original loss to CIFAR-10 and at every iteration we examine the mean discriminator score for the mini-batch of the real set and the mean discriminator scores for the mini-batch of the fake dataset generated by the generator. We use the logit output of the discriminator network as the score. We plot these two mean scores against each iteration before training in the first row of Figure 5 and after training (with the original loss ) in the second row. At each of the above training iterations, we replace by the aw-loss (2) and train for one iteration with the same training mini-batch. We plot the mean discriminator scores for the mini-batches of the real and fake dataset after this training in the third row of Figure 5. We further present the gaps between the two scores before training and after training using the original loss and using aw-loss in the fourth row of Figure 5.
Figure 5:
Mean discriminator scores for real data D(x) and fake data D(G(z)) (Row 1: before training, Row 2: after original training with , Row 3: after training with aw-loss and their gap (Row 4: GBT - gap before training; GAOT - gap after original training; GAAWT - gap after aw-loss training)
Figure 5 shows that training with aw-loss leads to higher real discriminator scores (0.921 epoch average) than training with the original loss (0.248 epoch average). The average gap between real and fake scores is also larger with the aw-loss at 1.413 vs. 1.262 of the original loss. Therefore, with the same model and the same training mini-batch, the aw-loss produces higher discriminator scores for the real dataset and larger gaps between real and fake scores. These are two important properties of a discriminator for the generator training.
5.3. Instability and Real Discriminator Scores
Our third study examines benefits of high discriminator scores for a real dataset with respect to instability and mode collapse of GAN training. We use a synthetic dataset with a mixture of Gaussian distributions used to test unrolled GAN in [33]. The dataset consists of eight 2D Gaussian distributions centered at eight equally distanced points on the unit circle. We train with a plain GAN as in [33] and we plot samples of (fake) data generated by the generator every 5,000 iterations on the first row of of Figure 6. We see that the generated data converging to two or three points but then moving off, demonstrating instability and mode collapse. To understand this phenomenon, at each of the iterations that we study in Figure 6, we generate 100 (real) data points from each of the eight classes and compute their mean discriminator scores (as the logit output of the discriminator). We plot the mean scores against the classes in the second row of Figure 6. We observe that the discriminator scores for the real data do not increase much during training, staying around 0, which corresponds to 0.5 probability after the logistic sigmoid function. The scores are also uneven among different classes. We believe these cause the instability in the generator training.
Figure 6:
Mixture of eight 2D Gaussian distributions centered at 8 points (right-most column). Row 1: GAN sample points produced by generators; Row 2: GAN mean discriminator scores for each of 8 classes; Row 3: aw-GAN sample points produced by generators; Row 4: aw-GAN mean discriminator scores for each of eight classes;
We compare the GAN results with aw-GAN that applies our adaptive weighted discriminator to the plain GAN and we present the corresponding plots of generated data points (fake) in the third row of Figure 6 and the corresponding discriminator scores on the eight classes in the bottom row. In this case, the generator gradually converges to all eight classes and the discriminator scores stay high for all eight classes. Even though the generator was starting to converge to a few classes (step 5,000), the discriminator scores remain high for all classes. Then the generator continues to converge while convergence to other classes occurs. We believe the high real discriminator scores maintains stability and prevents mode collapse in this case.
Conclusions
This paper pinpoints potential negative effects of the traditional GAN training on the real loss (and fake loss) and points out that this is a potential cause of instability and mode collapse. To remedy these issues, we have proposed the Adaptive Weighted Discriminator method to increase and maintain high real loss. Our experiments demonstrate the benefits and the competitiveness of this method.
Supplementary Material
Acknowledgements
We would like to thank Devin Willmott and Xin Xing for their initial efforts and helpful advice. We thank Rebecca Calvert for reading the manuscript and providing us with many valuable comments/suggestions. We also thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Lipscomb Compute Cluster.
Research supported in part by NSF OIA 2040665, NIH UH3 NS100606-05, and R01 HD101508-01 grants.
Research supported in part by NSF under grants DMS-1821144 and DMS-1620082.
References
- [1].Arjovsky Martin, Chintala Soumith, and Bottou Léon. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. 2 [Google Scholar]
- [2].Binkowski Mikolaj, Hjelm R. Devon, and Courville Aaron C.. Batch weight for domain adaptation with mass shift. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1844–1853, 2019. 2 [Google Scholar]
- [3].Bińkowski Mikołaj, Sutherland Dougal J, Arbel Michael, and Gretton Arthur. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018. 2 [Google Scholar]
- [4].Brock Andy. BigGAN-PyTorch. https://github.com/ajbrock/BigGAN-PyTorch. 6, 15, 16 [Google Scholar]
- [5].Brock Andrew, Donahue Jeff, and Simonyan Karen. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. 2, 5, 6, 16 [Google Scholar]
- [6].Che Tong, Li Yanran, Jacob Athul Paul, Bengio Yoshua, and Li Wenjie. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016. 2 [Google Scholar]
- [7].Che Tong, Zhang Ruixiang, Sohl-Dickstein Jascha, Larochelle Hugo, Paull Liam, Cao Yuan, and Bengio Yoshua. Your gan is secretly an energy-based model and you should use discriminator driven latent sampling, 2020. 3 [Google Scholar]
- [8].Chen T, Zhai X, Ritter M, Lucic M, and Houlsby N. Self-supervised gans via auxiliary rotation loss. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12146–12155, 2019. 6 [Google Scholar]
- [9].Chu Casey, Minami Kentaro, and Fukumizu Kenji. Smoothness and stability in gans. In International Conference on Learning Representations, 2020. 2 [Google Scholar]
- [10].Coates Adam, Ng Andrew, and Lee Honglak. An analysis of single-layer networks in unsupervised feature learning. In Gordon Geoffrey, Dunson David, and Dudík Miroslav, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR. 5, 14 [Google Scholar]
- [11].Diesendruck Maurice, Elenberg Ethan R, Sen Rajat, Cole Guy W, Shakkottai Sanjay, and Williamson Sinead A. Importance weighted generative networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 249–265. Springer, 2019. 2 [Google Scholar]
- [12].Dong Hao-Wen and Yang Yi-Hsuan. Towards a deeper understanding of adversarial losses under a discriminative adversarial network setting, 2019. 2 [Google Scholar]
- [13].Doveh Sivan and Giryes Raja. Degas: Differentiable efficient generator search, 2019. 2 [Google Scholar]
- [14].Gong Xinyu, Chang Shiyu, Jiang Yifan, and Wang Zhangyang. Autogan: Neural architecture search for generative adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019. 2, 5, 6, 16 [Google Scholar]
- [15].Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. Generative adversarial nets. In Ghahramani Z, Welling M, Cortes C, Lawrence ND, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014. 1, 2, 3 [Google Scholar]
- [16].Gulrajani Ishaan, Ahmed Faruk, Arjovsky Martin, Dumoulin Vincent, and Courville Aaron. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5769–5779, Red Hook, NY, USA, 2017. Curran Associates Inc. 1, 2 [Google Scholar]
- [17].He Hao, Wang Hao, Lee Guang-He, and Tian Yonglong. Probgan: Towards probabilistic gan with theoretical guarantees. In ICLR, 2019. 6 [Google Scholar]
- [18].Heusel Martin, Ramsauer Hubert, Unterthiner Thomas, Nessler Bernhard, and Hochreiter Sepp. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R, editors, Advances in Neural Information Processing Systems 30, pages 6626–6637. Curran Associates, Inc., 2017. 5 [Google Scholar]
- [19].Hoang Quan, Nguyen Tu Dinh, Le Trung, and Phung Dinh. MGAN: Training generative adversarial nets with multiple generators. In International Conference on Learning Representations, 2018. 5, 6 [Google Scholar]
- [20].Jolicoeur-Martineau Alexia. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations, 2019. 2 [Google Scholar]
- [21].Karras Tero, Aila Timo, Laine Samuli, and Lehtinen Jaakko. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018. 2 [Google Scholar]
- [22].Karras Tero, Laine Samuli, Aittala Miika, Hellsten Janne, Lehtinen Jaakko, and Aila Timo. Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, 2020. 5 [Google Scholar]
- [23].Kavalerov Ilya, Czaja Wojciech, and Chellappa Rama. A multi-class hinge loss for conditional gans. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1290–1299, January 2021. 6 [Google Scholar]
- [24].Kingma Diederik and Ba Jimmy. Adam: A method for stochastic optimization. International Conference on Learning Representations, December 2014. 15 [Google Scholar]
- [25].Krizhevsky Alex. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 5, 14 [Google Scholar]
- [26].Li Chun-Liang, Chang Wei-Cheng, Cheng Yu, Yang Yiming, and Poczos Barnabas. Mmd gan: Towards deeper understanding of moment matching network. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R, editors, Advances in Neural Information Processing Systems 30, pages 2203–2213. Curran Associates, Inc., 2017. 2 [Google Scholar]
- [27].Li Jerry, Madry Aleksander, Peebles John, and Schmidt Ludwig. On the limitations of first-order approximation in GAN dynamics. In Dy Jennifer and Krause Andreas, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3005–3013, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. 2 [Google Scholar]
- [28].Lim Jae Hyun and Ye Jong Chul. Geometric gan, 2017. 1, 2, 5, 15 [Google Scholar]
- [29].Lucic Mario, Kurach Karol, Michalski Marcin, Bousquet Olivier, and Gelly Sylvain. Are gans created equal? a largescale study. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 698–707, Red Hook, NY, USA, 2018. Curran Associates Inc. 5 [Google Scholar]
- [30].Mao X, Li Q, Xie H, Lau RYK, Wang Z, and Smolley SP. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2813–2821, 2017. 1, 2 [Google Scholar]
- [31].Mescheder Lars, Nowozin Sebastian, and Geiger Andreas. The numerics of gans. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R, editors, Advances in Neural Information Processing Systems 30, pages 1825–1835. Curran Associates, Inc., 2017. 2 [Google Scholar]
- [32].Mescheder Lars, Nowozin Sebastian, and Geiger Andreas. Which training methods for gans do actually converge? In International Conference on Machine Learning (ICML), 2018. 2, 3 [Google Scholar]
- [33].Metz Luke, Poole Ben, Pfau David, and Sohl-Dickstein Jascha. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, 2017. 2, 8 [Google Scholar]
- [34].Miyato Takeru, Kataoka Toshiki, Koyama Masanori, and Yoshida Yuichi. Spectral normalization for generative adversarial networks. In ICLR, 2018. 2, 5, 6, 16 [Google Scholar]
- [35].Miyato Takeru and Koyama Masanori. cGANs with projection discriminator. In International Conference on Learning Representations, 2018. 6 [Google Scholar]
- [36].Nagarajan Vaishnavh and Kolter J Zico. Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, pages 5585–5595, 2017. 2 [Google Scholar]
- [37].Tu Dinh Nguyen Trung Le, Vu Hung, and Phung Dinh Q.. Dual discriminator generative adversarial nets. CoRR, abs/1709.03831, 2017. 2 [Google Scholar]
- [38].Radford Alec, Metz Luke, and Chintala Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks, 2015. 2, 7 [Google Scholar]
- [39].Salimans Tim, Goodfellow Ian, Zaremba Wojciech, Cheung Vicki, Radford Alec, Chen Xi, and Chen Xi. Improved techniques for training gans. In Lee DD, Sugiyama M, Luxburg UV, Guyon I, and Garnett R, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016. 2, 5 [Google Scholar]
- [40].Sanyal Amartya, Torr Philip HS, and Dokania Puneet K. Stable rank normalization for improved generalization in neural networks and gans. arXiv preprint arXiv:1906.04659, 2019. 2, 5, 6 [Google Scholar]
- [41].Shmelkov Konstantin, Schmid Cordelia, and Alahari Karteek. How good is my gan?, 2018. 6, 16 [Google Scholar]
- [42].Tian Yuan, Wang Qin, Huang Zhiwu, Li Wen, Dai Dengxin, Yang Minghao, Wang Jun, and Fink Olga. Off-policy reinforcement learning for efficient and effective gan architecture search, 2020. 2 [Google Scholar]
- [43].Tran Dustin, Ranganath Rajesh, and Blei David M.. Hierarchical implicit models and likelihood-free variational inference, 2017. 2, 5, 15 [Google Scholar]
- [44].Tran Ngoc-Trung, Tran Viet-Hung, Nguyen Bao-Ngoc, Yang Linxiao, and Cheung Ngai-Man (Man). Self-supervised gan: Analysis and improvement with multi-class minimax game. In Advances in Neural Information Processing Systems 32, pages 13253–13264. Curran Associates, Inc., 2019. 5, 6 [Google Scholar]
- [45].Unterthiner Thomas, Nessler Bernhard, Klambauer Günter, Heusel Martin, Ramsauer Hubert, and Hochreiter Sepp. Coulomb gans: Provably optimal nash equilibria via potential fields. CoRR, abs/1708.08819, 2017. 2 [Google Scholar]
- [46].Wang Wei, Sun Yuan, and Halgamuge Saman. Improving MMD-GAN training with repulsive loss function. In ICLR, 2019. 2, 5, 6 [Google Scholar]
- [47].Zhang Han, Zhang Zizhao, Odena Augustus, and Lee Honglak. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020. 6 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






