



版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、A Domain Agnostic Measure for Monitoring and Evaluating GANsPaulina GrnarovaETH ZurichKfir Y. LevyTechnion-Israel Institute of TechnologyAurelien LucchiETH ZurichNathanaël PerraudinSwiss Data Science CenterIan GoodfellowThomas HofmannETH ZurichAndreas KrauseETH ZurichAbstractGenerative Adversar
2、ial Networks (GANs) have shown remarkable results in mod- eling complex distributions, but their evaluation remains an unsettled issue. Evalua-tions are essential for: (i) relative assessment of different ms and (ii) monitoringthe progress of a single m by simply inspecting tthroughout training. The
3、 latter cannot be determinederator and discriminator loss curves as they behavenon-intuitively. We leverage the notion of duality gap from game theory to proposea measure that addresses both (i) and (ii) at a low computational cost. Exten- sive experiments show the effectiveness of this measure to r
4、ank different GANms and capture the typical GAN failure scenarios, including mode collapseand non-convergent behaviours. This evaluation metric also provides meaningfulmonitoring on the progression of the loss during training. It highly correlates with FID on natural image datasets, and with domain
5、specific scores for text, sound and cosmology data where FID is not directly suitable. In particular, our proposed metric requires no labels or a pretrained classifier, making it domain agnostic.1IntroductionIn recent years, a large body of research has focused on practical and theoretical aspects o
6、f Generative adversarial networks (GANs) 9. This has led to the development of several GAN variants 24, 2 as well as some evaluation metrics such as FID or the Inception score that are both data-dependent and dedicated to images. A domain independent quantitative metric is however still a key missin
7、g ingredient that hinders further developments.One of the main reasons behind the lack of such a metric originates from the nature of GANs that implement an adversarial game between two players, namely a generator and a discriminator. Let us denote the data distribution by pdata(x), the m distributi
8、on by pu(x) and the prior over latent variables by pz. A probabilistic discriminator is denoted by Dv : x 7 0; 1 and a generator by Gu : z 7 x. The GAN objective is:min max M (u, v) = 1 Explog Dv(x) + 1 Ezlog(1 D (G (z).pvu(1)datazuv22Each of the two players tries to optimize their own objective, wh
9、ich is exactly balanced by the loss of the other player, thus yielding a two-player zero-sum minimax game. The minimax nature of the objective and the use of neural networks as players make the process of learning a generative m challenging. We focus our attention on two of the central open issues b
10、ehind these difficulties and how they translate to a need for an assessment metric.Correspondence to paulina.grnarovainf.ethz.ch33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.Figure 1: Comparison of information obtained by different metrics for likelihood
11、and minmax-based m The red dashed line corresponds to the optimal point for stopping the training.s.i) Convergence metric The need for an adequate convergence metric is especially relevant given the difficulty of training GANs: current approaches often fail to converge 31 or oscillate between differ
12、ent modes of the data distribution 21. The ability of reliably detecting non-convergent behavior has been pointed out as an open problem in many previous works, e.g., by 20 as a stepping stone towards a deeper analysis as to which GAN variants converge. Such a metric is not only important for drivin
13、g the research efforts forward, but from a practical perspective as well. Deciding when to stop training is difficult as the curves of the discriminator and generator losses oscillate (see Fig. 1) and arenon-informative as to whether the mis improving or not 2. This is especially troublesome whena G
14、AN is trained on non-image data in which case one might not be able to use visual inspection or FID/Inception scores as a proxy.ii) Evaluation metric Another key problem we address is the relative comparison of the learnedgenerative ms. While several evaluation metrics exist, there is no clear conse
15、nsus regardingwhich metric is the most appropriate. Many metrics achieve reasonable discriminability (i.e., abilityto distinguish generated samples from real ones), but also tend to have a high computational cost. Some popular metrics are also specific to image data. We refer the reader to 3 for an
16、in-depth discussion of the merits and drawbacks of existing evaluation metrics.In more traditional likelihood-based ms, the train/test curves do address the problems raised in i)and ii). For GANs, terator/discriminator curves (see Fig. 1) are however largely uninformativedue to the minimax nature of
17、 GANs where both players can undo each others progress.In this paper, we leverage ideas from game theory to propose a simple and computationally efficient metric for GANs. Our approach is to view GANs as a zero-sum game between a generator G and discriminator D. From this perspective, “solving"
18、 the game is equivalent to finding an equilibrium, i.e., a pair (G, D) such that no side may increase its utility by unilateral deviation. A natural metric for measuring the sub-optimality (w.r.t. an equilibrium) of a given solution (G, D) is the duality gap 33, 22. We therefore suggest to use it as
19、 a metric for GANs akin to a test loss in the likelihood case (See Fig. 1 - duality gap2).There are several important issues that we address in order to make the duality gap an appropriate and practical metric for GANs. Our contributions include the following: We show that the duality gap allows to
20、assess the similarity between t data distribution (see Theorem 1). We show how to appropriately estimate the duality gap in the typical machine learning scenario where our access to the GAN learning objective is only through samples. We provide a computationally efficient way to estimate the duality
21、 gap during training. In scenarios where one is interested in assessing the quality of the learned generator, we showerated data and truehow to use a related metric the minimax loss that takes only t in order to detect mode collapse and measure sample quality.erator into consideration2The curves are
22、 obtained for a progressive GAN trained on CelebA2 We extensively demonstrate the effectiveness of these metrics on a range of datasets, GAN variants and failure modes. Unlike the FID or Inception score that require labelled data or a domain dependent classifier, our metrics are domain independent a
23、nd do not require labels.Related work. While several evaluation metrics have been proposed 31, 30, 12, 18, previous research has pointed out various limitations of these metrics, thus leaving the evaluation of GANs asan unsettled issue 20. Since the data log-likelihood is commonly used to train gene
24、rative ms,it may appear to be a sensible metric for GANs. However, its computation is often intractable and32 also demonstrate that it has severe limitations as it might yield low visual quality samples despite of a high likelihood. Perhaps the most popular evaluation metric for GANs is the inceptio
25、nscore 31 that measures both diversity of terated samples and discriminability. While diversityis measured as the entropy of the output distribution, the discriminability aspect requires a pretrained neural network to assign high scores to images close to training images. Various modifications ofthe
26、 inception score have been suggested. The Frechet Inception Distance (FID) 12 ms featuresfrom a hidden layer as two multivariate Gaussians for terated and true data. However, theGaussian assumption might not hold in practice and labelled data is required in order to train aclassifier. Without labels
27、, transfer learning is possible to datasets under limited conditions (i.e., the source and target distributions should not be too dissimilar). In 26, two metrics are introduced toevaluate a single mplaying against past and future versions of itself, as well as to measure theaptitude of two different
28、 fully trained ms. In some way, this can be seen as an approximation ofthe minimax value we advocate in this paper, where instead of doing a full-on optimization in order to find the best adversary for the fixed generator, the search space is limited to discriminators that are snapshots from trainin
29、g, or discriminators trained with different seeds.The ideas of duality and equilibria developed in the seminal work of 33, 22 have become a cornerstone in many fields of science, but are relatively unexplored for GANs. Some exceptions are 5, 10, 8, 13 but these works do not address the problem of ev
30、aluation. Closer to us, game theoretic metrics were previously mentioned in 25, but without a discussion addressing the stochastic nature and other practical difficulties of GANs, thus not yielding a practical applicable method. We concludeour discussion by pointing out the vast literature on dualit
31、y used in the optimization commuas aconvergence criterion for min-max saddle point problems, see e.g. 23, 14. Some recent work usesLagrangian duality in order to derive an objective to train GANs 4 or to dualize the discriminator,therefore reformulating the saddle point objective as aization problem
32、 17. A similar approachproposed by 7 uses the dual formulation of Wasserstein GANs to train the decoder. Although wealso make use of duality, there are significant differences. Unlike prior work, our contribution does not relate to optimising GANs. Instead, we focus on establishing that the duality
33、gap acts as a proxy to measure convergence, which we do theoretically (Th. 1) as well as empirically, the latter requiring a new efficient estimation procedure discussed in Sec. 3.2Duality Gap as Performance MeasureStandard learning tasks are often described as (stochastic) optimization problems; th
34、is applies to common Deep Learning scenarios as well as to classical tasks such as logistic and linear regression. This formulation gives rise to a natural performance measure, namely the test loss3. In contrast, GANs are formulated as (stochastic) zero-sum games. Unfortunately, this fundamentally d
35、ifferent formulation does not allow us to use the same performance metric. In this section, we describe a performance measure for GANs, which naturally arises from a game theoretic perspective. We start with a brief overview of zero-sum games, including a description of the Duality gap metric.A zero
36、-sum game is defined by two players P1 and P2 who choose a decision from their respective decision sets K1 and K2. A game objective M : K1 × K2 7 R sets the utilities of the players. Concretely, upon choosing a pure strategy (u, v) K1 × K2 the utility of P1 is M (u, v), whilethe utility of
37、 P2 is M (u, v). The goal of either P1/P2 is toize their worst case utilities:min max M (u, v) (Goal of P1),max min M (u, v) (Goal of P2)(2)uK1 vK2vK2 uK1This formulation raises the question of whether there exists a solution (u, v) to which both players may jointly converge. The latter only occurs
38、if there exists (u, v) such that neither P1 nor P2 may3For classification tasks using the zero-one test error is also very natural. Nevertheless, in regression tasks the test loss is often the only reasonable performance measure.3increase their utility by unilateral deviation. Such a solution is cal
39、led a pure equilibrium, formally,max M (u, v) = min M (u, v) (Pure Equilibrium).vK2uK1While a pure equilibrium does not always exist, the seminal work of 22 shows that an extended notion of equilibrium always does. Specifically, there always exists a distribution D1 over elements of K1, and a distri
40、bution D2 over elements of K2, such that the following holds,max EuD1 M (u, v) = min EvD2 M (u, v) (MNE).vK2uK1Such a solution is called a Mixed Nash Equilibrium (MNE). This notion of equilibrium gives rise to the following natural performance measure of a given pure/mixed strategy.Definition 1 (Dua
41、lity Gap). Let D1 and D2 be fixed distributions over elements from K1 and K2respectively. Then the duality gap DG of (D1, D2) is defined as follows,DG(D1, D2) := max EuD1 M(u, v) min EvD2 M(u, v).(3)vK2uK1Particularly, for a given pure strategy (u, v) K1 × K2 we define,DG(u, v) := max M(u, v) m
42、in M(u, v) .(4)vK2uK1Two well-known properties of the duality gap are that it is always non-negative and is exactly zero in (mixed) Nash Equilibria. These properties are very appealing from a practical point of view, since it means that the duality gap gives us an immediate handle for measuring conv
43、ergence.Next we illustrate the usefulness of the duality gap metric by analyzing the ideal case where both Gand D have unbounded capacity. The latter notion introduced by 9 means that terator canrepresent any distribution, and the discriminator can represent any decision rule. The next proposition s
44、hows that in this case, as long as G is not equal to the true distribution then the duality gap is always positive. In particular, we show that the duality gap is at least as large as the Jensen-Shannondivergence between true and fake distributions. We also show that if G outputs the true distributi
45、on, then there exists a discriminator such that the duality gap (DG) is zero. See a proof in the Appendix.Theorem 1 (DG and JSD). Consider the GAN objective in Eq. 1, and assume that teratorand discriminator networks have unbounded capacity. Then the duality gap of a given fixed solution(Gu, Dv) is
46、lower bounded by the Jensen-Shannon divergence between the true distribution pdataand the fake distribution qu generated by Gu, i.e. DG(u, v) JSD(pdata | qu). Moreover, if Guoutputs the true distribution, then there exists a discriminator Dv such that DG(Gu, Dv) = 0.Note that different GAN objective
47、s are known to be related to other types of divergences 24, and we believe that the Theorem above can be generalized to other GAN objectives 2, 11.3Estimating the Duality Gap for GANsAppropriately estimating the duality gap from samples. Supervised learning problems are often formulated as stochasti
48、c optimization programs, meaning that we may only access estimates of the expected loss by using samples. One typically splits the data into training and test sets 4. The training set is used to find a solution whose quality is estimated using a separate test set (which provides an unbiased estimate
49、 of the true expected loss). Similarly, GANs are formulated as stochastic zero-sum games (Eq. 1) but the issue of evaluating the duality gap metric is more delicate. This is because we have three phases in the evaluation: (i) training a m (u, v), (ii) finding the worst case discriminator/generator,
50、vworst arg maxvK2 M (u, v), and uworst arg minuK1 M (u, v), and (iii) computing the duality gap by estimating: DG := M(u, vworst) M(uworst, v). Now since we do not have direct access to the expected objective, one should use different samples for each of the three mentioned phases in order to mainta
51、in an unbiased estimate of the expected duality gap. Thus we split our dataset into three disjoint subsets: a training set, an adversary finding set, and a test set which are respectively used in phases (i), (ii) and (iii).4Of course, one should also use a validation set, but this is less important
52、for our discussion here.4RINGSPIRALGRID15151.010100.555000.005000 10000 15000 20000Steps01000020000Steps3000001000020000 30000 40000Stepsp 0t 0 0e 0 0p 00s p 2 00s p 2 0e 0 0p 0 00s 3 0p 0e 00e 0 0s p 1 0p 0p 0 0t p 4 0p 05 0p 0p 0p 0e 0st 0s 0s 001314101351212005000 1000015000 2000001000020000Steps
53、300000100002000030000StepsStepsp 000s 0e 0p 0p 2 00 0000p 0p 1 05 000s 0e 0p 00s 3 0p 0e 00e 0 0s p 1 0p 0p 0 0t p 4 0Figure 2:Progression of duality gap (DG) throughout training and heatmaps of generated samples.Minimax Loss as a metric for evaluating generators. For all experiments, we report both
54、 the duality gap (DG) and the minimax loss M (u, vworst). The latter is the first term in the expression of the DG and intuitively measures the goodness of a generator Gu. If Gu is optimal and covers pdata, the minimax loss achieves its optimal value as well. This happens when Dvworst outputs 0.5 fo
55、r boththe real and generated samples. Whenever terated distribution does not cover the entire supportof pdata or compromises the sample quality, this is detected by Dvworst and hence, the minimax lossincreases. This makes it a compelling metric for detecting mode collapse and evaluating sample quali
56、ty. Note that in order to compute this metric one only needs a batch of generated samples, i.e.terator can be used as a black-box. Hence, this metric is not limited to generators trained aspart of a GAN, but can instead be used for any generator that can be sampled from.Practical and efficient estim
57、ation of the duality gap for GANs. In practice, the metrics are computed by optimizing a separate generator/discriminator using a gradient based algorithm. To speed up the optimization, we initialize the networks using the parameters of the adversary at the step being evaluated. Hence, if we are evaluating the GAN at step t, we train vworst
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 城市綜合體車庫租賃管理合同范本
- 環(huán)保型物流倉儲配送一體化合同樣本
- 珠寶首飾寄售合作協(xié)議范本
- 車輛購置附加金融貸款及保險合同
- 虛擬現(xiàn)實劇本創(chuàng)作及授權許可合同
- 高檔車庫物業(yè)管理及維修保養(yǎng)服務合同
- 非生產采購培訓
- 餐飲店股權轉讓與數(shù)字化運營服務協(xié)議
- 餐飲外賣服務與消費者權益保護協(xié)議
- 武術課件圖片大全集
- AQ∕T 7009-2013 機械制造企業(yè)安全生產標準化規(guī)范
- 閥門重量及法蘭規(guī)格重量參考明細表
- 干部履歷表(中共中央組織部2015年制)
- 低壓電氣基礎知識培訓課件
- 人民調解業(yè)務知識培訓講座課件
- 《活著》讀書分享優(yōu)秀課件
- 中興項目管理初級認證VUE題庫(含答案)
- 武漢市第五醫(yī)院醫(yī)聯(lián)體探索和思考張斌課件
- LNG加注站考核標準表
- 創(chuàng)新杯說課大賽計算機類一等獎作品《光纖熔接》教案
- 勞務派遣公司介紹ppt課件(PPT 35頁)
評論
0/150
提交評論