0. Course Goal

<aside> ⬇️ 이번 학습 목표

Understand the objective function of GAN.
Understand the concept of adversarial training and learn the min-max optimization problem.
Implement the GAN algorithm directly and examine the differences from the theory. </aside>

1. Generative Adversarial Network: Theory

✏️

이전 내용: Latent generative modeling
아이디어: generative model과 discriminator의 two-player game이 되도록 generative model을 학습?!

Untitled

1.1 Competition between Generator and Discriminator

Generator and Discriminator

Generator: the latent variable function $G_{\theta}$
- 생성한 데이터가 data space에 속하는 확률을 최소화하도록 디자인
  - $=\mathbb{P}{G}(\widehat{\mathcal{X}}{\text{data}}\setminus\mathcal{X}_{\text{data}})$
Discriminator: $D_{\phi}$ ($=\mathbf{1}{ \mathcal{X}{\text{data}}}(\mathbf{x})$)
- Generator가 생성한 데이터가 real data space에 속하는지를 판단
- $\mathcal{X}{\text{data}}$에서 관측된 real data와 $\widehat{\mathcal{X}}{\text{data}}$에서 생성된 data 중 어느 것이 실제 데이터인지 확률적으로 결정하도록 디자인
  - 즉, binary classification problem → binary cross-entropy loss function
    
    $$
    
    \ell(\mathbf{x},\phi):=-\mathbf{1}{\mathcal{X}{\text{data}}}(\mathbf{x})\log D_{\phi}(\mathbf{x})-(1-\mathbf{1}{\mathcal{X}{\text{data}}}(\mathbf{x}))\log (1-D_{\phi}(\mathbf{x})) $$
    
    $$
    
    \ell(\mathbf{x},\phi)=\begin{cases} -\log D_{\phi}(\mathbf{x}) & :\mathbf{x}\in\mathcal{X}{\text{data}} \\ -\log (1-D{\phi}(\mathbf{x})) & :\mathbf{x}\notin\mathcal{X}_{\text{data}} \end{cases} $$
  - w.r.t $\phi$, $\mathbf{x}\in\mathcal{X}{\text{data}}\cup\widehat{\mathcal{X}}{\text{data}}$에 대한 $\ell(\mathbf{x},\phi)$의 expectation을 최소화할 수 있다!
    
    → optimal discriminator $D_{\ast}$를 얻을 수 있음

Compare and Evaluate

data distribution으로부터 real data $\mathcal{D}$를 수집하고, latent variable $\mathbf{z}\sim P_{\mathcal{Z}}$를 취함으로써 model distribution $P_{G}$를 따르는 generator $G_{\theta}$를 작동시킬 수 있기 때문에
- $\mathbf{x}\sim P_{\text{data}}$와 $\hat{\mathbf{x}}\sim P_{G}$( $\hat{\mathbf{x}}=G_{\theta}(\mathbf{z})$)를 샘플링할 수 있다.
- 그렇다면 어떻게 sample $\mathbf{x},\hat{\mathbf{x}}\in \text{supp }P_{\text{data}}\cap \text{supp }P_{G}$을 compare & evaluate 할 수 있을까?
두 분포 $P_{\text{data}}$와 $P_G$를 비교하는 방법: density ratio $r=\frac{P_{\text{data}}}{P_{G}}$ 계산
- 두 데이터 분포가 같다면 $r=1$
- 그러나 generative model의
$P_{\text{data}}$에 접근할 수 없음 - → 샘플만으로 ratio를 계산할 수 있는 방법을 찾아야 함!

$$

\begin{aligned} \frac{P_{\text{data}}(\mathbf{x})}{P_{G}(\mathbf{x})}= &=\frac{\mathbb{P}(X=\mathbf{x}|\text{real})}{\mathbb{P}(X=\mathbf{x}|\text{generated})} \\ &=\frac{\mathbb{P}(\text{real}|X=\mathbf{x})\mathbb{P}(X=\mathbf{x})}{\mathbb{P}(\text{real})}\Big/ \frac{\mathbb{P}(\text{generated}|X=\mathbf{x})\mathbb{P}(X=\mathbf{x})}{\mathbb{P}(\text{generated})}\\ &=\frac{\mathbb{P}(\text{real}|X=\mathbf{x})}{\mathbb{P}(\text{generated}|X=\mathbf{x})}\approx\frac{D_{\phi}(\mathbf{x})}{1-D_{\phi}(\mathbf{x})} \end{aligned} $$
- 이때, $\mathbb{P}(\text{real})=\mathbb{P}(\text{generated})$
$$

D_{\phi}(\mathbf{x})\approx\frac{P_{\text{data}}(\mathbf{x})}{P_{\text{data}}(\mathbf{x})+P_{G}(\mathbf{x})} $$

Learning Objective

Discriminator network $D_{\phi}$ 학습을 위한 learning objective를 design할 수 있다!
- loss function $l$을 real data 부분과 $G_\theta$ 부분으로 분해
  
  $$
  
  \ell_{\text{data}}(\mathbf{x},\phi)=-\log D_{\phi}(\mathbf{x}),\quad \ell_{G}(\hat{\mathbf{x}},\phi)=-\log(1-D_{\phi}(\hat{\mathbf{x}})) $$
  - $\mathbf{x}\sim P_{\text{data}}$
  - $\hat{\mathbf{x}}\sim P_{G_{\theta}}$ by $\hat{\mathbf{x}}=G_{\theta}(\mathbf{z})$ for $\mathbf{z}\sim P_{\mathcal{Z}}$.
- loss function $L(\phi, \theta)$
  
  $$ \begin{aligned} L(\phi,\theta)&:=\mathbb{E}{\mathbf{x}\sim P{\text{data}},\hat{\mathbf{x}}\sim P_{G_{\theta}}}[\ell_{\text{data}}(\mathbf{x},\phi)+\ell_{G}(\hat{\mathbf{x}},\phi)]\\ &=\frac{1}{2}\mathbb{E}{\mathbf{x}\sim P{\text{data}}}[\ell_{\text{data}}(\mathbf{x},\phi)]+\frac{1}{2}\mathbb{E}{\hat{\mathbf{x}}\sim P{G_{\theta}}}[\ell_{G}(\hat{\mathbf{x}},\phi)]\\ &=\frac{1}{2}\int_{\mathcal{X}{\text{data}}} \ell{\text{data}}(\mathbf{x},\phi)P_{\text{data}}(\mathbf{x})\text{d}\mathbf{x}+\frac{1}{2}\int_{\widehat{\mathcal{X}}{\text{data}}}\ell{G}(\hat{\mathbf{x}},\phi)P_{G_{\theta}}(\hat{\mathbf{x}})\text{d}\hat{\mathbf{x}} \end{aligned} $$
- 이때, generator $G_\theta$가 realistic data를 잘 생성하게 된다면, i.e., $\mathcal{X}{\text{data}}\approx\widehat{\mathcal{X}}{\text{data}}$ 두 적분 항을 합칠 수 있음.
  
  $$
  
  \begin{aligned} \frac{1}{2}&\int_{\mathcal{X}{\text{data}}}[\ell{\text{data}}(\mathbf{x},\phi)P_{\text{data}}(\mathbf{x})\text{d}\mathbf{x}+\ell_{G}(\mathbf{x},\phi)P_{G_{\theta}}(\mathbf{x})]\text{d}\mathbf{x} \\ &=-\frac{1}{2}\int_{\mathcal{X}{\text{data}}}P{\text{data}}(\mathbf{x})\log D_{\phi}(\mathbf{x})+P_{G_{\theta}}(\mathbf{x})\log(1-D_{\phi}(\mathbf{x}))\text{d}\mathbf{x} \end{aligned} $$
위의 objective function을 최소화하는 optimal Discriminator
- For $a,b>0$ and $z\in (0,1)$,
  
  $$
  
  \begin{aligned} \frac{\text{d}}{\text{d} z}(a\log z+b\log(1-z))&=\frac{a}{z}-\frac{b}{1-z}=0\quad \Rightarrow \quad z=\frac{a}{a+b} \end{aligned} $$
- the minimizer of the above objective function:
  
  $$
  
  D_{\ast}(\mathbf{x})=D_{\phi_{\ast}}(\mathbf{x})=\frac{P_{\text{data}}(\mathbf{x})}{P_{\text{data}}(\mathbf{x})+P_{G_{\theta}}(\mathbf{x})} $$
  - $D_{\ast}$ depends on $P_{G_{\theta}}$ → the optimal parameter $\phi_{\ast}=\phi_{\ast}(\theta)$ is a function of $\theta$.
- plug $\phi_{\ast}(\theta)$ into $\phi$ of the above objective function:
  
  $$ \begin{aligned} L(\phi_{\ast}(\theta),\theta)&=\log 2-\frac{1}{2}\mathbb{KL}\left(P_{\text{data}}\Big\|\frac{P_{\text{data}}+P_{G_{\theta}}}{2}\right)-\frac{1}{2}\mathbb{KL}\left( P_{G_{\theta}}\Big\|\frac{P_{\text{data}}+P_{G_{\theta}}}{2}\right)\\ &=\log2 - \mathbb{JS}(P_{\text{data}}\|P_{G_{\theta}}) \end{aligned} $$
  - $\mathbb{JS}$ : Jensen-Shannon divergence