(DVD-GAN) Adversarial Video Generation on Complex Datasets
Published : 2019/09 Paper : Adversarial Video Generation on Complex Datasets
Discriminator
- Spatial $D_S$
- single frame construct
- sum of per frame scores
- similar to $D_I$ of MoCoGAN → but different in that it looks at full resolution videos
- Temporal $D_T$
- provides generator with learning signal to generate movements
- input : downsampled spatial video
- natural image 생성에서 strong leveraging of scale를 통해 high fidelity samples을 생성해냈던 것처럼 영상에도 적용
- scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator.
- new SOTA form Fréchet Inception Distance for prediction of Kinetics-600 dataset, etc.
 
- scalable generative model of natural video which produces high-quality samples at resolutions up to 256 × 256 and lengths up to 48 frames.
- build upon BigGAN architechture + scalable, video-specific generator and discriminator architectures
 
 
Dual Discriminators
tackles this scale problem by using two discriminators:
Spatial Discriminator DS
- critiques single frame content and structure by randomly sampling $k$ full-resolution frames and judging them individually
- final score is the sum of the per-frame scores
- similar to that of MoCoGAN : DVD-GAN’s DS is similar to the per-frame discriminator DI in MoCoGAN (Tulyakov et al., 2018). However MoCoGAN’s analog of DT looks at full resolution videos, whereas DS is the only source of learning signal for high-resolution details in DVD-GAN. For this reason, DS is essential when $\phi$ is not the identity, unlike in MoCoGAN where the additional per-frame discriminator is less crucial.
Temporal Discriminator DT
- provide G with the learning signal to generate movement
 
To make the model scalable, we apply a spatial down sampling function $\phi(.)$ to the whole video and feed its output to DT
- results in an architecture where the discriminators do not process the entire video’s worth of pixels, since DS processes only $k × H × W$ pixels and DT only $T \times \frac{H}{2} \times \frac{W}{2}$.
- For a 48 frame video at 128 × 128 resolution, this reduces the number of pixels to process per video from 786432 to 327680 : a 58% reduction.
This post is licensed under CC BY 4.0 by the author.