(DVD-GAN) Adversarial Video Generation on Complex Datasets

Posted Aug 6, 2024

By Eunhye Park

1 min read

Spatial $D_S$
- single frame construct
- sum of per frame scores
- similar to $D_I$ of MoCoGAN → but different in that it looks at full resolution videos
Temporal $D_T$
- provides generator with learning signal to generate movements
- input : downsampled spatial video

natural image 생성에서 strong leveraging of scale를 통해 high fidelity samples을 생성해냈던 것처럼 영상에도 적용
scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator.
new SOTA form Fréchet Inception Distance for prediction of Kinetics-600 dataset, etc.

&nbsp

scalable generative model of natural video which produces high-quality samples at resolutions up to 256 × 256 and lengths up to 48 frames.
build upon BigGAN architechture + scalable, video-specific generator and discriminator architectures

&nbsp

tackles this scale problem by using two discriminators:

critiques single frame content and structure by randomly sampling $k$ full-resolution frames and judging them individually
final score is the sum of the per-frame scores
similar to that of MoCoGAN : DVD-GAN’s DS is similar to the per-frame discriminator DI in MoCoGAN (Tulyakov et al., 2018). However MoCoGAN’s analog of DT looks at full resolution videos, whereas DS is the only source of learning signal for high-resolution details in DVD-GAN. For this reason, DS is essential when $\phi$ is not the identity, unlike in MoCoGAN where the additional per-frame discriminator is less crucial.

&nbsp

To make the model scalable, we apply a spatial down sampling function $\phi(.)$ to the whole video and feed its output to DT

results in an architecture where the discriminators do not process the entire video’s worth of pixels, since DS processes only $k × H × W$ pixels and DT only $T \times \frac{H}{2} \times \frac{W}{2}$.
For a 48 frame video at 128 × 128 resolution, this reduces the number of pixels to process per video from 786432 to 327680 : a 58% reduction.

This post is licensed under CC BY 4.0 by the author.

Trending Tags