(LVDM) Latent Video Diffusion Models for High-Fidelity Long Video Generation
Published : 2022/11 Paper : Latent Video Diffusion Models for High-Fidelity Long Video Generation
Summary
- focused on long video generation
- 3D autoencoder
- encoder : spatial and temporal downsampling
- spatiotemporal factorized 3D UNet architecture
- tried joint spatiotemporal attention → less effective
- conditional latent perturbation
- for coherent long sequence
- introducing noise to conditional latent variables at each generation step
- unconditional guidance
- ensure general realism
- when model generates conditionally, it may drift away from realistic or expected outputs as the sequence gets longer
Introduction
Diffusion model
- good result but high computational resources
- to address this, we introduce a lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget
Leverage diffusion models.
- However, directly extending diffusion models to video synthesis requires substantial computational resources
- so proposes LVDM, an efficient video diffusion model in the latent space of videos and we achieve SOTA results via simple LVDM model.
- In addition, to further generate long-range videos, we introduce a hierarchical LVDM framework that can extent videos far behind the training length.
However, generating long videos tends to suffer the performance degredation problem
- conditional latent perturbation and unconditional guidmace, whcih effectively slow
Contributions
- First compressing videos into tight latents
- hierarchical framework that operates in the video latent space, enabling our models to generate longer videos beyond the training length further
- conditional latent perturbation and unconditional guidance for mitigating the performance degradation during long video generation.
- SOTA results on three benchmarks in both short and long video generation settings. Also probife appealing result for open-domain text-to-video generation, demonstrating the effective and generalization of out models
This post is licensed under CC BY 4.0 by the author.