(LVDM) Latent Video Diffusion Models for High-Fidelity Long Video Generation

Posted Aug 5, 2024

By Eunhye Park

1 min read

focused on long video generation
3D autoencoder
- encoder : spatial and temporal downsampling
spatiotemporal factorized 3D UNet architecture
- tried joint spatiotemporal attention → less effective
conditional latent perturbation
- for coherent long sequence
- introducing noise to conditional latent variables at each generation step
unconditional guidance
- ensure general realism
- when model generates conditionally, it may drift away from realistic or expected outputs as the sequence gets longer

good result but high computational resources
to address this, we introduce a lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget

Leverage diffusion models.

However, directly extending diffusion models to video synthesis requires substantial computational resources
so proposes LVDM, an efficient video diffusion model in the latent space of videos and we achieve SOTA results via simple LVDM model.
In addition, to further generate long-range videos, we introduce a hierarchical LVDM framework that can extent videos far behind the training length.

However, generating long videos tends to suffer the performance degredation problem

conditional latent perturbation and unconditional guidmace, whcih effectively slow

First compressing videos into tight latents
hierarchical framework that operates in the video latent space, enabling our models to generate longer videos beyond the training length further
conditional latent perturbation and unconditional guidance for mitigating the performance degradation during long video generation.
SOTA results on three benchmarks in both short and long video generation settings. Also probife appealing result for open-domain text-to-video generation, demonstrating the effective and generalization of out models

This post is licensed under CC BY 4.0 by the author.

Trending Tags