Post

(LVDM) Latent Video Diffusion Models for High-Fidelity Long Video Generation

Published : 2022/11 Paper : Latent Video Diffusion Models for High-Fidelity Long Video Generation


Summary

  • focused on long video generation
  • 3D autoencoder
    • encoder : spatial and temporal downsampling
  • spatiotemporal factorized 3D UNet architecture
    • tried joint spatiotemporal attention → less effective
  • conditional latent perturbation
    • for coherent long sequence
    • introducing noise to conditional latent variables at each generation step
  • unconditional guidance
    • ensure general realism
    • when model generates conditionally, it may drift away from realistic or expected outputs as the sequence gets longer

Introduction

Diffusion model

  • good result but high computational resources
  • to address this, we introduce a lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget

Leverage diffusion models.

  • However, directly extending diffusion models to video synthesis requires substantial computational resources
  • so proposes LVDM, an efficient video diffusion model in the latent space of videos and we achieve SOTA results via simple LVDM model.
  • In addition, to further generate long-range videos, we introduce a hierarchical LVDM framework that can extent videos far behind the training length.

However, generating long videos tends to suffer the performance degredation problem

  • conditional latent perturbation and unconditional guidmace, whcih effectively slow

Contributions

  1. First compressing videos into tight latents
  2. hierarchical framework that operates in the video latent space, enabling our models to generate longer videos beyond the training length further
  3. conditional latent perturbation and unconditional guidance for mitigating the performance degradation during long video generation.
  4. SOTA results on three benchmarks in both short and long video generation settings. Also probife appealing result for open-domain text-to-video generation, demonstrating the effective and generalization of out models

Hierarchial LVDM Framework

This post is licensed under CC BY 4.0 by the author.