Post

(Lumiere) A Space-Time Diffusion Model for Video Generation

Published : 2024/01 Paper : A Space-Time Diffusion Model for Video Generation


Architecture

  • realistic, diverse and coherent motion

Lumiere Pipeline

  • space-time UNet architechture
    • entire temporal duration at once (single pass)
    • previous models : key frames → SSR which makes global temporal consistency different
      • base model generates on aggresively sub-smapled key frame which makes the result aliased and ambigious on fast videos
      • and this problem cannot be solved by TSR
      • domain gap exists → inference time is used also for interpolation ⇒ leads to errors
    • implies that TSR layer in the cascaded models are inefficient

    STUNet architecture

  • both spatial and temporal down and upsampling
  • leveraging T2I diffusion mdeol

    → directly generate full frame rate, low resolution video

    → multiple space-time scale

  • computations in pixel space not latent space
    • this results to model needing spatial super resolution model to produce high res images
    • however, this design principle may also be applied to LDMs
  • in this model, inflated SSR network can only operate on short segments of the videos due to memory constraints
    • for smooth transition, employ multidiffusion along the temporal axis

limitations

  • not designed for multi-shot video or videos with transition between shots
This post is licensed under CC BY 4.0 by the author.