(Lumiere) A Space-Time Diffusion Model for Video Generation

Posted Aug 5, 2024

By Eunhye Park

1 min read

Published : 2024/01 Paper : A Space-Time Diffusion Model for Video Generation

Architecture

realistic, diverse and coherent motion

space-time UNet architechture
- entire temporal duration at once (single pass)
- previous models : key frames → SSR which makes global temporal consistency different
  - base model generates on aggresively sub-smapled key frame which makes the result aliased and ambigious on fast videos
  - and this problem cannot be solved by TSR
  - domain gap exists → inference time is used also for interpolation ⇒ leads to errors
- implies that TSR layer in the cascaded models are inefficient
both spatial and temporal down and upsampling
leveraging T2I diffusion mdeol
→ directly generate full frame rate, low resolution video
→ multiple space-time scale
computations in pixel space not latent space
- this results to model needing spatial super resolution model to produce high res images
- however, this design principle may also be applied to LDMs
in this model, inflated SSR network can only operate on short segments of the videos due to memory constraints
- for smooth transition, employ multidiffusion along the temporal axis

limitations

not designed for multi-shot video or videos with transition between shots

Papers, Video Generation

This post is licensed under CC BY 4.0 by the author.