(Lumiere) A Space-Time Diffusion Model for Video Generation
Published : 2024/01 Paper : A Space-Time Diffusion Model for Video Generation
Architecture
- realistic, diverse and coherent motion
- space-time UNet architechture
- entire temporal duration at once (single pass)
- previous models : key frames → SSR which makes global temporal consistency different
- base model generates on aggresively sub-smapled key frame which makes the result aliased and ambigious on fast videos
- and this problem cannot be solved by TSR
- domain gap exists → inference time is used also for interpolation ⇒ leads to errors
- implies that TSR layer in the cascaded models are inefficient
- both spatial and temporal down and upsampling
leveraging T2I diffusion mdeol
→ directly generate full frame rate, low resolution video
→ multiple space-time scale
- computations in pixel space not latent space
- this results to model needing spatial super resolution model to produce high res images
- however, this design principle may also be applied to LDMs
- in this model, inflated SSR network can only operate on short segments of the videos due to memory constraints
- for smooth transition, employ multidiffusion along the temporal axis
limitations
- not designed for multi-shot video or videos with transition between shots
This post is licensed under CC BY 4.0 by the author.