(LAVIE) High-Quality Video Generation with Cascaded Latent Diffusion Models
Introduction
integrated video generation framework that operates on cascaded video latent diffusion models that consists of :
- base T2V model
- temporal interpolation model
- video super-resolution model
3B Parameters
Key Insights
- incorporation of simple temporal self-attentions + rotary positional encoding = captures temporal correlation
- validate that the process of joint image-video finetuning plays a pivotal role in producing high quality videos
- directly finetuning on video dataset severely hampers the concept-mixing ability of the model, leading to catastrophic forgetting and the gradual vanishing of the learned prior knowledge
Contribute a diverse video dataset names Vimeo25M : video-text pairs
Methods
- framework consisting of Video Latent Diffusion Models conditioned on text descriptions
- Base T2V model : responsible for generating short, low resolution key frames
- Temporal Interpolation model : interpolates the short videos and increase frame rate
- Video Super Resolution model : synthesizes high definition results from low resolution videos → imagen 이랑 비슷함
- each of these models is individually trained with text inputs serving as conditioning information.
- capable of generating video consisting of 61 frames with spatial resolution of 1280$\times$2046 pixels
Base T2V Model
- original LDM is designed as a 2D U-Net and can only process image data
- two modifications
- for each 2D conv layer, inflate the pretrained kernel to incorporate an additional temporal dimension → pseudo-3D convolutional later
- this inflation process converts any input tensor with the size B × C × H × W to B × C × 1 × H × W by introducing extra temporal axis
- extent the original transformer block to a spatiotemporal transformer by including a temporal attention layer after each spatial layer
- incorporate the concept of Rotary Positional Encoding from the late LLM to integrate temporal attention layer
- for each 2D conv layer, inflate the pretrained kernel to incorporate an additional temporal dimension → pseudo-3D convolutional later
- unlike previous methods that introduce additional temporal transformer to model time, LaVie directly applies it to the transformer block itself
- generates videos with 16 frames at a resolution of 320$\times$512
Primary Objective of the Base Model
- generate high quality key frames while also preserving diversity and capturing the compositional nature of videos
- apply joint finetuning approach using both image and video data to avoid catastrophic forgetting
Temporal Interpolation Model
- enhance the smoothness of the videos and synthesize richer temporal details
- training diffusion U-Net for it to quadruple the frame rate of the base video
- 16 → 61 frames
- In other words, every frame in the output is newly synthesized, providing a distinct approach compared to techniques where the input frames remain unchanged during interpolation
- diffusion U-Net is conditioned on the text prompt, which serves as additional guidance for the temporal interpolation process,
Video Super Resolution Model
- LDM upsampler similar to the base model
Results
Limitations
- Multi-subject generation
- Failure of hands generation
This post is licensed under CC BY 4.0 by the author.