(Gen-1) Structure and Content-Guided Video Synthesis with Diffusion Models
Published : 2023/02 Paper : Structure and Content-Guided Video Synthesis with Diffusion Models|
Introduction
- extend the latent diffusion models to video generation by introducing temporal layers into a pretrained image model and training jointly on images and videos
- structure and content-aware model that modifies videos guided by example images or texts
- full control over temporal, content and structure consistency.
- jointly training on image and video data enables inference-time control over temporal consistency
- for structure consistency, training on varying levels of detail in the representation allows choosing the desired setting during inference
Targeting decoupling, Gen-1 defines the content latents as CLIP embedding and the structure latents as the monocular depth estimates, attaining superior decoupled controllability — from ModelScopeT2V
Experiments
Video in terms of content and structure
The Goal of the model is to edit the content of a video while retaining its structure
- structure : characteristics describing its geometry and dynamics
- shapes and locations of the subject as well as their temporal changes
- content : features describing the appearance and semantics of the video
- colors and styles of objects and the lightning of the scene
영상을 생성하는 모델들은 자주 MoCoGAN처럼 영상을 어떤 기준으로 나누는 것을 자주 하는 것 같다.
3.2 Spatiotemporal Latent Diffusion
To jointly learn an image model with shared parameters
- extend image architecture by introducing temporal layers, which are only active for video inputs
- all other layers are shared between the image and video model
- autoencoder remains fixed and processes each frame in video independently
3.3 Representing Content and Structure
Conditional Diffusion Models
- during training : must derive structure and content representations from training video itself
- during inference : structure and content are derived from an input video and from a text prompt respectively
Content Representation
to infer content representation from both text input and video input → CLIP image embeddings to represent content
- video input
- select one of the input frames randomly during training
train a prior model that allows sampling image embeddings from text embeddings
텍스트에서 이미지를 생성할 수 있는 prior 모델을 학습할 수 있게 된다.
- decoder visualizations demonstrate that CLIP embeddings have increased sensitivity to semantic and stylistic properties, while being more invariant towards precise geometric attributes
Structure Representation
choose suitable representations to introduce inductive biases that guide our model towards the intended behavior while decreasing correlations between structure and content.
- depth estimates from input video : encodes significantly less content information compared to simpler structure representations
- We employ an information-destroying process based on a blur operator, which improves stability compared to other approaches such as adding noise.
Conditioning Mechanisms
- structure (outline)
- contains significant portion of the spatial information
- use concatenation for conditioning to make effective use of the spatial information
- content (color, style)
- not tied to particular locations
- leverage cross-attention which can effectively transport this information to any position