(Gen-1) Structure and Content-Guided Video Synthesis with Diffusion Models

Posted Jul 30, 2024

By Eunhye Park

2 min read

Published : 2023/02 Paper : Structure and Content-Guided Video Synthesis with Diffusion Models|

Introduction

extend the latent diffusion models to video generation by introducing temporal layers into a pretrained image model and training jointly on images and videos
structure and content-aware model that modifies videos guided by example images or texts
full control over temporal, content and structure consistency.
- jointly training on image and video data enables inference-time control over temporal consistency
- for structure consistency, training on varying levels of detail in the representation allows choosing the desired setting during inference

Targeting decoupling, Gen-1 defines the content latents as CLIP embedding and the structure latents as the monocular depth estimates, attaining superior decoupled controllability — from ModelScopeT2V

Experiments

Video in terms of content and structure

The Goal of the model is to edit the content of a video while retaining its structure

structure : characteristics describing its geometry and dynamics
- shapes and locations of the subject as well as their temporal changes
content : features describing the appearance and semantics of the video
- colors and styles of objects and the lightning of the scene

영상을 생성하는 모델들은 자주 MoCoGAN처럼 영상을 어떤 기준으로 나누는 것을 자주 하는 것 같다.

3.2 Spatiotemporal Latent Diffusion

To jointly learn an image model with shared parameters

extend image architecture by introducing temporal layers, which are only active for video inputs
all other layers are shared between the image and video model
autoencoder remains fixed and processes each frame in video independently

3.3 Representing Content and Structure

Conditional Diffusion Models

during training : must derive structure and content representations from training video itself
during inference : structure and content are derived from an input video and from a text prompt respectively

Content Representation

to infer content representation from both text input and video input → CLIP image embeddings to represent content

video input
- select one of the input frames randomly during training
- train a prior model that allows sampling image embeddings from text embeddings
  텍스트에서 이미지를 생성할 수 있는 prior 모델을 학습할 수 있게 된다.
decoder visualizations demonstrate that CLIP embeddings have increased sensitivity to semantic and stylistic properties, while being more invariant towards precise geometric attributes

Structure Representation

choose suitable representations to introduce inductive biases that guide our model towards the intended behavior while decreasing correlations between structure and content.

depth estimates from input video : encodes significantly less content information compared to simpler structure representations
We employ an information-destroying process based on a blur operator, which improves stability compared to other approaches such as adding noise.

Conditioning Mechanisms

structure (outline)
- contains significant portion of the spatial information
- use concatenation for conditioning to make effective use of the spatial information
content (color, style)
- not tied to particular locations
- leverage cross-attention which can effectively transport this information to any position

Papers, Video Generation

diffusion

This post is licensed under CC BY 4.0 by the author.