Post

(Gen-1) Structure and Content-Guided Video Synthesis with Diffusion Models

Published : 2023/02 Paper : Structure and Content-Guided Video Synthesis with Diffusion Models|


Introduction

  • extend the latent diffusion models to video generation by introducing temporal layers into a pretrained image model and training jointly on images and videos
  • structure and content-aware model that modifies videos guided by example images or texts
  • full control over temporal, content and structure consistency.
    • jointly training on image and video data enables inference-time control over temporal consistency
    • for structure consistency, training on varying levels of detail in the representation allows choosing the desired setting during inference

Targeting decoupling, Gen-1 defines the content latents as CLIP embedding and the structure latents as the monocular depth estimates, attaining superior decoupled controllability — from ModelScopeT2V

Experiments

Untitled

Video in terms of content and structure

The Goal of the model is to edit the content of a video while retaining its structure

  • structure : characteristics describing its geometry and dynamics
    • shapes and locations of the subject as well as their temporal changes
  • content : features describing the appearance and semantics of the video
    • colors and styles of objects and the lightning of the scene

영상을 생성하는 모델들은 자주 MoCoGAN처럼 영상을 어떤 기준으로 나누는 것을 자주 하는 것 같다.

**simillar to Make-A-Video and Video Diffusion Model ModelScopeT2V**

3.2 Spatiotemporal Latent Diffusion

To jointly learn an image model with shared parameters

  • extend image architecture by introducing temporal layers, which are only active for video inputs
  • all other layers are shared between the image and video model
  • autoencoder remains fixed and processes each frame in video independently

3.3 Representing Content and Structure

Conditional Diffusion Models

  • during training : must derive structure and content representations from training video itself
  • during inference : structure and content are derived from an input video and from a text prompt respectively

Content Representation

to infer content representation from both text input and video input → CLIP image embeddings to represent content

  • video input
    • select one of the input frames randomly during training
    • train a prior model that allows sampling image embeddings from text embeddings

      텍스트에서 이미지를 생성할 수 있는 prior 모델을 학습할 수 있게 된다.

  • decoder visualizations demonstrate that CLIP embeddings have increased sensitivity to semantic and stylistic properties, while being more invariant towards precise geometric attributes

Structure Representation

choose suitable representations to introduce inductive biases that guide our model towards the intended behavior while decreasing correlations between structure and content.

  • depth estimates from input video : encodes significantly less content information compared to simpler structure representations
  • We employ an information-destroying process based on a blur operator, which improves stability compared to other approaches such as adding noise.

Conditioning Mechanisms

  • structure (outline)
    • contains significant portion of the spatial information
    • use concatenation for conditioning to make effective use of the spatial information
  • content (color, style)
    • not tied to particular locations
    • leverage cross-attention which can effectively transport this information to any position
This post is licensed under CC BY 4.0 by the author.