Post

(CogVideo) Large-scale Pretraining for Text-to-Video Generation via Transformers

Published : 2022/05

Paper : Large-scale Pretraining for Text-to-Video Generation via Transformers


&nbsp

To align text and video : multi-frame-rate hierarchical training strategy

  • largest and first open-source pretrained transformer model for t2v generation in general domain
  • multi-frame-rate hierarchical training to better align text-clip pairs → improves generation accuracy

&nbsp

Challenges

  • The potential huge computation cost makes the training from scratch unaffordable
  • weak relevance to text-video dataset
  • video frames tend to gradually deviate from the text prompt
    • possibly because of lack of annotated text-video data available compared to text-image data
    • duration of the video varies
      • Previous models split the video into many clips of a fixed number of frames for training, which destroys the alignment between the text and its temporal counterparts in the video. If a “drinking” video is split into four individual clips of “holding a glass”, “lifting”, “drinking” and “putting down” with the same text “drinking”, the model will be confused to learn the accurate meaning of drinking.

&nbsp

Contributions

  • CogVideo
    • 9B-parameter transformer
    • inherits pretrained T2I model CogView2
  • multi-frame-rate hierarchial training strategy → to better align text and video clips
This post is licensed under CC BY 4.0 by the author.