Amazing technological breakthrough possible @S-Logix pro@slogix.in

Office Address

  • #5, First Floor, 4th Street Dr. Subbarayan Nagar Kodambakkam, Chennai-600 024 Landmark : Samiyar Madam
  • pro@slogix.in
  • +91- 81240 01111

Social List

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers - 2022


Large-scale Pretraining for Text-to-Video Generation via Transformers | S-Logix

Research Area:  Machine Learning

Abstract:

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Keywords:  
transformers
computation cost
CogVideo
video clip
machine
human evaluations

Author(s) Name:  Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang

Journal name:  Computer Vision and Pattern Recognition

Conferrence name:  

Publisher name:  arXiv

DOI:  https://doi.org/10.48550/arXiv.2205.15868

Volume Information:  Volume 1