VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Zhiyu Tan1 Xiaomeng Yang2 Luozheng Qin2 Hao Li1†
1Fudan University 2ShangHai Academy of AI for Science
Corresponding author & Project lead



[Arxiv]      [Code]      [Model]      [BibTeX]

🔥News

✅ 2024.08.05 We released the paper and the project page.

✅ 2024.08.07 We released the code and the text-to-video models trained on VidGen-1M.

📝 The dataset is uploadding, please stay tuned.

Abstract

The quality of text-video pairs fundamentally determines the performance potential of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To rectify this, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

Data Distribution


By employing tags associated with visual quality, temporal consistency, category, and motion, we undertook the task of filtering and sampling. The curated data distribution across multiple dimensions in our dataset is depicted in the above figure. This figure clearly indicates that videos characterized by low quality, static scene, excessive motion speed, and those demonstrating inadequate alignment between text and video along with poor temporal consistency were systematically removed. Concurrently, we have ensured a relatively even distribution of high quality samples across diverse categories.

LLM-based Fine Curation


In the stages of coarse curation and captioning, filtering for text-image alignment and temporal consistency using the CLIP score can remove some inconsistent data, but it is not entirely effective. Consequently, issues such as scene transitions in video, and two typical description errors occur in video captions: 1) Failed generating eos token, where the model fails to properly terminate the generation process, leading to looping or repetitive token generation, and 2) Frame-level generation, where the model lacks understanding of the dynamic relationships between frames and generates isolated descriptions for each frame, resulting in captions that lack coherence and fail to accurately reflect the video's overall storyline and action sequence.

In our endeavor to isolate and remove video-text pairings that exhibit discrepancies in both text-video alignment and temporal consistency, we leveraged the cutting-edge Language Model (LLM), LLAMA3.1, to scrutinize the respective captions in an efficient way. The application of the fine curation has facilitated a marked improvement in the quality of the text-video pairs, as evidenced in the above figure. Our study primarily centers around three critical factors: Scene Transition (ST), Frame-level Generation (FLG), and Reduplication (Redup).

Results


By training on the proposed VidGen-1M, we can achieve high performance in text-to-video generation.

BibTex

@article{tan2024vidgen,
  title={VidGen-1M: A Large-Scale Dataset for Text-to-video Generation},
  author={Tan, Zhiyu and Yang, Xiaomeng, and Qin, Luozheng and Li Hao},
  booktitle={arXiv preprint arxiv:2408.02629},
  year={2024}
}

Acknowledgement

The project page template is borrowed from DreamBooth.