Omni-Video

Data Preparation for OmniVideo

This document describes the data preparation pipeline for training OmniVideo models. The process involves extracting latent features from videos and text prompts to create efficient training datasets.

Overview

To finetune OmniVideo models, input videos and their corresponding prompts need to be preprocessed into latent features and multimodal language model (MLM) features. This offline feature extraction approach significantly reduces GPU memory requirements during training and improves overall training efficiency.

The data preparation consists of two main steps:

Step 1: VAE and T5 Feature Extraction

Overview

Extract video latent features using a VAE (Variational Autoencoder) and text embeddings using T5 encoder.

Usage

bash tools/data_prepare/run_vae_feature.sh

Input Format

The input should be a JSON file containing video paths and corresponding prompts. Each entry should follow this structure:

{
    "video": "/path/to/video.mp4",
    "conversations": [
        {
            "from": "human",
            "value": "<video>\nDescribe the events in the video shown by these frames in at least three sentences."
        },
        {
            "from": "gpt",
            "value": "In the video, a man is seen standing on a boat in the middle of the ocean. He is wearing a black jacket and a black cap. The man is holding a walkie-talkie in his hand and appears to be speaking into it. The ocean is calm with small waves, and the sky is overcast. The man seems to be communicating with someone on the boat or on the shore. The video captures the serene environment of the ocean and the man's interaction with the walkie-talkie."
        }
    ]
}

Important Notes

Output

The script generates pickle files containing:

Step 2: AR Model Feature Extraction

Overview

Extract autoregressive (AR) model features from the VAE features generated in Step 1.

Usage

bash tools/data_prepare/run_ar_feature.sh

Input

The script uses the pickle file list generated from Step 1 as input. Set the $DATA_FILE variable in run_ar_feature.sh to point to your VAE feature file list.

Output

Final pickle files containing all features required for OmniVideo training:

Configuration

Key Parameters

Integration with Training

The generated pickle files can be directly used with the OmniVideo training pipeline. The feature extraction process ensures optimal memory usage and training efficiency.

For detailed training instructions, refer to the main training documentation and finetune_model.py.