Omni-Video

Omni-Video 2

A flexible framework to bridge video understanding, generation and editing

Project Page   HuggingFace Model

Hao Yang2, Zhiyu Tan1,2†, Jia Gong2, Luozheng Qin2, Hesen Chen1,2, Xiaomeng Yang2, Yuqing Sun2, Yuetan Lin2, Mengping Yang2*, Hao Li1,2*

1Fudan University  |  2Shanghai Academy of Artificial Intelligence for Science
*Corresponding Author    Project Lead


🔥 Latest News


Introduction

We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.

Framework


Video Editing Demos

Note: Left side shows the source video, right side shows the edited result.

Advanced Video Editing

Complex Edit

Multi-element transformations combining appearance, lighting, and environmental changes.

*Change the man's black jacket to a tattered gray overcoat, replace the wall with faded blue wallpaper* *Change the woman's red shirt to glowing neon cyan, transform window glow to electric blue moonlight*
*Change the man's black jacket to a gray coat with glowing thread, replace blue light with warm amber* *Change workout attire to vibrant crimson sports bra and leggings, replace towel with flowing silk scarf*

High Motion

Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion.

*Change the woman's black top to a flowing blood-red silk gown that billows with motion* *Change the woman's green jacket to a deep crimson cloak that billows dramatically*
*Change the armored suit from red-and-black to matte charcoal gray with cyan circuitry accents* *Change the woman's white shirt to a blood-red silk blouse that clings to her form*

Diverse Local Edit

Precise object-level modifications while preserving surrounding context and motion.

*Change the real raccoon to a stuffed raccoon* *Change the firefighter's pizza to a steaming cup of coffee*
*Change the light brown fur to deep obsidian-black fur with icy blue ethereal mist* *Change the golden retriever to a black Labrador*

Basic Video Editing

Add

Adding objects and accessories to videos.

*Add a scarf around the first fox's neck* *Add a tiny pirate hat on the parrot's head*
*Add a red headband to the player's forehead* *Add a tiny crown to the hummingbird's head*

Remove

Removing elements from videos while maintaining scene coherence.

*Remove the meditation cushion from the scene* *Remove the two cubs from the scene*
*Remove the two lizards from the scene* *Remove the black cat from the scene*

Local Change

Local attribute changes on specific objects.

*Change the woman's white dress to a blood-stained black gown* *Change the fox into a badger*
*Change the man with thick beard to a woman with short silver hair* *Change the engineer's navy jacket to a bright crimson trench coat*

Project Structure

omnivideo2_release/
├── omnivideo/
│   ├── configs/           # Model configurations
│   ├── distributed/       # FSDP and sequence parallel utilities
│   ├── modules/           # Core model components (attention, VAE, T5, etc.)
│   ├── utils/             # Utility functions and solvers
│   ├── vllm_model.py      # Qwen3-VL integration
│   └── x2x_gen_unified.py # Main generation pipeline
└── tools/
    └── inference/
        ├── generate_omni_v2v.py    # Inference script
        └── inference_omni_v2v.sh   # Shell launcher

Environment Setup

Requirements

Installation

  1. Clone the repository:
    git clone https://github.com/your-org/omnivideo2.git
    cd omnivideo2
    
  2. Create a conda environment:
    conda create -n omnivideo2 python=3.10
    conda activate omnivideo2
    
  3. Install dependencies:
    pip install -r requirements.txt
    pip install flash-attn --no-build-isolation  # Optional but recommended for faster attention
    

Model Checkpoints

Download the pretrained checkpoints and organize them as follows:

${CKPT_DIR}/
├── high_noise_model/
│   └── model.pt              # High-noise timestep model
├── low_noise_model/
│   └── model.pt              # Low-noise timestep model
├── special_tokens.pkl        # Special token embeddings
├── models_t5_umt5-xxl-enc-bf16.pth  # T5 encoder
└── Wan2.1_VAE.pth            # VAE model

You will also need the Qwen3-VL model for visual feature extraction:

Inference

Prepare Input Data

Create a JSONL file with your prompts. Each line should be a JSON object:

For Video-to-Video editing:

{"sample_id": "001", "edit_prompt": "Change the dog to a cat", "source_clip_path": "/path/to/source_video.mp4"}

Run Inference

  1. Edit the configuration in tools/inference/inference_omni_e2e.sh:
# Update these paths
CKPT_DIR="/path/to/your/checkpoints"
QWEN3VL_MODEL_PATH="/path/to/Qwen3-VL-30B-A3B-Instruct"
DATA_FILE="/path/to/your/prompts.jsonl"

# Adjust generation parameters as needed
GEN_SIZE="832*480"       # Video resolution (width*height)
GEN_FRAME_NUM=41         # Number of frames
GEN_SAMPLE_FPS=8         # Output FPS
GEN_TASK="v2v-A14B"      # Task type: v2v-A14B or t2v-A14B
  1. Run the inference script:
bash tools/inference/inference_omni_e2e.sh

Available Tasks

Task Description
t2v-A14B Text-to-Video generation with A14B model
v2v-A14B Video-to-Video editing with A14B model

Generation Parameters

Parameter Default Description
--size 832*480 Output video resolution (width*height)
--frame_num 41 Number of frames to generate
--sample_fps 8 Output video FPS
--sample_steps 40 Number of diffusion sampling steps
--sample_guide_scale 3.0 Classifier-free guidance scale
--sample_shift 5 Noise schedule shift parameter
--sample_solver unipc Sampling solver (unipc, ddim, euler)

Acknowledgements

We sincerely thank the following teams for their outstanding contributions that made this project possible:

License

Please refer to the LICENSE file for details.

Citation

If you find this work useful, please consider citing:

@misc{omnivideo2,
  title={OmniVideo2: A flexible framework to bridge video understanding, generation and editing
},
  year={2026},
  publisher={GitHub},
  url={https://github.com/SAIS-FUXI/Omni-Video}
}
@article{tan2025omni,
  title={Omni-Video: Democratizing Unified Video Understanding and Generation},
  author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
  journal={arXiv preprint arXiv:2507.06119},
  year={2025}
}