Omni-Video

Omni-Video: Democratizing Unified Video Understanding and Generation

[Zhiyu Tan*](https://openreview.net/profile?id=~Zhiyu_Tan1) · [Hao Yang*](https://openreview.net/profile?id=~Yang_Hao4) ·[Luozheng Qin](https://openreview.net/profile?id=~Luozheng_Qin1) · [Jia Gong](https://scholar.google.com/citations?user=ZV-ThegAAAAJ&hl=zh-CN&oi=ao) · [Mengping Yang](https://scholar.google.com/citations?user=yF34LtcAAAAJ&hl=zh-CN)^✉ · [Hao Li](https://scholar.google.com/citations?user=pHN-QIwAAAAJ&hl=zh-CN) ^✉ ^*Equal Contribution ^✉Corresponding Authors

TL; DR: Omini-Video is a video unified model that enables various video tasks including video understanding, generation editing within a single framework.

🔥 Latest News

August 6, 2025: 🔥🔥 We are glad to release our code, which includes support for both inference and fine-tuning!
August 6, 2025: 🔥🔥 Our version v0.1 model is uploaded to HF Model now!
Jul 07, 2025: We release the Technique-Report of Omni-Video
Jul 07, 2025: We release the project page of Omni-Video

📑 Tasks supported by the unified Omni-Video

Visual understanding
- [✔] Image understanding
- [✔] Video understanding
Visual generation
- [✔] Text-to-Video generation
- [✔] Text-to-Image generation
- [✔] Video-to-Video editting
- [✔] Image-to-Image editting

🚀 Quick Start

Inference

# Run inference with sample data
bash tools/inference/inference.sh

Training

# Quick training with sample data
# Sample data are availabe in examples/finetune_data
bash finetune.sh

For detailed usage instructions, please refer to the SETUP_MODELS.md.

Abstract

Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.

Demos

Text-to-Video

| Group 1 | Group 2 | Group 3 | |--------|--------|--------| | ![Text-to-Video Demo 1](/Omni-Video/assets/t2v/001.gif) | ![Text-to-Video Demo 2](/Omni-Video/assets/t2v/002.gif) | ![Text-to-Video Demo 3](/Omni-Video/assets/t2v/003.gif) | | ![Text-to-Video Demo 4](/Omni-Video/assets/t2v/004.gif) | ![Text-to-Video Demo 5](/Omni-Video/assets/t2v/005.gif) | ![Text-to-Video Demo 6](/Omni-Video/assets/t2v/006.gif) | | ![Text-to-Video Demo 7](/Omni-Video/assets/t2v/007.gif) | ![Text-to-Video Demo 8](/Omni-Video/assets/t2v/008.gif) | ![Text-to-Video Demo 9](/Omni-Video/assets/t2v/009.gif) |

Video-to-Video

Our model offers video editing capabilities. While its generalization may not always be optimal due to the high cost of training data, multiple sampling can often improve results. For better performance, fine-tuning with high-quality data is also a recommended option.

| Group 1 | Group 2 | |--------|--------| | ![Video Demo 1](/Omni-Video/assets/v2v/v2v_demo_001.gif) | ![Video Demo 2](/Omni-Video/assets/v2v/v2v_demo_002.gif) | | ![Video Demo 3](/Omni-Video/assets/v2v/v2v_demo_003.gif) | ![Video Demo 4](/Omni-Video/assets/v2v/v2v_demo_004.gif) | | ![Video Demo 5](/Omni-Video/assets/v2v/v2v_demo_005.gif) | ![Video Demo 6](/Omni-Video/assets/v2v/v2v_demo_006.gif) | | Group 3 | Group 4 | |-------|-------| | ![Concat Demo 1](/Omni-Video/assets/v2v/concat_0.gif) | ![Concat Demo 2](/Omni-Video/assets/v2v/concat_1.gif) | | ![Concat Demo 3](/Omni-Video/assets/v2v/concat_2.gif) | ![Concat Demo 4](/Omni-Video/assets/v2v/concat_3.gif) | | ![Concat Demo 5](/Omni-Video/assets/v2v/concat_4.gif) | ![Concat Demo 6](/Omni-Video/assets/v2v/concat_5.gif) |

Text-to-image

Image-to-image Editing

Acknowledgement

We would like to thank VILA and Wan2.1 for their excellent work.

BibTex

@article{tan2025omni,
  title={Omni-Video: Democratizing Unified Video Understanding and Generation},
  author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
  journal={arXiv preprint arXiv:2507.06119},
  year={2025}
}