Hao Yang2, Zhiyu Tan1,2†, Jia Gong2, Luozheng Qin2, Hesen Chen1,2, Xiaomeng Yang2, Yuqing Sun2, Yuetan Lin2, Mengping Yang2*, Hao Li1,2*
1Fudan University | 2Shanghai Academy of Artificial Intelligence for Science
*Corresponding Author †Project Lead
We present a unified video editing and generation framework that pairs a text-to-video DiT backbone with vision-language understanding for precise, controllable edits. A VLM reads the source video and edit instruction to predict a detailed caption of the expected edited result, converting sparse prompts into explicit semantics about content, attributes, and temporal changes. The DiT model then uses mixed cross-attention conditioning, injecting source VAE latents (optionally concatenated with other cues) together with the expanded text semantics, to preserve identity, layout, and motion while enabling flexible control. This yields a single pipeline that supports text-to-video, video-to-video editing, and mixed-condition generation.
Note: Left side shows the source video, right side shows the edited result.
Multi-element transformations combining appearance, lighting, and environmental changes.
*Change the man's black jacket to a tattered gray overcoat, replace the wall with faded blue wallpaper*
|
*Change the woman's red shirt to glowing neon cyan, transform window glow to electric blue moonlight*
|
*Change the man's black jacket to a gray coat with glowing thread, replace blue light with warm amber*
|
*Change workout attire to vibrant crimson sports bra and leggings, replace towel with flowing silk scarf*
|
Challenging edits on fast-moving subjects with dynamic clothing and dramatic motion.
*Change the woman's black top to a flowing blood-red silk gown that billows with motion*
|
*Change the woman's green jacket to a deep crimson cloak that billows dramatically*
|
*Change the armored suit from red-and-black to matte charcoal gray with cyan circuitry accents*
|
*Change the woman's white shirt to a blood-red silk blouse that clings to her form*
|
Precise object-level modifications while preserving surrounding context and motion.
*Change the real raccoon to a stuffed raccoon*
|
*Change the firefighter's pizza to a steaming cup of coffee*
|
*Change the light brown fur to deep obsidian-black fur with icy blue ethereal mist*
|
*Change the golden retriever to a black Labrador*
|
Adding objects and accessories to videos.
*Add a scarf around the first fox's neck*
|
*Add a tiny pirate hat on the parrot's head*
|
*Add a red headband to the player's forehead*
|
*Add a tiny crown to the hummingbird's head*
|
Removing elements from videos while maintaining scene coherence.
*Remove the meditation cushion from the scene*
|
*Remove the two cubs from the scene*
|
*Remove the two lizards from the scene*
|
*Remove the black cat from the scene*
|
Local attribute changes on specific objects.
*Change the woman's white dress to a blood-stained black gown*
|
*Change the fox into a badger*
|
*Change the man with thick beard to a woman with short silver hair*
|
*Change the engineer's navy jacket to a bright crimson trench coat*
|
omnivideo2_release/
├── omnivideo/
│ ├── configs/ # Model configurations
│ ├── distributed/ # FSDP and sequence parallel utilities
│ ├── modules/ # Core model components (attention, VAE, T5, etc.)
│ ├── utils/ # Utility functions and solvers
│ ├── vllm_model.py # Qwen3-VL integration
│ └── x2x_gen_unified.py # Main generation pipeline
└── tools/
└── inference/
├── generate_omni_v2v.py # Inference script
└── inference_omni_v2v.sh # Shell launcher
git clone https://github.com/your-org/omnivideo2.git
cd omnivideo2
conda create -n omnivideo2 python=3.10
conda activate omnivideo2
pip install -r requirements.txt
pip install flash-attn --no-build-isolation # Optional but recommended for faster attention
Download the pretrained checkpoints and organize them as follows:
${CKPT_DIR}/
├── high_noise_model/
│ └── model.pt # High-noise timestep model
├── low_noise_model/
│ └── model.pt # Low-noise timestep model
├── special_tokens.pkl # Special token embeddings
├── models_t5_umt5-xxl-enc-bf16.pth # T5 encoder
└── Wan2.1_VAE.pth # VAE model
You will also need the Qwen3-VL model for visual feature extraction:
Create a JSONL file with your prompts. Each line should be a JSON object:
For Video-to-Video editing:
{"sample_id": "001", "edit_prompt": "Change the dog to a cat", "source_clip_path": "/path/to/source_video.mp4"}
tools/inference/inference_omni_e2e.sh:# Update these paths
CKPT_DIR="/path/to/your/checkpoints"
QWEN3VL_MODEL_PATH="/path/to/Qwen3-VL-30B-A3B-Instruct"
DATA_FILE="/path/to/your/prompts.jsonl"
# Adjust generation parameters as needed
GEN_SIZE="832*480" # Video resolution (width*height)
GEN_FRAME_NUM=41 # Number of frames
GEN_SAMPLE_FPS=8 # Output FPS
GEN_TASK="v2v-A14B" # Task type: v2v-A14B or t2v-A14B
bash tools/inference/inference_omni_e2e.sh
| Task | Description |
|---|---|
t2v-A14B |
Text-to-Video generation with A14B model |
v2v-A14B |
Video-to-Video editing with A14B model |
| Parameter | Default | Description |
|---|---|---|
--size |
832*480 |
Output video resolution (width*height) |
--frame_num |
41 |
Number of frames to generate |
--sample_fps |
8 |
Output video FPS |
--sample_steps |
40 |
Number of diffusion sampling steps |
--sample_guide_scale |
3.0 |
Classifier-free guidance scale |
--sample_shift |
5 |
Noise schedule shift parameter |
--sample_solver |
unipc |
Sampling solver (unipc, ddim, euler) |
We sincerely thank the following teams for their outstanding contributions that made this project possible:
Wan Team: For the foundational video generation architecture, VAE model, and diffusion framework.
Qwen-VL Team: For the powerful Qwen3-VL vision-language model.
Please refer to the LICENSE file for details.
If you find this work useful, please consider citing:
@misc{omnivideo2,
title={OmniVideo2: A flexible framework to bridge video understanding, generation and editing
},
year={2026},
publisher={GitHub},
url={https://github.com/SAIS-FUXI/Omni-Video}
}
@article{tan2025omni,
title={Omni-Video: Democratizing Unified Video Understanding and Generation},
author={Tan, Zhiyu and Yang, Hao and Qin, Luozheng and Gong, Jia and Yang, Mengping and Li, Hao},
journal={arXiv preprint arXiv:2507.06119},
year={2025}
}