
✅ 2025.07.29 We released UniCoT-7B-MoT, which extends Bagel-7B-MoT model to perform text-to-image generation with self-reflection reasoning mechanism.
✅ 2025.08.08 We released UniCoT v0.1 technical report on Arxiv and Github repository.
🔥 We are still working on the this project to implement more kinds of Chain-of-Thought (CoT) mechanism into unified model, please stay tuned!
While Chain-of-Thought (CoT) reasoning has been proven effective for complex text-based tasks, extending it to multimodal scenarios introduces new challenges. In visual contexts, human reasoning often relies on understanding how visual states evolve over time, such as tracking object movements and spatial interactions. This demands that Multimodal Large Language Models (MLLMs) reason not only at the textual level but also effectively incorporate and interpret visual cues.
To tackle this, we introduce **Uni-CoT**, a unified reasoning framework that extends CoT principles to the **multimodal domain**, empowering Multimodal Large Language Models (MLLMs) to perform **interpretable**, **step-by-step reasoning** across both **text and vision**. The core idea is to decompose complex multimodal tasks into structured, manageable steps that can be executed **sequentially or in parallel**, enabling more scalable and systematic reasoning as shown below.
As illustrated in the figure above, the Uni-CoT framework adopts a two-level hierarchical reasoning architecture:
1. Macro-Level CoT: Decomposes a complex task into simpler subtasks and synthesizes their outcomes to derive the final answer. We design three planning mechanism for different scenarios: *Sequential Decomposition* for causal, step-by-step planning; *Parallel Decomposition* for collaborative, multi-branch planning; *Progressive Refinement* for unknown or highly complex scenarios requiring iterative exploration.
2. Micro-Level CoT: Focuses on executing individual subtasks while filtering out irrelevant information. We incorporate a *Self-Reflection* mechanism to ensure stable and high-quality results in each subtask.
With these designs, our Uni-CoT framework aims to enable unified large models to tackle a wide range of challenging multimodal applications, including:
1. Highly reliable image generation/editing;
2. Visual planning;
3. Geometric and physical reasoning.
Culture↑ | Time↑ | Space↑ | Biology↑ | Physics↑ | Chemistry↑ | Overall↑ | |
---|---|---|---|---|---|---|---|
Janus | 0.16 | 0.26 | 0.35 | 0.28 | 0.30 | 0.14 | 0.23 |
MetaQuery | 0.56 | 0.55 | 0.62 | 0.49 | 0.63 | 0.41 | 0.55 |
Bagel-Think | 0.76 | 0.69 | 0.75 | 0.65 | 0.75 | 0.58 | 0.70 |
Uni-CoT | 0.76 | 0.60 | 0.76 | 0.73 | 0.81 | 0.73 | 0.75 |
GPT4O | 0.81 | 0.71 | 0.89 | 0.83 | 0.79 | 0.74 | 0.80 |
We adapt the unified Bagel-7B-MoT model to perform joint text and image generation in support of UniCoT-style multimodal reasoning. As a first step, we fine-tune the model using its native interleaved text-image training paradigm. While this naïve adaptation enables the model to learn basic UniCoT behaviors, we observe significant challenges when scaling to complex reasoning chains involving multiple image-text steps.
A primary bottleneck lies in the elevated complexity introduced by visual reasoning. Unlike text-only reasoning, where each step typically consumes 512–1,024 tokens, UniCoT requires generating both a reasoning text and a corresponding image per step. Synthesising Image via VAE-based representation consumes ~4,096 tokens, and encoding the image with a ViT-based representation for understanding incurs an additional ~4,900 tokens, resulting in nearly 9,000 tokens per reasoning step. This substantial overhead significantly increases the computational cost of training and inference. As the reasoning chain grows, the model struggles to converge and generalize, ultimately limiting its performance on complex multimodal tasks.
To mitigate the complexity introduced by long multimodal reasoning chains, we reformulate the Uni-CoT process as a Markov Decision Process (MDP), where each step depends solely on the current state. Concretely, we model each reasoning step as a discrete MDP node, which only depends on the preceding step and the task instruction. This formulation enables the model to focus on learning local transition dynamics between adjacent nodes, rather than capturing dependencies across the entire reasoning chain as shown below. Such a design choice significantly reduces computational overhead and improves training efficiency.
Specifically, each MDP node is defined by the following components:
State \(s_t\): Current context, refer to last reasoning step, including both text and images.
Action \(a_t\): A hybrid operation that involves generating editing instructions and performing corresponding image edits.
Next State \(s_{t+1}\): The updated context resulting from the applied action, including the edited image, a textual summary according to the edited image.
Reward \(r_t\): A textual conclusion or scalar score that quantifies the alignment between the outcome and the task objective.
With above design, our training focuses on three core objectives:
1. Learning to generate hybrid actions (text and image edits) that drive reasoning progression.
2. Predicting the next state summary given the current state and action.
3. Estimating reward that reflect task completion and reasoning quality.
@misc{qin2025unicot, title={Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision}, author={Luozheng Qin and Jia Gong and Yuqing Sun and Tianjiao Li and Mengping Yang and Xiaomeng Yang and Chao Qu and Zhiyu Tan and Hao Li}, year={2025}, eprint={2508.05606}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.05606}, }