
✅ 2025.07.29 We released UniCoT-7B-MoT, which extends Bagel-7B-MoT model to perform text-to-image generation with self-check (self-reflection) reasoning mechanism.
✅ 2025.08.08 We released UniCoT v0.1 technical report on Arxiv and Github repository.
🔥 We are still working on the this project to implement more kinds of Chain-of-Thought (CoT) mechanism into unified model, please stay tuned!
Chain-of-Thought (CoT) reasoning has significantly enhanced LLM performance on complex text tasks by encouraging interpretable, step-by-step problem solving. However, extending this paradigm to multimodal tasks presents new challenges. In vision-language scenarios, human cognition depends on understanding how visual states evolve over time, inferring causality and planning based on object movements, spatial interactions, and transformations, which are critical for physical reasoning, visual planning, and story comprehension.
To bridge this gap, we introduce the Unified Chain-of-Thought (Uni-CoT) framework, designed to empower Multimodal Large Language Models (MLLMs) to perform structured and interpretable reasoning across both text and vision. Uni-CoT first decomposes a given multimodal task into simple, modular steps, and then processes each step either sequentially or in parallel, as illustrated below. Thus, it enables more systematic and scalable reasoning across modalities.
Specifically, the Uni-CoT reasoning pipeline consists of four key components:
1. Planning: Decompose the complex task into a sequence of simpler, manageable subtasks.
2. Subtask Execution: Execute each subtask using the unified model with step-by-step reasoning.
3. Self-Check: After completing each subtask, perform a validation check to ensure the intermediate result aligns with the intended sub-goal.
4. Final Result: Aggregate the validated subtask results to generate the final output.
With these designs, our Uni-CoT framework aims to enable unified large models to tackle a wide range of challenging multimodal applications, including:
1. Highly reliable image generation/editing;
2. Visual planning;
3. Geometric and physical reasoning.
Culture↑ | Time↑ | Space↑ | Biology↑ | Physics↑ | Chemistry↑ | Overall↑ | |
---|---|---|---|---|---|---|---|
Janus | 0.16 | 0.26 | 0.35 | 0.28 | 0.30 | 0.14 | 0.23 |
MetaQuery | 0.56 | 0.55 | 0.62 | 0.49 | 0.63 | 0.41 | 0.55 |
Bagel | 0.76 | 0.69 | 0.75 | 0.65 | 0.75 | 0.58 | 0.70 |
Uni-CoT | 0.75 | 0.66 | 0.78 | 0.70 | 0.78 | 0.71 | 0.73 |
GPT4O | 0.81 | 0.71 | 0.89 | 0.83 | 0.79 | 0.74 | 0.80 |
We adapt the unified Bagel-7B-MoT model to perform joint text and image generation in support of UniCoT-style multimodal reasoning. As a first step, we fine-tune the model using its native interleaved text-image training paradigm. While this naïve adaptation enables the model to learn basic UniCoT behaviors, we observe significant challenges when scaling to complex reasoning chains involving multiple image-text steps.
A primary bottleneck lies in the elevated complexity introduced by visual reasoning. Unlike text-only reasoning, where each step typically consumes 512–1,024 tokens, UniCoT requires generating both a reasoning text and a corresponding image per step. Synthesising Image via VAE-based representation consumes ~4,096 tokens, and encoding the image with a ViT-based representation for understanding incurs an additional ~4,900 tokens, resulting in nearly 9,000 tokens per reasoning step. This substantial overhead significantly increases the computational cost of training and inference. As the reasoning chain grows, the model struggles to converge and generalize, ultimately limiting its performance on complex multimodal tasks.
To mitigate the complexity introduced by long multimodal reasoning chains, we reformulate the Uni-CoT process as a Markov Decision Process (MDP), where each step depends solely on the current state. Concretely, we model each reasoning step as a discrete MDP node, which only depends on the preceding step and the task instruction. This formulation enables the model to focus on learning local transition dynamics between adjacent nodes, rather than capturing dependencies across the entire reasoning chain as shown below. Such a design choice significantly reduces computational overhead and improves training efficiency.
Specifically, each MDP node is defined by the following components:
State \(s_t\): Current context, refer to last reasoning step, including both text and images.
Action \(a_t\): A hybrid operation that involves generating editing instructions and performing corresponding image edits.
Next State \(s_{t+1}\): The updated context resulting from the applied action, including the edited image, a textual summary according to the edited image.
Reward \(r_t\): A textual conclusion or scalar score that quantifies the alignment between the outcome and the task objective.
With above design, our training focuses on three core objectives:
1. Learning to generate hybrid actions (text and image edits) that drive reasoning progression.
2. Predicting the next state summary given the current state and action.
3. Estimating reward that reflect task completion and reasoning quality.
@misc{Uni-CoT, author = {SAIS-FUXI}, title = {Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision}, howpublished = {\url{https://github.com/Fr0zenCrane/UniCoT}}, year = {2025}, note = {Accessed: 2025-07-28} }