✅ 2025.03.13 We released the paper, model, dataset, inference code and project page of Cockatiel.
🔥 We are still working on the remaining code, which should be finished in days.
Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.
As illustrated in the above figure. The core of Cockatiel captioner relies on a human-aligned caption scorer, which assess the training value of each candidate synthesized caption from the perspective of dimension-specific video-caption alignment and human preference. In this way, we can avoid the impairment introduced by the synthetic nature of our data and align them with human preferences, eventually bootstrapping model performance on VDC and encouraging the generation of human-preferred captions. However, to the best of our knowledge, no public model nor training dataset currently suits this need, so we have to build it on our own. Specifically, we meticulously annotate a dataset of structured human preference score on video detailed captions and fine-tune VILA-v1.5-13B on it.
To infuse VDC models with captioning knowledge on every fine-grained dimension of video-caption alignment and human preferences, we devise a three-stage training pipeline to implement the proposed ensembling synthetic and human preferenced training while meet common engineering need. Specifically, we curate data employing the scorer-based selection policy with threshold, which assess the training value of captions generated by three base models, LLaVA-Video-7B, VILA-v1.5-13B, Aria3.5Bx8. Moreover, it scores each candidate caption and exclusively involve the one with the highest score for training if it exceeds the preset threshold. With the abovementioned synthetic data reject sampling procedure, we then train our Cockatiel-13B captioner based on the data, and further distill Cockatiel-8B from Cockatiel-13B.
We provide some specific comparison cases between Cockatiel-13B and leading VDC models in the below figure. For more detailed comparisons or more quantitative and qualitative results, please refer to our paper
@misc{qin2025cockatielensemblingsynthetichuman,
title={Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption},
author={Luozheng Qin and Zhiyu Tan and Mengping Yang and Xiaomeng Yang and Hao Li},
year={2025},
eprint={2503.09279},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.09279},
}