EvalAlign: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Zhiyu Tan1 Xiaomeng Yang2 Luozheng Qin2 Mengping Yang2 Cheng Zhang3 Hao Li1†
1Fudan University 2ShangHai Academy of AI for Science 3Carnegie Mellon University
Corresponding author & Project lead

[Arxiv]      [Code]      [Model]      [Dataset]      [Page]      [Leaderboard]      [BibTeX]

Model Image Faithfulness (Human) Image Faithfulness (EvalAlign) Text-image Alignment (Human) Text-image Alignment (EvalAlign)
PixArt-XL-2-1024-MS 2.28481 1.64151 5.11005 5.31007
Dreamlike Photoreal v2.0 2.00702 1.45224 4.560017 4.980013
SDXL Refiner v1.0 1.92293 1.60722 5.21003 5.40003
SDXL v1.0 1.81364 1.46753 5.03008 5.35004
Wuerstchen 1.78375 1.42795 4.87009 5.17005
LCM SDXL 1.69106 1.33917 5.18004 5.33006
Openjourney 1.66677 1.175010 4.830010 4.920015
Safe SD MAX 1.64918 1.21758 4.310023 4.590023
LCM LoRA SDXL 1.63879 1.38336 5.06007 5.27008
Safe SD STRONG 1.630810 1.146611 4.600016 4.830017
Safe SD MEDIUM 1.627511 1.129815 4.400021 4.560024
Safe SD WEAK 1.607812 1.118817 4.530018 4.710020
SD v2.1 1.552413 1.109418 4.800011 5.070011
SD v2.0 1.527714 1.130014 4.640014 5.010012
Openjourney v2 1.500015 0.995620 4.150024 4.650022
Redshift Diffusion 1.473316 1.138212 4.350022 4.670021
Dreamlike Diffusion v1.0 1.465217 1.20529 4.660013 5.150010
SD v1.5 1.441718 1.136213 4.450020 4.900016
IF-I-XL v1.0 1.380819 0.922122 5.45001 5.53001
SD v1.4 1.359220 0.951121 4.520019 4.760019
Vintedois Diffusion v0.1 1.356221 1.079719 4.620015 4.950014
IF-I-L v1.0 1.263522 0.881423 5.23002 5.45002
MultiFusion 1.237223 1.129816 4.680012 4.800018
IF-I-M v1.0 1.013524 0.792824 5.08006 5.22009

Acknowledgement

The project page template is borrowed from DreamBooth.