EvalAlign: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Zhiyu Tan1 Xiaomeng Yang2 Luozheng Qin2 Mengping Yang2 Cheng Zhang3 Hao Li1†
1Fudan University 2ShangHai Academy of AI for Science 3Carnegie Mellon University
Corresponding author & Project lead

[Arxiv]      [Code]      [Model]      [Dataset]      [Page]      [Leaderboard]      [BibTeX]

Model Image Faithfulness (Human) Image Faithfulness (EvalAlign) Text-image Alignment (Human) Text-image Alignment (EvalAlign)
PixArt-XL-2-1024-MS 2.2848 1.6415 5.1100 5.3100
Dreamlike Photoreal v2.0 2.0070 1.4522 4.5600 4.9800
SDXL Refiner v1.0 1.9229 1.6072 5.2100 5.4000
SDXL v1.0 1.8136 1.4675 5.0300 5.3500
Wuerstchen 1.7837 1.4279 4.8700 5.1700
LCM SDXL 1.6910 1.3391 5.1800 5.3300
Openjourney 1.6667 1.1750 4.8300 4.9200
Safe SD MAX 1.6491 1.2175 4.3100 4.5900
LCM LoRA SDXL 1.6387 1.3833 5.0600 5.2700
Safe SD STRONG 1.6308 1.1466 4.6000 4.8300
Safe SD MEDIUM 1.6275 1.1298 4.4000 4.5600
Safe SD WEAK 1.6078 1.1188 4.5300 4.7100
SD v2.1 1.5524 1.1094 4.8000 5.0700
SD v2.0 1.5277 1.1300 4.6400 5.0100
Openjourney v2 1.5000 0.9956 4.1500 4.6500
Redshift Diffusion 1.4733 1.1382 4.3500 4.6700
Dreamlike Diffusion v1.0 1.4652 1.2052 4.6600 5.1500
SD v1.5 1.4417 1.1362 4.4500 4.9000
IF-I-XL v1.0 1.3808 0.9221 5.4500 5.5300
SD v1.4 1.3592 0.9511 4.5200 4.7600
Vintedois Diffusion v0.1 1.3562 1.0797 4.6200 4.9500
IF-I-L v1.0 1.2635 0.8814 5.2300 5.4500
MultiFusion 1.2372 1.1298 4.6800 4.8000
IF-I-M v1.0 1.0135 0.7928 5.0800 5.2200

Acknowledgement

The project page template is borrowed from DreamBooth.