1OMRON SINIC X 2The University of Osaka
*Equal contribution. Kuniaki serves as the project lead, while Risa isresponsible for dataset construction.
🔥[2025-11-26] Introducing AlignBench, which evaluates the VLM's ability for text-image alignment ! 🚀
Assessing image–text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image–text alignment by evaluating detailed image–caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance.
Examples of hallucinated sentences in AlignBench. The hallucinated portions are often subtle,requiring fine-grained image-text alignment ability to detect them.
Stats of AlignBench in sentence-level correctness annotations. AlignBench contains a large number of annotated sentences, enough for benchmarking models. We exclude sentences with unknown label.
Left: Ratio of incorrect sentences by position; all captioners make fewer errors at the first position. Different colors indicate different positions. Right: Number of unaligned sentences per category; most mistakes occur in attributes and text.
AUROC results across VLMs. Cells with the best performance within open-source and closed-source groups are highlighted in a blue background, while the best model within each model family is marked in bold.
Key Points
The size of plots indicates the parameter size. Left: MMMU performance measured on Captioners (X-axis) vs. AUROC measured by GPT-5-mini (Y-axis) for each Captioner. Advanced Captioners tend to produce hard-to-detect hallucinations. Right: MMMU (X-axis) vs. AUROC (Y-axis) measured on each detector. Detectors with better MMMU performance tend to perform better on AlignBench.
Key Points
Detectors show positional bias in scoring. We average the detectors’ correctness scores (Y-axis) by sentence position (X-axis) and visualize the results using GPT-4o (Left) and Llama-4 (Right) as detectors. Both detectors assign higher scores to sentences appearing near the beginning of the output. The detector is not provided with any positional information during inference.
Key Points
Detectors struggle to detect their own hallucination. Left: Self- and cross-evaluation results. AUROC scores for each Captioner (columns), normalized by the average AUROC of each Detector (rows). Diagonal entries show self-evaluation. Right: We pick GPT-4o as a detector, with their output correctness scores averaged by sentence position. Blue and red lines show scores for correct and incorrect GPT-4o’s outputs; green shows scores for incorrect Llama-4 outputs.
Key Points
Results of model ensemble. Ensembling detectors’ outputs improves performance in almost all cases. The increase or decrease from the better model used for ensembling is highlighted next to each score.
Key Points
@article{saito2025alignbenchbenchmarkingfinegrainedimagetext,
title={AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs},
author={Kuniaki Saito and Risa Shinoda and Shohei Tanaka and Tosho Hirasawa and Fumio Okura and Yoshitaka Ushiku},
year={2025},
eprint={2511.20515},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.20515},
}