AlignBench

AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Kuniaki Saito*¹, Risa Shinoda*², Shohei Tanaka¹, Tosho Hirasawa¹, Fumio Okura², Yoshitaka Ushiku¹

¹OMRON SINIC X ²The University of Osaka

^*Equal contribution. Kuniaki serves as the project lead, while Risa isresponsible for dataset construction.

We introduce a novel benchmark, AlignBench, which evaluates the VLM's ability for text-image alignment. We employ state-of-the-art Image-to-Text and Text-to-Image models to create synthetic image-caption pairs with or without subtle hallucinations. Misaligned words are highlighted in red. Using this dataset, we benchmark diverse VLMs to assess their ability to understand the alignment of image-sentence pairs. We find that subtle hallucinations generated by multimodal models can be hard to detect, even by state-of-the-art VLMs.

🔔News

🔥[2025-11-26] Introducing AlignBench, which evaluates the VLM's ability for text-image alignment ! 🚀

Abstruct

Assessing image–text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image–text alignment by evaluating detailed image–caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance.

AlignBench

Example

Examples of hallucinated sentences in AlignBench. The hallucinated portions are often subtle,requiring fine-grained image-text alignment ability to detect them.

Stats of AlignBench

Stats of AlignBench in sentence-level correctness annotations. AlignBench contains a large number of annotated sentences, enough for benchmarking models. We exclude sentences with unknown label.

Left: Ratio of incorrect sentences by position; all captioners make fewer errors at the first position. Different colors indicate different positions. Right: Number of unaligned sentences per category; most mistakes occur in attributes and text.

Overview of Results

Main Results

AUROC results across VLMs. Cells with the best performance within open-source and closed-source groups are highlighted in a blue background, while the best model within each model family is marked in bold.

Key Points

AlignBench covers diverse levels of hallucination detection.
CLIP-based models tailored for compositional alignment remain nearly blind.
GPT-5 shows the best performance of all models, while Llama-4, the best open-source model.
Advanced captioners produces hard-to-detect errors.
Increasing the model size improves performance.
Robustness to text-to-image models differs by VLMs.

Analysys

Correlation with MMMU performance

The size of plots indicates the parameter size. Left: MMMU performance measured on Captioners (X-axis) vs. AUROC measured by GPT-5-mini (Y-axis) for each Captioner. Advanced Captioners tend to produce hard-to-detect hallucinations. Right: MMMU (X-axis) vs. AUROC (Y-axis) measured on each detector. Detectors with better MMMU performance tend to perform better on AlignBench.

Key Points

Stronger Captioners generate hallucinations that are harder to detect.
The performance on HalDec-Bench is highly correlated with that on MMMU.

Positional bias

Detectors show positional bias in scoring. We average the detectors’ correctness scores (Y-axis) by sentence position (X-axis) and visualize the results using GPT-4o (Left) and Llama-4 (Right) as detectors. Both detectors assign higher scores to sentences appearing near the beginning of the output. The detector is not provided with any positional information during inference.

Key Points

The detectors give a higher score to the sentences located near the beginning of the output.

For own hallucination

Detectors struggle to detect their own hallucination. Left: Self- and cross-evaluation results. AUROC scores for each Captioner (columns), normalized by the average AUROC of each Detector (rows). Diagonal entries show self-evaluation. Right: We pick GPT-4o as a detector, with their output correctness scores averaged by sentence position. Blue and red lines show scores for correct and incorrect GPT-4o’s outputs; green shows scores for incorrect Llama-4 outputs.

Key Points

Detectors struggle to detect their own hallucinations.

Model ensemble

Results of model ensemble. Ensembling detectors’ outputs improves performance in almost all cases. The increase or decrease from the better model used for ensembling is highlighted next to each score.

Key Points

Ensembling improves performance.
Strong sentence-level detectors are not always effective for segment-level localization.

Dataset Examples

BibTeX


          @article{saito2025alignbenchbenchmarkingfinegrainedimagetext,
            title={AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs}, 
            author={Kuniaki Saito and Risa Shinoda and Shohei Tanaka and Tosho Hirasawa and Fumio Okura and Yoshitaka Ushiku},
            year={2025},
            eprint={2511.20515},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2511.20515}, 
          }