1The University of Osaka 2Kyoto University 3Tokyo Institute of Technology
4National Institute of Advanced Industrial Science and Technology (AIST) 5Visual Geometry Group, University of Oxford
We present AgroBench Agronomist AI Benchmark designed to comprehensively evaluate 682 disease categories across 203 agricultural crop types for 7 vision-language question-answer tasks. In the era of larger-scale vision-language models (VLMs), our \datasetname is obviously non-trivial in terms of many more crop and disease categories with all expert annotations for establishing QA benchmarks in the agricultural domain.
Here, we introduce AgroBench(Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development.
Examples of labeled images for DID, PID, and WID tasks. Our dataset includes 682 crop-disease pairs, 134 pest categories, and 108 weed categories. We prioritized collecting images from real farm settings.
AgroBench includes multiple topics with a diverse range of categories. The total accuracy is calculated by the average of each task to mitigate the difference in QAs.
(a) and (b) Crop management QA types for white asparagus and asparagus, respectively. Their difference in the harvest timing affects the answer's difference correctly. (c) and (d) Disease management QA types for the alfalfa bacterial leaf spot with the initial and severe symptoms, respectively. Based on the severity of the symptoms, the annotator changes the answer.
We provide results for Random Choice, Human Validation, four closed-source VLMs, and open-source VLMs. Human validation was conducted by 28 people on a subset of 80 samples per task as a reference.
Baseline indicates results without CoT. In the one-shot, two-shot, and three-shot settings, we provide one, two, and three CoT examples per task, respectively, to guide the model.DID.
@article{shinoda2025agrobenchvisionlanguagemodelbenchmark, title={AgroBench: Vision-Language Model Benchmark in Agriculture}, author={Risa Shinoda and Nakamasa Inoue and Hirokatsu Kataoka and Masaki Onishi and Yoshitaka Ushiku}, year={2025}, eprint={2507.20519}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2507.20519}, }