AgroBench

AgroBench: Vision-Language Model Benchmark in Agriculture

ICCV 2025

Risa Shinoda^1,2,4, Nakamasa Inoue^3,4, Hirokatsu Kataoka^4,5 Masaki Onishi⁴, Yoshitaka Ushiku⁶,

¹The University of Osaka ²Kyoto University ³Tokyo Institute of Technology

⁴National Institute of Advanced Industrial Science and Technology (AIST) ⁵Visual Geometry Group, University of Oxford

⁶OMRON SINIC X

We present AgroBench Agronomist AI Benchmark designed to comprehensively evaluate 682 disease categories across 203 agricultural crop types for 7 vision-language question-answer tasks. In the era of larger-scale vision-language models (VLMs), our AgroBench is obviously non-trivial in terms of many more crop and disease categories with all expert annotations for establishing QA benchmarks in the agricultural domain.

🔔News

Introduction

Here, we introduce AgroBench(Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development.

Video

AgroBench

Overview

Examples of labeled images for DID, PID, and WID tasks. Our dataset includes 682 crop-disease pairs, 134 pest categories, and 108 weed categories. We prioritized collecting images from real farm settings.

Seven benchmark tasks in AgroBench

AgroBench includes multiple topics with a diverse range of categories. The total accuracy is calculated by the average of each task to mitigate the difference in QAs.

Example Annotations

(a) and (b) Crop management QA types for white asparagus and asparagus, respectively. Their difference in the harvest timing affects the answer's difference correctly. (c) and (d) Disease management QA types for the alfalfa bacterial leaf spot with the initial and severe symptoms, respectively. Based on the severity of the symptoms, the annotator changes the answer.

Experiment Results

Main Results

We provide results for Random Choice, Human Validation, four closed-source VLMs, and open-source VLMs. Human validation was conducted by 28 people on a subset of 80 samples per task as a reference.

Chain of Thouht

Baseline indicates results without CoT. In the one-shot, two-shot, and three-shot settings, we provide one, two, and three CoT examples per task, respectively, to guide the model.DID.

Error Examples

BibTeX


          @article{shinoda2025agrobenchvisionlanguagemodelbenchmark,
            title={AgroBench: Vision-Language Model Benchmark in Agriculture}, 
            author={Risa Shinoda and Nakamasa Inoue and Hirokatsu Kataoka and Masaki Onishi and Yoshitaka Ushiku},
            year={2025},
            eprint={2507.20519},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2507.20519}, 
      }