Logo AgroBench

AgroBench: Vision-Language Model Benchmark in Agriculture

ICCV 2025

1The University of Osaka 2Kyoto University 3Tokyo Institute of Technology

4National Institute of Advanced Industrial Science and Technology (AIST) 5Visual Geometry Group, University of Oxford

6OMRON SINIC X

geometric reasoning

We present AgroBench Agronomist AI Benchmark designed to comprehensively evaluate 682 disease categories across 203 agricultural crop types for 7 vision-language question-answer tasks. In the era of larger-scale vision-language models (VLMs), our \datasetname is obviously non-trivial in terms of many more crop and disease categories with all expert annotations for establishing QA benchmarks in the agricultural domain.

🔔News

Introduction

Here, we introduce AgroBench(Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development.

AgroBench

Overview

algebraic reasoning

Examples of labeled images for DID, PID, and WID tasks. Our dataset includes 682 crop-disease pairs, 134 pest categories, and 108 weed categories. We prioritized collecting images from real farm settings.

Seven benchmark tasks in AgroBench

algebraic reasoning

AgroBench includes multiple topics with a diverse range of categories. The total accuracy is calculated by the average of each task to mitigate the difference in QAs.

Example Annotations

algebraic reasoning

(a) and (b) Crop management QA types for white asparagus and asparagus, respectively. Their difference in the harvest timing affects the answer's difference correctly. (c) and (d) Disease management QA types for the alfalfa bacterial leaf spot with the initial and severe symptoms, respectively. Based on the severity of the symptoms, the annotator changes the answer.

Experiment Results

Main Results

algebraic reasoning

We provide results for Random Choice, Human Validation, four closed-source VLMs, and open-source VLMs. Human validation was conducted by 28 people on a subset of 80 samples per task as a reference.

Chain of Thouht

algebraic reasoning

Baseline indicates results without CoT. In the one-shot, two-shot, and three-shot settings, we provide one, two, and three CoT examples per task, respectively, to guide the model.DID.

Error Examples

BibTeX


          @article{shinoda2025agrobenchvisionlanguagemodelbenchmark,
            title={AgroBench: Vision-Language Model Benchmark in Agriculture}, 
            author={Risa Shinoda and Nakamasa Inoue and Hirokatsu Kataoka and Masaki Onishi and Yoshitaka Ushiku},
            year={2025},
            eprint={2507.20519},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2507.20519}, 
      }