Abstract

Understanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding.

BioVITA Bench

We develop BioVITA Bench, a species-level retrieval benchmark spanning the six cross-modal directions. Our benchmark enables comprehensive analysis from multimodal, ecological, and generalization perspectives. algebraic reasoning

BioVITA Train

We introduce BioVITA Train, a training dataset for VITA alignment. We curate 1.3 million audio clips and 2.3 million images with textual taxonomic annotations, covering 14k species and 34 ecological traits.

BioVITA Model

BioVITA Model consists of three encoders. Building upon BioCLIP 2, we train the audio encoder in Stage 1, and jointly train the audio and text encoders in Stage 2.

Overview of Results

Species-level cross-modal retrieval

Our BioVITA effectively handles all retrieval scenarios and significantly outperforms the tri-modal baselines. Stage 2, which incorporates visual information, further improves all retrieval scenarios by providing complementary cues for robust VITA alignment.

Unseen subset

Despite encountering entirely novel taxa, BioVITA demonstrates robust generalization. Consistent with observations in seen scenarios, the improvement from Stage 1 to Stage 2 underscores the crucial role of incorporating visual modalities for enhancing generalization.

Error Analysis

The left plot reports, among all species-level errors, the proportion in which the retrieved sample belongs to the correct genus, and the right plot reports the proportion belonging to the correct family. This suggests that the learned representations successfully capture hierarchical taxonomic structure.

BibTeX


@article{shinoda2026biovita,
      title={BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment}, 
      author={Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura},
      year={2026},
      eprint={2603.23883},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23883}, 
}

BioVITA

BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment

CVPR 2026 Main

Abstract

Dataset

BioVITA Bench

BioVITA Train

BioVITA Model

BioVITA Model

Overview of Results

Species-level cross-modal retrieval

Unseen subset

Error Analysis

BibTeX