1The University of Osaka 2The University of Tokyo 3Institute of Science Tokyo
4OMRON SINIC XUnderstanding animal species from multimodal data poses an emerging challenge at the intersection of computer vision and ecology. While recent biological models, such as BioCLIP, have demonstrated strong alignment between images and textual taxonomic information for species identification, the integration of the audio modality remains an open problem. We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. BioVITA involves (i) a training dataset, (ii) a representation model, and (iii) a retrieval benchmark. First, we construct a large-scale training dataset comprising 1.3 million audio clips and 2.3 million images, covering 14,133 species annotated with 34 ecological trait labels. Second, building upon BioCLIP2, we introduce a two-stage training framework to effectively align audio representations with visual and textual representations. Third, we develop a cross-modal retrieval benchmark that covers all possible directional retrieval across the three modalities (i.e., image-to-audio, audio-to-text, text-to-image, and their reverse directions), with three taxonomic levels: Family, Genus, and Species. Extensive experiments demonstrate that our model learns a unified representation space that captures species-level semantics beyond taxonomy, advancing multimodal biodiversity understanding.
We develop BioVITA Bench, a species-level retrieval benchmark spanning the six cross-modal directions. Our benchmark enables comprehensive analysis from multimodal, ecological, and generalization perspectives.
We introduce BioVITA Train, a training dataset for VITA alignment. We curate 1.3 million audio clips and 2.3 million images with textual taxonomic annotations, covering 14k species and 34 ecological traits.
BioVITA Model consists of three encoders. Building upon BioCLIP 2, we train the audio encoder in Stage 1, and jointly train the audio and text encoders in Stage 2.
Our BioVITA effectively handles all retrieval scenarios and significantly outperforms the tri-modal baselines. Stage 2, which incorporates visual information, further improves all retrieval scenarios by providing complementary cues for robust VITA alignment.
Despite encountering entirely novel taxa, BioVITA demonstrates robust generalization. Consistent with observations in seen scenarios, the improvement from Stage 1 to Stage 2 underscores the crucial role of incorporating visual modalities for enhancing generalization.
The left plot reports, among all species-level errors, the proportion in which the retrieved sample belongs to the correct genus, and the right plot reports the proportion belonging to the correct family. This suggests that the learned representations successfully capture hierarchical taxonomic structure.
@article{shinoda2026biovita,
title={BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment},
author={Risa Shinoda and Kaede Shiohara and Nakamasa Inoue and Kuniaki Saito and Hiroaki Santo and Fumio Okura},
year={2026},
eprint={2603.23883},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.23883},
}