ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Dog2vec: Self-Supervised Pre-Training for Canine Vocal Representation

Xingyuan Li, Kenny Zhu, Mengyue Wu

Previous generalized biological voice models were trained on large amounts of data from multiple species. However, on average, there is very little training data on species-specific voices, while large differences between the vocalizations of species may even be a barrier to encoding vocal features. This leads to potentially large errors in using generic models for species-specific vocalization studies. We collected over 6000 hours of dog barking videos and presented the first animal-specific bioacoustic embedding model, Dog2vec.1 The results indicate that Dog2vec outperforms species-independent pre-trained models and achieves state-of-the-art results on a series of dog-related tasks, including dog bark type recognition and dog sound event detection, and obtain a relative 8.2% performance increase.