ISCA Archive SIGUL 2023
ISCA Archive SIGUL 2023

VGSAlign: Bilingual Speech Alignment of Unpaired and Untranscribed Languages using Self-Supervised Visually Grounded Speech Models

Luan Thanh Nguyen, Sakriani Sakti

Direct neural speech-to-speech translation (S2ST) systems enable translating speech from source to target languages without the need for text transcription. However, these systems are mostly trained using supervised learning that relies on a massive amount of parallel source-target speech data, which is often unavailable. This paper proposes a bilingual speech alignment approach called VGSAlign, as the initial solution for obtaining paired data from unknown, untranscribed, and unpaired speech data. Here, we assume the speech has auxiliary input from the visual modality that describes the semantic information. The approach then leverages the ability (1) to discover spoken words in multiple languages from the correspondences between speech segments and part of images based on self-supervised visually grounded speech models and (2) to find the visually grounded semantically equivalent between the spoken discovery of speech segments of source and target languages. By learning the representations of speech and images, VGSAlign shows the potential to achieve bilingual speech alignment based on visual representation. Furthermore, experimental results show that the proposed approach could work effectively with unknown, untranscribed, and unpaired speech without being trained on any supervised tasks.