ISCA Archive IberSPEECH 2024
ISCA Archive IberSPEECH 2024

ILENIA_VOZ ASR System Fusion for Albayzin 2024 Speech to Text Challenge

Abir Messaoudi, Sarah Solito, Federico Costa, Carlos Daniel Hernández Mena, Marc Casals-Salvador, Lucas Takanori Sanchez Shiromizu, Marti Cortada Garcia, Carme Armentano-Oller, Antonio Moscoso Sánchez, Carmen Magariños, Javier González Corbelle, Asier Herranz, Christoforos Souganidis, Inma Hernáez Rioja, Ibon Saratxaga, Eva Navas

This paper presents the ILENIA_VOZ team's Automatic Speech Recognition (ASR) system developed for the Albayzin 2024 Speech-to-Text (S2T) Challenge, integrating efforts from four language-specific initiatives: AINA (Catalan), NÓS (Galician), GAITU (Basque), and VIVES (Valencian). Our primary system is a word-level ROVER fusion of multiple ASR models, achieving a 15.46% and 19.91% Word Error Rate on the RTVE 2022 and 2024 test sets, respectively. Additionally, three contrastive systems are presented. We use datasets from RTVE and various other corpora for model training and fine-tuning. The paper details our system architecture, the fusion techniques that we used, and presents results on the Albayzin-RTVE 2024 and 2022 test sets.