ISCA Archive SIGUL 2023
ISCA Archive SIGUL 2023

What Kind of Multi- or Cross-lingual Pre-training is the most Effective for a Spontaneous, Less-resourced ASR Task?

Péter Mihajlik, Máté Soma Kádár, Gergely Dobsinszki, Yan Meng, Meng Kedalai, Julian Linke, Tibor Fegyó, Katalin Mády

Most languages are under-resourced for Automatic Speech Recognition (ASR), and most relevant tasks are related to the transcription of spontaneous speech. The application of cross- or multi-lingual pre-training is inevitable, however, the selection of the best pre-trained model or data/method is not straightforward. In this paper, we introduce a case study for Hungarian, targeting good quality spontaneous speech while monitoring the ASR performance of read speech. Transformer/conformer- based end-to-end neural models with supervised cross-lingual, self-supervised cross- and (massively) multi-lingual and weakly supervised multi-lingual pre-training are fine-tuned and evaluated. Surprisingly, a relatively small-scale trilingual (SSL pre-trained) model won the competition by a large margin over very large-scale models trained on more Hungarian data. The results revealed that the composition of pre-training data in terms of language and speech style was essential, bigger size or higher number of languages did not necessarily come with improvement, and no transcription was required in the pre-training for the best performance.