ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Character Error Rate Estimation for Semi-Supervised Training of Speech Recognition for Arabic Dialects

Chanho Park, Oscar Saz

The use of semi-supervised data for Automatic Speech Recognition (ASR) is nowadays commonplace and is the basis of the most advanced ASR models. For low-resourced languages, where limited labelled data is available, it opens the possibility of using unlimited amounts of data without escalating costs. For this, an initial ASR system is required that can produce a pseudo-transcript of the untranscribed data, but in low-resourced languages, the accuracy of this initial system might not be sufficient to provide accurate pseudo-transcripts, so techniques for data selection become necessary. This paper explores the use of Character Error Rate (CER) estimation for automatically selecting the best segments from a set of nearly 4,000 hours of untranscribed Arabic in different dialects.