ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Context is all you need? Low-resource conversational ASR profits from context, coming from the same or from the other speaker

Julian Linke, Jana Winkler, Barbara Schuppler

Despite the rapid advancement of automatic speech recognition (ASR) systems, spontaneous conversations still pose a major challenge, which is even more of an obstacle for low-resourced languages, dialects or non-dominant varieties. What is more, lively turn-changes in conversational speech cause short utterances that have been found to be error prone for transformer-based ASR systems, requiring larger context. The question thus arises which type of context is useful: rather more from the same speaker, providing acoustically relevant context, or more from the conversation - mixing utterances from both speakers - providing semantically relevant context. Comparing seven ASR systems on conversational Austrian German, we find the best performance with a minimum of 20s of context, independent of whether it was from the same or from the other speaker. Systems fine-tuned with data from the same variety and speaking style require less context and perform overall better than zero-shot systems.