ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Improving End-to-end Mixed-case ASR with Knowledge Distillation and Integration of Voice Activity Cues

Sashi Novitasari, Takashi Fukuda, Gakuto Kurata

E2E mixed-case (MC) ASR is a more challenging task than unicase (UC) ASR because of the necessity of capitalizing and punctuating the decoded outputs simultaneously. MC models that are simply trained on formatted transcriptions often suffer from various negative impacts, notably a degradation in case-and-punctuation-insensitive performance due to the increased learning complexity. In this paper, we describe novel techniques for training E2E MC ASR models and use them to improve casing-and-punctuation sensitive and insensitive performance. Our approach incorporates knowledge distillation from UC teacher to MC student models not only to improve capitalization and punctuation accuracy but also to maximize phone classification capability in MC ASR. Furthermore, we attempt to integrate voice activity cues into MC ASR to support text formatting tasks. Our method significantly reduces errors by up to 9.2% relative to baseline models that operate at a similar decoding cost.