ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Voice Activity-based Text Segmentation for ASR Text Denormalization

Sashi Novitasari, Takashi Fukuda, Gakuto Kurata

We introduce a novel technique for text capitalization and punctuation recovery (CP) systems that learn from voice-activity cues to effectively enhance the output readability of E2E ASR. Commonly E2E ASR systems produce uncapitalized text with no punctuation marks. In such situations, CP systems are introduced as external modules to denormalize the ASR output; however, they suffer from performance degradation due to the difference between the text segmentation used to construct them and those resulting from ASR. ASR systems generally produce decoded text of input speech segments determined by a VAD algorithm, while CP systems are often constructed on grammatically well-segmented full-sentence text. To reduce this gap, we construct a CP system by using pseudo VAD-segmented text given by a text segmentation model designed using voice activity cues. Our method reduces false predictions by 4.5%-18.9% compared with the baseline while appropriately formatting the ASR texts.