ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

End-to-End Spontaneous Speech Recognition Using Disfluency Labeling

Koharu Horii, Meiko Fukuda, Kengo Ohta, Ryota Nishimura, Atsunori Ogawa, Norihide Kitaoka

Spontaneous speech often contains disfluent acoustic features such as fillers and hesitations, which are major causes of errors during automatic speech recognition (ASR). In this paper, we propose a method of "disfluency labeling” to address this problem. Our proposed method replaces disfluent phenomena in the transcription of speech data used for training with two types of labels, filler (#) and hesitation (@), and trains an end-to-end ASR model using this data, which makes it possible to recognize disfluent acoustic phenomena as recognition targets, like characters. In addition, by removing the disfluency labels that are included in the recognition results, the words that the speaker actually intended to say can be extracted from the disfluent speech. The results of our evaluation experiments show that both the character and sentence error rates were reduced for all of the ASR test sets when disfluency labeling was applied, compared to the baseline method. The proposed method also outperformed other methods intended to reduce disfluency-related errors, even when more disfluent, spontaneous dialog speech was used. This study shows that explicit learning of two disfluent features, fillers and hesitations, is effective in spontaneous speech recognition.

doi: 10.21437/Interspeech.2022-281

Cite as: Horii, K., Fukuda, M., Ohta, K., Nishimura, R., Ogawa, A., Kitaoka, N. (2022) End-to-End Spontaneous Speech Recognition Using Disfluency Labeling. Proc. Interspeech 2022, 4108-4112, doi: 10.21437/Interspeech.2022-281

  author={Koharu Horii and Meiko Fukuda and Kengo Ohta and Ryota Nishimura and Atsunori Ogawa and Norihide Kitaoka},
  title={{End-to-End Spontaneous Speech Recognition Using Disfluency Labeling}},
  booktitle={Proc. Interspeech 2022},