ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Detection of Laughter and Screaming Using the Attention and CTC Models

Takuto Matsuda, Yoshiko Arimoto

This study aimed to detect social signals, such as laughter and screams, in real environments. Social signals influence human-to-human communication. To effectively apply these signals in various systems, computer systems must appropriately detect social signals. In this study, social signal detection (SSD) experiments were conducted to demonstrate which of three feature sets, i.e., a spectral feature set, prosodic feature set, and spectral and prosodic feature set, was best for detecting laughter and screaming. The results showed that using both the spectral and prosodic feature sets yielded the best performance, with 81.83% accuracy for laughter and 81.68% accuracy for screams. Moreover, the detection model comparison results revealed that the bidirectional long short-term memory (BiLSTM)-connectionist temporal classification (CTC) yielded the best laughter detection performance, while attention-CTC was best for scream detection. These results suggest that CTC is effective for SSD.