ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Speech Emotion Recognition by Estimating Emotional Label Sequences with Phoneme Class Attribute

Ryotaro Nagase, Takahiro Fukumori, Yoichi Yamashita

In recent years, much research has been into speech emotion recognition (SER) using deep learning to predict emotions conveyed by speech. We studied the method that detected the emotion for the whole utterance using the frame-based SER, which estimates emotions in each frame rather than in a whole utterance. One of the problems with this method is that the emotional label sequence, which is used in training the frame-based SER, does not sufficiently consider phonemic characteristics. To solve this problem, we propose new methods of recognizing the emotion for the whole utterance using frame-based SER that considers the phoneme class attribute such as vowels, voiced consonants, unvoiced consonants, and other symbols in training. As a result, we found that the proposed methods significantly improve the performance of the result for the whole utterance compared to conventional methods.