ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Speech emotion recognition with deep learning beamforming on a distant human-robot interaction scenario

Ricardo García, Rodrigo Mahu, Nicolás Grágeda, Alejandro Luzanto, Nicolas Bohmer, Carlos Busso, Néstor Becerra Yoma

Human-robot interaction (HRI) is becoming a truly relevant topic imposing many challenges for state-of-the-art speech technology. This paper describes the first evaluation of speech emotion recognition (SER) technology with non-acted speech data recorded in a real indoor HRI scenario using deep learning-based beamforming technologies. The results presented show that deep learning beamforming gives in average an average concordance correlation coefficient (CCC) that is 15.03% higher than the ordinary minimum variance distortionless response (MVDR) beamformer when the SER system was trained with simulated conditions, which included an acoustic model of the testing HRI environment. Training by simulating the test scenarios and testing with real HRI static data provides on average an average CCC that is just 22.5% smaller than the ideal condition where training and testing were performed with the original MSP-Podcast database. This suggests the possibility to train SER engines with methods that emulates complex testing scenarios without recording further data.