ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

The Magnitude and Phase based Speech Representation Learning using Autoencoder for Classifying Speech Emotions using Deep Canonical Correlation Analysis

Ashishkumar Gudmalwar, Biplove Basel, Anirban Dutta, Ch V Rama Rao

Speech Emotion Recognition (SER) from human speech utterances is a task of identifying emotions irrespective of their semantic content. It has an important role in making human-machine interaction natural. Conventional SER approaches emphasize more on magnitude spectrum for feature extraction and ignore phase information. Recent studies reveal that phase information has a significant part in analyzing speech acoustics. This work explores speech representation learning from magnitude and phase information using autoencoder for SER task. We trained the UNET autoencoder using Mel Frequency Cepstral Coefficients (MFCCs) and Modified Group Delay Function (MODGD) for learning representations. The encoder part of the trained UNET autoencoder is used as input to the neural network classifier and fine-tuned it concerning four emotions separately for MFCCs and MODGD. The learned representation for both MFCCs and MODGD are combined and given input to Support Vector Machine (SVM) for classification. The Deep Canonical Correlation Analysis (DCCA) is used to maximize the correlation between magnitude and phase information to improve the conventional SER system's performance. The performance analysis is carried out using the IEMOCAP database. The experimental results show improvement over MFCC features and existing approaches for unimodal SER.