ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network

Jeng-Lin Li, Chi-Chun Lee

Integrating multimodal emotion sensing modules in realizing human-centered technologies is rapidly growing. Despite recent advancement of deep architectures in improving recognition performances, inability to handle individual differences in the expressive cues creates a major hurdle for real world applications. In this work, we propose a Speaker-aligned Graph Memory Network (SaGMN) that leverages the use of speaker embedding learned from a large speaker verification network to characterize such an individualized personal difference across speakers. Specifically, the learning of the gated memory block is jointly optimized with a speaker graph encoder which aligns similar vocal characteristics samples together while effectively enlarge the discrimination across emotion classes. We evaluate our multimodal emotion recognition network on the CMU-MOSEI database and achieve a state-of-art accuracy of 65.1% UAR and 74.7% F1 score. Further visualization experiments demonstrate the effect of speaker space alignment with the use of graph memory blocks.


doi: 10.21437/Interspeech.2020-1688

Cite as: Li, J.-L., Lee, C.-C. (2020) Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network. Proc. Interspeech 2020, 389-393, doi: 10.21437/Interspeech.2020-1688

@inproceedings{li20d_interspeech,
  author={Jeng-Lin Li and Chi-Chun Lee},
  title={{Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={389--393},
  doi={10.21437/Interspeech.2020-1688},
  issn={2958-1796}
}