ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Dialogue Acts Aided Important Utterance Detection Based on Multiparty and Multimodal Information

Fumio Nihei, Ryo Ishii, Yukiko Nakano, Kyosuke Nishida, Ryo Masumura, Atsushi Fukayama, Takao Nakamura

It has been reported that visualization of important utterances in a meeting enables efficient understanding of the meeting. Therefore, creating a model to estimate important utterances and improving its performance is an important issue. Several studies have reported that introducing auxiliary tasks as estimation targets improves the estimation performance of the main task. In this study, we develop estimation models of important utterances using dialogue acts (DAs) as an auxiliary task. The MATRICS corpus of four-party face-to-face meetings was used as the analysis data. A transformer with historical information was used for the model to estimate important utterances, and three types of modal information (text, audio, and video) were used as input data. In addition, audio and video data were separated into information about the speaker and others. As a result, the best model for important utterances was the one that used the speaker's text and audio, as well as others' audio and video data, with the assistance of DAs, with an estimation performance of 0.809 in f-measure. The results also showed that the model performed better than the one that only estimates important utterances, indicating that the assistance of DAs is effective in the estimation of important utterances.