ISCA Archive CHiME 2023
ISCA Archive CHiME 2023

NTT Multi-Speaker ASR System for the DASR Task of CHiME-7 Challenge

Naoyuki Kamo, Naohiro Tawara, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Atsunori Ogawa, Hiroshi Sato, Tsubasa Ochiai, Atsushi Ando, Rintaro Ikeshita, Takatomo Kano, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

We introduce our submission to the Distant automatic speech recognition (DSAR) task of the CHiME 7 challenge. Our system uses end-to-end diarization with vector clustering (EEND-VC), guided source separation (GSS), and attention-based encoder-decoder and transducer-based ASR systems. Our submission exploits pre-trained self-supervised learning (SSL) models to build strong diarization and ASR modules. We also explore data augmentation using contrastive data selection based on representations from SSL models. Besides, we use self-supervised adaptation (SSA) to adapt these modules to the recording conditions of each session. Our DASR system achieves a 36 % diarization error rate (DER) reduction and 47 % word error rate reduction (WER) over the baseline on the main track of the evaluation set and ranked third in the challenge.