ISCA Archive IberSPEECH 2018
ISCA Archive IberSPEECH 2018

UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge

Miquel Angel India Massana, Itziar Sagastiberri, Ponç Palau, Elisa Sayrol, Josep Ramon Morros, Javier Hernando

This paper presents the UPC system proposed for the Multimodal Speaker Diarization task of the 2018 Albayzin Challenge. This approach works by processing individually the speech and the image signal. In the speech domain, speaker diarization is performed using identity embeddings created by a triplet loss DNN that uses i-vectors as input. The triplet DNN is trained with an additional regularization loss that minimizes the variance of both positive and negative distances. A sliding windows is then used to compare speech segments with enrollment speaker targets using cosine distance between the embeddings. To detect identities from the face modality, a face detector followed by a face tracker has been used on the videos. For each cropped face a feature vector is obtained using a Deep Neural Network based on the ResNet 34 architecture, trained using a metric learning triplet loss (available from dlib library). For each track the face feature vector is obtained by averaging the features obtained for each one of the frames of that track. Then, this feature vector is compared with the features extracted from the images of the enrollment identities. The proposed system is evaluated on the RTVE2018 database.

doi: 10.21437/IberSPEECH.2018-40

Cite as: India Massana, M.A., Sagastiberri, I., Palau, P., Sayrol, E., Morros, J.R., Hernando, J. (2018) UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge. Proc. IberSPEECH 2018, 199-203, doi: 10.21437/IberSPEECH.2018-40

  author={Miquel Angel {India Massana} and Itziar Sagastiberri and Ponç Palau and Elisa Sayrol and Josep Ramon Morros and Javier Hernando},
  title={{UPC Multimodal Speaker Diarization System for the 2018 Albayzin Challenge}},
  booktitle={Proc. IberSPEECH 2018},