Recently, speaker diarization based on speaker embeddings has shown
excellent results in many works. In this paper we propose several enhancements
throughout the diarization pipeline. This work addresses two clustering
frameworks: agglomerative hierarchical clustering (AHC) and spectral
clustering (SC).
First, we use multiple speaker embeddings. We show that fusion
of x-vectors and d-vectors boosts accuracy significantly. Second, we
train neural networks to leverage both acoustic and duration information
for scoring similarity of segments or clusters. Third, we introduce
a novel method to guide the AHC clustering mechanism using a neural
network. Fourth, we handle short duration segments in SC by deemphasizing
their effect on setting the number of speakers.
Finally, we propose
a novel method for estimating the number of clusters in the SC framework.
The method takes each eigenvalue and analyzes the projections of the
SC similarity matrix on the corresponding eigenvector.
We evaluated our system
on NIST SRE 2000 CALLHOME and, using cross-validation, we achieved
an error rate of 5.1%, going beyond state-of-the-art speaker diarization.