ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Evaluating context-invariance in unsupervised speech representations

Mark Hallap, Emmanuel Dupoux, Ewan Dunbar

Unsupervised speech representations have taken off with benchmarks demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of discovering the phonemes of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced while text remains stable. This is why tokens of the same word have the same transcriptions---key to language understanding. Current benchmarks do not measure context-stability. We develop a new version of the ZeroSpeech ABX benchmark that does, and apply it to recent self-supervised representations. We show that context-independence of representations is predictive of the stability of word-level representations. We suggest research concentrate on improving context-independence of unsupervised representations.