ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Empirical Sampling from Latent Utterance-wise Evidence Model for Missing Data ASR based on Neural Encoder-Decoder Model

Ryu Takeda, Yui Sudo, Kazuhiro Nakadai, Kazunori Komatani

Missing data automatic speech recognition (MD-ASR) can utilize the uncertainty of speech enhancement (SE) results without re-training of model parameters. Such uncertainty is represented by a probabilistic evidence model, and the design and the expectation calculation of it are important. Two problems arise in applying the MD approach to utterance-wise ASR based on neural encoder-decoder model: the high-dimensionality of an utterance-wise evidence model and the discontinuity among frames of generated samples in approximating the expectation with Monte-Carlo method. We propose new utterance-wise evidence models using a latent variable and an empirical method for sampling from them. The space of our latent model is restricted by simpler conditional probability density functions (pdfs) given the latent variable, which enables us to generate samples from the low-dimensional space in deterministic or stochastic way. Because the variable also works as a common smoothing parameter among simple pdfs, the generated samples are continuous among frames, which improves the ASR performance unlike frame-wise models. The uncertainty from a neural SE is also used as a component in our mixture pdf models. Experiments showed that the character error rate of the enhanced speech was further improved by 2.5 points on average with our MD-ASR using transformer model.