ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Cross-Modality Diffusion Modeling and Sampling for Speech Recognition

Chia-Kai Yeh, Chih-Chun Chen, Ching-Hsien Hsu, Jen-Tzung Chien

The diffusion model excels as a generative model for continuous data within a single modality. To extend its effectiveness to speech recognition, where the continuous speech frames are used as the condition to generate the discrete word tokens, building a conditional diffusion across discrete state space becomes crucial. This paper introduces a non-autoregressive discrete diffusion model, enabling parallel generation of a word string corresponding to a speech signal through iterative diffusion steps. An acoustic transformer encoder identifies the speech representation, serving as the condition for a denoising transformer decoder to predict the whole discrete sequence. To address the redundancy reduction in cross-modality diffusion, an additional feature decorrelation objective is integrated during optimization. This paper further reduces the inference time by using a fast sampling approach. The experiments on speech recognition illustrate the merit of the proposed method.