ISCA Archive SynData4GenAI 2024
ISCA Archive SynData4GenAI 2024

Machine Speech Chain

Andros Tjandra

Although speech perception and production are closely related, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has largely advanced independently, with little mutual influence. However, human communication relies heavily on a closed-loop speech chain mechanism, where auditory feedback plays a pivotal role in human perception. This talk will explore a novel approach where we bridge this gap by developing a closed-loop machine speech chain model utilizing deep learning techniques.

Our model employs a sequence-to-sequence architecture that leverages both labeled and unlabeled data, enhancing the training process. This dual-direction functionality allows the ASR component to transcribe unlabeled speech features, and the TTS component reconstructs the original speech features from the ASR transcription. Certainly. Conversely, the TTS component synthesizes speech from the unlabeled text, and the ASR reconstructs the original text from the TTS generated speech.

This integration not only mimics human speech behaviors but also marks the first instance of its application into deep learning models. Our experimental results have demonstrated significant performance improvements over traditional systems trained solely on labeled data.