ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding Strategies in a Mamba Framework

Joun Yeop Lee, Sangjun Park, Byoung Jin Choi, Ji-Hyun Lee, Min-Kyung Kim, Hoon-Young Cho

Recent advances in neural codec-based text-to-speech (TTS) systems have achieved remarkable synthesized speech quality. However, their reliance on large model size and heavy computational requirements limits CPU-based on-device deployment. In this work, we present a Mamba-based streaming acoustic model with two novel depthwise decoding strategies for residual vector quantization (RVQ)-based codec: a Masked Language Model (MLM) approach and an Implicit Neural Representation (INR) approach. The MLM strategy iteratively refines tokens along the code depth axis to enhance speech quality, whereas the INR approach predicts all quantization levels in parallel to reduce computational costs. We further incorporate a speaker embedding conditioning mechanism for a zero-shot scenario, enabling robust performance on unseen speakers. Experimental results demonstrate comparable or even superior improvements in both objective and subjective metrics compared to other larger TTS baseline models.