ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

VoiceNet: Multilingual On-Device Phoneme-To-Audio Alignment

Kun Jin, Siva Penke, Srinivasa Algubelli

Phoneme-to-Audio Alignment has many applications and is generally considered as an important task in the lip-sync system where an avatar's lip shape is synchronized with the corresponding speech signal. In this work, we propose such a novel end-to-end on-device multilingual model, VoiceNet, which learns both phoneme recognition and text-independent forced alignment. VoiceNet supports on-device inference in Real-Time as well as in Non Real-time. Moreover, in the Non-RealTime scenario, we show that the performance can be further enhanced when text is given. Our experiments demonstrate competitive performance of VoiceNet compared with state-of-the-art Phoneme Recognition and Forced Alignment results on LibriSpeech and multilingual dataset. Benchmarked on a set of Galaxy phone devices, VoiceNet achieves the average phoneme inference latency of 6ms on CPU, demonstrating the high computational efficiency. Furthermore, VoiceNet can achieve 2x speedup on GPU and 10x speedup on NPU.