ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Deep Prosodic Features in Tandem with Perceptual Judgments of Word Reduction for Tone Recognition in Conversed Speech

Xiang-Li Lu, Yi-Fen Liu

To tackle the tone classification problem in conversational speech, we propose a transformer-based encoding network to classify tones in an utterance on a syllable-by-syllable basis. Using just F0 and rhythmic information, the interaction encoder consolidates contour representations first. By jointly predicting word tones using perceived judgments on reduction degrees, the learning architecture improves automatic recognition of the underlying syllable tones. Leveraging these enhancements, the experiments show that the proposed model is very robust and achieved a 12% increase in tone classification accuracy.