There are many abbreviation and non-standard words in SMS and Twitter messages. They are problematic for text-to-speech (TTS) or language processing techniques for these data. A character-based machine translation (MT) approach was previously used for normalization of nonstandard words. In this paper, we propose a two-step translation method to leverage phonetic information, where non-standard words are first translated to possible pronunciations, which are then translated to standard words. We further combine it with the single-step character-based translation module. Our experiments show that our proposed method significantly outperforms previous results in both n-best coverage and 1-best accuracy.
Index Terms: text normalization, text-to-speech, abbreviation