Phrasing structure is one of the most important factors in increasing
the naturalness of text-to-speech (TTS) systems, in particular for
long-form reading. Most existing TTS systems are optimized for isolated
short sentences, and completely discard the larger context or structure
of the text.
This paper presents how we have built phrasing models based on
data extracted from audiobooks. We investigate how various types of
textual features can improve phrase break prediction: part-of-speech
(POS), guess POS (GPOS), dependency tree features and word embeddings.
These features are fed into a bidirectional LSTM or a CART baseline.
The resulting systems are compared using both objective and subjective
evaluations. Using BiLSTM and word embeddings proves to be beneficial.