Both linguists and engineers ask questions about language and speech, but their concerns differ. Although both communities look for what makes up communication, linguists look for what constitutes the abstract linguistic system in the human mind and brain, while engineers look for ways to model and simulate speech for technology implementation. What if the question addressed is fluent speech of Mandarin Chinese, and the answers are to satisfy both linguists and engineers? Put in paraphrase, the question then becomes what is there to be studied in addition to lexical tones and intonation for the linguists, and how could fluent speech prosody be simulated in addition to adding up tones and intonations for the engineers. Trying boldly to bring answers to both communities, we decided first to adopt a corpus approach to phonetic studies, an attempt to remedy the traditional phonetic approach by looking at more samples. To ensure the corpora contain fluent prosody information, we collected narratives of read discourses rather than canonical phrases. A total of 9 set of speech corpora with different prosodic features were recorded over a decade (http://www.myet.com/COSPRO). We then designed a perceptually based annotation system that emphasized boundary information and boundary breaks and manually labeled the corpora. The annotated results were consistently identified multiple-phrase speech paragraphs and various kind of prosodic units within. We studied the acoustic phonetic correlates of the annotated paragraphs, units and boundary breaks in detail, and through quantitative analyses, found systematic cross-phrase patterns in every acoustic parameter for each unit identified. That is, F0 contours, syllable duration patterns, intensity distribution patterns, and on top of it, systematic boundary information and boundary breaks are found across phrases. These patterns are not only cross-speaker but also cross-speaking-rate. It became obvious that what constitutes fluency is neither in the tonal realization of each syllable, nor in the individual phrase intonation, but rather, in the association between and among intonation phrases (IP). The association came from higher up governing from the discourse. What these associations or associative prosodic relationships reflect is mainly governing from top-down. A framework of the multi-phrase hierarchy is subsequently constructed to account for fluent speech prosody. The term Prosodic Phrase Grouping (PG) was proposed for the framework to denote how intonation phrases (IP) were grouped to form a higher and larger prosodic unit; a unit that roughly corresponds to speech paragraphs in narratives or spoken discourses. Central to the framework is the notion that individual phrasal intonations are subjacent sister constituents subject to higher level constraints that specify layered modifications at each prosodic level; while ultimate output fluent prosody is achieved by adding up contributions from each prosodic layer. From our data analyses, we were able to show just how cumulative modifications account for the overall patterns in fluent speech, in particular, syllable duration as well as boundary pause patterns (Tseng et al., 2005). Subsequently, we were able to derive acoustic templates for each prosodic unit in the framework, namely, templates for global F0 contours, syllable durations and intensity distribution. These templates facilitated constructing a modular model of multiple- phrase grouping with 4 corresponding acoustic modules for speech synthesis applications.
By the same logic, we also view spoken discourse prosody as yet another higher node that groups PGs into sister constituents. Our more recent works are to establish discourse prosody organization from the PG upward. Again looking at the larger picture we studied relative F0 range narrowing vs. widening as well as F0 resets across PGs and boundaries. So far we have found two types of prosodic links that involved F0 narrowing and subsequent F0 reset. One type of F0 narrowing is duration triggered and redundant, which we term as Prosodic Fillers (PF); another is lexically and/or syntactically triggered and obligatory, which we term as Discourse Markers (DM). The main function of these two links appears to be a major source of melodic and rhythmic variation in output prosody. They also turned out to be predictable from text analyses.
In summary, what the prosodic specifications discussed above revealed is essentially the global overall relative prosodic relationships across phrases in fluent speech; what they reflected is top-down governing of semantic constraints from the discourse and cognitive constraints from the speaker. All of them are crucial to on-line speech planning and processing of discourse information. We argue that any prosody framework of fluent speech should include top-down information, specify how intonation phrases are formed, and take into considerations perceptual effects to on-line processing. Moreover, how discourse prosody is organized deems further attention. Technology developments could serve as the best testing ground for these findings. As for a tone language such as Mandarin Chinese, in addition to syllable tones and phrase intonations, there also exists a cross-phrase melody, rhythm and loudness pattern necessary to forms its fluent speech prosody. We believe these non-tonal aspects not only bear cross-linguistic significance, but also merits more attention in studies of tone languages in general.