This paper discusses construction of the component of text to speech systems that is responsible for computing speech timing (the duration system). Currently, a variety of approaches is used, ranging from manually constructed sequential rule systems that incorporate a fair amount of linguistic knowledge, to systems that have been constructed by statistical means and that only minimally incorporate such knowledge. Recent developments in the availability of large labeled, segmented speech corpora seem to give the edge to statistical approaches. The paper discusses some general properties of timing in natural speech, and presents a concrete argument for why both linguistic knowledge and statistical analysis are essential. The challenge lies in designing systems that take optimal advantage of both. In addition, more knowledge is needed to resolve empirical issues raised by duration system construction.
Keywords: speech timing, prosody