ISCA Archive SpeechProsody 2012
ISCA Archive SpeechProsody 2012

Framework for consistent speech databases

Sören Wittenberg, Rüdiger Hoffmann

The introduced speech processing framework creates phonetically and prosodically annotated speech databases. It provides with structured data files in eXtensible Markup Language format. Those files include all available information about a recorded utterance inclusively the speech signal. A Document Type Definition (DTD) describes the data structure and provides with the possibilty of automatically data structure validation. That ensures right data reading by human and interoperability between the used speech processing tools. A user can browse the speech databases with a normal web browser. The browser has to support XSL transformation, ECMA script and Scalable Vector Graphics to visualize the content. If the user requests a utterance, the browser gets the requested file with all available information of the corresponding utterance. The advantage of that is that the user obtain same data as a speech processing tool when it uses the underlying file server. The navigation through different speech layers is like browsing a web page. The user clicks on a part he wants to look at and a file embedded ECMA script filters the data and modifies the screen. The script is part of the XSL transformation file. It allows also elementary editing of the utterance content like changing word boundaries by moving the corresponding boundary mark. The changes are committed to the web server, that can handle further processing like integration into a subversion system. Because the whole speech database contents are strings, a standard search engine can be used for database searching. Searching for a phoneme under special context yields to all avialable phonemes with all information of them and also the speech signal.

Index Terms: speech database, framework, structured data