ISCA Archive Eurospeech 1997
ISCA Archive Eurospeech 1997

Several measures for selecting suitable speech CORPORA

Shuichi Itahashi, Naoko Ueda, Mikio Yamamoto

We make statistical investigations of various speech corpora to extract useful information re ecting the contents of the corpus so that we can create a sort of guidelines for selecting the most suitable corpus. A word is not separated by spaces in the Japanese text. Accordingly, we adopt n-gram counting methods to extract frequent mora sequences instead of words. A mora roughly corresponds to a syllable. By investigating the frequencies of 1 to 10-mora sequences in the existing six corpora, we can find the distinction between the written and the spoken languages, keywords and topics of dialogues. This paper shows that the simple statistical investigation makes it possible to represent the contents of the corpus to some extent without conducting a complicated job such as morphological analysis.