ISCA Archive Interspeech 2010
ISCA Archive Interspeech 2010

SEAME: a Mandarin-English code-switching speech corpus in south-east asia

Dau-Cheng Lyu, Tien-Ping Tan, Eng Siong Chng, Haizhou Li

In Singapore and Malaysia, people often speak a mix of Mandarin and English with a single sentence, that we call intra-sentential code-switch sentence. In this paper, we report the development of a Mandarin-English code-switching spontaneous speech corpus: SEAME. As part of a multilingual speech recognition project, the design of such a corpus allows the study of how Mandarin-English code-switch speech occurs in the spoken language in South-East Asia, and provides insights into the development of large vocabulary continuous speech recognition (LVCSR) to cover code-switching speech. We develop a speech corpus of intra-sentential code-switching utterances that are recorded under both interview and conversational settings. The paper describes the corpus design and the analysis of collected corpus.