Generalization of spoken dialogue systems increases the need for fast development of spoken language understanding modules for semantic tagging of speakerfs turns. Statistical methods are performing well for this task but require large corpora to be trained. Collecting such corpora is expensive in time and human expertise. In this paper we propose a semi automatic annotation process for fast production of dialogue corpora. The approach consists in automatically pre-annotating the corpus and then manually correct the annotation. To perform the preannotation we propose to port an existing corpus and to adapt it to the new data. The French MEDIA dialogue corpus is used as a starting point to produce two new corpora: one for a new language (Italian) and another for a new domain (theatre ticket reservation). We show that the automatic pre-annotation leads to a significant gain in productivity compared to a fully manual annotation and thus allow to derive new adaptation data which can be used to further improve the systems.
Index Terms: Spoken Dialogue Systems, Spoken Language Understanding, Language Portability, Statistical Machine Translation