ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Euskoparl: a speech and text Spanish-basque parallel corpus

Alicia Pérez, José M. Alcaide, María-Inés Torres

The advances in corpus-based approaches and machine learning techniques have promoted the development of minority languages. The aim of this work is to acquire a parallel corpus in Spanish and Basque with both text and speech data. In order to be able to compare the obtained results with those developed for other languages, we took Europarl as a reference. Thus, the data was acquired within the Basque Parliament reports and speeches. The acquisition process shows subtle differences to that of Europarl acquisition. The resulting corpus is described and a few preliminary experiments on machine translation with Moses reported.

Index Terms: speech resources, statistical machine translation, under-resourced languages