ISCA Archive Odyssey 2008
ISCA Archive Odyssey 2008

Building language detectors using small amounts of training data

David A. van Leeuwen, Niko Brümmer

In this paper we present language detectors built using relatively small amounts of training data. This is carried out using the modelling power of a Linear Discriminant Analysis back-end for the languages which have a small amount of training data. We present experiments on NIST 2005 Language Recognition Evaluation data, where we use a jackknifing technique to remove welltrained language knowledge from the LDA back-end, using only sparse trials for training the LDA. We investigate three systems, which show different levels of loss of language detection capability. We validate the technique on an independent collection of 21 languages, where we show that with less than one hour training we obtain an error rate for ‘new’ languages that is only slightly over twice the error rate for languages for which the full 60 hours of CallFriend data is available.