ISCA Archive Odyssey 2010
ISCA Archive Odyssey 2010

Estimating and Exploiting Language Distributions of Unlabeled Data

Alan McCree

This paper addresses the problem of language distribution estimation from unlabeled data. We present a new algorithm that treats automated classifier identification outputs as likelihoods and iteratively applies Bayes' rule to reclassify the data using successively improving distribution estimates as "priors". Experimental results using the MIT LL submission to the NIST LRE07 evaluation show significant improvements in estimation of non-uniform distributions as compared to a baseline counting approach. In addition, we show how to incorporate these estimated distributions into the classification task. Further experiments on the LRE07 corpus show large gains for both the detection/verification and identification tasks when only a small set of languages are actually present in the test set.