In the past decades, many successful approaches for language identification
have been published. However, almost none of these approaches were
developed with singing in mind. Singing has a lot of characteristics
that differ from speech, such as a wider variance of fundamental frequencies
and phoneme durations, vibrato, pronunciation differences, and different
semantic content.
We present a new phonotactic language identification system for
singing based on phoneme posteriorgrams. These posteriorgrams were
extracted using acoustic models trained on English speech ( TIMIT)
and on an unannotated English-language a-capella singing dataset (
DAMP). SVM models were then trained on phoneme statistics.
The models are evaluated
on a set of amateur singing recordings from YouTube, and, for comparison,
on the OGI Multilanguage corpus.
While the results
on a-capella singing are somewhat worse than the ones previously obtained
using i-vector extraction, this approach is easier to implement. Phoneme
posteriorgrams need to be extracted for many applications, and can
easily be employed for language identification using this approach.
The results on singing improve significantly when the utilized acoustic
models have also been trained on singing. Interestingly, the best results
on the OGI speech corpus are also obtained when acoustic models trained
on singing are used.