Deep Neural Network (DNN) acoustic models are commonly used in today’s
state-of-the-art speech recognition systems. As neural networks are
a data driven method, the amount of available training data directly
impacts the performance. In the past, several studies have shown that
multilingual training of DNNs leads to improvements, especially in
resource constrained tasks in which only limited training data in the
target language is available.
Previous studies have
shown speaker adaptation to be successfully performed on DNNs. This
is achieved by adding speaker information (e.g. i-Vectors) as additional
input features. Based on the idea of adding additional features, we
here present a method for adding language information to the input
features of the network. Preliminary experiments have shown improvements
by providing supervised information about language identity to the
network.
In this work, we extended this approach by training a neural network
to encode language specific features. We extracted those features unsupervised
and used them to provide additional cues to the DNN acoustic model
during training. Our results show that augmenting acoustic input features
with this language code enabled the network to better capture language
specific peculiarities. This improved the performance of systems trained
using data from multiple languages.