We studied semi-supervised training in a fully connected deep neural
network (DNN), unfolded recurrent neural network (RNN), and long short-term
memory recurrent neural network (LSTM-RNN) with respect to transcription
quality, importance data sampling, and training data amount. We found
that DNN, unfolded RNN, and LSTM-RNN exhibit increased sensitivity
to labeling errors. One point relative WER increase in the training
transcription translates to a half point WER increase in DNN and slightly
more in unfolded RNN; while in LSTM-RNN it translates to one full
point WER increase. LSTM-RNN is notably more sensitive to transcription
errors. We further found that the importance sampling has similar impact
on all three models. In supervised training, importance sampling yields
2~3% relative WER reduction against random sampling. The gain
is reduced in semi-supervised training. Lastly, we compared the model
capacity with increased training data. Experimental results suggest
that LSTM-RNN can benefit more from enlarged training data comparing
to unfolded RNN and DNN.
We trained a semi-supervised
LSTM-RNN using 2600 hours of transcribed and 10000 hours of untranscribed
data on a mobile speech task. The semi-supervised LSTM-RNN yields 6.56%
relative WER reduction against the supervised baseline trained from
2600 hours of transcribed speech.