Not all the questions related to the semi-supervised training of hybrid
ASR system with DNN acoustic model were already deeply investigated.
In this paper, we focus on the question of the granularity of confidences
(per-sentence, per-word, per-frame), the question of how the data should
be used (data-selection by masks, or in mini-batch SGD with confidences
as weights). Then, we propose to re-tune the system with the manually
transcribed data, both with the ‘frame CE’ training and
‘sMBR’ training.
Our preferred semi-supervised
recipe which is both simple and efficient is following: we select words
according to the word accuracy we obtain on the development set. Such
recipe, which does not rely on a grid-search of the training hyper-parameter,
generalized well for: Babel Vietnamese (transcribed 11h, untranscribed
74h), Babel Bengali (transcribed 11h, untranscribed 58h) and our custom
Switchboard setup (transcribed 14h, untranscribed 95h). We obtained
the absolute WER improvements 2.5% for Vietnamese, 2.3% for Bengali
and 3.2% for Switchboard.