In a previous study, we proposed an alternative masking criterion for binary mask estimation based on the underlying linguistic information. We estimated this mask by selecting from a set of candidate masks at each frame based on the hypotheses from an ASR system. Our previous system provided an 8% reduction in WER. In this work, we present an improved method for selecting the correct candidate mask at each frame, increasing the reduction in WER to 14%. Our new method uses a discriminative sequence model and provides a framework that can incorporate other mask estimations as features.
Index Terms: speech recognition, binary mask estimation