Significant advances in speech separation have been made by formulating it as a classification problem, where the desired output is the ideal binary mask (IBM). Previous work does not explicitly model the correlation between neighboring time-frequency units and standard binary classifiers are used. As one of the most important characteristics of speech signal is its temporal dynamics, the IBM contains highly structured, instead of, random patterns. In this study, we incorporate temporal dynamics into classification by employing structured output learning. In particular, we use linear-chain structured perceptrons to account for the interactions of neighboring labels in time. However, the performance of structured perceptrons largely depends on the linear separability of features. To address this problem, we employ pretrained deep neural networks to automatically learn effective feature functions for structured perceptrons. The experiments show that the proposed system significantly outperforms previous IBM estimation systems.
Index Terms: Monaural speech separation, temporal dynamics, structured perceptron, deep neural networks