A data-driven approach is presented for improving the performance of separating single-channel mixed speech signals, assuming unknown, arbitrary temporal dynamics. The new approach seeks and separates the longest mixed speech segments which can be accurately matched by composite training segments. Lengthening the mixed speech segments to match reduces the uncertainty of the matching constituent training segments, and hence the error of separation. Experiments are conducted on the Wall Street Journal database, for separating mixtures of largevocabulary speech utterances. The results are evaluated using various objective and subjective measures, including the challenge of largevocabulary continuous speech recognition. It is shown that the new separation approach leads to significant improvement in all these measures.
Index Terms: Temporal dynamics, longest matching segment, speech separation, speech recognition