ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

A computational auditory scene analysis system for robust speech recognition

Soundararajan Srinivasan, Yang Shao, Zhaozhang Jin, DeLiang Wang

We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary timefrequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting T-F masks are used in conjunction with missing-data methods for recognition. Systematic evaluations on a speech separation challenge task show significant improvement over the baseline performance.


doi: 10.21437/Interspeech.2006-19

Cite as: Srinivasan, S., Shao, Y., Jin, Z., Wang, D. (2006) A computational auditory scene analysis system for robust speech recognition. Proc. Interspeech 2006, paper 1547-Mon1WeS.1, doi: 10.21437/Interspeech.2006-19

@inproceedings{srinivasan06_interspeech,
  author={Soundararajan Srinivasan and Yang Shao and Zhaozhang Jin and DeLiang Wang},
  title={{A computational auditory scene analysis system for robust speech recognition}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1547-Mon1WeS.1},
  doi={10.21437/Interspeech.2006-19},
  issn={2958-1796}
}