A new multimodal framework for speech enhancement in noisy environments based on human auditory system model is proposed in this paper. Unlike existing engineering architectures each of which specifically designed for certain speech sensors (extracted pitch, visual cues, etc.), our proposed model provides the capacity to integrate cues of different type into the enhancement system by introducing the notion of temporal coherence. The short-time coherence coefficients (STCC) between sound components and cues computed through an estimate of mutual information are used as a measure of target speech dominance and consequently the gain coefficients. The objective evaluation results for two exemplars in this framework show that the new methodology is effective in practice.
Index Terms: speech enhancement, multimodal, mutual information, auditory