While functional neuroimaging studies demonstrate that multiple cortical regions play a key role in audio-visual integration of speech, whether cross-modal speech interactions only depend on well-known auditory and visuo-facial modalities or, rather, might also be triggered by other sensory sources remains unexplored. The present functional magnetic resonance imaging (fMRI) study examined the neural substrates of cross-modal binding during audio-visual speech perception in response to either seeing the facial/lip or tongue (tongue movement inside the mouth acquired by means of ultrason) movements of a speaker. To this aim, participants were exposed to auditory and/or visual speech stimuli in five different conditions: an auditory-only condition, and two visual-only and two audiovisual conditions that showed either the facial/lip or tongue movements of a speaker. Common overlapping activity between conditions were mainly observed in the posterior part of the superior temporal gyrus/sulcus, extending ventrally to the posterior middle temporal gyrus and dorsally to the parietal operculum, the supramarginal and angular gyri, as well as in the premotor cortex and in the inferior frontal gyrus. In addition, sub-additive neural responses were observed in the left posterior superior temporal gyrus/sulcus during audio-visual perception of both facial and tongue speech movements compared to unimodal auditory and visual speech perception. Altogether these results suggest that the left posterior superior temporal gyrus/sulcus is involved in multisensory processing of auditory speech signals and their accompanying facial/lip and tongue speech movements, and that multisensory speech perception is partly driven by listener’s knowledge of speech production.
Index Terms: audio-visual speech perception, ultrasound, fMRI