A computer-implemented neurofunctional model of speech production is introduced, which is capable of articulating vowels, VC-, and CV-syllables (C = voiced plosives; V = vowels). It will be shown in this paper that this production model is capable of simulating basic effects of auditory and audio-visual speech perception like (i) categorical perception of consonants and vowels and (ii) the McGurk effect. These typical features of speech perception directly result from the topological ordering of stored speech items at a supra-modal neural level, called a phonetic map of this model. This phonetic map is a self-organizing neural map which is trained and structured during early phases of speech acquisition. The neurofunctional model introduced here illustrates the close relationship between speech production and speech perception.