This paper describes a technique for synthesizing auditory speech and lip motion from an arbitrary given text. The technique is an extension of the visual speech synthesis technique based on an algorithm for parameter generation from HMM with dynamic features. Audio and visual features of each speech unit are modeled by a single HMM. Since both audio and visual parameters are generated simultaneously in a unified framework, auditory speech with synchronized lip movements can be generated automatically. We train both syllable and triphone models as the speech synthesis units, and compared their performance in text-to-audio-visual speech synthesis. Experimental results show that the generated audio-visual speech using triphone models achieved higher performance than that using syllable models.