Our long-term goal is to design a system for the Czech visual synthesis, that means an animated synthetic face (often called talking head) imitating pronouncing of a speech by a human being. In this paper we present techniques used for acquiring data and building the audio-visual speech corpus, especially its visual part. This process involves the recording of stereoscopic video data and solving of related problems as synchronization. Apart from that, we present simple method of utilization of such corpus using stereo vision principles and modelling shape of the lips by simple triangular mesh.