In this paper, we first present a shape and appearance model for Audio-Visual Automatic Speech Recognition. The shape model is a template (mean shape) and a set of deformation vectors to transform it into any possible shape. The global appearance model is a neural network trained to classify 5*5 colour image blocks as from skin, lips or inside of mouth. Both parts of this model were built automatically (without handlabelling). Appearance model was built using speech bimodality (acoustic information). We then propose several measures for quality evaluation of lip location. Finally, we show the classification results obtained using a hand-labelled and two automatically built appearance models of the lips.