The automatic analysis of temporal patterns of vocal folds motion and the tracking of glottal cues such as folds edge position or glottal area, has recently become a topic of interest in the field of laryngeal video imaging. We discuss here the use of a numerically simulated model of the folds motion within a video analysis context, for the analysis of videokymographic data and glottal cues segmentation. The proposed algorithm exploits both visual and acoustic data related to the glottal excitation, to estimate the parameters of the model. The trained model is then used to enhance the analysis and segmentation of visual glottal cues, i.e. the folds edge displacement and glottal area. Objective measures are reported of the accuracy with which the visual glottal cues and the acoustic voice emission are represented by the model. The method is illustrated and assessed on data from different subjects.