Anticipatory coarticulation provides a basis for the observed asynchrony between the acoustic and visual onsets of phones in certain linguistic contexts and is typically not explicitly modeled in audio-visual speech models. We study within-word audio-visual asynchrony using hand labeled words in which theory suggests that asynchrony should occur, and show that these labels confirm the theory. We introduce a new statistical model of AV speech, the asynchrony-dependent transition (ADT) model that allows asynchrony between AV states within word boundaries, where the state transitions depend on the instantaneous asynchrony as well as the modality's state. This model outperforms a baseline synchronous model in mimicking the hand labels in a forced alignment task, and its behavior as parameters are changed conforms to our expectations about anticipatory coarticulation. The same model could be used for ASR, although here we consider it for the task of forced alignment for linguistic analysis.