We advance a computational model of vowel diphthongisation that situates phonological representations in dynamic neural fields (DNFs), which represent the time-varying activation of neural populations that are sensitive to a given phonetic parameter range. We model all long vowels as two separate inputs to the DNF, with input timing governed by a coupled oscillator model that generates an anti-phase relationship between inputs. The location of time-varying maximum activation in the DNF forms a noisy dynamic target, which is used as input to a task dynamic model of gestural coordination. We find that spatial characteristics of long vowels are well captured by the model, which exhibits gradient variation between monophthongs and diphthongs. We also show that a simplified model of production/perception can simulate changes in a speaker’s phonological planning representations, which could represent a mechanism behind sound change if transmitted across a community.