Joint activities (e.g. building a LEGO model) unfold in a hierarchy of subprojects. Navigating them implies horizontally elaborating on a subproject (placing one block) and vertically moving to a new subproject (next block). Interactants coordinate horizontal and vertical transitions with project markers (okay, yeah). We suggest that vertical vs. horizontal transitions are distinguished both lexically and acoustically. We predicted that acoustic features of identical markers used for different transitions (okay-vertical vs. okay-horizontal) would exhibit more dissimilarity than markers used for same transitions (okay-vertical vs. okay-vertical). We used MFCC-based dynamic time warping to measure dissimilarity between vocalisations and analysed them with a Bayesian regression model. We find that Vietnamese speakers use both lexical and acoustic cues to mark transitions, and paired same-horizontal markers are acoustically more similar than same-vertical and different-transition markers.