Simulation plays a crucial role in developing components of automatic speech recognition systems such as enhancement and diarization. In source separation and target-speaker extraction, datasets with high degrees of temporal overlap are used both in training and evaluation. However, this contrasts with the fact that people tend to avoid such overlap in real conversations. It is well known that artifacts introduced from pre-processing with no overlapping speech can be detrimental to recognition performance. This work proposes a finite-state based generative method trained on timing information in speech corpora, which leads to two main contributions. First, a method for generating arbitrary large datasets which follow desired statistics of real parties. Second, features extracted from the models are shown to have a correlation with speaker extraction performance. This leads to the contribution of quantifying how much difficulty in a mixture is due to turn-taking, factoring out other complexities in the signal. Models which treat speakers as independent produce poor generation and representation results. We improve upon this by proposing models which have states conditioned on whether another person is speaking.