In the framework of a contract with the Basque Parliament for subtitling the videos of bilingual plenary sessions, which basically consisted of aligning very long (around 3 hours long) audio tracks with syntactically correct but acoustically inaccurate text transcriptions (since all the disfluencies, mistakes, etc. were edited), a very simple and efficient procedure (avoiding the need for language nor lexical models, which was key because of the mix of languages) was developed as a first approach, before trying more complex schemes found in the literature. Since it worked pretty well and the output was quite satisfactory for the intended application, that simple approach was finally chosen. In this paper, we describe the approach in detail and apply it to a widely known annotated dataset (specifically, to the 1997 Hub4 task), to allow the comparison to a reference approach. Results demonstrate that our approach provides only slightly worse segmentations at a much lower computational cost and requiring much fewer resources. Moreover, if the resource to be segmented includes speech in two or more languages and speakers conmute between them at any time, applying a speech recognizer becomes unfeasible in practice, whereas our approach can be still applied with no additional cost.
Index Terms: speech-to-text alignment, automatic video subtitling, multimedia information retrieval, multilingual speech.