Many speech segmentation techniques have been proposed to automate phonetic alignment. Most of the techniques require, however, labeled data to train, and perform well only for read, high-quality speech. Automatic phonetic alignment, for lower quality varied data with no labeled training data, the subject of this paper, is a much more challenging domain. An HMM-based automatic speech recognizer was used in this study to determine phonetic sequences and boundaries of "open source" speech data, retrieved from public websites. The HMM models were initially trained using the TIMIT database and subsequently adapted to each passage. Standard frontend features such as MFCC, LPCC and PLP, and features computed by applying the DCT directly to the short-time spectrum (DCTC) were evaluated using TIMIT data. The "best" parameter set was found to be DCTC_78 and these parameters were used to align the speech data of interest.
Index Terms: speech segmentation, phonetic alignment, speech recognition