ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Utilizing state-level distance vector representation for improved spoken term detection by text and spoken queries

Mitsuaki Makino, Naoki Yamamoto, Atsuhiko Kai

In spoken term detection (STD) systems, approximate subword-level matching of query term and automatically transcribed spoken documents is often employed for its reasonable accuracy and efficiency. However, high out-of-vocabulary (OOV) rate often degrades the subword-level recognition accuracy and affect the STD performance. This paper describes the usage of new expanded acoustic representations of subword sequence for improved scoring between OOV query term and subword-unit transcription. Each subword is expanded in corresponding subword's HMM states and each state is represented as a new acoustic structural feature, a distribution-distance vector (DDV). The proposed DDV representation and scoring is easily combined with two typical baseline STD approaches: a DTW-based approximate matching with subword-level acoustic dissimilarity measure and a lattice-based confidence scoring of subword n-grams. The experimental result showed that the proposed DDV-based scoring method significantly outperforms the simple DTW-scoring baseline with very little increase in the required search time. The combination of the DDV-based scoring with the confidence-based scoring showed the complementary effect and attained the best STD performance compared with the NTCIR-10 SpokenDoc2(SDPWS) submitted results when only the NTCIR reference automatic transcript is used. A preliminary experiment with spoken query terms also showed that the significant improvement for OOV queries.


doi: 10.21437/Interspeech.2014-396

Cite as: Makino, M., Yamamoto, N., Kai, A. (2014) Utilizing state-level distance vector representation for improved spoken term detection by text and spoken queries. Proc. Interspeech 2014, 1732-1736, doi: 10.21437/Interspeech.2014-396

@inproceedings{makino14_interspeech,
  author={Mitsuaki Makino and Naoki Yamamoto and Atsuhiko Kai},
  title={{Utilizing state-level distance vector representation for improved spoken term detection by text and spoken queries}},
  year=2014,
  booktitle={Proc. Interspeech 2014},
  pages={1732--1736},
  doi={10.21437/Interspeech.2014-396},
  issn={2308-457X}
}