ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Word-level invariant representations from acoustic waveforms

Stephen Voinea, Chiyuan Zhang, Georgios Evangelopoulos, Lorenzo Rosasco, Tomaso Poggio

Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple levels (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or preprocessing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.

doi: 10.21437/Interspeech.2014-518

Cite as: Voinea, S., Zhang, C., Evangelopoulos, G., Rosasco, L., Poggio, T. (2014) Word-level invariant representations from acoustic waveforms. Proc. Interspeech 2014, 2385-2389, doi: 10.21437/Interspeech.2014-518

  author={Stephen Voinea and Chiyuan Zhang and Georgios Evangelopoulos and Lorenzo Rosasco and Tomaso Poggio},
  title={{Word-level invariant representations from acoustic waveforms}},
  booktitle={Proc. Interspeech 2014},