Low-dimensional ‘bottleneck’ features extracted from neural
networks have been shown to give phoneme recognition accuracy similar
to that obtained with higher-dimensional MFCCs, using GMM-HMM models.
Such features have also been shown to preserve well the assumptions
of speech trajectory dynamics made by dynamic models of speech such
as Continuous-State HMMs. However, little is understood about how networks
derive these features and how and whether they can be interpreted in
terms of human speech perception and production.
We analyse three-dimensional
bottleneck features. We show that for vowels, their spatial representation
is very close to the familiar F1:F2 vowel quadrilateral.
For other classes of phonemes the features can similarly be related
to phonetic and acoustic spatial representations presented in the literature.
This suggests that these networks derive representations specific to
particular phonetic categories, with properties similar to those used
by human perception. The representation of the full set of phonemes
in the bottleneck space is consistent with a hypothesized comprehensive
model of speech perception and also with models of speech perception
such as prototype theory.