Most speech recognition systems use spectral features based on fixed
filters, such as MFCC and PLP. In this paper, we show that it is possible
to achieve state of the art results by making the feature extractor
a part of the network and jointly optimizing it with the rest of the
network. The basic approach is to start with a convolutional layer
that operates on the signal (say, with a step size of 1.25 milliseconds),
and aggregate the filter outputs over a portion of the time axis using
a network in network architecture, and then down-sample to every 10
milliseconds for use by the rest of the network. We find that, unlike
some previous work on learned feature extractors, the objective function
converges as fast as for a network based on traditional features.
Because we found that iVector adaptation is less effective in
this framework, we also experiment with a different adaptation method
that is part of the network, where activation statistics over a medium
time span (around a second) are computed at intermediate layers. We
find that the resulting ‘direct-from-signal’ network is
competitive with our state of the art networks based on conventional
features with iVector adaptation.