Many end-to-end Automatic Speech Recognition (ASR) systems still rely
on pre-processed frequency-domain features that are handcrafted to
emulate the human hearing. Our work is motivated by recent advances
in integrated learnable feature extraction. For this, we propose Lightweight
Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise
convolutions as a low-parameter machine-learnable feature extraction
for end-to-end ASR systems.
We integrated LSC
into the hybrid CTC/attention architecture for evaluation. The resulting
end-to-end model shows smooth convergence behaviour that is further
improved by applying SpecAugment in the time domain. We also discuss
filter-level improvements, such as using log-compression as activation
function. Our model achieves a word error rate of 10.7% on the TEDlium
v2 test dataset, surpassing the corresponding architecture with log-mel
filterbank features by an absolute 1.9%, but only has 21% of its model
size.