ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Attention Gate Between Capsules in Fully Capsule-Network Speech Recognition

Kyungmin Lee, Hyeontaek Lim, Mun-Hwan Lee, Hong-Gee Kim

We present a novel capsule network-based speech recognition model that effectively utilizes the full context of past time capsules. The input capsule sequences are recurrently used by filtering unnecessary contextual information using multi-head attention, which uses previous time output vectors as keys and values, and current time output vectors as queries. We applied the attention gate to the sequential dynamic routing (SDR), an all-capsule speech recognition model. The proposed method attained higher accuracy than the existing SDR with two attention heads on all test sets of the TIMIT and Wall Street Journal (WSJ) corpora while maintaining the same algorithmic delay. For the WSJ corpus, 10.75% of a relative word error rate (WER) reduction was achieved when the required delay was set to 525 ms. In addition, the model showed a 1.76x reduction in delay while maintaining the WERs. The proposed method results in an increase of approximately 0.1% in the number of parameters.