ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries

Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee

In this paper, we propose a novel deep learning architecture for improving word-level lip-reading. We first incorporate multi-scale processing into spatial feature extraction for lip-reading using hierarchical pyramidal convolution (HPConv) and self-attention. Specifically, HPConv is proposed to replace the conventional convolution features, leading to an improvement over the model’s ability to discover fine-grained lip movements. Next to deal with fixed-length image sequences representing words in a given database, a self-attention mechanism is proposed to integrate local information in all lip frames without assuming known word boundaries, so that our deep models automatically utilize key feature in relevant frames of a given word. Experiments on the Lip Reading in the Wild corpus show that our proposed architecture achieves an accuracy of 86.83%, yielding a relative error rate reduction of about 10% from that obtained with a state-of-the-art scheme of averaging frame scores for information fusion. A detailed analysis of the experimental results also confirms that weights learned from self-attention tend to be zero at both sides of an image sequence and focus non-zero weights in the middle part of a given word.