Sign language translation without transcription has only recently started
to gain attention. In our work, we focus on improving the state-of-the-art
translation by introducing a multi-feature fusion architecture with
enhanced input features. As sign language is challenging to segment,
we obtain the input features by extracting overlapping scaled segments
across the video and obtaining their 3D CNN representations. We exploit
the attention mechanism in the fusion architecture by initially learning
dependencies between different frames of the same video and later fusing
them to learn the relations between different features from the same
video. In addition to 3D CNN features, we also analyze pose-based features.
Our robust methodology outperforms the state-of-the-art sign language
translation model by achieving higher BLEU 3 – BLEU 4 scores
and also outperforms the state-of-the-art sequence attention models
by achieving a 43.54% increase in BLEU 4 score. We conclude that the
combined effects of feature scaling and feature fusion make our model
more robust in predicting longer n-grams which are crucial in continuous
sign language translation.