This paper focuses on error detection in Automatic Speech Recognition
(ASR) outputs. A neural network architecture is proposed, which is
well suited to handle continuous word representations, like word embeddings.
In a previous study, the authors explored the use of linguistic word
embeddings, and more particularly their combination. In this new study,
the use of acoustic word embeddings is explored. Acoustic word embeddings
offer the opportunity of an a priori acoustic representation of words
that can be compared, in terms of similarity, to an embedded representation
of the audio signal.
First, we propose an approach
to evaluate the intrinsic performances of acoustic word embeddings
in comparison to orthographic representations in order to capture discriminative
phonetic information. Since French language is targeted in experiments,
a particular focus is made on homophone words. Then, the use of acoustic
word embeddings is evaluated for ASR error detection. The proposed
approach gets a classification error rate of 7.94% while the previous
state-of-the-art CRF-based approach gets a CER of 8.56% on the outputs
of the ASR system which won the ETAPE evaluation campaign on speech
recognition of French broadcast news.