The present research aims to build an MSA audio-visual corpus. The
corpus is annotated both phonetically and visually and dedicated to
emotional speech processing studies. The building of the corpus consists
of 5 main stages: speaker selection, sentences selection, recording,
annotation and evaluation. 500 sentences were critically selected based
on their phonemic distribution. The speaker was instructed to read
the same 500 sentences with 6 emotions (Happiness – Sadness –
Fear – Anger – Inquiry – Neutral). A sample of 50
sentences was selected for annotation. The corpus evaluation modules
were: audio, visual and audio-visual subjective evaluation.
The corpus evaluation
process showed that happy, anger and inquiry emotions were better recognized
visually (94%, 96% and 96%) than audibly (63.6%, 74% and 74%) and the
audio visual evaluation scores (96%, 89.6% and 80.8%). Sadness and
fear emotion on the other hand were better recognized audibly (76.8%
and 97.6%) than visually (58% and 78.8 %) and the audio visual evaluation
scores were (65.6% and 90%).