ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

AraOffence: Detecting Offensive Speech Across Dialects in Arabic Media

Youssef Nafea, Shady Shehata, Zeerak Talat, Ahmed Aboeitta, Ahmed Sharshar, Preslav Nakov

Natural language processing (NLP) has made efforts towards identifying toxicity and offensive content for the text and image modalities. Despite sharing similar concerns with text and images, such as increased access to online abuse using speech, speech offensiveness research trails behind. While NLP has primarily considered English language data, speech has emphasized under-represented languages such as Swahili and Wolof. In this work, we introduce ARAOFFENSE, a dataset of scripted media in Arabic dialects labelled for offensiveness. ARAOFFENSE contains 2146 instances, of which 475 are labelled as offensive, spanning 1.55 hours of audio. We assess the capabilities of speech models to detect offensive content and present a hard-to-beat multi-modal text and audio model which outperforms the baselines by 26+% in terms of the Matthews Correlation Coefficient. Our work thus presents the first benchmark for offensive speech detection in dialectical Arabic.