AraOffense: Detecting Offensive Speech Across Dialects in Arabic Media

Date of Award

4-30-2024

Document Type

Thesis

Degree Name

Master of Science in Natural Language Processing

Department

Natural Language Processing

First Advisor

Prof. Shady Shehata

Abstract

In the digital age, the proliferation of online platforms has brought about significant benefits in terms of global communication and information sharing. However, this advancement is not without its challenges, particularly the rise in offensive speech that can have damaging social effects. While natural language processing (NLP) has made strides in identifying and moderating toxic content in textual and visual data, the domain of speech—especially in under-represented languages—remains relatively underexplored. This study addresses this gap by introducing AraOffense, a novel dataset specifically designed to detect offensive speech within the Arabic language, encompassing a variety of dialects. The dataset comprises 2,146 instances, including 475 offensive samples, fetched from scripted media content, thus providing a reliable and ethically sourced corpus for research and development. Our research aims to evaluate the capability of current speech models to identify offensive content effectively. By implementing a multi-modal approach that leverages both text and audio data, our study demonstrates a significant improvement over traditional unimodal methods. The proposed model, which combines state-of-the-art audio encoders with text analysis through large language models, notably enhances the detection of offensive speech. Specifically, our best configuration outpaces baseline models by over 26% in terms of the Matthews Correlation Coefficient (MCC), establishing the first benchmark for offensive speech detection in Arabic. Furthermore, our findings underscore the importance of tailored, language-specific models over general multilingual ones, particularly in handling the nuances and complexities inherent in Arabic dialects. The study also highlights the efficacy of multi-headed attention mechanisms in fusing audio and textual information, which surpasses simpler concatenation techniques in performance. This work not only contributes a valuable resource for the development of content moderation tools in Arabic-speaking regions but also sets the stage for future explorations into multimodal and language-specific approaches in NLP. By shedding light on the less explored domain of offensive speech detection in spoken language, particularly in dialectrich and underrepresented languages, our study calls for continued efforts to enhance digital safety and inclusivity on a global scale

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing

Advisors: Shady Shehata,

Online access available for MBZUAI patrons

Share

COinS