ExtArabic: Extensive Arabic Natural Language Understanding Benchmark
Building a reliable and comprehensive evaluation benchmark for Arabic language understanding is highly desirable to measure the diverse abilities of current Arabic language models (LMs) and accelerate advances. Previous public benchmarks have often focused on a specific subset of tasks (e.g., sentiment, machine translation). This paper presents a new extensive Arabic evaluation benchmark (ExtArabic), including eight diverse tasks spanning semantics (named entity recognition, natural language inference, question answering, topic classification), sentiment (binary sentiment, emotion classification), language types (dialect detection), and commonsense (Winograd schema). In particular, besides the carefully selected representative datasets collected from existing literature, we create the Arabic Winograd schema task by translating and adapting the respective dataset in English, presenting a new commonsense reasoning challenge rarely studied in the Arabic context. To ensure that the benchmarking process is fair and does not encourage overfitting, ExtArabic have also developed a private dataset using adversarial attacks. Incorporating adversarial robustness evaluation into the benchmarking process ensures that the Arabic LMs are not only accurate but also resilient against malicious inputs. Extensive experiments on ExtArabic with the latest large pretrained models such as mBERT, AraBERT, MARBERT and CAMeLBERT, showcase that Arabic language understanding still has a large room for improvement. Overall, we believe that ExtArabic, with its diverse set of tasks and incorporation of private dataset from adversarial attacks, will be well-integrated with the community’s goals and fosters Arabic NLP research as a whole.
S.A. Al Barri, "ExtArabic: Extensive Arabic Natural Language Understanding Benchmark", M.S. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2023.