Natural Language Processing Faculty Publications

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Genta Indra Winata, Bloomberg, United States
Alham Fikri Aji, Amazon, United States
Samuel Cahyawijaya, HKUST, Hong Kong
Rahmad Mahendra, Universitas Indonesia, Indonesia & INACL, Indonesia
Fajri Koto, The University of Melbourne, Australia
Ade Romadhony, INACL, Indonesia & Telkom University, Indonesia
Kemal Kurniawan, INACL, Indonesia & The University of Melbourne, Australia
David Moeljadi, Kanda University of International Studies, Japan
Radityo Eko Prasojo, Kata.ai
Pascale Fung, HKUST, Hong Kong
Timothy Baldwin, The University of Melbourne, Australia & Mohamed bin Zayed University of Artificial IntelligenceFollow
Jey Han Lau, The University of Melbourne, Australia

Document Type

Article

Publication Title

arXiv

Abstract

Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages. © 2022, CC BY-SA.

DOI

10.48550/arXiv.2205.15960

Publication Date

5-31-2022

Keywords

Computation and Language (cs.CL), Natural Language Processing

Comments

Preprint: arXiv

Archived with thanks to arXiv

Preprint License: CC by NC-SA 4.0

Uploaded 01 July 2022

Recommended Citation

G.I. Winata, et al, "NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages", 2022, arXiv:2205.15960

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Natural Language Processing Faculty Publications

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Included in

Browse

Contribute

Links

Natural Language Processing Faculty Publications

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Included in

Share

Browse

Contribute

Links