Natural Language Processing Faculty Publications

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Zheng Xin Yong, Brown University
Ruochen Zhang, Brown University
Jessica Zosa Forde, Brown University
Skyler Wang, University of California, Berkeley
Arjun Subramonian, University of California, Los Angeles
Holy Lovenia, Hong Kong University of Science and Technology
Samuel Cahyawijaya, Hong Kong University of Science and Technology
Genta Indra Winata, Bloomberg

Document Type

Conference Proceeding

Publication Title

CALCS 2023 - Computational Approaches to Linguistic Code-Switching, Proceedings of the Workshop

Abstract

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

First Page

Last Page

Publication Date

1-1-2023

Recommended Citation

Z. Yong et al., "Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages," CALCS 2023 - Computational Approaches to Linguistic Code-Switching, Proceedings of the Workshop, pp. 43 - 63, Jan 2023.

This document is currently not available here.

COinS

Natural Language Processing Faculty Publications

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Document Type

Publication Title

Abstract

First Page

Last Page

Publication Date

Recommended Citation

Browse

Contribute

Links

Natural Language Processing Faculty Publications

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

Publication Date

Recommended Citation

Share

Browse

Contribute

Links