Natural Language Processing Faculty Publications

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Chenxi Whitehouse, City, University of London
Monojit Choudhury, Microsoft Corporation
Alham Fikri Aji, Mohamed bin Zayed University of Artificial IntelligenceFollow

Document Type

Conference Proceeding

Publication Title

EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings

Abstract

This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited. To achieve this, we utilise several LLMs, namely Dolly-v2, StableVicuna, ChatGPT, and GPT-4, to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we evaluate the effectiveness of fine-tuning smaller multilingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translated English-generated data, revealing the overall advantages of incorporating data generated by LLMs, e.g. a notable 13.4 accuracy score improvement for the best case. Furthermore, we conduct a human evaluation by asking native speakers to assess the naturalness and logical coherence of the generated examples across different languages. The results of the evaluation indicate that LLMs such as ChatGPT and GPT-4 excel at producing natural and coherent text in most languages, however, they struggle to generate meaningful text in certain languages like Tamil. We also observe that ChatGPT falls short in generating plausible alternatives compared to the original dataset, whereas examples from GPT-4 exhibit competitive logical consistency. We release the generated data at https://github.com/MBZUAI-nlp/Gen-X.

First Page

671

Last Page

686

Publication Date

1-1-2023

Recommended Citation

C. Whitehouse et al., "LLM-powered Data Augmentation for Enhanced Crosslingual Performance," EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings, pp. 671 - 686, Jan 2023.

This document is currently not available here.

COinS

Natural Language Processing Faculty Publications

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Document Type

Publication Title

Abstract

First Page

Last Page

Publication Date

Recommended Citation

Browse

Contribute

Links

Natural Language Processing Faculty Publications

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

Publication Date

Recommended Citation

Share

Browse

Contribute

Links