Towards Learning Efficient Multilingual and Multimodal Representation
Document Type
Dissertation
Abstract
This thesis focuses on developing efficient representation methods for multilingual and multimodal data in machine learning. The research is divided into three stages, each focusing on specific tasks. The first stage investigates and improves multilingual representation approaches for question-answering and text-to-speech tasks. The second stage aims to improve the fusion strategies of multimodal representations for hateful meme classification. In the third stage, the previous stages are unified by exploring the image retrieval task and improving the performance of multilingual and multimodal representations. The thesis proposes various approaches using pre-trained models and multimodal fusion techniques to improve the performance and the cultural relevance of various machine learning applications. For example, the proposed Hate-CLIPper architecture achieves state-of-the-art performance on meme detection, while training using a natively multilingual and multimodal Wikipedia Image-Text dataset with English text augmentation enables retrieval of culturally relevant images in ten Indian languages. The research not only contributes to the development of efficient representation methods for multilingual and multimodal data, but also inspires further investigations into the use of pre-trained models and multimodal fusion techniques for machine learning in multilingual and multimodal settings.
First Page
i
Last Page
63
Publication Date
6-2023
Recommended Citation
G.K. Kumar, "Towards Learning Efficient Multilingual and Multimodal Representation", M.S. Thesis, Computer Vision, MBZUAI, Abu Dhabi, UAE, 2023.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfillment of the requirements for the M.Sc degree in Computer Vision
Advisors: Dr. Karthik Nandakumar, Dr. Salman Khan
Online access for MBZUAI patrons