Towards Learning Efficient Multilingual and Multimodal Representation

Document Type



This thesis focuses on developing efficient representation methods for multilingual and multimodal data in machine learning. The research is divided into three stages, each focusing on specific tasks. The first stage investigates and improves multilingual representation approaches for question-answering and text-to-speech tasks. The second stage aims to improve the fusion strategies of multimodal representations for hateful meme classification. In the third stage, the previous stages are unified by exploring the image retrieval task and improving the performance of multilingual and multimodal representations. The thesis proposes various approaches using pre-trained models and multimodal fusion techniques to improve the performance and the cultural relevance of various machine learning applications. For example, the proposed Hate-CLIPper architecture achieves state-of-the-art performance on meme detection, while training using a natively multilingual and multimodal Wikipedia Image-Text dataset with English text augmentation enables retrieval of culturally relevant images in ten Indian languages. The research not only contributes to the development of efficient representation methods for multilingual and multimodal data, but also inspires further investigations into the use of pre-trained models and multimodal fusion techniques for machine learning in multilingual and multimodal settings.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Karthik Nandakumar, Dr. Salman Khan

Online access for MBZUAI patrons