Natural Language Processing Faculty Publications

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Muhammad Arslan Manzoor, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Sarah Albarri, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Ziting Xian, Sun Yat-Sen University
Zaiqiao Meng, University of Glasgow
Preslav Nakov, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Shangsong Liang, Mohamed Bin Zayed University of Artificial IntelligenceFollow

Document Type

Article

Publication Title

ACM Transactions on Multimedia Computing, Communications and Applications

Abstract

Multimodality Representation Learning, as a technique of learning to embed information from different modalities and their correlations, has achieved remarkable success on a variety of applications, such as Visual Question Answering (VQA), Natural Language for Visual Reasoning (NLVR), and Vision Language Retrieval (VLR). Among these applications, cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task, e.g., understand, recognize, retrieve, or generate optimally. Researchers have proposed diverse methods to address these tasks. The different variants of transformer-based architectures performed extraordinarily on multiple modalities. This survey presents the comprehensive literature on the evolution and enhancement of deep learning multimodal architectures to deal with textual, visual and audio features for diverse cross-modal and modern multimodal tasks. This study summarizes the (i) recent task-specific deep learning methodologies, (ii) the pretraining types and multimodal pretraining objectives, (iii) from state-of-the-art pretrained multimodal approaches to unifying architectures, and (iv) multimodal task categories and possible future improvements that can be devised for better multimodal learning. Moreover, we prepare a dataset section for new researchers that covers most of the benchmarks for pretraining and finetuning. Finally, major challenges, gaps, and potential research topics are explored. A constantly-updated paperlist related to our survey is maintained at https://github.com/marslanm/multimodality-representation-learning.

First Page

Last Page

DOI

10.1145/3617833

Publication Date

10-23-2023

Keywords

multimodal applications, multimodal methods, Multimodality, pretrained models, representation learning

Comments

IR Deposit conditions:

OA version: Accepted version

No embargo

Publisher copyright and source must be acknowledged

Must link to publisher version with statement that this is the definitive version and DOI

Must state that version on repository is the authors version

Set statement to accompany deposit (see policy)

Recommended Citation

M. Manzoor et al., "Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications," ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 3, pp. 1 - 34, Oct 2023.

The definitive version is available at https://doi.org/10.1145/3617833

Additional Links

DOI link: https://doi.org/10.1145/3617833

Link to Full Text

COinS

Natural Language Processing Faculty Publications

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Browse

Contribute

Links

Natural Language Processing Faculty Publications

Multimodality Representation Learning: A Survey on Evolution, Pretraining and Its Applications

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Share

Browse

Contribute

Links