Transformer-Based Feature Fusion Approach for Multimodal Visual Sentiment Recognition Using Tweets in the Wild

Document Type


Publication Title

IEEE Access


We present an image-based real-time sentiment analysis system that can be used to recognize in-the-wild sentiment expressions on online social networks. The system deploys the newly proposed transformer architecture on online social networks (OSN) big data to extract emotion and sentiment features using three types of images: images containing faces, images containing text, and images containing no faces/text. We build three separate models, one for each type of image, and then fuse all the models to learn the online sentiment behavior. Our proposed methodology combines a supervised two-stage training approach and threshold-moving method, which is crucial for the data imbalance found in OSN data. The training is carried out on existing popular datasets (i.e., for the three models) and our newly proposed dataset, the Domain Free Multimedia Sentiment Dataset (DFMSD). Our results show that inducing the threshold-moving method during the training has enhanced the sentiment learning performance by 5-8% more points compared to when the training was conducted without the threshold-moving approach. Combining the two-stage strategy with the threshold-moving method during the training process, has been proven effective to further improve the learning performance (i.e. by 12% more enhanced accuracy compared to the threshold-moving strategy alone). Furthermore, the proposed approach has shown a positive learning impact on the fusion of the three models in terms of the accuracy and F-score.

First Page


Last Page




Publication Date



big data, deep learning, feature extraction, fusion, images, multimodality, online social media, sentiment, threshold moving, transfer learning, Transformers, tweets, ViT

This document is currently not available here.