On the Effectiveness of Images in Multi-modal Text Classification: An Annotation Study
ACM Transactions on Asian and Low-Resource Language Information Processing
Combining different input modalities beyond text is a key challenge for natural language processing. Previous work has been inconclusive as to the true utility of images as a supplementary information source for text classification tasks, motivating this large-scale human study of labelling performance given text-only, images-only, or both text and images. To this end, we create a new dataset accompanied with a novel annotation method - Japanese Entity Labeling with Dynamic Annotation - to deepen our understanding of the effectiveness of images for multi-modal text classification. By performing careful comparative analysis of human performance and the performance of state-of-the-art multi-modal text classification models, we gain valuable insights into differences between human and model performance, and the conditions under which images are beneficial for text classification.
Datasets, multi-modality, natural language processing, neural networks, text classification
C. Ma et al., “On the effectiveness of images in multi-modal text classification: An annotation study,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 3, pp. 1–19, 2023. doi:10.1145/3565572