On the Effectiveness of Images in Multi-modal Text Classification: An Annotation Study
Document Type
Article
Publication Title
ACM Transactions on Asian and Low-Resource Language Information Processing
Abstract
Combining different input modalities beyond text is a key challenge for natural language processing. Previous work has been inconclusive as to the true utility of images as a supplementary information source for text classification tasks, motivating this large-scale human study of labelling performance given text-only, images-only, or both text and images. To this end, we create a new dataset accompanied with a novel annotation method - Japanese Entity Labeling with Dynamic Annotation - to deepen our understanding of the effectiveness of images for multi-modal text classification. By performing careful comparative analysis of human performance and the performance of state-of-the-art multi-modal text classification models, we gain valuable insights into differences between human and model performance, and the conditions under which images are beneficial for text classification.
First Page
1
Last Page
19
DOI
10.1145/3565572
Publication Date
3-10-2023
Keywords
Datasets, multi-modality, natural language processing, neural networks, text classification
Recommended Citation
C. Ma et al., “On the effectiveness of images in multi-modal text classification: An annotation study,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 3, pp. 1–19, 2023. doi:10.1145/3565572
Comments
IR Deposit conditions:
OA version (pathway a) Accepted version
No embargo
Publisher copyright and source must be acknowledged
Must link to publisher version with statement that this is the definitive version and DOI
Must state that version on repository is the authors version
Set statement to accompany deposit (see policy)