Deep0Tag: Deep Multiple Instance Learning for Zero-Shot Image Tagging
Zero-shot learning aims to perform visual reasoning about unseen objects. In-line with the success of deep learning on object recognition problems, several end-to-end deep models for zero-shot recognition have been proposed in the literature. These models are successful in predicting a single unseen label given an input image but do not scale to cases where multiple unseen objects are present. Here, we focus on the challenging problem of zero-shot image tagging, where multiple labels are assigned to an image, that may relate to objects, attributes, actions, events, and scene type. Discovery of these scene concepts requires the ability to process multi-scale information. To encompass global as well as local image details, we propose an automatic approach to locate relevant image patches and model image tagging within the Multiple Instance Learning (MIL) framework. To the best of our knowledge, we propose the first end-to-end trainable deep MIL framework for the multi-label zero-shot tagging problem. We explore several alternatives for instance-level evidence aggregation and perform an extensive ablation study to identify the optimal pooling strategy. Due to its novel design, the proposed framework has several interesting features: 1) unlike previous deep MIL models, it does not use any off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation. 2) During test time, it can process any number of unseen labels given their semantic embedding vectors. 3) Using only image-level seen labels as weak annotation, it can produce a localized bounding box for each predicted label. We experiment with the large-scale NUS-WIDE and MS-COCO datasets and achieve superior performance across conventional, zero-shot, and generalized zero-shot tagging tasks.