K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering
MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.
network architecture search, visual question answering
Y. Zhou, et al, "K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering", In Proceedings of the 28th ACM Intl. Conf. on Multimedia (MM '20)," ACM, New York, NY, USA, pp. 1245–1254, Oct 2020. doi:10.1145/3394171.3413998