Data-efficient transformer-based 3D object detection
This thesis work intends to study 3D point clouds object detection from indoor scenes. 3D object detection aims to recognize classes of object categories and locate them by drawing a bounding box around an object. To achieve this final goal of predicting a set of bounding boxes from an input scan, a machine learning model must extract necessary features that describe the scene points. Nevertheless, at this moment, 3D scene under- standing poses a challenge, as 3D point cloud data is unique: it is orderless, sparse, and continuous. Recent 3D detection models rely on Transformer architecture due to its natural ability to abstract global context features. One is the 3DETR network - a pure transformer-based model designed to generate 3D boxes on indoor dataset scans. It is generally known that transformers are data-hungry. However, data collection and annotation in 3D are more challenging than in 2D. Thus, our goal is to study the data-hungriness of the 3DETR-m model and propose a solution for its data efficiency. Our methodology is based on the observation that PointNet++ provides more locally aggregated features that can be useful to support 3DETR-m prediction on small dataset problem. We suggest three methods of backbone fusion that are based on addition (Fusion I), concatenation (Fusion II), and replacement (Fusion III). We utilize pre-trained weights from the Group-free model trained on the SUN RGB-D dataset. The proposed 3DETR-m outperforms the original model in all data proportions (10%, 25%, 50%, 75%, and 100%). We improve 3DETR-m paper results by 1.46% and 2.46% in mAP@25 and mAP@50 on the full dataset. Hence, we believe our research efforts can provide new insights into the data- hungriness issue of 3D transformer detectors and inspire the usage of pre-trained models in 3D as one way towards data efficiency.
A. Nurakhmetova, "Data-efficient transformer-based 3D object detection", M.S. Thesis, Computer Vision, MBZUAI, Abu Dhabi, UAE, 2022.