CEAFFOD: Cross-Ensemble Attention-based Feature Fusion Architecture Towards a Robust and Real-time UAV-based Object Detection in Complex Scenarios

Document Type

Conference Proceeding

Publication Title

Proceedings - IEEE International Conference on Robotics and Automation


Deploying object detectors in embedded devices such as unmanned aerial vehicles (UAVs) comes with many challenges. This is due to both the UAV itself having low embedded resources in terms of computation and memory, and also due to the nature of the captured visual data with the variations in objects' scale, orientation, density, viewpoint, distribution, shape, context and others. It is crucial for the object detector to be robust with high accuracy, real-time with fast inference and light-weight to be applicable. Inspired by YOLO architecture, we propose a novel single-stage detection architecture. Our contributions are, first, feature fusion spatial pyramid pooling (FFSPP) block that applies attention-based feature fusion across both time and space utilizing the information of subsequent frames and scales in an efficient manner. Secondly, we introduce a multi-dilated attention-based cross-stage partial connection (MDACSP) block that helps in increasing the receptive field and producing per-channel modulation weights after aggregating the feature maps across their spatial domain. Third, scaled feature fusion head (SFFH) fuses both the FFSPP block features and the connected MDACSP block features specific for this head. For a more robust result across different scenarios, we perform cross-ensembling with three of the top UAV/traffic surveillance datasets: UAVDT, UA-DETRAC and VisDrone. Our ablation study shows how every contribution improves over the baseline. Our approach yielded the state-of-the-art results in all the aforementioned datasets achieving 89.3% mAP, 93.5% mAP, and 42.9% mAP respectively. Testing the model performance on NVIDIA Jetson Xavier NX board shows a desirable balance between the inference time and the memory cost. We also show qualitatively the model robustness and efficiency across the diverse complex scenarios of these datasets. We hope this work facilitates the advancement of the UAV-based perception in such crucial industrial applications.

First Page


Last Page




Publication Date



Training, Visualization, Shape, Surveillance, Object detection, Detectors, Computer architecture


IR conditions: non-described