YoloS3DN: Towards Low-Latency ViT-Powered Stereo 3D Object Detection

Document Type



Stereo 3D Object Detection has been an ever-growing challenge in the field of Computer Vision, specifically because of the role it plays in deploying Autonomous Driving solutions that are computationally light-weight, fast and accurate. This task is particularly challenging for models that utilise point cloud for their disparity estimation for depth 3D reconstruction. Moreover, Vision Transformers have recently started outperforming CNNs in image classification tasks, all whilst consuming fewer resources with fewer parameters and FLOPs. To this end, YoloS3DN is proposed, a light-weight, real-time capable iteration of YoloStereo3D with a Vision Transformer backbone. This network scales back into the 2D object detection whilst reinforcing them with stereo features. This model was trained on a single NVIDIA RTX 3090 GPU on the KITTI Stereo Dataset and validated on the 3D Object Detection Benchmark of 2017. Ablation and comparative experiments displaying experimental, comparative and qualitative results with superior performance in inference speed and state-of-the-art comparable performance in accuracy. The advancements in this field pushes innovation towards Stereo-based Object detection in real-time and real-world autonomous driving and Robotics solutions. The code is been made available at https://github.com/zakaseb/YoloS3DN.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Hang Dai, Dr. Hisham Cholakkal

with 2 years embargo period

This document is currently not available here.