CGPS-3DV: Cycled Generative Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Vehicles

Document Type



Image generation was investigated indirectly through image reconstruction for the depth estimation task, and more recently, the feature-level generation demonstrated the effectiveness of the image synthesis for the monocular 3D object detection task. Our research question is: Is it possible to bridge the gap between stereo and monocular settings in the field of autonomous vehicle's perception through direct image-level generation? Our contributions can be summarized as the following: First, we introduce a novel Cycled Generative Pseudo Stereo (CGPS) architecture as a block designed to enable efficient direct image-level generation in stereo settings from left-view to right-view images. Second, We incorporate an integrated combination of losses to make the direct image-level generation in this setting more efficient: object perception consistency loss to guarantee the existence of the objects of interest in the generated right-view images, cycle consistency loss to narrow the gap between the input and the generated left-view images after every cycle in the training, and the adversarial loss to be a PatchGAN discriminator to penalize at the scale of local patches of the generated right-view image. The approach makes the model learn the camera parameters of any taken dataset, and afterwards it offers a block that turns any stereo-based into monocular-based model. We show that using our block architecture and training settings bridge the performance gap of stereo-based and monocular-based 3D object detectors, where we obtained the state-of-the-art performance on the KITTI monocular 3D object detection benchmark. Using our CGPS image generation block with LIGA-Stereo as a stereo-based 3D object detector, it outperformed all the monocular approaches in the KITTI dataset for the 3D object detection task, where the achieved accuracy is 46.70% A P, 55.28% AP and 74.80% AP for the hard, moderate and easy categories respectively. The CGPS monocular-to-stereo block generally turns stereo-based 3D object detectors into monocular-based 3D object detectors while keeping the 3D detection accuracy close to the stereo setting and also superior to the best performing monocular 3D object detectors with a high improvement margin. Testing on the top stereo-based 3D object detectors, the average difference between the stereo and monocular settings for the car class is of 7.17% AP, 9.27% AP and 9.50% AP for the difficulty levels of easy, moderate and hard respectively. Moreover, for the pedestrian class, the range of difference for the hard difficulty level is between 3.32% AP and 3.78% AP, between 2.34% AP and 2.66% AP for the moderate difficulty and between 3.28% AP and 3.86% AP for the hard difficulty level. For the cyclist class, the range difference is 0.9% AP, 0.75% AP and 0.37% AP for the hard, moderate and easy difficulty levels respectively. These quantitative results and the qualitative ones show robustness across different methods including volume-based and cross-modality conversion-based methods, and the performance is also consistent for all the difficulty levels and all the classes. We hope that our approach opens the research opportunity to make use of the monocular 3D object detectors in industry, by narrowing the gap in accuracy between monocular-based and stereo-based 3D object detectors and other 3D vision tasks, and we hope it facilitates future research in that direction.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Hang Dai, Dr. Shijian Lu

Online access for MBZUAI patrons