From Limitations to Innovations: A Deep Dive into Multimodal Weakly Supervised Violence Detection

Document Type



This thesis addresses the critical need for automated content moderation and surveillance in the context of violence detection through the lens of multimodal weakly supervised deep learning models. As the prevalence of online content continues to grow exponentially, the importance of accurate and efficient violence detection systems has become paramount. Despite recent advancements in this field, existing methods for violence detection suffer from various limitations, including inconsistency in handling footage from multiple sources, inadequate evaluation of fusion mechanisms, rigidity in loss functions, neglect of temporal information, and insufficient multi-class evaluation. This study aims to overcome these limitations and contribute to the development of more accurate and versatile violence detection models.

To address the scene change problem, compilation videos were divided into individual scenes using TransNet V2 and a custom scene merging algorithm. A comprehensive comparison of fusion mechanisms, including Cross-Modal Attention with Local Arousal (CMA-LA), Concatenation Fusion, Attentive Additive Fusion, Gated Additive Fusion, Cross Fusion, and Tensor Fusion, was conducted to evaluate their effectiveness in violence detection. A novel loss function was developed that adaptively selects the appropriate number of maximum activations, effectively accounting for the non-uniform distribution of violent events, and Temporal relations were exploited by introducing two versions of modifications in the Modality-aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection (MACIL SD) architecture. The MACIL SD model consists of two parts: the single model part, which is based on visual features only, and the AV model part, which combines audio and visual information. The single model part serves to distill knowledge onto the AV model, reducing noise in the process. In version (1), only the AV model part was modified by replacing fully connected layers with temporal convolution layers. In version (2), both the single model part and the AV model part were modified by applying temporal convolution layers, enhancing the ability to capture temporal information.

The results demonstrated that the modified dataset significantly benefits weakly supervised training approaches, particularly CMA-LA, increasing overall AP by 2.95 compared to the reproduced results and 2.11 compared to the state-of-the-art results (83.54). The MACIL SD method showed only marginal improvements with the modified dataset, with a 0.52 increase in overall AP compared to the reproduced results but a 1.16 decrease compared to the reported results (83.4). Fusion mechanism performance varied across video labels; simple methods like concatenation and gated additive fusion outperformed the initial paper titled “Not only Look, but also Listen” which introduced the dataset, while advanced methods like CMA-LA achieved the highest overall AP. The new loss function outperformed baseline in most cases, with a 3.95 increase in overall AP compared to the re-run of the CMA-LA code, and a 3.11 increase compared to the 83.54 AP reported. Temporal convolution modifications in MACIL SD models showed promise, with version (2) achieving the highest overall AP, outperforming the re-run of MACIL SD, the modified dataset, the reported results, state-of-the-art (CMA-LA), and version (1). However, version (1) offers more stable training and settles on a higher AP.

Based on these findings, future research directions include transitioning from weakly supervised learning to active learning, employing curriculum learning to tackle difficult-to classify labels, investigating models’ robustness to missing or irrelevant modalities, developing new datasets for a more comprehensive evaluation, exploring novel fusion mechanisms, evaluating new backbones for feature extraction or incorporating them into the architecture, developing alternative loss functions, and introducing new sampling approaches within the weakly supervised setting. Overall, this thesis contributes to the advancement of multimodal weakly supervised violence detection by addressing existing limitations and providing a foundation for future research aimed at improving the performance and applicability of these models.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Abdulmotaleb Elsaddik, Dr. Karthik Nandakumar

Online access available for MBZUAI patrons