Training Free Distillation for Efficient Action Recognition

Document Type



The rise of social media and video-sharing platforms has led to an increase in the amount of video content generated and shared daily. From entertainment to education and communication, video has become part of our daily lives. This has led to a growing need for applications that can automatically understand and analyse video data in a variety of contexts, including recommendation systems, surveillance systems, and sports analyses. Action recognition being one of the fundamental tasks in video understanding had always been an important research area. Over the past few years, while there have been significant algorithmic and hardware improvements in action recognition, the high computational cost remains a concern due to its increased energy consumption, reduced scalability, and usability. To address this challenge, researchers are working on efficient action recognition through model optimization and reduction of temporal and spatial redundancies. However, reducing spatial redundancy is seldom explored. To address this gap, AdaFocus [47] which utilizes a Reinforcement Learning (RL) policy to select small patches in the original video frames to reduce the spatial redundancy was proposed, they achieved state-of-the-art results in several video dataset benchmarks. We argue that training an RL model to select patches is suboptimal. This is because the extra RL model also requires a large training time and computational cost. In addition, the selected patches are less explainable. To this end, we propose a training-free distillation method that utilizes motion information extracted from optical flow to select the patch with the highest salient motion in each frame. Our method provides a better explanation and can be used on top of other models. Experiments have shown that good optical flow extraction is sufficient to obtain information about salient motion in the frames. Besides, we also demonstrate that introducing different variations of algorithms on motion calculation can further increase performance. By utilizing this approach, we managed to diminish considerable computational cost and the number of model parameters, while achieving good outcomes on HMDB51 and RWF2000 datasets compare to the baseline center crop and random crop, however, the performance is slightly lower compare to the framework with original patch size. Additionally, we decreased the computational cost and parameters on the ActivityNet dataset, albeit with only a marginal decline in performance.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Abdulmotaleb Elsaddik, Dr. Huan Xiong

Online access available for MBZUAI patrons