On Improving Model Generalizability via Optimization and Representation Learning Techniques

Document Type



In the digital age of ever-increasing data sources, accessibility, and collection, the demand for generalizable machine learning models that are effective at capitalizing on given limited training datasets is unprecedented due to the labor-intensiveness and expensiveness of data collection. The deployed model must efficiently exploit patterns and regularities in the data to achieve desirable predictive performance on new, unseen datasets, also known as "testing data." Naturally, due to the various sources of data pools within different domains from which data can be collected, such as in Machine Learning (ML), Natural Language Processing (NLP), and Computer Vision (CV), selection bias will evidently creep into the gathered data, resulting in distribution (domain) shifts. Therefore, the field of Domain Generalization (DG) aims to achieve model generalizability to an unseen target domain by using only labeled training data from the source domain. Although several generalization methods have been proposed, a recent study showed that the simple empirical risk minimization (ERM) approach works just as well as or better than previous DG methods under comparable settings. However, when dealing with deep neural networks (DNNs), it is often the case that the loss function is highly complex and non-convex. Hence, in practice, it is typical for the learned model to yield sub-optimal generalization performance as a result of pursuing sharp local minima when simply solving ERM on such a loss function. This thesis presents several approaches to tackling the generalization error by utilizing optimization and representation learning techniques. Firstly, we introduce the notion of a local minimum’s sharpness, which is an attribute that induces a model’s non-generalizability, and it can serve as a simple guiding heuristic to theoretically distinguish satisfactory (flat) local minima from poor (sharp) local minima. Secondly, motivated by the introduced concept of variance-stability ∼ exploration-exploitation tradeoff, we propose a novel gradient-based adaptive optimization algorithm that is a variant of SGD, named Bouncing Gradient Descent (BGD). BGD’s primary goal is to ameliorate SGD’s deficiency of getting trapped in suboptimal minima by utilizing relatively large step sizes and "unorthodox" approaches in the weight updates in order to achieve better model generalization by attracting flatter local minima. Finally, we present a novel trainable pooling technique named FlexPooling, focused on improving the vital yet often overlooked process of feature map to class correspondence via a global pooling operation by utilizing overparameterization of the CNN’s projection head. It acts as a simple and effective adaptive pooling operation that generalizes the concept of average pooling by learning a weighted average pooling over the network’s latent feature activations jointly with the rest of the network. We empirically validate the proposed approaches on several benchmark classification datasets, showing that they contribute to significant and consistent improvements in model generalization performance and produce state-of-the-art results when compared to the baseline approaches.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Eric Xing, Dr. Huan Xiong

Online access for MBZUAI patrons