Large-Scale Dataset Distillation

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Zhiqiang Shen

Second Advisor

Dr. Kun Zhang


The recent advancements in deep learning are primarily due to the growing size of neural networks and datasets. As scaling up models and data increases computational demands, the focus shifts to efficient training methods. For example, efficient model design minimizes model size while preserving performance. Additionally, efficient strategies enable model training with limited resources. Unlike the advanced techniques of efficient model design and training strategies, this thesis explores training efficiency from a data distillation perspective. Dataset distillation aims to generate a smaller but representative subset from a large dataset, thereby enabling a model to be trained more efficiently while evaluating the original testing data distribution to achieve decent performance. Many prior works propose to align with diverse aspects of the original datasets, such as matching the training weight trajectories, gradient, feature distributions, etc. However, the considerable computing resource requirements in the previous methods limit the exploration of large-scale dataset distillation. In this work, we propose a novel dataset distillation framework SRe2L comprised of Squeeze, Recover, and Relabel stages. This paradigm allows for decoupling the previous bi-level synthesis stage from the real data input and segregating inner-loop and outerloop optimization. Additionally, during data synthesis, we introduce a simple yet effective global-to-local gradient refinement approach enabled by Curriculum Data Augmentation (CDA). The proposed framework beats the current state-of-the-art and achieves the current highest accuracy on large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a conventional input resolution of 224×224. The remarkable performance demonstrates our proposed SRe2L and CDA are pioneering methods that facilitate the development of large-scale dataset distillation and pave the way towards building an efficiently distilled dataset in the large data era.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Zhiqiang Shen, Kun Zhang

Online access available for MBZUAI patrons