Synergy: System for Co-adaptive Goodput-Based Scheduling and Hybrid Parallelism of Deep Learning Jobs
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Qirong Ho
Second Advisor
Dr. Samuel Horvath
Abstract
"In the realm of deep learning (DL), the development of increasingly complex models, such as large language models (LLMs), has escalated the need for sophisticated distributed computing approaches. To effectively train these expansive models, distributed deep learning across multiple graphics processing units (GPUs) is essential, aiming to reduce both computational time and memory requirements. However, a significant research challenge arises from the current limitations of scheduling systems. These systems often struggle to effectively allocate resources to DL jobs sharing the same clusters, lacking sensitivity to job-specific training progress and hyperparameters, while only considering a single view of distributed DL - data parallelism. This gap underscores a critical need for integrated solutions that combine advanced model parallelization with dynamic, goodput-aware scheduling to support the efficient training of large-scale DL models. This work presents Synergy, a system designed to improve the training of large DL models by integrating co-adaptive goodput-based scheduling with automatic 3D parallelism. Synergy aims to enhance training efficiency by focusing on goodput, a metric that optimizes both system throughput and statistical efficiency of training. This approach ensures an efficient use of computational resources, accelerating the convergence of DL jobs. The key contribution of Synergy is its architecture, which supports the demanding requirements of large model training through adaptive scheduling and parallelism. Synergy is structured into two main components: SynergyTask and SynergyScheduler. SynergyTask manages hyperparameter tuning and automatic parallelism for each DL job, while SynergyScheduler allocates resources across multiple tasks, optimizing for goodput. This structure makes Synergy user-friendly, requiring no user expertise in model parallelization or hyperparameter tuning in response to using distributed DL."
Recommended Citation
A. Sakip, "Synergy: System for Co-adaptive Goodput-Based Scheduling and Hybrid Parallelism of Deep Learning Jobs,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors:Qirong Ho, Samuel Horvath
Online access available for MBZUAI patrons