Synergy: System for Co-adaptive Goodput-Based Scheduling and Hybrid Parallelism of Deep Learning Jobs

Date of Award

4-30-2024

Document Type

Thesis

Degree Name

Master of Science in Machine Learning

Department

Machine Learning

First Advisor

Dr. Qirong Ho

Second Advisor

Dr. Samuel Horvath

Abstract

"In the realm of deep learning (DL), the development of increasingly complex models, such as large language models (LLMs), has escalated the need for sophisticated distributed computing approaches. To effectively train these expansive models, distributed deep learning across multiple graphics processing units (GPUs) is essential, aiming to reduce both computational time and memory requirements. However, a significant research challenge arises from the current limitations of scheduling systems. These systems often struggle to effectively allocate resources to DL jobs sharing the same clusters, lacking sensitivity to job-specific training progress and hyperparameters, while only considering a single view of distributed DL - data parallelism. This gap underscores a critical need for integrated solutions that combine advanced model parallelization with dynamic, goodput-aware scheduling to support the efficient training of large-scale DL models. This work presents Synergy, a system designed to improve the training of large DL models by integrating co-adaptive goodput-based scheduling with automatic 3D parallelism. Synergy aims to enhance training efficiency by focusing on goodput, a metric that optimizes both system throughput and statistical efficiency of training. This approach ensures an efficient use of computational resources, accelerating the convergence of DL jobs. The key contribution of Synergy is its architecture, which supports the demanding requirements of large model training through adaptive scheduling and parallelism. Synergy is structured into two main components: SynergyTask and SynergyScheduler. SynergyTask manages hyperparameter tuning and automatic parallelism for each DL job, while SynergyScheduler allocates resources across multiple tasks, optimizing for goodput. This structure makes Synergy user-friendly, requiring no user expertise in model parallelization or hyperparameter tuning in response to using distributed DL."

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors:Qirong Ho, Samuel Horvath

Online access available for MBZUAI patrons

Share

COinS