COPUS Co-adaptive Parallelism and Jobs Scheduling

Date of Award

4-30-2024

Document Type

Thesis

Degree Name

Master of Science in Machine Learning

Department

Machine Learning

First Advisor

Dr. Qirong Ho

Second Advisor

Dr. Karthik Nandakumar

Abstract

"Large-scale Deep Learning (DL) models demand significant computational resources and prolonged training times. Sevilla et al. [11] splits the history of compute in Machine Learning (ML) in three eras: the Pre Deep Learning Era, the Deep Learning Era, and the Large-Scale Era. Large DL models encapsulate intricate relationships within data, comprehends complex features and play a pivotal role in driving progress across diverse application domains. They have expanded the horizons of what ML can achieve. The Large-Scale Era signals the emerging of large DL models where their escalating sizes and intricacy present a compelling case for a more powerful and advanced parallelization and scheduling techniques. Efficiently training Large ML models, aims for the optimization of computational and memory resources and the incorporation of a diverse array of parallelization techniques. In 2022, Zheng et al. published the Alpa paper [16] a state of the art automated parallelism approach that combines inter-operator and intra-operator parallelism for distributed deep learning. Alpa generates automated execution plans that combine data, operator and pipeline parallelism. Alpa scale out large deep learning models on distributed compute nodes by incorporating the inter-and-intra operator hierarchies. Alpa is distinguished by employing a parallel execution plan that defines the optimal parallelization approach for each category of the parallelism hierarchy. Qiao et al. in AdaptDL paper [10] introduced the job scheduling approach by adaptively optimizing inter-dependent factors at the job level and the cluster level. AdaptDL models the jobs’ Goodput and dynamically re-assign resources to improve the cluster-wide resources utilization, fostering equity and fairness among deep learning jobs through the assessment of a meaningful metric, namely, Good- put. A multifaceted approach in addressing the challenges inherent in training large DL models is to adopt a co-adaptive approach seamlessly combining inter-operator and, intra- operator parallelism, cluster-wide scheduling and job-level optimizations. This integration of diverse strategies, provides a more resilient and agile training of large ML models resulting in the optimization of resources utilization, improved training efficiency and accelerated models convergence. In this thesis, we present COPUS: Coadaptive Parallelism and Jobs Scheduling where we aim to address a broader co-adaptive deep learning challenge within an expanded scope, incorporating 3D parallelism and scheduling into the decision space. We will be extending the problem-solving horizon to a larger and more diverse space and replacing the local parallelism problem in AdaptDL with the automated pipeline and intra-operator parallelism introduced by Alpa while optimizing the cluster resources optimization through the jobs scheduling. This co-adaptive framework sets the base to unlock new dimensions of efficiency and resources sharing across multiple deep learning jobs."

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Qirong HO, Karthik Nandakumar

Online access available for MBZUAI patrons

Share

COinS