Deploying LLMs on Apple Silicon: Strategies to Enhance Efficiency and Performance

Date of Award

4-30-2024

Document Type

Thesis

Degree Name

Master of Science in Machine Learning

Department

Machine Learning

First Advisor

Dr. Shih-Hao Hung

Second Advisor

Dr. Abdulmotaleb Elsaddik

Abstract

"This work explores the optimization of computer systems based on Apple Silicon, with a focus on implementing large language models (LLMs), for which LLaMa2 is used as our case study. Unlike typical computer systems with independent graphical processing units (GPUs), the Apple Silicon tightly integrates GPU cores on-chip, which enables the GPU cores to efficiently access the system memory with the CPU cores. Potentially, this shared-memory system-on-chip (SoC) approach is friendly to LLMs as far as the memory capacity is concerned. Initial experiments have been conducted to understand the performance variations of LLMs across different frameworks, including LLaMa.cpp and MLX. Through comprehensive profiling tools, the study dissects the operational processes of the system to identify the performance bottlenecks in various contexts. Interestingly, our LLM application with the MLX framework runs slower and slower over time, and the inference time is eventually prolonged by over 8 folds, but a system reboot recovers the performance. We suspect that certain unexpected behaviors in the MLX framework seriously impact the performance. However, as the MLX framework is complicated, to pinpoint the cause for performance disparities, we abstract the LLM into matrix multiplications and conduct multiple ablation experiments to validate our hypotheses, revealing that the strategy for loading model parameters is a significant bottleneck in the MLX framework. Conversely, the llama.cpp framework's strategy effectively addresses this issue, unaffected by factors such as reboots. The study provides valuable practical experience given the current popularity of LLMs and the SoC approach. Furthermore, the insight I gained in completing this study significantly enhanced my understanding of computer system optimization methodologies, from analyzing bottlenecks with performance analysis tools to exploring and improving underlying algorithmic principles, which benefit future research in system performance. Moreover, applying ML models on Apple devices represents a novel direction compared to traditional training on Nvidia-based machines, likely marking an unstoppable future trend. Our work holds practical significance in this emerging field."

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Shih-Hao Hung, Abdulmotaleb Elsaddik

Online access available for MBZUAI patrons

Share

COinS