Deploying LLMs on Apple Silicon: Strategies to Enhance Efficiency and Performance
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Shih-Hao Hung
Second Advisor
Dr. Abdulmotaleb Elsaddik
Abstract
"This work explores the optimization of computer systems based on Apple Silicon, with a focus on implementing large language models (LLMs), for which LLaMa2 is used as our case study. Unlike typical computer systems with independent graphical processing units (GPUs), the Apple Silicon tightly integrates GPU cores on-chip, which enables the GPU cores to efficiently access the system memory with the CPU cores. Potentially, this shared-memory system-on-chip (SoC) approach is friendly to LLMs as far as the memory capacity is concerned. Initial experiments have been conducted to understand the performance variations of LLMs across different frameworks, including LLaMa.cpp and MLX. Through comprehensive profiling tools, the study dissects the operational processes of the system to identify the performance bottlenecks in various contexts. Interestingly, our LLM application with the MLX framework runs slower and slower over time, and the inference time is eventually prolonged by over 8 folds, but a system reboot recovers the performance. We suspect that certain unexpected behaviors in the MLX framework seriously impact the performance. However, as the MLX framework is complicated, to pinpoint the cause for performance disparities, we abstract the LLM into matrix multiplications and conduct multiple ablation experiments to validate our hypotheses, revealing that the strategy for loading model parameters is a significant bottleneck in the MLX framework. Conversely, the llama.cpp framework's strategy effectively addresses this issue, unaffected by factors such as reboots. The study provides valuable practical experience given the current popularity of LLMs and the SoC approach. Furthermore, the insight I gained in completing this study significantly enhanced my understanding of computer system optimization methodologies, from analyzing bottlenecks with performance analysis tools to exploring and improving underlying algorithmic principles, which benefit future research in system performance. Moreover, applying ML models on Apple devices represents a novel direction compared to traditional training on Nvidia-based machines, likely marking an unstoppable future trend. Our work holds practical significance in this emerging field."
Recommended Citation
X. Ke, "Deploying LLMs on Apple Silicon: Strategies to Enhance Efficiency and Performance,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors: Shih-Hao Hung, Abdulmotaleb Elsaddik
Online access available for MBZUAI patrons