FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Approach for Heterogeneous Edge Devices
Document Type
Article
Publication Title
IEEE Transactions on Mobile Computing
Abstract
With the increasing proliferation of Internet-of-Things (IoT) devices, there is a growing trend towards distributing the power of deep learning (DL) among edge devices rather than centralizing it at the cloud. To deploy deep and complex models at edge devices with limited resources, model partitioning of deep neural network (DNN) models has been widely studied. However, most of the existing literature only considers distributing the inference model while still training the model at the cloud. In this paper, we propose FTPipeHD, a novel DNN training approach that trains DNN models across distributed heterogeneous devices with the fault-tolerance mechanism. To accelerate the training with the time-varying computing power of each device, we optimize the partition points dynamically according to real-time computing capacities. We also propose a novel weight redistribution approach that replicates the weights to both the neighboring nodes and the central node periodically, which combats the failure of multiple devices during training while incurring limited communication costs. Our numerical results demonstrate that FTPipeHD is 6.8 times faster in training than the state-of-the-art method when the computing capacity of the best device is 10 times greater than the worst one. It is also shown that the proposed method is able to accelerate the training even with the existence of device failures.
First Page
1
Last Page
13
DOI
10.1109/TMC.2023.3272567
Publication Date
6-1-2023
Keywords
Computational modeling, Data models, Distributed training, edge training, Fault tolerance, fault tolerance, Fault tolerant systems, Load modeling, Servers, Training
Recommended Citation
Y. Chen, Q. Yang, S. He, Z. Shi, J. Chen and M. Guizani, "FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Approach for Heterogeneous Edge Devices," in IEEE Transactions on Mobile Computing, Feb 2023, doi: 10.1109/TMC.2023.3272567.
Comments
IR Deposit conditions:
OA version (pathway a) Accepted version
No embargo
When accepted for publication, set statement to accompany deposit (see policy)
Must link to publisher version with DOI
Publisher copyright and source must be acknowledged