Machine Learning Faculty Publications

Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions

Ling Liu, Southwest University for Nationalities
Pan Zhou, Southwest University for Nationalities
Gang Sun, University of Electronic Science and Technology of China
Xi Chen, University of Electronic Science and Technology of China
Tao Wu, Chengdu University of Information Technology
Hongfang Yu, University of Electronic Science and Technology of China
Mohsen Guizani, Mohamed Bin Zayed University of Artificial IntelligenceFollow

Document Type

Article

Publication Title

Neurocomputing

Abstract

With the widespread use of distributed machine learning (DML), many IT companies have established networks dedicated to DML. Different communication architectures of DML have different traffic patterns and different requirements on network performance, which is closely related to network topology. However, traditional network topologies usually pursue general goals and are agnostic to the special communication pattern of the applications. The mismatch between network topology and the applications will directly affect the training performance. Although some studies have analyzed the effect of topology on training performance, the topologies and communication architectures involved are not comprehensive, and it is still not known which topology is appropriate for which communication architecture. This survey investigates typical topologies and analyzes whether they meet the requirements of three commonly used communication architectures (i.e., Parameter Server (PS), Tree and Ring architectures) of DML. Specifically, the topology requirements of each communication architecture and two common topology requirements (i.e., high scalability and fault tolerance) for DML are studied firstly. Next, whether these topologies meet the topology requirements is analyzed. Then, this paper discusses potential technologies and approaches to construct the appropriate scheme for each topology requirement, and then presents DMLNet, a novel network topology that suits the three communication architectures. Finally, several potential directions for future research are outlined.

DOI

10.1016/j.neucom.2023.127009

Publication Date

1-28-2024

Keywords

Distributed Machine Learning (DML), Network topology, Parameter Server (PS) architecture, Ring architecture, Training performance, Tree architecture

Recommended Citation

L. Liu et al., "Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions," Neurocomputing, vol. 567, Jan 2024.

The definitive version is available at https://doi.org/10.1016/j.neucom.2023.127009

This document is currently not available here.

COinS

Machine Learning Faculty Publications

Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Recommended Citation

Browse

Contribute

Links

Machine Learning Faculty Publications

Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Recommended Citation

Share

Browse

Contribute

Links