A Distributed Neural Network Training Method Based on Hybrid Gradient Computing

Zhen Lu; Meng Lu; Yan Liang

doi:10.12694/scpe.v21i2.1727

Authors

Zhen Lu Guangzhou College of South China University of Technology, Guangzhou, China
Meng Lu China mobile（Suzhou）software technology co., Suzhou, China
Yan Liang Alibaba Group, Guangzhou, China

DOI:

https://doi.org/10.12694/scpe.v21i2.1727

Keywords:

Deep learning, gradient descent, mixed precision computing, distributed training

Abstract

The application of deep learning in industry often needs to train large-scale neural networks and use large-scale data sets. However, larger networks and larger data sets lead to longer training time, which hinders the research of algorithms and the progress of actual engineering development. Data-parallel distributed training is a commonly used solution, but it is still in the stage of technical exploration. In this paper, we study how to improve the training accuracy and speed of distributed training, and propose a distributed training strategy based on hybrid gradient computing. Specifically, in the gradient descent stage, we propose a hybrid method, which combines a new warmup scheme with the linear-scaling stochastic gradient descent (SGD) algorithm to effectively improve the training accuracy and convergence rate. At the same time, we adopt the mixed precision gradient computing. In the single-GPU gradient computing and inter-GPU gradient synchronization, we use the mixed numerical precision of single precision (FP32) and half precision (FP16), which not only improves the training speed of single-GPU, but also improves the speed of inter-GPU communication. Through the integration of various training strategies and system engineering implementation, we finished ResNet-50 training in 20 minutes on a cluster of 24 V100 GPUs, with 75.6% Top-1 accuracy, and 97.5% GPU scaling efficiency. In addition, this paper proposes a new criterion for the evaluation of the distributed training efficiency, that is, the actual average single-GPU training time, which can evaluate the improvement of training methods in a more reasonable manner than just the improved performance due to the increased number of GPUs. In terms of this criterion, our method outperforms those existing methods.

Author Biography

Yan Liang, Alibaba Group, Guangzhou, China

He received his Ph.D degree from Sun Yat-Sen University in 2013. His research interest covers image processing, machine learing.

A Distributed Neural Network Training Method Based on Hybrid Gradient Computing

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

announcement

Indexed In

SUBMIT

Metrics

Journal Information