How to Distribute Deep Learning Model Training?

A quick refresher into distributed training using tf.distribute.strategy

Renu Khandelwal

--

This post details why you need to distribute the model training, different distribution strategies, and how they work. Finally, how to apply them usingtf.distribute.strategy

Need to distribute the model training

  • Training on a single GPU device takes a longer duration compared to training on multiple GPU devices.
  • Current deep learning models are becoming complex, with millions to billions of parameters; more data is available, making training time-consuming, so training needs to be distributed across multiple GPUs.
  • Distributed training takes less time, and hence faster experimentation is possible.

How is distributed training achieved?

Training distribution is achieved using parallelism on

  1. Hardware platform: You can train the model on a single machine with multiple GPUs/TPUs or use multiple devices in a network with one or more GPUs/TPUs.

2. Data Parallelism partitions the data between different workers. Each worker has a complete copy of the model and operates on a subset of data. The model updates are synchronized

--

--

Renu Khandelwal

A Technology Enthusiast who constantly seeks out new challenges by exploring cutting-edge technologies to make the world a better place!