Member-only story

How to Distribute Deep Learning Model Training?

A quick refresher into distributed training using `tf.distribute.strategy`

6 min readOct 13, 2021

This post details why you need to distribute the model training, different distribution strategies, and how they work. Finally, how to apply them usingtf.distribute.strategy

Need to distribute the model training

Training on a single GPU device takes a longer duration compared to training on multiple GPU devices.
Current deep learning models are becoming complex, with millions to billions of parameters; more data is available, making training time-consuming, so training needs to be distributed across multiple GPUs.
Distributed training takes less time, and hence faster experimentation is possible.

How is distributed training achieved?

Training distribution is achieved using parallelism on

Hardware platform: You can train the model on a single machine with multiple GPUs/TPUs or use multiple devices in a network with one or more GPUs/TPUs.

2. Data Parallelism partitions the data between different workers. Each worker has a complete copy of the model and operates on a subset of data. The model updates are synchronized…

How to Distribute Deep Learning Model Training?

A quick refresher into distributed training using `tf.distribute.strategy`

Need to distribute the model training

Written by Renu Khandelwal

No responses yet

How to Distribute Deep Learning Model Training?

A quick refresher into distributed training using tf.distribute.strategy

Need to distribute the model training

Written by Renu Khandelwal

No responses yet

A quick refresher into distributed training using `tf.distribute.strategy`