How to Distribute Deep Learning Model Training?
A quick refresher into distributed training using tf.distribute.strategy
6 min readOct 13, 2021
This post details why you need to distribute the model training, different distribution strategies, and how they work. Finally, how to apply them usingtf.distribute.strategy
Need to distribute the model training
- Training on a single GPU device takes a longer duration compared to training on multiple GPU devices.
- Current deep learning models are becoming complex, with millions to billions of parameters; more data is available, making training time-consuming, so training needs to be distributed across multiple GPUs.
- Distributed training takes less time, and hence faster experimentation is possible.
How is distributed training achieved?
Training distribution is achieved using parallelism on
- Hardware platform: You can train the model on a single machine with multiple GPUs/TPUs or use multiple devices in a network with one or more GPUs/TPUs.
2. Data Parallelism partitions the data between different workers. Each worker has a complete copy of the model and operates on a subset of data. The model updates are synchronized…