tf.distribute 101: Training Keras on Multiple Devices and Machines

by Tensor Flow - [Technical Documentation] June 13th, 2025

This guide explains how to scale Keras model training using TensorFlow’s tf.distribute API, covering both single-host multi-GPU and multi-worker setups with performance and fault tolerance tips.

Content Overview

Introduction
Setup
Single-host, multi-device synchronous training
Using callbacks to ensure fault tolerance
tf.data performance tips
Multi-worker distributed synchronous training
Example: code running in a multi-worker setup
Further reading

Introduction

There are generally two ways to distribute computation across multiple devices:

Data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results. There exist many variants of this setup, that differ in how the different model replicas merge results, in whether they stay in sync at every batch or whether they are more loosely coupled, etc.

Model parallelism, where different parts of a single model run ...

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE

Content Overview

Introduction

Share: