Tech »  Topic »  tf.distribute 101: Training Keras on Multiple Devices and Machines

tf.distribute 101: Training Keras on Multiple Devices and Machines


by Tensor Flow - [Technical Documentation] June 13th, 2025

This guide explains how to scale Keras model training using TensorFlow’s tf.distribute API, covering both single-host multi-GPU and multi-worker setups with performance and fault tolerance tips.

Content Overview

  • Introduction
  • Setup
  • Single-host, multi-device synchronous training
  • Using callbacks to ensure fault tolerance
  • tf.data performance tips
  • Multi-worker distributed synchronous training
  • Example: code running in a multi-worker setup
  • Further reading

Introduction

There are generally two ways to distribute computation across multiple devices:

Data parallelism, where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results. There exist many variants of this setup, that differ in how the different model replicas merge results, in whether they stay in sync at every batch or whether they are more loosely coupled, etc.

Model parallelism, where different parts of a single model run ...


Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE