Scaling MoE inference with NVIDIA Dynamo on Google Cloud A4X
google cloudblogAs organizations transition from standard LLMs to massive Mixture-of-Experts (MoE) architectures like DeepSeek-R1, the primary constraint has shifted from raw compute density to communication latency and memory bandwidth. Today, we’re releasing two new validated recipes designed to help customers overcome the infrastructure bottlenecks of the agentic AI era. These new recipes provide clear steps to optimize both throughput and latency built on the A4X machine series powered by NVIDIA GB200 NVL72 and NVIDIA Dynamo, which extend the reference architecture we published in September 2025 for disaggregated inference on A3 Ultra (NVIDIA H200) VMs.
We’re bringing the best of both worlds to AI infrastructure by combining the multi-layered scalability of Google Cloud’s AI infrastructure with the rack-scale acceleration of the A4X. These recipes are part of a broader collaboration between our organizations that includes investments in important inference infrastructure like Dynamic Resource Allocation (DRA) and Inference Gateway.
Highlights ...
Copyright of this story solely belongs to google cloudblog . To see the full text click HERE

