Five techniques to reach the efficient frontier of LLM inference

Every dollar that you spend on model inference buys you a position on a graph of latency and throughput. On this plot is a curve of optimal configurations, where you've squeezed the maximum possible performance from your hardware. That curve, borrowed from portfolio theory in finance, is the efficient frontier.

With the assumption that you have a fixed budget for hardware, you can trade latency for throughput. But, you can't improve one aspect without sacrificing the other, unless the frontier curve itself moves. There are two fundamentally different dynamics at play, and this is the central insight for anyone running LLMs in production.

The first dynamic is getting to the frontier, which involves applying the full stack of techniques available to you today. This part is within your control. Continuous batching, paged attention, intelligent routing, speculative decoding, and quantization all exist right now. If you're not using ...

Copyright of this story solely belongs to google cloudblog . To see the full text click HERE

Share: