Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

2 weeks, 3 days ago venturebeat

As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model's weights.

Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model's existing architecture.

The limits of next-token prediction

Next-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially problematic in reasoning models, which frequently generate thousands of “chain of thought” tokens before producing the final response, leading to a slow and expensive user experience.

Multi-token prediction (MTP) offers an alternative training paradigm that allows a language model to produce multiple tokens simultaneously in a ...

Copyright of this story solely belongs to venturebeat . To see the full text click HERE

The limits of next-token prediction

Share: