Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding
venturebeatAs agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University and TogetherAI has found a way to bake 3x throughput gains directly into a model's weights.
Unlike speculative decoding, which requires a separate drafting model, this approach requires no additional infrastructure — just a single special token added to the model's existing architecture.
The limits of next-token prediction
Next-token prediction — generating text one token per forward pass — creates a throughput ceiling that becomes painfully expensive when models need to produce thousands of tokens. This bottleneck is especially problematic in reasoning models, which frequently generate thousands of “chain of thought” tokens before producing the final response, leading to a slow and expensive user experience.
Multi-token prediction (MTP) offers an alternative training paradigm that allows a language model to produce multiple tokens simultaneously in a ...
Copyright of this story solely belongs to venturebeat . To see the full text click HERE

