Training Time Comparison: Multi-Token vs. Next-Token Prediction

by Large Models (dot tech) June 8th, 2025

This table (S5) quantifies the training time overhead of multi-token prediction relative to next-token prediction, demonstrating its computational efficiency across different LLM sizes.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;

(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and his the last author;

(5) Gabriel Synnaeve, FAIR at Meta and the last author.

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE

Share: