Accelerating LLM inference with post-training weight and activation using AWQ and GPTQ on Amazon SageMaker AI
aws.amazon.com - machine-learningFoundation models (FMs) and large language models (LLMs) have been rapidly scaling, often doubling in parameter count within months, leading to significant improvements in language understanding and generative capabilities. This rapid growth comes with steep costs: inference now requires enormous memory capacity, high-performance GPUs, and substantial energy consumption. This trend is evident in the open source space. In 2023, TII-UAE released Falcon 180B, the largest open model at the time. Meta surpassed that in 2024 with Llama 3.1, a 405B dense model. As of mid-2025, the largest publicly available model is DeepSeek (V3 – Instruct variant, R1 – Reasoning variant), a mixture of experts (MoE) architecture with 671 billion total parameters—of which 37 billion are active per token. These models deliver state-of-the-art performance across a wide range of tasks, including multi-modal search, code generation, summarization, idea generation, logical reasoning, and even PhD-level problem solving. Despite their value, deploying such models ...
Copyright of this story solely belongs to aws.amazon.com - machine-learning . To see the full text click HERE

