Boosting LLM Decode Throughput: vAttention vs. PagedAttention

by Text Generation June 13th, 2025

Discover how vAttention's use of FlashAttention's vanilla kernel for contiguous KV-cache delivers superior decode performance over paged kernels, highlighting its portability benefits.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Large Language Models

2.2 Fragmentation and PagedAttention

3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel

3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead

4 Insights into LLM Serving Systems

5 vAttention: System Design and 5.1 Design Overview

5.2 Leveraging Low-level CUDA Support

5.3 Serving LLMs with vAttention

6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation

6.2 Hiding memory allocation latency

7 Evaluation

7.1 Portability and Performance for Prefills

7.2 Portability and Performance for Decodes

7.3 Efficacy of Physical Memory Allocation

7.4 Analysis of Memory Fragmentation

8 Related Work ...

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE

Table of Links

Share:

More related news