vAttention Performance & Portability for LLM Prefill Phase

by Text Generation 3m June 13th, 2025

This section highlights vAttention's ability to add dynamic memory allocation support to unmodified FlashAttention and FlashInfer prefill kernels, simplifying development while boosting performance.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Large Language Models

2.2 Fragmentation and PagedAttention

3 Issues with the PagedAttention Model and 3.1 Requires re-writing the attention kernel

3.2 Adds redundancy in the serving framework and 3.3 Performance Overhead

4 Insights into LLM Serving Systems

5 vAttention: System Design and 5.1 Design Overview

5.2 Leveraging Low-level CUDA Support

5.3 Serving LLMs with vAttention

6 vAttention: Optimizations and 6.1 Mitigating internal fragmentation

6.2 Hiding memory allocation latency

7 Evaluation

7.1 Portability and Performance for Prefills

7.2 Portability and Performance for Decodes

7.3 Efficacy of Physical Memory Allocation

7.4 Analysis of Memory Fragmentation

8 Related ...

Copyright of this story solely belongs to hackernoon.com . To see the full text click HERE

Table of Links

Share:

More related news