vLLM v0.16 Adds WebSocket Realtime API and Faster Scheduling

Date: February 24, 2026
Source: vLLM Release Notes

Release Context: This is a version upgrade. vLLM v0.16.0 is the latest release of the popular open-source inference server. The WebSocket Realtime API is a new feature that mirrors the functionality of OpenAI’s Realtime API, providing a self-hosted alternative for developers building voice-enabled applications.

Background on vLLM

vLLM is an open-source library for large language

model (LLM) inference and serving, originally developed in the Sky Computing Lab at UC Berkeley. Over time, it has become the de facto standard for self-hosted, high-throughput LLM inference because of its performance and memory efficiency. Its core innovation is PagedAttention, a memory management technique that lets it serve multiple concurrent requests with far higher throughput than traditional serving methods.

The v0.16.0 release introduces full support for async scheduling with pipeline parallelism, delivering strong improvements in end-to-end throughput and time-per-output-token (TPOT). However ...

Copyright of this story solely belongs to perficient.com . To see the full text click HERE

Background on vLLM

Share: