Skip to main content

Generative AI

vLLM v0.16 Adds WebSocket Realtime API and Faster Scheduling

Rag Decortive Header

vLLM v0.16.0: Throughput Scheduling and a WebSocket Realtime API

Date: February 24, 2026
Source: vLLM Release Notes

Release Context: This is a version upgrade. vLLM v0.16.0 is the latest release of the popular open-source inference server. The WebSocket Realtime API is a new feature that mirrors the functionality of OpenAI’s Realtime API, providing a self-hosted alternative for developers building voice-enabled applications.

Background on vLLM

vLLM is an open-source library for large language

 

model (LLM) inference and serving, originally developed in the Sky Computing Lab at UC Berkeley. Over time, it has become the de facto standard for self-hosted, high-throughput LLM inference because of its performance and memory efficiency. Its core innovation is PagedAttention, a memory management technique that lets it serve multiple concurrent requests with far higher throughput than traditional serving methods.

The v0.16.0 release introduces full support for async scheduling with pipeline parallelism, delivering strong improvements in end-to-end throughput and time-per-output-token (TPOT). However, the headline feature is a WebSocket-based vLLM Realtime API for streaming audio interactions, mirroring the OpenAI Realtime API interface and built for voice-enabled agent applications. Additionally, the release includes speculative decoding improvements, structured output enhancements, and multiple serving and RLHF workflow capabilities. Taken together, the combination of structured outputs, streaming, parallelism, and scale in a single release shows continued convergence between “model serving” and “agent runtime” requirements.

 

07 Vllm V016 Realtime

 

Why the vLLM Realtime API Matters for Developers

If you run models on your own infrastructure for cost, privacy, or latency reasons (a trend reinforced by Hugging Face’s acquisition of llama.cpp), this release directly affects your serving stack. The vLLM Realtime API is the standout addition. It gives you a self-hosted alternative to OpenAI’s Realtime API with the same interface, so existing client code can point at a vLLM instance with minimal changes. That alone removes a hard dependency on OpenAI for voice-enabled web applications.

On the throughput side, the async scheduling improvements mean high-concurrency workloads (serving many simultaneous users, for example) will see better performance without needing additional hardware. As a result, more throughput on the same GPUs translates directly to lower cost per request. For workloads where raw token speed matters most, the Mercury 2 diffusion LLM offers a complementary approach that reaches over 1,000 tokens per second.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Matthew Aberham

Matthew Aberham is a solutions architect, and full-stack engineer focused on building scalable web platforms and intuitive front-end experiences. He works at the intersection of performance engineering, interface design, and applied AI systems.

More from this Author

Follow Us