vLLM v0.16.0: Throughput Scheduling and a WebSocket Realtime API
Date: February 24, 2026
Source: vLLM Release Notes
Release Context: This is a version upgrade. vLLM v0.16.0 is the latest release of the popular open-source inference server. The WebSocket Realtime API is a new feature that mirrors the functionality of OpenAI’s Realtime API, providing a self-hosted alternative for developers building voice-enabled applications.
Background on vLLM
vLLM is an open-source library for large language
model (LLM) inference and serving, originally developed in the Sky Computing Lab at UC Berkeley. Over time, it has become the de facto standard for self-hosted, high-throughput LLM inference because of its performance and memory efficiency. Its core innovation is PagedAttention, a memory management technique that lets it serve multiple concurrent requests with far higher throughput than traditional serving methods.
The v0.16.0 release introduces full support for async scheduling with pipeline parallelism, delivering strong improvements in end-to-end throughput and time-per-output-token (TPOT). However, the headline feature is a WebSocket-based vLLM Realtime API for streaming audio interactions, mirroring the OpenAI Realtime API interface and built for voice-enabled agent applications. Additionally, the release includes speculative decoding improvements, structured output enhancements, and multiple serving and RLHF workflow capabilities. Taken together, the combination of structured outputs, streaming, parallelism, and scale in a single release shows continued convergence between “model serving” and “agent runtime” requirements.

Why the vLLM Realtime API Matters for Developers
If you run models on your own infrastructure for cost, privacy, or latency reasons (a trend reinforced by Hugging Face’s acquisition of llama.cpp), this release directly affects your serving stack. The vLLM Realtime API is the standout addition. It gives you a self-hosted alternative to OpenAI’s Realtime API with the same interface, so existing client code can point at a vLLM instance with minimal changes. That alone removes a hard dependency on OpenAI for voice-enabled web applications.
On the throughput side, the async scheduling improvements mean high-concurrency workloads (serving many simultaneous users, for example) will see better performance without needing additional hardware. As a result, more throughput on the same GPUs translates directly to lower cost per request. For workloads where raw token speed matters most, the Mercury 2 diffusion LLM offers a complementary approach that reaches over 1,000 tokens per second.
