Running a chatbot demo on a single GPU with transformers is easy. Serving the same model to thousands of concurrent users without melting your budget is a different problem entirely. That gap — between a notebook that works and an endpoint that survives production traffic — is exactly where vLLM has become the default answer in 2026.
If your team is self-hosting open-source models like Qwen, Llama, Mistral, or DeepSeek instead of paying per-token to a closed API, this guide walks through what vLLM does, why it is fast, and how to deploy it for real workloads.
Why naive serving falls apart
The standard Hugging Face inference loop processes requests one batch at a time. The whole batch must finish before the next one starts, so a single long generation blocks short ones behind it. Worse, GPU memory for the KV cache — the running attention state for every token in every sequence — is allocated as one big contiguous block per request, sized for the worst case. Most of it sits empty.
The result: GPU utilization typically hovers around 30–40%. You are paying for hardware that spends most of its time waiting. On expensive accelerators, that waste is the single largest line item in an inference budget.
PagedAttention: the core idea
vLLM's breakthrough is PagedAttention, which borrows virtual memory paging from operating systems. Instead of one contiguous KV-cache region per request, the cache is split into fixed-size blocks (pages). Logical token positions map to non-contiguous physical blocks, exactly like an OS maps virtual pages to physical RAM.
This kills two problems at once. Fragmentation nearly disappears, so you can fit far more concurrent sequences in the same memory. And blocks become shareable — when several requests share a common prompt prefix (a system prompt, a few-shot template), they can point at the same physical pages instead of duplicating them. By reclaiming wasted memory, PagedAttention raises the achievable batch size for a given GPU, which directly raises throughput.
Continuous batching
The second pillar is continuous batching (also called in-flight batching). Rather than waiting for a whole batch to finish, vLLM's scheduler operates at the token level: as soon as one sequence emits its end-of-sequence token and frees its slot, a queued request takes its place in the very next step.
Switching from static batching to continuous batching with PagedAttention typically lifts GPU utilization from that 30–40% range up to 75–90%, which translates to roughly 2–4x more output tokens per GPU-hour. Across benchmarks, vLLM commonly delivers 2–3x higher throughput than baseline serving, and far more than that against naive single-request loops.
In 2026, the rewritten V1 engine is the default. It cleaned up the scheduler, made chunked prefill standard, and improved prefix caching — so the gains above are what you get out of the box, not after heavy tuning.
Getting started
Installation is a single package, and serving is a single command. vLLM exposes an OpenAI-compatible server, which is the feature that makes adoption painless: any code already written against the OpenAI SDK points at your endpoint with one URL change.
pip install vllm
# Start an OpenAI-compatible server on port 8000
vllm serve Qwen/Qwen2.5-7B-Instruct \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 8192Now call it exactly like the OpenAI API:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed-for-local",
)
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in one sentence."}],
)
print(resp.choices[0].message.content)No SDK rewrite, no proprietary client. That single property is why teams migrating off closed APIs reach for vLLM first.
Scaling to large models
A 7B model fits comfortably on one GPU. A 70B model does not. For that, vLLM supports tensor parallelism, which shards each layer's weights across multiple GPUs on one node:
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90Set --tensor-parallel-size to the number of GPUs on the machine. For models too large for a single node, pipeline parallelism splits layers across nodes as well. The rule of thumb: use tensor parallelism within a box first, because it is bandwidth-hungry and benefits from fast intra-node interconnects.
Quantization for throughput and fit
Quantization shrinks weights so a model needs less memory and moves faster. vLLM supports a broad set of methods — among them FP8, AWQ, GPTQ, GPTQ-Marlin, bitsandbytes, and compressed-tensors.
Two practical paths:
- FP8 on modern accelerators is the sweet spot in 2026. Most major open models (Llama 3.x, Mistral, Qwen 2.5, Phi-4) are validated with vLLM FP8, and if no pre-quantized weights exist, vLLM can quantize on the fly from the original weights.
- AWQ is a strong choice when you want roughly 2x throughput with minimal quality loss on hardware without native FP8.
vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 2The payoff is concrete: quantization frees memory, more free memory means a bigger KV cache, a bigger KV cache means more concurrent sequences, and more concurrency means higher throughput. The optimizations compound.
A production-ready configuration
For an 80GB-class GPU serving a mid-size model online, a sane starting point looks like this:
vllm serve <your-model> \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 16384 \
--max-model-len 32768 \
--enable-prefix-caching \
--tensor-parallel-size 2A few notes on the knobs that matter most:
gpu-memory-utilization— leave a margin (0.90, not 0.98) so memory spikes do not trigger out-of-memory crashes under load.max-num-batched-tokens— the lever for throughput versus latency. Higher values pack more work per step (good for batch jobs); lower values keep individual responses snappy.enable-prefix-caching— a near-free win when many requests share a system prompt or template, thanks to the shareable blocks PagedAttention makes possible.max-model-len— cap context to what you actually need. Reserving room for 128K tokens you never use just steals KV-cache capacity from real traffic.
Run it behind a process manager, put it on a private network, and front it with a reverse proxy that handles TLS and rate limiting. Because the API is OpenAI-shaped, an LLM gateway or router can load-balance several vLLM instances with no custom glue.
Why this matters for MENA teams
For startups and enterprises across Tunisia, Saudi Arabia, and the wider region, the economics are compelling. Self-hosting an open model on vLLM converts a variable per-token bill — paid in foreign currency to a provider that may be subject to export restrictions — into a fixed, predictable infrastructure cost you control. Data never leaves your environment, which matters for PDPL and sovereignty requirements. And because vLLM runs the open weights you already trust, you are insulated from sudden API deprecations or regional access changes.
The throughput gains are what make this viable rather than aspirational: getting 3–4x more tokens out of every GPU-hour is the difference between self-hosting being a cost center and being a genuine saving.
Conclusion
vLLM turned high-performance LLM serving from a research problem into a one-line command. PagedAttention reclaims wasted KV-cache memory, continuous batching keeps the GPU busy, tensor parallelism scales past a single card, and quantization stretches every gigabyte further — all behind an OpenAI-compatible API that drops into existing code.
If you are still serving production traffic through raw transformers, you are leaving most of your hardware idle. Start with vllm serve, measure your tokens per GPU-hour, then tune the batch and memory knobs. The path from demo to durable endpoint has never been shorter.
Need help architecting a self-hosted inference stack or migrating off a closed API? Noqta builds production AI infrastructure for teams across the MENA region.