vLLM is a high-throughput inference engine that achieves 10-24x better throughput than naive serving. It uses PagedAttention to manage the KV cache like virtual memory pages, dramatically reducing memory waste and enabling continuous batching. If you are serving LLM requests in production, vLLM is the engine you want.

Why vLLM Matters

Standard serving wastes 60-80% of GPU memory on the KV cache. vLLM PagedAttention reduces this to under 4%. Continuous batching dynamically adds new requests to in-progress batches. Result: 24x throughput improvement on the same hardware.

Benchmark (Llama 2 7B on A100 40GB): Naive HF: 24 tok/s. vLLM: 580 tok/s. Same GPU, 24x better.

System Requirements

NVIDIA GPU with compute capability 7.0+ (Ampere, Hopper, Ada Lovelace). CUDA 11.8 or 12.1+. 8GB+ GPU memory. 16GB+ system RAM.

Installation

# pip
pip install vllm

# Docker (recommended)
docker pull vllm/vllm-openai:latest

docker run -d \
  --name vllm \
  --runtime nvidia \
  --gpus all \
  -p 8000:8000 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct

API Usage

# Chat completions
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

# Completions
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "prompt": "Hello"}'

OpenAI Drop-In

from openai import OpenAI
client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[...])
print(response.choices[0].message.content)

Advanced Configuration

Tensor Parallelism (multiple GPUs): --tensor-parallel-size 2

Extended context: --max-model-len 16384

Quantisation: --quantization fp8 (H100/Ada Lovelace only)

Monitoring

Prometheus metrics at /metrics: vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:prompt_tokens_total.

vLLM vs Ollama

Use Ollama for: development, quick experiments, simplicity. Use vLLM for: production serving, high throughput, many concurrent users.