vLLM is a high-throughput inference engine that achieves 10-24x better throughput than naive serving. It uses PagedAttention to manage the KV cache like virtual memory pages, dramatically reducing memory waste and enabling continuous batching. If you are serving LLM requests in production, vLLM is the engine you want.
Why vLLM Matters
Standard serving wastes 60-80% of GPU memory on the KV cache. vLLM PagedAttention reduces this to under 4%. Continuous batching dynamically adds new requests to in-progress batches. Result: 24x throughput improvement on the same hardware.
Benchmark (Llama 2 7B on A100 40GB): Naive HF: 24 tok/s. vLLM: 580 tok/s. Same GPU, 24x better.
System Requirements
NVIDIA GPU with compute capability 7.0+ (Ampere, Hopper, Ada Lovelace). CUDA 11.8 or 12.1+. 8GB+ GPU memory. 16GB+ system RAM.
Installation
# pip pip install vllm # Docker (recommended) docker pull vllm/vllm-openai:latest docker run -d \ --name vllm \ --runtime nvidia \ --gpus all \ -p 8000:8000 \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --ipc=host \ vllm/vllm-openai:latest \ --model meta-llama/Meta-Llama-3-8B-Instruct
API Usage
# Chat completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'
# Completions
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "prompt": "Hello"}'OpenAI Drop-In
from openai import OpenAI client = OpenAI(api_key="not-needed", base_url="http://localhost:8000/v1") response = client.chat.completions.create(model="meta-llama/Meta-Llama-3-8B-Instruct", messages=[...]) print(response.choices[0].message.content)
Advanced Configuration
Tensor Parallelism (multiple GPUs): --tensor-parallel-size 2
Extended context: --max-model-len 16384
Quantisation: --quantization fp8 (H100/Ada Lovelace only)
Monitoring
Prometheus metrics at /metrics: vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:prompt_tokens_total.
vLLM vs Ollama
Use Ollama for: development, quick experiments, simplicity. Use vLLM for: production serving, high throughput, many concurrent users.