vLLM (production high-throughput serving)¶
vLLM is a high-throughput LLM inference and serving engine with PagedAttention and continuous batching. Per the 2026 benchmark reference in the v1.1.x plan, vLLM delivers ~6× throughput over Ollama at 50+ concurrent users.
Use vLLM when: - You're serving 10+ concurrent users - You run batch inference jobs - You have GPU hardware (single or multi-GPU) dedicated to LLM serving - You need predictable latency under load
Use Ollama when: - Single-user / developer workflow - No GPU or a small consumer GPU - Simplicity matters more than throughput
Use LiteLLM when: - You want to route to vLLM AND other backends via one config
Install¶
(or pip install fcc-vllm-plugin once published)
Start a vLLM server¶
Via Docker (recommended for most users)¶
FCC ships a ready-to-use compose file at docker/vllm.yml:
# Optional: set HF_TOKEN if your model is gated (Llama, Mistral, etc.)
export HF_TOKEN=hf_...
# Default model: meta-llama/Meta-Llama-3.1-8B-Instruct
docker compose -f docker/vllm.yml up -d
# Or override the model
VLLM_MODEL=mistralai/Mistral-7B-v0.3 docker compose -f docker/vllm.yml up -d
# Verify vLLM is up
curl http://localhost:8000/v1/models
This brings up two containers:
fcc-vllm— the vLLM server on port 8000 with GPU accessfcc-backend-vllm— the FCC backend preconfigured to route through vLLM
Via Kubernetes (production)¶
Run vLLM in the same cluster as FCC and wire them together via cluster DNS:
# vllm-deployment.yaml (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
namespace: llm-serving
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --host
- 0.0.0.0
- --port
- "8000"
- --gpu-memory-utilization
- "0.9"
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: vllm
namespace: llm-serving
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
Then install FCC with vLLM as the default provider:
helm install fcc ./charts/fcc \
--set global.ai.defaultProvider=vllm \
--set global.ai.vllmBaseUrl=http://vllm.llm-serving.svc.cluster.local:8000/v1 \
--set global.ai.vllmDefaultModel=meta-llama/Meta-Llama-3.1-8B-Instruct
Native Python (development / tuning)¶
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.9
Configure FCC¶
export VLLM_BASE_URL=http://localhost:8000/v1
export VLLM_DEFAULT_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
export FCC_DEFAULT_PROVIDER=vllm
fcc scenarios run --scenario basic_routing
Auto-detection¶
VLLM_BASE_URL must be explicitly set for FCC to auto-detect vLLM.
The plugin does not probe localhost:8000 — this is the same
conservative behavior as the Ollama plugin (preserves the mock fallback
for users who don't intend to use vLLM).
Programmatic use¶
from fcc.simulation.ai_client import AIClient
client = AIClient(provider="vllm")
response = client.complete(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention in one sentence."},
],
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
temperature=0.2,
max_tokens=128,
)
print(response.content)
print(f"Tokens: {response.usage}")
print(f"Latency: {response.latency_ms:.0f} ms")
Per-scenario routing¶
Pin vLLM (or a specific model served by vLLM) per scenario:
{
"id": "BENCH-VLLM-001",
"name": "vLLM Llama 3.1 8B baseline",
"type": "ai",
"description": "Benchmark the same workflow against vLLM-served Llama",
"objectives": ["Measure throughput and latency under load"],
"setup": {
"initial_input": "Design a REST API for...",
"start_node": "RC",
"personas_involved": ["RC", "BC", "DE"],
"ai_config": {
"provider": "vllm",
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"temperature": 0.0,
"max_tokens": 2048
}
},
"validation_rules": []
}
GPU memory tuning¶
vLLM pre-allocates GPU memory for the KV cache on startup. Two knobs:
# How much of the GPU's total memory to use (0.0-1.0)
--gpu-memory-utilization 0.9
# Max concurrent request batch size in tokens
--max-num-batched-tokens 8192
If you see RuntimeError: CUDA out of memory on startup, reduce
--gpu-memory-utilization (try 0.8 or 0.75). If you see requests
queueing under heavy load, increase --max-num-batched-tokens.
Quantization¶
vLLM supports AWQ, GPTQ, and FP8 quantized models for ~2-4× smaller memory footprint with minimal quality loss:
# Example: AWQ-quantized Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Meta-Llama-3.1-70B-Instruct-AWQ \
--quantization awq
vLLM auto-detects the quantization from the model's config.
Multi-GPU (tensor parallelism)¶
For models larger than a single GPU:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4
This shards the model across 4 GPUs.
Troubleshooting¶
FCC keeps falling back to mock- Confirm
VLLM_BASE_URLis set:echo $VLLM_BASE_URL. Without it, the plugin stays inert. ConnectionError: [Errno 111] Connection refused- vLLM isn't listening yet. Model load takes 1-5 minutes on first boot
(depending on model size and GPU). Check logs:
docker logs fcc-vllm 401 Unauthorized- vLLM is behind an auth proxy. Set
VLLM_API_KEYenv var. - Requests queueing at high concurrency
- Normal — increase
--max-num-batched-tokenson the vLLM side, or run multiple vLLM replicas. - First request is slow, subsequent fast
- vLLM JIT-compiles CUDA kernels for the specific model shape on first request. Pre-warm with a dummy request at startup.
See also¶
- Ollama — simpler alternative for dev/single-user
- LiteLLM — universal router that can target vLLM
- Provider matrix — full comparison
- vLLM OpenAI-compatible server docs
docker/vllm.yml— bundled compose example