Skip to content

vLLM (production high-throughput serving)

vLLM is a high-throughput LLM inference and serving engine with PagedAttention and continuous batching. Per the 2026 benchmark reference in the v1.1.x plan, vLLM delivers ~6× throughput over Ollama at 50+ concurrent users.

Use vLLM when: - You're serving 10+ concurrent users - You run batch inference jobs - You have GPU hardware (single or multi-GPU) dedicated to LLM serving - You need predictable latency under load

Use Ollama when: - Single-user / developer workflow - No GPU or a small consumer GPU - Simplicity matters more than throughput

Use LiteLLM when: - You want to route to vLLM AND other backends via one config

Install

pip install -e ./plugins/fcc-vllm-plugin

(or pip install fcc-vllm-plugin once published)

Start a vLLM server

FCC ships a ready-to-use compose file at docker/vllm.yml:

# Optional: set HF_TOKEN if your model is gated (Llama, Mistral, etc.)
export HF_TOKEN=hf_...

# Default model: meta-llama/Meta-Llama-3.1-8B-Instruct
docker compose -f docker/vllm.yml up -d

# Or override the model
VLLM_MODEL=mistralai/Mistral-7B-v0.3 docker compose -f docker/vllm.yml up -d

# Verify vLLM is up
curl http://localhost:8000/v1/models

This brings up two containers:

  • fcc-vllm — the vLLM server on port 8000 with GPU access
  • fcc-backend-vllm — the FCC backend preconfigured to route through vLLM

Via Kubernetes (production)

Run vLLM in the same cluster as FCC and wire them together via cluster DNS:

# vllm-deployment.yaml (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: llm-serving
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model
            - meta-llama/Meta-Llama-3.1-8B-Instruct
            - --host
            - 0.0.0.0
            - --port
            - "8000"
            - --gpu-memory-utilization
            - "0.9"
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: vllm
  namespace: llm-serving
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000

Then install FCC with vLLM as the default provider:

helm install fcc ./charts/fcc \
  --set global.ai.defaultProvider=vllm \
  --set global.ai.vllmBaseUrl=http://vllm.llm-serving.svc.cluster.local:8000/v1 \
  --set global.ai.vllmDefaultModel=meta-llama/Meta-Llama-3.1-8B-Instruct

Native Python (development / tuning)

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9

Configure FCC

export VLLM_BASE_URL=http://localhost:8000/v1
export VLLM_DEFAULT_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct
export FCC_DEFAULT_PROVIDER=vllm

fcc scenarios run --scenario basic_routing

Auto-detection

VLLM_BASE_URL must be explicitly set for FCC to auto-detect vLLM. The plugin does not probe localhost:8000 — this is the same conservative behavior as the Ollama plugin (preserves the mock fallback for users who don't intend to use vLLM).

Programmatic use

from fcc.simulation.ai_client import AIClient

client = AIClient(provider="vllm")
response = client.complete(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain PagedAttention in one sentence."},
    ],
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    temperature=0.2,
    max_tokens=128,
)
print(response.content)
print(f"Tokens: {response.usage}")
print(f"Latency: {response.latency_ms:.0f} ms")

Per-scenario routing

Pin vLLM (or a specific model served by vLLM) per scenario:

{
  "id": "BENCH-VLLM-001",
  "name": "vLLM Llama 3.1 8B baseline",
  "type": "ai",
  "description": "Benchmark the same workflow against vLLM-served Llama",
  "objectives": ["Measure throughput and latency under load"],
  "setup": {
    "initial_input": "Design a REST API for...",
    "start_node": "RC",
    "personas_involved": ["RC", "BC", "DE"],
    "ai_config": {
      "provider": "vllm",
      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
      "temperature": 0.0,
      "max_tokens": 2048
    }
  },
  "validation_rules": []
}

GPU memory tuning

vLLM pre-allocates GPU memory for the KV cache on startup. Two knobs:

# How much of the GPU's total memory to use (0.0-1.0)
--gpu-memory-utilization 0.9

# Max concurrent request batch size in tokens
--max-num-batched-tokens 8192

If you see RuntimeError: CUDA out of memory on startup, reduce --gpu-memory-utilization (try 0.8 or 0.75). If you see requests queueing under heavy load, increase --max-num-batched-tokens.

Quantization

vLLM supports AWQ, GPTQ, and FP8 quantized models for ~2-4× smaller memory footprint with minimal quality loss:

# Example: AWQ-quantized Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Meta-Llama-3.1-70B-Instruct-AWQ \
  --quantization awq

vLLM auto-detects the quantization from the model's config.

Multi-GPU (tensor parallelism)

For models larger than a single GPU:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4

This shards the model across 4 GPUs.

Troubleshooting

FCC keeps falling back to mock
Confirm VLLM_BASE_URL is set: echo $VLLM_BASE_URL. Without it, the plugin stays inert.
ConnectionError: [Errno 111] Connection refused
vLLM isn't listening yet. Model load takes 1-5 minutes on first boot (depending on model size and GPU). Check logs: docker logs fcc-vllm
401 Unauthorized
vLLM is behind an auth proxy. Set VLLM_API_KEY env var.
Requests queueing at high concurrency
Normal — increase --max-num-batched-tokens on the vLLM side, or run multiple vLLM replicas.
First request is slow, subsequent fast
vLLM JIT-compiles CUDA kernels for the specific model shape on first request. Pre-warm with a dummy request at startup.

See also