Self-hosted AI infrastructure is only as secure as the server it runs on. These services start fast and often bind to all interfaces by default. This guide covers hardening from day one: firewall, API auth, rate limiting, network isolation, and AI-specific threats.

Threat Model

External: Unauthorized API access (GPU time theft), prompt injection attacks, data exfiltration, DDoS exhausting GPU memory. Internal: Lateral movement, credential stuffing, accidental exposure. AI-specific: Jailbreaking, data leakage through prompts, toxic output generation.

Step 1 — UFW Firewall

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp comment "SSH"
sudo ufw allow from 192.168.1.0/24 to any port 11434 comment "Ollama LAN"
sudo ufw allow from 192.168.1.0/24 to any port 8080 comment "LocalAI LAN"
sudo ufw allow 443/tcp comment "HTTPS"
sudo ufw deny 11434
sudo ufw deny 8080
sudo ufw enable

Step 2 — Ollama Hardening

Default bind is 127.0.0.1 — already secure. For network access, use Unix socket instead of TCP:

# Edit systemd: Environment="OLLAMA_HOST=unix:///var/run/ollama.sock"
sudo systemctl restart ollama

Step 3 — API Key Authentication

nginx + htpasswd for LocalAI/vLLM:

sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd ollama_user
# Add to nginx: auth_basic "AI Access"; auth_basic_user_file /etc/nginx/.htpasswd;

vLLM bearer token: Client must send Authorization: Bearer YOUR_TOKEN header.

Step 4 — Rate Limiting

limit_req_zone $binary_remote_addr zone=ai_api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=ai_chat:10m rate=5r/m;

Step 5 — Docker Network Isolation

docker network create --driver bridge ai-network
docker run -d --name ollama --network ai-network ollama/ollama
docker run -d --name openwebui --network ai-network -p 8080:8080 -e OLLAMA_BASE_URL=http://ollama:11434 ghcr.io/open-webui/open-webui:main

Step 6 — Fail2Ban

sudo apt install fail2ban
sudo systemctl enable fail2ban
# Configure in /etc/fail2ban/jail.local for SSH and nginx-http-auth

Step 7 — GPU Memory Limits

Prevent resource exhaustion: MemoryMax=32G in systemd service for AI services.

Step 8 — Prompt Injection Defense

import re
def sanitize_prompt(user_input):
    dangerous = [r"ignore previous", r"disregard your", r"system prompt:"]
    for p in dangerous:
        user_input = re.sub(p, "[filtered]", user_input, flags=re.IGNORECASE)
    return user_input[:4096]

Security Checklist

  • UFW enabled, default deny incoming
  • AI services on localhost or behind firewall
  • API key auth on all network-accessible endpoints
  • Rate limiting configured
  • Fail2Ban running
  • Docker networks isolating services
  • Systemd memory/CPU limits
  • Monitoring for GPU utilization anomalies