LLM Gateway
Unified proxy for 200+ LLM providers. One API, all models. Load balancing, caching, rate limiting, and observability.
LLM Gateway
One API for 200+ language models from every major provider. OpenAI-compatible interface with load balancing, caching, rate limiting, fallbacks, and full observability.
curl https://llm.hanzo.ai/v1/chat/completions \
-H "Authorization: Bearer $HANZO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"messages": [{"role": "user", "content": "Hello"}]
}'Why Hanzo LLM Gateway?
- 200+ Models — Claude, GPT, Gemini, Llama, Mistral, and more
- OpenAI Compatible — Drop-in replacement, use any OpenAI SDK
- Smart Routing — Automatic fallbacks, load balancing, model selection
- Cost Control — Per-key budgets, rate limits, usage analytics
- Caching — Semantic and exact caching to reduce costs up to 90%
- Observability — Full request logging, latency tracking, token analytics
Supported Providers
| Provider | Models | Features |
|---|---|---|
| Anthropic | Claude 4.x, Claude 3.x | Vision, tool use, extended context |
| OpenAI | GPT-4o, o3, o4-mini | Function calling, vision, DALL-E |
| Gemini 2.x, PaLM | Multimodal, grounding | |
| Meta | Llama 3.x, Llama 4 | Open source, self-hosted |
| Mistral | Mistral Large, Codestral | European, code generation |
| Together AI | 50+ open models | Fast inference, fine-tuning |
| Groq | Llama, Mixtral | Fastest inference |
| Zen LM | 600M-480B | Frontier open-weight models |
SDK Usage
Python
from openai import OpenAI
client = OpenAI(
api_key="your-hanzo-api-key",
base_url="https://llm.hanzo.ai/v1"
)
response = client.chat.completions.create(
model="claude-sonnet-4-5-20250929",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)TypeScript
import OpenAI from 'openai'
const client = new OpenAI({
apiKey: process.env.HANZO_API_KEY,
baseURL: 'https://llm.hanzo.ai/v1',
})
const completion = await client.chat.completions.create({
model: 'claude-sonnet-4-5-20250929',
messages: [{ role: 'user', content: 'Explain quantum computing' }],
})
console.log(completion.choices[0].message.content)Key Features
Smart Routing
Automatic fallbacks between providers when one is down
Cost Management
Per-key budgets, rate limits, and usage analytics
Semantic Caching
Cache similar requests to reduce costs up to 90%
Guardrails
Content filtering, PII detection, and safety controls
Observability
Full request logging with latency and token analytics
Fine-tuning
Custom model training via Together AI and Zen LM
Configuration
model_list:
- model_name: "default"
litellm_params:
model: "anthropic/claude-sonnet-4-5-20250929"
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: "default"
litellm_params:
model: "openai/gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
- model_name: "fast"
litellm_params:
model: "groq/llama-3.1-70b"
api_key: "os.environ/GROQ_API_KEY"
router_settings:
routing_strategy: "latency-based-routing"
fallbacks:
- default: ["fast"]API Endpoints
| Endpoint | Description |
|---|---|
POST /v1/chat/completions | Chat completions (streaming supported) |
POST /v1/completions | Text completions |
POST /v1/embeddings | Text embeddings |
POST /v1/images/generations | Image generation |
GET /v1/models | List available models |
POST /key/generate | Create API keys with budgets |
GET /key/info | Key usage and budget info |
Related
Hanzo Chat
AI chat with Zen models, 100+ third-party, MCP and ZAP tools
Gateway Service
Full API reference and deployment guide
MCP
Model Context Protocol tools
Hanzo Dev
AI coding agent using the gateway
Platform
API Reference
SDKs
Overview
Hanzo Studio
MCP
Hanzo Dev
ZAP Protocol
How is this guide?
Last updated on