Hanzo Engine
High-performance LLM inference engine — blazing-fast Rust-based serving with Metal/CUDA acceleration, quantization, vision, audio, and MCP tools
Hanzo Engine
Hanzo Engine is a high-performance, multimodal inference engine written in Rust. It serves LLMs, vision models, speech models, image generation models, and embedding models through an OpenAI-compatible HTTP API, with optional MCP (Model Context Protocol) server support.
Key design goals: zero-config model loading, hardware-aware quantization, and production-grade throughput via PagedAttention and continuous batching.
GitHub: github.com/hanzoai/engine
Docker: ghcr.io/hanzoai/engine
License: Apache-2.0
Features
- Hardware Acceleration: Native Metal (Apple Silicon) and CUDA (NVIDIA) backends, FlashAttention V2/V3, multi-GPU tensor parallelism
- Quantization: GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB, AFQ (Metal-optimized), and In-Situ Quantization (ISQ) of any HuggingFace model
- PagedAttention: High-throughput continuous batching on CUDA and Metal with prefix caching and KV cache quantization
- Speculative Decoding: Draft-model acceleration with rejection sampling for 2-3x speedup on supported architectures
- Multimodal: Text, vision, audio/speech, image generation, and embeddings in a single binary
- MCP Integration: Built-in MCP server exposing model tools via JSON-RPC, plus MCP client for connecting to external tool servers
- OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints --
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/images/generations,/v1/models - Agentic: Integrated tool calling with Python/Rust callbacks, web search, and structured output (JSON schema, regex, grammar)
- Distributed Inference: Multi-GPU via NCCL tensor parallelism, multi-node via TCP ring topology with heterogeneous device support
- LoRA / X-LoRA: Adapter model support with runtime weight merging
- AnyMoE: Create mixture-of-experts on any base model
- Per-Layer Topology: Fine-tune quantization bit depth per layer for optimal quality/speed tradeoffs
- UQFF: Universal Quantized File Format for portable quantized model distribution
- Web UI: Built-in chat interface at
/uiwhen serving
Architecture
┌───────────────────────────────────────────────────────────────────┐
│ Hanzo Engine │
├───────────────────────────────────────────────────────────────────┤
│ │
│ Clients │
│ ─────── │
│ OpenAI SDK ─┐ │
│ curl / HTTP ─┤──▶ ┌──────────────────────────────────┐ │
│ Python SDK ─┤ │ HTTP Server (axum) │ │
│ Rust SDK ─┤ │ :36900 │ │
│ MCP Client ─┘──▶ │ /v1/chat/completions │ │
│ │ /v1/completions │ │
│ │ /v1/embeddings │ │
│ │ /v1/models │ │
│ │ /health │ │
│ │ /docs (SwaggerUI) │ │
│ └─────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ Pipeline Engine │ │
│ │ ────────────────── │ │
│ │ • Continuous batching │ │
│ │ • PagedAttention + prefix cache │ │
│ │ • ISQ / GGUF / GPTQ / AWQ │ │
│ │ • Device mapping (multi-GPU) │ │
│ │ • Chat template detection │ │
│ └─────────┬────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Metal (GPU) │ │ CUDA (GPU) │ │ CPU (fallback) │ │
│ │ Apple Si │ │ FlashAttn │ │ AVX2 / MKL │ │
│ │ AFQ quants │ │ Marlin kern │ │ GGUF / HQQ │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ MCP Server (:4321) │ │
│ │ Exposes model tools via JSON-RPC 2.0 │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────┘Supported Models
Text Models
| Model | Architectures | Notes |
|---|---|---|
| Llama | Llama 2, 3, 3.1, 3.2 | GGUF supported |
| Mistral | Mistral 7B, Mixtral 8x7B/8x22B | MoE support |
| Qwen | Qwen 2, 3, 3 Next, 3 MoE | Thinking mode |
| Gemma | Gemma, Gemma 2 | |
| Phi | Phi 2, 3, 3.5 MoE | |
| DeepSeek | V2, V3 / R1 | MoE with MoQE |
| GLM | GLM 4, 4.7 Flash, 4.7 MoE | |
| Starcoder | Starcoder 2 | Code models |
| SmolLM | SmolLM 3 | Small models |
| Granite | Granite 4.0 | |
| GPT-OSS | GPT-OSS (Harmony format) | reasoning_effort support |
Vision Models
| Model | Input Types |
|---|---|
| Qwen 3-VL, 3-VL MoE | Image + Text |
| Qwen 2-VL, 2.5-VL | Image + Text |
| Gemma 3, 3n | Image + Text |
| Llama 4, 3.2 Vision | Image + Text |
| Mistral 3 | Image + Text |
| Phi 4 Multimodal, Phi 3V | Image + Text |
| MiniCPM-O | Image + Text |
| Idefics 2, 3 | Image + Text |
| LLaVA, LLaVA Next | Image + Text |
Speech, Image Generation, and Embedding Models
| Category | Models |
|---|---|
| Speech (ASR) | Voxtral, Dia |
| Image Generation | FLUX |
| Embeddings | Embedding Gemma, Qwen 3 Embedding |
Zen Models
Hanzo Engine is the inference backend for the Zen model family. First-class support for all Zen architectures:
| Model | Parameters | Context | Architecture | Use Case |
|---|---|---|---|---|
| zen4 | 744B MoE (40B active) | 202K | Transformer MoE | Flagship reasoning and generation |
| zen4-max | 1.04T MoE (32B active) | 256K | Transformer MoE | Maximum capability |
| zen4-ultra | 744B MoE + CoT (40B active) | 202K | Transformer MoE | Extended chain-of-thought |
| zen4-pro | 80B MoE (3B active) | 131K | Transformer MoE | High quality, efficient serving |
| zen4-mini | 8B dense | 40K | Transformer | Fast inference, edge deployment |
| zen4-coder | 480B MoE (35B active) | 262K | Transformer MoE | Code generation and analysis |
| zen4-coder-flash | 30B MoE (3B active) | 262K | Transformer MoE | Fast code completion |
| zen4-coder-pro | 480B dense BF16 | 262K | Transformer | Maximum code quality |
| zen3-vl | 30B MoE (3B active) | 131K | Vision-Language MoE | Multimodal understanding |
| zen3-omni | ~200B | 202K | Multimodal | Text, vision, audio |
| zen3-nano | 4B dense | 40K | Transformer | Ultra-lightweight |
| zen3-guard | 4B dense | -- | Classifier | Safety and content filtering |
| zen3-embedding | -- | -- | Embedding (3072-dim) | Search and retrieval |
# Serve any Zen model
hanzo-engine serve -m zenlm/zen4-mini --port 8000
hanzo-engine serve -m zenlm/zen4 --port 8000 --isq Q4K
hanzo-engine serve -m zenlm/zen4-coder --port 8000Quick Start
Install
Linux / macOS (recommended):
curl -sSL https://engine.hanzo.ai/install.sh | shVia Cargo:
cargo install hanzo-engineVia pip (Python SDK):
pip install hanzo-engine # CPU
pip install hanzo-engine-cuda # NVIDIA GPU
pip install hanzo-engine-metal # Apple Silicon
pip install hanzo-engine-mkl # Intel CPUBuild from source with the appropriate backend:
# macOS (Metal)
cargo build --package hanzo-engine --release --no-default-features --features metal
# Linux (CUDA)
cargo build --package hanzo-engine --release --features cuda
# Linux (CUDA + FlashAttention)
cargo build --package hanzo-engine --release --features "cuda flash-attn cudnn"Run a Model
# Interactive chat with auto-detected architecture
hanzo-engine run -m zenlm/zen4-mini
# Start HTTP server with web UI
hanzo-engine serve --ui -m google/gemma-3-4b-it
# Start server with ISQ 4-bit quantization
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4
# Run a GGUF quantized model
hanzo-engine run --format gguf -f ./model.Q4_K_M.gguf
# Auto-tune for your hardware
hanzo-engine tune -m zenlm/zen4-mini --emit-config config.toml
hanzo-engine from-config -f config.toml
# System diagnostics
hanzo-engine doctorThe server listens on port 36900 by default. Visit http://localhost:36900/ui for the built-in chat interface, or http://localhost:36900/docs for interactive Swagger API docs.
CLI Commands
| Command | Purpose |
|---|---|
run | Interactive chat mode |
serve | Start HTTP + optional MCP server |
from-config | Run from a TOML config file |
quantize | Generate UQFF quantized model |
tune | Auto-benchmark and recommend settings |
doctor | System diagnostics (CUDA, Metal, HF connectivity) |
login | Authenticate with HuggingFace Hub |
cache | Manage downloaded model cache |
bench | Run performance benchmarks |
API Reference
Hanzo Engine exposes an OpenAI-compatible REST API. Any client that speaks the OpenAI API works out of the box.
Chat Completions
curl http://localhost:36900/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention in one paragraph."}
],
"max_tokens": 256,
"temperature": 0.7
}'Using the Python OpenAI SDK
import openai
client = openai.OpenAI(
base_url="http://localhost:36900/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about Rust."},
],
max_tokens=128,
)
print(completion.choices[0].message.content)Streaming
Set "stream": true in the request body. The server returns Server-Sent Events (SSE) compatible with the OpenAI streaming format.
stream = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Count to 10 slowly."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Text Completions
curl http://localhost:36900/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"prompt": "The Rust programming language is",
"max_tokens": 128
}'Embeddings
Serve an embedding model to enable this endpoint:
hanzo-engine serve -m google/embeddinggemma-300mcurl http://localhost:36900/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": "Hello, world!"
}'Endpoints Summary
| Method | Path | Description |
|---|---|---|
POST | /v1/chat/completions | Chat completions (streaming supported) |
POST | /v1/completions | Text completions |
POST | /v1/embeddings | Vector embeddings |
POST | /v1/images/generations | Image generation (FLUX) |
GET | /v1/models | List loaded models |
GET | /health | Server health check |
GET | /docs | Interactive Swagger UI |
GET | /ui | Built-in chat web UI |
Extended Parameters
Beyond the standard OpenAI API, Hanzo Engine supports additional request parameters:
| Parameter | Type | Description |
|---|---|---|
top_k | int | Top-K sampling |
min_p | float | Min-P sampling threshold |
grammar | object | Constrained generation (regex, JSON schema, Lark grammar, llguidance) |
enable_thinking | bool | Enable chain-of-thought for supported models |
truncate_sequence | bool | Truncate over-length prompts instead of rejecting |
repetition_penalty | float | Multiplicative penalty for repeated tokens |
web_search_options | object | Enable web search integration |
reasoning_effort | string | Reasoning depth for Harmony-format models: low, medium, high |
dry_multiplier | float | DRY anti-repetition penalty strength |
Quantization Guide
Hanzo Engine supports extensive quantization methods for reducing model size and increasing throughput.
In-Situ Quantization (ISQ)
ISQ quantizes model weights during loading -- the full unquantized model never needs to fit in memory. Just pass --isq <bits>:
# 4-bit quantization (auto-selects best method for your hardware)
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4
# 8-bit quantization
hanzo-engine serve -m zenlm/zen4-mini --isq 8On Metal, ISQ uses AFQ (affine quantization) optimized for Apple Silicon. On CUDA, it uses Q/K quantization for best throughput.
Quantization Methods
| Method | Bit Widths | Devices | Notes |
|---|---|---|---|
| ISQ (auto) | 2, 3, 4, 5, 6, 8 | All | Hardware-aware auto-selection |
| GGUF | 2-8 bit (Q/K types) | CPU, CUDA, Metal | Most portable format |
| GPTQ | 2, 3, 4, 8 | CUDA only | Marlin kernel for 4/8-bit |
| AWQ | 4, 8 | CUDA only | Marlin kernel for 4/8-bit |
| HQQ | 4, 8 | All | Half-quadratic quantization |
| FP8 | 8 | All | E4M3 floating point |
| BNB | 4, 8 | CUDA | bitsandbytes int8, fp4, nf4 |
| AFQ | 2, 3, 4, 6, 8 | Metal only | Fastest on Apple Silicon |
| MLX | Pre-quantized | Metal | Apple MLX format |
Running GGUF Models
# Direct GGUF file
hanzo-engine run --format gguf -f ./zen4-mini-Q4_K_M.gguf
# Auto-detect GPTQ model from HuggingFace
hanzo-engine run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bitPer-Layer Topology
Fine-tune quantization per layer using a topology file. This lets you keep attention layers at higher precision while aggressively quantizing feed-forward layers:
hanzo-engine serve -m zenlm/zen4-mini --topology topology.tomlPagedAttention
PagedAttention accelerates inference and enables high-throughput continuous batching:
- CUDA: Enabled by default; disable with
--no-paged-attn - Metal: Opt-in with
--paged-attn - Prefix caching: Reuses KV cache blocks across requests sharing common prefixes (system prompts)
- KV cache quantization: FP8 (E4M3) reduces KV cache memory by ~50%
# Custom KV cache memory allocation
hanzo-engine serve -m zenlm/zen4-mini --paged-attn --paged-attn-gpu-memory 8GiB
# KV cache quantization for longer contexts
hanzo-engine serve -m zenlm/zen4-mini --paged-attn --paged-cache-type f8e4m3MCP Integration
Hanzo Engine can act as both an MCP server (exposing model tools) and an MCP client (connecting to external tool servers).
MCP Server
Expose model capabilities as MCP tools alongside the HTTP API:
hanzo-engine serve \
-m zenlm/zen4-mini \
--port 36900 \
--mcp-port 4321The MCP server exposes tools based on loaded model modalities (text, vision, etc.) via JSON-RPC 2.0.
MCP Client
Connect to external MCP tool servers so the model can call external tools during inference:
hanzo-engine serve \
-m zenlm/zen4-mini \
--mcp-client-config mcp-tools.jsonPython SDK
from hanzo_engine import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.Plain(model_id="zenlm/zen4-mini"),
in_situ_quant="4",
)
response = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=256,
)
)
print(response.choices[0].message.content)Rust SDK
use anyhow::Result;
use hanzo_engine::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = TextModelBuilder::new("zenlm/zen4-mini")
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let messages = TextMessages::new()
.add_message(TextMessageRole::User, "Hello!");
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
Ok(())
}Docker
# CPU
docker run -p 8000:8000 ghcr.io/hanzoai/engine:latest \
serve -m zenlm/zen4-mini --port 8000
# NVIDIA GPU
docker run --gpus all -p 8000:8000 ghcr.io/hanzoai/engine:cuda \
serve -m zenlm/zen4 --port 8000
# Apple Silicon (Metal)
docker run -p 8000:8000 ghcr.io/hanzoai/engine:metal \
serve -m zenlm/zen4-mini --port 8000Docker Compose
services:
engine:
image: ghcr.io/hanzoai/engine:cuda
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- model-cache:/root/.cache/huggingface
command: serve -m zenlm/zen4 --port 8000
restart: unless-stopped
volumes:
model-cache:Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: hanzo-engine
spec:
replicas: 1
selector:
matchLabels:
app: hanzo-engine
template:
metadata:
labels:
app: hanzo-engine
spec:
containers:
- name: engine
image: ghcr.io/hanzoai/engine:cuda
command: ["hanzo-engine", "serve", "-m", "zenlm/zen4", "--port", "8000"]
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: hanzo-engine
spec:
selector:
app: hanzo-engine
ports:
- port: 8000
targetPort: 8000
type: ClusterIPDistributed Inference
Multi-GPU (NCCL)
For models that exceed single-GPU memory, use NCCL tensor parallelism:
# 2x GPU tensor parallelism
HANZO_ENGINE_LOCAL_WORLD_SIZE=2 hanzo-engine serve \
-m zenlm/zen4 --port 8000
# 4x GPU
HANZO_ENGINE_LOCAL_WORLD_SIZE=4 hanzo-engine serve \
-m zenlm/zen4-max --port 8000Multi-Node (Ring)
Ring topology supports heterogeneous setups -- mix Metal, CUDA, and CPU nodes in a single inference cluster:
# Node 0 (master)
RING_CONFIG=ring_node0.json hanzo-engine serve -m zenlm/zen4 --port 8000
# Node 1
RING_CONFIG=ring_node1.json hanzo-engine serve -m zenlm/zen4 --port 8001Speculative Decoding
Draft-model acceleration uses a smaller model to generate candidate tokens, then validates them with the full model. This achieves 2-3x speedup for autoregressive generation:
[speculative]
draft_model = "zenlm/zen4-mini"
gamma = 16hanzo-engine from-config -f config.tomlPerformance Benchmarks
Representative benchmarks with continuous batching enabled:
| Model | Hardware | Quantization | Throughput (tok/s) | Latency (TTFT) | Memory |
|---|---|---|---|---|---|
| zen4-mini (8B) | 1x A100 80GB | FP16 | ~2,400 | 28ms | 16 GB |
| zen4-mini (8B) | 1x A100 80GB | Q4K ISQ | ~3,800 | 18ms | 5 GB |
| zen4-mini (8B) | M3 Max 64GB | Metal | ~85 | 120ms | 16 GB |
| zen4-mini (8B) | M3 Max 64GB | Q4K ISQ | ~110 | 80ms | 5 GB |
| zen4-pro (80B MoE) | 1x A100 80GB | Q4K ISQ | ~950 | 65ms | 42 GB |
| zen4 (744B MoE) | 4x H100 | FP8 + NCCL | ~1,200 | 180ms | 280 GB |
| zen4 (744B MoE) | 8x A100 80GB | Q4K + NCCL | ~800 | 250ms | 320 GB |
Run your own benchmarks:
hanzo-engine bench -m zenlm/zen4-mini --isq Q4K
hanzo-engine tune -m zenlm/zen4-mini --emit-config optimal.tomlConfiguration
Environment Variables
| Variable | Description |
|---|---|
HANZO_ENGINE_PORT | Server port (default: 36900) |
HANZO_ENGINE_LOCAL_WORLD_SIZE | Number of GPUs for NCCL tensor parallelism |
HANZO_ENGINE_NO_NCCL | Set to 1 to disable NCCL, use device mapping instead |
RING_CONFIG | Path to ring topology JSON config for multi-node inference |
KEEP_ALIVE_INTERVAL | SSE keep-alive interval in ms |
OTEL_EXPORTER_OTLP_ENDPOINT | OpenTelemetry export endpoint |
OTEL_SERVICE_NAME | Service name for traces |
HF_TOKEN | HuggingFace Hub authentication token |
TOML Configuration
For reproducible deployments, use a TOML config:
# Generate optimal config for your hardware
hanzo-engine tune -m zenlm/zen4-mini --emit-config config.toml
# Run from config
hanzo-engine from-config -f config.tomlPerformance Tips
- Use ISQ:
--isq 4gives the best balance of quality and speed on most hardware - Enable PagedAttention: Required for high-throughput batched serving
- Run
tune:hanzo-engine tunebenchmarks your hardware and emits optimal settings - KV cache quantization:
--paged-cache-type f8e4m3halves KV cache memory for longer contexts - Multi-GPU: Use device mapping for models that exceed single-GPU memory
- FlashAttention: Add
--features flash-attnat build time for CUDA (significant speedup)
Related Services
On-device inference runtime -- mobile, web (WASM), and embedded deployment
LLM API gateway -- routes requests to Engine and 100+ external providers
Rust ML framework powering Engine tensor operations
Platform management, billing, and hosted inference
How is this guide?
Last updated on
Hanzo Visor
VM and container runtime for AI workloads with GPU passthrough, live migration, snapshot/restore, and Kubernetes CRI integration.
Hanzo O11y
Full-stack observability platform — Prometheus metrics, Grafana dashboards, OpenTelemetry distributed tracing, log aggregation, alerting, and SLO management for Hanzo infrastructure and applications.