Hanzo

Hanzo Engine

High-performance LLM inference engine — blazing-fast Rust-based serving with Metal/CUDA acceleration, quantization, vision, audio, and MCP tools

Hanzo Engine

Hanzo Engine is a high-performance, multimodal inference engine written in Rust. It serves LLMs, vision models, speech models, image generation models, and embedding models through an OpenAI-compatible HTTP API, with optional MCP (Model Context Protocol) server support.

Key design goals: zero-config model loading, hardware-aware quantization, and production-grade throughput via PagedAttention and continuous batching.

GitHub: github.com/hanzoai/engine Docker: ghcr.io/hanzoai/engine License: Apache-2.0

Features

  • Hardware Acceleration: Native Metal (Apple Silicon) and CUDA (NVIDIA) backends, FlashAttention V2/V3, multi-GPU tensor parallelism
  • Quantization: GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB, AFQ (Metal-optimized), and In-Situ Quantization (ISQ) of any HuggingFace model
  • PagedAttention: High-throughput continuous batching on CUDA and Metal with prefix caching and KV cache quantization
  • Speculative Decoding: Draft-model acceleration with rejection sampling for 2-3x speedup on supported architectures
  • Multimodal: Text, vision, audio/speech, image generation, and embeddings in a single binary
  • MCP Integration: Built-in MCP server exposing model tools via JSON-RPC, plus MCP client for connecting to external tool servers
  • OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints -- /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/images/generations, /v1/models
  • Agentic: Integrated tool calling with Python/Rust callbacks, web search, and structured output (JSON schema, regex, grammar)
  • Distributed Inference: Multi-GPU via NCCL tensor parallelism, multi-node via TCP ring topology with heterogeneous device support
  • LoRA / X-LoRA: Adapter model support with runtime weight merging
  • AnyMoE: Create mixture-of-experts on any base model
  • Per-Layer Topology: Fine-tune quantization bit depth per layer for optimal quality/speed tradeoffs
  • UQFF: Universal Quantized File Format for portable quantized model distribution
  • Web UI: Built-in chat interface at /ui when serving

Architecture

┌───────────────────────────────────────────────────────────────────┐
│                         Hanzo Engine                              │
├───────────────────────────────────────────────────────────────────┤
│                                                                   │
│  Clients                                                          │
│  ───────                                                          │
│  OpenAI SDK ─┐                                                    │
│  curl / HTTP ─┤──▶ ┌──────────────────────────────────┐          │
│  Python SDK  ─┤    │   HTTP Server (axum)              │          │
│  Rust SDK    ─┤    │   :36900                          │          │
│  MCP Client  ─┘──▶ │   /v1/chat/completions            │          │
│                     │   /v1/completions                 │          │
│                     │   /v1/embeddings                  │          │
│                     │   /v1/models                      │          │
│                     │   /health                         │          │
│                     │   /docs (SwaggerUI)               │          │
│                     └─────────┬────────────────────────┘          │
│                               │                                   │
│                               ▼                                   │
│                     ┌──────────────────────────────────┐          │
│                     │   Pipeline Engine                 │          │
│                     │   ──────────────────              │          │
│                     │   • Continuous batching           │          │
│                     │   • PagedAttention + prefix cache │          │
│                     │   • ISQ / GGUF / GPTQ / AWQ      │          │
│                     │   • Device mapping (multi-GPU)    │          │
│                     │   • Chat template detection       │          │
│                     └─────────┬────────────────────────┘          │
│                               │                                   │
│              ┌────────────────┼────────────────┐                  │
│              ▼                ▼                 ▼                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐       │
│  │  Metal (GPU) │  │  CUDA (GPU)  │  │  CPU (fallback)  │       │
│  │  Apple Si    │  │  FlashAttn   │  │  AVX2 / MKL      │       │
│  │  AFQ quants  │  │  Marlin kern │  │  GGUF / HQQ      │       │
│  └──────────────┘  └──────────────┘  └──────────────────┘       │
│                                                                   │
│  ┌──────────────────────────────────────────────────────┐        │
│  │   MCP Server (:4321)                                  │        │
│  │   Exposes model tools via JSON-RPC 2.0                │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                   │
└───────────────────────────────────────────────────────────────────┘

Supported Models

Text Models

ModelArchitecturesNotes
LlamaLlama 2, 3, 3.1, 3.2GGUF supported
MistralMistral 7B, Mixtral 8x7B/8x22BMoE support
QwenQwen 2, 3, 3 Next, 3 MoEThinking mode
GemmaGemma, Gemma 2
PhiPhi 2, 3, 3.5 MoE
DeepSeekV2, V3 / R1MoE with MoQE
GLMGLM 4, 4.7 Flash, 4.7 MoE
StarcoderStarcoder 2Code models
SmolLMSmolLM 3Small models
GraniteGranite 4.0
GPT-OSSGPT-OSS (Harmony format)reasoning_effort support

Vision Models

ModelInput Types
Qwen 3-VL, 3-VL MoEImage + Text
Qwen 2-VL, 2.5-VLImage + Text
Gemma 3, 3nImage + Text
Llama 4, 3.2 VisionImage + Text
Mistral 3Image + Text
Phi 4 Multimodal, Phi 3VImage + Text
MiniCPM-OImage + Text
Idefics 2, 3Image + Text
LLaVA, LLaVA NextImage + Text

Speech, Image Generation, and Embedding Models

CategoryModels
Speech (ASR)Voxtral, Dia
Image GenerationFLUX
EmbeddingsEmbedding Gemma, Qwen 3 Embedding

Zen Models

Hanzo Engine is the inference backend for the Zen model family. First-class support for all Zen architectures:

ModelParametersContextArchitectureUse Case
zen4744B MoE (40B active)202KTransformer MoEFlagship reasoning and generation
zen4-max1.04T MoE (32B active)256KTransformer MoEMaximum capability
zen4-ultra744B MoE + CoT (40B active)202KTransformer MoEExtended chain-of-thought
zen4-pro80B MoE (3B active)131KTransformer MoEHigh quality, efficient serving
zen4-mini8B dense40KTransformerFast inference, edge deployment
zen4-coder480B MoE (35B active)262KTransformer MoECode generation and analysis
zen4-coder-flash30B MoE (3B active)262KTransformer MoEFast code completion
zen4-coder-pro480B dense BF16262KTransformerMaximum code quality
zen3-vl30B MoE (3B active)131KVision-Language MoEMultimodal understanding
zen3-omni~200B202KMultimodalText, vision, audio
zen3-nano4B dense40KTransformerUltra-lightweight
zen3-guard4B dense--ClassifierSafety and content filtering
zen3-embedding----Embedding (3072-dim)Search and retrieval
# Serve any Zen model
hanzo-engine serve -m zenlm/zen4-mini --port 8000
hanzo-engine serve -m zenlm/zen4 --port 8000 --isq Q4K
hanzo-engine serve -m zenlm/zen4-coder --port 8000

Quick Start

Install

Linux / macOS (recommended):

curl -sSL https://engine.hanzo.ai/install.sh | sh

Via Cargo:

cargo install hanzo-engine

Via pip (Python SDK):

pip install hanzo-engine               # CPU
pip install hanzo-engine-cuda          # NVIDIA GPU
pip install hanzo-engine-metal         # Apple Silicon
pip install hanzo-engine-mkl           # Intel CPU

Build from source with the appropriate backend:

# macOS (Metal)
cargo build --package hanzo-engine --release --no-default-features --features metal

# Linux (CUDA)
cargo build --package hanzo-engine --release --features cuda

# Linux (CUDA + FlashAttention)
cargo build --package hanzo-engine --release --features "cuda flash-attn cudnn"

Run a Model

# Interactive chat with auto-detected architecture
hanzo-engine run -m zenlm/zen4-mini

# Start HTTP server with web UI
hanzo-engine serve --ui -m google/gemma-3-4b-it

# Start server with ISQ 4-bit quantization
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4

# Run a GGUF quantized model
hanzo-engine run --format gguf -f ./model.Q4_K_M.gguf

# Auto-tune for your hardware
hanzo-engine tune -m zenlm/zen4-mini --emit-config config.toml
hanzo-engine from-config -f config.toml

# System diagnostics
hanzo-engine doctor

The server listens on port 36900 by default. Visit http://localhost:36900/ui for the built-in chat interface, or http://localhost:36900/docs for interactive Swagger API docs.

CLI Commands

CommandPurpose
runInteractive chat mode
serveStart HTTP + optional MCP server
from-configRun from a TOML config file
quantizeGenerate UQFF quantized model
tuneAuto-benchmark and recommend settings
doctorSystem diagnostics (CUDA, Metal, HF connectivity)
loginAuthenticate with HuggingFace Hub
cacheManage downloaded model cache
benchRun performance benchmarks

API Reference

Hanzo Engine exposes an OpenAI-compatible REST API. Any client that speaks the OpenAI API works out of the box.

Chat Completions

curl http://localhost:36900/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain PagedAttention in one paragraph."}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }'

Using the Python OpenAI SDK

import openai

client = openai.OpenAI(
    base_url="http://localhost:36900/v1",
    api_key="EMPTY",
)

completion = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about Rust."},
    ],
    max_tokens=128,
)

print(completion.choices[0].message.content)

Streaming

Set "stream": true in the request body. The server returns Server-Sent Events (SSE) compatible with the OpenAI streaming format.

stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Count to 10 slowly."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Text Completions

curl http://localhost:36900/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "prompt": "The Rust programming language is",
    "max_tokens": 128
  }'

Embeddings

Serve an embedding model to enable this endpoint:

hanzo-engine serve -m google/embeddinggemma-300m
curl http://localhost:36900/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "input": "Hello, world!"
  }'

Endpoints Summary

MethodPathDescription
POST/v1/chat/completionsChat completions (streaming supported)
POST/v1/completionsText completions
POST/v1/embeddingsVector embeddings
POST/v1/images/generationsImage generation (FLUX)
GET/v1/modelsList loaded models
GET/healthServer health check
GET/docsInteractive Swagger UI
GET/uiBuilt-in chat web UI

Extended Parameters

Beyond the standard OpenAI API, Hanzo Engine supports additional request parameters:

ParameterTypeDescription
top_kintTop-K sampling
min_pfloatMin-P sampling threshold
grammarobjectConstrained generation (regex, JSON schema, Lark grammar, llguidance)
enable_thinkingboolEnable chain-of-thought for supported models
truncate_sequenceboolTruncate over-length prompts instead of rejecting
repetition_penaltyfloatMultiplicative penalty for repeated tokens
web_search_optionsobjectEnable web search integration
reasoning_effortstringReasoning depth for Harmony-format models: low, medium, high
dry_multiplierfloatDRY anti-repetition penalty strength

Quantization Guide

Hanzo Engine supports extensive quantization methods for reducing model size and increasing throughput.

In-Situ Quantization (ISQ)

ISQ quantizes model weights during loading -- the full unquantized model never needs to fit in memory. Just pass --isq <bits>:

# 4-bit quantization (auto-selects best method for your hardware)
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4

# 8-bit quantization
hanzo-engine serve -m zenlm/zen4-mini --isq 8

On Metal, ISQ uses AFQ (affine quantization) optimized for Apple Silicon. On CUDA, it uses Q/K quantization for best throughput.

Quantization Methods

MethodBit WidthsDevicesNotes
ISQ (auto)2, 3, 4, 5, 6, 8AllHardware-aware auto-selection
GGUF2-8 bit (Q/K types)CPU, CUDA, MetalMost portable format
GPTQ2, 3, 4, 8CUDA onlyMarlin kernel for 4/8-bit
AWQ4, 8CUDA onlyMarlin kernel for 4/8-bit
HQQ4, 8AllHalf-quadratic quantization
FP88AllE4M3 floating point
BNB4, 8CUDAbitsandbytes int8, fp4, nf4
AFQ2, 3, 4, 6, 8Metal onlyFastest on Apple Silicon
MLXPre-quantizedMetalApple MLX format

Running GGUF Models

# Direct GGUF file
hanzo-engine run --format gguf -f ./zen4-mini-Q4_K_M.gguf

# Auto-detect GPTQ model from HuggingFace
hanzo-engine run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bit

Per-Layer Topology

Fine-tune quantization per layer using a topology file. This lets you keep attention layers at higher precision while aggressively quantizing feed-forward layers:

hanzo-engine serve -m zenlm/zen4-mini --topology topology.toml

PagedAttention

PagedAttention accelerates inference and enables high-throughput continuous batching:

  • CUDA: Enabled by default; disable with --no-paged-attn
  • Metal: Opt-in with --paged-attn
  • Prefix caching: Reuses KV cache blocks across requests sharing common prefixes (system prompts)
  • KV cache quantization: FP8 (E4M3) reduces KV cache memory by ~50%
# Custom KV cache memory allocation
hanzo-engine serve -m zenlm/zen4-mini --paged-attn --paged-attn-gpu-memory 8GiB

# KV cache quantization for longer contexts
hanzo-engine serve -m zenlm/zen4-mini --paged-attn --paged-cache-type f8e4m3

MCP Integration

Hanzo Engine can act as both an MCP server (exposing model tools) and an MCP client (connecting to external tool servers).

MCP Server

Expose model capabilities as MCP tools alongside the HTTP API:

hanzo-engine serve \
  -m zenlm/zen4-mini \
  --port 36900 \
  --mcp-port 4321

The MCP server exposes tools based on loaded model modalities (text, vision, etc.) via JSON-RPC 2.0.

MCP Client

Connect to external MCP tool servers so the model can call external tools during inference:

hanzo-engine serve \
  -m zenlm/zen4-mini \
  --mcp-client-config mcp-tools.json

Python SDK

from hanzo_engine import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="zenlm/zen4-mini"),
    in_situ_quant="4",
)

response = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(response.choices[0].message.content)

Rust SDK

use anyhow::Result;
use hanzo_engine::{IsqType, TextMessageRole, TextMessages, TextModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = TextModelBuilder::new("zenlm/zen4-mini")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new()
        .add_message(TextMessageRole::User, "Hello!");

    let response = model.send_chat_request(messages).await?;
    println!("{}", response.choices[0].message.content.as_ref().unwrap());

    Ok(())
}

Docker

# CPU
docker run -p 8000:8000 ghcr.io/hanzoai/engine:latest \
  serve -m zenlm/zen4-mini --port 8000

# NVIDIA GPU
docker run --gpus all -p 8000:8000 ghcr.io/hanzoai/engine:cuda \
  serve -m zenlm/zen4 --port 8000

# Apple Silicon (Metal)
docker run -p 8000:8000 ghcr.io/hanzoai/engine:metal \
  serve -m zenlm/zen4-mini --port 8000

Docker Compose

services:
  engine:
    image: ghcr.io/hanzoai/engine:cuda
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - model-cache:/root/.cache/huggingface
    command: serve -m zenlm/zen4 --port 8000
    restart: unless-stopped

volumes:
  model-cache:

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hanzo-engine
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hanzo-engine
  template:
    metadata:
      labels:
        app: hanzo-engine
    spec:
      containers:
        - name: engine
          image: ghcr.io/hanzoai/engine:cuda
          command: ["hanzo-engine", "serve", "-m", "zenlm/zen4", "--port", "8000"]
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: hanzo-engine
spec:
  selector:
    app: hanzo-engine
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

Distributed Inference

Multi-GPU (NCCL)

For models that exceed single-GPU memory, use NCCL tensor parallelism:

# 2x GPU tensor parallelism
HANZO_ENGINE_LOCAL_WORLD_SIZE=2 hanzo-engine serve \
  -m zenlm/zen4 --port 8000

# 4x GPU
HANZO_ENGINE_LOCAL_WORLD_SIZE=4 hanzo-engine serve \
  -m zenlm/zen4-max --port 8000

Multi-Node (Ring)

Ring topology supports heterogeneous setups -- mix Metal, CUDA, and CPU nodes in a single inference cluster:

# Node 0 (master)
RING_CONFIG=ring_node0.json hanzo-engine serve -m zenlm/zen4 --port 8000

# Node 1
RING_CONFIG=ring_node1.json hanzo-engine serve -m zenlm/zen4 --port 8001

Speculative Decoding

Draft-model acceleration uses a smaller model to generate candidate tokens, then validates them with the full model. This achieves 2-3x speedup for autoregressive generation:

[speculative]
draft_model = "zenlm/zen4-mini"
gamma = 16
hanzo-engine from-config -f config.toml

Performance Benchmarks

Representative benchmarks with continuous batching enabled:

ModelHardwareQuantizationThroughput (tok/s)Latency (TTFT)Memory
zen4-mini (8B)1x A100 80GBFP16~2,40028ms16 GB
zen4-mini (8B)1x A100 80GBQ4K ISQ~3,80018ms5 GB
zen4-mini (8B)M3 Max 64GBMetal~85120ms16 GB
zen4-mini (8B)M3 Max 64GBQ4K ISQ~11080ms5 GB
zen4-pro (80B MoE)1x A100 80GBQ4K ISQ~95065ms42 GB
zen4 (744B MoE)4x H100FP8 + NCCL~1,200180ms280 GB
zen4 (744B MoE)8x A100 80GBQ4K + NCCL~800250ms320 GB

Run your own benchmarks:

hanzo-engine bench -m zenlm/zen4-mini --isq Q4K
hanzo-engine tune -m zenlm/zen4-mini --emit-config optimal.toml

Configuration

Environment Variables

VariableDescription
HANZO_ENGINE_PORTServer port (default: 36900)
HANZO_ENGINE_LOCAL_WORLD_SIZENumber of GPUs for NCCL tensor parallelism
HANZO_ENGINE_NO_NCCLSet to 1 to disable NCCL, use device mapping instead
RING_CONFIGPath to ring topology JSON config for multi-node inference
KEEP_ALIVE_INTERVALSSE keep-alive interval in ms
OTEL_EXPORTER_OTLP_ENDPOINTOpenTelemetry export endpoint
OTEL_SERVICE_NAMEService name for traces
HF_TOKENHuggingFace Hub authentication token

TOML Configuration

For reproducible deployments, use a TOML config:

# Generate optimal config for your hardware
hanzo-engine tune -m zenlm/zen4-mini --emit-config config.toml

# Run from config
hanzo-engine from-config -f config.toml

Performance Tips

  1. Use ISQ: --isq 4 gives the best balance of quality and speed on most hardware
  2. Enable PagedAttention: Required for high-throughput batched serving
  3. Run tune: hanzo-engine tune benchmarks your hardware and emits optimal settings
  4. KV cache quantization: --paged-cache-type f8e4m3 halves KV cache memory for longer contexts
  5. Multi-GPU: Use device mapping for models that exceed single-GPU memory
  6. FlashAttention: Add --features flash-attn at build time for CUDA (significant speedup)

On-device inference runtime -- mobile, web (WASM), and embedded deployment

LLM API gateway -- routes requests to Engine and 100+ external providers

Rust ML framework powering Engine tensor operations

Platform management, billing, and hosted inference

How is this guide?

Last updated on

On this page