Hanzo

Hanzo Edge

On-device AI inference -- run Zen models and any GGUF model locally on macOS, Linux, iOS, Android, Web (WASM), and embedded devices with zero cloud dependency.

Hanzo Edge

Hanzo Edge is a high-performance, cross-platform AI inference runtime for on-device deployment. Written in Rust, it runs Zen models and any GGUF-compatible model locally with zero cloud dependency -- full data privacy, zero network latency, and offline operation.

GitHub: github.com/hanzoai/edge Crates.io: hanzo-edge License: Apache-2.0

Features

  • On-Device Inference: Zero network latency, works offline, full data privacy -- no cloud calls required
  • Cross-Platform: macOS, Linux, iOS, Android, Web (WASM), and embedded devices from a single codebase
  • Hardware Acceleration: Metal (Apple Silicon), CUDA (NVIDIA), CPU (AVX2/AVX-512), Accelerate (macOS)
  • GGUF Native: First-class support for quantized models (Q4_K, Q5_K, Q8_0, and all standard GGUF types)
  • OpenAI-Compatible API: Drop-in replacement local server at localhost -- works with any OpenAI SDK client
  • Streaming: Token-by-token streaming via SSE (HTTP) and callbacks (Rust/WASM)
  • HuggingFace Hub: Automatic model download and caching from any HF repository
  • WebAssembly: Run inference directly in the browser via WASM with streaming support
  • Minimal Binary: Single binary deployment, no Python or system dependencies

Architecture

hanzo-edge (workspace)
+-- edge-core/              # Core inference runtime (library)
|   +-- lib.rs              # Public API: Model, InferenceSession, SamplingParams
|   +-- model.rs            # Model trait, GGUF loading, HF Hub download
|   +-- session.rs          # Autoregressive generation + streaming iterator
|   +-- sampling.rs         # Temperature, top-k, top-p, repeat penalty
|   +-- tokenizer.rs        # HF tokenizer wrapper with EOS detection
+-- edge-cli/               # CLI binary
|   +-- main.rs             # Clap-based CLI with 4 subcommands
|   +-- loader.rs           # HF Hub download with progress bars
|   +-- cmd/
|       +-- run.rs          # Streaming inference to stdout
|       +-- serve.rs        # OpenAI-compatible HTTP server (Axum)
|       +-- bench.rs        # TTFT, throughput, memory benchmarking
|       +-- info.rs         # Model metadata inspection
+-- edge-wasm/              # WebAssembly module
    +-- lib.rs              # WASM bindings: EdgeModel, generate, generate_stream

Built on Hanzo ML for tensor operations.

Crates

CrateDescriptionInstall
hanzo-edge-coreCore inference runtime and Model traitcargo add hanzo-edge-core
hanzo-edgeCLI binary with run, serve, bench, infocargo install hanzo-edge
hanzo-edge-wasmBrowser WASM module with streamingwasm-pack build edge-wasm

Quick Start

Install

# Install via curl
curl -sSL https://edge.hanzo.ai/install.sh | sh

# Or via cargo
cargo install hanzo-edge

Run a Model

# Run inference with streaming output
hanzo-edge run --model zenlm/zen3-nano --prompt "Hello!"

# Start local OpenAI-compatible API server
hanzo-edge serve --model zenlm/zen3-nano --port 8080

# Model info (architecture, params, quantization, context length)
hanzo-edge info --model zenlm/zen3-nano

# Benchmark (TTFT, tokens/sec, memory, averaged over N iterations)
hanzo-edge bench --model zenlm/zen3-nano --prompt "Hello" \
    --max-tokens 128 -n 5

CLI Commands

CommandPurpose
runStreaming inference to stdout
serveStart OpenAI-compatible HTTP server
benchTTFT, throughput, and memory benchmarking
infoModel metadata inspection

Zen Models for Edge

Pre-quantized models optimized for on-device inference, all available at huggingface.co/zenlm in GGUF format:

ModelParamsMemoryUse Case
zen-nano600M~400MBUltra-lightweight, embedded
zen-eco4B~2.5GBGeneral purpose, mobile
zen4-mini8B~5GBHigh quality, desktop/laptop
# Run a Zen model
hanzo-edge run --model zenlm/zen3-nano --prompt "Write a haiku" \
    --max-tokens 128 --temperature 0.7 --top-p 0.9

Supported Model Architectures

Hanzo Edge supports any GGUF model using these architectures (auto-detected from GGUF metadata):

ArchitectureExamples
LlamaZen, Llama 2/3, Mistral
Qwen2Zen 4
Phi3Phi-3-mini, Phi-3-small
Gemma2Gemma 2
SmolLMSmolLM 135M-1.7B

Any GGUF file using the Llama-family tensor layout is supported.

API Server

The built-in server is OpenAI-compatible -- any client that speaks the OpenAI API works out of the box:

hanzo-edge serve --model zenlm/zen3-nano --port 8080

Endpoints

MethodPathDescription
POST/v1/chat/completionsChat completion (streaming + non-streaming)
POST/v1/completionsText completion
GET/v1/modelsList loaded models
GET/healthHealth check

Chat completions support stream: true for server-sent events (SSE), with [DONE] sentinel and ChatML-formatted prompts.

SDK Usage

Rust

use hanzo_edge_core::{load_model, InferenceSession, SamplingParams, ModelConfig};

// Load from HuggingFace Hub (auto-downloads and caches)
let config = ModelConfig {
    model_id: "zenlm/zen3-nano".to_string(),
    model_file: Some("zen3-nano.Q4_K_M.gguf".to_string()),
    ..Default::default()
};
let (mut model, tokenizer) = load_model(&config)?;

// Generate
let params = SamplingParams {
    temperature: 0.7,
    top_p: 0.9,
    top_k: 40,
    max_tokens: 256,
    repeat_penalty: 1.1,
    repeat_last_n: 64,
};
let mut session = InferenceSession::new(&mut *model, &tokenizer, params);
let output = session.generate("Explain quantum computing")?;
println!("{}", output.text);

// Streaming
let stream = session.generate_stream("Write a haiku")?;
for token_result in stream {
    print!("{}", token_result?);
}

Python (via local API)

from openai import OpenAI

# Point at the local hanzo-edge server
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")

# Chat completions (streaming)
stream = client.chat.completions.create(
    model="zen3-nano",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Text completions
response = client.completions.create(
    model="zen3-nano",
    prompt="The quick brown fox",
    max_tokens=64
)
print(response.choices[0].text)

JavaScript (WASM)

import init, { EdgeModel, get_version, get_device_info } from 'hanzo-edge-wasm';

await init();
console.log(`Hanzo Edge v${get_version()} [${get_device_info()}]`);

// Load model and tokenizer as ArrayBuffers
const modelBytes = await fetch('model.gguf').then(r => r.arrayBuffer());
const tokenizerBytes = await fetch('tokenizer.json').then(r => r.arrayBuffer());

const model = new EdgeModel(
    new Uint8Array(modelBytes),
    new Uint8Array(tokenizerBytes)
);

// Synchronous generation
const output = model.generate("Hello!", 256, 0.7);
console.log(output);

// Streaming (token-by-token callback)
model.generate_stream("Write a poem", 256, 0.7, (token) => {
    process.stdout.write(token);
});

// Reset KV cache between conversations
model.reset();

Platform Support

PlatformBackendStatus
macOS (Apple Silicon)MetalProduction
macOS (Intel)CPU/AccelerateProduction
Linux x86_64CPU/CUDAProduction
Linux ARM64CPUProduction
Web (WASM)CPUBeta
iOSMetal/CoreMLComing Soon
AndroidVulkan/NNAPIComing Soon

Building from Source

git clone https://github.com/hanzoai/edge
cd edge

# Build CLI (CPU)
cargo build --release -p hanzo-edge

# Build CLI (Metal, Apple Silicon)
cargo build --release -p hanzo-edge --features metal

# Build CLI (CUDA)
cargo build --release -p hanzo-edge --features cuda

# Build WASM
cd edge-wasm && cargo build --target wasm32-unknown-unknown --release
wasm-bindgen target/wasm32-unknown-unknown/release/edge_wasm.wasm \
    --out-dir pkg --target web

# Run tests
cargo test --workspace

Feature Flags

FeatureDescriptionBuild
cpuCPU backend (default)cargo build --release
metalMetal backend for macOS/iOScargo build --release --features metal
cudaCUDA backend for NVIDIA GPUscargo build --release --features cuda

Cloud GPU inference engine -- production serving with PagedAttention, continuous batching, and multi-GPU

Unified API gateway -- routes to Engine and 100+ external providers

Rust ML framework powering Edge and Engine tensor operations

Hosted inference, billing, and platform management

How is this guide?

Last updated on

On this page