v0.0.3 · macOS arm64

Run AI on your own metal.

UnifiedEngine is a local AI daemon for Apple Silicon. It loads GGUF models through a real llama.cpp runtime, accelerates them with Metal, and serves them behind an OpenAI-compatible API — all without a single byte leaving your Mac.

Download for macOS Quick start

100%On-device

0Cloud calls

MetalGPU backend

unifiedengine — zsh

$ cargo run -p ue-daemon --model qwen.gguf
✓ llama.cpp runtime ready  backend=metal
✓ GGUF loaded             model=local-gguf
✓ listening               127.0.0.1:38180

$ curl -N localhost:38180/v1/chat/completions \
   -d '{"messages":[{"role":"user",
       "content":"Hello"}],"stream":true}'

data: {"delta":{"content":"Running "}}
data: {"delta":{"content":"locally "}}
data: {"delta":{"content":"on Metal."}}
data: [DONE]▋

backendmetal

readytrue

llama.cpp · Apple Metal · GGUF · Rust + axum · SwiftUI · OpenAI API

Capabilities

Everything the engine does

A genuine local inference stack — not a thin proxy. Built to feel native on Apple Silicon and trivial to integrate.

Real llama.cpp runtime

Not a wrapper or a mock. UnifiedEngine links the llama.cpp C API directly and loads GGUF models through the genuine inference runtime.

Metal acceleration

The vendored llama.cpp build ships with the Metal backend, so generation runs on the Apple Silicon GPU — fast, cool, and battery-aware.

True token streaming

Tokens stream the moment they are produced, delivered over Server-Sent Events through the same OpenAI-compatible endpoint.

OpenAI-compatible API

Point any OpenAI SDK at a custom base URL. /v1/chat/completions, /v1/models, and /v1/status work out of the box — no code rewrites.

Private by design

Inference happens entirely on your Mac. No telemetry, no cloud round-trips. Your prompts and data never leave the device.

Hardened control plane

API-key auth, CORS allow-lists, request-size caps, concurrency limits, per-minute rate limiting, and request-ID tracing — built in.

Under the hood

A clean, layered runtime

From the native macOS control plane down to the Metal kernels, every layer has one job — and all of them run on your machine.

App

UnifiedEngine.app SwiftUI control plane · Keychain-stored API keys

spawns ↓

Service

ue-daemon · ue-api (axum) Auth · CORS · rate limit · request-ID · validation

calls ↓

Runtime

ue-llama → llama.cpp C API GGUF loading · token generation · streaming callback

runs on ↓

Hardware

Apple Silicon GPU · Metal Vendored Metal-enabled build · zero cloud

ue-daemonue-apiue-llamaue-modelsue-configue-common

Drop-in API

If it speaks OpenAI, it speaks UnifiedEngine

Keep your existing SDK. Change the base URL. Inference now happens locally — streaming, status, and model metadata all included.

GET /health Unauthenticated readiness probe
GET /v1/status Runtime, backend & active model
GET /v1/models Active model metadata
POST /v1/chat/completions Chat — streaming or not

Read the API reference →

chat.py python

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:38180/v1",
    api_key="local",  # or your UE_API_KEY
)

stream = client.chat.completions.create(
    model="local-gguf",
    messages=[{"role": "user",
               "content": "Explain Metal"}],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content, end="")

Native experience

A real Mac app, not a config file

The SwiftUI control plane gives the daemon a face — run it, configure it, talk to it, and watch its logs from one window.

Dashboard

Start and stop the engine, watch runtime status, and configure host, port, CORS, and rate limits without touching a terminal.

Chat

A native chat surface wired straight to the local model — the fastest way to confirm everything works.

Model Settings

Pick any local .gguf file. Nothing is bundled; you stay in full control of which weights run.

Logs

Switch between API request logs and the full daemon log to inspect exactly what the engine is doing.

Get going

Running in three commands

Build llama.cpp with Metal

./scripts/build_llama_cpp_metal.sh

Start the daemon with a GGUF model

cargo run -p ue-daemon -- \
  --model /path/to/model.gguf

Send your first request

curl http://127.0.0.1:38180/v1/chat/completions \
  -d '{"model":"local-gguf",
       "messages":[{"role":"user",
       "content":"Hello"}]}'

Prefer the GUI? The macOS app exposes the same settings and stores your API key in Keychain. No GGUF model is bundled — bring your own.

Ready when you are

Bring intelligence in-house.

UnifiedEngine runs on macOS arm64. Build from source today, or package a signed release with the included scripts.

Download for macOS View on GitHub