v0.0.3 · macOS arm64

Run AI on your own metal.

UnifiedEngine is a local AI daemon for Apple Silicon. It loads GGUF models through a real llama.cpp runtime, accelerates them with Metal, and serves them behind an OpenAI-compatible API — all without a single byte leaving your Mac.

100%On-device
0Cloud calls
MetalGPU backend
unifiedengine — zsh
$ cargo run -p ue-daemon --model qwen.gguf
 llama.cpp runtime ready  backend=metal
 GGUF loaded             model=local-gguf
 listening               127.0.0.1:38180

$ curl -N localhost:38180/v1/chat/completions \
   -d '{"messages":[{"role":"user",
       "content":"Hello"}],"stream":true}'

data: {"delta":{"content":"Running "}}
data: {"delta":{"content":"locally "}}
data: {"delta":{"content":"on Metal."}}
data: [DONE]
backendmetal
readytrue
llama.cpp · Apple Metal · GGUF · Rust + axum · SwiftUI · OpenAI API
Capabilities

Everything the engine does

A genuine local inference stack — not a thin proxy. Built to feel native on Apple Silicon and trivial to integrate.

Real llama.cpp runtime

Not a wrapper or a mock. UnifiedEngine links the llama.cpp C API directly and loads GGUF models through the genuine inference runtime.

Metal acceleration

The vendored llama.cpp build ships with the Metal backend, so generation runs on the Apple Silicon GPU — fast, cool, and battery-aware.

True token streaming

Tokens stream the moment they are produced, delivered over Server-Sent Events through the same OpenAI-compatible endpoint.

OpenAI-compatible API

Point any OpenAI SDK at a custom base URL. /v1/chat/completions, /v1/models, and /v1/status work out of the box — no code rewrites.

Private by design

Inference happens entirely on your Mac. No telemetry, no cloud round-trips. Your prompts and data never leave the device.

Hardened control plane

API-key auth, CORS allow-lists, request-size caps, concurrency limits, per-minute rate limiting, and request-ID tracing — built in.

Under the hood

A clean, layered runtime

From the native macOS control plane down to the Metal kernels, every layer has one job — and all of them run on your machine.

App
UnifiedEngine.app SwiftUI control plane · Keychain-stored API keys
spawns ↓
Service
ue-daemon · ue-api (axum) Auth · CORS · rate limit · request-ID · validation
calls ↓
Runtime
ue-llama → llama.cpp C API GGUF loading · token generation · streaming callback
runs on ↓
Hardware
Apple Silicon GPU · Metal Vendored Metal-enabled build · zero cloud
ue-daemonue-apiue-llamaue-modelsue-configue-common
Drop-in API

If it speaks OpenAI, it speaks UnifiedEngine

Keep your existing SDK. Change the base URL. Inference now happens locally — streaming, status, and model metadata all included.

  • GET /health Unauthenticated readiness probe
  • GET /v1/status Runtime, backend & active model
  • GET /v1/models Active model metadata
  • POST /v1/chat/completions Chat — streaming or not
Read the API reference →
chat.py python
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:38180/v1",
    api_key="local",  # or your UE_API_KEY
)

stream = client.chat.completions.create(
    model="local-gguf",
    messages=[{"role": "user",
               "content": "Explain Metal"}],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
Native experience

A real Mac app, not a config file

The SwiftUI control plane gives the daemon a face — run it, configure it, talk to it, and watch its logs from one window.

01

Dashboard

Start and stop the engine, watch runtime status, and configure host, port, CORS, and rate limits without touching a terminal.

02

Chat

A native chat surface wired straight to the local model — the fastest way to confirm everything works.

03

Model Settings

Pick any local .gguf file. Nothing is bundled; you stay in full control of which weights run.

04

Logs

Switch between API request logs and the full daemon log to inspect exactly what the engine is doing.

Get going

Running in three commands

1

Build llama.cpp with Metal

./scripts/build_llama_cpp_metal.sh
2

Start the daemon with a GGUF model

cargo run -p ue-daemon -- \
  --model /path/to/model.gguf
3

Send your first request

curl http://127.0.0.1:38180/v1/chat/completions \
  -d '{"model":"local-gguf",
       "messages":[{"role":"user",
       "content":"Hello"}]}'

Prefer the GUI? The macOS app exposes the same settings and stores your API key in Keychain. No GGUF model is bundled — bring your own.

Ready when you are

Bring intelligence in-house.

UnifiedEngine runs on macOS arm64. Build from source today, or package a signed release with the included scripts.