Real llama.cpp runtime
Not a wrapper or a mock. UnifiedEngine links the llama.cpp C API directly and loads GGUF models through the genuine inference runtime.
UnifiedEngine is a local AI daemon for Apple Silicon. It loads GGUF models through a real llama.cpp runtime, accelerates them with Metal, and serves them behind an OpenAI-compatible API — all without a single byte leaving your Mac.
$ cargo run -p ue-daemon --model qwen.gguf
✓ llama.cpp runtime ready backend=metal
✓ GGUF loaded model=local-gguf
✓ listening 127.0.0.1:38180
$ curl -N localhost:38180/v1/chat/completions \
-d '{"messages":[{"role":"user",
"content":"Hello"}],"stream":true}'
data: {"delta":{"content":"Running "}}
data: {"delta":{"content":"locally "}}
data: {"delta":{"content":"on Metal."}}
data: [DONE]▋ A genuine local inference stack — not a thin proxy. Built to feel native on Apple Silicon and trivial to integrate.
Not a wrapper or a mock. UnifiedEngine links the llama.cpp C API directly and loads GGUF models through the genuine inference runtime.
The vendored llama.cpp build ships with the Metal backend, so generation runs on the Apple Silicon GPU — fast, cool, and battery-aware.
Tokens stream the moment they are produced, delivered over Server-Sent Events through the same OpenAI-compatible endpoint.
Point any OpenAI SDK at a custom base URL. /v1/chat/completions, /v1/models, and /v1/status work out of the box — no code rewrites.
Inference happens entirely on your Mac. No telemetry, no cloud round-trips. Your prompts and data never leave the device.
API-key auth, CORS allow-lists, request-size caps, concurrency limits, per-minute rate limiting, and request-ID tracing — built in.
From the native macOS control plane down to the Metal kernels, every layer has one job — and all of them run on your machine.
Keep your existing SDK. Change the base URL. Inference now happens locally — streaming, status, and model metadata all included.
/health Unauthenticated readiness probe /v1/status Runtime, backend & active model /v1/models Active model metadata /v1/chat/completions Chat — streaming or not from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:38180/v1",
api_key="local", # or your UE_API_KEY
)
stream = client.chat.completions.create(
model="local-gguf",
messages=[{"role": "user",
"content": "Explain Metal"}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="") The SwiftUI control plane gives the daemon a face — run it, configure it, talk to it, and watch its logs from one window.
Start and stop the engine, watch runtime status, and configure host, port, CORS, and rate limits without touching a terminal.
A native chat surface wired straight to the local model — the fastest way to confirm everything works.
Pick any local .gguf file. Nothing is bundled; you stay in full control of which weights run.
Switch between API request logs and the full daemon log to inspect exactly what the engine is doing.
./scripts/build_llama_cpp_metal.sh cargo run -p ue-daemon -- \
--model /path/to/model.gguf curl http://127.0.0.1:38180/v1/chat/completions \
-d '{"model":"local-gguf",
"messages":[{"role":"user",
"content":"Hello"}]}' Prefer the GUI? The macOS app exposes the same settings and stores your API key in Keychain. No GGUF model is bundled — bring your own.
UnifiedEngine runs on macOS arm64. Build from source today, or package a signed release with the included scripts.