Running multiple local LLMs on Apple Silicon means juggling ~140 GB of weights, wired-memory limits, profile YAMLs, and Open WebUI config drift. You end up administering the stack instead of using it.
4lm is the CLI that refuses to let that happen.
Two shapes
One installer, one CLI, two ways to run it.
Workstation — your Mac IS the LLM
./install.sh # omlx + Open WebUI + opencode TUI
4lm start
open http://localhost:3000 # register your admin account here, first
4lm opencode # daily driver
Open WebUI on http://localhost:3000 with RAG, web search, code
interpreter, and memory wired in by default. opencode in your
terminal, pointed at the local /v1. Your laptop is the assistant.
Appliance — a Mac in your closet serves the LAN
./install.sh --backend-only # skips Open WebUI + opencode
4lm start
4lm expose lan --confirm
Headless OpenAI-compatible /v1/* API on the LAN. Other machines run
their own clients (opencode, Open WebUI, Continue.dev — anything that
speaks /v1) pointed at http://<host>:8000. The Mac Studio in the
closet does the inference; the Air on the couch does the typing.
What it refuses to do
- Never auto-starts after reboot. A 70 GB working set should not
sneak onto wired memory before you’ve made coffee. Opt in with
4lm autostart enableif you want it. - Never binds to LAN without
--confirm. No env-var bypass, no config typo, no “I thought it was already local.”4lm expose lanis a deliberate two-step. - Never silently breaks profile switches.
4lm profile set <name>validates the YAML → swaps the active symlink → polls/v1/modelsfor 30 s → on timeout, restores the previous symlink and re-polls. Bad YAML never kills the stack. - Never invalidates your knowledge base across profiles. Every
omlx profile serves the embedder as
qwen3-embeddingand the reranker asqwen3-reranker. Switch tiers without reindexing. - Never lets you OOM silently.
install.shenforcesiogpu.wired_limit_mb=98304via sudoers + sysctl.4lm doctorsmoke-tests inference;4lm diagshows what’s actually running.
Profile lineup
Six profiles. The three Qwen3-stack tiers share an 8B embedder so knowledge bases stay valid across switches.
| Profile | Backend | Coder | Chat | Embed | Rerank | Vision | Steady | Fits on |
|---|---|---|---|---|---|---|---|---|
lean |
omlx | Qwen3-Coder-30B-A3B | Qwen3.6-35B-A3B | 8B | 0.6B | — | ~40 GB | 64 GB+ |
default |
omlx | Qwen3-Coder-Next (80B) | Qwen3.6-35B-A3B | 8B | 0.6B | VL-8B | ~65 GB | 96 GB+ |
max-100gb |
omlx | Qwen3-Coder-Next (80B) | Qwen3-Next-80B | 8B | 4B | VL-8B | ~92 GB | 128 GB |
mlx-coding |
omlx | Qwen3-Coder-Next (80B) | — | — | — | — | ~42 GB | 64 GB+ |
mlx-knowledge |
omlx | — | Qwen3.6-35B-A3B | 8B | 0.6B | — | ~23 GB | 36 GB+ |
ollama |
ollama | qwen3-coder-next:q4_K_M | — | — | — | — | ~22 GB | 36 GB+ |
The everyday ladder is lean → default → max-100gb. mlx-coding
strips everything except the 80B coder for long agentic sessions.
mlx-knowledge is the text-only vault-synthesis tier. ollama is
the GGUF smoke test.
Architecture
4lm (single control command)
│ bootstrap / bootout / kickstart
▼
┌──────────────────────────────────┐ ┌──────────────────────────────────┐
│ com.4lm.backend │ │ com.4lm.webui │
│ omlx | mlx_lm | ollama │ │ open-webui serve │
│ :8000 (OpenAI API) │←───│ :3000 (Web UI) │
└──────────────────────────────────┘ └──────────────────────────────────┘
▲ ▲
│ HTTP │ HTTP (browser)
┌─────┴────┐ ┌─────┴────┐
│ opencode │ │ Safari │
│ TUI │ │ Chrome │
└──────────┘ └──────────┘
The backend is the source of truth. Open WebUI is a stateless
frontend proxying to it. opencode talks directly to :8000/v1.
None of them know or care about each other — the OpenAI-compatible
API is the seam.
Quickstart
make bootstrap # Brewfile + Brewfile-tui (skipped if BACKEND_ONLY=1)
make install # ~/.4lm/, sudoers, sysctl, pipx deps, log rotation
make models # ~140 GB from HuggingFace (idempotent)
4lm start # bootstrap launchd agents
4lm opencode # daily driver
64 GB Macs: switch to the
leanprofile first —4lm profile set leanbeforemake models.leanfits in 40 GB and downloads ~80 GB instead of ~140 GB.
After a reboot: 4lm start. There’s no autostart and that’s a feature.
Where to go next
- README — the full pitch, command reference, file layout
- Setup runbook — operator details, troubleshooting, LAN client wiring
- Profile schema — YAML key reference
- Autostart — opt-in login autostart mechanics
- CHANGELOG — version history
- SECURITY — threat model + vulnerability reporting
MIT licensed. Built on omlx, mlx_lm, Ollama, Open WebUI, opencode, and the Qwen3 model family.