Why Self-Host Your AI? The Honest Case for Local Inference

The pitch vs the reality

Running AI on your own hardware is the cleanest version of the sovereignty argument. No API provider sees your prompts. No terms of service govern what you can ask. No subscription gets cancelled. Your models, your machine, your rules. For anyone who cares about AI independence, local inference is the obvious destination.

Then you actually try it. I run a Mac Studio M4 Max with 64GB of unified memory. Solid hardware by any measure. I loaded 32B-class dense models via Ollama, pointed my coding agent at them, and waited. The models worked. They just weren’t fast enough for the kind of multi-step agent work I was doing with OpenClaw. At 22 tokens per second, every chained inference call felt like wading through mud. I went back to Venice and Morpheus APIs within a week.

That’s where most people’s self-hosting story ends. Mine didn’t. A different model architecture, Mixture of Experts (MoE), combined with Apple’s MLX framework instead of Ollama, changed the equation completely. The same Mac Studio now runs Qwen3.5-35B-A3B at 94 tokens per second. Local inference is my primary setup for OpenClaw, with cloud APIs as fallback. The model architecture and framework mattered more than the hardware.

What 64GB of unified memory gets you

The M4 Max with 64GB is one of the better consumer machines for local inference. After macOS takes its share, you have roughly 54-56GB of usable memory for models. That’s enough for a 70B model at Q4 quantisation or a 32B at full 8-bit precision with room to spare.

With dense models on Ollama, the speed picture is modest. But it changes dramatically when you factor in model architecture and framework choice.

Local inference speed on M4 Max 64GB

Qwen3.5-35B-A3B MoE (MLX) 94 tok/s

Devstral Small 24B (Ollama) 25 tok/s

Qwen 3 32B 8-bit (Ollama) 22 tok/s

Llama 3.3 70B Q4 (Ollama) 11 tok/s

The top bar is not a typo. Qwen3.5-35B-A3B is a Mixture of Experts model: 35 billion total parameters, but only 3 billion active per token. The full model sits in memory (19GB at 4-bit quantisation), but each inference step only computes through a fraction of the network. The result is a 35B-class model that runs at speeds you’d normally associate with a 3B model. Pair that with Apple’s MLX framework, which is optimised for the M4 Max’s unified memory and Metal GPU, and you get 94 tokens per second. That’s faster than most cloud APIs.

Dense models tell a different story. A 70B model at 11 tokens per second is slower than comfortable reading speed. A 32B at 22 tokens per second is workable for interactive use but painful for anything iterative. Neither comes close to what MoE delivers.

94 vs 22 tok/s MoE + MLX vs dense Ollama Same hardware. Same memory. 4x faster, because only 3B of 35B parameters activate per token.

Quality has caught up faster than speed. Qwen 3 32B and Devstral 24B handle code generation, summarisation, and structured reasoning at a level that would have required 70B parameters in mid-2025. MoE models like Qwen3.5-35B-A3B push further: 35B-class quality at a speed that makes agent workflows practical on local hardware. They still can’t match a 235B model running on Venice’s infrastructure, but the gap is closing fast.

Where sovereign APIs still win

Venice and Morpheus both provide OpenAI-compatible APIs, so swapping them into existing tooling takes minutes. But with local inference now hitting 94 tok/s, the question shifts. It’s no longer “local or API?” It’s “which layer does each handle best?”

Local vs sovereign API (April 2026)

	Local (M4 Max 64GB, MLX)	Venice	Morpheus
Best model	Qwen3.5-35B-A3B (MoE)	Qwen 3.5 235B	MiniMax-M2.5
Speed	94 tok/s	50+ tok/s	API-fast
Context window	262K native (~100-150K practical on 64GB)	32K-2M (varies by model)	~200K verified (up to 1M advertised)
Cost model	~$4,660 hardware (yr 1)	In-app: $18/mo. API: pay per token or stake DIEM	Stake MOR for daily credits, or pay per token (card/crypto)

Local inference now competes on raw speed. 94 tok/s is faster than what most cloud APIs deliver. For agent workflows that chain five or six inference calls, this matters. OpenClaw runs on my local MLX server as its primary model, with cloud as fallback. That wasn’t possible three months ago with dense models on Ollama.

Where APIs still earn their place: the largest models and the longest context. Venice serves Qwen 3.5 235B, which no consumer hardware can run. Morpheus provides MiniMax-M2.5, which I use for Agent Zero’s chat model. Morpheus advertises 1M context for M2.5; independent sources (OpenRouter, HuggingFace) verify ~200K as the tested practical window. Even at 200K, that’s a model I can’t run locally at acceptable speed. Venice also handles image and video generation, something local models don’t touch.

The cost picture has shifted too. Venice’s in-app Pro plan costs $18 per month for unlimited text generation. API access is separate (pay per token or stake DIEM for daily credits). Morpheus lets you stake MOR for daily inference credits that renew automatically. But the Mac Studio now handles the heaviest daily workload (agent inference for OpenClaw) at zero marginal cost. The APIs fill gaps rather than carrying the load. That changes the arithmetic compared to our earlier analysis, where we assumed APIs would handle complex reasoning and agent work.

Where local inference earns its place

With MoE + MLX, local inference is no longer just a specialist tool for slow, privacy-sensitive workloads. It handles agent workflows at API-competitive speeds while still covering every use case where data can’t leave the machine.

When to run locally

Use case	Why local wins	Speed matters?
Agent workflows (MoE + MLX)	94 tok/s, no rate limits, full sovereignty	Yes, and local delivers
Embeddings + RAG	Sensitive docs never leave your machine	No (hundreds of tok/s)
Code completion	Short completions, codebase stays local	No (10-50 tokens at a time)
Fine-tuning	Your data shapes the model privately	No (training, not inference)
Batch processing	No rate limits, no per-token cost	No (run overnight)
Offline / air-gapped	Zero internet dependency	No (availability is the point)

Agent workflows. This is the use case that didn’t work three months ago. OpenClaw chains five or six inference calls per task. With dense 32B models on Ollama, each call took long enough to break the flow. With Qwen3.5-35B-A3B on MLX at 94 tok/s, it works. The model’s 262K native context window (practically ~100-150K on 64GB before memory pressure) handles large codebases without truncation. No rate limits, no per-token cost, no prompts leaving the machine. Cloud APIs sit behind it as fallback, not as primary.

Embeddings and RAG over private documents. Embedding models like nomic-embed-text are tiny (under 1GB), run at hundreds of tokens per second on any recent hardware, and score within 3-4% of OpenAI’s embeddings on MTEB retrieval benchmarks. Pair one with a local vector database like ChromaDB and a 32B generation model, and you have a RAG pipeline over sensitive documents that never touches an external server. Construction contracts, financial models, medical records, legal strategy: if the documents can’t leave your machine, this is how you query them. This is probably the strongest practical case for local inference today.

Code completion in your editor. Continue.dev connects to a local Ollama instance and provides code completion, refactoring, and explanation directly in VS Code. A 24-32B model at 22-25 tokens per second handles this well because code completions are short. You’re generating 10-50 tokens at a time, not 500. The latency that kills agent workflows barely registers for autocomplete. Your codebase stays on your machine.

Fine-tuning on your own data. Apple’s MLX library supports LoRA and QLoRA fine-tuning natively on Apple Silicon. 64GB handles fine-tuning a 7B model comfortably. You could personalise a model on your writing style, your contract clause library, or your codebase. The result is a model shaped by your specific patterns, running locally, with zero data leaving the machine. Early-stage tooling, but the capability is real.

Batch processing where latency doesn’t matter. Need to classify, summarise, or extract data from hundreds of documents? Queue them up overnight. 70B at 10 tokens per second is perfectly fine when you’re not watching it. No rate limits, no per-token cost after the hardware investment, and it runs until you tell it to stop.

Offline and air-gapped environments. Once you’ve downloaded your models, the entire stack works without an internet connection. Planes, remote construction sites, network outages, or environments where security policy prohibits external API calls. Ollama plus a GGUF model plus a local vector database is a complete, zero-dependency AI stack.

MLX: from speculation to production

When I first wrote this article, I mentioned MLX as a “worth trying” alternative to Ollama. That undersold it. MLX is now my primary inference server, running via mlx-lm on port 1234 and serving OpenClaw all day.

The speed gains over Ollama are real but vary by model type. For dense models, expect 20-50% improvement on the same hardware. For MoE models like Qwen3.5-35B-A3B, the combination is transformative: 94 tok/s on standard KV cache, compared to the low-20s you’d get from Ollama with a similar-quality dense model. MLX’s Metal GPU optimisation and Apple Silicon unified memory support make it the right framework for this hardware.

One practical note on memory. Qwen3.5-35B-A3B supports 262K context natively, and the MLX server is configured to match. On 64GB, you can work comfortably up to around 100-150K tokens of actual context before memory pressure becomes an issue. Beyond that, KV cache compression tools like TurboQuant (which I benchmarked at 43 tok/s with ~60% KV memory savings) extend the usable range at the cost of decode speed. For most agent workloads, standard KV cache at 94 tok/s is the right default. Ollama still earns its place for utility models: I run qwen3:8b via Ollama for Agent Zero’s utility tasks, where the simpler setup and lower resource footprint make more sense.

The right stack isn’t one or the other

Here’s what my actual stack looks like today. Local MLX serving Qwen3.5-35B-A3B handles the heaviest daily workload: OpenClaw’s agent inference at 94 tok/s, zero marginal cost, nothing leaving the machine. Morpheus provides MiniMax-M2.5 for Agent Zero’s chat model, where I need the larger context window. Local Ollama runs qwen3:8b for Agent Zero’s utility tasks. Venice handles ad hoc development work, image generation, and video. Each service has a fallback.

Six months ago, “sovereignty stack” meant sovereign APIs doing the hard work and local handling the scraps. That’s changed. MoE models and MLX shifted the balance. The Mac Studio now runs the primary workload, and APIs fill specific gaps: models too large to run locally, multi-modal generation, and context windows beyond what 64GB can sustain.

What matters is that every layer respects your data. Venice’s stated policy is zero prompt logging. Morpheus is a permissionless marketplace where providers compete and you’re not locked to one. Your local models answer to nobody. The sovereignty argument isn’t about where inference happens. It’s about who controls it. Build for optionality, not dogma.