Listen to this episode
· Updated

Why Self-Host Your AI? The Honest Case for Local Inference

I tried local AI inference on an M4 Max, failed with dense models, then hit 94 tok/s with MoE architecture via MLX. Here's what changed, what still needs APIs, and how to build a hybrid sovereignty stack.

The pitch vs the reality

Running AI on your own hardware is the cleanest version of the sovereignty argument. No APIAPIApplication Programming Interface. A structured way for one piece of software to talk to another. In DeAI, APIs let applications request inference from a model without running the model themselves.Like a waiter in a restaurant. You don't walk into the kitchen and cook your own meal. You tell the waiter what you want, they tell the kitchen, the kitchen cooks it, and the waiter brings it back. The API is the waiter.Read more → provider sees your prompts. No terms of service govern what you can ask. No subscription gets cancelled. Your models, your machine, your rules. For anyone who cares about AI independence, local inferenceInferenceRunning a trained AI model to produce an answer. Inference is what happens when you type a prompt into ChatGPT and get a response. The model takes your input, computes a best guess, and returns it.Like asking an expert for their opinion. The training was the decades they spent becoming an expert. The inference is the 30 seconds it takes them to answer your specific question.Read more → is the obvious destination.

Then you actually try it. I run a Mac Studio M4 Max with 64GB of unified memory. Solid hardware by any measure. I loaded 32B-class dense models via Ollama, pointed my coding agent at them, and waited. The models worked. They just weren’t fast enough for the kind of multi-step agent work I was doing with OpenClaw. At 22 tokens per second, every chained inference call felt like wading through mud. I went back to Venice and Morpheus APIs within a week.

That’s where most people’s self-hosting story ends. Mine didn’t. A different modelModelA trained neural network that takes inputs (text, images, audio) and produces outputs (more text, classifications, generated content). In DeAI the model is the thing that actually does the work.Like a very experienced apprentice who has spent years watching thousands of masters make furniture. They can't explain how they know when a joint is right, but they can make a chair that looks and functions like a Chippendale. The training is invisible. The output is what matters.Read more → architecture, Mixture of ExpertsMoEMixture of Experts. A neural network architecture where many specialised "expert" sub-models exist alongside a router that picks which experts to use for each input. Only a fraction of the model's parameters are active for any single query.Like a hospital with many specialist doctors instead of one giant generalist. When a patient arrives, a triage nurse routes them to the right specialist. The hospital has 100 doctors total but each patient sees only the 2-3 they actually need.Read more → (MoE), combined with Apple’s MLX framework instead of Ollama, changed the equation completely. The same Mac Studio now runs Qwen3.5-35B-A3B at 94 tokens per second. Local inference is my primary setup for OpenClaw, with cloud APIs as fallback. The model architecture and framework mattered more than the hardware.

What 64GB of unified memory gets you

The M4 Max with 64GB is one of the better consumer machines for local inference. After macOS takes its share, you have roughly 54-56GB of usable memory for models. That’s enough for a 70B model at Q4 quantisationQuantisationCompressing an AI model by storing each parameter with fewer bits of precision. Quantisation cuts model size and inference cost by 2-4x with small quality losses, making big models practical to run on consumer hardware.Like printing a high-resolution photo at lower DPI. The image is mostly the same, the details are slightly less crisp, and the file size drops dramatically. For most uses you can't tell the difference. For some uses the quality loss matters.Read more → or a 32B at full 8-bit precision with room to spare.

With dense models on Ollama, the speed picture is modest. But it changes dramatically when you factor in model architecture and framework choice.

Local inference speed on M4 Max 64GB

Qwen3.5-35B-A3B MoE (MLX) 94 tok/s
Devstral Small 24B (Ollama) 25 tok/s
Qwen 3 32B 8-bit (Ollama) 22 tok/s
Llama 3.3 70B Q4 (Ollama) 11 tok/s

The top bar is not a typo. Qwen3.5-35B-A3B is a Mixture of Experts model: 35 billion total parametersParametersThe internal numbers (weights and biases) inside a neural network that get adjusted during training. A 70-billion-parameter model has 70 billion adjustable internal numbers encoding everything it has learned.Like the synapses in a human brain. Each parameter is a tiny dial that gets nudged a little during training. With enough dials, the network can represent surprisingly complex patterns. The total parameter count is roughly how much "brain" the model has.Read more →, but only 3 billion active per tokenTokenA digital unit of value or access rights tracked on a blockchain. Tokens can represent ownership in a project, a right to use a service, a share of future revenue, or simply a tradable asset with no underlying claim.Like a physical poker chip a casino issues. The chip itself has no value. What makes it worth something is what it lets you do at the casino, what the casino has promised, and how much other people will pay you for it.Read more →. The full model sits in memory (19GB at 4-bit quantisation), but each inference step only computes through a fraction of the network. The result is a 35B-class model that runs at speeds you’d normally associate with a 3B model. Pair that with Apple’s MLX framework, which is optimised for the M4 Max’s unified memory and Metal GPUGPUGraphics Processing Unit. Originally designed to render video game graphics, GPUs turned out to be exceptionally good at the massively parallel math that AI models need. Modern AI training and inference runs almost entirely on GPUs.Like a factory with 10,000 workers doing the same simple task in parallel, versus a CPU which is more like 10 workers each doing different complex tasks. AI training involves doing simple math a million times per second on a million numbers, which is exactly what the GPU factory is designed for.Read more →, and you get 94 tokens per second. That’s faster than most cloud APIs.

Dense models tell a different story. A 70B model at 11 tokens per second is slower than comfortable reading speed. A 32B at 22 tokens per second is workable for interactive use but painful for anything iterative. Neither comes close to what MoE delivers.

94 vs 22 tok/s MoE + MLX vs dense Ollama Same hardware. Same memory. 4x faster, because only 3B of 35B parameters activate per token.

Quality has caught up faster than speed. Qwen 3 32B and Devstral 24B handle code generation, summarisation, and structured reasoning at a level that would have required 70B parameters in mid-2025. MoE models like Qwen3.5-35B-A3B push further: 35B-class quality at a speed that makes agent workflows practical on local hardware. They still can’t match a 235B model running on Venice’s infrastructure, but the gap is closing fast.

Where sovereign APIs still win

Venice and Morpheus both provide OpenAI-compatible APIs, so swapping them into existing tooling takes minutes. But with local inference now hitting 94 tok/s, the question shifts. It’s no longer “local or API?” It’s “which layer does each handle best?”

Local vs sovereign API (April 2026)

Local (M4 Max 64GB, MLX)VeniceMorpheus
Best model Qwen3.5-35B-A3B (MoE) Qwen 3.5 235B MiniMax-M2.5
Speed 94 tok/s 50+ tok/s API-fast
Context window 262K native (~100-150K practical on 64GB) 32K-2M (varies by model) ~200K verified (up to 1M advertised)
Cost model ~$4,660 hardware (yr 1) In-app: $18/mo. API: pay per token or stake DIEM Stake MOR for daily credits, or pay per token (card/crypto)

Local inference now competes on raw speed. 94 tok/s is faster than what most cloud APIs deliver. For agent workflows that chain five or six inference calls, this matters. OpenClaw runs on my local MLX server as its primary model, with cloud as fallback. That wasn’t possible three months ago with dense models on Ollama.

Where APIs still earn their place: the largest models and the longest context. Venice serves Qwen 3.5 235B, which no consumer hardware can run. Morpheus provides MiniMax-M2.5, which I use for Agent Zero’s chat model. Morpheus advertises 1M context for M2.5; independent sources (OpenRouter, HuggingFace) verify ~200K as the tested practical window. Even at 200K, that’s a model I can’t run locally at acceptable speed. Venice also handles image and video generation, something local models don’t touch.

The cost picture has shifted too. Venice’s in-app Pro plan costs $18 per month for unlimited text generation. API access is separate (pay per token or stakeStakingLocking up a cryptocurrency to help secure a blockchain network, usually in exchange for rewards. The locked tokens act as a security deposit that can be taken away if the staker misbehaves.Like putting down a large rental deposit for an apartment. You get the money back if you behave, you earn interest while it's locked, and the landlord takes it if you trash the place.Read more → DIEM for daily credits). Morpheus lets you stake MOR for daily inference credits that renew automatically. But the Mac Studio now handles the heaviest daily workload (agent inference for OpenClaw) at zero marginal cost. The APIs fill gaps rather than carrying the load. That changes the arithmetic compared to our earlier analysis, where we assumed APIs would handle complex reasoning and agent work.

Where local inference earns its place

With MoE + MLX, local inference is no longer just a specialist tool for slow, privacy-sensitive workloads. It handles agent workflows at API-competitive speeds while still covering every use case where data can’t leave the machine.

When to run locally

Use caseWhy local winsSpeed matters?
Agent workflows (MoE + MLX) 94 tok/s, no rate limits, full sovereignty Yes, and local delivers
Embeddings + RAG Sensitive docs never leave your machine No (hundreds of tok/s)
Code completion Short completions, codebase stays local No (10-50 tokens at a time)
Fine-tuning Your data shapes the model privately No (training, not inference)
Batch processing No rate limits, no per-token cost No (run overnight)
Offline / air-gapped Zero internet dependency No (availability is the point)

Agent workflows. This is the use case that didn’t work three months ago. OpenClaw chains five or six inference calls per task. With dense 32B models on Ollama, each call took long enough to break the flow. With Qwen3.5-35B-A3B on MLX at 94 tok/s, it works. The model’s 262K native context window (practically ~100-150K on 64GB before memory pressure) handles large codebases without truncation. No rate limits, no per-token cost, no prompts leaving the machine. Cloud APIs sit behind it as fallback, not as primary.

Embeddings and RAG over private documents. EmbeddingEmbeddingA numerical representation of a word, sentence, or image as a list of numbers (a vector) that captures its meaning. Similar things have similar embeddings, which makes them useful for search, clustering, and recommendation.Like a map where every word has GPS coordinates. Words with similar meanings end up close together on the map. "Cat" and "kitten" are nearby. "Cat" and "thunderstorm" are far apart. The map is the embedding space.Read more → models like nomic-embed-text are tiny (under 1GB), run at hundreds of tokens per second on any recent hardware, and score within 3-4% of OpenAI’s embeddings on MTEB retrieval benchmarks. Pair one with a local vector database like ChromaDB and a 32B generation model, and you have a RAG pipeline over sensitive documents that never touches an external server. Construction contracts, financial models, medical records, legal strategy: if the documents can’t leave your machine, this is how you query them. This is probably the strongest practical case for local inference today.

Code completion in your editor. Continue.dev connects to a local Ollama instance and provides code completion, refactoring, and explanation directly in VS Code. A 24-32B model at 22-25 tokens per second handles this well because code completions are short. You’re generating 10-50 tokens at a time, not 500. The latency that kills agent workflows barely registers for autocomplete. Your codebase stays on your machine.

Fine-tuningFine-tuningThe process of taking a pre-trained model and training it further on a smaller, more specialised dataset to adapt it to a specific task, domain, or style. Fine-tuning is much cheaper than training from scratch.Like hiring an experienced general practitioner doctor and giving them six months of focused training in a sub-speciality. You don't have to teach them medicine from scratch. You just narrow their expertise to the area you actually need.Read more → on your own data. Apple’s MLX library supports LoRA and QLoRA fine-tuning natively on Apple Silicon. 64GB handles fine-tuning a 7B model comfortably. You could personalise a model on your writing style, your contract clause library, or your codebase. The result is a model shaped by your specific patterns, running locally, with zero data leaving the machine. Early-stage tooling, but the capability is real.

Batch processing where latency doesn’t matter. Need to classify, summarise, or extract data from hundreds of documents? Queue them up overnight. 70B at 10 tokens per second is perfectly fine when you’re not watching it. No rate limits, no per-token cost after the hardware investment, and it runs until you tell it to stop.

Offline and air-gapped environments. Once you’ve downloaded your models, the entire stack works without an internet connection. Planes, remote construction sites, network outages, or environments where security policy prohibits external API calls. Ollama plus a GGUF model plus a local vector database is a complete, zero-dependency AI stack.

MLX: from speculation to production

When I first wrote this article, I mentioned MLX as a “worth trying” alternative to Ollama. That undersold it. MLX is now my primary inference server, running via mlx-lm on port 1234 and serving OpenClaw all day.

The speed gains over Ollama are real but vary by model type. For dense models, expect 20-50% improvement on the same hardware. For MoE models like Qwen3.5-35B-A3B, the combination is transformative: 94 tok/s on standard KV cache, compared to the low-20s you’d get from Ollama with a similar-quality dense model. MLX’s Metal GPU optimisation and Apple Silicon unified memory support make it the right framework for this hardware.

One practical note on memory. Qwen3.5-35B-A3B supports 262K context natively, and the MLX server is configured to match. On 64GB, you can work comfortably up to around 100-150K tokens of actual context before memory pressure becomes an issue. Beyond that, KV cache compression tools like TurboQuant (which I benchmarked at 43 tok/s with ~60% KV memory savings) extend the usable range at the cost of decode speed. For most agent workloads, standard KV cache at 94 tok/s is the right default. Ollama still earns its place for utility models: I run qwen3:8b via Ollama for Agent Zero’s utility tasks, where the simpler setup and lower resource footprint make more sense.

The right stack isn’t one or the other

Here’s what my actual stack looks like today. Local MLX serving Qwen3.5-35B-A3B handles the heaviest daily workload: OpenClaw’s agent inference at 94 tok/s, zero marginal cost, nothing leaving the machine. Morpheus provides MiniMax-M2.5 for Agent Zero’s chat model, where I need the larger context window. Local Ollama runs qwen3:8b for Agent Zero’s utility tasks. Venice handles ad hoc development work, image generation, and video. Each service has a fallback.

Six months ago, “sovereignty stack” meant sovereign APIs doing the hard work and local handling the scraps. That’s changed. MoE models and MLX shifted the balance. The Mac Studio now runs the primary workload, and APIs fill specific gaps: models too large to run locally, multi-modal generation, and context windows beyond what 64GB can sustain.

What matters is that every layer respects your data. Venice’s stated policy is zero promptPromptThe text you give an AI model to tell it what to generate. A prompt can be a simple question, a long instruction, a chunk of context plus a task, or a conversation history the model uses to produce its response.Like a brief you give to a junior designer. A vague brief gets a vague result. A detailed brief with context, constraints, and examples gets something usable. The quality of the output depends heavily on the quality of the brief.Read more → logging. Morpheus is a permissionless marketplace where providers compete and you’re not locked to one. Your local models answer to nobody. The sovereignty argument isn’t about where inference happens. It’s about who controls it. Build for optionality, not dogma.

Score changes, new reviews, one editorial take every two weeks. No spam.