· Updated

Open-Weight AI Models Compared

A practical comparison of the best open-weight AI models for local and sovereign inference. Llama, Qwen, DeepSeek, Mistral, Gemma and more, with licence clarity and hardware requirements. Updated monthly.

Ten labs now release AI models you can download and run on your own hardware. Meta, Alibaba, DeepSeek, Mistral, Google and Microsoft, joined through 2026 by Z.ai, MiniMax, Moonshot and Cohere. Between them, they cover everything from 3B parameter models that run on a phone to a 1.6T parameter behemoth that needs a data centre.

But “open” carries a much narrower meaning than most people assume. And the modelModelA trained neural network that takes inputs (text, images, audio) and produces outputs (more text, classifications, generated content). In DeAI the model is the thing that actually does the work.Like a very experienced apprentice who has spent years watching thousands of masters make furniture. They can't explain how they know when a joint is right, but they can make a chair that looks and functions like a Chippendale. The training is invisible. The output is what matters.Read more → you choose determines what you can actually do with it, legally and practically. This page tracks the models that matter for sovereign inferenceInferenceRunning a trained AI model to produce an answer. Inference is what happens when you type a prompt into ChatGPT and get a response. The model takes your input, computes a best guess, and returns it.Like asking an expert for their opinion. The training was the decades they spent becoming an expert. The inference is the 30 seconds it takes them to answer your specific question.Read more →: who made them, what licence they carry, how much hardware they need, and whether you can run them without asking anyone’s permission.

For the practical case for (and against) running these locally, see Why Self-Host Your AI. For hardware setup, see Mac Studio DeAI Setup.

Open-weight vs open-source: the distinction that matters

The AI industry uses “open source” loosely. Most models marketed as open source are not. The distinction is simple and it matters.

Truly open source means the model weightsParametersThe internal numbers (weights and biases) inside a neural network that get adjusted during training. A 70-billion-parameter model has 70 billion adjustable internal numbers encoding everything it has learned.Like the synapses in a human brain. Each parameter is a tiny dial that gets nudged a little during training. With enough dials, the network can represent surprisingly complex patterns. The total parameter count is roughly how much "brain" the model has.Read more → ship under a licence approved by the Open Source Initiative (OSI): Apache 2.0 or MIT. You can use the model commercially, modify it, redistribute it, fine-tune it, and deploy it without restriction. Qwen 3.5 and 3.6, DeepSeek V4, Mistral Large 3 and Small 4, Phi-4, Z.ai’s GLM-5.2, and (as of Gemma 4) Google’s Gemma all qualify. Gemma’s move to Apache 2.0 in 2026 is a meaningful shift: the previous Gemma Terms restricted fine-tuned redistribution, and that restriction is now gone.

Open weight, restricted licence means you can download and run the weights but the licence limits what you can do. Meta’s Llama models restrict use by applications with more than 700 million monthly active users. MiniMax’s community licence is non-commercial by default and gates larger commercial use behind written authorisation. Moonshot’s Kimi K2.6 ships under a Modified MIT that adds an attribution requirement above 100 million monthly users. Cohere’s Command A is non-commercial entirely. These restrictions may not affect individual users today, but they constrain what you can build on top of them.

For sovereignty purposes, the difference matters. If you’re building infrastructure on top of a model, an Apache 2.0 or MIT licence means nobody can pull the rug. A restricted licence means someone can change the terms.

The comparison

Data from HuggingFace API. Last refreshed: 2026-06-26. Downloads are 30-day rolling counts. Run npm run refresh:models to update.

Model Params Context Licence Min RAM (Q4) Downloads Ollama
Gemma 4 31B
Google · Gemma
31B 256K Open Source 19 GB 11.2M
Qwen 3.5 9B
Alibaba · Qwen
9.7B 256K Open Source 6 GB 9.8M
Qwen 3.6 27B
Alibaba · Qwen
27.8B 256K Open Source 17 GB 5.7M
Qwen 3.6 35B-A3B
Alibaba · Qwen
36B (3B active) 256K Open Source 21 GB 5.6M
Gemma 4 12B
Google · Gemma
12B 256K Open Source 8 GB 2.3M
DeepSeek V4-Flash
DeepSeek · DeepSeek
284B (13B active) 1M Open Source 160 GB 2.1M
DeepSeek V4-Pro
DeepSeek · DeepSeek
1600B (49B active) 1M Open Source 860 GB 1.7M
Ministral 3 3B
Mistral AI · Mistral
3.8B 256K Open Source 2.5 GB 1.2M
Phi-4
Microsoft · Phi
14B 16K Open Source 10 GB 897K
Qwen 3.5 397B
Alibaba · Qwen
397B (17B active) 256K Open Source 220 GB 557K
Mistral Small 4
Mistral AI · Mistral
119B (6.5B active) 256K Open Source 74 GB 130K
GLM-5.2
Z.ai · GLM
744B (40B active) 1M Open Source 380 GB 84K
Mistral Large 3
Mistral AI · Mistral
675B (41B active) 256K Open Source 407 GB 3K
Llama 3.1 8B
Meta · Llama
8B 128K Restricted 6 GB 10.1M
Kimi K2.6
Moonshot AI · Kimi
1000B (32B active) 256K Restricted 580 GB 2.5M
Llama 4 Scout
Meta · Llama
109B (17B active) 10M Restricted 55 GB 724K
Llama 3.3 70B
Meta · Llama
70.6B 128K Restricted 42 GB 681K
MiniMax M3
MiniMax · MiniMax
428B (23B active) 1M Restricted 235 GB 170K
Command A
Cohere · Command
111B 256K Non-Commercial 67 GB 2K

Which model for which job

Hardware constraints and use case determine the right choice. Here’s how the current line-up maps onto common scenarios.

Laptop or Mini PC (8-16GB RAM). Ministral 3 3B, Qwen 3.5 9B, Gemma 4 12B, or Phi-4. All fit comfortably at Q4 quantisationQuantisationCompressing an AI model by storing each parameter with fewer bits of precision. Quantisation cuts model size and inference cost by 2-4x with small quality losses, making big models practical to run on consumer hardware.Like printing a high-resolution photo at lower DPI. The image is mostly the same, the details are slightly less crisp, and the file size drops dramatically. For most uses you can't tell the difference. For some uses the quality loss matters.Read more →. Qwen 3.5 9B is the strongest all-rounder at this size with Apache 2.0 licensing and native vision. Gemma 4 12B is now Apache 2.0 too. Phi-4 punches above its weight on reasoning but has a short 16K context window and is the oldest model here.

Mac Studio or workstation (32-64GB RAM). Qwen 3.6 27B is the sweet spot: around 17GB at Q4, Apache 2.0 licence, native multimodal, and quality that rivals 70B models from a year ago. Gemma 4 31B is another strong option at 19GB, and now ships under Apache 2.0. Qwen 3.6 35B-A3B is a fast alternative: 36B of quality but only 3B active per token. Llama 3.3 70B fits at Q4 (42GB) but runs slowly enough to feel sluggish for interactive use.

Multi-GPU server or cloud. DeepSeek V4-Flash and V4-Pro are frontier-competitive under MIT licence with 1M context, though V4-Pro’s 1.6T parameters need serious infrastructure. Z.ai’s GLM-5.2 (MIT, 744B) is the strongest open coding model. Qwen 3.5 397B (Apache 2.0) is the largest model you can run with no licence restrictions at all. Mistral Large 3 (Apache 2.0) is the European sovereignty option.

Coding and development. GLM-5.2 and DeepSeek V4 lead the open field on coding benchmarks, but both need a server. For something that runs locally, Qwen 3.6 27B handles code generation well, and DeepSeek’s still widely used R1 distilled models (now folded into the V4 lineage upstream) run at 32B on consumer hardware. For editor-integrated completion via Continue.dev, a 24-32B model gives responsive autocomplete.

RAG and document analysis. Context window matters here. Qwen 3.5 and 3.6 support 256K tokens natively. DeepSeek V4, GLM-5.2 and MiniMax M3 reach 1M. Mistral Small 4 supports 256K. For embeddingEmbeddingA numerical representation of a word, sentence, or image as a list of numbers (a vector) that captures its meaning. Similar things have similar embeddings, which makes them useful for search, clustering, and recommendation.Like a map where every word has GPS coordinates. Words with similar meanings end up close together on the map. "Cat" and "kitten" are nearby. "Cat" and "thunderstorm" are far apart. The map is the embedding space.Read more → generation, run nomic-embed-text or BGE-M3 alongside your generation model: they’re under 1GB and don’t compete for memory.

Privacy-first use cases. Any model running via Ollama or LM Studio on your own hardware provides Level 4 privacy (the highest on our privacy spectrum). No prompts leave the machine. For situations where data cannot touch an external server, local inference with an open-source model is the only option that provides both legal clarity and architectural guarantees.

The sovereignty lens

Three questions determine whether a model is sovereign-compatible:

  1. Can you run it without permission? Apache 2.0 and MIT: yes. Llama Community and MiniMax Community: conditionally. CC-BY-NC: not commercially.

  2. Can you build on it without risk? If the licence allows commercial use and modification without restrictions, you can build products, fine-tune for clients, and redistribute. If not, your project depends on someone else’s licence terms staying favourable.

  3. Can you run it on hardware you control? Models under 70GB at Q4 run on consumer hardware. Above that, you need data centre infrastructure, which may mean renting from a cloud provider and reintroducing a dependency.

The models that score highest on all three: Qwen 3.6 27B (Apache 2.0, 17GB, runs on a Mac), Gemma 4 12B (now Apache 2.0, 8GB, runs on a laptop), and Qwen 3.5 9B (Apache 2.0, 6GB, runs on almost anything). DeepSeek’s and Z.ai’s MIT-licensed flagships are excellent but too large for consumer hardware in their full form; smaller distilled checkpoints solve this.

How we keep this current

Open-weight models release constantly. This page uses a hybrid approach:

Automated data refresh. Download counts and community metrics update weekly via the HuggingFace APIAPIApplication Programming Interface. A structured way for one piece of software to talk to another. In DeAI, APIs let applications request inference from a model without running the model themselves.Like a waiter in a restaurant. You don't walk into the kitchen and cook your own meal. You tell the waiter what you want, they tell the kitchen, the kitchen cooks it, and the waiter brings it back. The API is the waiter.Read more → (npm run refresh:models, run in CI every Monday). The “last refreshed” date at the top of the comparison shows when data was last pulled.

Manual editorial updates. New model families, licence changes, hardware requirements, and sovereignty assessments are updated during our monthly review process. We add models when they’re significant enough to change the comparison, not every time someone releases a 7B variant.

What we track. 19 models across ten families, chosen for relevance to sovereign and local inference. We exclude models that are research-only, not available on Ollama or LM Studio, or too niche to matter for our audience. If a model you care about is missing, it’s probably because it doesn’t meet one of those criteria.

For how these models fit into a broader sovereignty strategy, see The Sovereignty Stack.

Score changes, new reviews, one editorial take every two weeks. No spam.