Listen to this episode
· Updated

Why Private AI Inference Is the Next Infrastructure Battle

When you send a prompt to an AI, who else sees it? The answer is more complicated than any project wants to admit. A hands-on guide to the five levels of AI privacy.

What “private” actually means

Every AI privacy project says it is private. Most of them are using the word differently.

When Venice says it is private, it means different things depending on which mode you choose. In Private mode, your identity is stripped from the promptPromptThe text you give an AI model to tell it what to generate. A prompt can be a simple question, a long instruction, a chunk of context plus a task, or a conversation history the model uses to produce its response.Like a brief you give to a junior designer. A vague brief gets a vague result. A detailed brief with context, constraints, and examples gets something usable. The quality of the output depends heavily on the quality of the brief.Read more → before it reaches a GPUGPUGraphics Processing Unit. Originally designed to render video game graphics, GPUs turned out to be exceptionally good at the massively parallel math that AI models need. Modern AI training and inference runs almost entirely on GPUs.Like a factory with 10,000 workers doing the same simple task in parallel, versus a CPU which is more like 10 workers each doing different complex tasks. AI training involves doing simple math a million times per second on a million numbers, which is exactly what the GPU factory is designed for.Read more → provider. In TEETEETrusted Execution Environment. A hardware-secured region of a CPU or GPU where code runs in isolation, so even the machine's operator can't read what's happening inside. TEEs give decentralised AI inference privacy guarantees.Like a bank vault inside a bank. The bank owns the building, staffs the lobby, and runs the security cameras. But what's inside the vault is invisible to everyone, including the bank staff, unless the customer opens it.Read more → mode, inferenceInferenceRunning a trained AI model to produce an answer. Inference is what happens when you type a prompt into ChatGPT and get a response. The model takes your input, computes a best guess, and returns it.Like asking an expert for their opinion. The training was the decades they spent becoming an expert. The inference is the 30 seconds it takes them to answer your specific question.Read more → runs inside a hardware enclaveEnclaveAn isolated region of CPU or GPU memory protected by hardware. Code and data inside the enclave are inaccessible to the operating system, the hypervisor, or even the machine's physical owner.Like a secure room inside a much larger office building. The building's caretakers have keys to every other room but not this one. What happens inside is invisible to them by design.Read more → the provider can’t inspect. In E2EE mode, your prompt is encrypted on your device and only decrypted inside a verified enclave. When Phala says it is private, it means the GPU provider cannot read your prompt even during processing. When you run Llama on your Mac Studio, there is no provider to worry about at all.

These are three different things. All three call themselves private. Only one of them keeps the contents of your prompt invisible to everyone except you.

This distinction matters because the use case defines the required level. Asking an AI to write a blog post does not need the same privacy as asking it to analyse your company’s financial projections, your medical records, or your legal strategy. Understanding where each platform sits on the spectrum is not academic. It determines what you should and should not trust them with.

The AI Privacy Spectrum

No privacy Full isolation
OpenAI/Anthropic Level 0
Venice (Private) Level 1
Venice (E2EE) / Phala / Oasis Level 2
Arcium/Nillion Level 3
Local Level 4

The five levels of AI inference privacy

Level 0: No privacy. OpenAI, Anthropic (default)

The provider sees your prompt in plaintext. They store it. They log metadata: who asked, when, from where, what modelModelA trained neural network that takes inputs (text, images, audio) and produces outputs (more text, classifications, generated content). In DeAI the model is the thing that actually does the work.Like a very experienced apprentice who has spent years watching thousands of masters make furniture. They can't explain how they know when a joint is right, but they can make a chair that looks and functions like a Chippendale. The training is invisible. The output is what matters.Read more →, how many tokens. The data persists on their infrastructure indefinitely unless you have negotiated specific terms.

OpenAI’s enterprise privacy page is 2,400 words long. That length tells you everything about the complexity of data handling on their side. Samsung employees leaked source code through ChatGPT. Amazon staff pasted confidential documents. The data breaches have already happened.

With default settings, centralised providers know who you are, what you asked, and they store both indefinitely.

Level 1: Policy-based privacy. Venice (Private mode), Morpheus (current)

Most “private” AI platforms sit here, and honest examination is required.

Venice in its default Private mode routes your prompt through a privacy proxy before it reaches a distributed GPU provider. The proxy strips your identity. The GPU provider doesn’t know who sent the prompt. Venice’s zero data retention (ZDR) policy means they claim not to log prompts, and the GPU providers are contractually required not to store them either.

Here is what that means in practice:

  • Your identity is hidden. The GPU provider doesn’t know who you are. This is real anonymisation.
  • Your prompt content is visible. The GPU provider must see the plaintext prompt to run inference on it. There is no encryption during processing.
  • No data is stored, by policy. Venice doesn’t log prompts. GPU providers are told not to. But there is no cryptographic or hardware mechanism that prevents a provider from reading or copying your prompt during the milliseconds it sits in their GPU memory.

To use the construction analogy: it is like sending a confidential document through a courier service that strips your return address from the envelope. The recipient does not know who sent it. But they open the envelope to read the document inside. They promise not to photocopy it. You trust them on that.

Note: Venice’s Private mode is Level 1. In March 2026, Venice also launched TEE and E2EE modes that sit at Level 2 (covered below). Private mode remains the default for free and most Pro users.

Venice privacy flow (Level 1)

📝 Your prompt
🔀 Venice proxy Identity stripped
GPU provider Sees prompt, not identity
Response Data purged

This is dramatically better than Level 0. There is no persistent data. There is no identity link. There are no logs to subpoena. For the vast majority of AI use cases (writing, coding, analysis, creative work) Level 1 is sufficient. You do not need military-grade encryption to ask an AI to help with your marketing copy.

But if you are pasting your company’s cap table, your patient’s medical records, or your client’s legal strategy into a Level 1 system, you need to understand that a GPU provider could theoretically read that content during inference. They do not know it is yours. They claim not to store it. But the content itself passes through their hardware in plaintext.

Morpheus operates at Level 1. P2P encrypted routing protects data in transit, but the provider sees the prompt during inference. The v6.0.0 release (19 March 2026) added TEE attestationAttestationA cryptographic proof that a piece of code is running on a specific hardware enclave in an unmodified state. Attestation lets remote users verify that a service is genuinely running what it claims to be running.Like a tamper-evident seal on a medicine bottle. The seal itself doesn't make the medicine safe, but it gives you a way to verify that nobody opened the bottle and swapped the contents before you bought it.Read more → for the proxy-router (the routing layer), but the GPU executing inference still sees prompts in plaintext. This is an infrastructure trust improvement (you can verify the routing node is genuine and not logging), not a privacy upgrade (the provider can still read your prompt during processing). See “What is coming” below for details.

Level 2: Hardware-enforced privacy. Venice (TEE/E2EE), Phala, NVIDIA Confidential Compute

At Level 2, the privacy guarantee moves from policy to physics.

Trusted Execution Environments (TEEs) are hardware enclaves built into modern processors. When your prompt enters a TEE, it is encrypted in memory. The GPU processes your data inside the enclave, but even the server operator (the person who owns the physical hardware) cannot read the contents. The hardware enforces this isolation. It is not a policy choice. It is a design constraint of the silicon.

Venice TEE and E2EE modes

Venice’s TEE and E2EE modes (launched March 2026) were the first independent consumer AI product to offer Level 2 privacy with per-response attestation. Apple’s Private Cloud Compute shipped hardware-enforced AI privacy first, in June 2024 on Apple Silicon, but it is bound to Apple’s vertical stack and the verification tooling is Apple-provided. Venice runs in any browser, the inference is operated by independent providers, and each response carries an attestation report the user can verify themselves. In TEE mode, inference runs inside hardware enclaves operated by NEAR AI Cloud and Phala Network. In E2EE mode, your prompt is encrypted on your device before it leaves your browser, stays encrypted through Venice’s proxy, and is only decrypted inside the verified enclave. Neither Venice nor the GPU provider can see your data. Each response includes a verification icon linking to a full attestation report you can independently check. E2EE disables web search, memory, file uploads, and function calling (those features would require decrypting outside the enclave), but for sensitive workloads, the trade-off is worth it. Both modes are Pro features ($18/month).

Both modes are also available via the APIAPIApplication Programming Interface. A structured way for one piece of software to talk to another. In DeAI, APIs let applications request inference from a model without running the model themselves.Like a waiter in a restaurant. You don't walk into the kitchen and cook your own meal. You tell the waiter what you want, they tell the kitchen, the kitchen cooks it, and the waiter brings it back. The API is the waiter.Read more →, not just the web chat. Developers select models by prefix (tee- or e2ee-), and E2EE requires client-side implementation of ECDH key exchange (secp256k1) with AES-256-GCM encryption. Third-party integrations can implement verifiable E2EE programmatically, though they handle the cryptographic protocol themselves.

Phala Network and independent TEE infrastructure

Phala Network provides some of Venice’s TEE infrastructure and also operates independently, running GPU TEE inference on NVIDIA H100, H200, and B200 hardware with SOC 2 Type I and HIPAA compliance certifications. Their open-source dstack SDKSDKSoftware Development Kit. A collection of code libraries, documentation, and tools that lets developers integrate a service into their applications without writing everything from scratch. SDKs are how projects become easy to build with.Like a plug-and-play kit for building furniture. You don't have to mill your own wood, forge your own screws, or design the joinery from scratch. The kit gives you pre-cut parts and instructions so you can assemble the thing in an afternoon.Read more → converts standard containers into confidential VMs. The technology produces attestation reports: cryptographic proofs that the enclave is genuine and has not been tampered with.

What this means for your prompt:

  • The GPU provider cannot read your prompt. The hardware prevents it, even though the data is on their machine.
  • Attestation proves the enclave is genuine. You can verify the security properties before sending sensitive data.
  • Performance overhead is minimal. Less than 2-5% throughput penalty. Near-native speed for private inference.

TEE privacy flow (Level 2)

📝 Your prompt
🔒 TEE enclave Hardware-encrypted
GPU processes Cannot read contents
Response Attestation verified

Tradeoffs and what TEE actually attests

The tradeoff: you are trusting the hardware vendor (Intel, AMD, NVIDIA) to have implemented the enclave correctly. This trust is not absolute. Intel SGXSGXIntel Software Guard Extensions. The first widely-deployed TEE technology, introduced in 2015. SGX creates encrypted memory regions (enclaves) where code and data are protected from the operating system and the machine's owner.Like a safe deposit box at a bank. The bank owns the safe room and can see who comes in and out, but they can't see what's inside the boxes. SGX gives applications a private box on a shared computer.Read more → was compromised in 2022, and Phala had to migrate from SGX to TDX in response. The attack surface is dramatically smaller than Level 1 (you need a hardware zero-day rather than a rogue GPU operator) but it exists.

TEE-based privacy is not mathematically proven. It is hardware-attested. The distinction matters for the most sensitive use cases, but for enterprise AI inference, it represents a genuine step change from “trust our policy” to “verify our attestation.”

Practical guidance: Venice’s TEE/E2EE modes mean healthcare providers, legal professionals, and companies handling proprietary data now have a hardware-attested option within a consumer product, rather than needing to run Phala directly or go fully local. The caveats: E2EE is Pro-only ($18/month), limited to a subset of models, and disables web search and memory. For the highest-sensitivity use cases, Level 4 (local) remains the gold standard.

Level 3: Cryptographic privacy. FHE, MPC, ZK

Level 3 is the mathematical guarantee. No trust in hardware vendors, no trust in policies, nothing except the mathematics of cryptography.

Fully Homomorphic Encryption (FHE)

Fully Homomorphic EncryptionFHEFully Homomorphic Encryption. A cryptographic technique that lets you compute on encrypted data without decrypting it. The result is also encrypted, and only the data owner can read it. FHE is the strongest form of computational privacy.Like sending a sealed box of ingredients to a chef, having the chef cook a meal inside the box without ever opening it, and getting back a sealed box with the finished meal. Only you can unseal it and see what's inside.Read more → (FHE) allows computation on encrypted data. Your prompt stays encrypted from the moment it leaves your machine through the entire inference process. The GPU computes the answer without ever decrypting the question. The provider literally cannot see your data. It is mathematically impossible without your decryption key.

FHE flow (Level 3)

🔐 Your device Encrypt prompt locally
⚙️ GPU provider Computes on ciphertext, cannot decrypt
Your device Decrypts result with private key

The problem: 100-1000x computational overhead. A prompt that takes 2 seconds at Level 0 could take minutes to hours under FHE. This is improving rapidly. Zama and others have made dramatic advances. But FHE is not viable for real-time LLMLLMLarge Language Model. A neural network trained on vast amounts of text to predict the next word in a sequence. Modern LLMs (GPT, Claude, Llama, Qwen, DeepSeek) generate human-quality text and are the foundation of most modern AI products.Like an autocomplete that read every book ever written. It has no memory of individual texts but it has absorbed the patterns of language so deeply that it can generate paragraphs that sound human. The skill is statistical, not conscious.Read more → inference today.

Multi-Party Computation (MPC)

Multi-Party ComputationMPCMulti-Party Computation. A cryptographic technique where multiple parties jointly compute a function over their inputs without any party revealing its input to the others. Useful for shared computation on private data.Like a group of friends calculating the average of their salaries without anyone revealing their actual salary to the others. Each person contributes a piece of the answer, and the pieces combine into the result, but nobody learns anything except the average itself.Read more → (MPC) distributes your prompt across multiple parties, none of whom see the complete data. Each party processes a fragment and they jointly compute the result. No single party can reconstruct your original prompt.

MPC flow (Level 3)

✂️ Your device Split prompt into N shares
👥 Parties A, B, C… Each processes a share, none sees the whole
Combine Result reconstructed from shares

Arcium on Solana combines MPC with FHE and zero-knowledge proofs for parallelised confidential computingConfidential ComputeHardware-enforced computation where data and code are encrypted in memory and only the authorised application can access them. The machine's operator cannot read what the application is doing even though they own the machine.Like renting space in a bank vault. The bank owns the building and runs the security, but what you put in the vault is invisible even to the bank staff. Only you have the key.Read more →. Their acquisition of Inpher (a Web2 confidential MLMLMachine Learning. The branch of AI where systems learn patterns from data instead of being explicitly programmed with rules. Modern AI (LLMs, image generation, recommendation systems) is almost entirely machine learning.Like teaching a child to recognise dogs by showing them thousands of pictures of dogs, instead of writing down a precise rulebook for what makes a dog. The child learns the pattern from examples rather than from instructions.Read more → company) brought the Manticore protocol, an ML-optimised MPC supporting encrypted model trainingTrainingThe one-time process of teaching a neural network to perform a task by showing it massive amounts of example data and adjusting its internal weights until the outputs are good. Training builds the model; inference uses it.Like the years an apprentice spends learning a trade. You don't see any of the actual work, just thousands of repeated mistakes gradually becoming competence. By the end, the apprentice can do the job. The training was invisible, but the skill is now permanent.Read more →, inference, XGBoost, clustering, and federated learning. This is not theoretical AI privacy; it is acquired, tested technology from the enterprise ML world being ported to a decentralised network. Arcium is pre-mainnet with no live tokenTokenA digital unit of value or access rights tracked on a blockchain. Tokens can represent ownership in a project, a right to use a service, a share of future revenue, or simply a tradable asset with no underlying claim.Like a physical poker chip a casino issues. The chip itself has no value. What makes it worth something is what it lets you do at the casino, what the casino has promised, and how much other people will pay you for it.Read more →, but the technical depthLiquidityHow easily a token can be bought or sold without moving the price. High liquidity means you can enter or exit large positions quickly at the quoted price. Low liquidity means even small trades can swing the market.Like the difference between selling a house and selling a share of Apple stock. The house might be worth more on paper, but finding a buyer at that price takes weeks. The Apple share converts to cash in one click.Read more → (PhDs from EPFL, Sorbonne, ETH Zurich on the team, NVIDIA Inception membership) suggests this is a serious effort.

The problem for AI inference: coordination overhead between parties adds 10-100x latency compared to plaintext execution. Not yet practical for interactive LLM inference, but potentially viable for training workloads and batch processing where latency matters less. As hardware acceleration improves (Arcium’s NVIDIA partnership targets this), the gap will narrow.

Zero-Knowledge Proofs (ZKPs)

Zero-Knowledge Proofs can prove that a computation was performed correctly without revealing the inputs. You could prove “this model produced this output from a valid input” without revealing what the input was.

ZK proof flow (Level 3)

🔢 Your device Run computation on private input
📜 Generate proof Cryptographic certificate of correctness
🔍 Verifier Checks proof, learns nothing about input

ZKZKZero Knowledge. A class of cryptographic proofs that let you prove something is true without revealing any of the underlying information. ZK lets a network verify a transaction without seeing the transaction's contents.Like proving you know the password to a safe by demonstrating you can open it, without ever saying the password out loud. The verifier learns that you know the password and nothing more.Read more → is excellent for verification but does not inherently protect data during computation. For AI inference privacy, ZK works best in combination with TEE or MPC. The other technique keeps your data hidden during processing, and ZK proves the result is legitimate afterwards.

Level 4: Local inference. Your hardware, your rules

The most private inference is inference that never touches a network. Running an open-weight model on your own hardware means no provider, no policy, no attestation, no cryptography. Just your prompt on your machine.

The tradeoffs are real: you are limited to open-weight models (no Claude, no GPT-5), you need capable hardware ($3,000-6,000 for a Mac Studio that runs 70B models), and performance depends on your local specs. But for genuinely sensitive work (legal documents, medical analysis, proprietary code) local inference is the gold standard of privacy.

Where each project sits

Where each project sits on the privacy spectrum

ProjectPrivacy levelPrivacy typeWhat the provider sees
OpenAI / Anthropic L0 None Everything: identity + prompt + stored
Venice (Private) L1 Policy (ZDR) Prompt content, not identity. Not stored by policy.
Venice (TEE/E2EE) L2 TEE + E2EE (hardware) Nothing. Prompt encrypted on device, decrypted only inside verified enclave.
Morpheus (v6.0.0) L1 P2P + proxy-router TEE (mainnet) Prompt visible to GPU during inference. Proxy-router attested, not logging. GPU attestation next.
Phala Network L2 TEE (hardware) Nothing. Hardware enclave prevents access
Oasis (ROFL) L2 TEE (hardware) Nothing during computation
Arcium L3 MPC+FHE+ZKP Fragments only. No party sees complete data
Nillion L3 MPC + secret sharing Fragments only
Local (Ollama, llama.cpp) L4 Physical isolation No provider exists

What is coming

Three developments are worth watching:

Morpheus v6.0.0: proxy-router TEE attestation. On 19 March 2026, Morpheus shipped v6.0.0 with TEE attestation for the proxy-router (the routing layer between consumer and provider). Users running the node software locally can cryptographically verify that a provider’s routing node is genuine, unmodified, and running inside an Intel TDX hardware enclave with chat logging disabled. This is not yet available through the api.mor.org hosted gateway, which is how most consumers access Morpheus.

v6.0.0 Proxy-router TEE attestation shipped Intel TDX via SecretVM. API gateway access pending.

What shipped in v6.0.0:

  • TEE-hardened Docker images with Cosign keyless signing and SBOM generation
  • Intel TDX RTMR3 measurement computed in CI/CD and published in signed attestation manifests
  • Consumer-side attestation verification before session creation (fails hard if unverified)
  • TLS certificate fingerprint bound to attestation report, preventing quote replay attacks
  • Chat context storage provably disabled (non-overridable at runtime)
  • Only 5 runtime secrets injectable; blockchain configuration frozen at build time

The trust model is well designed: consumers fetch the expected RTMR3 hash from the signed manifest (published in the container registry), then verify it against the actual RTMR3 from the hardware attestation. The provider never self-reports its own measurements. Anti-spoofing via TLS fingerprint binding means an attacker can’t replay a legitimate attestation quote from another provider.

What this does and doesn’t provide. The proxy-router is attested, but the GPU model server is not. Your prompt’s journey: encrypted in transit (P2P) to the provider’s attested proxy-router (verified genuine, not logging), then forwarded in plaintext to the GPU for inference. The GPU operator can still theoretically read your prompt during processing. This is an infrastructure trust improvement (is this node legitimate?), not a data privacy guarantee (can this node see my data?). Morpheus remains at Level 1 for privacy until GPU attestation ships.

A structural point: Morpheus is a marketplace, not an inference provider. It doesn’t run GPUs. It connects consumers to providers who do. So Morpheus itself can’t “be” Level 2 because it doesn’t control the inference layer. What it can do is verify and enable higher privacy levels offered by providers. If a provider runs GPU TEE hardware, Morpheus can attest it. If a provider offers Venice E2EE models, Morpheus can route to them. The privacy level depends on the provider. Morpheus’s contribution is the decentralised marketplace and the ability to verify what providers claim.

What’s next: Phase 2 (GPU attestation) would let Morpheus verify that a provider’s LLM execution runs inside a TEE, closing the last plaintext gap. Native RA-TLS would embed attestation in the TLS handshake. TEE support through the api.mor.org gateway (how most consumers access Morpheus) is also pending. When these ship, Morpheus can connect consumers to verified Level 2 providers via a standard API.

Dolphin’s hybrid approach. Dolphin, the creator of Venice’s uncensored models, is researching a hybrid architecture combining TEE with “sharded inference”, an MPC-like technique where the prompt is split across multiple GPU providers. No single provider sees the full prompt, and each provider runs inside a TEE. This would sit between Level 2 and Level 3 on the spectrum.

FHE acceleration. Zama, Fhenix, and dedicated FHE hardware projects are driving down the overhead from 1000x toward 10-100x. When FHE reaches single-digit overhead for transformer inference (plausible within 2-3 years) it delivers mathematically provable privacy at near-native speed. That changes the entire landscape.

A practical decision framework

The question is not “which level is best?” The question is “what level does my use case require?”

Level 1 is sufficient when:

  • Your prompts are general knowledge work (writing, coding, analysis)
  • The prompt content would not cause harm if exposed without your identity attached
  • You want meaningful privacy improvement over centralised providers at zero performance cost
  • Your threat model is “I don’t want a company logging and training on my data”

Level 2 is required when:

  • Your prompts contain personally identifiable information about others (healthcare, legal)
  • Regulatory compliance requires verifiable data protection (HIPAA, GDPR data processing)
  • Your threat model includes a compromised infrastructure provider
  • You need to prove to an auditor that data was protected during processing

Level 4 is required when:

  • Maximum assurance is non-negotiable (state secrets, extreme IP sensitivity)
  • You cannot trust any external infrastructure, even hardware-attested enclaves
  • You have the hardware and are willing to accept the model and performance limitations

Most people overestimate the privacy level they need and underestimate the friction of higher levels. If you are not processing data that would trigger HIPAA, securities regulations, or legal privilege, Level 1 is likely sufficient and Level 0 (sending everything to OpenAI) is the real problem to solve.

The infrastructure battle

Three forces are converging, and they are not about technology preference.

Regulatory pressure is accelerating. GDPR, the EU AI Act, HIPAA, CCPA, and emerging state-level US privacy laws are all converging on the same requirement: demonstrate that data is protected during AI processing. “We have a policy” is not going to satisfy auditors for long.

Enterprise adoption requires it. Every corporate legal team restricting ChatGPT usage is a potential customer for private inference. Samsung banned ChatGPT after the source code leak. Banks restrict AI tool usage. Healthcare providers cannot send patient data to OpenAI. The demand exists. The infrastructure to serve it at scale does not. Not yet.

The cost structure favours it. Venice already demonstrates that private inference can be cheaper than centralised APIs. Decentralised compute (Akash claims 50-85% below AWS) drives down the cost base. Privacy does not have to be a premium feature. It can be the default, at a discount, because you are not subsidising OpenAI’s infrastructure margin and data extraction business.

The projects that win the infrastructure battle will be the ones that push up the privacy level while keeping the cost and performance competitive. Venice proved that Level 1 could attract 1.3 million users. Its March 2026 TEE/E2EE launch, powered by NEAR and Phala infrastructure, shows how a consumer platform can offer Level 2 without forcing users to understand the underlying cryptography. Phala is proving that TEE infrastructure is production-viable. The gap between “works in a paper” and “works at scale” for Levels 3 and beyond is closing.

The question is not whether private AI inference becomes standard. It is which projects build the infrastructure that makes it invisible enough that users choose it without thinking about privacy at all. The same way HTTPS became the default for web browsing without anyone actively choosing it.

Score changes, new reviews, one editorial take every two weeks. No spam.