The harness is the product, not the model

Models are commodities. Claude, GPT, Gemini — the gap narrows every release. Everyone has the same APIs. Same pricing.

What determines whether an AI system works in production is what wraps around the model. What it sees. What tools it can call. What it can't do. How output gets validated. How state persists. That's the harness. After building these across legal, travel, enterprise docs and government — it's the only thing that matters.

What a harness is

The orchestration layer between the LLM and the real world. Without it, chatbot. With it, system.

What the model sees. A compressed domain primer — entities, relationships, constraints — injected into every prompt. The model starts informed, not blind.
What tools it can call. A registered set of domain-specific functions. The model picks which. The harness constrains which exist.
What it cannot do. Write protection. Data-layer constraints that make it physically impossible to corrupt the source, leak across domains, or hallucinate outside the curated universe.
How output is structured. The model writes narrative. Structured data passes through untouched from the data layer. Facts are never rewritten.

Deterministic first, LLM second

Use code for what can be computed. Use the LLM only for judgment.

Search is deterministic. Validation is deterministic. Access control is deterministic. The LLM decomposes queries, reranks results, synthesises narrative. It never touches the pipeline.

Not a preference. Economics. Deterministic layers cut per-query cost by 60-80% versus sending everything to the model.

The domain is the constraint, not the prompt

Prompt-based guardrails fail. "Don't recommend outside our catalogue." The model does it anyway three weeks in on an edge case.

Data-layer constraints can't fail. If it's not in the index, it doesn't exist. Every enterprise CISO asks the same question first: "How do you prevent hallucination?" The answer isn't a better prompt. The data physically doesn't exist in the system.

Why 80% of enterprise AI fails

72-80% of enterprise RAG fails in year one. 90% of agentic projects failed in production in 2024. Not bad models. Missing harness.

95% accuracy per layer, four layers deep, gives you 81% system reliability. Wrong one in five times. In enterprise, unusable.

The biggest players already know this

Claude Code — same Claude model, wrapped in 19 permission-gated tools, memory consolidation, subagent spawning, context compaction. A background daemon wakes up after inactivity, reads the project's memory, consolidates learnings, rewrites the index. The model reasons. The harness governs.

Cursor — trained model-specific harnesses for every frontier model. Dropping reasoning traces from the tool-calling loop caused a 30% performance collapse. Each model needs its own tuned harness. The model is generic. The harness is specific.

Devin — cloud sandbox with terminal, editor, browser. Fork and rollback. Session persistence. Error recovery. They invented their own compute unit (ACU) to measure harness cost per task. The model call is one line. The harness is the product.

Anthropic's research on long-running agents: a two-agent architecture with state handoff artifacts. The key finding — agents need to understand state when starting with a fresh context window. Harness problem, not model problem.

Every one of these teams arrived here independently.

The consensus

2023 was RAG. 2024 was agents. 2025 was agent swarms. 2026 is harnesses. Models absorbed 80% of what frameworks used to provide. The remaining 20% — persistence, cost control, observability, error recovery — is exactly what the harness delivers.

I've been building these since before the industry had the word. The harness is the product. The model is a replaceable component inside it.