Models vs Harnesses: Why the System Around the LLM Matters More Than the Model
What is an AI harness and why does it matter?
A harness is the structure you build around a large language model so it produces reliable output: planning, context, evaluation criteria, and reference samples.
As one practitioner puts it, you build a harness for the LLM —and the great vibe coders spend more time on planning than on the coding itself.
Define what you want to build and its purpose.
Decide how you will evaluate it.
Gather good samples the model can borrow from.
Then attack the problem—those constraints shape what the LLM produces.
The same logic applies to teams: surround them with the right resources and monitor the feedback, or the output goes on a tangent.
Is the model or the system around it more important for product?
Both matter, but the leverage has shifted.
PMs working inside agentic workflows describe two forces changing their job: models reaching a level of intelligence that is good enough , and—more importantly—harnesses around those models that make them extremely useful .
The implication for product teams is that picking a slightly better model is not where the wins are.
The wins are in the system: dev setups, codebase access, evaluation loops, and the daily workflow PMs run on top of the model.
“the biggest part is the is the the harness and the system around uh around the models that are really making it extremely useful”
Why is Claude Code more than just a model?
Claude Code is partly the underlying model, but according to PMs using it daily, the biggest part is the harness and the system around the models .
That harness is what turns the model into a tool a PM can spend most of the day inside.
At Lemlist, PMs now have a dev setup identical to the engineering team and access to the entire codebase.
The most accessible use case is what one PM calls chat with codebase : asking questions of the repo to understand how a decade of software building has compounded.
None of that workflow is the model alone—it is the system wrapped around it.
“you build a harness for the LLM. Right? But I think what sometimes happens too is you don't build a harness for your company.”
How does evaluation fit into the harness?
Evaluation is not an afterthought—it is the first thing to design.
Before building, ask how the system will be evaluated at scale.
If today's evaluation depends on a single domain expert eyeballing answers, that does not scale: the problem becomes building the AI system plus replicating the evaluation person.
That means documenting the domain understanding the human reviewer applies, so it can be encoded into the harness.
“A lot of the great vibe coders that I know is they'll spend so much more time on planning than actually vibe coding.”
Does every business need the same harness?
The harness you need depends on how much stochastic behaviour your business can absorb.
A model that is right 86% of the time may be fine for some workflows, but for hardware, financial modeling, accounting, or taxes, the tolerance is closer to 99.9%—and the 0.1% has to be caught before it reaches the customer.
Two responses follow: Build a tighter harness that reduces the likelihood of AI failure.
Reshape the organization to be stochastic first , factoring unpredictability into operations.
Choosing between them is a business decision, not just a model decision.
Frequently asked questions.
- What is an AI harness in simple terms?
- An AI harness is the structure built around a large language model so it produces useful, reliable work. It includes planning (what you want to build and why), evaluation criteria, good sample inputs to borrow from, and the surrounding tools and context. Practitioners describe it as constraining what the LLM produces—you decide the purpose, the evaluation method, and the reference material before letting the model attack the problem.
- Why is the harness more important than the model right now?
- Because frontier models have reached a level of intelligence that is good enough for most product work. The differentiator has moved to the systems around those models. PMs building with Claude Code report that while part of its value is the model itself, the biggest part is the harness and the system around the model that makes it extremely useful in daily product work.
- Is Claude Code just a model or something more?
- Claude Code is partly the underlying model, but the bigger value comes from the harness and system wrapped around it. PMs at Lemlist now spend most of their day inside Claude Code, chatting with the agent across use cases—starting with chat with codebase, where they query the entire repository to understand a decade of compounded software decisions. That workflow is enabled by the harness, not the raw model.
- How do evaluations fit into building an AI harness?
- Evaluation should be defined before building. The first question to answer is how you will evaluate the system at scale. If evaluation today relies on one domain expert reviewing outputs by hand, that does not scale—the project becomes building both the AI system and a replica of that expert. Documenting the reviewer's domain logic is how evaluation gets encoded into the harness.
- Does every company need the same kind of harness?
- The harness depends on tolerance for stochastic output. For hardware companies or businesses making financial, accounting, or tax decisions, accuracy needs to approach 99.9%, with the remaining 0.1% caught before reaching the customer. Other businesses can adopt a stochastic-first posture and shape operations around the fact that AI is unpredictable. The harness either tightens to reduce failure or the organization adapts around it.
- Do teams need a harness too, not just the model?
- The same logic that applies to constraining an LLM applies to constraining a team: surround people with the right resources and information so they don't go on a tangent, then monitor success through traction or product feedback. Companies that build a harness for the LLM but not for the company end up with the same drift problem at the organizational level.
