How to build an AI chief of staff on WhatsApp
Vladimir de Ziegler · June 25, 2026 · 9 min read
You already know what a chief of staff does, even if you've never had one.
They're the person who remembers that you decided against the new CRM in March, that the ops lead is drowning in proposal admin, that the SoftOne migration is blocked on an API nobody can find. You don't brief them from scratch every morning. They hold the context so you don't have to.
That holding-context part is the whole job. And it's the exact part every “AI assistant” drops on the floor.
I've watched people wire up a slick assistant, demo it, get a clean answer, and feel like they've hired someone. Then they come back the next day and it has forgotten the company exists. You re-explain your systems. You re-explain your team. You re-explain the thing you told it yesterday.
It's a very smart stranger you keep re-meeting.
So when I built one for real, the design started from memory. The intelligence was the easy half. Here's how it actually works under the hood, and what I'd tell you if you wanted to build your own.
A chief of staff is a memory job before it's a thinking job
The model was never the hard part. Claude is more than good enough to reason about an AI roadmap. The hard part is that a chat turn is stateless by default, and a chief of staff is the opposite of stateless.
So the first thing the system does on every single message is load a “company brain” into the prompt. Identity binds the WhatsApp number to a company. Facts live as structured data: the systems they run, their KPIs, rough headcount, the workflows we've already scoped, the business cases we've already built. On top of that sits a synthesized narrative, capped at 400 words, that reads like notes a good chief of staff would keep.
The system prompt has one rule in capital letters:
Never re-ask anything in the you-already-know block. Build on it.
That single constraint changes the texture of the whole thing. It stops being a smart stranger and starts being someone who was in the room last time.
WhatsApp is where the work actually gets asked
I didn't put this on a dashboard because executives don't live in dashboards. They live in their thumbs.
The whole thing runs as an MCP server with 28 tools, so it's reachable from inside Claude, Cursor, ChatGPT, and VS Code. But the surface that matters is WhatsApp, bridged through Kapso. You text it the way you'd text a colleague between meetings.
One design detail matters more than it sounds: the webhook acknowledges the message immediately, returns a 200, and does the real work in the background. AI work is slow and bursty. If you make someone watch a typing indicator for forty seconds, they put the phone down and the habit never forms. Ack fast, think async, reply when you've got something worth saying.
The chat loop runs Claude Sonnet 4.6, capped at eight tool rounds a turn, with web search available up to five times. That web search isn't decoration. I'll come back to why it's load-bearing.
The unit of work is one bottleneck, scoped to the end
This was the design decision I'd defend hardest.
The temptation with these tools is to let the conversation sprawl. The exec mentions five problems, the assistant nods at all five, and you end up with a shallow puddle of half-understood ideas. So the loop is deliberately strict: take one bottleneck, scope it to completion, bank it, then ask what's next.
Scoping one workflow means eliciting three things, in the user's own numbers, not invented ones:
- Today's cost. The role doing the work, the hours per task, the frequency. Or the revenue being lost while it sits in a queue.
- The upside. The deals that close, the conversion that lifts, the capacity that frees up.
- The single KPI that turns this into money, and what one unit of it is worth.
Then the tool does its work. It scores build complexity across seven dimensions and lands it in a band: Simple is two to four weeks, Medium five to nine, Hard nine to fifteen, Very hard fifteen and up, costed at a blended day rate of around $1,560. It pulls a labour cost from a salary benchmark for that role and region. It computes the growth value. And it banks the workflow into the company's facts, capped at thirty per company and de-duped by name, so the set stays clean as it grows.
Every number it gives you in chat is labelled an estimate. The line is always the same: this is an estimate, the final deck recomputes precisely. A chief of staff who oversells gets you in trouble in the boardroom, so this one is built to undersell on purpose.
The math is where these tools lie to you
Here's the part I got wrong first, and had to rebuild.
When you let a model add up the value of automation, it will happily count the same money twice. It saves the sales rep ten hours a week, values those hours, and then also credits the extra revenue from the deals the rep now has time to close. Both feel real. Together they're often the same freed-up hours wearing two hats.
The first version did exactly this, and the numbers came out gorgeous and dishonest. So now there are guards.
If the KPI is a time metric, like turnaround or response time, and someone also feeds in revenue-conversion inputs, the system rejects the revenue. You don't get to bank saved hours and the income from those same hours. Labour-saved and revenue-upside only get summed when they're confirmed to be genuinely independent. Otherwise it books the larger of the two and moves on.
There's a second guard on optimism. The KPI uplift comes from live research on comparable projects, but if that research comes back low-confidence, or claims an uplift above 200%, the system throws it out and substitutes a conservative floor. A model left unsupervised will find you a case study that says AI tripled someone's pipeline. A chief of staff you'd actually trust says “let's not plan on that.”
The first version produced a beautiful number. The work was in the version you'd stake your reputation on.
And this is where that web search earns its place. When someone says “we run SoftOne” or “it's all in HubSpot,” the system searches whether that platform actually has a usable API before it assumes anything. No API means a browser-agent or a rebuild, which makes the work harder, which feeds straight into the complexity score. Feasibility gets checked first, then priced.
The close is a deck it builds while you put the phone down
Once you've scoped a few bottlenecks, you tell it to build the case. You don't re-enter anything.
It reads the banked set and hands off to a worker that does the slow, deterministic work. Per workflow, it re-scores complexity, re-benchmarks the KPI uplift, and computes the euro value properly this time. It derives the shared foundations the workflows sit on, the CRM, the ERP, the comms layer, the document store. It clusters workflows by the data they share. It buckets each one as a Quick Win, a High ROI play, or a Strategic Bet. Then it sequences them into a capacity-aware roadmap and renders a branded deck as HTML and PDF.
That job runs in the background and takes minutes. So it doesn't pretend to be done. It returns a job ID, says it's started, and when the worker finishes it pushes the deck link straight back to your WhatsApp. You asked a question between meetings and a costed roadmap shows up in your chat while you were in the next one.
The synthesis that keeps the brain current runs on the cheap path on purpose. A smaller model, Haiku 4.5, folds the recent conversation into the narrative every few turns, and gets forced to run the moment something important happens: a company gets registered, a workflow gets scoped, a job gets kicked off. Memory updates on the events that matter, off the hot path, so the chat stays fast.
Scope a workflow with the AI Chief
It lives on WhatsApp and runs the exact loop in this article. Bring one bottleneck and it'll scope it with you.
Meet the AI Chief →What I'd tell you if you're building one
If you take anything from this, take the design rules, because they're where the months went.
- Start from memory. A chief of staff that forgets is a party trick. Persist identity, facts, and a synthesized narrative, and load it on every turn before the model does anything.
- Force synthesis on the events that matter. Don't just summarize on a timer. The moment a real decision lands, capture it, or the brain drifts out of date right when it counts.
- Refuse to double-count. The fastest way to lose a room is a number that falls apart under one question. Build the guards in, cap the optimism, and label everything you can't yet prove as an estimate.
- Make it async and make it fast. Ack the message, think in the background, reply when you have something. The habit dies in the typing indicator.
- Check feasibility before you price it. “It's in the ERP” is the start of a question, not the answer. Find out if there's an API, then score the build honestly.
None of this is exotic. A good chief of staff has always worked this way, holding context, doing the homework, refusing to flatter you with a number that won't survive the meeting.
The only new part is the thing waking up to do it over text.
If you want to see it work before you build your own, you can talk to mine. It's the AI Chief, it lives on WhatsApp, and it'll scope a workflow with you in the same loop I just described.
Ciao.