On this page
- TL;DR
- The starting point: the wrapper
- Stage 1: prompts get complicated
- Stage 2: the model needs context
- Stage 3: the model needs to do things
- Stage 4: one model call isn't enough
- Stage 5: the model is wrong sometimes
- Stage 6: the agent has to be reliable
- Stage 7: you can't debug what you can't see
- Stage 8: cost gets real
- Stage 9: external context (RAG, MCP, the real world)
- Stage 10: governance
- The components inventory
- What you don't need (yet)
- A practical 2026 stack for building this
- The honest closing
TL;DR
- Every "AI app" starts as a thin wrapper over a chat completions API. None stay there.
- The journey from wrapper to real agent has predictable stages: prompt → memory → tools → orchestration → evaluation → durability.
- At each stage, you'll add components you initially thought you didn't need. They are not optional in production.
- Key components every serious agent ends up with: a prompt registry, a tool registry, a memory store, a trace logger, an eval harness, and a job queue.
- If you're building an agent in 2026 and you don't have all six, you have an unscalable demo.
The starting point: the wrapper
The first version of every AI app looks like this:
const response = await openai.chat.completions.create({
model: "gpt-5",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: userInput },
],
});
return response.choices[0].message.content;
That's it. That's the whole app. It works for demos. It fails in production within a week.
The journey of "from wrapper to agent" is the journey of finding out, in order, every reason this isn't enough.
Stage 1: prompts get complicated
The first thing that happens: your system prompt grows. Then it has variables. Then it has conditional logic. Then it has different versions for different user types. Then someone changes it without telling you and quality regresses overnight.
What you build: a prompt registry.
A simple version:
- Prompts live in version-controlled files, not in inline string literals.
- Each prompt has an ID, a version number, and a template.
- A small renderer fills variables and returns the final string.
- Changes go through code review.
A more serious version:
- Prompts can be A/B tested across users.
- Each prompt version is logged with the model output for analysis later.
- Non-engineers can propose changes via a dashboard, but production rollout is gated.
This sounds like overkill until your second on-call incident is "the support agent started giving rude responses and we have no idea when it changed."
Stage 2: the model needs context
You ship the wrapper. Users ask follow-up questions. The model has no memory of the previous turn. You bolt on conversation history.
messages: [
systemPrompt,
...previousMessages,
{ role: "user", content: userInput },
]
This works until conversations get long. Then they exceed the context window. Or they cost too much. Or earlier turns confuse the model.
What you build next: a memory store.
This typically has three layers:
- Short-term memory: the recent N turns, kept verbatim.
- Working memory: key facts extracted from the conversation, stored as structured data ("user is on the Pro plan, prefers metric units, mentioned a deadline of April 30").
- Long-term memory: persistent facts about the user across sessions, retrieved by relevance to the current input.
Each layer needs a write path (when do you update it?) and a read path (how do you fetch it cheaply per request?). The naive answer — "embed everything and do vector search" — is wasteful. Most facts you need have IDs and can be looked up directly. Vector search is for the unstructured tail.
If you're building this from scratch, you'll spend a week getting it right. If you're using a workspace like NovaKit, some of this is already provided as a managed memory store you can wire in.
Stage 3: the model needs to do things
Eventually, returning text isn't enough. The user asks "what's the status of order #12345?" and the model needs to actually look it up.
You add tool calling. The first version is one tool, hard-coded:
tools: [{
type: "function",
function: {
name: "lookup_order",
description: "Look up an order by ID",
parameters: { type: "object", properties: { orderId: { type: "string" } } }
}
}]
This works for one tool. Then you add a second. Then a fifth. Then twenty.
What you build next: a tool registry.
A real tool registry has:
- A central definition of every tool, with schema, description, and handler.
- Per-user or per-context filtering — not every user should see every tool.
- Permission checks before execution (does this user have access to this customer record?).
- Rate limiting per tool (don't let the model call your most expensive tool a hundred times in a loop).
- Standardized error formatting so the model can recover when a tool fails.
The honest reality: you will build this even if you're using a framework that "handles tool calling for you." Frameworks handle the protocol. They don't handle the governance.
Stage 4: one model call isn't enough
You ship a system that can use tools. It works for simple tasks. Then a user asks something that requires three tool calls in sequence with reasoning in between.
The single-call architecture breaks. You move to a loop:
while not done:
ask model what to do next
if tool call: execute it, add result to context
if final answer: return
Now you have an agent loop. Congratulations. You also now have new problems:
- The loop sometimes runs forever (cap iterations).
- The loop sometimes does nothing (detect no-op turns).
- The model sometimes calls the same tool with the same args repeatedly (deduplicate).
- The context grows fast as you accumulate tool results (summarize older turns).
What you build next: real orchestration.
Orchestration patterns that emerge:
- Plan-then-execute. First call: produce a plan. Subsequent calls: execute steps, with replanning allowed.
- Multi-agent. A "router" agent picks which specialist agent to invoke. Specialists have narrower tool access.
- Critic-and-actor. One model proposes; another evaluates. Useful when output quality is high-stakes.
These are not magic. They're just structured ways to call the model multiple times. The trap is starting too complex — most workflows do not need multi-agent. Plan-then-execute handles 80% of cases.
For a parallel pattern from human-AI coding workflows, see Vibe Coding in 2026.
Stage 5: the model is wrong sometimes
You can't tell. Or you can, but only when a user complains. By the time a complaint reaches you, you've already shipped the regression to a thousand users.
What you build next: an eval harness.
The minimum viable eval system:
- A set of representative test cases — input → expected output (or expected behavior).
- An automated runner that calls your agent on each case and scores the result.
- A pass/fail threshold that gates deployments.
Scoring is the hard part. Three approaches:
- Exact match for cases where the answer is deterministic (database lookups, calculations).
- LLM-as-judge for subjective quality (use Claude Opus 4 or GPT-5 to rate outputs against a rubric). Surprisingly good when the rubric is specific.
- Human review for the trickiest cases — usually 20-50 hand-picked examples that you re-evaluate every release.
Without evals, every prompt change is a coin flip. With evals, you ship 5x faster because you trust your changes.
Stage 6: the agent has to be reliable
Your agent works. Sometimes it takes 30 seconds. Sometimes it fails halfway. Sometimes the model returns an error and the user sees a cryptic message. Sometimes the user closes the tab and the half-completed work is lost.
What you build next: durability.
Durability has several pieces:
- Job queue. Long-running agent runs go on a queue, not a request thread. The user gets an immediate response ("working on it") and a notification when done. Convex actions, Inngest, Trigger.dev, or your own SQS-backed worker.
- Resumable runs. If a step fails halfway, you can retry from the failed step instead of from the beginning. This means storing intermediate state per step.
- Idempotent tool calls. Make sure the same tool call twice doesn't double-charge a card or send two emails.
- User-facing progress. Stream the model's reasoning to the user as it runs. Don't make them stare at a spinner for 30 seconds.
- Graceful timeouts. Cap any single agent run at, say, 5 minutes. Surface the partial result.
This is the stage where "I built a clever AI app" becomes "I built a real product."
Stage 7: you can't debug what you can't see
By now you're getting bug reports like "the agent gave me a weird response yesterday around 3pm." You have no way to investigate.
What you build next: a trace logger.
For every agent run, log:
- The input.
- The system prompt version used.
- Every model call (provider, model, prompt, response, token counts, cost).
- Every tool call (name, arguments, result, latency, success/failure).
- The final output.
- Total duration and cost.
Store this somewhere queryable. Tools like Langsmith, Helicone, Braintrust, or your own Postgres + a dashboard work. The exact tool matters less than having one.
Once you have traces, debugging gets fast. "Show me all runs in the last 24 hours where the agent called cancel_subscription." "Show me runs where total cost exceeded $1." "Show me runs where the user gave a thumbs-down."
Stage 8: cost gets real
You launch. Users use it. Your provider bill is now a real line item. You start asking questions.
- Which prompts cost the most per call?
- Which users cost the most per month?
- Are we using the right model for each task, or are we paying Opus prices for Sonnet-quality work?
- Can we cache anything?
What you add: cost-aware routing and caching.
Cost-aware routing means: for each agent step, you classify the difficulty and route to the cheapest model that can handle it. Easy summarization → Sonnet 4.6 or DeepSeek V3. Hard reasoning → Opus 4 or GPT-5. Trivial classification → a small open-weights model on Groq.
Caching means: when the same prompt-with-same-context comes in twice, return the cached result. Anthropic and OpenAI both offer prompt caching at the provider level, which gives you 50-90% cost reduction on repeated context. Use it.
For BYOK products specifically, this is one of the major value props — see Consolidate Your AI Subscriptions for the user-side argument.
Stage 9: external context (RAG, MCP, the real world)
The agent now needs to reason about your customer's documents, their database, their team's Slack, the live web. The naive answer is RAG. The 2026 answer is a mix.
- Structured lookups via MCP servers. If the data has a schema (a database, an API), expose it as MCP tools. The model can query directly with full type safety.
- Vector search for unstructured tails. Documents, support tickets, knowledge bases — embed and retrieve.
- Live web for fresh facts. Tavily, Exa, or Perplexity APIs as tools.
- Code execution for math, data, transformations. Sandboxed Python or JavaScript runners.
The mix matters. RAG-only architectures consistently underperform mixed architectures because most useful data is structured.
Stage 10: governance
Late-stage problems that hit every serious agent:
- PII handling. What does the model see, what does it log, what gets retained?
- Output filtering. Block unsafe content before it reaches the user.
- Auditability. For B2B and regulated customers, every model call must be traceable to a user, a session, and a purpose.
- Per-tenant isolation. Customer A's data must never appear in Customer B's responses.
- Provider redundancy. When OpenAI has an outage, can you fail over to Anthropic? When you make changes, do you have a way to compare quality?
Most teams underbuild this and pay for it during their first enterprise sales cycle.
The components inventory
By the time you have a real production agent, you have:
- Prompt registry with versioning.
- Memory store with short/working/long layers.
- Tool registry with permissions and rate limits.
- Orchestrator that runs the agent loop.
- Eval harness gating deployments.
- Job queue for durable execution.
- Trace logger for debugging and analytics.
- Cost-aware router that picks the right model per step.
- Cache layer for repeated prompts.
- External context layer mixing MCP, vector search, web, and code execution.
- Governance layer for PII, filtering, audit, isolation, and redundancy.
This is not a startup. This is the table stakes for "the system actually works in production."
What you don't need (yet)
A few things teams build prematurely:
- A custom model. You almost never need to fine-tune. Better prompts, better tools, and better routing solve 95% of cases.
- A complex multi-agent system. Plan-then-execute beats it for most workflows.
- A vector database for everything. Most lookups have IDs. Use them.
- Your own LLM gateway. Existing providers and OpenRouter cover this. Build your own only when you genuinely need features they don't have.
A practical 2026 stack for building this
If you're starting fresh today and want to grow into this architecture:
- Backend: Convex (or Supabase) — gives you database, scheduled functions, and storage in one place.
- Auth: Better Auth.
- Prompts: Code-versioned in your repo. Promptfoo for evals.
- Tools/MCP:
@modelcontextprotocol/sdk— the standard. - Tracing: Langsmith, Helicone, or Braintrust.
- Job queue: Convex actions, Inngest, or Trigger.dev.
- Models: Claude Opus 4, Claude Sonnet 4.6, GPT-5, plus Groq-hosted small models for hot paths. Route via OpenRouter or direct.
- Frontend: Next.js 16, deployed on Vercel.
For more on the developer workflow that makes this fast, see Vibe Coding in 2026.
The honest closing
The reason every "ChatGPT wrapper" succeeds or fails has very little to do with the wrapper part. It has everything to do with how seriously the team takes the operational stack — evals, tracing, cost, durability, governance.
Anyone can wrap an API. The companies that win in 2026 are the ones that built (or borrowed) the surrounding components before their first enterprise customer asked the hard questions.
Build the boring stuff early. Your future on-call self will thank you.
NovaKit is a BYOK AI workspace — useful as the planning, drafting, and prompt-iteration surface alongside whatever production agent you're building. Bring your own keys, route across every major model, keep your data local.