engineeringApril 3, 202611 min read

How AI Agents Actually Work: Tool Use, Memory, and Orchestration Explained

'Agentic AI' is the buzzword of 2026 — but what's actually happening under the hood when an agent books your flight, refactors your code, or runs a 5-step research task? A plain-English breakdown with real examples.

TL;DR

  • An AI agent is a loop: the model reads context, decides whether to call a tool or respond, executes the tool, reads the result, repeats.
  • The three ingredients: a model (the decision-maker), tools (what it can do in the real world), and a loop (the orchestration that lets it take multiple steps).
  • Agents "remember" through: conversation history (short-term), summarization (medium), and external stores (long-term, via files, databases, or memory systems).
  • Most agent failures come from: bad tool descriptions, runaway loops, context overflow, and unclear termination conditions.
  • In 2026 the stack is maturing fast — MCP for tools, frontier models for decisions, and observability for debugging.

What "agent" actually means

The word "agent" gets thrown around loosely. Let's pin it down.

An AI agent is a system where:

  1. A language model receives a goal.
  2. The model decides what to do next (respond, call tool A, call tool B).
  3. If it calls a tool, the tool executes and returns a result.
  4. The model receives the result as new context and decides again.
  5. This loop continues until the model decides it's done.

That's it. That's the whole trick.

The magic isn't in some new algorithm. It's in the loop and the tools. Once a model can do things and see their results, it can compose multi-step behavior.

The minimal agent loop

In pseudocode, any agent looks like this:

def agent(goal, tools, model):
    messages = [{"role": "user", "content": goal}]
    while True:
        response = model.run(messages, tools=tools)
        if response.finish_reason == "stop":
            return response.content
        if response.tool_calls:
            for call in response.tool_calls:
                result = execute_tool(call.name, call.arguments)
                messages.append({"role": "tool", "content": result})
        messages.append({"role": "assistant", "content": response})

That's 15 lines. The frontier-model providers (Anthropic, OpenAI, Google) all support structured tool calling that makes this straightforward — you describe tools as JSON schemas, the model decides when to call them.

Everything on top of this — LangGraph, AutoGen, CrewAI, Letta, the various "agent frameworks" — is ways to organize the loop, add memory, handle failures, and orchestrate multiple agents working together.

Tool use: how the model decides what to do

Tools are described to the model as schemas:

{
  "name": "search_web",
  "description": "Search the web for recent information. Use when you need facts newer than your training cutoff.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": { "type": "string", "description": "Search query" }
    }
  }
}

The model sees this and, when it receives a user question it can't answer from its own knowledge, produces a structured tool call:

{
  "tool_name": "search_web",
  "arguments": { "query": "NovaKit AI workspace 2026 features" }
}

Your code catches this, runs the search, feeds the result back, and the model continues with more information.

The model is not "thinking" about tools in a magical way. It's producing text predictions that happen to match a structured tool-call format. What makes this work reliably is extensive post-training: models are deliberately trained to follow tool-call formats when given tool descriptions.

Tool descriptions matter more than you'd think

The biggest determinant of whether an agent works is how well you describe its tools.

Bad description: "Look stuff up." Good description: "Search Wikipedia for general reference information. Use for historical facts, definitions, and well-established topics. Do not use for news less than 6 months old — use search_news instead."

The difference: the model now knows when to and when not to use this tool. A well-described tool set is 80% of a well-behaved agent.

Principles:

  • Describe when to use the tool — not just what it does.
  • Describe when NOT to use it — explicitly contrast with related tools.
  • Describe what the input means — be specific about formats, units, ranges.
  • Describe what the output will look like — so the model can plan follow-up steps.

Memory: how agents "remember"

Models are stateless. They remember only what's in the current context window. Agent memory is an engineering problem you solve in layers:

Short-term: conversation history

The trivial case: keep passing the whole conversation back to the model. Works until you hit the context limit.

Medium-term: summarization

When the conversation gets long, compress older messages into summaries. The model keeps recent turns verbatim plus a summary of what came before.

Simple pattern:

  • Full verbatim for the last N turns (e.g., 20).
  • A summary for everything before.
  • Re-summarize every N turns.

Long-term: external memory stores

For truly persistent memory — across sessions, across projects — you need external storage:

  • Files / notes: The agent writes important facts to a file and reads them back next session. Minimal and effective.
  • Key-value stores: Named pieces of info the agent can lookup ("what's John's email?").
  • Vector stores: Embeddings over past interactions; semantic retrieval.
  • Graph memory: Entities and relationships for structured recall (who is whose colleague, what project is what about).

Several standalone memory systems exist in 2026 — Letta (formerly MemGPT), Mem0, and Zep are the most notable. They handle the storage, retrieval, and summarization logic so you don't have to build it.

Orchestration: multi-step reasoning patterns

Some tasks need more than a single-agent loop. Common orchestration patterns:

Plan-and-execute

  • A planner agent creates a multi-step plan.
  • An executor agent runs each step.
  • Separating these produces clearer reasoning traces and is easier to debug.

Reflection / critique

  • The agent produces a draft.
  • A second pass (or a second agent) critiques it.
  • The agent revises.

Used heavily in writing, coding, and research agents. Small but real quality boost.

Multi-agent teams

  • Multiple agents with specialized roles (researcher, writer, reviewer).
  • One agent delegates to others via "calls" that look like tool calls.
  • Good for complex workflows; adds debugging complexity.

Tree search / branching

  • The agent explores multiple possible approaches in parallel.
  • A final step picks the best result or merges them.
  • Used in harder reasoning tasks, research, and open-ended exploration.

Frameworks like LangGraph formalize these patterns. You can also build them from scratch — they're not especially complex.

Where agents fail (and how to prevent it)

Runaway loops

The agent keeps calling tools forever, never deciding it's done.

Prevention:

  • Hard step limit. Cap at 20-50 tool calls per task. Fail loudly.
  • Budget ceiling. Cap dollar spend per run.
  • Clear termination conditions in the system prompt: "When you have the answer, respond directly — do not continue tool-calling."

Context overflow

Conversation grows past the model's context window, causing truncation or errors.

Prevention:

  • Track tokens as you go. Summarize when approaching 80% of context.
  • Trim old tool results. Old file contents usually aren't needed; the model rarely re-reads them.
  • Restart with state. Dump essential state to a file, start a fresh conversation with that file in context.

Wrong tool choice

The agent calls the wrong tool, wastes time and money.

Prevention:

  • Better tool descriptions (see above).
  • Tool grouping. Organize tools into clear categories in the system prompt.
  • Examples in the description. "Example correct use: ... Example incorrect use: ..."

Hallucinated results

The agent decides a tool returned something it didn't, and builds reasoning on fiction.

Prevention:

  • Explicit "quote the tool result" instructions when the output matters.
  • Verification loops. After a claim-producing step, have the agent cite exact tool output.
  • Use a frontier model. GPT-4o-mini makes this error more than Claude Opus 4.

Silent drift

The agent slowly wanders off the original goal.

Prevention:

  • Restate the goal periodically in the system prompt or in a self-reflection step.
  • Checkpoint against success criteria. "Have I actually done what was asked?"

Observability: how to debug an agent

Without observability, agents are black boxes. You need to see:

  • Every model call: prompt, response, tokens used, cost, duration.
  • Every tool call: tool name, arguments, result, duration.
  • The full trace of a run — linear timeline of everything that happened.
  • Aggregates: success rate, average steps per task, average cost per task.

Tools that help in 2026:

  • Langfuse, Helicone, LangSmith, Arize Phoenix — purpose-built agent observability.
  • OpenTelemetry — generic tracing, with AI-specific conventions emerging.
  • Just logs. For small-scale experiments, structured JSON logs are enough.

If you're building agents and you don't have tracing, that's the first thing to fix. Debugging by guessing doesn't scale past a few test cases.

The MCP connection

MCP (Model Context Protocol) is the 2026 standard for describing tools to models. Before MCP, every framework invented its own tool-description format. Now there's one protocol.

Why this matters for agents:

  • Tools written for MCP work with Claude, GPT, Gemini, Cursor, Claude Code — anything MCP-compliant.
  • The community has contributed hundreds of MCP servers; you don't rebuild "search Gmail" from scratch.
  • You can mix tools from many sources in one agent without gluing them together manually.

If you're building agents in 2026 and ignoring MCP, you're writing more code than you need to.

A real agent example

Let's trace a realistic run: "Find the top 3 competitors for NovaKit and compare their pricing in a markdown table."

  1. Step 1: Model decides it needs to search. Calls search_web("NovaKit competitors BYOK AI chat").
  2. Step 2: Search returns 10 results. Model reads titles and snippets.
  3. Step 3: Model decides top 3 candidates, calls fetch_url on each to read their pricing pages.
  4. Step 4: Three page contents return. Model extracts pricing tables.
  5. Step 5: Model decides it has enough info. Writes a markdown comparison table. Returns to user.

Total: 4-5 tool calls, ~90 seconds, maybe $0.10 in API cost.

Without agents: a human spends 20 minutes doing the same thing.

That's the ROI. Scaled across thousands of small tasks per week, the productivity shift is enormous.

Building vs. buying

Should you build agents from scratch or use a framework?

Build from scratch when:

  • You need full control of the loop.
  • You're doing something unusual (specific orchestration pattern).
  • You want to deeply understand what's happening.

Use a framework when:

  • You're doing standard patterns (plan-execute, multi-agent teams).
  • You need production-grade observability quickly.
  • You want memory handled for you.

Good frameworks to know:

  • LangGraph — explicit state-machine orchestration. Powerful, some learning curve.
  • CrewAI — multi-agent team patterns.
  • Letta — memory-first agents with persistent state.
  • AutoGen — Microsoft's multi-agent framework.
  • Pydantic AI — type-safe, minimal, Python-native.

For most projects, start from scratch with 50 lines of code, then adopt a framework if you hit its use case. Agent frameworks have real value but also real overhead; don't pay that tax before you need it.

The near-future

Where agents are headed through 2026-2027:

  • Better memory. Persistent, semantic, entity-aware. Memory is the current weakest link.
  • Better evaluation. "Did the agent succeed?" is surprisingly hard to evaluate. Test harnesses and judges are improving.
  • Standard interfaces. MCP is one step. Expect more standardization of authentication, streaming, and multi-server composition.
  • Agents calling agents. Hierarchical delegation will become normal; today it's still hand-rolled.
  • Safety mechanisms. Permissions, capabilities, and constraint systems. Agents that can do things in the real world are terrifyingly powerful — and we're still figuring out how to keep them safe.

The summary

  • An agent is a loop: model → decide → tool → result → decide again.
  • Tool descriptions and termination conditions are where most quality comes from.
  • Memory is an engineering problem solved in layers: history, summaries, external stores.
  • Observability is not optional.
  • MCP is the 2026 standard for tool interfaces.
  • Start simple (50 lines). Adopt frameworks when you hit their use cases.

Understand the loop; the rest follows.


Build, test, and observe your own agents in NovaKit — BYOK support for every major frontier model, MCP-compatible tool connections coming soon, per-run cost tracking included.

NovaKit workspace

Stop reading about AI tools. Use the one you own.

NovaKit is a BYOK AI workspace — chat across providers, compare model costs live, and keep conversations on your device. No markup on tokens, no lock-in.

  • Bring your own keys
  • Private by default
  • All models, one workspace

Keep exploring

All posts