On this page
- TL;DR
- What changed
- The two archetypes
- Reasoning models
- Fast models
- A bigger taxonomy
- When reasoning wins
- Math beyond arithmetic
- Algorithm and code architecture
- Hard debugging
- Multi-constraint planning
- Novel or ambiguous problems
- When the cost of being wrong is high
- When fast wins
- Chat UX
- Classification and routing
- Summarization of well-structured content
- Bulk extraction
- High-volume agent loops
- Voice interfaces
- Autocomplete and inline suggestions
- The hybrid pattern (what actually works in production)
- A latency budget you can actually use
- The cost story
- How to think about hidden thinking tokens
- A worked example: customer support
- Common mistakes
- Mistake: defaulting to the smartest model
- Mistake: defaulting to the fastest model
- Mistake: not measuring P95
- Mistake: ignoring thinking-token cost
- Mistake: not exposing the trade-off to the user
- How to actually choose, today
- The summary
TL;DR
- The 2026 model landscape is split in two: reasoning models that think for seconds-to-minutes (o3, GPT-5 with thinking, Claude Opus 4.7 extended thinking) and fast models that respond in under a second (Haiku 4.5, Gemini 2.5 Flash, GPT-5-mini).
- Reasoning models win on: math, code architecture, multi-step planning, novel problems, anything where being wrong is expensive.
- Fast models win on: chat UX, classification, summarization, well-defined transformations, anything you'll do thousands of times.
- The cost/latency gap is roughly 10-50x. The quality gap on appropriate tasks is usually under 10%.
- Most production systems should use both: a fast model for the 90% of work that doesn't need thinking, a reasoning model for the 10% that does.
What changed
Until 2024, an LLM gave you a token-by-token answer right after your prompt. Latency was a function of length. Quality was a function of model size.
Then OpenAI shipped o1, and the rules changed. Reasoning models spend significant compute before the first visible token. They generate hidden chains of thought, evaluate options, backtrack, and finally produce an answer. The result is dramatically better performance on hard problems — at dramatically higher latency and cost.
By 2026, every major lab has both archetypes. OpenAI: GPT-5 (with thinking modes) and GPT-5-mini. Anthropic: Claude Opus 4.7 with extended thinking, Sonnet 4.6, Haiku 4.5. Google: Gemini 2.5 Pro with thinking, Gemini 2.5 Flash. The split is now permanent.
The question is no longer which model. It's which mode.
The two archetypes
Reasoning models
Examples: o3, o3-mini, GPT-5 with thinking, Claude Opus 4.7 extended thinking, Gemini 2.5 Pro thinking.
How they work: the model generates a long internal scratchpad — sometimes thousands of tokens of "thinking" — before producing the user-visible answer. You pay for those thinking tokens.
Strengths:
- Multi-step math and proofs.
- Algorithm design and analysis.
- Hard root-cause debugging.
- Tasks requiring planning across many constraints.
- Anything where the model needs to consider multiple paths and choose the best.
- Novel problems not well-represented in training data.
Weaknesses:
- Slow. First-token latency in seconds, complete answers in tens of seconds to minutes.
- Expensive. Effective cost per answer can be 10-50x a fast model.
- Often "overthinks" simple questions. Asking o3 "what's the capital of France" wastes 30 seconds and a dollar.
- Can produce long, hedged answers when a short one would do.
Fast models
Examples: Claude Haiku 4.5, Gemini 2.5 Flash, GPT-5-mini, Mistral Large 2 (small variants).
How they work: classic feed-forward generation, no hidden thinking step. Answer streams out token-by-token.
Strengths:
- Sub-second first-token latency. Feels alive in chat.
- Very cheap per call.
- Strong at well-defined tasks: classification, extraction, summarization, format conversion.
- Excellent for high-volume work where you process thousands or millions of items.
- Better UX in conversational interfaces.
Weaknesses:
- Brittle on multi-step reasoning. May produce confident wrong answers.
- Limited at novel problems — better at pattern-matching than thinking through.
- Smaller "working memory" effectively, even when context window is large.
- Can require more careful prompting to coax out good behavior on harder tasks.
A bigger taxonomy
It's not actually two categories. It's a spectrum.
| Tier | Examples | Latency | Use case |
|---|---|---|---|
| Heavy reasoning | o3, GPT-5 thinking high | 30s-5min | Hardest problems, where wrong is unacceptable |
| Light reasoning | o3-mini, GPT-5 thinking low, Opus 4.7 extended | 5-30s | Real reasoning tasks, faster turnaround |
| Premium chat | Claude Opus 4.7, GPT-5, Gemini 2.5 Pro | 2-10s | Quality-first conversational and creative work |
| Workhorse | Claude Sonnet 4.6, GPT-5 default | 1-5s | Daily default, good at most things |
| Fast | Haiku 4.5, Gemini 2.5 Flash, GPT-5-mini | <1s | Chat, classification, high-volume |
| Tiny / local | Phi-4, Qwen 2.5, Llama 3.3 8B | Instant locally | Offline, edge, privacy |
When we say "fast vs smart" we usually mean the top two and the bottom two. The middle is where most actual work happens.
When reasoning wins
Math beyond arithmetic
For anything past basic algebra — calculus, combinatorics, proofs, optimization — reasoning models win decisively. Fast models will produce confident-looking output that is often wrong in subtle ways. o3 and GPT-5 thinking will work through it.
Algorithm and code architecture
When the question is "what's the right algorithm" or "how should this system be structured," reasoning models genuinely think differently. Claude Opus 4.7 with extended thinking is also strong here.
For raw code-typing once the design is clear, you can drop back to a faster model.
Hard debugging
"Why is this failing only in production with a particular customer's data?" is a multi-hypothesis problem. Reasoning models enumerate hypotheses, weigh evidence, and rank them. Fast models tend to grab the first plausible explanation.
Multi-constraint planning
Scheduling, resource allocation, agent task planning with many interacting requirements. Reasoning models hold the constraints in working memory long enough to satisfy all of them.
Novel or ambiguous problems
Anything that doesn't look like training data. Reasoning models recover better from "I haven't seen exactly this before."
When the cost of being wrong is high
Production migrations, contract review, medical or legal first-pass analysis (with human review), security analysis. Pay the latency to lower the error rate.
When fast wins
Chat UX
If a human is reading the response stream, sub-second first-token matters more than marginal quality. People perceive the slow model as broken. Use the fast model for the conversational shell, with explicit "deep think" actions that switch to reasoning.
Classification and routing
"Is this email spam? Is this ticket billing or technical? What category is this product?" Fast models are dramatically better fit-for-purpose. Reasoning is wasted compute on a one-shot decision.
Summarization of well-structured content
Article summaries, meeting transcripts, document compression. Fast models do this well at a tenth the cost.
Bulk extraction
Pulling structured data out of unstructured text. Names, dates, amounts, sentiments. Run a fast model in parallel across thousands of items.
High-volume agent loops
If you have an agent making 50 tool calls per task, each call should usually be a fast model. Use reasoning only at the top-level planning layer.
Voice interfaces
Voice latency budget is brutally tight. Anything over a second feels broken. Fast models or distilled small models are the only option.
Autocomplete and inline suggestions
Same as voice. Sub-200ms or it's not autocomplete, it's a hint dialog.
The hybrid pattern (what actually works in production)
Most well-built 2026 systems do not pick one or the other. They use a layered architecture:
- Router (fast model or even a classifier): decides what kind of request this is.
- Default handler (fast or workhorse): handles the common case.
- Escalation handler (reasoning model): triggered when the request is hard, ambiguous, or high-stakes.
- Verifier (fast or reasoning): checks the output before showing it to the user.
This pattern delivers reasoning-model quality where it matters and fast-model latency everywhere else. It's the architecture behind every production AI feature that doesn't feel like a tech demo.
We dig into the routing patterns specifically in Multi-model AI workflows and routing.
A latency budget you can actually use
For each surface in your product, give it a latency budget. Then pick the model that fits.
| Surface | Budget | Implication |
|---|---|---|
| Voice turn | <800ms | Fast model only, distilled if possible |
| Inline autocomplete | <200ms | Local SLM or fast cloud |
| Chat first token | <2s | Fast or workhorse model |
| Chat full response | <10s | Workhorse model |
| Agent step | <30s | Workhorse, reasoning OK if user expects |
| Background job | <5min | Reasoning OK, optimize for cost |
| Batch overnight | Hours | Cheapest capable model |
The cardinal sin is putting a 30-second reasoning model behind a chat UI that promised "instant." Users will hate it even if the answers are perfect.
The cost story
Rough order of magnitude in mid-2026 (output tokens):
- o3: ~$60 per million.
- GPT-5 with thinking high: ~$30-50 per million effective (you pay for thinking tokens).
- Claude Opus 4.7 extended thinking: ~$50 per million effective.
- Claude Sonnet 4.6: ~$15 per million.
- GPT-5 default: ~$15 per million.
- Haiku 4.5: ~$1 per million.
- Gemini 2.5 Flash: ~$0.50 per million.
- GPT-5-mini: ~$1 per million.
A reasoning answer can easily cost 30-100x what a fast answer costs. If you do this at scale without thinking about it, your bill will tell you a story you don't want to read.
How to think about hidden thinking tokens
Reasoning models bill for the thinking tokens, not just the visible output. A single o3 answer can burn 5-20k thinking tokens. At scale this matters.
Practical implications:
- Cap the thinking budget. GPT-5 and Claude both expose thinking effort knobs. Use "low" or "medium" unless the task truly needs "high."
- Batch when possible. Reasoning models often have batch APIs at significant discounts.
- Cache prompts. All major providers now offer prompt caching. For repeated system prompts, you can cut effective cost by 50-90%.
- Don't double-pay. If a fast model can do it, don't run reasoning. The whole point is to escalate selectively.
A worked example: customer support
You're building an AI support agent. Daily volume: 10,000 tickets.
Naive approach: route every ticket to GPT-5 with thinking. Quality: excellent. Latency: 30 seconds per ticket. Cost: maybe $5,000 per day.
Hybrid approach:
- Haiku 4.5 router classifies the ticket type. Cost: pennies. Latency: 200ms.
- GPT-5-mini drafts a response for the 80% of tickets that are routine. Cost: a few cents per ticket. Latency: 2-3 seconds.
- GPT-5 with low thinking handles the 15% of tickets that are nuanced. Cost: 10-20 cents. Latency: 8-15 seconds.
- o3-mini is reserved for the 5% of tickets that escalate to "this looks complicated, get it right." Cost: 50 cents. Latency: 30s.
- Haiku 4.5 verifier sanity-checks every outgoing message. Cost: trivial.
Total daily cost: a small fraction of the naive approach. P95 latency: dramatically lower for most users. Quality on the hard cases: comparable. This is the architecture pattern that pays for itself.
Common mistakes
Mistake: defaulting to the smartest model
You think "I want the best, so o3." Now your simple chat takes 45 seconds and costs a quarter per message. Users churn.
Mistake: defaulting to the fastest model
You think "I want it cheap, so Haiku for everything." Now your agent confidently produces wrong answers on every multi-step task and your support load explodes.
Mistake: not measuring P95
P50 latency is fine; P95 is the user experience that loses you customers. Reasoning models have ugly tails. Measure them.
Mistake: ignoring thinking-token cost
Your bill from o3 came in 5x what you expected. You forgot the thinking tokens count.
Mistake: not exposing the trade-off to the user
Some users want fast and rough; some want slow and right. A "deep think" toggle in the UI is one of the highest-leverage product decisions you can make.
How to actually choose, today
For each request:
- Is this a single, well-defined task that the model has seen a thousand times? Fast.
- Will a human be sitting and watching? Lean fast unless they explicitly asked for depth.
- Is the cost of being wrong real (production change, money, a contract)? Lean reasoning.
- Is this high-volume? Lean fast, reserve reasoning for escalation.
- Does the answer need a chain of multi-step deductions? Reasoning.
That's it. Five questions, ten seconds, and you've usually picked right.
The summary
- Reasoning and fast models are different tools, not different quality tiers.
- Use fast for chat UX, classification, summarization, and high-volume work.
- Use reasoning for math, planning, novel problems, and high-stakes decisions.
- The best production systems use both, with a router deciding per request.
- Watch your latency P95 and your thinking-token cost — they will surprise you.
For a broader landscape view, see Comparing AI models 2026. For the routing patterns, see Multi-model AI workflows and routing.
NovaKit supports both modes for every major provider. Switch between fast and reasoning per message, see the cost in real time, and route automatically when you scale up.