cost-optimizationFebruary 24, 202612 min read

The Complete AI API Pricing Guide 2026: All 13 Major Providers Compared

Every AI API price, updated for 2026. GPT-4o, Claude Opus 4, Gemini 2.5 Pro, Groq, DeepSeek, Mistral, and more — input/output tokens, free tiers, rate limits, and real-world cost per message. Bookmark this.

TL;DR

  • The gap between the cheapest and most expensive flagship model is now 300x — Gemini 2.0 Flash at $0.10/M tokens vs. Claude Opus 4 at $75/M output.
  • For 80% of production workloads, Claude Sonnet 4.6 or GPT-4o-mini hit the quality/price sweet spot.
  • Open-source models on Groq and Together cost pennies for speeds that were hyperscaler-only 18 months ago.
  • This guide lists every major provider with current prices (February 2026), their quirks, and what each is best for.

Bookmark this post — we keep it updated. The live, always-current pricing is at /price-tracker.

How to read this guide

Every provider has its own pricing model, but the dominant pattern is $ per 1 million tokens, split into input (what you send to the model) and output (what the model generates). Output tokens are usually 3-5x more expensive than input tokens.

A token ≈ 4 characters of English ≈ 0.75 words. A typical chat message is 150-300 tokens. A long document might be 10,000+ tokens.

All prices below are as of February 2026 and quoted in USD per 1 million tokens.

The big three (frontier labs)

OpenAI

ModelInputOutputContextBest for
GPT-5$5.00$15.00256kHardest reasoning, long planning
GPT-4o$2.50$10.00128kGeneral-purpose workhorse
GPT-4o-mini$0.15$0.60128kBulk tasks, fast + cheap
o3$15.00$60.00200kExpensive deep reasoning
o3-mini$1.10$4.40200kGood reasoning at a reasonable price
GPT-Image-1Per imageImage generation ($0.04/std, $0.17/HD)

Notes: OpenAI offers prompt caching (50% discount on repeated input tokens) and batch API (50% off, async). Free tier is limited; you need to add at least $5 credit to unlock higher tier rate limits.

Best for: Anything where quality and ecosystem support matter. OpenAI has the most mature tooling — function calling, structured outputs, assistants API.

Anthropic

ModelInputOutputContextBest for
Claude Opus 4$15.00$75.00200k (1M w/ beta)Best coding model, hardest tasks
Claude Sonnet 4.6$3.00$15.00200kSweet spot: Opus quality at 5x less cost
Claude Haiku 3.5$0.80$4.00200kFast, cheap, still surprisingly good

Notes: Anthropic offers prompt caching (up to 90% discount after the first cached request). The 1M context variant of Opus/Sonnet costs 2x the base rates. No image generation — Claude is text-and-vision-input only.

Best for: Coding, long-document reasoning, agentic workflows, instruction-following.

Google (Gemini)

ModelInputOutputContextBest for
Gemini 2.5 Pro$1.25$10.002MMassive context, research, long video
Gemini 2.0 Flash$0.10$0.401MCheap workhorse for huge context
Gemini 2.0 Flash-Lite$0.05$0.201MDirt-cheap batch workloads

Notes: Gemini's context window is its superpower — 2M tokens on 2.5 Pro is unmatched. Free tier on AI Studio is generous but uses your data for training. Paid/Vertex does not.

Best for: Massive documents, long codebases, video understanding, multimodal tasks.

The fast inference providers (open models, crazy speeds)

Groq

ModelInputOutputSpeedContext
Llama 3.3 70B$0.59$0.79~300 tok/s128k
Llama 3.1 8B$0.05$0.08~750 tok/s128k
DeepSeek R1 Distill$0.75$0.99~200 tok/s128k
Mixtral 8x7B$0.24$0.24~500 tok/s32k

Notes: Groq runs open-source models on custom LPU hardware. Output tokens per second are typically 5-10x faster than GPU-based hosts. Rate limits are per-model and can be hit during peak hours.

Best for: Real-time chat where perceived speed matters, voice applications, streaming UX.

Together AI

ModelInputOutputContext
Llama 3.3 70B Turbo$0.88$0.88128k
Llama 3.1 405B Turbo$3.50$3.50128k
DeepSeek V3$1.25$1.2564k
Qwen 2.5 72B$1.20$1.2032k

Notes: Together offers the widest selection of open-source models, including specialized coding and math variants. Dedicated endpoints available for enterprise.

Best for: Open-source model experimentation, specialized models (coder variants, math variants), self-deployed workflows.

Fireworks AI

ModelInputOutputNotes
Llama 3.3 70B$0.90$0.90Fast inference
Mixtral 8x22B$1.20$1.20High-quality MoE
Fireworks-tuned customPer configPer configFine-tuning platform

Best for: Fine-tuning your own model variants, production OSS serving with SLAs.

The value providers (cheap but capable)

DeepSeek

ModelInputOutputContext
DeepSeek V3$0.27 (off-peak $0.07)$1.10 (off-peak $0.27)64k
DeepSeek R1$0.55 (off-peak $0.14)$2.19 (off-peak $0.55)64k

Notes: Off-peak discount windows (roughly 16:30-00:30 UTC) cut prices 75%. Quality genuinely competes with GPT-4o on most benchmarks at a fraction of the price. Data sovereignty concerns for some enterprise users (servers in China).

Best for: Cost-sensitive production workloads, research, non-sensitive consumer products.

Mistral

ModelInputOutputContext
Mistral Large 2$2.00$6.00128k
Mistral Medium$0.40$2.0032k
Mistral Small$0.20$0.6032k
Codestral$0.30$0.9032k

Notes: European-based, strong GDPR posture. Codestral is specifically tuned for code. Does not train on API data.

Best for: EU companies with data residency requirements, code-focused workflows, no-China / no-US alternatives.

The specialized players

xAI (Grok)

ModelInputOutputContext
Grok 3$5.00$15.00128k
Grok 3 Mini$0.30$0.50128k

Best for: Real-time Twitter/X data access (the one thing it has that others don't).

Cohere

ModelInputOutputContext
Command R+$2.50$10.00128k
Command R$0.15$0.60128k
Embed v3$0.10Embeddings

Best for: Retrieval-heavy workloads, RAG pipelines, enterprise search.

Perplexity

ModelInputOutputNotes
Sonar Pro$3.00$15.00Built-in web search
Sonar$1.00$1.00Faster web-grounded

Best for: Applications that need web search grounding baked in. Non-standard pricing includes per-request search fee.

OpenRouter

OpenRouter is not a model provider — it's a routing layer that gives you one API to access most of the models listed above. Prices are the provider's price plus a small margin (typically 0-5%). Great if you don't want to manage multiple API keys, though BYOK with direct provider keys is usually cheaper.

Real-world cost for common tasks

Here's what common tasks actually cost across providers. All assume one full round-trip (prompt + response).

Task: Short chat message (~200 in, ~400 out)

ModelCostFeels like
GPT-4o-mini$0.0003Essentially free
Claude Haiku 3.5$0.0018Essentially free
Claude Sonnet 4.6$0.0066¢ per dozen messages
GPT-4o$0.0045¢ per dozen messages
Claude Opus 4$0.0330~3¢ per message
GPT-5$0.0070< 1¢ per message

Task: Long document summary (~50k in, ~2k out)

ModelCost
Gemini 2.0 Flash$0.006
GPT-4o-mini$0.009
Claude Sonnet 4.6$0.180
GPT-4o$0.145
Claude Opus 4$0.900

Task: Bulk classification (1,000 items × 500 in × 50 out)

ModelCost
Gemini 2.0 Flash-Lite$0.030
GPT-4o-mini$0.105
Claude Haiku 3.5$0.600

Free tiers worth knowing about

  • Google AI Studio: Generous free tier. Catch: trains on your data. Use for experimentation, not production user data.
  • Groq: Free tier with rate limits. Great for learning, prototyping.
  • Mistral: "La Plateforme" has a free tier for experimentation.
  • Together: $5 free credit for new accounts, plus a free tier on smaller models.
  • OpenRouter: Many free-tier endpoints for open-source models.

What you should actually use

If we had to pick one default for each category:

  • Best all-rounder: Claude Sonnet 4.6. Good at everything, costs what one decent coffee costs per week for heavy use.
  • Best cheap default: GPT-4o-mini. Near-free, great quality.
  • Best quality-regardless-of-price: Claude Opus 4 for code, GPT-5 for reasoning.
  • Best for huge documents: Gemini 2.5 Pro (2M context).
  • Best for speed: Groq + Llama 3.3 70B (~300 tokens/sec).
  • Best for privacy-conscious EU workloads: Mistral Large 2.
  • Best for "I just want it free": Gemini 2.0 Flash free tier, or DeepSeek off-peak.

The thing this table doesn't show

Token prices aren't the whole story. The real production cost includes:

  • Latency. Groq may cost more per token than Flash-Lite, but its speed may let you skip a caching layer.
  • Rate limits. A cheap model with low rate limits costs you engineer time to wait out the backoff.
  • Reliability. Some providers have 99.99% uptime; others have bad days.
  • Feature support. Function calling, structured outputs, vision — not every provider has everything.

That's why BYOK and multi-provider clients like NovaKit matter: you can pick the best model for each specific task without negotiating procurement for each one.

Keep this up to date

Prices change. OpenAI and Google cut prices roughly every 6 months. Anthropic prices are relatively stable. New providers emerge monthly.

For live, always-current pricing see our price tracker. For estimating your monthly spend under different usage patterns, use the cost calculator.


Stop guessing at AI costs. NovaKit tracks every message's token count and dollar cost in real time — for every provider, no matter which model you pick.

NovaKit workspace

Stop reading about AI tools. Use the one you own.

NovaKit is a BYOK AI workspace — chat across providers, compare model costs live, and keep conversations on your device. No markup on tokens, no lock-in.

  • Bring your own keys
  • Private by default
  • All models, one workspace

Keep exploring

All posts