LLM Inference Providers — Speed/Cost/Limits Comparison

#researchlist

LLM inference providers (2026) compared on $/M tokens, RPM/TPM limits, and real operating experience running CollabLists agents against each one. Includes web-search APIs too.

## What we concluded (2026-05-20)

After running a multi-tool agent loop on Cerebras-primary and hitting token_quota_exceeded 429s on essentially every prompt, the operating recipe we landed on is:

Anthropic Claude Sonnet 4.5 — PRIMARY. With prompt caching (cache_control: ephemeral) on system + tools, cached input is billed at 10% rate after the first call. High TPM, native web_search, supports tool loops with long context. Slower per call (~2-5s) than Cerebras but reliably completes.
Cerebras gpt-oss-120b — SECONDARY. Use it for SHORT specialty tasks (validation, summarization, classification) where 3000 tok/s decode is genuinely useful. NOT a generalist primary — TPM constraints make long agentic loops fail.
Fireworks AI — TERTIARY. OpenAI-compatible, 356ms TTFT, plenty of TPM headroom on paid tier. Slot 3 of the fallback chain (after Anthropic & Cerebras, before Groq).
Groq — FALLBACK. Free tier 6K TPM is too low for production but useful when the first three all fail.
Web search: Brave (paid for general queries) + Anthropic-native (free, on Claude path). Brave wins 2026 benchmarks on agent score + latency + price.

The big lesson: Cerebras is optimized for fast SHORT inference (their pitch) and is genuinely the best at that. It is NOT designed for long-context multi-tool agent loops — per SemiAnalysis: *"Cerebras architecture makes it hard to economically serve... long context lengths representative of today's agentic workloads."* Don't use Cerebras as a generalist agent backend.

## Related lists

Cool AI Labs & Companies — /list/13c555d5-7ec0-47f9-812b-eabe79c2db0e (broader scan of AI companies vs this list's specifically-inference focus)

Last verified: 2026-05-20.

PublicOwned byJacob Cole

Table view

Operating decisions captured live in items below. Provider routing order in production code is set via PROVIDER_PRIORITY env var (default anthropic,cerebras,fireworks,deepseek,groq). To pin a specific provider per request for testing: append ?force_provider=<name> to the agent-chat URL.

For TPM budget: each tool-heavy agent loop with the full 18-tool surface costs ~25-30K input tokens per upstream call × N iterations. At Cerebras 250K TPM, that's 3-5 prompts/min max. At Anthropic 80K TPM but cached input at 10% rate, effective throughput is ~10x higher.

You're viewing this list as a guest. Sign in to comment or star items.Sign in

Anthropic Claude Sonnet 4.5 — frontier model, native web_search, prompt caching

www.anthropic.com

Cerebras gpt-oss-120b — fastest decode in industry (~3000 tok/s)

www.cerebras.ai

Groq Llama 3.3 70B — LPU hardware, low TTFT champion

console.groq.com

Google Gemini 2.5 Flash — cheapest credible frontier model

ai.google.dev

Google Gemini 2.5 Pro — best Google model, 200K context

cloud.google.com

DeepSeek V3.2-Exp — Chinese frontier-tier, native tools

api-docs.deepseek.com

Llama 3.3 70B via Together AI — cheapest Llama path

www.together.ai

Fireworks AI — balanced, 4x faster structured outputs vs vLLM

fireworks.ai

OpenRouter — aggregator across all providers

openrouter.ai

Brave Search API — winner of 2026 agent-API benchmark

api-dashboard.search.brave.com

Tavily Search API — LLM-purpose-built but expensive at scale

tavily.com

Exa AI — semantic search, different paradigm

exa.ai

Anthropic native web_search — server-side, no external dep

platform.claude.com

Serper — raw Google SERPs, cheapest at scale

serper.dev

Cerebras Prompt Caching — exists but useless for TPM constraints

inference-docs.cerebras.ai