LLM Inference Providers — Speed/Cost/Limits Comparison

#researchlist

LLM inference providers (2026) compared on $/M tokens, RPM/TPM limits, and real operating experience running CollabLists agents against each one. Includes web-search APIs too.

## What we concluded (2026-05-20)

After running a multi-tool agent loop on Cerebras-primary and hitting token_quota_exceeded 429s on essentially every prompt, the operating recipe we landed on is:

Anthropic Claude Sonnet 4.5 — PRIMARY. With prompt caching (cache_control: ephemeral) on system + tools, cached input is billed at 10% rate after the first call. High TPM, native web_search, supports tool loops with long context. Slower per call (~2-5s) than Cerebras but reliably completes.
Cerebras gpt-oss-120b — SECONDARY. Use it for SHORT specialty tasks (validation, summarization, classification) where 3000 tok/s decode is genuinely useful. NOT a generalist primary — TPM constraints make long agentic loops fail.
Fireworks AI — TERTIARY. OpenAI-compatible, 356ms TTFT, plenty of TPM headroom on paid tier. Slot 3 of the fallback chain (after Anthropic & Cerebras, before Groq).
Groq — FALLBACK. Free tier 6K TPM is too low for production but useful when the first three all fail.
Web search: Brave (paid for general queries) + Anthropic-native (free, on Claude path). Brave wins 2026 benchmarks on agent score + latency + price.

The big lesson: Cerebras is optimized for fast SHORT inference (their pitch) and is genuinely the best at that. It is NOT designed for long-context multi-tool agent loops — per SemiAnalysis: *"Cerebras architecture makes it hard to economically serve... long context lengths representative of today's agentic workloads."* Don't use Cerebras as a generalist agent backend.

## Related lists

Cool AI Labs & Companies — /list/13c555d5-7ec0-47f9-812b-eabe79c2db0e (broader scan of AI companies vs this list's specifically-inference focus)

Last verified: 2026-05-20.

Public

Operating decisions captured live in items below. Provider routing order in production code is set via PROVIDER_PRIORITY env var (default anthropic,cerebras,fireworks,deepseek,groq). To pin a specific provider per request for testing: append ?force_provider=<name> to the agent-chat URL.

For TPM budget: each tool-heavy agent loop with the full 18-tool surface costs ~25-30K input tokens per upstream call × N iterations. At Cerebras 250K TPM, that's 3-5 prompts/min max. At Anthropic 80K TPM but cached input at 10% rate, effective throughput is ~10x higher.

#	Content	Link	Section	Status	Visibility	Source
1	Anthropic Claude Sonnet 4.5 — frontier model, native web_search, prompt caching Pricing: $3.00/M in, $15.00/M out. Cached input: $0.30/M (10x cheaper). Native web_search tool: $10/1k searches. CollabLists 2026-05-20: flipped to Anthropic-primary after Cerebras TPM constraint became intractable for tool-heavy loops. ~2-5s latency vs Cerebras 0.7s but reliably completes.	www.anthropic.com	—	—	inherit	—
2	Cerebras gpt-oss-120b — fastest decode in industry (~3000 tok/s) Pricing: $0.05/M in, $0.10/M out. Free tier 30 RPM / 60K TPM. PayGo documented 1000 RPM / 1M TPM. CollabLists 2026-05-20 experience: paid account console showed 250 RPM / 250K TPM (below docs). Hit token_quota_exceeded 100% on tool-heavy agent loops. New key csk-4rt4... appears enterprise-tier (100/100 parallel passed). Prompt caching cached tokens STILL count against TPM — saves latency only, not cost. SemiAnalysis: hard to economically serve long context lengths representative of today agentic workloads. Best for: short fast inference, validation/QA. Worst for: long multi-tool loops.	www.cerebras.ai

Read-only v0. To edit an item, click its content to open the focused page. Column resize, sort, inline edit, and the per-list item_schema editor are tracked under ticket 1z3.