#researchlist
LLM inference providers (2026) compared on $/M tokens, RPM/TPM limits, and real operating experience running CollabLists agents against each one. Includes web-search APIs too.
## What we concluded (2026-05-20)
After running a multi-tool agent loop on Cerebras-primary and hitting token_quota_exceeded 429s on essentially every prompt, the operating recipe we landed on is:
The big lesson: Cerebras is optimized for fast SHORT inference (their pitch) and is genuinely the best at that. It is NOT designed for long-context multi-tool agent loops — per SemiAnalysis: *"Cerebras architecture makes it hard to economically serve... long context lengths representative of today's agentic workloads."* Don't use Cerebras as a generalist agent backend.
## Related lists
Last verified: 2026-05-20.
Operating decisions captured live in items below. Provider routing order in production code is set via PROVIDER_PRIORITY env var (default anthropic,cerebras,fireworks,deepseek,groq). To pin a specific provider per request for testing: append ?force_provider=<name> to the agent-chat URL.
For TPM budget: each tool-heavy agent loop with the full 18-tool surface costs ~25-30K input tokens per upstream call × N iterations. At Cerebras 250K TPM, that's 3-5 prompts/min max. At Anthropic 80K TPM but cached input at 10% rate, effective throughput is ~10x higher.
| # | Content | Link | Section | Status | Visibility | Source |
|---|---|---|---|---|---|---|
| 1 | Anthropic Claude Sonnet 4.5 — frontier model, native web_search, prompt caching Pricing: $3.00/M in, $15.00/M out. Cached input: $0.30/M (10x cheaper). Native web_search tool: $10/1k searches. CollabLists 2026-05-20: flipped to Anthropic-primary after Cerebras TPM constraint became intractable for tool-heavy loops. ~2-5s latency vs Cerebras 0.7s but reliably completes. | www.anthropic.com | — | — | inherit | — |
| 2 | Cerebras gpt-oss-120b — fastest decode in industry (~3000 tok/s) Pricing: $0.05/M in, $0.10/M out. Free tier 30 RPM / 60K TPM. PayGo documented 1000 RPM / 1M TPM. CollabLists 2026-05-20 experience: paid account console showed 250 RPM / 250K TPM (below docs). Hit token_quota_exceeded 100% on tool-heavy agent loops. New key csk-4rt4... appears enterprise-tier (100/100 parallel passed). Prompt caching cached tokens STILL count against TPM — saves latency only, not cost. SemiAnalysis: hard to economically serve long context lengths representative of today agentic workloads. Best for: short fast inference, validation/QA. Worst for: long multi-tool loops. | www.cerebras.ai |
Read-only v0. To edit an item, click its content to open the focused page. Column resize, sort, inline edit, and the per-list item_schema editor are tracked under ticket 1z3.
| — |
| — |
| inherit |
| — |
| 3 | Groq Llama 3.3 70B — LPU hardware, low TTFT champion Free tier: 30 RPM / 6K TPM. Dev tier: 1000 RPM, 25% discount on tokens. Enterprise: custom. CollabLists 2026-05-20: signed up free tier; tried to upgrade to Dev tier but Groq said no availability. Key works on free tier as 3rd-tier fallback only — 6K TPM too low for primary. Speed: 476 tok/s on gpt-oss-120b, sub-300ms TTFT. | console.groq.com | — | — | inherit | — |
| 4 | Google Gemini 2.5 Flash — cheapest credible frontier model Pricing: $0.30/M in, $2.50/M out. RPM/TPM very high on Vertex paid. Native tool use + context caching. AI Studio free tier: 10 RPM / 250K TPM (Flash) — same as Cerebras PayGo. On Vertex AI: pay-as-you-go, billed against GCP credits → effectively free for early-stage projects with credits. Not yet tested by CollabLists. | ai.google.dev | — | — | inherit | — |
| 5 | Google Gemini 2.5 Pro — best Google model, 200K context Pricing: $1.25/M in (≤200K context), $10.00/M out. 1M-token context window available. AI Studio free: 5 RPM / 50 RPD only — useless for production. Vertex AI paid: high limits, GCP-credits eligible. | cloud.google.com | — | — | inherit | — |
| 6 | DeepSeek V3.2-Exp — Chinese frontier-tier, native tools Pricing: $0.028/M in, $0.42/M out. Best price-to-quality ratio on the market as of mid-2026. Native tool calling in OpenAI-compatible API. CollabLists has key wired in route.ts but never had a real key — unset in Secret Manager. Worth provisioning as alternative-to-Anthropic backup. | api-docs.deepseek.com | — | — | inherit | — |
| 7 | Llama 3.3 70B via Together AI — cheapest Llama path Pricing: $0.88/M (Together). Same model on Vertex AI: $1.36/M (55% premium for GCP managed). Together is the cheapest serverless host for Llama; Vertex is the GCP-credits path. | www.together.ai | — | — | inherit | — |
| 8 | Fireworks AI — balanced, 4x faster structured outputs vs vLLM Pricing varies by model. Speed: 83-90 tok/s (Llama 70B), ~7.9s end-to-end. Best for: structured-output workloads where JSON-mode latency matters. CollabLists 2026-05-20: key fw_7zs9... saved to Secret Manager. Direct test on gpt-oss-120b passed 10/10 parallel + 356ms wall on tiny prompts. Not yet wired into providerChain — would be the natural 4th-in-chain if we want a Cerebras-tier-fast alternative when Cerebras+Anthropic+Groq all fail. $1 of credit ≈ ~3M tokens of gpt-oss-120b inference. | fireworks.ai | — | — | inherit | — |
| 9 | OpenRouter — aggregator across all providers One key, automatic failover, transparent routing. Markup: small spread over provider direct pricing. Useful when you want fallback without managing N keys. Not currently used by CollabLists but the providerChain pattern in route.ts essentially recreates this in-house. | openrouter.ai | — | — | inherit | — |
| 10 | Brave Search API — winner of 2026 agent-API benchmark Pricing: $5-9 per 1k queries. Free tier: 2K/mo. Latency: 669ms (fastest of search APIs). Agent Score 14.89 (highest, per aimultiple.com 8-API benchmark). Independent index (not Google-derived). CollabLists 2026-05-19: chose this after head-to-head vs Tavily — same shape, faster, cheaper, larger free tier. In prod now serving web_search tool on Cerebras/DeepSeek/Groq paths. | api-dashboard.search.brave.com | — | — | inherit | — |
| 11 | Tavily Search API — LLM-purpose-built but expensive at scale Pricing: $0.008/credit, $30/mo for ~3.7k queries. Free: 1K/mo. Returns pre-filtered/reranked results optimized for LLM context. Agent Score 13.88. CollabLists 2026-05-19: chose NOT to use after benchmark — Brave was faster, cheaper, larger free tier. | tavily.com | — | — | inherit | — |
| 12 | Exa AI — semantic search, different paradigm Pricing: $0.001/result ($0.01/search at 10 results). Uses embeddings instead of keyword matching. Best for: "find pages similar to this URL" or essay-style queries. Worst for: keyword queries where Brave wins. | exa.ai | — | — | inherit | — |
| 13 | Anthropic native web_search — server-side, no external dep Tool type: web_search_20260209. Pricing: $10/1k searches + token cost. GA on Claude Sonnet 4.x / Haiku 4.x. NO API key needed beyond Anthropic key. Returns citations automatically. 2026 upgrade: Dynamic Filtering uses code_execution to pre-filter results before context. CollabLists 2026-05-20: wired in route.ts via Anthropic path; verified working — found the protein-powder-lead CR article + spontaneously used code_execution. | platform.claude.com | — | — | inherit | — |
| 14 | Serper — raw Google SERPs, cheapest at scale Pricing: ~$0.30/1k (vs Tavily $8/1k at 1M queries). Returns raw Google SERP HTML. Best for: high-volume scraping; you do reranking yourself. | serper.dev | — | — | inherit | — |
| 15 | Cerebras Prompt Caching — exists but useless for TPM constraints Automatic on gpt-oss-120b. Free (no markup). TTL 5min-1hr. Cached blocks reused via 128-token segment matching. CRITICAL CAVEAT: cached tokens STILL COUNT against TPM rate limit per their docs. Saves latency only, not cost or quota. This is why we flipped to Anthropic-primary on 2026-05-20. | inference-docs.cerebras.ai | — | — | inherit | — |