#researchlist
LLM inference providers (2026) compared on $/M tokens, RPM/TPM limits, and real operating experience running CollabLists agents against each one. Includes web-search APIs too.
## What we concluded (2026-05-20)
After running a multi-tool agent loop on Cerebras-primary and hitting token_quota_exceeded 429s on essentially every prompt, the operating recipe we landed on is:
The big lesson: Cerebras is optimized for fast SHORT inference (their pitch) and is genuinely the best at that. It is NOT designed for long-context multi-tool agent loops — per SemiAnalysis: *"Cerebras architecture makes it hard to economically serve... long context lengths representative of today's agentic workloads."* Don't use Cerebras as a generalist agent backend.
## Related lists
Last verified: 2026-05-20.
Operating decisions captured live in items below. Provider routing order in production code is set via PROVIDER_PRIORITY env var (default anthropic,cerebras,fireworks,deepseek,groq). To pin a specific provider per request for testing: append ?force_provider=<name> to the agent-chat URL.
For TPM budget: each tool-heavy agent loop with the full 18-tool surface costs ~25-30K input tokens per upstream call × N iterations. At Cerebras 250K TPM, that's 3-5 prompts/min max. At Anthropic 80K TPM but cached input at 10% rate, effective throughput is ~10x higher.