Cost and Latency Tradeoffs
The Model Selection Triangleโ
Every production LLM decision involves three variables in tension:
- Quality โ How reliably correct, coherent, and safe is the output?
- Cost โ How much does each inference call cost?
- Latency โ How fast does the response arrive?
No model maximizes all three. Frontier models (GPT-4o, Claude Opus, Gemini Ultra) lead on quality. Fast/cheap models (Claude Haiku, GPT-4o-mini, Gemini Flash) lead on cost and speed. The engineering problem is choosing the right tier for each task.
How LLM Pricing Worksโ
Almost all providers charge per token, separately for input and output:
cost = (input_tokens ร input_price) + (output_tokens ร output_price)
Output tokens are typically 3โ5ร more expensive than input tokens. Generating each output token requires a full forward pass through the model; reading input tokens requires only a less expensive prefill step.
Practical implications:
- Long system prompt + short response is cheap
- Short prompt + long generated report is expensive
- Output cost dominates for most generation tasks
Approximate prices as of mid-2025 (verify with provider for current rates):
| Tier | Example models | Input / M tokens | Output / M tokens |
|---|---|---|---|
| Frontier | GPT-4o, Claude Sonnet | $3โ$15 | $15โ$75 |
| Fast/cheap | GPT-4o-mini, Claude Haiku | $0.10โ$0.30 | $0.40โ$1.50 |
| Open (self-hosted) | Llama 3 70B | infra cost only | infra cost only |
Latency Dimensionsโ
Latency has two components that feel different to users:
Time to first token (TTFT): How long from request to the first character appearing. This is the "responsiveness" feel in streaming UIs. Dominated by network latency + model prefill time.
Throughput: How fast tokens stream once they start. Proportional to output length. A fast model with worse TTFT can feel better on long outputs; a slow model with excellent TTFT can feel better on short interactive replies.
For interactive chat: optimize TTFT first, throughput second. For batch processing: optimize throughput (tokens/second per dollar).
Prompt Cachingโ
Most major providers offer prompt caching: if the beginning of your prompt is identical across multiple requests, the preprocessed prefix is cached, and you pay a fraction of the normal price for those tokens.
This matters most for:
- Long system prompts that don't change between users
- RAG patterns where retrieved documents are constant across requests
- Multi-turn conversations where early context is static
Effective caching can reduce costs by 50โ90% on the cached portion.
Model Routing Patternsโ
Production systems often route to different models based on request type:
Complexity-based routing: simple factual queries go to the fast/cheap tier; complex reasoning tasks escalate to frontier models.
Cost-ceiling routing: if a task's value is bounded (a classification that informs a low-value decision), cap the inference cost at a small fraction of that value.
Cascade pattern: run a cheap model first; if confidence is low, escalate to a frontier model. Works well for classification; poorly for open-ended generation where confidence is hard to measure.
Task-specific fine-tuned model: a smaller model fine-tuned for your specific task often matches frontier quality at a fraction of the cost. Worth investing in once a task is well-defined and volume is significant.
Practical Cost Estimationโ
Before deploying at scale:
- Profile your actual input/output token distribution across a representative request sample
- Apply the provider's pricing
- Multiply by expected daily volume
- Estimate prompt caching impact
- Compare against value-per-inference to validate the economics
Don't guess token counts โ measure them. Input tokens are often longer than engineers expect, especially with JSON schemas, long system prompts, and conversation history.
For internal tools and low-volume features, cost rarely matters โ a frontier model at $0.01/request is fine. For consumer-facing features at millions of requests per day, model tier selection is a make-or-break product decision. Build your eval framework to compare fast/cheap models against frontier models early, before you're committed to an architecture.
Further Readingโ
- Evaluation and Benchmarks โ quality measurement that informs tier selection
- Tools Landscape โ provider comparison including pricing and latency characteristics