Cost and Latency Tradeoffs

PM: Read in full — 20 min

The Model Selection Triangle

Every production LLM decision involves three variables in tension:

Quality — How reliably correct, coherent, and safe is the output?
Cost — How much does each inference call cost?
Latency — How fast does the response arrive?

No model maximizes all three. Frontier models (GPT, Claude Opus, Gemini Pro) lead on quality. Fast/cheap models (Claude Haiku, GPT mini, Gemini Flash) lead on cost and speed. Open models (Llama, DeepSeek) offer competitive quality with infrastructure-only cost. The engineering problem is choosing the right tier for each task.

How LLM Pricing Works

Almost all providers charge per token, separately for input and output:

cost = (input_tokens × input_price) + (output_tokens × output_price)

Output tokens are typically 3–5× more expensive than input tokens. Generating each output token requires a full forward pass through the model; reading input tokens requires only a less expensive prefill step.

Practical implications:

Long system prompt + short response is cheap
Short prompt + long generated report is expensive
Output cost dominates for most generation tasks

Frontier model pricing has fallen roughly 10× over two years — from around $30 per million tokens to $3–$5 for equivalent capability. The fast/cheap tier now offers quality that matched frontier models from two years ago. Architectures designed around cost constraints in previous years are worth revisiting; what required the cheap tier then may now be viable at frontier quality for the same budget.

Approximate prices — verify with provider before committing to an architecture, as rates change frequently:

Tier	Example families	Input / M tokens	Output / M tokens
Frontier	GPT (OpenAI), Claude (Anthropic), Gemini Pro (Google)	$3–$15	$15–$75
Fast/cheap	GPT mini, Claude Haiku, Gemini Flash	$0.10–$0.50	$0.40–$2.00
Open (self-hosted)	Llama (Meta), DeepSeek, Mistral	infra cost only	infra cost only

Latency Dimensions

Latency has two components that feel different to users:

Time to first token (TTFT): How long from request to the first character appearing. This is the "responsiveness" feel in streaming UIs. Dominated by network latency + model prefill time.

Throughput: How fast tokens stream once they start. Proportional to output length. A fast model with worse TTFT can feel better on long outputs; a slow model with excellent TTFT can feel better on short interactive replies.

For interactive chat: optimize TTFT first, throughput second. For batch processing: optimize throughput (tokens/second per dollar).

The abandonment cliff: web UX research consistently finds that response times above 3–5 seconds cause significant user abandonment — Google has reported that 53% of mobile visits are abandoned if a page takes more than 3 seconds to load, and practitioner data on LLM applications follows the same curve. Users tolerate a short wait; multi-second delays shift from acceptable to product-breaking. Streaming output — showing tokens as they generate rather than waiting for the full response — significantly reduces perceived latency even when total generation time is unchanged.

Prompt Caching

Most major providers offer prompt caching: if the beginning of your prompt is identical across multiple requests, the preprocessed prefix is cached, and you pay a fraction of the normal price for those tokens.

This matters most for:

Long system prompts that don't change between users
RAG patterns where retrieved documents are constant across requests
Multi-turn conversations where early context is static

Effective caching can reduce costs by 50–90% on the cached portion.

Model Routing Patterns

Production systems often route to different models based on request type:

Complexity-based routing: simple factual queries go to the fast/cheap tier; complex reasoning tasks escalate to frontier models.

Cost-ceiling routing: if a task's value is bounded (a classification that informs a low-value decision), cap the inference cost at a small fraction of that value.

Cascade pattern: run a cheap model first; if confidence is low, escalate to a frontier model. Works well for classification; poorly for open-ended generation where confidence is hard to measure.

Task-specific fine-tuned model: a smaller model fine-tuned for your specific task often matches frontier quality at a fraction of the cost. Worth investing in once a task is well-defined and volume is significant.

Practical Cost Estimation

Before deploying at scale:

Profile your actual input/output token distribution across a representative request sample
Apply the provider's pricing
Multiply by expected daily volume
Estimate prompt caching impact
Compare against value-per-inference to validate the economics

Don't guess token counts — measure them. Input tokens are often longer than engineers expect, especially with JSON schemas, long system prompts, and conversation history.

PM Takeaway

For internal tools and low-volume features, cost rarely matters — a frontier model at $0.01/request is fine. For consumer-facing features at millions of requests per day, model tier selection is a make-or-break product decision. Build your eval framework to compare fast/cheap models against frontier models early, before you're committed to an architecture.

The Model Selection Triangle​

How LLM Pricing Works​

Latency Dimensions​

Prompt Caching​

Model Routing Patterns​

Practical Cost Estimation​

Further Reading​