Skip to main content

Cost and Latency Tradeoffs

PM: Read in full โ€” 20 min

The Model Selection Triangleโ€‹

Every production LLM decision involves three variables in tension:

  • Quality โ€” How reliably correct, coherent, and safe is the output?
  • Cost โ€” How much does each inference call cost?
  • Latency โ€” How fast does the response arrive?

No model maximizes all three. Frontier models (GPT-4o, Claude Opus, Gemini Ultra) lead on quality. Fast/cheap models (Claude Haiku, GPT-4o-mini, Gemini Flash) lead on cost and speed. The engineering problem is choosing the right tier for each task.

How LLM Pricing Worksโ€‹

Almost all providers charge per token, separately for input and output:

cost = (input_tokens ร— input_price) + (output_tokens ร— output_price)

Output tokens are typically 3โ€“5ร— more expensive than input tokens. Generating each output token requires a full forward pass through the model; reading input tokens requires only a less expensive prefill step.

Practical implications:

  • Long system prompt + short response is cheap
  • Short prompt + long generated report is expensive
  • Output cost dominates for most generation tasks

Approximate prices as of mid-2025 (verify with provider for current rates):

TierExample modelsInput / M tokensOutput / M tokens
FrontierGPT-4o, Claude Sonnet$3โ€“$15$15โ€“$75
Fast/cheapGPT-4o-mini, Claude Haiku$0.10โ€“$0.30$0.40โ€“$1.50
Open (self-hosted)Llama 3 70Binfra cost onlyinfra cost only

Latency Dimensionsโ€‹

Latency has two components that feel different to users:

Time to first token (TTFT): How long from request to the first character appearing. This is the "responsiveness" feel in streaming UIs. Dominated by network latency + model prefill time.

Throughput: How fast tokens stream once they start. Proportional to output length. A fast model with worse TTFT can feel better on long outputs; a slow model with excellent TTFT can feel better on short interactive replies.

For interactive chat: optimize TTFT first, throughput second. For batch processing: optimize throughput (tokens/second per dollar).

Prompt Cachingโ€‹

Most major providers offer prompt caching: if the beginning of your prompt is identical across multiple requests, the preprocessed prefix is cached, and you pay a fraction of the normal price for those tokens.

This matters most for:

  • Long system prompts that don't change between users
  • RAG patterns where retrieved documents are constant across requests
  • Multi-turn conversations where early context is static

Effective caching can reduce costs by 50โ€“90% on the cached portion.

Model Routing Patternsโ€‹

Production systems often route to different models based on request type:

Complexity-based routing: simple factual queries go to the fast/cheap tier; complex reasoning tasks escalate to frontier models.

Cost-ceiling routing: if a task's value is bounded (a classification that informs a low-value decision), cap the inference cost at a small fraction of that value.

Cascade pattern: run a cheap model first; if confidence is low, escalate to a frontier model. Works well for classification; poorly for open-ended generation where confidence is hard to measure.

Task-specific fine-tuned model: a smaller model fine-tuned for your specific task often matches frontier quality at a fraction of the cost. Worth investing in once a task is well-defined and volume is significant.

Practical Cost Estimationโ€‹

Before deploying at scale:

  1. Profile your actual input/output token distribution across a representative request sample
  2. Apply the provider's pricing
  3. Multiply by expected daily volume
  4. Estimate prompt caching impact
  5. Compare against value-per-inference to validate the economics

Don't guess token counts โ€” measure them. Input tokens are often longer than engineers expect, especially with JSON schemas, long system prompts, and conversation history.

PM Takeaway

For internal tools and low-volume features, cost rarely matters โ€” a frontier model at $0.01/request is fine. For consumer-facing features at millions of requests per day, model tier selection is a make-or-break product decision. Build your eval framework to compare fast/cheap models against frontier models early, before you're committed to an architecture.

Further Readingโ€‹