AI in Production

Scaling, monitoring, cost optimisation, latency reduction, caching, observability, and fallback strategies for production AI systems.

Taking AI to Production

Building an AI demo is easy. Running AI reliably at scale is hard. Here’s what you need to know.

Latency Optimisation

Streaming

The single biggest perceived-latency improvement. Instead of waiting for the full response:

Time-to-first-token: ~300ms (user sees response starting)
Full response: ~3-5 seconds (tokens appear progressively)

Without streaming: User stares at a blank screen for 3-5 seconds

Always stream in user-facing applications.

Prompt Optimisation

Every token costs time and money:

Trim system prompts — remove redundant instructions
Summarise conversation history instead of sending full transcripts
Use shorter output instructions — “Respond in 2 sentences” vs letting the model ramble
Cache system prompts — OpenAI and Anthropic offer prompt caching for repeated prefixes

Model Selection for Latency

Tier	Model	Time-to-first-token	Use Case
Fast	GPT-4o mini, Claude Haiku	~200ms	Routing, classification
Balanced	GPT-4o, Claude Sonnet	~400ms	Most tasks
Powerful	Claude Opus, GPT-4.5	~800ms	Complex reasoning

Use smaller models for simple tasks and route complex queries to larger ones.

Caching

Exact Match Cache

Store response for identical prompts. Simple but limited — users rarely ask the exact same thing.

Semantic Cache

Use embeddings to find similar previous queries. If similarity > threshold, return cached response. Dramatically reduces API calls for common question patterns.

Prompt Caching (Provider-Level)

Both OpenAI and Anthropic cache repeated prompt prefixes. If your system prompt is 2000 tokens and every request includes it, prompt caching reduces that cost by ~50%.

Cost Management

Token Economics

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o mini	$0.15	$0.60
GPT-4o	$2.50	$10.00
Claude Sonnet	$3.00	$15.00
Claude Opus	$15.00	$75.00

Output tokens cost 3-5x more than input tokens. Shorter responses = lower costs.

Cost Reduction Strategies

Model routing — Use cheap models for simple tasks, expensive models only when needed
Response length limits — Instruct the model to be concise
Caching — Avoid redundant API calls
Batch processing — Use batch APIs (50% cheaper) for non-real-time tasks
Conversation summarisation — Compress long histories before sending

Observability

You can’t improve what you can’t measure.

Key Metrics to Track

Latency — time-to-first-token and total response time
Token usage — input/output per request
Error rates — API failures, rate limits, timeouts
Cost — daily/weekly/monthly spend per model
Quality — user feedback, hallucination rate

Tools

LangSmith — Traces, evaluation, monitoring for LangChain apps
Helicone — API proxy that logs all LLM calls with analytics
Portkey — Unified API gateway with observability
Custom logging — Log every request/response with metadata

Multi-Provider Fallback

Never depend on a single AI provider:

providers = [
    { name: "openai", model: "gpt-4o", timeout: 10s },
    { name: "anthropic", model: "claude-sonnet", timeout: 10s },
    { name: "together", model: "llama-3-70b", timeout: 15s },
]

for provider in providers:
    try:
        response = call_with_adapted_prompt(provider, prompt)
        return response
    except (Timeout, RateLimit, ServerError):
        log_failure(provider)
        continue

raise AllProvidersFailed()

Key Considerations

Each provider has different prompt formats — abstract this
Rate limits differ — track per-provider
Quality varies — test your specific use case on each
Some tasks may not be suitable for certain models

Retry & Rate Limiting

Exponential Backoff

delay = min(base_delay * 2^attempt + random_jitter, max_delay)

Client-Side Rate Limiting

Track your request rate and queue/delay requests before hitting provider limits. Cheaper than getting rate-limited and retrying.

Security in Production

API key rotation — rotate keys regularly, never commit to code
Input validation — sanitise user input for prompt injection
Output filtering — check responses for PII, harmful content
Audit logging — log all AI interactions for compliance
Data residency — know where your data is processed (EU regulations, etc.)

Key takeaway: Production AI is 20% model selection and 80% engineering — caching, observability, fallbacks, cost management, and security. The model is the easy part; the infrastructure around it is what makes it reliable.