AI in Production
Scaling, monitoring, cost optimisation, latency reduction, caching, observability, and fallback strategies for production AI systems.
Taking AI to Production
Building an AI demo is easy. Running AI reliably at scale is hard. Hereβs what you need to know.
Latency Optimisation
Streaming
The single biggest perceived-latency improvement. Instead of waiting for the full response:
Time-to-first-token: ~300ms (user sees response starting)
Full response: ~3-5 seconds (tokens appear progressively)
Without streaming: User stares at a blank screen for 3-5 seconds
Always stream in user-facing applications.
Prompt Optimisation
Every token costs time and money:
- Trim system prompts β remove redundant instructions
- Summarise conversation history instead of sending full transcripts
- Use shorter output instructions β βRespond in 2 sentencesβ vs letting the model ramble
- Cache system prompts β OpenAI and Anthropic offer prompt caching for repeated prefixes
Model Selection for Latency
| Tier | Model | Time-to-first-token | Use Case |
|---|---|---|---|
| Fast | GPT-4o mini, Claude Haiku | ~200ms | Routing, classification |
| Balanced | GPT-4o, Claude Sonnet | ~400ms | Most tasks |
| Powerful | Claude Opus, GPT-4.5 | ~800ms | Complex reasoning |
Use smaller models for simple tasks and route complex queries to larger ones.
Caching
Exact Match Cache
Store response for identical prompts. Simple but limited β users rarely ask the exact same thing.
Semantic Cache
Use embeddings to find similar previous queries. If similarity > threshold, return cached response. Dramatically reduces API calls for common question patterns.
Prompt Caching (Provider-Level)
Both OpenAI and Anthropic cache repeated prompt prefixes. If your system prompt is 2000 tokens and every request includes it, prompt caching reduces that cost by ~50%.
Cost Management
Token Economics
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o mini | $0.15 | $0.60 |
| GPT-4o | $2.50 | $10.00 |
| Claude Sonnet | $3.00 | $15.00 |
| Claude Opus | $15.00 | $75.00 |
Output tokens cost 3-5x more than input tokens. Shorter responses = lower costs.
Cost Reduction Strategies
- Model routing β Use cheap models for simple tasks, expensive models only when needed
- Response length limits β Instruct the model to be concise
- Caching β Avoid redundant API calls
- Batch processing β Use batch APIs (50% cheaper) for non-real-time tasks
- Conversation summarisation β Compress long histories before sending
Observability
You canβt improve what you canβt measure.
Key Metrics to Track
- Latency β time-to-first-token and total response time
- Token usage β input/output per request
- Error rates β API failures, rate limits, timeouts
- Cost β daily/weekly/monthly spend per model
- Quality β user feedback, hallucination rate
Tools
- LangSmith β Traces, evaluation, monitoring for LangChain apps
- Helicone β API proxy that logs all LLM calls with analytics
- Portkey β Unified API gateway with observability
- Custom logging β Log every request/response with metadata
Multi-Provider Fallback
Never depend on a single AI provider:
providers = [
{ name: "openai", model: "gpt-4o", timeout: 10s },
{ name: "anthropic", model: "claude-sonnet", timeout: 10s },
{ name: "together", model: "llama-3-70b", timeout: 15s },
]
for provider in providers:
try:
response = call_with_adapted_prompt(provider, prompt)
return response
except (Timeout, RateLimit, ServerError):
log_failure(provider)
continue
raise AllProvidersFailed()
Key Considerations
- Each provider has different prompt formats β abstract this
- Rate limits differ β track per-provider
- Quality varies β test your specific use case on each
- Some tasks may not be suitable for certain models
Retry & Rate Limiting
Exponential Backoff
delay = min(base_delay * 2^attempt + random_jitter, max_delay)
Client-Side Rate Limiting
Track your request rate and queue/delay requests before hitting provider limits. Cheaper than getting rate-limited and retrying.
Security in Production
- API key rotation β rotate keys regularly, never commit to code
- Input validation β sanitise user input for prompt injection
- Output filtering β check responses for PII, harmful content
- Audit logging β log all AI interactions for compliance
- Data residency β know where your data is processed (EU regulations, etc.)
Key takeaway: Production AI is 20% model selection and 80% engineering β caching, observability, fallbacks, cost management, and security. The model is the easy part; the infrastructure around it is what makes it reliable.
Quick Quiz
Test what you just learned. Pick the best answer for each question.
Q1 What is 'semantic caching' for AI APIs?
Q2 What is the primary strategy for reducing AI API latency?
Q3 What should a multi-provider fallback strategy include?
Q4 What is the most common cause of unexpectedly high AI API costs?