The Multi-Model Reality
Three months ago, you standardized on GPT-4 for everything. It was simpler: one API, one set of prompts, predictable costs. But your bill is $50K/month and growing. Meanwhile, GPT-3.5 costs 95% less, Claude 3.5 Sonnet is faster, and open-source Llama 3 70B is free to run.
The question isn't "which model is best?"—it's "which model is best for this specific task?" The answer changes based on:
- Task complexity (simple classification vs. complex reasoning)
- Quality requirements (mission-critical vs. good enough)
- Latency constraints (real-time chat vs. batch processing)
- Cost budgets (high-volume vs. occasional use)
- Privacy needs (API vs. self-hosted)
Multi-model strategies treat models as specialized tools, not universal solutions. You evaluate each model against each task, then route workloads intelligently. The result: better quality, lower cost, and more resilience.
The Model Landscape: Strengths and Weaknesses
Frontier Models (GPT-4, Claude 3 Opus, Gemini Ultra)
Strengths:
- Complex reasoning and multi-step tasks
- Following intricate instructions
- Creative writing and nuanced content
- Strong performance out-of-the-box with minimal prompting
Weaknesses:
- Expensive ($0.03-$0.06 per 1K tokens)
- Slower (2-5 seconds latency)
- Overkill for simple tasks
Best for:
- High-stakes decisions (legal analysis, medical information)
- Complex content generation (long-form articles, code with architecture)
- Tasks where quality >>> cost
Mid-Tier Models (GPT-3.5, Claude 3 Sonnet, Gemini Pro)
Strengths:
- Good balance of cost and quality
- Faster than frontier models (1-3 seconds)
- Capable of most common tasks with good prompting
Weaknesses:
- Struggle with very complex reasoning
- May need more prompt engineering than GPT-4
- Less consistent on edge cases
Best for:
- Customer support (straightforward Q&A)
- Content summarization
- Classification with some nuance
- High-volume applications where cost matters
Specialized Models (Embedding models, code-specific, etc.)
Strengths:
- Optimized for specific tasks
- Very cheap or free
- Fast inference
Weaknesses:
- Limited to narrow use cases
- Not general-purpose
Best for:
- Embeddings for semantic search (text-embedding-3-small)
- Code completion (CodeLlama, StarCoder)
- Sentiment analysis (fine-tuned BERT models)
Open-Source Models (Llama 3, Mistral, Qwen)
Strengths:
- Free to run (pay for compute only)
- Full control over deployment
- Privacy (nothing leaves your infrastructure)
- Can fine-tune for specific domains
Weaknesses:
- Require infrastructure setup
- Generally lower quality than frontier models (but gap is closing)
- Need more prompt engineering
Best for:
- High-volume workloads where API costs are prohibitive
- Sensitive data that can't be sent to third parties
- Custom fine-tuning for domain-specific tasks
Systematic Multi-Model Evaluation
Step 1: Define Your Task Matrix
List all LLM tasks in your application:
Task | Volume | Latency Target | Quality Requirement | Current Model | Current Cost |
---|---|---|---|---|---|
Customer support Q&A | 10K/day | <2s | High | GPT-4 | $1,200/day |
Product description generation | 500/day | <10s | Medium | GPT-4 | $30/day |
Email summarization | 50K/day | <1s | Medium | GPT-4 | $3,000/day |
Sentiment classification | 100K/day | <500ms | Low | GPT-4 | $6,000/day |
Total monthly cost: $310,000
Step 2: Build Task-Specific Eval Sets
For each task, create a benchmark:
- 100-200 representative examples
- Covering common cases, edge cases, and adversarial inputs
- With ground truth or quality annotations
Example: Customer support Q&A eval set
{
"task": "customer-support-qa",
"examples": [
{
"input": "How do I reset my password?",
"category": "common",
"difficulty": "easy",
"expected_elements": ["check email", "click reset link", "check spam"]
},
{
"input": "I was charged twice for the same order, what should I do?",
"category": "billing",
"difficulty": "medium",
"expected_elements": ["verify charges", "contact support", "provide order number"]
},
// ... 98 more examples
]
}
Step 3: Evaluate All Candidate Models
Test multiple models on each task:
models_to_test = [
"gpt-4-turbo",
"gpt-3.5-turbo",
"claude-3-opus",
"claude-3-sonnet",
"claude-3-haiku",
"gemini-pro",
"llama-3-70b",
"mistral-large"
]
results = {}
for model in models_to_test:
results[model] = evaluate(
prompt=CUSTOMER_SUPPORT_PROMPT,
model=model,
dataset="customer-support-qa",
metrics=["accuracy", "completeness", "tone", "safety"]
)
Collect:
- Quality metrics (accuracy, completeness, tone)
- Performance metrics (latency P50/P95, throughput)
- Cost per query
Step 4: Compare Model Performance
Example results: Customer support Q&A
Model | Accuracy | Completeness | Tone | Safety | P95 Latency | Cost/Query | Cost-Quality Ratio |
---|---|---|---|---|---|---|---|
GPT-4 Turbo | 89% | 87% | 8.2/10 | 100% | 3.2s | $0.012 | $0.0135 |
GPT-3.5 Turbo | 78% | 74% | 7.1/10 | 98% | 1.8s | $0.0006 | $0.00077 |
Claude 3 Opus | 91% | 89% | 8.5/10 | 100% | 2.9s | $0.015 | $0.0165 |
Claude 3 Sonnet | 85% | 83% | 8.0/10 | 100% | 1.5s | $0.003 | $0.0035 |
Claude 3 Haiku | 76% | 72% | 7.3/10 | 99% | 0.9s | $0.00025 | $0.00033 |
Gemini Pro | 82% | 80% | 7.6/10 | 99% | 2.1s | $0.00025 | $0.00030 |
Llama 3 70B | 73% | 69% | 6.8/10 | 97% | 2.5s | $0.0002* | $0.00027 |
*Assuming self-hosted compute costs
Analysis:
- Best quality: Claude 3 Opus, but expensive
- Best cost-quality ratio: Gemini Pro or Claude 3 Haiku
- Fastest: Claude 3 Haiku
- Cheapest: Llama 3 70B (self-hosted) or Gemini Pro/Haiku
Decision depends on priorities:
- If quality is paramount: Claude 3 Opus
- If cost matters: Switch from GPT-4 to Gemini Pro (save 98%, lose 7% accuracy)
- If speed matters: Claude 3 Haiku
Step 5: Repeat for All Tasks
Do this for every task. You'll often find:
- Simple tasks (sentiment analysis) work fine with cheap models
- Complex tasks (legal analysis) need frontier models
- Mid-tier tasks have good cost-quality tradeoffs with mid-tier models
Example: Email summarization
Model | Quality | Cost/Query | Daily Cost (50K queries) |
---|---|---|---|
GPT-4 | 92% | $0.06 | $3,000 |
GPT-3.5 | 87% | $0.003 | $150 |
Claude 3 Haiku | 88% | $0.00025 | $12.50 |
Decision: Claude 3 Haiku gives 88% quality (only 4% drop from GPT-4) at 99.6% cost savings. Switch and save $2,987/day.
Building a Model Router
Once you've evaluated models per task, build a routing layer that sends each request to the optimal model.
Simple Static Routing
Map tasks to models in config:
model_routing:
customer-support-qa:
model: claude-3-sonnet
fallback: gpt-4-turbo
email-summarization:
model: claude-3-haiku
fallback: gpt-3.5-turbo
sentiment-classification:
model: custom-bert-fine-tuned
fallback: claude-3-haiku
legal-document-analysis:
model: gpt-4-turbo
fallback: claude-3-opus
Implementation:
class ModelRouter:
def __init__(self, routing_config):
self.routes = routing_config
async def route(self, task: str, input: str) -> str:
route = self.routes.get(task)
if not route:
raise ValueError(f"No route configured for task: {task}")
try:
# Try primary model
response = await llm_call(
prompt=PROMPTS[task],
input=input,
model=route['model'],
timeout=5.0
)
return response
except Exception as e:
# Fall back to backup model
logger.warning(f"Primary model failed, falling back: {e}")
response = await llm_call(
prompt=PROMPTS[task],
input=input,
model=route['fallback'],
timeout=10.0
)
return response
# Usage
router = ModelRouter(routing_config)
answer = await router.route("customer-support-qa", user_question)
Dynamic Routing Based on Input Complexity
Some tasks vary in difficulty. Route simple cases to cheap models, hard cases to expensive ones.
Example: Customer support
- Simple questions ("How do I reset password?") → Claude 3 Haiku
- Complex questions ("I was charged incorrectly and my account is suspended") → GPT-4
Complexity classifier:
async def classify_complexity(question: str) -> str:
"""
Classify question complexity using a small, fast model.
"""
classifier_prompt = f"""
Classify this customer question as SIMPLE, MEDIUM, or COMPLEX:
SIMPLE: One-step answer, common question
MEDIUM: Requires some explanation or multiple steps
COMPLEX: Multi-part issue, requires judgment or policy interpretation
Question: {question}
Classification:
"""
result = await llm_call(
classifier_prompt,
model="claude-3-haiku", # Fast, cheap classifier
max_tokens=10
)
return result.strip().upper()
async def route_by_complexity(question: str) -> str:
complexity = await classify_complexity(question)
if complexity == "SIMPLE":
model = "claude-3-haiku"
elif complexity == "MEDIUM":
model = "claude-3-sonnet"
else:
model = "gpt-4-turbo"
return await llm_call(CUSTOMER_SUPPORT_PROMPT, question, model=model)
Cost savings: If 60% of questions are SIMPLE, 30% MEDIUM, 10% COMPLEX:
- Before (all GPT-4): $0.012 × 10K = $120/day
- After (routed): (0.6 × $0.00025) + (0.3 × $0.003) + (0.1 × $0.012) = $0.00285/query × 10K = $28.50/day
- Savings: 76%
Hybrid Approaches: Cascade and Retry
Start with a cheap model. If it fails or produces low-confidence output, retry with a better model.
Example: Content moderation
async def moderate_content(text: str) -> dict:
"""
Cascade: try fast model first, escalate if uncertain.
"""
# Try cheap classifier
fast_result = await llm_call(
prompt=MODERATION_PROMPT,
input=text,
model="claude-3-haiku",
response_format="json"
)
# If high confidence (very safe or very unsafe), trust it
if fast_result['confidence'] > 0.9:
return fast_result
# If uncertain, escalate to better model
detailed_result = await llm_call(
prompt=DETAILED_MODERATION_PROMPT,
input=text,
model="gpt-4-turbo",
response_format="json"
)
return detailed_result
Cost profile:
- 80% of content is clearly safe/unsafe: $0.00025/query
- 20% needs escalation: $0.00025 + $0.012 = $0.01225/query
- Average: (0.8 × $0.00025) + (0.2 × $0.01225) = $0.00265/query
- vs. always using GPT-4: $0.012/query (78% savings)
Model-Specific Prompt Optimization
Different models have different strengths. Prompts optimized for GPT-4 may not work well for Llama.
Prompting Strategies by Model
GPT-4 / Claude:
- Handle complex instructions well
- Good at following multi-step tasks
- Respond well to few-shot examples
GPT-3.5 / Claude 3 Haiku:
- Need simpler, more explicit instructions
- Benefit from output format specifications
- May need more examples than GPT-4
Open-source (Llama, Mistral):
- Need very explicit instructions
- Benefit from structured prompts (format as Q&A, input/output pairs)
- May need prompt templates specific to the model's training
Example: Same task, different prompts
For GPT-4:
You are a customer support assistant.
Answer this question based on our knowledge base:
Knowledge base: {kb_context}
Question: {question}
Provide a helpful, professional response.
For Claude 3 Haiku (more explicit):
You are a customer support assistant.
Knowledge base:
{kb_context}
Question: {question}
Instructions:
1. Find the answer in the knowledge base
2. Provide a helpful, professional response
3. If the answer isn't in the knowledge base, say "I don't have that information"
4. Keep your response under 100 words
Answer:
For Llama 3 70B (even more structured):
### Task
Answer customer questions using the knowledge base.
### Knowledge Base
{kb_context}
### Question
{question}
### Instructions
- Only use information from the knowledge base
- Be helpful and professional
- If you don't know, say "I don't have that information"
- Maximum 100 words
### Answer
Evaluate each prompt with its target model. Don't assume one prompt works for all.
Managing Multi-Model Infrastructure
Challenges
1. API Key Management You now have OpenAI, Anthropic, Google, and maybe Hugging Face keys.
Solution: Use a secrets manager (AWS Secrets Manager, HashiCorp Vault) and a unified client:
class UnifiedLLMClient:
def __init__(self):
self.openai_client = OpenAI(api_key=get_secret("openai"))
self.anthropic_client = Anthropic(api_key=get_secret("anthropic"))
self.google_client = GoogleAI(api_key=get_secret("google"))
async def call(self, model: str, prompt: str, **kwargs):
if model.startswith("gpt"):
return await self._call_openai(model, prompt, **kwargs)
elif model.startswith("claude"):
return await self._call_anthropic(model, prompt, **kwargs)
elif model.startswith("gemini"):
return await self._call_google(model, prompt, **kwargs)
else:
raise ValueError(f"Unknown model: {model}")
2. Rate Limits and Quotas Each provider has different limits.
Solution: Implement fallbacks and request throttling:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_with_fallback(primary_model, fallback_model, prompt):
try:
return await llm_call(primary_model, prompt)
except RateLimitError:
logger.warning(f"{primary_model} rate limited, trying {fallback_model}")
return await llm_call(fallback_model, prompt)
3. Monitoring and Observability Each model has different latency, error patterns, and costs.
Solution: Tag traces by model and task:
@evalops.trace(scenario="customer-support-qa")
async def answer_question(question: str):
model = router.select_model("customer-support-qa", question)
response = await llm_call(
PROMPT,
question,
model=model,
metadata={"model": model, "task": "customer-support-qa"}
)
return response
Now you can aggregate metrics by model:
- GPT-4: 89% accuracy, $0.012/query, 3.2s P95
- Claude Sonnet: 85% accuracy, $0.003/query, 1.5s P95
Cost Optimization Strategies
Strategy 1: Tiered Routing
Route by user tier:
- Free users → Cheap models
- Paid users → Mid-tier models
- Enterprise users → Best models
def select_model_by_tier(task: str, user_tier: str) -> str:
model_tiers = {
"free": {
"customer-support": "claude-3-haiku",
"summarization": "gpt-3.5-turbo"
},
"paid": {
"customer-support": "claude-3-sonnet",
"summarization": "gpt-4-turbo"
},
"enterprise": {
"customer-support": "gpt-4-turbo",
"summarization": "gpt-4-turbo"
}
}
return model_tiers[user_tier][task]
Strategy 2: Batch Processing for Non-Urgent Tasks
Use cheaper models for batch jobs:
- Real-time chat → GPT-4 (users waiting)
- Nightly report generation → GPT-3.5 or open-source (no one waiting)
Strategy 3: Caching
Cache LLM responses for identical inputs:
import hashlib
from functools import lru_cache
@lru_cache(maxsize=10000)
async def cached_llm_call(prompt_hash: str, input_hash: str, model: str):
# Actual LLM call
return await llm_call(prompt, input, model)
async def call_with_cache(prompt: str, input: str, model: str):
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
input_hash = hashlib.md5(input.encode()).hexdigest()
return await cached_llm_call(prompt_hash, input_hash, model)
For repeated queries (e.g., "How do I reset my password?"), this eliminates API costs entirely.
Strategy 4: Fine-Tuning for High-Volume Tasks
If you're running 100K+ queries/day on a task, consider fine-tuning a smaller model:
- Collect 1K+ examples of GPT-4 outputs on your task
- Fine-tune GPT-3.5 or an open-source model on this data
- Evaluate: does fine-tuned GPT-3.5 match GPT-4 quality?
- If yes, switch and save 95%
Example:
- Before: GPT-4 at 100K queries/day = $1,200/day
- After: Fine-tuned GPT-3.5 at 100K queries/day = $60/day
- Savings: $1,140/day = $416K/year
Evaluating Model Drift and Updates
Providers update models regularly. OpenAI has rolled out multiple GPT-4 versions. Claude versions change. Your evaluation might become outdated.
Monitor Model Versions
Log which exact model version you use:
response = openai.chat.completions.create(
model="gpt-4-turbo-2024-04-09", # Pin specific version
...
)
When providers announce updates, re-run evals:
# Compare new version to current
evalops compare \
--baseline gpt-4-turbo-2024-04-09 \
--candidate gpt-4-turbo-2024-08-15 \
--dataset customer-support-qa
If quality improves, upgrade. If it regresses, stick with old version or adjust prompts.
Case Study: Multi-Model Migration
Scenario: A content platform uses GPT-4 for everything. Monthly bill: $280K.
Goal: Reduce costs by 70% without significantly impacting quality.
Approach:
Step 1: Task inventory
- Article summarization: 50K/day
- Comment moderation: 200K/day
- Content recommendations: 30K/day
- Author writing assistance: 5K/day
Step 2: Evaluation Built eval sets for each task, tested 6 models (GPT-4, GPT-3.5, Claude variants, Gemini).
Step 3: Routing decisions
Task | Volume | Current (GPT-4) | New Model | Quality Impact | Daily Savings |
---|---|---|---|---|---|
Summarization | 50K | $3,000 | Claude 3 Haiku | -3% | $2,987 |
Moderation | 200K | $12,000 | Gemini Pro + GPT-4 cascade | -1% | $11,400 |
Recommendations | 30K | $1,800 | GPT-3.5 | -5% | $1,620 |
Writing assistance | 5K | $300 | GPT-4 (keep) | 0% | $0 |
Step 4: Implementation Rolled out routing layer with fallbacks. Monitored for 2 weeks.
Results:
- Daily cost: $9,333 → $2,626 (72% reduction)
- Quality: Minimal impact (-2% average across tasks)
- User satisfaction: Unchanged (no complaints)
- Annual savings: $2.45M
Conclusion
There is no "best" LLM—only the best model for each specific task. Multi-model evaluation lets you match workloads to models systematically:
- Measure quality, cost, and latency for every candidate model
- Route tasks to the optimal model based on requirements
- Use dynamic routing for variable-complexity tasks
- Implement fallbacks for resilience
- Monitor and re-evaluate as models evolve
The result: better quality, lower costs, and more robustness than any single-model approach.
Start small:
- Pick your highest-cost task
- Evaluate 3 alternative models
- Compare cost-quality tradeoffs
- Switch the task with the clearest win
- Repeat for other tasks
Within a quarter, you'll have a sophisticated multi-model system saving 50-80% on LLM costs while maintaining or improving quality.
Next Steps:
- Run your first multi-model evaluation with EvalOps
- Explore model routing patterns in Spellbook
- Join the community to discuss model selection strategies
Questions about optimizing your model mix? Email hello@evalops.dev.