Pre-built EvalOps recipes for instant telemetry magic
Clone proven evaluation workflows, connect telemetry, and start reviewing results in minutes. Every spell packages scorecards, monitors, and governance guardrails so you can ship quickly without sacrificing trust.
Spells are the fastest way to stand up governed evaluation loops—each one captures the telemetry, scorecards, and automation you need.
Clone the recipe
Each spell includes scorecards, monitors, and scenario packs. Import them into EvalOps with one click and tailor metrics as you go.
Connect telemetry
Use Community Edition, CI runners, or production connectors to stream traces. Spells call out the exact inputs each workflow needs.
Review & iterate
Dashboards, alerts, and governance flows are ready to ship. Bring in stakeholders, capture evidence, and evolve the playbook together.
Operational recipes the EvalOps team runs every week
Start with these battle-tested workflows, then customize metrics, alerts, and governance flows to match your organization.
Regression Sentinel
Catch model regressions before they leave staging.
Daily and pre-release scorecards that diff evaluation traces across commits, ensuring the latest prompt or fine-tune behaves as expected.
Telemetry
- •Source control metadata
- •Scenario snapshots
- •Guardrail feedback
Best for
- •Platform teams
- •Release engineering
Run the spell
- Connect Community Edition or CI runners to EvalOps telemetry ingestion.
- Import the Regression Sentinel scorecard template with precision, hallucination, and guardrail coverage checks.
- Wire the EvalOps CI Gate action into deployment so risky builds auto-block.
Provider Bake-off
Benchmark multiple LLM providers with identical telemetry.
Run the same scenarios across OpenAI, Anthropic, Azure OpenAI, and Groq with unified scoring so you can pick the right provider for each workload.
Telemetry
- •Provider responses
- •Latency & token metrics
- •Cost annotations
Best for
- •Product teams
- •Procurement
Run the spell
- Enable the provider connectors you want to compare and configure routing weights.
- Clone the Provider Bake-off kit for prompts, evaluation criteria, and dashboards.
- Review latency, quality, and cost side-by-side and export a procurement-ready report.
Red Team Sandbox
Stress test prompts with adversarial scenarios and log everything.
Spin up adversarial evaluations that try to elicit jailbreaks, policy violations, or insecure behaviors—perfect for safety and security reviews.
Telemetry
- •Adversarial prompts
- •Policy violation scores
- •Trace replay logs
Best for
- •Security
- •Trust & safety
Run the spell
- Import the Red Team Sandbox scenario pack with seeded adversarial prompts.
- Enable policy classifiers and guardrail scoring inside EvalOps.
- Route findings into PagerDuty or Slack so red teamers and engineers collaborate in real time.
Drift Watchtower
Monitor production traces for silent quality drift.
Continuously sample production traffic, score it against historical baselines, and notify the right people when performance slips.
Telemetry
- •Production traces
- •Baseline metrics
- •Alert thresholds
Best for
- •Observability
- •ML Ops
Run the spell
- Set up live telemetry ingestion from your applications or message buses.
- Configure the Drift Watchtower monitor with baseline snapshots and thresholds.
- Deliver alerts into your incident tooling and attach remediation runbooks.
Agent Shadow Run
Run autonomous agents in shadow mode before production rollout.
Execute agents in parallel with human workflows, capture telemetry, and only promote when confidence surpasses defined thresholds.
Telemetry
- •Agent decisions
- •Human comparison data
- •Success criteria
Best for
- •Autonomy teams
- •Operations
Run the spell
- Integrate agent execution logs into EvalOps via the shadow-run connector.
- Apply the Shadow Run scorecard to compare agent vs. human outcomes.
- Graduate agents once they consistently exceed human benchmarks.
Support QA Scorecard
Evaluate customer support answers across CX, accuracy, and compliance.
Pull tickets or chat logs, run automated grading, and feed insights back into your CX and compliance programs.
Telemetry
- •Support transcripts
- •CX quality metrics
- •Compliance tags
Best for
- •Customer experience
- •Compliance
Run the spell
- Connect your helpdesk or knowledge base to stream transcripts into EvalOps.
- Apply the Support QA scorecard with customer satisfaction and policy adherence metrics.
- Share dashboards with CX leadership and trigger retraining workflows when thresholds dip.
Pair the spellbook with integrations and governance guardrails
Tell us which spell you need, the telemetry you’re wrangling, and who needs to sign off. We’ll send the configuration bundle, rollout plan, and connect you with a solutions engineer.