Why Use the SDK?
While Grimoire CLI wraps your entire process, the EvalOps Python SDK lets you instrument individual functions and capture telemetry inline. This is ideal for:
- FastAPI or Flask apps
- Jupyter notebooks
- Data pipelines with LangChain or LlamaIndex
- Any Python code calling LLMs
Installation
pip install evalops
Basic Usage
from evalops import EvalOps
# Initialize (auto-discovers API key from EVALOPS_API_KEY env var)
evalops = EvalOps()
# Or specify manually
evalops = EvalOps(api_key="your-key-here", workspace="your-workspace")
# Capture a trace
with evalops.trace(scenario="customer-support-qa") as trace:
trace.metadata({"ticket_id": "TKT-5678", "agent": "bot"})
# Your LLM call
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "How do I reset my password?"}]
)
# Log it
trace.log({
"provider": "openai",
"model": "gpt-4",
"prompt": "How do I reset my password?",
"response": response.choices[0].message.content,
"tokens": response.usage.dict()
})
That's it. The trace is now in your EvalOps workspace, ready for scoring.
Automatic Instrumentation
Prefer zero-config? Use the auto-instrumenter:
from evalops.auto import instrument
# Patch supported libraries
instrument(providers=["openai", "anthropic", "langchain"])
# Now all LLM calls are automatically traced
import openai
response = openai.chat.completions.create(...) # Captured automatically
This works with:
- OpenAI Python SDK
- Anthropic Claude SDK
- LangChain
- LlamaIndex
- Haystack
Adding Custom Metrics
Want to score traces with your own logic? Define a custom metric:
from evalops.metrics import custom_metric
@custom_metric(name="contains_apology")
def check_apology(response: str) -> float:
"""Return 1.0 if response contains an apology, else 0.0"""
apology_words = ["sorry", "apologize", "apologies"]
return 1.0 if any(word in response.lower() for word in apology_words) else 0.0
# Apply it
with evalops.trace(scenario="support") as trace:
response = get_llm_response(...)
trace.log({"response": response})
trace.score(check_apology(response), metric="contains_apology")
Now every trace has a contains_apology
score. View it in EvalOps dashboards or trigger alerts when the rate changes.
Redacting PII
Before traces leave your environment, redact sensitive fields:
evalops = EvalOps(
redact_fields=["email", "ssn", "credit_card"],
redact_patterns=[r"\\d{3}-\\d{2}-\\d{4}"] # SSN regex
)
# These fields are automatically scrubbed before upload
Local-Only Mode
Don't want to send traces to EvalOps yet? Keep them local:
evalops = EvalOps(mode="local", storage_dir="./traces")
# Traces saved to ./traces/*.json
Review them with:
evalops dashboard --local ./traces
FastAPI Integration
For web apps, use the middleware:
from fastapi import FastAPI
from evalops.integrations.fastapi import EvalOpsMiddleware
app = FastAPI()
app.add_middleware(EvalOpsMiddleware, workspace="production")
@app.post("/chat")
async def chat(message: str):
# Automatically captured as a trace
response = openai.chat.completions.create(...)
return {"response": response.choices[0].message.content}
Every request becomes a trace with full context (endpoint, headers, latency).
CI/CD Integration
In your GitHub Actions or GitLab CI:
- name: Run evals with EvalOps
env:
EVALOPS_API_KEY: ${{ secrets.EVALOPS_API_KEY }}
EVALOPS_WORKSPACE: staging
run: |
pip install evalops
python -m evalops exec --scenario "ci-regression" -- pytest tests/eval/
Traces are tagged with commit SHA, branch, and CI metadata.
Next Steps
- View the full SDK reference
- Import a Spellbook recipe for pre-built scorecards
- Set up CI gates to block deploys on regressions
Questions? Open an issue or email hello@evalops.dev.