Hardening Evaluation Telemetry: Making Trace Capture Trustworthy

The Telemetry Trust Problem

Your evaluation scores are only as good as the telemetry feeding them. If trace capture is inconsistent, incomplete, or inaccurate, your scorecards become noise. You'll make decisions based on bad data: shipping changes that actually regressed quality, blocking good changes because your test environment differs from production, or worse—not noticing silent degradation because your metrics aren't measuring what you think they are.

Telemetry drift happens when traces captured in different environments (local dev machines, CI, staging, production) don't reflect the same reality. Common causes:

Environment differences: Local has different dependencies than CI or production
Incomplete capture: Some traces are missing context, tool calls, or intermediate steps
Timing variations: Race conditions cause non-deterministic trace ordering
Data leakage: Sensitive information slips into traces and gets stored or transmitted
Configuration drift: Teams configure tracing differently across environments

When telemetry drifts, you lose confidence in evaluation results. A prompt that scores 85% locally might score 70% in production—not because the prompt changed, but because your test environment didn't match reality.

Hardening telemetry means building safeguards that ensure trace capture is deterministic, complete, and trustworthy no matter where it runs.

The Five Pillars of Telemetry Hardening

1. Deterministic Collection

Every trace must be reproducible. Given the same inputs and environment, you should capture identical traces.

What we capture:

Every trace includes:

Git metadata: Commit SHA, branch name, repository URL
Project hash: Content-addressed hash of all relevant source files
Working directory state: Which files were present, which were modified
Environment context: OS, runtime version, key dependencies
Execution metadata: Entry point, command-line arguments, environment variables (sanitized)
Timestamp and timezone: When the trace was captured (for temporal analysis)

Why this matters:

When you compare evaluation scores across commits, you need to know exactly what code produced each trace. Git SHA alone isn't enough—developers might have uncommitted changes, or the test might run on a dirty working directory. Project hashing catches this.

Implementation in Grimoire:

import { grimoire } from '@evalops/grimoire';

// Automatic context capture
const trace = grimoire.startTrace({
  scenario: 'customer-support-qa',
  // These are captured automatically:
  // - git commit, branch, dirty status
  // - project hash (content-addressed)
  // - runtime environment
});

Grimoire computes a project hash by:

Finding all files tracked by git
Applying your .grimoire/ignore patterns (like .gitignore)
Creating a Merkle tree of file contents
Producing a deterministic hash

If any file changes (even uncommitted), the project hash changes. This lets you definitively know whether two traces were captured from identical code.

Verification:

# Check what will be included in traces
grimoire check

# Output:
# ✓ Git repository detected: main @ abc123f
# ✓ Project hash: 7f3a9c2b...
# ✓ 145 files tracked
# ⚠ 3 uncommitted changes detected
#   - src/prompts/support.ts (modified)
#   - tests/eval/support.test.ts (new file)
#   - package.json (modified)
#
# Traces will be marked as "dirty" until changes are committed.

2. Snapshot Verification in CI

CI environments are ephemeral and easy to misconfigure. Common problems:

Missing dependencies: CI installs production deps but not dev/test deps
Skipped directories: Build process excludes test data directories
Wrong runtime version: Tests run on Node 18 but production uses Node 20
Environment variables missing: API keys, config values not set

When these issues occur, CI-generated traces don't match local or production reality. Evaluation scores from CI become meaningless.

Snapshot verification ensures CI environments match declared policies before allowing trace upload.

How it works:

Define a snapshot policy in .grimoire/policy.yml:

snapshots:
  required_files:
    - "src/**/*.ts"
    - "prompts/**/*.txt"
    - "tests/eval/**/*.json"
  
  required_env_vars:
    - OPENAI_API_KEY
    - EVALOPS_API_KEY
    - NODE_ENV
  
  runtime:
    node_version: ">=20.0.0"
    grimoire_version: "^1.2.0"
  
  dependencies:
    must_match: package-lock.json  # Ensure exact dep versions

When CI runs, Grimoire:

Captures a snapshot of the environment
Compares it to the declared policy
Rejects trace upload if there's a mismatch
Provides a remediation guide

Example failure:

$ grimoire exec -- npm test

❌ Snapshot verification failed

Missing required files:
  - tests/eval/support-scenarios.json
  Expected location: /workspace/tests/eval/support-scenarios.json
  
Missing environment variables:
  - OPENAI_API_KEY

Runtime version mismatch:
  Node: 18.19.0 (required: >=20.0.0)

Remediation:
  1. Ensure CI runs on Node 20+
  2. Add OPENAI_API_KEY to CI secrets
  3. Verify tests/eval directory is not excluded by .dockerignore

Traces NOT uploaded. Fix the issues above and retry.

This fails fast—before wasting CI minutes on evals that won't be valid.

CI configuration example (GitHub Actions):

name: Evaluation Tests

on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - uses: actions/setup-node@v3
        with:
          node-version: '20'  # Match production
      
      - run: npm ci  # Exact dep versions from lock file
      
      - name: Verify environment before evals
        run: grimoire check --strict
      
      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          EVALOPS_API_KEY: ${{ secrets.EVALOPS_API_KEY }}
        run: grimoire exec -- npm run test:eval

The --strict flag enforces snapshot policy compliance before any traces are captured.

3. Redaction Pipelines

Traces often contain sensitive data: customer PII, API keys, internal system details, proprietary prompts. This data must be redacted before:

Leaving your network (if sending to EvalOps cloud)
Being written to disk (if using local-only mode)
Being shown in dashboards or shared with teams

Field-level redaction:

Configure which fields to redact and how:

# .grimoire/redaction.yml
redaction:
  # Simple field removal
  remove_fields:
    - "user.email"
    - "user.phone"
    - "request.headers.authorization"
  
  # Pattern-based redaction (regex)
  patterns:
    - name: credit_card
      regex: '\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b'
      replacement: "[REDACTED_CC]"
    
    - name: ssn
      regex: '\b\d{3}-\d{2}-\d{4}\b'
      replacement: "[REDACTED_SSN]"
    
    - name: email
      regex: '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
      replacement: "[REDACTED_EMAIL]"
  
  # Hash fields instead of removing (preserves uniqueness for deduplication)
  hash_fields:
    - "user.id"
    - "session.id"
  
  # Allow-list: fields that can contain sensitive-looking patterns
  allow:
    - "model.parameters.temperature"  # Looks like credit card but isn't

Redaction happens at capture time, before traces leave the process. This ensures sensitive data is never persisted or transmitted.

Example:

# Trace before redaction
{
  "user": {
    "id": "user-12345",
    "email": "alice@example.com",
    "phone": "+1-555-0123"
  },
  "query": "My credit card 4532-1234-5678-9010 was charged twice",
  "response": "I'll look into that charge for card ending in 9010..."
}

# After redaction
{
  "user": {
    "id": "hash:7f3a9c2b",  # Hashed
    "email": "[REDACTED_EMAIL]",
    "phone": "[REDACTED_PHONE]"
  },
  "query": "My credit card [REDACTED_CC] was charged twice",
  "response": "I'll look into that charge for card ending in 9010..."
}

Note: The response still contains partial card info ("9010"). You might want tighter redaction depending on compliance requirements.

Custom redaction functions:

For domain-specific PII, write custom redactors:

from evalops import Redactor

@Redactor.register("patient_id")
def redact_patient_ids(text: str) -> str:
    """
    Redact patterns like PT-123456 (patient IDs in medical system).
    """
    import re
    return re.sub(r'\bPT-\d{6}\b', '[REDACTED_PATIENT_ID]', text)

# Apply to traces
evalops = EvalOps(redaction_policies=["default", "patient_id"])

Verification:

Test redaction locally before deploying:

grimoire test-redaction < sample-trace.json

# Output:
# ✓ Redacted 3 email addresses
# ✓ Redacted 1 credit card number
# ✓ Hashed 2 user IDs
# ⚠ Potential PII detected but not redacted:
#   - Phone number: "+1-555-0123" (line 47)
#
# Add phone number redaction or allow-list this pattern.

4. Audit Trails

For governance and compliance, you need to know:

Who captured which traces?
When were they captured?
What code version was running?
Were they modified after capture?

Immutable trace IDs:

Every trace gets a cryptographically signed ID:

trace-7f3a9c2b-abc123f-20250915T143022Z-sign:3f9a8c1d

Components:

7f3a9c2b: Project hash
abc123f: Git commit SHA
20250915T143022Z: Timestamp (ISO 8601)
sign:3f9a8c1d: HMAC signature (prevents tampering)

If anyone modifies the trace after capture, the signature verification fails.

Audit logs:

EvalOps logs every trace operation:

{
  "event": "trace.uploaded",
  "trace_id": "trace-7f3a9c2b-abc123f-20250915T143022Z-sign:3f9a8c1d",
  "user": "alice@company.com",
  "source": "ci-github-actions",
  "git_commit": "abc123f",
  "timestamp": "2025-09-15T14:30:22Z",
  "redaction_applied": true,
  "policy_version": "v2.1"
}

Query audit logs to answer compliance questions:

"Show all traces captured from production in the last 30 days"
"Which user uploaded traces with failed redaction?"
"What code version was running when this trace was captured?"

5. Differential Privacy (for sensitive domains)

In highly regulated environments (healthcare, finance), even redacted traces might reveal sensitive patterns. Differential privacy adds noise to aggregate metrics while preserving statistical accuracy.

How it works:

Instead of uploading raw traces, Grimoire:

Computes aggregate statistics locally (accuracy rates, token usage distributions, error frequencies)
Adds calibrated noise (Laplace or Gaussian mechanism)
Uploads noisy aggregates instead of individual traces

Example:

grimoire eval run \
  --scenario medical-qa \
  --differential-privacy \
  --epsilon 0.5  # Privacy budget

This gives you evaluation scores without exposing individual interactions.

Trade-off:

Pro: Strong privacy guarantees (formal differential privacy)
Con: Can't drill into individual traces for debugging
Use case: Regulated industries, customer data lakes, sensitive workloads

Cross-Environment Consistency

A major source of telemetry drift: local, CI, and production environments differ subtly. Strategies to keep them aligned:

Strategy 1: Containerize Everything

Run evals in Docker containers with pinned dependencies:

# Dockerfile.eval
FROM node:20-alpine

WORKDIR /app

# Copy lock file for exact dep versions
COPY package-lock.json ./
COPY package.json ./

RUN npm ci --only=production

COPY . .

# Run evals
CMD ["grimoire", "exec", "--", "npm", "run", "test:eval"]

Use this container locally, in CI, and in production. Guaranteed consistency.

Strategy 2: Environment Parity Checks

Before running evals, assert environment matches expected state:

// tests/setup.ts
import assert from 'assert';

// Verify runtime environment
assert(process.version.startsWith('v20'), 'Node 20+ required');
assert(process.env.NODE_ENV === 'test', 'NODE_ENV must be test');
assert(process.env.OPENAI_API_KEY, 'OPENAI_API_KEY required');

// Verify file structure
assert(fs.existsSync('tests/eval/dataset.json'), 'Eval dataset missing');

If assertions fail, tests abort before generating invalid traces.

Strategy 3: Lockfile Enforcement

Ensure CI uses exact dependency versions from package-lock.json or bun.lock:

# GitHub Actions: Use npm ci (not npm install)
- run: npm ci  # Installs from lock file, errors if package.json changed

Prevents "works on my machine" issues caused by floating dependency versions.

Monitoring Telemetry Health

How do you know your telemetry is healthy?

Metrics to track:

Capture rate: Are all expected traces being captured?
- Target: 100% of eval runs, >99% of production requests
Schema conformance: Do traces match expected structure?
- Alert if fields are missing or types are wrong
Redaction success rate: Is PII being caught?
- Sample traces and check for leakage
Environment consistency: Do traces from different environments have the same project hash?
- Alert if CI traces diverge from local
Upload latency: How long does it take traces to reach EvalOps?
- Target: <5s P95

Dashboard example:

# EvalOps dashboard
telemetry_health:
  - metric: capture_rate
    threshold: ">99%"
    alert_channel: "#telemetry-alerts"
  
  - metric: redaction_failures
    threshold: "0"
    alert_channel: "#security-alerts"
  
  - metric: schema_errors
    threshold: "<1%"
    alert_channel: "#telemetry-alerts"

Common Telemetry Pitfalls

Pitfall 1: Sampling in the Wrong Place

Don't sample before redaction:

# WRONG: PII might leak into samples
if random.random() < 0.1:  # 10% sampling
    trace = capture_trace()
    redact(trace)
    upload(trace)

Instead:

# RIGHT: Always redact, then sample
trace = capture_trace()
redact(trace)  # Every trace is redacted
if random.random() < 0.1:
    upload(trace)  # Only upload 10%

Pitfall 2: Ignoring Clock Skew

Different machines have different system clocks. When aggregating traces by time, use a consistent timestamp source:

# Use ISO 8601 with explicit timezone
timestamp = datetime.now(timezone.utc).isoformat()

Don't rely on local system time—it might be wrong.

Pitfall 3: Incomplete Spans

If your trace has multiple steps (retrieval → reasoning → generation), ensure all spans are captured even if one fails:

trace = grimoire.startTrace(scenario="rag-qa")

try:
    docs = trace.span("retrieval").run(retrieve(query))
    answer = trace.span("generation").run(generate(query, docs))
except Exception as e:
    # Capture the failure
    trace.span("error").end(error=str(e))
finally:
    trace.end()  # Always close the trace

Pitfall 4: Overly Aggressive Redaction

If you redact too much, traces become useless for debugging:

# Too aggressive: Can't debug anything
redact(trace, remove_all_text=True)

Balance: Redact PII but preserve enough context to understand behavior.

Private Cloud and On-Premises Deployments

For organizations that can't send traces to EvalOps cloud, we offer:

Self-hosted EvalOps:

Runs entirely in your VPC or data center
No data leaves your network
Supports customer-managed encryption keys
Can integrate with your existing auth (SSO, SAML)

Air-gapped mode:

Grimoire runs in fully offline environments
Traces stored locally, reviewed on-premises
Scorecards run on local compute
No internet connectivity required

Hybrid mode:

Sensitive traces stay local
Aggregated metrics (no PII) sync to cloud for dashboards
Best of both worlds: privacy + centralized visibility

Email enterprise@evalops.dev for deployment options.

Conclusion

Telemetry hardening isn't glamorous, but it's foundational. Without trustworthy traces, your evaluation scores are fiction. You can't confidently ship changes, can't diagnose regressions, and can't prove compliance with safety or privacy requirements.

The safeguards we've covered—deterministic collection, snapshot verification, redaction pipelines, audit trails, and cross-environment consistency—turn trace capture from a liability into an asset.

Start with:

Add git metadata and project hashing to traces
Define snapshot policies for CI environments
Configure field-level redaction for PII
Verify redaction with test traces
Monitor telemetry health metrics

These steps take a few hours but pay dividends every time you run evaluations. Trustworthy telemetry means trustworthy evaluation means confident deployments.

Next Steps:

Questions about hardening telemetry for your environment? Email hello@evalops.dev.