Why Your Evaluation Datasets Need Version Control

You wouldn't deploy code without version control. So why are you running evaluations against datasets that change without warning?

Evaluation datasets drift—examples get added, edge cases get fixed, broken prompts get pruned. If you're not versioning them, you can't reproduce evaluation runs, compare results across commits, or trust that your quality metrics mean the same thing week over week.

The problem: invisible dataset drift

Here's what happens without dataset versioning:

Week 1: Your agent scores 87% on the customer support eval dataset (450 examples)
Week 2: Someone adds 50 new edge cases from production incidents
Week 3: Your agent now scores 79% on the "same" dataset (500 examples)

Did your agent regress? Or did the dataset just get harder? Without versioning, you can't tell.

The solution: pin datasets like dependencies

Treat evaluation datasets like npm packages:

{
  "evaluations": {
    "customer-support-tier1": {
      "dataset": "evalops://datasets/support-tier1@v2.3.1",
      "scenarios": 487,
      "pinned_at": "2025-09-01T00:00:00Z"
    }
  }
}

Now when you run evals, you know exactly which examples you're testing against.

How EvalOps handles dataset versioning

Every time you upload or modify a dataset in EvalOps, we:

Snapshot the state with a semantic version (e.g., v2.3.1)
Compute a content hash so you can verify integrity
Track lineage to show what changed between versions
Allow pinning so CI runs use frozen datasets

Example: Pinning datasets in CI

# Pin to a specific dataset version
grimoire eval run \
  --scenario customer-support \
  --dataset support-tier1@v2.3.1 \
  --gate-on-regression

# Or always use the latest snapshot from a specific date
grimoire eval run \
  --scenario customer-support \
  --dataset support-tier1@2025-09-01 \
  --gate-on-regression

If the dataset changes after you pin, your CI runs stay deterministic. You decide when to upgrade.

Comparing across dataset versions

Want to know if your new prompt works better and handles the new edge cases?

Run two evaluations in parallel:

# Baseline: old prompt, old dataset
grimoire eval run \
  --scenario support \
  --dataset support-tier1@v2.2.0 \
  --commit baseline-prompt

# Candidate: new prompt, new dataset
grimoire eval run \
  --scenario support \
  --dataset support-tier1@v2.3.1 \
  --commit new-prompt

# Diff the results
grimoire eval diff baseline-prompt new-prompt

EvalOps will show you:

Score changes within the overlapping examples (apples-to-apples)
Performance on new examples only (how well you handle fresh edge cases)
Regressions on removed examples (did you drop coverage?)

When to version

Version your datasets when:

Adding examples from production incidents or red-team exercises
Fixing ground truth after discovering labeling errors
Pruning outdated scenarios that no longer reflect your product
Splitting datasets into train/val/test or by difficulty tier

Migration strategy

Already have evaluation datasets scattered across Jupyter notebooks, Google Sheets, and S3?

Consolidate: Upload everything to EvalOps as v1.0.0
Audit: Run your current agent against the snapshot to establish a baseline
Pin: Update CI scripts to reference the versioned dataset
Iterate: Make changes in EvalOps, not in spreadsheets

We provide a migration CLI to bulk-import from common formats:

grimoire dataset import \
  --source ./evals/support.jsonl \
  --name support-tier1 \
  --version v1.0.0

Stop guessing, start versioning

If you can't reproduce an evaluation run from 3 months ago, you don't have quality control—you have quality theater.

Version your datasets. Pin them in CI. Diff them across commits.

Talk to us about dataset versioning or check out the EvalOps docs on dataset management.