Why Your Evaluation Datasets Need Version Control
You wouldn't deploy code without version control. So why are you running evaluations against datasets that change without warning?
Evaluation datasets drift—examples get added, edge cases get fixed, broken prompts get pruned. If you're not versioning them, you can't reproduce evaluation runs, compare results across commits, or trust that your quality metrics mean the same thing week over week.
The problem: invisible dataset drift
Here's what happens without dataset versioning:
- Week 1: Your agent scores 87% on the customer support eval dataset (450 examples)
- Week 2: Someone adds 50 new edge cases from production incidents
- Week 3: Your agent now scores 79% on the "same" dataset (500 examples)
Did your agent regress? Or did the dataset just get harder? Without versioning, you can't tell.
The solution: pin datasets like dependencies
Treat evaluation datasets like npm packages:
{
"evaluations": {
"customer-support-tier1": {
"dataset": "evalops://datasets/support-tier1@v2.3.1",
"scenarios": 487,
"pinned_at": "2025-09-01T00:00:00Z"
}
}
}
Now when you run evals, you know exactly which examples you're testing against.
How EvalOps handles dataset versioning
Every time you upload or modify a dataset in EvalOps, we:
- Snapshot the state with a semantic version (e.g.,
v2.3.1
) - Compute a content hash so you can verify integrity
- Track lineage to show what changed between versions
- Allow pinning so CI runs use frozen datasets
Example: Pinning datasets in CI
# Pin to a specific dataset version
grimoire eval run \
--scenario customer-support \
--dataset support-tier1@v2.3.1 \
--gate-on-regression
# Or always use the latest snapshot from a specific date
grimoire eval run \
--scenario customer-support \
--dataset support-tier1@2025-09-01 \
--gate-on-regression
If the dataset changes after you pin, your CI runs stay deterministic. You decide when to upgrade.
Comparing across dataset versions
Want to know if your new prompt works better and handles the new edge cases?
Run two evaluations in parallel:
# Baseline: old prompt, old dataset
grimoire eval run \
--scenario support \
--dataset support-tier1@v2.2.0 \
--commit baseline-prompt
# Candidate: new prompt, new dataset
grimoire eval run \
--scenario support \
--dataset support-tier1@v2.3.1 \
--commit new-prompt
# Diff the results
grimoire eval diff baseline-prompt new-prompt
EvalOps will show you:
- Score changes within the overlapping examples (apples-to-apples)
- Performance on new examples only (how well you handle fresh edge cases)
- Regressions on removed examples (did you drop coverage?)
When to version
Version your datasets when:
- Adding examples from production incidents or red-team exercises
- Fixing ground truth after discovering labeling errors
- Pruning outdated scenarios that no longer reflect your product
- Splitting datasets into train/val/test or by difficulty tier
Migration strategy
Already have evaluation datasets scattered across Jupyter notebooks, Google Sheets, and S3?
- Consolidate: Upload everything to EvalOps as
v1.0.0
- Audit: Run your current agent against the snapshot to establish a baseline
- Pin: Update CI scripts to reference the versioned dataset
- Iterate: Make changes in EvalOps, not in spreadsheets
We provide a migration CLI to bulk-import from common formats:
grimoire dataset import \
--source ./evals/support.jsonl \
--name support-tier1 \
--version v1.0.0
Stop guessing, start versioning
If you can't reproduce an evaluation run from 3 months ago, you don't have quality control—you have quality theater.
Version your datasets. Pin them in CI. Diff them across commits.
Talk to us about dataset versioning or check out the EvalOps docs on dataset management.