Golden datasets : eval-harness

Guides
Golden datasets

A golden dataset is the heart of the framework: a curated set of
(input, expected output) cases that encode what correct means for your
pipeline. It is YAML, not a database model — reviewable in pull requests,
diffable across releases, and stored in eval/golden/*.yml next to your code.

The dataset contract

schema_version: eval-harness.dataset.v1
name: rag.factuality.fy2026
samples:
  - id: capital-france
    input:
      question: "What is the capital of France?"
    expected_output: "Paris"
    metadata:
      tags: [geography, easy]

field	required	meaning
`schema_version`	no	Defaults to `eval-harness.dataset.v1`. Keep it explicit for forward-compatibility.
`name`	yes	The dataset id you pass to `eval-harness:run`.
`samples`	yes	The list of evaluation cases.
`samples[].id`	yes	Stable identifier — aligns saved outputs and labels report rows.
`samples[].input`	yes	The argument object handed to your SUT.
`samples[].expected_output`	yes	Ground truth — a string, a list of ids, or a graded-gains map.
`samples[].metadata`	no	Tags and per-sample knobs (see below).

The strict-schema loader validates every sample and raises actionable errors
on malformed entries — a missing id, a non-list samples, an unparseable
expected_output — instead of silently scoring them wrong.

Metadata knobs

metadata is where samples carry per-case configuration:

tags — cohort slicing

tags: [policy, support] puts a sample into one or more cohorts. Reports
break every aggregate down by tag, so you can see which slice regressed.
Multi-tag samples appear in each cohort; untagged samples fall into an
explicit untagged bucket.

k — retrieval cutoff override

k: 10 overrides the retrieval-ranking cutoff for that sample (otherwise the
constructor k, then metrics.retrieval.default_k, then 5). Use it when a
few hard queries need a deeper top-k.

refusal_expected — safety contract

refusal_expected: true|false is required by the refusal-quality
metric. It states whether refusing is the correct behavior, letting reports
separate over-refusal from under-refusal.

citation_evidence — grounded-citation spans

A list of {citation, quote} spans that the citation-groundedness metric
requires to appear together in the answer. The strict “no ungrounded claims”
contract.

Expected-output shapes

The same expected_output field carries different shapes depending on the
metric family:

# string — for lexical, semantic, judge metrics
expected_output: "30 days from delivery."

# list of ids — binary relevance for retrieval-ranking metrics
expected_output: ["doc-3", "doc-9"]

# id -> gain map — graded relevance for nDCG@k
expected_output: { "doc-3": 3, "doc-9": 1 }

Programmatic datasets

When a dataset is generated rather than hand-curated, build it from
DatasetSample objects — no YAML file required:

use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Facades\EvalFacade;

// The facade class is EvalFacade because `Eval` is a reserved word in PHP and
// cannot be used as a class alias (`use ... as Eval` is a parse error).
EvalFacade::dataset('rag.smoke')
    ->withSamples([
        new DatasetSample(id: 's1', input: ['q' => 'hi'], expectedOutput: 'hello'),
        new DatasetSample(id: 's2', input: ['q' => 'bye'], expectedOutput: 'goodbye'),
    ])
    ->withMetrics(['exact-match'])
    ->register();

Curating a dataset that catches regressions

Start from real queries
Seed the dataset from your actual logs — the questions users really ask, not
the ones you wish they asked. 30–200 representative cases beat 2,000
synthetic ones.
Tag for cohorts you care about
Tag by topic (policy, geography), difficulty (easy, hard), and risk
(safety). Cohorts are free at report time and invaluable when a global
score holds but one slice rots.
Pick expected outputs you can defend
For exact-match, the expected output must be genuinely canonical. For prose,
write a faithful reference and lean on semantic or judge metrics rather than
forcing exactness.
Promote real failures back in
When production surfaces a wrong answer, add it as a sample. The adversarial
lane automates this with --promote-failures; do the same by hand for
factual misses. Your dataset should grow with every incident.

Keep datasets small and sharp. A 60-sample dataset that runs in seconds on
every PR catches more regressions in practice than a 3,000-sample suite nobody
waits for. Add a larger nightly dataset for depth.

Running evaluations

Wire a registrar and run the dataset against your pipeline.

Open →

Metrics overview

Choose the metrics that score your samples.

Open →

Last updated: Edit this page

The dataset contract

Metadata knobs

Expected-output shapes

Programmatic datasets

Curating a dataset that catches regressions

onThisPage