Quickstart : eval-harness

Get Started
Quickstart

This walks you from install to a regression gate wired into CI. For the full
prerequisite matrix and provider configuration, see
Installation.

eval-harness is a Laravel package. You need PHP 8.3+ and Laravel
12.x / 13.x. Offline metrics (exact-match, contains, regex,
rouge-l, citation-groundedness, the retrieval-ranking family, and
ordinal-distance) require no API key and no network. Only the
embedding- and judge-backed metrics call a provider.

Install the package
```
composer require padosoft/eval-harness
```
Auto-discovery wires the service provider — no config/app.php edits. The
package ships a single auto-loaded migration (eval_harness_online_scores)
for the optional online-monitoring feature, so your next php artisan migrate
creates that one table; nothing else touches your schema.

Curate a golden dataset
A golden dataset is a small YAML file of (question, expected answer) pairs
that represent the queries you actually care about — 30–200 is plenty.

eval/golden/factuality.yml:

schema_version: eval-harness.dataset.v1
name: rag.factuality.fy2026
samples:
  - id: capital-france
    input:
      question: "What is the capital of France?"
    expected_output: "Paris"
    metadata:
      tags: [geography, easy]

  - id: refund-policy
    input:
      question: "How many days do I have to return an order?"
    expected_output: "30 days from delivery."
    metadata:
      tags: [policy, support]

schema_version is optional. Omit it and the loader defaults to
eval-harness.dataset.v1. Keeping it explicit makes your dataset
forward-compatible across releases.

Wire a registrar in your app
A registrar registers the dataset, declares its metrics, and binds the
system under test (SUT) — a callable that drives your real pipeline.

app/Console/EvalRegistrar.php:

<?php

namespace App\Console;

use Illuminate\Contracts\Container\Container;
use Padosoft\EvalHarness\EvalEngine;

class EvalRegistrar
{
    public function __construct(private readonly Container $container) {}

    public function __invoke(EvalEngine $engine): void
    {
        $engine->dataset('rag.factuality.fy2026')
            ->loadFromYaml(base_path('eval/golden/factuality.yml'))
            ->withMetrics(['exact-match', 'rouge-l'])
            ->register();

        $this->container->bind('eval-harness.sut', fn () =>
            fn (array $input): string => app(\App\Rag\KnowledgeAgent::class)
                ->answer($input['question']),
        );
    }
}

This quickstart uses offline metrics (exact-match, rouge-l) so it
runs with no API key on a fresh install. To add paraphrase-tolerant or
judged scoring (cosine-embedding, llm-as-judge, …), configure an
embeddings/judge provider first — see Installation.

Run the eval
```
php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\Console\EvalRegistrar" \
  --json --out=factuality.json
```
The exit code is 0 if every metric executed cleanly, non-zero on any
captured failure (a timeout, a malformed provider response). This is an
execution-health signal — note it does not fail on merely low
scores: a SUT that returns wrong-but-clean answers still exits 0.

To gate on quality — to fail when scores drop even though metrics ran
cleanly — read the JSON report’s macro_f1 and assert on a threshold. See
Gating on a quality threshold.

Read the report
Drop the --json / --out flags to render a human-readable Markdown report
to stdout:

# Eval report — rag.factuality.fy2026

_Run completed in 2.41s over 30 samples (0 failures captured)._

## Per-metric aggregates

| metric           | mean   | p50    | p95    | pass-rate (>= 0.5) |
| ---------------- | ------ | ------ | ------ | ------------------ |
| exact-match      | 0.7333 | 1.0000 | 1.0000 | 0.7333             |
| rouge-l          | 0.9012 | 0.9421 | 0.9893 | 0.9667             |

## Macro-F1 (avg pass-rate across all metrics): 0.8500

What just happened

The YAML dataset was loaded and validated against
eval-harness.dataset.v1.
For each sample, the SUT callable ran your real pipeline to produce an
actual output.
Each declared metric scored (sample, actual output) into a
MetricScore in [0, 1] — or recorded a SampleFailure if the provider /
metric contract broke (captured by default, not fatal).
The engine aggregated everything into an EvalReport:
per-metric mean / p50 / p95 / pass-rate, macro-F1, per-cohort
breakdowns, and score histograms.
The Artisan command set the exit code from the report so CI can gate.

Next steps

Golden datasets

The dataset contract, samples, metadata, tags, and per-sample knobs.

Open →

The CI gate

Wire the exit code into GitHub Actions and gate the merge on macro-F1.

Open →

Metrics & theory

Choose the right metric and understand the math behind each score.

Open →

Last updated: Edit this page

What just happened

Next steps

onThisPage