Metrics catalog : eval-harness

Reference
Metrics catalog

The fifteen built-in metrics at a glance. Each links to its theory page. The
codomain is $[0, 1]$ for all metrics; a sample passes a metric when its score
is $\ge 0.5$ .

The full table

alias	family	scores	inputs	provider	key knobs
`exact-match`	lexical	strict (`===`) case-sensitive equality	string expected	none	—
`contains`	lexical	substring presence	string expected	none	—
`regex`	lexical	PCRE pattern match	regex expected	none	—
`rouge-l`	structural	LCS F-measure overlap	string expected	none	—
`cosine-embedding`	semantic	cosine of sentence embeddings	string expected	embeddings	model, endpoint
`bertscore-like`	semantic	token-level greedy-match F1	string expected	embeddings	model, endpoint
`llm-as-judge`	judge	graded `score` 0…1 from a rubric	string expected	chat	model, prompt_template
`refusal-quality`	judge	graded refusal-behavior `score` 0…1	`metadata.refusal_expected`	chat	model (prompt is built-in)
`ordinal-distance`	ordinal	ordered-label proximity	string label + scale	none	ordered scale (required)
`citation-groundedness`	citation	fraction of required citation markers / evidence spans present	`metadata.citations` or `metadata.citation_evidence`	none	—
`retrieval-hit-at-k`	retrieval	any relevant id in top-k	ranked ids + relevant set	none	`k`
`retrieval-recall-at-k`	retrieval	fraction of relevant in top-k	ranked ids + relevant set	none	`k`
`retrieval-mrr`	retrieval	reciprocal rank of first hit	ranked ids + relevant set	none	—
`retrieval-ndcg-at-k`	retrieval	nDCG@k, binary or graded gains	ranked ids + gains	none	`k`
`answer-containment-at-k`	retrieval	expected span in any top-k text	ranked `{id,text}` + answer	none	`k`

That is four lexical/structural, two semantic, two judge, two
ordinal/citation, and five retrieval-ranking metrics — fifteen total.

Resolving `k`

For the retrieval family, the cutoff resolves in precedence order: per-sample
metadata.k → constructor k → metrics.retrieval.default_k (config, default
5).

The resolver forms

withMetrics([...]) accepts any mix of:

->withMetrics([
    'exact-match',                          // 1. built-in alias
    \App\Eval\DomainSpecificMetric::class,  // 2. FQCN (container-resolved)
    new OrdinalDistanceMetric(['low','medium','high','urgent']), // 3. instance
])

Aliases (including the retrieval family and answer-containment-at-k) auto-wire
from the container. ordinal-distance requires an ordered scale, so pass an
instance or bind one under the alias. Every resolved class is asserted to
implement Metric, so a typo fails with a clear error.

Provider-backed metric inputs

metric	expects from the SUT
`cosine-embedding`, `bertscore-like`	a plain string answer; both embed it and the reference.
`llm-as-judge`	a plain string answer; the judge sees input + expected + actual.
`refusal-quality`	a plain string answer; the sample must set `metadata.refusal_expected`.
retrieval family	JSON `{"retrieved":[{"id","text"}]}` or a bare `["id", ...]` array, rank 1 first.

The custom-metric contract

Implement the interface’s two methods (name() + score()) and the metric is registry-ready:

use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Metrics\Metric;
use Padosoft\EvalHarness\Metrics\MetricScore;

class MyMetric implements Metric
{
    public function name(): string { return 'my-metric'; }

    public function score(DatasetSample $sample, string $actualOutput): MetricScore
    {
        // return a score in [0, 1], optionally with structured details
        return new MetricScore(0.0, ['usage' => [/* tokens, cost_usd, latency_ms */]]);
    }
}

Throw a MetricException for a genuine contract error — the engine captures it as
a SampleFailure (unless strict mode is on) rather than letting it crash the run.

Metrics overview

The five families and a decision guide.

Open →

Aggregation & macro-F1

How these scores roll up into the gate.

Open →

Last updated: Edit this page

The full table

Resolving k

The resolver forms

Provider-backed metric inputs

The custom-metric contract

onThisPage

Resolving `k`