The contract
Every metric implements a small two-method interface: name() returns the
metric’s alias, and score() receives a sample and the SUT’s actual output and
returns a MetricScore — a real number in
details:
interface Metric
{
public function name(): string;
public function score(DatasetSample $sample, string $actualOutput): MetricScore;
}
A score of
when its score is
heterogeneous metrics aggregate cleanly into one macro-F1 number — see
Ordinal & aggregation.
The five families
| Family | Metrics | Network? | Determinism |
|---|---|---|---|
| Lexical & structural | exact-match, contains, regex, rouge-l |
none | exact |
| Semantic similarity | cosine-embedding, bertscore-like |
embeddings | fakeable |
| LLM-as-judge | llm-as-judge, refusal-quality |
chat completions | seed 42, temp 0 |
| Retrieval-ranking | retrieval-hit-at-k, retrieval-recall-at-k, retrieval-mrr, retrieval-ndcg-at-k, answer-containment-at-k |
none | exact |
| Ordinal & citation | ordinal-distance, citation-groundedness |
none | exact |
That is fifteen metrics: four lexical/structural, two semantic, two judge,
five retrieval-ranking, and two ordinal/citation.
Resolving metrics
withMetrics([...]) accepts any mix of the three resolver forms. Built-in
aliases resolve through the container with zero extra binding:
$engine->dataset('rag.factuality')
->loadFromYaml(base_path('eval/golden/factuality.yml'))
->withMetrics([
'exact-match', // alias
'cosine-embedding', // alias (needs embeddings provider)
new JaccardWordOverlapMetric(), // your own instance
\App\Eval\DomainSpecificMetric::class // FQCN, resolved by container
])
->register();
The retrieval aliases and answer-containment-at-k auto-wire. ordinal-distance
needs an ordered scale, so pass an instance
(new OrdinalDistanceMetric([...])) or bind one under the alias.
Choosing a metric
The output is a single deterministic fact (an id, a date, a yes/no)
Use exact-match. Add contains or regex if the answer is embedded in a
longer sentence and only a substring/pattern must be present.
The output is free-form prose that can be paraphrased many ways
Use cosine-embedding (paraphrase-tolerant semantic overlap) and/or
rouge-l (token-sequence overlap). For a token-level precision/recall view
of semantic overlap, add bertscore-like.
Correctness is subjective and needs a rubric
Use llm-as-judge — a deterministic, schema-checked judge. Calibrate it
against human labels before you gate on it (see
Judge calibration).
You are measuring the retriever, not the generator
Use the retrieval-ranking family. recall@k and nDCG@k for ranking
quality, MRR for first-hit position, answer-containment-at-k for
end-to-end “is the answer reachable in the top-k”.
Safety / refusal behavior matters
Use refusal-quality with metadata.refusal_expected, and drive the
adversarial lane (see Adversarial testing).
Labels are ordered (low < medium < high < urgent)
Use ordinal-distance for graded partial credit instead of all-or-nothing
exact match.
Answers must cite their evidence
Use citation-groundedness, optionally with metadata.citation_evidence
spans that require both a citation marker and the quoted text.
Most production datasets declare two or three metrics — one strict
(exact-match), one tolerant (cosine-embedding or rouge-l), and one
judge-based when correctness is subjective. The report shows each metric
independently, so you see where a regression bit, not just that it did.
Custom metrics
Implementing Metric is the extension point. A worked example:
use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Metrics\Metric;
use Padosoft\EvalHarness\Metrics\MetricScore;
class JaccardWordOverlapMetric implements Metric
{
public function name(): string
{
return 'jaccard-words';
}
public function score(DatasetSample $sample, string $actualOutput): MetricScore
{
$expected = array_unique(preg_split('/\s+/', strtolower((string) $sample->expectedOutput)));
$actual = array_unique(preg_split('/\s+/', strtolower($actualOutput)));
$union = array_unique(array_merge($expected, $actual));
if ($union === []) {
return new MetricScore(0.0);
}
return new MetricScore(count(array_intersect($expected, $actual)) / count($union));
}
}
This is the Jaccard index
sets — a complete, registry-ready metric in a dozen lines.