Lexical metrics break the moment a correct answer is phrased differently from the
reference. Semantic metrics fix that by scoring in embedding space, where
“You may return within a month” and “Returns are accepted up to 30 days” sit
close together even though they share few tokens.
Both metrics call an OpenAI-compatible embeddings endpoint through the
EmbeddingClient contract — which you can fake (Http::fake() or a bound
deterministic client) so the offline test suite never touches the network.
cosine-embedding
Embed the expected output and the actual output into vectors
The result is mapped into
measures the angle between the two vectors, not their magnitude — so it
captures topical/semantic alignment independent of length. This is the
workhorse metric for free-form answers that can be correct in many surface
forms.
Cosine-embedding scores the whole answer as one vector. It rewards topical
closeness but cannot tell you which span is wrong. When you need token-level
attribution, use bertscore-like.
bertscore-like
A token-level similarity in the spirit of BERTScore (Zhang et al., 2020).
Instead of one vector per answer, it embeds the unique tokens of each side and
greedily matches each token to its most similar counterpart, then computes
precision, recall, and F1 over those best matches.
Let
precision are:
The metric returns
the reference covered somewhere in the candidate?”; precision asks “is every
part of the candidate justified by the reference?”; F1 balances the two.
The implementation deduplicates token-embedding calls per sample, so
repeated tokens are embedded once. This keeps the provider cost bounded even
on long answers.
Determinism and faking
Both metrics depend only on the embedding vectors, so binding a deterministic
EmbeddingClient makes them fully reproducible offline:
use Padosoft\EvalHarness\Contracts\EmbeddingClient;
$this->app->bind(EmbeddingClient::class, fn () => new class implements EmbeddingClient {
/**
* @param list<string> $texts
* @return list<list<float>> one vector per input, in the same order
*/
public function embedMany(array $texts): array
{
// deterministic stub: hash-seeded pseudo-vectors, or fixture lookups
return array_map(
static fn (string $text): array => MyFixtures::vectorFor($text),
$texts,
);
}
});
In CI you can either fake the client (free, deterministic) or point it at a real
embeddings endpoint (a true semantic signal at the cost of API calls). The
provider’s usage (tokens, cost, latency) is attached to the MetricScore
details and rolls up into the report’s usage summary.
Choosing between them
| Question | Metric |
|---|---|
| Is the answer topically correct, any phrasing? | cosine-embedding |
| Which parts of the answer align with the reference? | bertscore-like |
| Cheapest semantic signal? | cosine-embedding (one embed per side) |
| Most discriminating on partially-correct prose? | bertscore-like |
Embedding similarity is not faithfulness. A fluent, well-embedded answer
can still be factually wrong or ungrounded. Pair semantic metrics with
citation-groundedness or llm-as-judge when correctness — not just
topicality — is the gate.