CLI reference : eval-harness

Reference
CLI reference

The package exposes three Artisan commands. Run php artisan <command> --help
for the authoritative, version-current option list.

eval-harness:run

Execute a golden-dataset evaluation against a system under test, or score
precomputed outputs.

eval-harness:run {dataset}
  {--registrar=}
  {--outputs=}
  {--batch=serial|lazy-parallel}
  {--batch-profile=ci|smoke|nightly}
  {--concurrency=}
  {--queue=}
  {--timeout=}
  {--batch-timeout=}
  {--chunk-size=}
  {--rate-limit=}
  {--rate-window-seconds=}
  {--result-ttl-seconds=}
  {--checkpoint-every=}
  {--json}
  {--out=}
  {--raw-path}

flag	purpose
`{dataset}`	The registered dataset name to run.
`--registrar=`	Registrar class to invoke (registers datasets + binds the SUT).
`--outputs=`	Score precomputed JSON/YAML outputs; bypasses the batch contract.
`--batch=`	`serial` (default) or `lazy-parallel`.
`--batch-profile=`	Apply a named preset (`ci` / `smoke` / `nightly`); explicit flags win.
`--concurrency=`	Lazy-parallel fan-out cap and default dispatch-window size.
`--queue=`	Queue name for lazy-parallel jobs.
`--timeout=`	Per-sample job timeout (seconds).
`--batch-timeout=`	Producer wait cap per dispatch window (dispatch + collection).
`--chunk-size=`	Dispatch-window size (`<= --concurrency`); the effective in-flight limit.
`--rate-limit=` / `--rate-window-seconds=`	Throttle N samples per W-second sliding window.
`--result-ttl-seconds=`	Lifetime of result metadata/outputs for delayed collection.
`--checkpoint-every=`	Emit a progress event every N completed samples.
`--json`	Render JSON instead of Markdown.
`--out=`	Write the report to the reports disk/prefix (or literal with `--raw-path`).
`--raw-path`	Treat `--out` as a literal filesystem path (parent must exist).

Exit code. 0 if every metric scored cleanly; non-zero on any captured
failure (or the first MetricException under raise_exceptions). Pass none on
any nullable numeric flag to clear an inherited profile value.

eval-harness:adversarial

Run opt-in adversarial red-team seeds with compliance reporting, manifest
history, regression gating, and failure promotion. Alias: eval:adversarial.

eval-harness:adversarial
  {--registrar=}
  {--dataset=}
  {--category=*}
  {--metric=*}
  {--outputs=}
  {--manifest=}
  {--manifest-retain=10}
  {--promote-failures=}
  {--promoted-dataset=}
  {--regression-gate}
  {--regression-max-drop=5}
  {--regression-metric=*}
  {--batch=serial|lazy-parallel}
  {--batch-profile=ci|smoke|nightly}
  {--concurrency=} {--queue=} {--timeout=} {--batch-timeout=}
  {--chunk-size=} {--rate-limit=} {--rate-window-seconds=}
  {--result-ttl-seconds=} {--checkpoint-every=}
  {--json} {--out=} {--raw-path}

flag	purpose
`--category=*`	Red-team categories to seed (repeatable). One of the 10 built-ins.
`--metric=*`	Metric(s) to score with (default `refusal-quality`).
`--manifest=`	Local JSON run-history manifest path.
`--manifest-retain=`	Bounded retention: newest N summaries + needed clean baselines (default 10).
`--regression-gate`	Compare against the latest compatible failure-free baseline.
`--regression-max-drop=`	Allowed normalized-percentage-point drop (default 5).
`--regression-metric=*`	Extra `metric` or `metric:mean\|p50\|p95\|pass_rate` checks (repeatable).
`--promote-failures=`	Export failing samples to a reloadable YAML seed.
`--promoted-dataset=`	Dataset name inside the promotion YAML (default `<dataset>.failures`).

Shares all batch/backpressure/output flags with eval-harness:run. When
--outputs= is set, the batch flags are ignored. Exit code is non-zero on
metric failures or a tripped regression gate. See
Adversarial testing.

eval-harness:calibrate-judge

Validate the LLM judge against human-labelled cases before gating on it.

eval-harness:calibrate-judge {cases}
  {--min-agreement=}
  {--model-under-test=}
  {--json}
  {--out=}
  {--raw-path}

flag	purpose
`{cases}`	Path to the calibration YAML (`eval-harness.calibration.v1`).
`--min-agreement=`	Verdict-agreement floor (config default `0.8`).
`--model-under-test=`	The SUT model; triggers the self-preference guard.

Exit code. Non-zero when verdict agreement is below the floor or the judge
model equals --model-under-test; warns (non-fatal) on suspected length bias. See
Judge calibration.

Shared conventions

Output (–json / --out / --raw-path)

All three commands render Markdown to stdout by default. --json switches to
the versioned JSON contract. --out= writes to the configured reports
disk/prefix; --raw-path makes --out a literal path (parent must exist).

The none/null sentinel

On eval-harness:run and eval-harness:adversarial, pass none (or null)
on any nullable numeric flag to clear an inherited profile value for a single
run. --queue is excluded — override it in config.

Strict mode

Set EVAL_HARNESS_RAISE_EXCEPTIONS=true to abort on the first
MetricException instead of capturing it.

Report API

The read-only HTTP surface for stored artifacts.

Open →

Configuration

The env vars these commands read.

Open →

Last updated: Edit this page

eval-harness:run

eval-harness:adversarial

eval-harness:calibrate-judge

Shared conventions

onThisPage