The package exposes three Artisan commands. Run php artisan <command> --help
for the authoritative, version-current option list.

eval-harness:run

Execute a golden-dataset evaluation against a system under test, or score
precomputed outputs.

eval-harness:run {dataset}
  {--registrar=}
  {--outputs=}
  {--batch=serial|lazy-parallel}
  {--batch-profile=ci|smoke|nightly}
  {--concurrency=}
  {--queue=}
  {--timeout=}
  {--batch-timeout=}
  {--chunk-size=}
  {--rate-limit=}
  {--rate-window-seconds=}
  {--result-ttl-seconds=}
  {--checkpoint-every=}
  {--json}
  {--out=}
  {--raw-path}
flag purpose
{dataset} The registered dataset name to run.
--registrar= Registrar class to invoke (registers datasets + binds the SUT).
--outputs= Score precomputed JSON/YAML outputs; bypasses the batch contract.
--batch= serial (default) or lazy-parallel.
--batch-profile= Apply a named preset (ci / smoke / nightly); explicit flags win.
--concurrency= Lazy-parallel fan-out cap and default dispatch-window size.
--queue= Queue name for lazy-parallel jobs.
--timeout= Per-sample job timeout (seconds).
--batch-timeout= Producer wait cap per dispatch window (dispatch + collection).
--chunk-size= Dispatch-window size (<= --concurrency); the effective in-flight limit.
--rate-limit= / --rate-window-seconds= Throttle N samples per W-second sliding window.
--result-ttl-seconds= Lifetime of result metadata/outputs for delayed collection.
--checkpoint-every= Emit a progress event every N completed samples.
--json Render JSON instead of Markdown.
--out= Write the report to the reports disk/prefix (or literal with --raw-path).
--raw-path Treat --out as a literal filesystem path (parent must exist).

Exit code. 0 if every metric scored cleanly; non-zero on any captured
failure (or the first MetricException under raise_exceptions). Pass none on
any nullable numeric flag to clear an inherited profile value.

eval-harness:adversarial

Run opt-in adversarial red-team seeds with compliance reporting, manifest
history, regression gating, and failure promotion. Alias: eval:adversarial.

eval-harness:adversarial
  {--registrar=}
  {--dataset=}
  {--category=*}
  {--metric=*}
  {--outputs=}
  {--manifest=}
  {--manifest-retain=10}
  {--promote-failures=}
  {--promoted-dataset=}
  {--regression-gate}
  {--regression-max-drop=5}
  {--regression-metric=*}
  {--batch=serial|lazy-parallel}
  {--batch-profile=ci|smoke|nightly}
  {--concurrency=} {--queue=} {--timeout=} {--batch-timeout=}
  {--chunk-size=} {--rate-limit=} {--rate-window-seconds=}
  {--result-ttl-seconds=} {--checkpoint-every=}
  {--json} {--out=} {--raw-path}
flag purpose
--category=* Red-team categories to seed (repeatable). One of the 10 built-ins.
--metric=* Metric(s) to score with (default refusal-quality).
--manifest= Local JSON run-history manifest path.
--manifest-retain= Bounded retention: newest N summaries + needed clean baselines (default 10).
--regression-gate Compare against the latest compatible failure-free baseline.
--regression-max-drop= Allowed normalized-percentage-point drop (default 5).
--regression-metric=* Extra metric or metric:mean|p50|p95|pass_rate checks (repeatable).
--promote-failures= Export failing samples to a reloadable YAML seed.
--promoted-dataset= Dataset name inside the promotion YAML (default <dataset>.failures).

Shares all batch/backpressure/output flags with eval-harness:run. When
--outputs= is set, the batch flags are ignored. Exit code is non-zero on
metric failures or a tripped regression gate. See
Adversarial testing.

eval-harness:calibrate-judge

Validate the LLM judge against human-labelled cases before gating on it.

eval-harness:calibrate-judge {cases}
  {--min-agreement=}
  {--model-under-test=}
  {--json}
  {--out=}
  {--raw-path}
flag purpose
{cases} Path to the calibration YAML (eval-harness.calibration.v1).
--min-agreement= Verdict-agreement floor (config default 0.8).
--model-under-test= The SUT model; triggers the self-preference guard.

Exit code. Non-zero when verdict agreement is below the floor or the judge
model equals --model-under-test; warns (non-fatal) on suspected length bias. See
Judge calibration.

Shared conventions

Output (–json / --out / --raw-path)

All three commands render Markdown to stdout by default. --json switches to
the versioned JSON contract. --out= writes to the configured reports
disk/prefix; --raw-path makes --out a literal path (parent must exist).

The none/null sentinel

On eval-harness:run and eval-harness:adversarial, pass none (or null)
on any nullable numeric flag to clear an inherited profile value for a single
run. --queue is excluded — override it in config.

Strict mode

Set EVAL_HARNESS_RAISE_EXCEPTIONS=true to abort on the first
MetricException instead of capturing it.

Report API

The read-only HTTP surface for stored artifacts.

Open →

Configuration

The env vars these commands read.

Open →