Most problems fall into a few recognizable shapes. Find the symptom, apply the
fix.

Captured failures in the report

Symptom. The report shows total failures > 0 and the run exits non-zero,
but the metric aggregates look otherwise fine.

Cause. A metric threw a MetricException on one or more samples — a provider
timeout, a malformed response, or a metric contract error. By design these are
captured (recorded against (sample, metric)), not fatal, so one bad sample
doesn’t erase the rest of the run.

Fix. Open the per-sample rows (JSON --out or the report API rows.csv) to
see which (sample, metric) failed and why. Distinguish infrastructure
failures (timeouts, 5xx) from quality regressions. For a strict lane that
should abort on the first one, set EVAL_HARNESS_RAISE_EXCEPTIONS=true.

Malformed judge JSON

Symptom. Judge-backed samples fail with a JSON parse / schema error.

Cause. The judge model returned non-JSON or off-schema output. The strict
parser rejects it rather than silently scoring 0.0.

Fix. Confirm the endpoint supports response_format=json_object; some models
ignore it. Tighten the prompt_template to demand the exact JSON shape, or
switch to a judge model that honors structured output. Enable a couple of
provider retries (EVAL_HARNESS_PROVIDER_RETRY_ATTEMPTS=2) for transient blips —
note retries cover connection failures, 429, and 5xx, not malformed-but-200
responses.

Missing samples in a lazy-parallel run

Symptom. The command reports missing samples after a lazy-parallel run.

Cause. The producer’s --batch-timeout fired before all queued outputs
landed, or the workers and producer don’t share a cache store.

Fix.

  • Verify EVAL_HARNESS_BATCH_CACHE_STORE points at a store shared by the
    command process and the workers (Redis), not array.
  • Confirm workers are actually consuming --queue.
  • Raise --batch-timeout, or lower --chunk-size to reduce in-flight pressure.

See Horizon & queues.

“chunk dispatch consumed the full wait timeout”

Symptom. The run fails with this explicit diagnostic and a count of
undispatched samples.

Cause. A low --rate-limit throttled the producer so hard that the
--batch-timeout budget was spent before all samples were even dispatched.

Fix. Lower --chunk-size, relax --rate-limit / widen
--rate-window-seconds, or raise --batch-timeout.

Timeouts ignored on the sync driver

Symptom. A slow sample runs far longer than --timeout / --batch-timeout.

Cause. On the sync driver dispatch() runs the job inline; there is no
worker process to enforce the job $timeout.

Fix. Use a real queue driver (Redis) in production for any wall-clock
guarantee. sync is for tests only.

missing-baseline from the adversarial gate

Symptom. --regression-gate reports missing-baseline instead of pass/fail.

Cause. No compatible, failure-free manifest entry exists yet for this slice
(report schema × dataset × metric × category × sample-count) — often the first
run, or the first run after changing the dataset/metric set.

Fix. This is expected on a fresh slice. Let a clean run establish the
baseline; subsequent runs gate against it. Don’t hand-edit the manifest to fake
one. See Adversarial testing.

Provider auth / 401 errors

Symptom. Embedding or judge calls fail with 401/403.

Cause. Missing or wrong API key, or a key scoped to the wrong endpoint.

Fix. Set EVAL_HARNESS_JUDGE_API_KEY / EVAL_HARNESS_EMBEDDINGS_API_KEY
(they fall back to OPENAI_API_KEY). In CI, inject them from secrets. Prefer
offline metrics on the PR gate so most runs need no key at all.

The report API returns 404 / nothing

Symptom. API routes 404, or the package’s routes aren’t registered.

Cause. The API is disabled by default.

Fix. Set api.enabled = true and an api.middleware stack that includes
auth
, then confirm the api.prefix. See Report API.

Architecture test fails after edits

Symptom. The Architecture testsuite fails mentioning a leaked symbol.

Cause. The standalone-agnostic guard found a reference to a sibling package
or a consumer’s internal symbols in src/. The package must never depend on the
apps that consume it.

Fix. Remove the offending reference. The package is consumed by AskMyDocs and
others but depends on none of them.

Batch execution

The flags behind most batch issues.

Open →

Configuration

Every env var referenced above.

Open →