Most problems fall into a few recognizable shapes. Find the symptom, apply the
fix.
Captured failures in the report
Symptom. The report shows total failures > 0 and the run exits non-zero,
but the metric aggregates look otherwise fine.
Cause. A metric threw a MetricException on one or more samples — a provider
timeout, a malformed response, or a metric contract error. By design these are
captured (recorded against (sample, metric)), not fatal, so one bad sample
doesn’t erase the rest of the run.
Fix. Open the per-sample rows (JSON --out or the report API rows.csv) to
see which (sample, metric) failed and why. Distinguish infrastructure
failures (timeouts, 5xx) from quality regressions. For a strict lane that
should abort on the first one, set EVAL_HARNESS_RAISE_EXCEPTIONS=true.
Malformed judge JSON
Symptom. Judge-backed samples fail with a JSON parse / schema error.
Cause. The judge model returned non-JSON or off-schema output. The strict
parser rejects it rather than silently scoring 0.0.
Fix. Confirm the endpoint supports response_format=json_object; some models
ignore it. Tighten the prompt_template to demand the exact JSON shape, or
switch to a judge model that honors structured output. Enable a couple of
provider retries (EVAL_HARNESS_PROVIDER_RETRY_ATTEMPTS=2) for transient blips —
note retries cover connection failures, 429, and 5xx, not malformed-but-200
responses.
Missing samples in a lazy-parallel run
Symptom. The command reports missing samples after a lazy-parallel run.
Cause. The producer’s --batch-timeout fired before all queued outputs
landed, or the workers and producer don’t share a cache store.
Fix.
- Verify
EVAL_HARNESS_BATCH_CACHE_STOREpoints at a store shared by the
command process and the workers (Redis), notarray. - Confirm workers are actually consuming
--queue. - Raise
--batch-timeout, or lower--chunk-sizeto reduce in-flight pressure.
See Horizon & queues.
“chunk dispatch consumed the full wait timeout”
Symptom. The run fails with this explicit diagnostic and a count of
undispatched samples.
Cause. A low --rate-limit throttled the producer so hard that the
--batch-timeout budget was spent before all samples were even dispatched.
Fix. Lower --chunk-size, relax --rate-limit / widen
--rate-window-seconds, or raise --batch-timeout.
Timeouts ignored on the sync driver
Symptom. A slow sample runs far longer than --timeout / --batch-timeout.
Cause. On the sync driver dispatch() runs the job inline; there is no
worker process to enforce the job $timeout.
Fix. Use a real queue driver (Redis) in production for any wall-clock
guarantee. sync is for tests only.
missing-baseline from the adversarial gate
Symptom. --regression-gate reports missing-baseline instead of pass/fail.
Cause. No compatible, failure-free manifest entry exists yet for this slice
(report schema × dataset × metric × category × sample-count) — often the first
run, or the first run after changing the dataset/metric set.
Fix. This is expected on a fresh slice. Let a clean run establish the
baseline; subsequent runs gate against it. Don’t hand-edit the manifest to fake
one. See Adversarial testing.
Provider auth / 401 errors
Symptom. Embedding or judge calls fail with 401/403.
Cause. Missing or wrong API key, or a key scoped to the wrong endpoint.
Fix. Set EVAL_HARNESS_JUDGE_API_KEY / EVAL_HARNESS_EMBEDDINGS_API_KEY
(they fall back to OPENAI_API_KEY). In CI, inject them from secrets. Prefer
offline metrics on the PR gate so most runs need no key at all.
The report API returns 404 / nothing
Symptom. API routes 404, or the package’s routes aren’t registered.
Cause. The API is disabled by default.
Fix. Set api.enabled = true and an api.middleware stack that includes
auth, then confirm the api.prefix. See Report API.
Architecture test fails after edits
Symptom. The Architecture testsuite fails mentioning a leaked symbol.
Cause. The standalone-agnostic guard found a reference to a sibling package
or a consumer’s internal symbols in src/. The package must never depend on the
apps that consume it.
Fix. Remove the offending reference. The package is consumed by AskMyDocs and
others but depends on none of them.