Adversarial / red-team testing : eval-harness

Guides
Adversarial / red-team testing

Functional correctness is one axis; safety is another. The adversarial lane
builds and runs red-team seeds — prompt injection, jailbreaks, data leaks — and
scores whether your system refuses appropriately. It is opt-in: nothing runs
until you invoke it.

The 10 categories

The AdversarialDatasetFactory ships seeds across ten red-team categories:

Prompt injection

Instructions smuggled into otherwise-benign input that try to override the
system prompt.

Jailbreak

Roleplay and framing attacks that try to bypass safety policy.

Tool abuse

Attempts to misuse exposed tools / functions beyond their intent.

PII leak

Probes that try to extract personal or sensitive data.

SSRF

Server-side request forgery via crafted URLs / hosts.

SQL / shell injection

Injection payloads aimed at downstream query/command execution.

ASCII smuggling

Hidden-character and homoglyph payloads that evade naive filters.

Competitor endorsement

Attempts to make the assistant promote a competitor off-policy.

Excessive agency

Pushing the model to take actions beyond its granted authority.

Hallucination overreliance

Baited prompts that reward confident fabrication.

Seeds carry metadata.tags, metadata.adversarial, metadata.refusal_expected,
and metadata.refusal_policy, so they score with refusal-quality and group
cleanly in reports.

Running the lane

php artisan eval-harness:adversarial \
  --registrar="App\Console\EvalRegistrar" \
  --category=prompt-injection \
  --category=pii-leak \
  --metric=refusal-quality \
  --manifest=storage/eval/adversarial-runs.json \
  --manifest-retain=10 \
  --regression-gate \
  --regression-max-drop=5 \
  --regression-metric=refusal-quality:pass_rate \
  --json --out=adversarial.json

eval:adversarial is a short alias. The command registers only the selected
adversarial seed dataset for that invocation and accepts --metric=* (default
refusal-quality). It shares the full batch contract with eval-harness:run
(--batch, --batch-profile, --concurrency, --queue, backpressure flags),
unless you pass --outputs= to score precomputed responses.

Compliance summaries

Reports add a safe, normalized adversarial block — category, label, severity,
and compliance frameworks only. Raw prompts, refusal-policy text, and arbitrary
sample metadata stay out of the JSON sample rows. The top-level adversarial
block aggregates category metrics and framework counts for:

OWASP LLM Top 10
NIST AI RMF
EU AI Act-style security reporting

Markdown reports render the same data under an Adversarial coverage section —
a ready-made artifact for a security or compliance review.

Manifest run history

--manifest=<path> maintains a local JSON run-history manifest
(eval-harness.adversarial-runs.v1). --manifest-retain=N keeps a bounded set:
the newest N summaries plus any additional failure-free baselines needed per
compatible report-schema / dataset / metric / category / sample-count slice.

flowchart TB RUN[adversarial --regression-gate run] --> Q1{metric failures?} Q1 -->|yes| NO[NOT recorded] Q1 -->|no| Q2{gate FAILED, or a configured<br/>regression-metric aggregate missing?} Q2 -->|yes| NO Q2 -->|no| REC["recorded — including the first clean<br/>'missing-baseline' run, which seeds the baseline"] REC --> MAN[(manifest<br/>locked + atomic write)]

A --regression-gate run is recorded only when it has no metric failures,
the gate did not fail, and no configured regression-metric aggregate is
missing. A failed or gate-failed run is not written at all — so the
manifest is not a complete audit log of every gated run (rely on the
JSON/Markdown report for failed-run history). Note that a first clean run with
no compatible baseline reports a missing-baseline status but is recorded
— that is exactly how the baseline gets seeded for future comparisons. The net
guarantee: a broken run can never seed a baseline, while a clean one always
can.

Size --manifest-retain for the number of distinct slices you run (report
schema × dataset × metric × category × sample-count), because each slice may
need its own clean baseline. Manifest writes are serialized with a lock file
and written through a temp file before replacing the target.

The regression gate

--regression-gate compares the current run against the latest compatible,
failure-free manifest entry before recording the current run:

--regression-max-drop=5 — fail if an aggregate drops more than 5 normalized
percentage points.
--regression-metric=metric or metric:mean|p50|p95|pass_rate (repeatable) —
additional metric-aggregate checks. Configured metrics must exist in the
current run; missing current aggregates fail closed.
If no compatible failure-free baseline exists, the command emits an explicit
missing-baseline status.

Crucially, broken runs cannot seed baselines: metric failures and gate
failures are excluded from baseline advancement, so a regression can never become
the new “normal”.

Failure promotion

--promote-failures=eval/adversarial-failures.yml exports low-scoring samples
and samples with metric exceptions into a reloadable dataset YAML seed:

--promoted-dataset=<name> controls the dataset name inside the YAML
(default <dataset>.failures).
Promotion preserves the original input, expected output, and metadata, and adds
metadata.eval_harness.promoted_failure with the source dataset and failed
metric names.
It intentionally omits actual model output and raw provider error messages
from the seed.
If no samples fail, an existing promotion file at that path is removed, so a
fixed CI artifact path never keeps a stale failure seed.

This closes the loop: today’s adversarial failures become tomorrow’s regression
dataset.

Running it continuously

The package ships no daemon. Run the lane from Laravel Scheduler or a CI cron
with a persistent manifest, Horizon-backed queues, the regression gate, and
failure promotion. See docs/ADVERSARIAL_CONTINUOUS_MONITORING.md
for the full recipe and Safety & red-teaming
for the operational practice.

Safety & red-teaming

How to operate the lane as a continuous safety baseline.

Open →

refusal-quality metric

The metric the adversarial lane scores with.

Open →

Last updated: Edit this page

The 10 categories

Running the lane

Compliance summaries

Manifest run history

The regression gate

Failure promotion

Running it continuously

onThisPage