Critical AI Audit Showcase

Automated multi-document audit over incident, contract, and medical samples, with perturbation comparisons that surface which findings are stable and which are fragile.

Status

Demo-ready showcase. Fast rules-only mode is deterministic; HLM3-Mix modes use the bundled checkpoint without duplicating weights.

What the demo proves

A reviewer can see — in one run — whether a finding survives meaningful changes to the source document:

Original audit over each sample (incident / contract / medical)
Perturbation variants generated automatically:
- evidence sentence removed
- risk terms softened
- irrelevant prefix prepended
Comparison metrics per case and per perturbation:
- retained / removed / added findings
- Jaccard match against the original
Per-case reports plus a top-level summary in Markdown and JSON

Modes

Mode	Speed	What runs
Fast rules-only	seconds	Deterministic extractors only
HLM3-Mix on originals	seconds–minute	Adds span audit signals to originals
HLM3-Mix on perturbations	longer	Audits every variant for fragility evidence

The all-models edition adds a sidebar checkpoint selector so each run can be driven by any entry in the bundle's language catalog. Tier badges and caveats appear next to each option; the validated checkpoint is the quality reference, and any non-validated entries surface explicit in-UI caveats.

What the reviewer sees

Case-by-case summary table with findings, critical/high counts, and audit status
Per-perturbation table showing retained / removed / added findings and the Jaccard score
Source document and generated report side by side
Downloadable artifacts (summary Markdown, summary JSON, per-case reports)

What "perturbation comparison" actually shows

A robust finding survives all three perturbations with a high Jaccard match. A fragile finding disappears when one synonym is changed or one sentence is removed. The showcase makes this difference explicit and lets a reviewer quantify it before a partner pilot.

Caveats

Perturbation transforms are illustrative; production validation needs partner-specific risk schemas.
Rules-only mode is deterministic but limited; HLM3-Mix modes add span signals but do not change the rule set.

Where it fits

The right demo for:

Compliance reviewers measuring decision robustness
Engineering leads probing failure modes before a pilot
Buyers comparing audit behavior across the packaged catalog

Critical Document Audit — single-document interactive variant
Use case: EU AI Act compliance — narrative deployment for regulated decisioning
HLM3-Mix Model Lab — same catalog, prompt-testing UI
Validation Summary — public benchmark numbers

Critical AI Audit Showcase ​

What the demo proves ​

Modes ​

What the reviewer sees ​

What "perturbation comparison" actually shows ​

Caveats ​

Where it fits ​

Related ​