Critical AI Audit Showcase
Automated multi-document audit over incident, contract, and medical samples, with perturbation comparisons that surface which findings are stable and which are fragile.
Status
Demo-ready showcase. Fast rules-only mode is deterministic; HLM3-Mix modes use the bundled checkpoint without duplicating weights.
What the demo proves
A reviewer can see — in one run — whether a finding survives meaningful changes to the source document:
- Original audit over each sample (incident / contract / medical)
- Perturbation variants generated automatically:
- evidence sentence removed
- risk terms softened
- irrelevant prefix prepended
- Comparison metrics per case and per perturbation:
- retained / removed / added findings
- Jaccard match against the original
- Per-case reports plus a top-level summary in Markdown and JSON
Modes
| Mode | Speed | What runs |
|---|---|---|
| Fast rules-only | seconds | Deterministic extractors only |
| HLM3-Mix on originals | seconds–minute | Adds span audit signals to originals |
| HLM3-Mix on perturbations | longer | Audits every variant for fragility evidence |
The all-models edition adds a sidebar checkpoint selector so each run can be driven by any entry in the bundle's language catalog. Tier badges and caveats appear next to each option; the validated checkpoint is the quality reference, and any non-validated entries surface explicit in-UI caveats.
What the reviewer sees
- Case-by-case summary table with findings, critical/high counts, and audit status
- Per-perturbation table showing retained / removed / added findings and the Jaccard score
- Source document and generated report side by side
- Downloadable artifacts (summary Markdown, summary JSON, per-case reports)
What "perturbation comparison" actually shows
A robust finding survives all three perturbations with a high Jaccard match. A fragile finding disappears when one synonym is changed or one sentence is removed. The showcase makes this difference explicit and lets a reviewer quantify it before a partner pilot.
Caveats
- Perturbation transforms are illustrative; production validation needs partner-specific risk schemas.
- Rules-only mode is deterministic but limited; HLM3-Mix modes add span signals but do not change the rule set.
Where it fits
The right demo for:
- Compliance reviewers measuring decision robustness
- Engineering leads probing failure modes before a pilot
- Buyers comparing audit behavior across the packaged catalog
Related
- Critical Document Audit — single-document interactive variant
- Use case: EU AI Act compliance — narrative deployment for regulated decisioning
- HLM3-Mix Model Lab — same catalog, prompt-testing UI
- Validation Summary — public benchmark numbers