Skip to content

Critical AI Audit Showcase

Automated multi-document audit over incident, contract, and medical samples, with perturbation comparisons that surface which findings are stable and which are fragile.

Status

Demo-ready showcase. Fast rules-only mode is deterministic; HLM3-Mix modes use the bundled checkpoint without duplicating weights.

What the demo proves

A reviewer can see — in one run — whether a finding survives meaningful changes to the source document:

  1. Original audit over each sample (incident / contract / medical)
  2. Perturbation variants generated automatically:
    • evidence sentence removed
    • risk terms softened
    • irrelevant prefix prepended
  3. Comparison metrics per case and per perturbation:
    • retained / removed / added findings
    • Jaccard match against the original
  4. Per-case reports plus a top-level summary in Markdown and JSON

Modes

ModeSpeedWhat runs
Fast rules-onlysecondsDeterministic extractors only
HLM3-Mix on originalsseconds–minuteAdds span audit signals to originals
HLM3-Mix on perturbationslongerAudits every variant for fragility evidence

The all-models edition adds a sidebar checkpoint selector so each run can be driven by any entry in the bundle's language catalog. Tier badges and caveats appear next to each option; the validated checkpoint is the quality reference, and any non-validated entries surface explicit in-UI caveats.

What the reviewer sees

  • Case-by-case summary table with findings, critical/high counts, and audit status
  • Per-perturbation table showing retained / removed / added findings and the Jaccard score
  • Source document and generated report side by side
  • Downloadable artifacts (summary Markdown, summary JSON, per-case reports)

What "perturbation comparison" actually shows

A robust finding survives all three perturbations with a high Jaccard match. A fragile finding disappears when one synonym is changed or one sentence is removed. The showcase makes this difference explicit and lets a reviewer quantify it before a partner pilot.

Caveats

  • Perturbation transforms are illustrative; production validation needs partner-specific risk schemas.
  • Rules-only mode is deterministic but limited; HLM3-Mix modes add span signals but do not change the rule set.

Where it fits

The right demo for:

  • Compliance reviewers measuring decision robustness
  • Engineering leads probing failure modes before a pilot
  • Buyers comparing audit behavior across the packaged catalog