HLM3-Mix Model Lab

Browser UI for prompt-testing packaged HLM3 language checkpoints with explicit load/unload, token-level output, and per-layer iteration signals.

Status

Demo-ready browser surface. Wraps the language checkpoint catalog that ships with the on-site demo bundle; the same catalog also drives the all-models editions of the Critical Document Audit and AI Audit Showcase.

What the demo proves

A reviewer can directly compare HLM3-Mix checkpoints without juggling multiple terminals or duplicate model weights:

Select a checkpoint from the sidebar — each option is labelled with its tier badge and the relevant benchmark metric.
Load explicitly so only one model occupies VRAM at a time.
Generate with greedy or sampling settings and inspect the per-token table.
Unload before switching to the next checkpoint.
Compare any two checkpoints sequentially with a single click.

What the reviewer sees

Live sidebar caveat and tier badge for the selected checkpoint
Token table with step, token_id, decoded token, and top-1 probability
Per-layer average Hopfield iterations for the run
A catalog table at the bottom listing every available entry plus on-disk status
Tokens-per-second, total tokens, and total seconds for each run

Why one catalog matters

The lab and the all-models editions of the audit UIs share a single source of truth. Adding a new checkpoint means editing one file; every downstream UI updates automatically.

Caveats

Some checkpoints in the bundle are diagnostic surfaces, not quality claims — the sidebar caveat and tier badge make the distinction explicit so a reviewer cannot mistake a scale or research artifact for a validated result.
Benchmark values come from offline evaluations; treat them as relative scale indicators, not benchmark headlines.
Larger checkpoints can need several GB of VRAM. The lab promotes large models to bfloat16 on CUDA automatically, but CPU mode is the safe fallback.

Where it fits

The right demo for:

Engineering reviewers comparing checkpoint behavior side-by-side
Partners evaluating which checkpoint to baseline against
Anyone who wants to see the validated public result reproduced live

Validation Summary — canonical benchmark numbers and statuses
Critical Document Audit (all-models) — same catalog in a document-audit workflow
Critical AI Audit Showcase (all-models) — same catalog in an automated audit run
HLM3 model card — architecture and tier context

HLM3-Mix Model Lab ​

What the demo proves ​

What the reviewer sees ​

Why one catalog matters ​

Caveats ​

Where it fits ​