Skip to content

HLM3-Mix Model Lab

Browser UI for prompt-testing packaged HLM3 language checkpoints with explicit load/unload, token-level output, and per-layer iteration signals.

Status

Demo-ready browser surface. Wraps the language checkpoint catalog that ships with the on-site demo bundle; the same catalog also drives the all-models editions of the Critical Document Audit and AI Audit Showcase.

What the demo proves

A reviewer can directly compare HLM3-Mix checkpoints without juggling multiple terminals or duplicate model weights:

  1. Select a checkpoint from the sidebar — each option is labelled with its tier badge and the relevant benchmark metric.
  2. Load explicitly so only one model occupies VRAM at a time.
  3. Generate with greedy or sampling settings and inspect the per-token table.
  4. Unload before switching to the next checkpoint.
  5. Compare any two checkpoints sequentially with a single click.

What the reviewer sees

  • Live sidebar caveat and tier badge for the selected checkpoint
  • Token table with step, token_id, decoded token, and top-1 probability
  • Per-layer average Hopfield iterations for the run
  • A catalog table at the bottom listing every available entry plus on-disk status
  • Tokens-per-second, total tokens, and total seconds for each run

Why one catalog matters

The lab and the all-models editions of the audit UIs share a single source of truth. Adding a new checkpoint means editing one file; every downstream UI updates automatically.

Caveats

  • Some checkpoints in the bundle are diagnostic surfaces, not quality claims — the sidebar caveat and tier badge make the distinction explicit so a reviewer cannot mistake a scale or research artifact for a validated result.
  • Benchmark values come from offline evaluations; treat them as relative scale indicators, not benchmark headlines.
  • Larger checkpoints can need several GB of VRAM. The lab promotes large models to bfloat16 on CUDA automatically, but CPU mode is the safe fallback.

Where it fits

The right demo for:

  • Engineering reviewers comparing checkpoint behavior side-by-side
  • Partners evaluating which checkpoint to baseline against
  • Anyone who wants to see the validated public result reproduced live