Skip to content

Why HLM

One Architecture. Every Modality.

Transformers need a different architecture for every problem. Vision transformers, audio transformers, multimodal transformers — each one bolts on modality-specific encoders, tokenizers, and projection layers. The result is an engineering patchwork where text, images, and audio never share the same representational substrate.

HLM uses a single architecture for everything: polynomial Hopfield layers. The same layer that processes language also processes 3D point clouds, medical volumes, and audio waveforms. No modality-specific encoders. No projection bridges. One energy landscape, shared across all inputs.

This isn't a convenience — it's a fundamental difference in how the model builds representations.

Transformer vs HLM

How Transformers Understand

A transformer processes tokens through attention layers that compute weighted averages over the input sequence. Understanding emerges from statistical correlation — which tokens co-occur, which patterns predict the next token.

This works remarkably well for language. But it has structural limitations:

  • No discrete memory. Knowledge is distributed across billions of parameters. You cannot point to where a specific fact is stored.
  • No compositional structure. Attention is a flat operation. Hierarchical reasoning requires stacking many layers to approximate what should be a structural property.
  • No native multimodality. Text and images live in different embedding spaces. Aligning them requires contrastive training on massive paired datasets — learning a bridge, not sharing a foundation.
  • Fragile under editing. Change one parameter and you break everything. There is no surgical target.

How HLM Understands

An HLM layer stores knowledge as attractor basins in an energy landscape. Each basin is a stable memory pattern — a local minimum that the network converges to when it encounters similar input.

This changes everything about how the model relates to what it knows:

  • Discrete, addressable memory. Each concept occupies a specific basin. You can find it, measure it, and modify it.
  • Compositional by construction. Basins combine through energy superposition. A "polite + technical" concept is the natural blend of two energy minima — not a learned trick.
  • Native multimodality. Text, spatial data, and audio all converge to basins in the same energy landscape. A concept doesn't need a bridge between modalities — it exists as a basin that multiple input types can reach.
  • Surgically editable. Because knowledge has a location, you can operate on it. Inject, remove, move, blend — without touching anything else.

Building World Models

The difference matters most when you ask: what does the model actually know?

A transformer knows correlations. It knows that "the cat sat on the ___" is likely followed by "mat" because that pattern appears in the training data. It does not have a representation of cats, mats, or sitting that exists independent of the token sequence.

An HLM stores attractors. The basin for "cat" is a stable state that the network converges to from many different inputs — text descriptions, 3D scans, audio of a meow. The basin is the concept, and it exists in the energy landscape whether or not a specific input is present.

This is closer to how world models should work:

TransformerHLM
Knowledge formatDistributed weightsDiscrete attractor basins
Concept representationImplicit (statistical)Explicit (energy minima)
Multimodal groundingLearned alignment (contrastive)Native convergence (shared landscape)
CompositionalityApproximated through depthNatural through energy superposition
EditabilityNone (retrain)Surgical (basin operations)
InterpretabilityOpaqueBasins are measurable states

The Convergence Principle

When you show HLM a picture of a cat and the word "cat," both inputs converge to the same basin. Not because they were trained with a contrastive loss to map to the same point — but because the energy landscape has a natural attractor that both modalities fall into.

This is what associative memory does. It's what Hopfield networks were designed for in 1982. HLM scales this principle to modern model sizes with polynomial interactions that maintain the rich basin structure that softmax attention destroys.

The result is a model that doesn't just predict the next token. It has a landscape of stable concepts that can be surveyed, programmed, and composed — across every modality, in a single architecture.

Practical Implications

For deployment: One model handles text, spatial, and audio. No ensemble of specialists. No alignment training between modalities.

For customization: Edit behavior in milliseconds. Capture a concept from one modality, inject it from another. No retraining.

For understanding: Survey what the model knows. Measure it. Verify it. The energy landscape is not a black box — it's a programmable surface.

For safety: Remove specific knowledge surgically. Guard against over-modification. Audit the operation log. Every change is traceable because every concept has a location.

Early Access

HLM models are currently training at scale. Join the waitlist for early access, or contact us for commercial and pilot projects.