Why Polynomial Hopfield
The entire Energy Language approach depends on one architectural decision: using polynomial Hopfield networks instead of transformers.
The Core Difference
Transformers use softmax attention, which is equivalent to a polynomial interaction with degree d → ∞. As d increases, the energy landscape smooths out until there is one single attractor — a global minimum that everything collapses into. There's nothing to surgically edit.
Polynomial Hopfield networks (d=3) maintain a rich energy landscape with 200+ discrete attractor basins per layer. Each basin is a stable memory pattern — a local minimum in the energy function. These basins can be individually targeted for surgery.
The Math
The energy function for a polynomial Hopfield layer:
E(x) = -1/d · |x|^d + 0.5 · x^T · W · f(x)Where f(x) = sign(x) · |x|^(d-1) is the polynomial interaction function.
- When d → ∞ (softmax): one basin, smooth landscape
- When d = 3 (HLM): many basins, rich landscape, each surgically accessible
Why d=3 Specifically?
- d=2 (classical Hopfield): basins exist but capacity is limited (~0.14N patterns for N neurons)
- d=3 (HLM): exponentially more basins, sharper separation, better surgical precision
- d→∞ (transformer): one basin, no surgery possible
The sweet spot is finite d large enough for capacity but small enough to maintain discrete, separable basins.
Implications
This isn't a limitation of transformers that can be patched — it's a fundamental property of the softmax function. To make neural networks surgically programmable, the architecture must support discrete attractor basins. That's what polynomial Hopfield layers provide.