nanoEBM

What if next-token prediction wasn't a single forward pass, but a tiny optimization problem? nanoEBM is a ~10M-param character-level Transformer with a linear energy head that learns to think harder at inference time.

Implemented in under 400 lines, runs on your Mac or GPU. 67-token vocab, 6 layers, 384 dim, 6 heads — minimal and extensible by design.