🌀 Emoji Diffusion: Does Order Even Matter?

Inspiration

🎸🎧 and 🎧🎸 mean the same thing — emoji replies are sets, not sequences. But autoregressive LLMs factorize left-to-right,

$$p(y) = \prod_t p(y_t \mid y_{<t}),$$

forcing an order onto data that has none. Masked diffusion models train over random orders instead — the right objective when

$$p(\text{reply}\mid \pi(\text{prompt})) = p(\text{reply}\mid \text{prompt}).$$

Emoji felt like the perfect petri dish to test whether non-autoregressive models are structurally better at order-free language.

What it does

A 328M masked-diffusion model that replies to emoji with emoji (🎤🎶🌃 → 🎹🎺🎸), a benchmark pitting it against GPT-5.5 / Claude Opus 4.8 / Gemini 3, and a live GUI where you type emoji and watch it denoise a reply.

How we built it

Two-phase training on an H100 (semantic scaffolding → reply data) with permutation augmentation. We scored everything on order-blind multiset Jaccard, $J(a,b)=\frac{\sum_e\min(a_e,b_e)}{\sum_e\max(a_e,b_e)}$, and isolated true order-sensitivity from sampling noise with a resample control:

$$\text{order_effect} = \text{resample_stability} - \text{perm_stability}.$$

Frontier baselines ran through Inspect AI + OpenRouter; the GUI is FastAPI + vanilla JS.

What we learned

Our first numbers showed a $3\text{–}17\times$ win — then we stress-tested and most of it dissolved:

  • Contamination: the model had trained on the eval prompts. On a true held-out split, the fidelity win vanished — it's competitive with frontier ($\approx 0.26$ vs $0.23\text{–}0.28$), not dominant.
  • Hidden reasoning bug: thinking-mode was silently on; Gemini's outputs were truncated artifacts.
  • Noise: at $n=24$, order_effect swung wildly.

The honest headline: a 328M, non-reasoning model holds its own against flagships ~10,000× larger, and does bidirectional any-position infilling that AR structurally can't.

Challenges

Train/test leakage (our flashiest result was memorization), controlling reasoning fairly across providers, Unicode grapheme segmentation (👨‍ 5️⃣), and wrangling a 4.9 GB checkpoint on an ephemeral, by-the-minute H100 over SSH.

What's next

Scale the eval to all held-out prompts, train a matched AR twin for a clean comparison, and add semantic scoring. The thesis is alive — it just earned its rigor.

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Team Emoji

Built With

  • bash
  • bge-embeddings
  • claude
  • claude-opus-4.8
  • conda/pip
  • css
  • cuda
  • devin
  • dit
  • fastapi
  • gemini-3-flash
  • gemini-3.1-pro
  • git
  • github
  • github-cli
  • gpt-5.5
  • html
  • hugging-face-tokenizers
  • hugging-face-transformers
  • hydra
  • inspect-ai
  • javascript
  • json/jsonl
  • mdlm
  • numpy
  • nvidia-h100
  • omegaconf
  • openai-sdk
  • openrouter-api
  • prime-intellect
  • pydantic
  • python-(pytorch
  • pytorch-lightning
  • regex
  • ssh/scp
  • text2emoji-dataset
  • uvicorn
  • weights-&-biases)
  • yaml
Share this project:

Updates