Team Emoji

🌀 Emoji Diffusion: Does Order Even Matter?

Inspiration

🎸🎧 and 🎧🎸 mean the same thing — emoji replies are sets, not sequences. But autoregressive LLMs factorize left-to-right,

$$p(y) = \prod_t p(y_t \mid y_{<t}),$$

forcing an order onto data that has none. Masked diffusion models train over random orders instead — the right objective when

$$p(\text{reply}\mid \pi(\text{prompt})) = p(\text{reply}\mid \text{prompt}).$$

Emoji felt like the perfect petri dish to test whether non-autoregressive models are structurally better at order-free language.

What it does

A 328M masked-diffusion model that replies to emoji with emoji (🎤🎶🌃 → 🎹🎺🎸), a benchmark pitting it against GPT-5.5 / Claude Opus 4.8 / Gemini 3, and a live GUI where you type emoji and watch it denoise a reply.

How we built it

Two-phase training on an H100 (semantic scaffolding → reply data) with permutation augmentation. We scored everything on order-blind multiset Jaccard, $J(a,b)=\frac{\sum_e\min(a_e,b_e)}{\sum_e\max(a_e,b_e)}$, and isolated true order-sensitivity from sampling noise with a resample control:

$$\text{order_effect} = \text{resample_stability} - \text{perm_stability}.$$

Frontier baselines ran through Inspect AI + OpenRouter; the GUI is FastAPI + vanilla JS.

What we learned

Our first numbers showed a $3\text{–}17\times$ win — then we stress-tested and most of it dissolved:

Contamination: the model had trained on the eval prompts. On a true held-out split, the fidelity win vanished — it's competitive with frontier ($\approx 0.26$ vs $0.23\text{–}0.28$), not dominant.
Hidden reasoning bug: thinking-mode was silently on; Gemini's outputs were truncated artifacts.
Noise: at $n=24$, order_effect swung wildly.

The honest headline: a 328M, non-reasoning model holds its own against flagships ~10,000× larger, and does bidirectional any-position infilling that AR structurally can't.

Challenges

Train/test leakage (our flashiest result was memorization), controlling reasoning fairly across providers, Unicode grapheme segmentation (👨‍ 5️⃣), and wrangling a 4.9 GB checkpoint on an ephemeral, by-the-minute H100 over SSH.

What's next

Scale the eval to all held-out prompts, train a matched AR twin for a clean comparison, and add semantic scoring. The thesis is alive — it just earned its rigor.