🌀 Emoji Diffusion: Does Order Even Matter?
Inspiration
🎸🎧 and 🎧🎸 mean the same thing — emoji replies are sets, not sequences. But autoregressive LLMs factorize left-to-right,
$$p(y) = \prod_t p(y_t \mid y_{<t}),$$
forcing an order onto data that has none. Masked diffusion models train over random orders instead — the right objective when
$$p(\text{reply}\mid \pi(\text{prompt})) = p(\text{reply}\mid \text{prompt}).$$
Emoji felt like the perfect petri dish to test whether non-autoregressive models are structurally better at order-free language.
What it does
A 328M masked-diffusion model that replies to emoji with emoji (🎤🎶🌃 → 🎹🎺🎸), a benchmark pitting it against GPT-5.5 / Claude Opus 4.8 / Gemini 3, and a live GUI where you type emoji and watch it denoise a reply.
How we built it
Two-phase training on an H100 (semantic scaffolding → reply data) with permutation augmentation. We scored everything on order-blind multiset Jaccard, $J(a,b)=\frac{\sum_e\min(a_e,b_e)}{\sum_e\max(a_e,b_e)}$, and isolated true order-sensitivity from sampling noise with a resample control:
$$\text{order_effect} = \text{resample_stability} - \text{perm_stability}.$$
Frontier baselines ran through Inspect AI + OpenRouter; the GUI is FastAPI + vanilla JS.
What we learned
Our first numbers showed a $3\text{–}17\times$ win — then we stress-tested and most of it dissolved:
- Contamination: the model had trained on the eval prompts. On a true held-out split, the fidelity win vanished — it's competitive with frontier ($\approx 0.26$ vs $0.23\text{–}0.28$), not dominant.
- Hidden reasoning bug: thinking-mode was silently on; Gemini's outputs were truncated artifacts.
- Noise: at $n=24$, order_effect swung wildly.
The honest headline: a 328M, non-reasoning model holds its own against flagships ~10,000× larger, and does bidirectional any-position infilling that AR structurally can't.
Challenges
Train/test leakage (our flashiest result was memorization), controlling reasoning fairly across providers, Unicode grapheme segmentation (👨 5️⃣), and wrangling a 4.9 GB checkpoint on an ephemeral, by-the-minute H100 over SSH.
What's next
Scale the eval to all held-out prompts, train a matched AR twin for a clean comparison, and add semantic scoring. The thesis is alive — it just earned its rigor.
Inspiration
What it does
How we built it
Challenges we ran into
Accomplishments that we're proud of
What we learned
What's next for Team Emoji
Built With
- bash
- bge-embeddings
- claude
- claude-opus-4.8
- conda/pip
- css
- cuda
- devin
- dit
- fastapi
- gemini-3-flash
- gemini-3.1-pro
- git
- github
- github-cli
- gpt-5.5
- html
- hugging-face-tokenizers
- hugging-face-transformers
- hydra
- inspect-ai
- javascript
- json/jsonl
- mdlm
- numpy
- nvidia-h100
- omegaconf
- openai-sdk
- openrouter-api
- prime-intellect
- pydantic
- python-(pytorch
- pytorch-lightning
- regex
- ssh/scp
- text2emoji-dataset
- uvicorn
- weights-&-biases)
- yaml
Log in or sign up for Devpost to join the conversation.