Inspiration

The frontier labs are converging on capabilities. Soon, the engineering moat will be gone. As model IQ unifies, what differentiates any foundational model? The soul. The emotional intelligence of a model.

EQ preference data is scarce, expensive, and fundamentally unscalable. So the top models stay hollow in exactly the dimensions users care most about: empathy, vulnerability, social awareness, and emotional perception.

Hundreds of millions of people turn to LLMs through the hardest moments of their lives , layoffs, grief, the conversations. What they get back is soulless. No context. No social awareness.

We set out to fix that to create a better society.

What it does

We built an open-source database for emotional intelligence data that anyone can contribute to, used to train frontier models.

On our platform, users write ideal responses to prompts across domains like work, grief, relationships, and family. Each prompt has its own contextual rubric. Responses are scored by a panel of LLMs as judges against that rubric.

After submitting, users see how their response ranks against LLMs and against other humans. They can contribute their response to the dataset and watch it improve model behavior in real time supported by RAG (fine-tuning in progress).

How we built it

We built primary with Claude Code. Rubrics and prompt design were shaped by human experts, professors, therapists, and cognitive scientists. We referenced and benchmarked against Hugging Face's EQ-Bench. Stack is TypeScript, Next.js, Python, Tailwind CSS, and Supabase.

Challenges we ran into

The biggest challenge was establishing ground truth, what an ideal response to an emotionally charged prompt actually looks like. We solved it by combining a network of human experts with multi-LLM generation of context-specific rubrics for each of the 50 prompts in our dataset.

The second challenge was proving the data actually improves models. We used RAG to pull the most relevant human-written responses and scenarios into the prompt at inference time, giving the model real context to ground its response in. From there, the dataset exports cleanly into a fine-tuning-ready format for frontier labs.

Accomplishments that we're proud of

We generated 50 distinct rubrics across 50 diverse prompts in the domains of loss and grief, plus 50 ideal responses cross-checked by multiple experts and our team. Getting that volume of high-quality ground truth done in a limited timeframe was a serious push.

We're also proud that we chipped away at the hardest part of EQ data, the cost and unscalability of annotation, by building an active leaderboard and ranking system that gamifies contribution. People want to see how they stack up, and that incentive is what turns annotation from a chore into a loop.

What we learned

We learned how to actually build rubrics for emotional preference data, how to decompose abstract qualities like empathy and vulnerability into dimensions a model can be scored on. Without a structured rubric, LLM judges score everything a 7. With one, they're surprisingly sharp.

We learned that getting to a single ideal response is harder than it sounds. Experts disagree, and converging on one answer forced us to be explicit about why a response is ideal, which dimensions it hits, which tradeoffs it makes. The rubric became the argument, not the response itself.

And we learned how far behind frontier models actually are on this dimension. When you score them against a real rubric on a real emotional scenario, the gap between their capability benchmarks and their EQ performance is enormous. That gap is the opportunity.

What's next for The Soul Problem

Fine-tuning models directly on our dataset, not just RAG. Expanding the expert network and growing the dataset well beyond the initial 50 prompts. Benchmarking far more models, publishing our own benchmark on Hugging Face, and fine-tuning an open model like Qwen as a proof point for what EQ-trained weights can actually do.

Built With

Share this project:

Updates