Astro-LLM — The LLM Almanac (大模型黄历)

Inspiration

Coding plans like Claude Code and Codex feel slightly different day to day — some days the same model thinks longer, some days it's sharper, some days worse. And which model "gets" your taste is deeply personal. We wanted one place that turns both into a daily reading you'd actually open — like an old almanac (黄历) that tells you what's auspicious today.

What it does

A single dashboard with four organs hanging off one personal eval harness:

运势 — Capability fortune. 18 deterministic, auto-graded tasks run against every model → a daily IQ, latency, cost, and an 8-day trend. It tells you whether today is a "spend tokens" or "save tokens" day.
缘分 — Value match. A short questionnaire of subjective dilemmas (rank car brands, recline the seat?, tabs vs spaces…). We force a real verdict out of each model, then match them to you — revealing your "AI soulmate" on a radar chart.
Astrology skin. Each model is charted from its release date + provider HQ; you enter your birthday. Fun on the surface, honest underneath.
调和 — Harmonize loop. Inject your taste into the least-aligned model, then re-score on held-out dilemmas it never saw — a gain means generalization, not memorization (the guardrail against a sycophancy machine).
Plus an MCP server so any agent can score itself, 5 completely different themed layouts, and 12 languages.

How we built it

Backend: Python + FastAPI, async httpx to Claude / GPT / DeepSeek. A deterministic astrology engine, footrule + agreement matching, an autograder, and an MCP server. 40 passing tests.
Frontend: React + Vite + TypeScript. Five bespoke layouts (not recolors): a woodblock almanac, a celestial natal-chart oracle, a brutalist court docket, a zen-ink scroll, and a modern glass dashboard. Real provider brand SVGs, Recharts, Framer Motion.
Deploy: Vercel (FastAPI serverless + static SPA), runs keyless from committed seed data.
Built start-to-finish with Claude Code.

Challenges we ran into

Making models commit — they're tuned to hedge, so we force strict JSON verdicts with no abstaining. Separating honest measurement from noise (a held-out set keeps "alignment" from becoming flattery). Building five real layouts rather than five palettes. And provider quirks: Codex needs the Responses API, Opus rejects temperature, GPT-5 and DeepSeek-v4 are reasoners.

Accomplishments that we're proud of

Five genuinely distinct, animated UIs; a real personal eval harness with a held-out honesty guardrail; an MCP surface; 12-language support; and a keyless public deploy judges can click.

What we learned

Models hedge unless you constrain them; small daily benchmarks are mostly noise unless you're honest about it; and a fun skin can carry a serious idea — model value-alignment — further than a dry chart.

What's next

Crowdsource the daily probe so nobody pays the full token cost, hunt for real weekly periodicity near quota resets, add Gemini / Qwen / Grok live, and expand the harmonize loop.