Inspiration

I wanted a “butler” for my daily tasks: something that can turn messy, natural-language plans into clean, actionable to‑dos with proper due times—without forcing me to manually split, name, and schedule everything. The hackathon theme around Flutter + Serverpod was the perfect chance to build an end‑to‑end product that feels agentic but stays practical and reliable.

What it does

PodButler is a Flutter app backed by a Serverpod API that helps you create tasks quickly using natural language.

Key capabilities:

  • Natural-language task parsing (English + Indonesian): understands phrases like “tomorrow at 10”, “in 2 hours”, “next Monday”, etc.
  • Smart Task Breakdown: paste a complex plan and PodButler decomposes it into multiple smaller tasks and creates them in one request.
  • Due time intelligence: improves reliability for relative dates by injecting current time context (UTC + Asia/Jakarta) into the model prompt.
  • Batch creation API: a dedicated backend endpoint creates multiple tasks at once (parseAndCreateMany) so the UI stays fast and consistent.
  • Observability for AI: every parse request stores an LLM trace (llm_trace) in Postgres (input, raw model output, heuristic due-time, final JSON, errors) for offline evaluation and debugging.
  • Export tooling: traces can be exported to JSON/CSV and prepared for offline evaluation workflows (e.g., Opik).

How we built it

  • Flutter for the mobile/web UI and user experience.
  • Serverpod for the backend: endpoints, ORM persistence, and clean API boundaries.
  • LLM integration (Gemini) on the server side with strict JSON outputs, robust coercion (array/object), and safe fallbacks.
  • Trace-first workflow: instead of coupling the app to a live observability SDK, we log high-signal structured traces in the database and export them for later analysis.

Challenges we ran into

  • Relative time ambiguity & time zones: “tomorrow” or “next week” can shift depending on locale/timezone. We improved consistency by explicitly providing “current time” in both UTC and Asia/Jakarta inside the prompt.
  • Multi-step intent parsing: users often paste paragraphs, not one task. Turning that into multiple tasks required careful prompting, JSON validation, and a batch API path.
  • Debugging production mismatches: we hit cases where the hosted API lagged behind local changes—solved by aligning deployment and adding a simple smoke-test script.

What we learned

  • “Agentic” UX is not just the model—it’s the pipeline: strict schemas, validation, error handling, and observability matter as much as prompting.
  • Storing structured traces early makes iteration dramatically faster and safer, especially under hackathon time pressure.

What’s next

  • One-click import of traces into evaluation tools (Opik) and automated scoring.
  • Better task templates (meeting prep, workouts, travel planning) and smarter recurrence handling.
  • Optional real-time observability integration once requirements are stable.

Built With

Share this project:

Updates