Inspiration
When claude / codex / devin sandbox came out, I thought Cursor will die. But apparently not. I guess people like interactive dev experience. People like gamified dev experience. Visualization. and actually knowing what the agent is doing. I used to work in data orchestration with Airflow and am familiar with the idea of a DAG. Directed Acyclic Graph. A coding task is already a DAG. explore → implement ∥ implement → test ∥ test → ship. Why dont we visualize it for developers?
What it does
The orchestration layer that makes any coding agent production-grade — observable, resumable, and self-healing. Clear parallelism and clear dependencies.
How we built it
python, react, git checkpointing. I built the own thing with Cognition! Managed to used up all my credits. For some LLM calls, I used Claude credits, 500 also used up and I spent 2 dollar on deepseek for their API calls.
Challenges we ran into
repetitive frontend adjustment was the hardest and state mangement for the subtasks.
Accomplishments that we're proud of
Actually getting it to a running state, implementing the checkpointing, working retry with diff models, observability on cost / token usage, react not crushing (never do frontend before), used up all my devin / claude credit.
What we learned
eval and state management are still hard to do with sub agents. How do we evaluate subagents? Do we let the main orchestrator evaluate it or ask the model to evaluate it? For example when I ran a task with deepseek, it gave itself 100 without doubt!
What's next for Sea Otter
impoved infra on retry / state mangement
Log in or sign up for Devpost to join the conversation.