Inspiration

When claude / codex / devin sandbox came out, I thought Cursor will die. But apparently not. I guess people like interactive dev experience. People like gamified dev experience. Visualization. and actually knowing what the agent is doing. I used to work in data orchestration with Airflow and am familiar with the idea of a DAG. Directed Acyclic Graph. A coding task is already a DAG. explore → implement ∥ implement → test ∥ test → ship. Why dont we visualize it for developers?

What it does

The orchestration layer that makes any coding agent production-grade — observable, resumable, and self-healing. Clear parallelism and clear dependencies.

How we built it

python, react, git checkpointing. I built the own thing with Cognition! Managed to used up all my credits. For some LLM calls, I used Claude credits, 500 also used up and I spent 2 dollar on deepseek for their API calls.

Challenges we ran into

repetitive frontend adjustment was the hardest and state mangement for the subtasks.

Accomplishments that we're proud of

Actually getting it to a running state, implementing the checkpointing, working retry with diff models, observability on cost / token usage, react not crushing (never do frontend before), used up all my devin / claude credit.

What we learned

eval and state management are still hard to do with sub agents. How do we evaluate subagents? Do we let the main orchestrator evaluate it or ask the model to evaluate it? For example when I ran a task with deepseek, it gave itself 100 without doubt!

What's next for Sea Otter

impoved infra on retry / state mangement

Built With

Share this project:

Updates