Foreman

Inspiration

A plumber, an HVAC tech, or an electrician spends 30 to 60 minutes after every job writing up an invoice: pulling rates from memory, tracking down what parts were used, formatting line items, emailing the vendor. It's the least skilled part of their day and the most likely to have errors that delay payment.

We wanted to see how much of that could disappear if an AI agent had already read the work order, prefilled everything it could, and only asked the human for the two or three things it genuinely couldn't figure out on its own.

What we built

Foreman is a multi-agent pipeline that moves a field service work order from raw request to approved invoice.

Intake reads the unstructured request, classifies the job type, pulls out the relevant entities (location, vendor, urgency), and flags anything missing before the order goes further.
Scheduling proposes appointment windows, drafts customer outreach, and suggests parts likely needed for the job, clearly labeled as estimates rather than confirmed quotes.
Invoicing is the deep stage. It prefills an invoice from everything already known, identifies the specific gaps (labor rate, hours, trip charge), has a natural-language conversation with the user to fill only those gaps, checks the draft against past invoices for rate consistency, renders a branded invoice, and drafts the vendor notification email.

The human is in the loop at every commit point. Agents propose; a person confirms. No stage advances without explicit approval. The demo highlight is an ArmorIQ safety check blocking an off-plan action mid-invoice: the agent tries to commit, the gate fires, and the operator sees exactly why.

The shared state is a single work-order object in Redis. Agents don't call each other; each one reads what came before it and writes its own section. That let four people build in parallel on day one without stepping on each other.

How we built it

Agents run on Anthropic Claude via the SDK, using tool use directly with no framework wrapper. Each agent has a focused set of tools, a system prompt that describes the turn-based flow, and a fallback to seeded data if an external call fails.

Redis holds the work-order object and invoice history. Every agent reads from and writes to it; the pipeline advances when an approval gate opens.

ArmorIQ wraps the committing actions (filling the template, drafting the vendor email). Every action is signed with a plan; off-plan actions are blocked at runtime and surfaced to the operator.

Arize Phoenix instruments every Claude call. Each agent decision, including gap-fill questions asked, consistency flags raised, and ArmorIQ checks, appears as a span in the Phoenix UI and links back to the work order through a trace ID.

The API is FastAPI with a locked OpenAPI contract that all four team members built against from the start. The frontend is Vite, React, Tailwind, and shadcn/ui.

Challenges

Conversation state across HTTP requests. The invoicing agent needs to pick up mid-conversation when the user responds. Naively, every POST to /invoice-chat restarted the agent from scratch and re-asked the same questions. We fixed this by persisting the full Claude message history in the Invoice object in Redis, so turn two resumes exactly where turn one left off.

A broken Phoenix dependency. arize-phoenix 6.2.0 ships with a broken internal import when arize-phoenix-evals is installed separately. We had to pin compatible versions and wrap all Phoenix imports in graceful fallbacks so a missing tracing dependency never crashes the agent.

Four people, one schema. Locking the work-order schema in the first 90 minutes was the right call. Every argument about field names happened before anyone wrote code, which meant no merge conflicts on the object everyone reads and writes.

What we learned

The human approval gate is not a feature you add to an agentic system. It's the architecture. Designing it as a real stop, not a cosmetic checkbox, forced every other decision: how state is held, how agents are prompted, how ArmorIQ fits in. Getting that right early made the rest of the build feel coherent.

Conversation history is also load-bearing in a way we didn't fully appreciate at the start. An agent that asks the same question twice isn't just annoying; it breaks the user's trust that the system understood them. Persisting and resuming message history is the difference between something that feels like a product and something that feels like a prototype.