Inspiration
Anthropic ran 186 autonomous agent-to-agent commercial deals in 2026. Their own researchers documented agents confabulating details mid-negotiation, users being disadvantaged without knowing it. They closed the paper with this: "these confabulations illustrate the potential risks of implementing a system like this in a non-experimental setting without additional safeguards."
That sentence is the inspiration. Anthropic built the marketplace and told the world it needs governance infrastructure they don't have. We built that infrastructure.
The deeper inspiration came from watching the AI space move: everyone is building agents that can take real actions, negotiate deals, move money, control systems. Nobody is building the layer that governs what those agents are allowed to do, verifies what they actually did, and proves it cryptographically. That gap is the same gap that existed in the early internet before protocols, trust systems, and reliability infrastructure were built. Nvidia didn't build internet applications. They built what the internet ran on. That is the positioning here.
What it does
CONDUCTOR is an open-source workflow engine that lets AI agents orchestrate real-world actions using simple YAML files. You define your pipeline in YAML, no custom code. CONDUCTOR handles execution, retries, and audit trails using Temporal Cloud for durable execution.
Underneath CONDUCTOR runs the Thread Suite, nine open-source AI reliability and governance tools:
- PolicyThread watches every agent output against organizational compliance rules in real time. When a Gemini agent produces output containing prohibited language or violates a policy threshold, PolicyThread catches it instantly, logs the violation with a cryptographically signed attestation record, and returns the result to CONDUCTOR.
- ChainThread wraps every agent-to-agent handoff in a signed envelope. Every transition between agents is cryptographically recorded — tamper-evident, auditable, regulator-ready.
- Iron-Thread validates that AI outputs match expected structure before they reach any downstream system.
- TestThread tests agent behavioral correctness across runs with adversarial generation.
- ThreadWatch watches the entire pipeline simultaneously, detecting anomalies across all layers before they become failures.
- AgentID gives every AI agent a cryptographic identity and reputation score.
- DriftWatch monitors whether a model is becoming wrong about verifiable facts over time.
- Behavioral Fingerprint detects when an agent's behavioral profile shifts after a silent model update.
- PromptThread versions prompts and tracks performance over time.
In the demo: a Gemini-powered loan decision agent runs inside CONDUCTOR. A violating application — $75,000, contains "guaranteed returns" language — is evaluated by Gemini and the output is passed to PolicyThread. The violation is caught. ChainThread records the flagged handoff cryptographically. The pipeline logs everything. A clean application runs next and passes all checks.
How we built it
The architecture has two layers working together.
CONDUCTOR is built on Node.js with TypeScript, running on Temporal Cloud for durable workflow execution. Workflows are defined in YAML and loaded at runtime. Each step type maps to a connector — a discrete activity that Temporal orchestrates. We built six connectors: Gemini (AI reasoning), Safe multisig (blockchain transactions), x402 payment verification, ERC-8004 identity verification, HTTP webhooks, and echo. The Gemini connector uses the Google GenAI SDK with gemini-3-flash-preview.
The Thread Suite is nine FastAPI backends deployed on Render, each with a Supabase database, a Lovable dashboard, and SDKs published to PyPI and npm. The tools communicate via webhooks — CONDUCTOR calls PolicyThread and ChainThread endpoints after each Gemini step, passing the agent output for evaluation and logging.
Everything was built by one person on a Celeron processor with 4GB RAM using AI as the technical co-founder and lead architect. Total infrastructure cost: $0.
Arize was integrated as the observability partner for monitoring agent evaluation quality and tracking policy compliance metrics across runs.
Challenges we ran into
The foreign key constraint problem. ChainThread's database enforces referential integrity — you cannot send a handoff envelope to a chain that doesn't exist. CONDUCTOR was sending a plain text chain ID but the database expected a pre-created chain record. The diagnosis required reading Supabase error logs directly and understanding the relationship between the two systems. The fix was a one-time database setup step, which is actually correct behavior — it enforces that every agent handoff belongs to a verified chain.
Gemini API quota on free tier. The initial API key had zero quota for the models we needed. Reading the quickstart documentation carefully revealed that gemini-3-flash-preview had active free tier quota. Changing one line — the model name — unblocked the entire connector.
Cold start timing. Render free tier services spin down after inactivity. Running a workflow that calls PolicyThread and ChainThread sequentially means both services need to be warm. Building the demo required understanding the timing between workflow steps and service startup.
Building governance infrastructure that is honest about its limits. The cryptographic attestation chain proves log integrity — it proves records were not tampered with after the fact. It does not independently verify that an AI evaluation was correct. Getting this distinction right in the documentation required careful thinking about what cryptography actually proves versus what it implies.
Accomplishments that we're proud of
Nine tools live. All deployed. All with real APIs, SDKs, and dashboards. Organic downloads on PyPI and npm with zero marketing. A Gemini-powered loan decision agent running inside a workflow engine that calls a compliance monitoring tool and a handoff verification tool in real time — end to end — from a Celeron laptop in Accra, Ghana.
The cryptographic trust layer running through the suite is the accomplishment we are most proud of. Iron-Thread, ChainThread, PolicyThread, and AgentID all use SHA-256 hash chains to make their records tamper-evident. No other AI infrastructure portfolio has built cryptographic verification this deep across this many layers.
The Anthropic Project Deal paper documented specific agent failure modes. Every one of them maps to a tool in the Thread Suite. That is not coincidence — it is confirmation that the problem was correctly identified before the research was published.
What we learned
Building nine tools in sequence teaches you things that building one tool does not. The most important: the tools compound. Each tool makes the next one more defensible, more useful, and more connected. ChainThread is more valuable because it can carry a PolicyThread policy envelope. PolicyThread is more credible because its attestation chain shares a cryptographic philosophy with ICA's DMVP protocol. The suite is more valuable than the sum of its parts.
We also learned that the most dangerous failure mode in AI governance is not the obvious one. It is not a model going rogue. It is an AI evaluator confidently signing off on a hallucination. Using AI to police AI requires being explicit about what each layer actually proves. Deterministic rules are the primary layer. Semantic AI evaluation is the secondary layer for edge cases. That hierarchy matters and needs to be documented honestly.
The build process confirmed that the bottleneck was never hardware or money. It was knowing what to build, knowing how to chain the tools, and having the patience to ship.
What's next for CONDUCTOR + Thread Suite, AI Agent Governance Infrastructure
ThreadWatch auto-integration — patching all nine Thread Suite tools to POST signals to ThreadWatch automatically after every significant event, making the pipeline vigilance layer fully automatic rather than requiring manual integration.
AgentID integration with ChainThread — every handoff envelope references a verified AgentID credential. Identity becomes a first-class property of every agent interaction.
Policy Mediation — when two AI agents from different organizations meet and their respective PolicyThread policy envelopes conflict, a mediation layer finds a compatible operating space. This is the inevitable next problem in multi-party autonomous commerce.
Hardware layer — the software builds the market and proves the need. Dedicated trust chips, behavioral monitoring appliances, on-prem compliance hardware for regulated industries. The software is built as if hardware will never happen. But the hardware destiny is baked into every architectural choice.
The goal is not to build everything. The goal is to build around the same substrate — the reliability layer — deeply enough that anyone deploying AI agents in production will eventually need something from this portfolio.
Built With
- chainthread
- fastapi
- gemini
- google-cloud-run
- node.js
- policythread
- python
- supabase
- temporal-cloud
- typescript
Log in or sign up for Devpost to join the conversation.