ClearPort: Autonomous Customs Recovery

What started it

Small and medium exporters have no in-house customs team. They file cross-border declarations through digital tools and hope the filing clears. When a destination silently changes a rule, a date format, a required field, or a tariff code set, the declaration comes back rejected with a cryptic code. While someone figures out what the code means, the container sits at the dock accruing demurrage of $200 to $1,000+ per day. A few days of that can erase a quarter's profit.

Big firms have SAP GTS and a broker on retainer. The small exporter has nothing. The gap is not document generation or submission, which is already solved by tools like EasyPost and Avalara. The gap is runtime diagnosis and validated repair of a rejection. Nobody parses the rejection, finds the failing field, rewrites it, checks the fix against known-good submissions, and resubmits. That is the hole we set out to fill.

What we built

ClearPort is an autonomous customs-recovery layer. It closes a single loop:

    diagnose -> patch -> eval-gate -> tiered act -> learn

A rejection comes in. The agent recalls relevant memory, diagnoses the root cause, patches the declaration, and then has to get past an evaluation conscience built on Arize Phoenix before any real-money action happens. Low-risk fixes auto-clear and buy the shipping label. High-value or restricted parcels cross an explicit hard line and escalate to a human. Every outcome is written back to memory, so the same error self-heals the next time it appears, with no classifier and no human in the loop.

The piece we are most proud of is that the evaluator itself improves, and we can prove the improvement is real rather than circular.

How it works

The runtime brain is Gemini on Vertex AI, driving a fully traced Python recovery loop that is also exposed through a Google ADK root_agent surface. Each step of the loop (recall, diagnose, patch, verify, decide, act, learn) is a real OpenTelemetry span, so the whole decision is reconstructable in Phoenix.

The trust layer is Arize Phoenix. When live it provides the eval-gate judge through phoenix-evals, writes each verdict back onto the verify span as an annotation, holds the episodic datasets, and runs real experiments for memory promotion, a synthetic benchmark, and the judge-quality meta-eval. Offline, every one of those has a deterministic backstop, so the entire demo reproduces with no API keys.

Memory is tiered: static customs law (with veto power), episodic outcomes, distilled lessons, and procedural prompts. A fix only becomes a permanent lesson after an experiment shows the human-corrected version beats the agent's own attempts by a real margin, with enough examples behind it.

Evaluating the evaluator

It is easy to fake a good eval-gate by grading it with the very rule it already enforces. That makes "we never wrongly auto-cleared anything" a tautology and the model does no real work. We refused to do that.

Instead the gate is graded by an independent oracle that models the destination registry, which is the set of rules the carrier never checks. A separate learned judge predicts the destination verdict from semantically similar past outcomes, using similarity-weighted kNN offline and Gemini few-shot in-context learning when live. It only ever tightens the gate, and it abstains until it has enough relevant precedent, so a cold start behaves exactly like the old gate.

Then we measured it. As the judge accumulates adjudicated experience, accuracy against the independent oracle climbs from 50% cold to 100% taught, and false auto-clears fall from 50% to 0%. That curve is reproducible offline with clearport-judge-eval, and when Phoenix is live it is registered as a real experiment whose task actually runs the judge, so the improvement is clickable.

What we learned

The hardest and most valuable lesson was about honesty in evaluation. Separating the authority that grades the gate from the authority the gate uses to decide changed the whole design, and it is what makes the learning curve believable instead of decorative.

We also learned a lot about keeping an agent demoable and reliable at the same time. Putting a deterministic fallback behind every external dependency meant we could develop, test, and demo with zero keys, then flip to live Vertex AI and Phoenix by changing environment variables and never code.

Challenges we faced

Avoiding circular evaluation, described above, was as much a design problem as a coding one.
Keeping the per-request path fast and dependency-light. We use the in-process Phoenix client on the hot path and reserve the Model Context Protocol surface for a few explicit, non-hot-path places like the startup handshake, prompt management, and on-demand investigation.
Memory recall ordering. Lessons are retrieved semantically first, but static law keeps veto power, so a remembered shortcut can never override a hard customs rule.
Making the numbers defensible. Every headline metric on the dashboard shows its assumptions inline, and the backend is covered by 146 passing tests.

Scope and honesty

ClearPort runs in EasyPost test mode and never files to a real government customs system. The regional overlay simulates a destination registry so we can trigger silent rule changes on demand. The agent performs structural and syntactic repair only; final legal classification of high-value or restricted goods always routes to a human.