Inspiration
The web is about to get a second audience. AI agents are no longer just answering questions — they're browsing, clicking, filling forms, and completing tasks inside real web applications. But most developers are building for one audience — human users — and have no idea how their code will perform for this second audience. A human can squint past a <div onclick> that should be a button. An agent can't. It bounces — and the developer never finds out why, because the app looks fine to a human.
I looked for a tool that could tell developers "here's what an AI agent experiences when it hits your app" and found nothing. Lighthouse checks accessibility for humans. WAVE checks WCAG compliance. Nobody checks agent readiness. That's the gap Hermes Clew fills.
The pain: agents fail silently on most web apps. The solution: scan the code and show developers what agents can and can't see. What changes: developers can fix agent-readiness issues before agents ever hit their app.
What it does
Hermes Clew is a read-only Agent Readiness Scanner built on the GitLab Duo Agent Platform. It includes both a custom public agent (agents/agent.yml) and a custom public flow (flows/flow.yml), registered in the GitLab Duo catalog.
You type "scan this project for agent readiness" in Duo Chat. The agent uses read_file and read_files tools to scan HTML, JSX, and TSX files in your repo, scores it from 0-100 across six weighted categories, and produces a plain-English Agent Readiness Report.
The six categories: Semantic HTML (25pts), Form Accessibility (20pts), ARIA & Accessibility (15pts), Structured Data (15pts), Content in HTML (15pts), and Link & Navigation (10pts). Each maps to a specific capability agents need.
It operates in two modes. Mode 1: a deterministic Python scanner runs automatically in the CI pipeline on every push, produces a JSON artifact (hermes_clew_scan_results.json), and you paste the results into Duo Chat where the agent reasons over the findings — catching false positives, assessing severity, and generating the report. Mode 2: you ask the agent to scan directly and it reads the files and applies the scoring rubric from its system prompt.
Every issue is described from the agent's perspective, not as a compliance violation. Every fix is ranked by impact with effort estimates and expected score improvement. The Confidence Notes section is honest about what the scan knows and what it's guessing. Awareness, not judgment.
How I built it
The architecture has three layers, though two share the same Claude instance. The GitLab Duo Agent — which IS Anthropic Claude via GitLab's built-in integration — receives the request via Duo Chat and reads files using read_file and read_files tools. A deterministic Python scan engine runs in the CI pipeline (triggered on every push), executing six category checkers using regex and pattern matching against source files — no LLM involved, just facts. The raw JSON findings can then be pasted into Duo Chat where the same Claude agent reasons about context, catches false positives, weighs severity, and generates the final report.
The CI pipeline has two stages: test (runs 85 pytest tests) and scan (runs the deterministic scanner against the included demo app and publishes the JSON artifact). The agent configuration lives in agents/agent.yml. The flow configuration lives in flows/flow.yml. Both are registered in the GitLab Duo catalog. The only Python dependency is pytest.
The included demo-app/ directory contains a sample grocery app (FreshCart) with intentionally mixed good and bad patterns — semantic HTML alongside div-soup, labeled forms alongside unlabeled ones — so the scanner has realistic material to evaluate.
This was an AI-assisted project. Code was developed with assistance from Claude, ChatGPT, and GitLab's IDE AI features. All output was human-reviewed, validated, and directed. All mistakes are mine.
Challenges I ran into
The biggest discovery came from my initial platform spike (documented in docs/SPIKE_RESULTS.md). I validated four capabilities before writing any scanner code: can the agent respond in Duo Chat, can it read files, can it execute Python, and is Claude the underlying model. The critical finding: Duo Chat agents can read files but cannot execute Python scripts — there's no shell or exec tool available. That meant my original plan of having the agent run the Python scanner directly inside the chat session was impossible. I redesigned into a hybrid approach: the deterministic scanner runs in CI on every push (trigger → action), and the Duo agent either reasons over the JSON artifact or reads files directly.
JSX and TSX parsing without a JavaScript AST parser was another challenge. The scanner uses regex and string matching, so React components like <Button> that render to semantic HTML at build time get flagged incorrectly in source. Spread props like {...props} might include ARIA attributes the scanner can't see. Rather than pretending the heuristic parser is perfect, the reasoning layer is explicitly tasked with catching these false positives and noting them in the Confidence Notes section.
Accomplishments I'm proud of
85 tests passing in CI — covering all six category checkers, the file finder, the scoring engine, the report prompt builder, the scanner orchestrator, and an end-to-end integration test. The scanner is deterministic and reproducible.
The report genuinely helps. A developer who has never heard of ARIA or Schema.org can read a Hermes Clew report and know exactly what to fix, how long it'll take, and how many points they'll gain. Issues are told as stories from the agent's perspective, not as compliance violations.
The project practices what it preaches. The codebase includes an AGENTS.md file describing the project to any AI agent, uses semantic file names, follows single-responsibility principles, and produces machine-readable output. An agent readiness tool that isn't agent-ready would be embarrassing.
What I learned
The spike-first approach saved the entire project. Validating platform capabilities before committing to an architecture let me discover the Python execution limitation early and redesign before I was locked into a dead-end approach. Document your spikes.
I also learned that honesty about limitations builds more trust than false confidence. The Confidence Notes section — where Hermes Clew explicitly states whether findings are high-confidence deterministic results or medium-confidence heuristic guesses, and flags suspected false positives rather than silently adjusting scores — turned out to be one of the strongest design decisions.
What's next for Hermes Clew
External URL scanning against deployed apps using the rendered DOM instead of just source files. MR-triggered scanning that posts the Agent Readiness Report as a merge request comment so teams catch regressions before they ship. Historical score tracking across commits. NLWeb and MCP compatibility checks as the agentic web evolves beyond browsing into standardized protocols.
Hermes Clew is one tool in the Clew suite. The thread connecting them all: help developers see problems they can't easily see themselves.
Log in or sign up for Devpost to join the conversation.