Inspiration

We kept asking ourselves: why do AI assistants just talk at you? You ask ChatGPT to compare two credit cards and it gives you a wall of text. You still have to go find the cards, read the terms, copy the numbers, and do the comparison yourself.

What if your AI could actually see what you see, read what you read, and do the work for you — right there in your browser? Not a chatbot. An agent.

What it does

Aladin is an AI agent that lives in your browser. It doesn't just answer questions — it reads the page you're on, extracts structured data, performs multi-step tasks, creates files, and delivers results. All from a side panel chat.

What the agent can do:

  • See your screen — Reads the full accessibility tree of any webpage: every heading, link, button, form, table, and interactive element. It doesn't need you to copy-paste — it already knows what's on the page.
  • Extract and structure data — Ask it to pull all the companies from a job board, all the people from a Twitter feed, or all the products from a comparison page. It scrapes, structures, and organizes the data for you.
  • Create files — The agent generates spreadsheets, CSVs, and documents on the server and gives you a download link. "Make me an Excel file of everything on this page" just works.
  • Multi-step reasoning — The agent chains actions together: navigate pages, load more content, find APIs behind lazy-loaded data, paginate through results, and compile everything into a single deliverable.
  • Screenshot analysis — Snap a screenshot of anything — a paper document, a chart, a confusing UI — and the agent analyzes it visually using multimodal AI.
  • Real-time streaming — You see the agent think token-by-token. When it's working on a server-side task, a live spinner shows what it's doing so it never looks stuck.

Example: "Get me all the companies from this job board"

The agent sees the page has 367 companies but only 12 loaded. It finds the underlying API, paginates through all results, extracts name/domain/stage/headcount/industry for each one, generates a CSV with 367 rows, and hands you a download link — all from one message.

How we built it

  • Chrome Extension (Manifest V3) — Service worker captures the full accessibility tree of any page and structures it for the AI. Content scripts inject visual indicators when the agent is actively working.
  • OpenClaw Agent Gateway — Orchestrates multi-step agent actions on the server: tool invocation, file generation, browser automation, and API discovery. Runs on AWS EC2.
  • Claude (Anthropic) via AWS Bedrock — Powers both text reasoning and multimodal vision analysis through the ConverseStream API.
  • Node.js/Express compatibility server — Bridges the extension's protocol with OpenClaw, handles Server-Sent Events for real-time streaming, manages chat persistence and file serving.
  • Auth0 — Secures the agent pipeline. The server validates Auth0 JWTs before allowing any interaction with the AI agent, and namespaces all conversations per authenticated user.

Challenges we ran into

The hardest problem was making the agent feel alive instead of stuck. When the agent performs server-side work (scraping, file creation, API calls), there's no natural feedback to the user. We built a streaming protocol that sends real-time step labels — "Fetching page 3 of 12...", "Generating spreadsheet..." — so the user always knows what's happening.

Getting the accessibility tree extraction right was also critical. A naive innerText dump loses all structure. We extract a semantic tree with roles, labels, states, and hierarchy, which lets the agent understand interactive elements (buttons, forms, dropdowns) not just text.

Accomplishments that we're proud of

  • The agent genuinely understands page structure — it can find and interact with lazy-loaded content, discover hidden APIs, and extract data that isn't visible without scrolling
  • One-message-to-deliverable workflow: ask a question, get a file. No intermediate steps for the user
  • Real-time agent progress with token streaming and server-side action labels
  • Full multimodal support — text, screenshots, and file generation in a single conversation
  • Auth0 securing the entire agent pipeline so agent actions and conversation history are isolated per user

What we learned

  • Building an AI agent that acts is fundamentally different from building one that responds. The UX challenge isn't the AI — it's keeping the user informed about what's happening behind the scenes
  • Chrome's accessibility tree is an incredibly rich source of structured page data that most AI tools completely ignore
  • Server-Sent Events require careful buffering tuning (setNoDelay, flush intervals) to feel truly real-time rather than chunky
  • The gap between "AI that explains things" and "AI that does things for you" is where the real value lives

What's next for Aladin

  • Browser control — Let the agent click, scroll, and fill forms on the user's behalf, not just read pages
  • Workflow recording — Record a multi-step browser workflow once, then let the agent replay and adapt it
  • Scheduled agents — Set up recurring tasks: "Every Monday, check this page and email me a summary"
  • Multi-tab awareness — Let the agent work across multiple tabs simultaneously, comparing and cross-referencing
  • Plugin system — Let users define custom extraction templates for sites they use frequently

Built With

  • auth0
Share this project:

Updates