Inspiration
Modern browsers can do almost anything: payments, forms, research, bookings. Turning something like "compare these two stocks on Yahoo Finance" into reliable clicks and typed fields usually means brittle scripts or manual repetition. SCAI (ScreenAI) came from the idea that natural language should be enough: describe the outcome, watch the browser work, and stay in control when sites throw CAPTCHAs or need your logged-in session.
I wanted something closer to pair programming with the web: an assistant that plans, executes, and shows its work with live screenshots and a clear step-by-step narrative, not a black box.
What I Learned
- Orchestration beats one-shot prompts. Long web tasks are not a single LLM reply. They are a loop of observe, plan, act, and verify. Treating automation as a state machine with streaming updates made the product feel trustworthy.
- The browser is the hardest API. Sites change, layouts shift, and "click the blue button" fails silently. Combining DOM signals, screenshots, and verification is what turns demos into something you can rely on for real workflows.
- Real-time UX matters as much as the model. Users tolerate latency if they see progress: a live panel, step completion, and honest status text beats a spinner every time.
- Product glue is real work. Auth (Supabase), billing (Stripe), usage limits, and deployment (Docker / EC2) are not flashy, but they are what makes a hackathon project feel like software people can actually use.
How I Built It
SCAI is a full-stack AI automation system with three cooperating layers:
Next.js 14 + TypeScript + Tailwind — Marketing site, chat-style app UI, settings, and pricing. The main experience (
app/app/page.tsx) is built around streaming chat and automation state managed in hooks likeuse-chat.ts, with components for the live automation panel and step progress.FastAPI orchestrator (Python) — Runs the CodeAct-style loop: the model proposes executable logic, actions run in a controlled path, and vision-style checks help confirm the UI state after steps. Task lifecycle and streams are exposed over WebSockets / SSE so the frontend can render plans, actions, screenshots, and completion events in near real time.
Chrome extension (Manifest V3) — Bridges Chrome DevTools Protocol (CDP) so automation can drive a real browser session, including sites where the user is already signed in, while the server coordinates planning and verification.
Supporting services include AWS Bedrock (Claude) for planning and reasoning, Supabase for auth and persistence, Stripe for subscriptions or credits-style usage, optional Redis, and Docker-based local and production deployments.
A compact view of the loop (for intuition):
$$ \text{success} \;\Rightarrow\; \bigwedge_{i=1}^{n} \text{verify}(\text{step}_i) $$
In plain terms: I only call a long task "done" when each meaningful step has been executed and checked against what appears on screen, not when the model sounds confident.
Inline framing: automating (n) sites does not multiply effort by (n) if verification catches failures early. Cost is roughly ( \sum_i \mathbb{1}[\text{retry}_i] ), which is why retries and observability were first-class for me.
Challenges I Faced
- Reliability vs. flexibility. Letting the model generate code unlocks branching logic, but it also demands sandboxing, guardrails, and robust error handling so one bad step does not spiral.
- Latency and perceived performance. Streaming helps, but coordinating extension, server, model, and vision checks adds delays. I invested in incremental UI feedback so the system always feels "alive."
- Auth-sensitive automation. Many valuable tasks require cookies and sessions. That pushed me toward extension-driven control rather than pretending a generic remote browser can replace the user's identity context.
- Shipping, not only demoing. Wiring Stripe, Supabase, and deployment alongside core AI behavior meant balancing scope with polish under hackathon time pressure.
Links
- Live site: sc-ai.app
Built With
- amazon-web-services
- aws-ec2
- chrome
- chrome-extension-(manifest-v3)
- claude
- css3
- deepgram-(voice-input)
- devtools
- docker
- docker-compose
- events
- fastapi
- html
- javascript
- nginx
- node.js
- protocol
- python
- python-3.11+
- redis
- server-sent
- supabase-(postgresql
- tailwind
- textract
- typescript
- websockets

Log in or sign up for Devpost to join the conversation.