Inspiration
Modern teams ship faster than ever, but quality assurance has not kept up. Most QA workflows still rely on brittle, script-based tests that break whenever a UI changes, forcing engineers to constantly rewrite test logic. During the DevHouse SF sprint, we wanted to explore whether agentic AI could replace rigid scripts with reasoning-driven automation—testing applications the way a human would, but at machine speed.
AutoPilot QA was inspired by this gap: why can AI reason about complex problems but still struggle to click through a website like a human tester?
What it does
AutoPilot QA is an autonomous AI system that performs browser-based testing using natural language instructions.
Instead of writing test scripts, users describe a goal such as:
“Verify that a user can log in and reach the dashboard.”
The system then:
Plans a test strategy using an LLM
Executes the plan in a real browser environment
Observes UI behavior and captures screenshots
Generates a structured test report with results
This makes QA faster, more resilient to UI changes, and accessible to non-technical users
How we built it
We designed AutoPilot QA as an agent-based AI system with clear separation between reasoning and execution.
Architecture overview: User Goal→LLM Planner→Browser Executor→Report Generator Core components:
LLM Planner: Converts natural language goals into step-by-step QA plans
Browser Executor: Runs those steps in a live browser (Browserbase-ready)
Observation Layer: Captures screenshots and execution traces
Report Generator: Produces structured, human-readable test results
Dashboard: A React UI to trigger tests and visualize outputs
The backend is built with FastAPI, while the frontend uses React. The system is provider-agnostic and can run with OpenAI models or fully local LLMs.
Challenges we ran into
One of the biggest challenges was designing a stable agent loop that could reason, act, and report without becoming overly complex. Another challenge was balancing realism and reliability—browser automation is inherently noisy, and the system needed to gracefully handle unexpected UI states.
We also had to design the architecture so that sensitive credentials (like API keys) were never hardcoded, keeping the system production-ready even in a hackathon environment.
Accomplishments that we're proud of
Built a fully functional agentic AI system in one week
Replaced brittle QA scripts with reasoning-based automation
Created a real browser testing workflow, not just a mock demo
Delivered a demo-ready, investor-grade prototype
Designed the system to work with both cloud and local LLMs
What we learned
This project reinforced that agent design matters more than model size. Clear planning, modular execution, and structured feedback loops can make even lightweight models powerful. We also learned that practical AI systems need strong engineering fundamentals—error handling, observability, and clean interfaces matter just as much as the model itself.
Log in or sign up for Devpost to join the conversation.