Inspiration
While working in the AI SAAS space and seeing dozens of new tools released every day is very exciting, it is also extremely overwhelming. A lot of platforms are difficult to learn how to configure, requiring hours of combing through tutorial videos and onboarding documentation. We want an easy to use interface that lets us get up and running on any platform within seconds.
What it does
Metis AI is an advanced multimodal browser use agent that can perform actions on any SAAS application using natural language instruction. Metis AI can index how-to guides, interactive documentation and videos for a given software platform. These are processed into a series of custom instructions (text|images) for a browser use agent to complete or demonstrate how to accomplish given task on behalf of the user.
How we built it
Scraped and curated documentation for the Stytch platform using a toolkit called ScribeAI. These guides contain image and text content generated from screen recordings of actions in the platform. Used ApertureDB as a vectorDB to store and fetch instructions when having to perform an action. Relevant docs are provided along with user intent to gemini to provide precise instructions to a browser use agent which uses google gemini to complete the task autonomously. Using in-context learning we were able to augment the capabilities and knowledge of the web agent on specific customer requests.
Challenges we ran into
We tried claude computer use: https://github.com/GiorgosAlexakis/metis as an extension to goose. But this solution was slow even though it was able to execute the actions. Differences in system architecture we had to use a linux VM to run claude computer use tools.
Accomplishments that we're proud of
Browser agents are very nascent technology and in many cases cannot perform tasks reliably. We are able to increase the performance and skills of a web agent by augmenting it with prior knowledge of the platform. We were able to generate instructions from tutorial videos for a multi-modal agent. Perform multiple steps given one customer query. Track agent accuracy using an evaluator component and W&B Weave
What we learned
'browser use' uses playwright which we assume means the agent can parse an HTML structure which seems to improve accuracy and performance compared to claude computer use. RAG improves performance in multi-modal agents. Computer tools are generally harder to debug.
What's next for Metis AI
Our vision is to build a sophisticated copilot for SaaS platforms that works with humans by: performing actions on behalf of the user through a chat interface highlighting objects on screen along with instructions in order to help them perform a task.
We are also interested in performing automated exploration of platform capabilities to offload work from businesses.

Log in or sign up for Devpost to join the conversation.