Metis AI

Inspiration

While working in the AI SAAS space and seeing dozens of new tools released every day is very exciting, it is also extremely overwhelming. A lot of platforms are difficult to learn how to configure, requiring hours of combing through tutorial videos and onboarding documentation. We want an easy to use interface that lets us get up and running on any platform within seconds.

What it does

Metis AI is an advanced multimodal browser use agent that can perform actions on any SAAS application using natural language instruction. Metis AI can index how-to guides, interactive documentation and videos for a given software platform. These are processed into a series of custom instructions (text|images) for a browser use agent to complete or demonstrate how to accomplish given task on behalf of the user.

How we built it

Scraped and curated documentation for the Stytch platform using a toolkit called ScribeAI. These guides contain image and text content generated from screen recordings of actions in the platform. Used ApertureDB as a vectorDB to store and fetch instructions when having to perform an action. Relevant docs are provided along with user intent to gemini to provide precise instructions to a browser use agent which uses google gemini to complete the task autonomously. Using in-context learning we were able to augment the capabilities and knowledge of the web agent on specific customer requests.

Challenges we ran into

We tried claude computer use: https://github.com/GiorgosAlexakis/metis as an extension to goose. But this solution was slow even though it was able to execute the actions. Differences in system architecture we had to use a linux VM to run claude computer use tools.

Accomplishments that we're proud of

Browser agents are very nascent technology and in many cases cannot perform tasks reliably. We are able to increase the performance and skills of a web agent by augmenting it with prior knowledge of the platform. We were able to generate instructions from tutorial videos for a multi-modal agent. Perform multiple steps given one customer query. Track agent accuracy using an evaluator component and W&B Weave

What we learned

'browser use' uses playwright which we assume means the agent can parse an HTML structure which seems to improve accuracy and performance compared to claude computer use. RAG improves performance in multi-modal agents. Computer tools are generally harder to debug.

What's next for Metis AI

Our vision is to build a sophisticated copilot for SaaS platforms that works with humans by: performing actions on behalf of the user through a chat interface highlighting objects on screen along with instructions in order to help them perform a task.

We are also interested in performing automated exploration of platform capabilities to offload work from businesses.

Built With

aperturedb
browser-use
gemini
open-clip
together
weave

Submitted to

Multimodal AI Agents
- Winner Best use of Stytch
- Winner Best use of Weave (W&B)
- Winner Fifth Place Overall

Created by

Scope out the project and integrated the tool Scribe AI to transcribe screen recordings to parsed documentation. Included Together's evaluator from their looping agent workflow for a LLM-as-a-judge evaluation.

Sravan Jayanthi
Discussed with team and made a prototype demo with Google genapi sdk that could read from pdfs and use function calling to call a browser use agent to complete tasks on a SaaS with assistance from pdf docs

Arash Joobandi
I integrated the aperture vector database and generated a script for retrieving the most relevant pdf instruction document from a text query using Vector DB retrieval and SigLIP image-text model.

Yuvanshu Agarwal
Lead the idea. Explored claude as a computer use agent, helped design the backend system and proposed using generator/feedback components to improve observability.

Giorgos Alexakis
Software Engineer @connectly.ai

Updates

Giorgos Alexakis posted an update — Feb 17, 2025 12:25 AM EST

Metis won 5th place and two sponsor prices!

Log in or sign up for Devpost to join the conversation.

Sravan Jayanthi started this project — Feb 16, 2025 05:14 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.