Inspiration
REAL challenged us to develop browser agents that are fast and accurate on real-life tasks. Latency was a big bottleneck. We needed to figure out which vision-language models could compete on speed while not hallucinating. I chose TwinBrowse as the name since the Mk.1 prototype was founded on Google's Gemini 3 pro.
What it does
Twinbrowse is a web agent that completes web tasks by analyzing screenshots and executing clicks/scrolls/etc. to handle workflows across multiple web platforms (GoMail,GoCalendar,NetworkIn,Marrisuite). It will go against a list of tasks available at the agisdk repository. You can also make your own lists, by using the corresponding REAL developed webpages.
How I built it
The agent 'sees' the screen through screenshots, it takes jpeg screenshots using Chrome DevTools Protocol, converts it to base64 URLs, it also keeps the last 4 screenshots in context. It also prunes older images to manage token usage. The screenshots are then sent to the vision language model via OpenRouter. The model analyzes the visual state, and generates tool calls (click,type,scroll) the tools executions are based on coordinates that are scaled from 0-1000 to actual viewport (1080p) dimensions. If you turn Headless to False, you can see the tools executing via Playwright.
Challenges I ran into
The biggest challenge I faced was figuring out dropdown options, because when the native elements are open, they can't be seen in the screenshots the algorithm takes. Apart from that, latency and the text based tool extraction was a challenge to understand. I had to write Debug statements during specific tasks to see what was going on behind the scenes. I also learnt that the instruct variant of Qwen beats larger models at certain tasks. Another aspect I had to consider was navigation guidelines, and location awareness via prompt instructions of the agent. What's next for TwinBrowse Adding Retry logic for when actions fail. Develop a system that lets me see what the agent is thinking.
Built With
- openrouter
- playwright
- python
- qwen
Log in or sign up for Devpost to join the conversation.