Inspiration
Our journey began with a clear vision: to build an intelligent system that could finally automate the tedious and complex process of web data extraction. We were inspired by the need to help data analysts, researchers, and business intelligence teams who spend countless hours manually finding and cleaning data from the web. Our goal was to create a multi-agent system that could automatically discover relevant pages, extract the necessary information, and deliver it as a clean, structured, and integrated dataset.
What it does
Extractra is an intelligent web data extraction tool. Given a target website URL and specific user-defined requirements, it can automatically discover relevant subpages and extract structured data from them in parallel.
How we built it
The project was brought to life with a modern technology stack and a clear architectural plan. The frontend was built with Next.js and TypeScript, while the backend leveraged Python and FastAPI. The overall backend system workflow is supported by the Google Agent Development Kit (ADK), which we used to design the multi-agent system.
Challenges we ran into
1.Identifying relevant pages on a large website without crawling the all pages.
* Solution: We designed the PageDiscoveryAgent to perform a smart, two-step process. It first fetches links and then uses the LLM Service to evaluate the relevance of each page before committing to a full content extraction, saving significant time and resources.
2.The slow speed of extracting data from dozens or hundreds of pages sequentially.
* Solution: Performance was a major hurdle. We solved this by implementing the ParallelAgent (parallel_content_extraction_agent) within the ADK framework. This allowed for concurrent processing of multiple web pages, which was a key factor in making the system efficient.
3.Orchestrating the complex flow of information between agents and external services like Crawl4AI and the LLM.
* Solution: This complexity was managed by using the Google Agent Development Kit (ADK). Its structured approach with concepts like SequentialAgent and ParallelAgent gave us the scaffolding we needed to build and manage the multi-agent workflow effectively.
Accomplishments that we're proud of
1.Successfully built an intelligent multi-agent automation system: Our original vision was to develop an intelligent system capable of automatically discovering, extracting, and integrating web data. By leveraging the Google Agent Development Kit (ADK) framework, we successfully transformed this vision into reality by breaking down the complex process into an automated workflow coordinated by multiple specialized agents.
2.Overcame performance bottlenecks in large-scale data extraction:
To tackle the challenge of extracting data from massive numbers of pages at high speed, we designed and implemented the ParallelContentExtractionAgent. This agent is capable of processing multiple webpages in parallel while simultaneously extracting content, significantly reducing overall processing time and improving efficiency. This is also one of our core achievements, reflecting practical optimizations based on real-world constraints such as token consumption.
3.Adopted a clear and modular technical architecture:
We take pride in the clean architecture we designed for the system. By separating concerns, we assigned webpage crawling and content transformation to the dedicated Crawl4AI component, while all intelligent analysis and data extraction tasks are handled by a separate LLM Service. This separation of frontend and backend responsibilities enhances system robustness and makes it easier to maintain.
What we learned
Building Extractra was a profound learning experience, primarily in the power of a modular, multi-agent architecture. We learned that by breaking down a complex task into specialized roles, we could build a more robust and efficient system. Instead of building everything from scratch, we learned to leverage specialized tools. We used Crawl4AI exclusively for its powerful web crawling and HTML-to-Markdown conversion capabilities. For all intelligent tasks—like analyzing page structure, evaluating content relevance, and extracting structured data—we relied on a dedicated LLM Service.
This approach of orchestrating specialized agents and services, managed within the Google Agent Development Kit (ADK) framework, was the most critical lesson we learned. It showed us how to build a scalable, intelligent system where each component does one thing exceptionally well.
What's next for Extractra
1.Establishing a comprehensive user feedback and iteration mechanism. 2.Support for more output formats and optimized extraction efficiency. 3.Support for multimodal input capabilities. 4.Integrating reasoning models and reinforcement learning for smarter deep crawling strategies.
Built With
- clerk
- crawl4ai
- docker
- fastapi
- google-agent-development-kit
- google-cloud
- lovable
- playwright
- python
- react
- tailwind-css
- typescript
Log in or sign up for Devpost to join the conversation.