Inspiration
Every day, enterprises spend millions on manual data collection and expensive scraping services. The web contains vast amounts of valuable information, but accessing it in a usable format remains difficult and costly.
Structured data has become critical for enterprises across industries - powering competitive pricing intelligence in e-commerce and building proprietary datasets for AI/ML model training. By automating schema generation and extraction, this agent democratizes access to web data, making it faster, more affordable, and accessible to everyone.
What it does
This agent extracts structured data from any website based on natural language queries. Provide a URL and describe what you need - the agent determines the optimal schema and performs the extraction.
Example use cases:
- Extract product details, pricing, and specifications from e-commerce sites.
- Gather recent news articles, headlines, and metadata from news websites.
- Pull contact information, reviews, or any structured content from web pages.
The key innovation: understanding your intent from a simple query and dynamically creating the extraction schema - no manual configuration required.
How we built it
Built with Claude Sonnet 4.5 via Claude Code, the agent orchestrates the entire workflow through natural dialogue:
- Conversational interface - Claude Sonnet 4.5 interprets queries and discovers URLs via Exa AI semantic search when not provided
- Dynamic schema generation - Claude generates Pydantic v2 validation schemas from natural language for consistent, typed JSON outputs
- High-performance extraction - Scrapy + Playwright handles JavaScript rendering, anti-bot detection, and browser automation (3-6 second average extraction, 10-20x faster than alternatives)
The architecture follows four stages: intent understanding → schema generation → strategy routing → extraction & validation.
Challenges we ran into
- Agent architecture - Designing a system that works reliably across diverse website structures while balancing AI intelligence versus deterministic logic.
- Bot detection evasion - Implementing stealth optimizations without relying on proxy networks.
- Dynamic schema generation - Ensuring accurate, consistent schemas from ambiguous user queries.
- Speed - Our initial implementation took ~65s per query (using crawl4ai). We re-architected the agent to be 10x faster.
Accomplishments that we're proud of
- Matching industry leaders - Our agent achieves comparable performance to Firecrawl's Extract, a Series A-funded startup with significantly more resources.
- Production-ready solution - Built a tool we're actively deploying at our own startup for market research and data enrichment - not just a demo.
- Cost-effective innovation - Delivered enterprise-grade extraction at reasonable cost/extraction.
What we learned
Push the boundaries of what LLMs can do. Sonnet 4.5 is exceptionally capable, with applications far beyond mainstream use cases. By trusting it to handle complex reasoning - inferring data types, managing nested structures, adapting to HTML variations - we achieved results that would have required thousands of lines of brittle, site-specific code.
What's next for Data Intelligence Agent
- Multi-URL processing - Scale to parallel processing across thousands of URLs simultaneously for rapid data collection.
- Production hardening - Implement enterprise-grade bot detection evasion, proxy infrastructure, and token optimization for real-world deployment.
Note: The best way to test this project out is running it locally. The production deployment would require strong bot bypass measures- only some of which are currently implemented.
Built With
- anthropic
- crawl4ai
- python

Log in or sign up for Devpost to join the conversation.