Web image scrapper

Inspiration

]The inspiration for this project came from the increasing need for image datasets in machine learning and deep learning applications, particularly for tasks like image classification, object detection, and iconography analysis. Many image-heavy websites store valuable data, but extracting and processing this information manually is time-consuming. By automating the web scraping of images and their metadata (e.g., alt tags, dimensions, file size), this project aims to help developers, data scientists, and researchers easily gather the data they need for training and experimentation.

What it does

This project allows users to scrape images and their metadata from any website by simply providing the website URL. It supports both: Static pages: Extracts images using traditional HTML parsing methods. Dynamic pages: Handles JavaScript-rendered content using Selenium to ensure that no images are missed. Users can view real-time scraping progress and results on a clean, interactive dashboard, with metadata such as image dimensions, file size, format, and alt tags.

How we built it

Frontend: Built with React.js, providing a responsive and user-friendly interface where users can input a URL and select the scraping mode (static or dynamic). The frontend interacts with the backend via Axios to submit scrape requests.

Backend: Developed using Flask as the API gateway. The backend handles the scraping logic using:

BeautifulSoup for static web scraping. Selenium for dynamic web scraping. Pillow for processing images and extracting metadata. Libraries & Tools:

Requests to fetch HTML content. BeautifulSoup for parsing static HTML. Selenium for handling dynamic JavaScript content. Pillow to analyze images.

Challenges we ran into

Handling Dynamic Content: Scraping images from JavaScript-heavy websites required incorporating Selenium to load the dynamic content correctly, which added complexity to the project. Cross-Origin Resource Sharing (CORS): Communicating between the React frontend and Flask backend on different ports caused CORS issues, which required setting up proper CORS handling in the backend. Image Processing: Extracting metadata like image size, dimensions, and format from different image formats while handling large image datasets was resource-intensive and needed optimization.

Accomplishments that we're proud of

Building a Fully Functional Image Scraper: Successfully built a tool that can scrape both static and dynamic web pages and return valuable image metadata.= Real-Time Frontend Updates: Implemented real-time progress updates in the frontend using WebSockets, allowing users to see scraping progress and results in an interactive way. Robust Backend: The backend can handle both static and dynamic content, making it a versatile tool for various scraping needs.

What we learned

Web Scraping Techniques: Improved our understanding of the complexities of web scraping, including handling static vs. dynamic content and working with different HTML structures across various websites. API Integration: Gained experience in connecting a frontend React app with a Flask API, handling requests and ensuring smooth communication between both layers. Image Metadata Extraction: Learned how to effectively use the Pillow library for processing images and extracting key metadata like file size, dimensions, and format.

What's next for Web image scrapper

Advanced Scraping Features: Adding support for scraping image-heavy resources like infinite scroll pages and CAPTCHA-protected sites. Data Export Options: Provide the ability to export scraped images and metadata as CSV files or to cloud storage platforms like Google Drive. ML Dataset Integration: Add features that allow the scraped images to be easily formatted and exported as datasets for use in machine learning models. User Authentication: Implement user login and save scraping history for easier access and reusability of previous scraping tasks. Improved Image Categorization: Leverage machine learning models to automatically categorize the images based on their content and make the scraper even more powerful.

Built With

Updates

subriti pradhan started this project — Sep 29, 2024 10:59 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.