Inspiration

It is created as a part of Major League Hacking Hackathon "Local Hack Day: Build".

What it does

It is a summary of the topic "Web Scraping" and defines what is web scraping, the different methods it can be done and difference between legal and malicious bots used for scraping.

Web Scraping

Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. Techniques:

  1. By manually copying and pasting data from a web page into a text file or spreadsheet.
  2. Extracting info based on the UNIX grep command or regular expression-matching facilities of programming languages.
  3. Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.
  4. Some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content.
  5. Browsers like Internet Explorer or Mozilla can parse web pages into a DOM tree, based on which programs can retrieve parts of the pages.
  6. By vertical aggregation- involves establishing the knowledge base for the entire vertical and then the platform creates the bots automatically.
  7. The annotations, organized into a semantic layer, are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.
  8. Machine learning and computer vision attempt to identify and extract information from web pages by interpreting pages visually as a human being might. Difference between legitimate and malicious bots Since all scraping bots have the same purpose—to access site data—it can be difficult to distinguish between legitimate and malicious bots. Key differences between the two are-
  9. Legitimate bots are identified with the organization for which they scrape. For example, Googlebot identifies itself in its HTTP header as belonging to Google. Malicious bots, conversely, impersonate legitimate traffic by creating a false HTTP user agent.
  10. Legitimate bots abide a site’s robot.txt file, which lists those pages a bot is permitted to access and those it cannot. Malicious scrapers, on the other hand, crawl the website regardless of what the site operator has allowed.

Built With

Share this project:

Updates