Language documentation web scraper

Inspiration

To learn about web scraping, we used videos by TheNewBoston and Data Science Dojo. Links: https://www.youtube.com/watch?v=XQgXKtPSzUI https://www.youtube.com/watch?v=F2lbS-F0eTQ&list=PL6gx4Cwl9DGA8Vys-f48mAH9OKSUyav0q&index=5

What it does

The program first asks whether you want to search the java or python documentation. Upon a valid answer ("Java","Python") it prints out a table of valid guides that the user can access. For example, one in the java documentation is displayed as technotes/guides/math/index.html. The user is then prompted to search a keyword. If the user input math, the program will then access that math/index.html link and scrape out all of the important information from the guide.

How I built it

Two classes were made inheriting the HTMLParser class in the html.parser library. These processed all of the tag and data handling. One class was specific to python and one was to java. Then, we pulled the data from the webpages using the urllib library. Then, to return only the "important" links and "a tags" we parsed out the unimportant tags using the start tag handler. We were able to parse most out by only selecting a tags and the rest by inspecting the attributes. Then, we printed out the href links in a table. Then, we take the input for the guide selected and using VERY basic regex (this causes the input to need to be pretty much perfect), we were able to select the correct link requested. Then, we access this new link's page using the urllib library again. Now, we want to take all of the data from the p and h1 tags. To do this, we check what tag we are inside of in the start tag handler and then if we are in a p or h1 we load it onto a string buffer. At the end or after each header, we print out the buffer.

Challenges I ran into

Initially, the formatting for the data content of the p and h1 was wrong because the documentation split the p tags into sections at seemingly random intervals. Initially every time we exited a p tag we just printed and moved on. However, this caused fragments where there would be a single word then 3 blank lines at random intervals. To solve this, we created the string buffer that we load the content of the tags onto and remove all \n characters. This allowed us to control the \n characters and add them in at our discretion.

Accomplishments that I'm proud of

Given that neither team member had any experience with web scraping, we are both proud to have a product that looks clean. Also, one of the team members was rusty in python and found the relearning experience to be very enjoyable. Overall, forming a decent understanding of a very new topic was a very satisfying accomplishment.

What I learned

We learned how to use basic web scraping in python. We learned how to use github to control and share code versions.

What's next for Language documentation web scraper

A graphical user interface could make the experience a lot more user friendly. More language support. Better formatting. Enhanced regex to make the search process easier by keyword.

Built With

python

Updates

JT Reagor started this project — Dec 12, 2020 04:33 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.