We wanted to create a dataset that is relevant to not only us but millions of people around the world. As a team of freshmen, we have begun to experience the frustration of job/internship hunting. To help both ourselves and other students in their internship application process, we decided to create a dataset for open internship positions on some popular job sites.
Our dataset is a collection of open software engineer internship positions on popular job sites. Some data we've pulled includes but is not limited to the company name, job title, pay, and location. Not only can we create a web scraping application from scratch in python to scrape for multiple other job opportunities, the programs are also usable of a variety of sites. For example, we began with Levels.fyi, Indeed.com, and GlassDoor
How we built it
We used Selenium to address the Dynamic nature of our target websites and BeautifulSoup to read their HTML codes. Our team members actually used two separate approaches, one utilizing the HTML Xpaths present in each element and the other utilizing class names to search through all of the code. Then we used pandas data frames to both enter and clean our data. We used Selenium to scrape job data off of Glassdoor and pandas to format and built-in functions to export to CSV. For levels.fyi, pandas was used to parse through a json file of all the job data.
Challenges we ran into
For the first ~8 hours, we made minimal progress with our web scraper approach in regards to working properly with dynamic webpages. As we weren't able to properly load the data as a BeautifulSoup object, due to the dynamic nature of the program, we needed to find a workaround through Selenium. Furthermore, null values, missing class names, and data formatting issues plagued us throughout the process.
Accomplishments that we're proud of
We are glad to have created a working dataset. In addition, we were able to make some visualization models off of it to demonstrate its use. All in all, we're excited to add web scraping to our programming arsenals!
What we learned
- How to Webscrape from scratch
- How to use Pandas
- How to clean data
What's next for Software Engineer Internship Dataset
In the future, we'd like to expand the data past mere "Software Engineer Internship" data, into a dataset comprising of hundreds of different jobs sourced from hundreds of different websites. Though we now recognize the inherent difficulty of parsing all of this data, with our newfound knowledge, we're confident we can achieve this goal.