BlueSkyScraper

Inspiration

Understanding the customer is critical to any business. However, customer sentiment is dispersed across numerous review sites and social media platforms. BlueSkyScraper provides a centralized hub for aggregated user opinions with statistical and machine learning driven insights for actionable decisions. We demonstrate its efficacy with an analysis of JetBlue.

What it does

We provide a robust, modularized data pipeline which connects to multiple sources to aggregate large amounts of data. We layer advanced ML models and various statistical techniques to breakdown data and reveal underlying trends, which can be converted into business decisions.

The first step involves connecting to various social media platforms like reddit and Twitter. These sources give unfiltered access to user thoughts and help understand public sentiment. As this data is largely unstructured text, we leverage Google Natural Language API and TextBlob to understand key entities and sentiments. We also use unsupervised clustering to determine categories and patterns within these tweets.

This is then combined with input from various review sites we web-scrape like TripAdvisor and Skytrax. Along with unsupervised text we again pass into the above algorithms, we also collect statistics and ratings. These are again used to do more informed clustering and understanding of data.

After several iterations of increased data collection and deeper structural understanding, we analyze the final output to generate actionable items. This is combined with market research and existing solutions to provide a comprehensive list of recommendations to improve customer satisfaction.

As a case study, we apply our system to JetBlue to analyze customer sentiment towards this airline.

How we built it

To collect data, we used various scrapers and APIs to extract as much information as possible. This first phase was largely automated after we wired everything up and only required a list of target sites. We then looked for additional information, such as company reports and other already aggregated datasets, which we merged into our pipeline.

After, we ran various NLP algorithms on it. Some included ones already ported for Python, such as TextBlob. We also added in various Google NLP APIs. On top of this, we tried out various clustering approaches to identify key categories and relevant entities. We then iterated on this approach, feeding in our clusters as priors to the NLP algorithms to get more detailed insights in this area.

Lastly, we developed a series of plots and statistics to capture the data and present our findings. We augmented this with some market research to give actionable items for JetBlue.

Challenges we ran into

Connecting with APIs like Twitter required jumping through several hoops and clearance processes. Getting through all these in time was fairly challenging. We also ran into some struggles with scraping various sites which weren't too bot friendly.

Accomplishments that we're proud of

We're really proud of the insights we were able to glean. Many of the interesting patterns, such as far lower correlation between free wifi and customer satisfaction compared to food options, were rather surprising. We also were really proud of being able to handle such a sheer volume of data in a short amount of time as much of the collection and analysis was very computationally time consuming.

What we learned

A lot about webscraping! This was an area very new to us, and it was really cool understanding how bots traverse the web and what information is accessible.

What's next for BlueSkyScraper

A fully fleshed out dashboard. Due to a lack of experience, we weren't really able to make a dashboard in time. As a result, we opted for focusing on generating better graphs and a really slick side deck. We'll be finishing this up soon though!