Inspiration
Having seen this sort of challenge at other hackathons, we were intrigued by the Soprasteria challenge. Additionally, we wanted to see how applicable our knowledge of machine learning was.
What it does
Pinocchio is a Chrome extension that scrapes a webpage in order to extract information on it such as the title and article text. It then uses this information to predict a score from 0 to 10 using an XGBoost algorithm (an implementation of gradient boosted decision trees).
How we built it
The frontend of the Chrome extension was done in Javascript. The web scrapper lambda function was written in Node.js. The preprocessor lambda function was written in Python. Finally, the machine learning backend was written in Python. This was all tied together using Amazon Web Services, taking advantage of AWS Lambda, S3, SageMaker and API Gateway.
Challenges we ran into
Implementing the lambda functions was problematic. Due to the 50MB size restriction, we had to upload one of our function codes to S3, and we had to follow a workaround in order to implement the sklearn library for the preprocessor function. Configuring the IAM group to work was also a bit awkward. It was difficult to find a suitable representation for the data, we attempted many approaches e.g. using semantic analysis, TF-IDF, publisher reliability etc. We also tried several ML algorithms before settling on XGBoost.
Accomplishments that we're proud of
Learning how to use Amazon Web Services in order to create a serverless application. We were also pleased with how our Chrome extension turned out.
What we learned
Learning Javascript, having had no previous experience. Gained experience in trying to find the best way to represent data for a machine learning problem.
Credit: Icon made by Freepik from www.flaticon.com
Log in or sign up for Devpost to join the conversation.