Having seen this sort of challenge at other hackathons, we were intrigued by the Soprasteria challenge. Additionally, we wanted to see how applicable our knowledge of machine learning was.

What it does

Pinocchio is a Chrome extension that scrapes a webpage in order to extract information on it such as the title and article text. It then uses this information to predict a score from 0 to 10 using an XGBoost algorithm (an implementation of gradient boosted decision trees).

How we built it

The frontend of the Chrome extension was done in Javascript. The web scrapper lambda function was written in Node.js. The preprocessor lambda function was written in Python. Finally, the machine learning backend was written in Python. This was all tied together using Amazon Web Services, taking advantage of AWS Lambda, S3, SageMaker and API Gateway.

Challenges we ran into

Implementing the lambda functions was problematic. Due to the 50MB size restriction, we had to upload one of our function codes to S3, and we had to follow a workaround in order to implement the sklearn library for the preprocessor function. Configuring the IAM group to work was also a bit awkward. It was difficult to find a suitable representation for the data, we attempted many approaches e.g. using semantic analysis, TF-IDF, publisher reliability etc. We also tried several ML algorithms before settling on XGBoost.

Accomplishments that we're proud of

Learning how to use Amazon Web Services in order to create a serverless application. We were also pleased with how our Chrome extension turned out.

What we learned

Learning Javascript, having had no previous experience. Gained experience in trying to find the best way to represent data for a machine learning problem.

Credit: Icon made by Freepik from

Share this project: