Having seen this sort of challenge at other hackathons, we were intrigued by the Soprasteria challenge. Additionally, we wanted to see how applicable our knowledge of machine learning was.
What it does
Pinocchio is a Chrome extension that scrapes a webpage in order to extract information on it such as the title and article text. It then uses this information to predict a score from 0 to 10 using an XGBoost algorithm (an implementation of gradient boosted decision trees).
How we built it
Challenges we ran into
Implementing the lambda functions was problematic. Due to the 50MB size restriction, we had to upload one of our function codes to S3, and we had to follow a workaround in order to implement the sklearn library for the preprocessor function. Configuring the IAM group to work was also a bit awkward. It was difficult to find a suitable representation for the data, we attempted many approaches e.g. using semantic analysis, TF-IDF, publisher reliability etc. We also tried several ML algorithms before settling on XGBoost.
Accomplishments that we're proud of
Learning how to use Amazon Web Services in order to create a serverless application. We were also pleased with how our Chrome extension turned out.
What we learned
Credit: Icon made by Freepik from www.flaticon.com