For every analyst, industry research takes enormous amount of time. After the whole day of going through various industry publications, newswires and company websites one feel drained trying to extract the relevant information from the sea of content. I am really interested in green tech and renewable energy, so tried to solve this problem and to make my life easier by aggregating and classifying information.
What it does
Sustinero is an information aggregation engine for green tech and renewable energy, powered by Machine Learning. It extracts data using rss feeds from government agencies, company websites, pr newswires and industry publications. Summaries are extracted using ML and then labelled with relevant category ("Merger&Acquisition", "Tender&Action", etc.) and assigned a technology tag (solar, wind, hydrogen, etc.).
How we built it
Frontend and backend built using AWS Amplify and NextJS. Serverless data pipeline runs on AWS using cloud functions, ML models for text summary and classification are exposed as sagemaker endpoints. Assigning tag is a sagemaker notebook that uses ngrams.
Challenges we ran into
- Data collection and labelling is an iterative process and seems to have no end:)
- It's really hard to fit a machine learning model on a lambda function as a layer, so ended up publishing models on a sagemaker and exposing them via endpoints.
Accomplishments that we're proud of
Achieved high accuracy for a number of labels, saved a few hours a day spent on research. Managed to put models to run in a serverless data pipeline in the cloud.
What we learned
When requirements of labels for ML text classification project are not known in advanced and dataset is very limited and highly imbalanced, rephrasing a multi label classification problem as set of binary classification worked really wells, allowing for rapid iteration and high accuracy on small datasets.
What's next for Sustinero
I want to dive deeper into Machine Learning to build NER models to extract companies, so I could follow a particular company or a group of companies (like wind turbine manufacturers), as wells as location linked to country and region level.