Sustainability is becoming more and more important for many of us when buying clothes. How can you know whether a clothing brand is sustainable? Several websites manually rank clothing brands according to their level of sustainability. This is a time consuming process, resulting in information that might not always be up-to-date. That is how Goodbase came to be. Our aim is to solve this problem by applying scraping techniques and Natural Language Processing (NLP) to create a website with an up-to-date sustainability label for almost all clothing brands.
This project drives social change in the community. Not only by creating awareness, but also by providing a clear path to action. Knowing what brands are sustainable will make it easier for people to make the right choice, feel part of positive change and ultimately help reducing the impact of climate change. What we do today, can improve all our tomorrows.
What it does
Imagine, a database that has as many brands as you can think of, that is updated as often as you wish and that is adaptive over time. A database that provides well-explained, reliable answers regarding the level of sustainability of any brands you're interested in. You can search your brand and directly see whether it is sustainable or not. That is what Goodbase does.
How we built it
We gathered the homepages of more than 2000 clothing brands using the Google Search API. Instead of manually checking sustainability criteria of each brand, like working conditions, used materials, recycling policy etc., we search for relevant information on the website of the brand. Our scraper crawles through several levels of each website and selects the homepage plus 10 relevant pages based on a dictionary of keywords. These relevant pages include pages such as about-us, history and blogs and excluded product pages, shopping carts and store locators. To build a supervised model, we needed training examples of sustainable and unsustainable clothing brands. We collected ratings from a few initiatives which manually rate clothing brands based on their level of sustainability, such as Project Cece. We used this small dataset to train an Artificial Intelligent (AI) model. Natural language processing is used to prepare the text, obtained from the websites, for classification. Using a word embedding we represent the original text as numerical vectors. These vectors are fed to a Support Vector Machine with a linear kernel, which is trained to assign sustainability labels to the clothing brands.
Challenges we ran into
The biggest challenge was to determine which data to use. A website has many pages, containing a lot of data. We needed to find a balance between the amount of data used and the computational efficiency. That is why we decided to scrape the homepage plus 10 other relevant sub pages. Furthermore we decided not to use pages specifically about sustainability, to prevent brands from having a possibility to directly influence the model outcomes. Another challenge was to build a model that is supported by the community. There is a lot of controversy in the area of sustainability. When is a brand sustainable? Everyone has another opinion about this. In the end, we hope we build a model that everyone can relate to.
Accomplishments that we are proud of
The application we built is still a Minimal Viable Product (MVP), but it is already predicting quite well. We used a package called 'Explain like I am 5" to improve the model even further. It is a very nice package to use when you want to understand better why the model gives a certain outcome. We are proud of the fact that we have already built a website that can rate clothing brands quite accurately.
What we learned
The biggest eyeopener was that clothing websites contain the necessary information to determine if a clothing brand is sustainable or not. Unsustainable firms do not propagate the fact that they are unsustainable and they will try anything to look sustainable. Therefore, it is great that we can build a model that is only based on scraped data from websites. Moreover, we learned a lot of interesting technical stuff, like:
- How to automate the scraping of websites and cherry-pick the useful information from the results.
- NLP: from pre-processing raw text to selecting the right model, as well as using word embeddings.
- How to build a web app which shows live results, calculated by the NLP model on a production server in the back-end, all running in Docker containers.
What's next for Goodbase
- Rank clothing brands on different criteria, like working conditions, environment, fair trade, etc. We want to build different models for the different criteria and base the total score on the different models.
- Improve usage of different languages, other than English. Currently, we translated non-English content with the Google API (translators).
- Extract more features from scraped material, like price categories, country of origin, external links etc.
- Extract pdf files from website, as these files possibly contain information about sustainability.
- Include more data sources, like news websites, external databases etc.
- Try other models (more advanced Neural Networks, BERT, etc.)
- Add a summary about each brand (text summarization)
P.S. The quality of the video in the youtube link is not that good, therefore, we also added a Loom link.