About the Project

We started Verde-i with a simple question:
How can companies call themselves “sustainable” when their emissions tell a different story?

Today, most sustainability judgments rely on ESG scores that are paywalled, opaque, poorly standardized, and often disconnected from real environmental impact. That lack of transparency inspired us to build a system that compares what companies say about sustainability to what they actually emit, using open data and machine learning.

What We Learned

Building Verde-i showed us just how complex (and often messy!) the sustainability data landscape is. We learned:

how difficult it is to find unbiased, standardized ESG or emissions data,
how dramatically corporate sustainability reports vary in format, structure, and honesty, and
how essential NLP is for making sense of long, inconsistent PDF disclosures.

How We Built It

Our pipeline combines PDF scraping, ClimateBERT transformer models, and ML regression into a single transparency engine.

1. PDF Scraping & Text Extraction

We built a custom PDF scraper to extract raw text from long sustainability reports. Every line was passed through ClimateBERT’s relatedness model to filter out non-climate content, ensuring that the rest of our analysis focused only on genuinely climate-relevant language.

2. Climate Language Profiling with ClimateBERT

We used multiple ClimateBERT downstream models, including Specificity, Commitment, Sentiment (Risk/Neutral/Opportunity), and TCFD classifiers. We converted each report into a structured climate-language profile.
This produced quantitative metrics for:

relatedness
specificity (vague vs. concrete claims)
commitment (actual pledges vs. PR)
sentiment framing (risk vs. opportunity)
TCFD categories: metrics, strategy, governance, risk

Each company ended up with a vector of 7–8 structured language signals.

3. Emissions Aggregation

We aggregated real-world emissions from the EPA GHGRP, supplemented with self-reported Scope 2/3 data for companies whose emissions fall outside the EPA threshold.
This became our “ground truth.”

4. Machine Learning Model

We trained a Random Forest regression model to predict log-emissions entirely from the ClimateBERT language metrics.
Then we computed:

\ (Greenwash Residual = actual emissions - predicted emissions) \

A high residual means a company talks cleaner than it behaves.
This z-scored residual is our Greenwash Index.

We built a second version of the model using ESG scores only, allowing users to compare language-based vs. ESG-based predictions.

The Final Product

We built an interactive dashboard where users can:

click a company
view its predicted vs. actual emissions
explore its full ClimateBERT language profile
see its Greenwash Index (language-based)
compare it directly to the ESG Residual Index

The result: a transparent, AI-powered system that reveals which companies are actually sustainable and which ones just sound like it.

Built With

climatebert
copilot
html
javascript
python

Submitted to

Quackhacks 2.0

Created by

I developed the PDF scraper, end-to-end data pipeline, and ML model behind our Greenwash Residual Score. Easily one of my favorite projects to work on.

Zoe Tomlinson
I led the front-end work by researching competitor tools and analyzing existing sustainability platforms to design a user-friendly, accessible interface on Figma. I translated the design into TypeScript using Next.js. This was my second time working with Next.js and my first time deploying with Vercel and using Copilot in VS Code, a challenging but rewarding learning experience!

Mary Pham
Preeyapat Wisetmongkolchai
Haley Cabrera

Updates

Zoe Tomlinson started this project — Nov 16, 2025 02:31 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.