About the Project

We started Verde-i with a simple question:
How can companies call themselves “sustainable” when their emissions tell a different story?

Today, most sustainability judgments rely on ESG scores that are paywalled, opaque, poorly standardized, and often disconnected from real environmental impact. That lack of transparency inspired us to build a system that compares what companies say about sustainability to what they actually emit, using open data and machine learning.

What We Learned

Building Verde-i showed us just how complex (and often messy!) the sustainability data landscape is. We learned:

  • how difficult it is to find unbiased, standardized ESG or emissions data,
  • how dramatically corporate sustainability reports vary in format, structure, and honesty, and
  • how essential NLP is for making sense of long, inconsistent PDF disclosures.

How We Built It

Our pipeline combines PDF scraping, ClimateBERT transformer models, and ML regression into a single transparency engine.

1. PDF Scraping & Text Extraction

We built a custom PDF scraper to extract raw text from long sustainability reports. Every line was passed through ClimateBERT’s relatedness model to filter out non-climate content, ensuring that the rest of our analysis focused only on genuinely climate-relevant language.

2. Climate Language Profiling with ClimateBERT

We used multiple ClimateBERT downstream models, including Specificity, Commitment, Sentiment (Risk/Neutral/Opportunity), and TCFD classifiers. We converted each report into a structured climate-language profile.
This produced quantitative metrics for:

  • relatedness
  • specificity (vague vs. concrete claims)
  • commitment (actual pledges vs. PR)
  • sentiment framing (risk vs. opportunity)
  • TCFD categories: metrics, strategy, governance, risk

Each company ended up with a vector of 7–8 structured language signals.

3. Emissions Aggregation

We aggregated real-world emissions from the EPA GHGRP, supplemented with self-reported Scope 2/3 data for companies whose emissions fall outside the EPA threshold.
This became our “ground truth.”

4. Machine Learning Model

We trained a Random Forest regression model to predict log-emissions entirely from the ClimateBERT language metrics.
Then we computed:

\ (Greenwash Residual = actual emissions - predicted emissions) \

A high residual means a company talks cleaner than it behaves.
This z-scored residual is our Greenwash Index.

We built a second version of the model using ESG scores only, allowing users to compare language-based vs. ESG-based predictions.

The Final Product

We built an interactive dashboard where users can:

  • click a company
  • view its predicted vs. actual emissions
  • explore its full ClimateBERT language profile
  • see its Greenwash Index (language-based)
  • compare it directly to the ESG Residual Index

The result: a transparent, AI-powered system that reveals which companies are actually sustainable and which ones just sound like it.

Built With

Share this project:

Updates