Project Story

Predicting Corporate Scope 1 & Scope 2 Emissions with Domain-Driven Machine Learning

🌱 Motivation & Inspiration

Global ESG reporting continues to expand, yet large gaps remain in corporate carbon disclosures, especially for:

  • Scope 1: Direct emissions
  • Scope 2: Indirect emissions from purchased electricity

These disclosure gaps hinder financial institutions, regulators, and portfolio managers from accurately assessing climate exposure and transition risks.

This project was inspired by a simple belief:

Better carbon estimation enables better sustainable finance decisions.

Our goal was to build a domain-informed, explainable, and robust ML pipeline that can estimate emissions even when disclosure is incomplete. Instead of relying purely on raw ESG scores, we integrate financial scale, geography, and behavioral signals to reconstruct emissions patterns that more closely reflect real-world operations.


📊 Business Relevance

More accurate Scope 1 & Scope 2 predictions support:

  • Reliable portfolio carbon footprint calculation
  • Enhanced climate scenario planning
  • Better regulatory compliance with disclosure mandates
  • More credible transition risk assessments
  • Fairer company-to-company comparisons

This model reduces the uncertainty gap between disclosed and unreported emission data.


🧠 Key Learnings

We learned that:

  • Domain knowledge is essential — raw ML alone is not enough
  • Log-transformations are critical for heavy-tailed sustainability data
  • Scope-based revenue engineering dramatically improves predictive structure
  • Country encoding must be handled with discipline to avoid leakage or overfitting
  • Behavioral features express themselves only after considering firm size
  • The goal is not just lowering error, but producing trustworthy, explainable outputs

🏗️ How the Project Was Built

  1. Defined the business problem and key hypotheses
  2. Conducted in-depth EDA
  3. Engineered domain-driven features
  4. Built a unified preprocessing pipeline
  5. Compared multiple model families
  6. Tuned CatBoost across both scopes
  7. Generated the final predictions (submission.csv)
  8. Documented the entire workflow in final.md

🚧 Challenges Faced

  • Sparse and inconsistent ESG and SDG disclosures
  • Imbalanced country distributions
  • Scope 1 and Scope 2 relying on fundamentally different drivers
  • Distinguishing genuine patterns from artifacts
  • Balancing domain assumptions with ML flexibility

Despite these challenges, we demonstrated that domain-aware machine learning can produce reliable, business-relevant predictions from imperfect ESG data.

Built With

Share this project:

Updates