Project Story

Predicting Corporate Scope 1 & Scope 2 Emissions with Domain-Driven Machine Learning

🌱 Motivation & Inspiration

Global ESG reporting continues to expand, yet large gaps remain in corporate carbon disclosures, especially for:

Scope 1: Direct emissions
Scope 2: Indirect emissions from purchased electricity

These disclosure gaps hinder financial institutions, regulators, and portfolio managers from accurately assessing climate exposure and transition risks.

This project was inspired by a simple belief:

Better carbon estimation enables better sustainable finance decisions.

Our goal was to build a domain-informed, explainable, and robust ML pipeline that can estimate emissions even when disclosure is incomplete. Instead of relying purely on raw ESG scores, we integrate financial scale, geography, and behavioral signals to reconstruct emissions patterns that more closely reflect real-world operations.

📊 Business Relevance

More accurate Scope 1 & Scope 2 predictions support:

Reliable portfolio carbon footprint calculation
Enhanced climate scenario planning
Better regulatory compliance with disclosure mandates
More credible transition risk assessments
Fairer company-to-company comparisons

This model reduces the uncertainty gap between disclosed and unreported emission data.

🧠 Key Learnings

We learned that:

Domain knowledge is essential — raw ML alone is not enough
Log-transformations are critical for heavy-tailed sustainability data
Scope-based revenue engineering dramatically improves predictive structure
Country encoding must be handled with discipline to avoid leakage or overfitting
Behavioral features express themselves only after considering firm size
The goal is not just lowering error, but producing trustworthy, explainable outputs

🏗️ How the Project Was Built

Defined the business problem and key hypotheses
Conducted in-depth EDA
Engineered domain-driven features
Built a unified preprocessing pipeline
Compared multiple model families
Tuned CatBoost across both scopes
Generated the final predictions (submission.csv)
Documented the entire workflow in final.md

🚧 Challenges Faced

Sparse and inconsistent ESG and SDG disclosures
Imbalanced country distributions
Scope 1 and Scope 2 relying on fundamentally different drivers
Distinguishing genuine patterns from artifacts
Balancing domain assumptions with ML flexibility

Despite these challenges, we demonstrated that domain-aware machine learning can produce reliable, business-relevant predictions from imperfect ESG data.