Inspiration

Many companies don't report their emissions, making it hard for investors, regulators, and sustainability analysts to assess climate risk. Our models will estimate emissions for these non-reporting companies.

What it does

We developed an AI model grounded in economic physics, using gradient boosting to predict corporate carbon emissions. Our approach transformed raw sector codes, revenue data, and region information into meaningful carbon intensity signals through careful feature engineering. The solution employs a blended ensemble strategy (70% base model + 30% target-encoded model) to balance granular pattern recognition with sector-level priors for robust predictions.

How we built it

We constructed an AI model driven by economic physics, using a gradient boosting algorithm to predict corporate carbon emissions. Through feature engineering, industry classification, revenue size, and region codes were transformed into carbon emission intensity signals. A dual-model fusion strategy (70% base model + 30% target coding model) was employed to ensure prediction stability.

Challenges we ran into

  • Extreme data skew: Emission data followed a heavy-tailed Pareto distribution requiring log transformation

  • Leakage prevention: Ensuring no target information contaminated feature engineering

  • Sparse behavioral signals: Limited environmental activity data for many entities

  • Accounting complexity: Distinguishing between location-based and market-based emissions reporting

  • Feature stability: Technical hurdles with quantile binning and interaction term consistency

Accomplishments that we're proud of

We successfully translated complex economic hypotheses into interpretable, actionable features while maintaining model stability. Our solution achieves consistent log-RMSE performance across emission magnitudes and clearly distinguishes the different physical drivers behind Scope 1 (combustion/process) and Scope 2 (grid electricity) emissions. The model provides a reliable screening layer for sustainable investment decisions without overfitting to noise.

What we learned

Structural factors (sector composition, company scale, geographic location) dominate emission predictions, while behavioral and governance signals showed limited predictive power in our dataset. We confirmed that logarithmic transformation is essential for handling emission data's extreme variance, and discovered that shallow tree architectures with aggressive subsampling provide the best bias-variance tradeoff for this domain.

What's next for our model

  • High priority: Integrate live grid carbon intensity APIs and purchasing power parity adjustments to enhance Scope 2 accuracy and cross-country comparability.

  • Medium term: Develop a zero-inflated hybrid model to better detect renewable energy procurement patterns and incorporate external environmental compliance data.

  • Future vision: Explore NLP techniques to extract signals from ESG reports and establish continuous monitoring protocols for model retraining based on economic shifts and grid decarbonization trends.

Built With

  • blend
  • eda
  • featureengineering
  • gbr
  • python
  • random-forest
  • sagemaker
  • target-encoded
  • visualization
  • xgboost
Share this project:

Updates