Inspiration
Many companies don't report their emissions, making it hard for investors, regulators, and sustainability analysts to assess climate risk. Our models will estimate emissions for these non-reporting companies.
What it does
We developed an AI model grounded in economic physics, using gradient boosting to predict corporate carbon emissions. Our approach transformed raw sector codes, revenue data, and region information into meaningful carbon intensity signals through careful feature engineering. The solution employs a blended ensemble strategy (70% base model + 30% target-encoded model) to balance granular pattern recognition with sector-level priors for robust predictions.
How we built it
We constructed an AI model driven by economic physics, using a gradient boosting algorithm to predict corporate carbon emissions. Through feature engineering, industry classification, revenue size, and region codes were transformed into carbon emission intensity signals. A dual-model fusion strategy (70% base model + 30% target coding model) was employed to ensure prediction stability.
Challenges we ran into
Extreme data skew: Emission data followed a heavy-tailed Pareto distribution requiring log transformation
Leakage prevention: Ensuring no target information contaminated feature engineering
Sparse behavioral signals: Limited environmental activity data for many entities
Accounting complexity: Distinguishing between location-based and market-based emissions reporting
Feature stability: Technical hurdles with quantile binning and interaction term consistency
Accomplishments that we're proud of
We successfully translated complex economic hypotheses into interpretable, actionable features while maintaining model stability. Our solution achieves consistent log-RMSE performance across emission magnitudes and clearly distinguishes the different physical drivers behind Scope 1 (combustion/process) and Scope 2 (grid electricity) emissions. The model provides a reliable screening layer for sustainable investment decisions without overfitting to noise.
What we learned
Structural factors (sector composition, company scale, geographic location) dominate emission predictions, while behavioral and governance signals showed limited predictive power in our dataset. We confirmed that logarithmic transformation is essential for handling emission data's extreme variance, and discovered that shallow tree architectures with aggressive subsampling provide the best bias-variance tradeoff for this domain.
What's next for our model
High priority: Integrate live grid carbon intensity APIs and purchasing power parity adjustments to enhance Scope 2 accuracy and cross-country comparability.
Medium term: Develop a zero-inflated hybrid model to better detect renewable energy procurement patterns and incorporate external environmental compliance data.
Future vision: Explore NLP techniques to extract signals from ESG reports and establish continuous monitoring protocols for model retraining based on economic shifts and grid decarbonization trends.
Log in or sign up for Devpost to join the conversation.