ClimaZoneAI - Renewable Energy Forecasting for Canada

🌟 Inspiration

Canada is at a critical juncture in its energy transition. With a commitment to reaching net-zero emissions by 2050, understanding where and when renewable energy will be most effective is crucial. However, Canada's vast geographic diversity presents a unique challenge:

British Columbia has abundant hydro resources from mountain snowmelt
Alberta has some of the best wind potential on the prairies
Ontario has strong solar potential in the south
But how do we quantify and forecast these differences?

The challenge: We had historical weather data from 233 Canadian cities, but the data was messy - many weather stations only recorded 1-2 variables, leaving critical measurements like wind speed completely missing. Traditional analysis would simply discard incomplete data, losing valuable geographic coverage.

Our vision: Build an AI-powered system that could:

Intelligently infer missing weather variables using geographic features
Calculate realistic renewable energy potential indices
Forecast future energy potential across multiple time horizons
Visualize everything in an accessible, interactive dashboard

We were inspired by the idea that better data leads to better decisions - and those decisions could directly impact where Canada invests billions in renewable infrastructure.

💡 What We Learned

1. Domain Knowledge is Critical

Our biggest "aha moment" came when we noticed our hydro index was showing nearly zero for all Canadian cities. This made no sense - Canada generates 60% of its electricity from hydro!

The problem: We were calculating hydro potential from daily precipitation, but 88% of days have zero rainfall.

The insight: Real hydro systems work on accumulated water resources - reservoirs fill over weeks and months, not hours. We needed to think like hydrologists, not just data scientists.

The solution: We implemented monthly aggregation for hydro calculations:

$$ \text{Hydro}{\text{raw}} = 2.0 \times \sum{i=1}^{30} \text{PRCP}i + 1.5 \times \sum{i=1}^{30} \text{SNOW}_i + 0.5 \times \overline{\text{SNWD}} $$

Where:

$\text{PRCP}_i$ = daily precipitation (mm)
$\text{SNOW}_i$ = daily snowfall (mm)
$\overline{\text{SNWD}}$ = average monthly snow depth (mm)

Result: Hydro index jumped from 0.01 to 0.23 average - now reflecting reality!

Lesson: Domain expertise > blind data processing. Always validate results against real-world knowledge.

2. AI Can Fill the Gaps - If You're Smart About It

With many weather stations missing wind measurements, we had two choices:

Discard all incomplete data (lose 40% of cities)
Infer missing values intelligently

We chose option 2, building a physics-informed AI inference model:

$$ \text{AWND} = \left(0.2 + 0.004 \times h + 0.0008 \times P + 0.03 \times |\phi - 45°|\right)_{[0.5, 12]} $$

Where:

$h$ = elevation (meters)
$P$ = precipitation (mm)
$\phi$ = latitude (degrees)
$[\cdot]_{[a,b]}$ = clip to range [a, b]

Why this works:

Higher elevation → Mountains create wind (orographic effect)
Precipitation → Usually comes with storm systems (wind events)
Distance from mid-latitude → More extreme weather systems

Validation: Our inferred wind speeds had:

Mean: 1.69 m/s (realistic for Canada ✓)
Range: 0.5-10.5 m/s (physically plausible ✓)
Correlation with storms: Strong ✓

Lesson: AI inference works best when informed by physics and domain knowledge, not just statistical patterns.

3. Honest Visualization Builds Trust

Initially, our graphs looked "nicer" with smooth, continuous lines. But we realized we were lying to users by connecting lines across missing months.

Example:

BAD (connectgaps: true):
• ─────────────────────────────── •
  Jan                            May
  ↑ Misleading! No data Feb-Apr

GOOD (connectgaps: false):  
•               [gap]             •
  Jan                            May
  ↑ Honest! Shows data quality

We implemented strict gap handling:

Filter out months with <3 days of data
Drop rows with any NaN values
Disable interpolation: connectgaps: false in Plotly
Linear shape: No artificial smoothing

Result: Users see exactly what data we have - no artificial smoothness, no hidden problems.

Lesson: In scientific applications, honesty > aesthetics. Show users the truth, even if it's messy.

4. Percentile Normalization > Min-Max Scaling

Standard min-max normalization failed us:

$$ x_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $$

Problem: One extreme outlier (e.g., a hurricane with 150 mm rain) compressed all normal values to near-zero.

Solution: Percentile-based robust scaling:

$$ x_{\text{norm}} = \frac{\text{clip}(x, p_5, p_{95}) - p_5}{p_{95} - p_5} $$

Where $p_5$ and $p_{95}$ are the 5th and 95th percentiles.

Result: Much more balanced distribution across all three energy types:

Solar: 0.435 mean (was 0.02)
Wind: 0.297 mean (was 0.15)
Hydro: 0.229 mean (was 0.01)

Lesson: Robust statistics > naive statistics for real-world data.

5. Static HTML > Dynamic Web Apps for Demos

We built both a Streamlit app (with advanced ML models) and a static HTML dashboard. For hackathon demos, HTML won decisively:

Streamlit Challenges:

Requires Python backend running
Port conflicts (8501 often taken)
Slower loading with large datasets
Can't easily share with judges

HTML Advantages:

✅ Double-click to open (no installation)
✅ Works offline
✅ Email as attachment
✅ Deploy to GitHub Pages in 30 seconds
✅ No dependencies

Lesson: For demos and presentations, portability trumps features. Build the advanced stuff, but make sure you have a simple way to show it.

🛠️ How We Built It

Phase 1: Data Pipeline (The Messy Reality)

Challenge: Input data was in "long format" - one row per observation:

station,date,observation,value
CA001,2024-01-01,PRCP,5.2
CA001,2024-01-01,TAVG,-2.1
CA001,2024-01-01,SNOW,0.0

This meant 8 rows per station-date combination. Completely unworkable for ML!

Solution: Built a transformation pipeline:

# Step 1: Pivot to wide format
df_wide = df.pivot_table(
    index=['station', 'date', 'latitude', 'longitude', 
           'elevation', 'city'],
    columns='observation',
    values='value',
    aggfunc='first'
)

# Step 2: Extract province from city_province
df_wide['province'] = df_wide['city_province'].apply(
    lambda x: x.split(', ')[-1]
)

# Step 3: AI-driven inference for missing variables
df_wide = infer_missing_variables(df_wide)

Result: 103,246 long-format rows → 95,848 wide-format rows with complete features.

Phase 2: Feature Engineering (Making Data Meaningful)

Raw weather numbers don't directly tell you energy potential. We engineered domain-specific indices:

Solar Index

$$ \text{Solar}_{\text{raw}} = \text{TAVG} - \frac{\text{PRCP}}{10} $$

Rationale:

Higher temperature = Better solar panel efficiency
More precipitation = Cloud cover = Less sunlight
Division by 10 scales precipitation appropriately

Wind Index

$$ \text{Wind}_{\text{raw}} = \frac{\text{AWND} + \text{WSF2}}{2} $$

Rationale:

Average wind speed (AWND) = Sustained energy
Wind gusts (WSF2) = Peak capacity
Both matter for turbine performance

Hydro Index (Our Innovation!)

$$ \text{Hydro}{\text{raw}} = 2.0 \times \text{PRCP}{\text{monthly}} + 1.5 \times \text{SNOW}{\text{monthly}} + 0.5 \times \overline{\text{SNWD}{\text{monthly}}} $$

Why monthly? Hydro reservoirs accumulate water over weeks/months, not days!

Normalization

$$ \text{Index}{\text{norm}} = \frac{\text{clip}(\text{Index}{\text{raw}}, p_5, p_{95}) - p_5}{p_{95} - p_5} $$

All indices normalized to [0, 1] using robust percentile scaling.

Combined Score

$$ \text{Renewable Score} = \frac{\text{Solar} + \text{Wind} + \text{Hydro}}{3} $$

Phase 3: Forecasting Engine (Pattern-Based Intelligence)

We needed forecasts for three time horizons:

30 days (daily operations)
4 months (seasonal planning)
1 year (annual projections)

Core Algorithm: Historical pattern replication

def forecast(city, days_ahead):
    # Learn monthly patterns from history
    patterns = historical_data.groupby('month').agg({
        'Solar': 'mean',
        'Wind': 'mean', 
        'Hydro': 'mean'
    })

    # Apply patterns to future dates
    for future_date in next_N_days:
        month = future_date.month
        prediction[future_date] = patterns[month]

    return prediction

Why this works:

Renewable energy is highly seasonal
Solar peaks in summer (long days, clear skies)
Wind peaks in spring/fall (storm systems)
Hydro peaks in spring (snowmelt)

Aggregation Logic:

30 days → Display daily (30 points)
4 months → Aggregate to monthly (4 points) for clarity
1 year → Aggregate to monthly (12 points) for readability

Result: Fast, interpretable forecasts with no model training required!

Phase 4: Visualization Dashboard (Making It Accessible)

Goal: Anyone should be able to explore renewable energy potential across Canada.

Technology Choice: Static HTML + Plotly.js

Why?

No server required (portable)
Works offline
Fast loading
Universal compatibility

Implementation:

# Generate dashboard
def generate_html_dashboard():
    # Pre-compute ALL forecasts for ALL cities
    forecasts = {}
    for city in cities:
        forecasts[city] = {
            '30d': forecast(city, 30),
            '4m': forecast(city, 120),
            '1y': forecast(city, 365)
        }

    # Embed in HTML with JavaScript
    html = f"""
    <script>
        const allData = {json.dumps(forecasts)};

        function updateCharts() {{
            const city = document.getElementById('city').value;
            const period = document.getElementById('period').value;
            const data = allData[city][period];

            // Update 3 Plotly charts
            plotEnergyComparison(data);
            plotOverallTrend(data);
            plotBreakdown(data);
        }}
    </script>
    """

Key Configuration:

{
    x: dates,
    y: values,
    mode: 'lines+markers',
    line: { 
        shape: 'linear',  // No interpolation
        width: 3 
    },
    connectgaps: false,  // DON'T connect missing data!
    marker: { size: 8 }
}

Features:

3 dropdowns (Province, City, Forecast Period)
3 interactive charts (Plotly.js)
4 metric cards (averages)
Responsive design (mobile-friendly)
Zero dependencies (self-contained)

Result: A dashboard that works anywhere - laptop, tablet, phone, even a USB drive!

🚧 Challenges We Faced

Challenge 1: The Hydro Index Mystery 🧐

Problem: Initial hydro index showed ~0.01 for all cities (effectively zero).

Why it happened: We calculated hydro from daily precipitation, but 88% of days have 0mm rain!

Debug process:

# Checked data distribution
print(df['PRCP'].describe())
# count    95848
# mean     0.68 mm
# 50%      0.00 mm  ← MEDIAN IS ZERO!
# 75%      0.00 mm  ← 75th percentile ALSO ZERO!

Solution: Realized hydro needs monthly cumulative data (like real reservoirs). Implemented monthly aggregation.

Time cost: 4 hours of debugging and research.

Lesson: When results don't match reality, check your assumptions about the domain!

Challenge 2: XGBoost Wouldn't Install on macOS 😤

Error:

XGBoostError: Library not loaded: @rpath/libomp.dylib
Reason: no such file

Why: XGBoost requires OpenMP for parallel processing, not included in macOS by default.

Solutions tried:

pip uninstall/reinstall xgboost ❌
Update Python ❌
Try conda environment ❌

What worked:

brew install libomp

Time cost: 2 hours of Stack Overflow diving.

Lesson: Platform-specific dependencies are a pain. Always document setup steps!

Challenge 3: JSON Serialization Hell 🔥

Problem: Generating HTML dashboard crashed with:

TypeError: Object of type Timestamp is not JSON serializable

Why: Pandas timestamps aren't JSON-compatible!

Initial attempts:

# Tried 1: Convert to string
df['date'] = df['date'].astype(str)  # Still nested timestamps!

# Tried 2: to_dict()
data = df.to_dict('records')  # Still has Period objects!

Solution: Manual conversion:

result = []
for _, row in df.iterrows():
    result.append({
        'period': row['date'].strftime('%Y-%m-%d'),  # Explicit string
        'Solar': round(float(row['Solar']), 3),      # Explicit float
        'Wind': round(float(row['Wind']), 3),
        'Hydro': round(float(row['Hydro']), 3)
    })

Time cost: 1.5 hours.

Lesson: When serializing for web, be explicit about types. Don't trust automatic conversions.

Challenge 4: Data Gaps - To Connect or Not to Connect? 🤔

Problem: Some cities (like Cranbrook) have sparse data - data for Jan, then nothing until May.

Initial approach: Let Plotly connect the lines (looked prettier).

Realization: We're showing a false trend! There's no data Feb-Apr, so we shouldn't imply continuous measurements.

Ethical dilemma: Should we:

Make graphs look nice (smooth lines)?
Show truth (gaps visible)?

Decision: Truth > aesthetics. Implemented strict gap handling.

Code:

# Filter months with <3 days
monthly = monthly[monthly['day_count'] >= 3]

# Drop any remaining NaN
monthly = monthly.dropna()

# JavaScript config
connectgaps: false  // Key!

Time cost: 3 hours of discussion and implementation.

Lesson: In scientific applications, honesty is paramount. Show the limitations of your data.

Challenge 5: Long Format → Wide Format Conversion 🔄

Problem: Input CSV had 8 rows per observation:

CA001,2024-01-01,PRCP,5.2
CA001,2024-01-01,TAVG,-2.1
...8 rows total...

Why it's hard: Pivot tables can have multiple values per cell (duplicate dates).

Solution:

df_wide = df.pivot_table(
    index=['station', 'date', 'latitude', 'longitude', 
           'elevation', 'city'],
    columns='observation',
    values='value',
    aggfunc='first'  # Take first value if duplicates
).reset_index()

Gotcha: Had to preserve metadata columns (city, lat/lon) in the index!

Time cost: 2 hours.

Lesson: Data transformation is often 50% of the work. Never underestimate cleaning time!

💻 Built With

Languages

Python 3.13 - Core data processing and analysis
- Chosen for: Rich data science ecosystem
- Used for: ETL pipeline, feature engineering, forecasting
JavaScript (ES6) - Frontend interactivity
- Chosen for: Universal browser support
- Used for: Dashboard controls, chart updates
HTML5 / CSS3 - Structure and styling
- Chosen for: Static deployment capability
- Used for: Dashboard layout, responsive design
Markdown / LaTeX - Documentation
- Chosen for: Clear technical writing with math support
- Used for: All project documentation

Frameworks & Libraries

Data Processing

pandas 2.x - DataFrame operations, pivoting, aggregation
numpy 1.x - Numerical operations, array math
scikit-learn 1.x - Normalization, percentile scaling

Machine Learning (Advanced Models)

Prophet 1.1+ - Time-series forecasting with seasonality
- Facebook's forecasting library
- Used for: Trend decomposition, confidence intervals
XGBoost 3.x - Gradient boosting regression
- Chosen for: Non-linear pattern learning
- Used for: Feature importance analysis, ensemble models

Visualization

Plotly.js 2.27 - Interactive JavaScript charts
- Chosen for: No backend required, rich interactivity
- Used for: All dashboard charts (line, area, bar)
Streamlit 1.x - Python web app framework (optional advanced UI)
- Chosen for: Rapid prototyping
- Used for: Model comparison interface

Data Source

GHCN (Global Historical Climatology Network)
- Source: NOAA National Centers for Environmental Information
- Coverage: 233 Canadian weather stations
- Variables: Temperature, precipitation, snow, wind
- Time range: 2022-2024
- Format: CSV (long format, converted to wide)

Development Tools

Git / GitHub - Version control and collaboration
VS Code - Primary IDE
Cursor - AI-assisted coding
Jupyter Notebooks - Exploratory analysis
Python venv - Dependency isolation

Deployment & Hosting

Static HTML - Primary deployment method
- No server required
- Works offline
- Can be hosted on:
- GitHub Pages (free)
- Netlify (free)
- Local filesystem
- USB drive (for offline demos)
Streamlit Cloud (optional) - For advanced ML demo
- Cloud-based Python app hosting
- Access to Prophet/XGBoost models

Key Technical Decisions

Why Static HTML over Web Framework?

Considered:

Flask (Python backend)
React (JavaScript frontend)
Streamlit (Python rapid prototyping)

Chose HTML because:

✅ No server = No maintenance
✅ Universal compatibility (works everywhere)
✅ Instant loading (no API calls)
✅ Easy demo (just open file)
✅ Zero dependencies

Trade-off: Pre-compute all forecasts (larger file size, ~5MB), but worth it for portability.

Why Plotly.js over D3.js?

Considered:

D3.js (maximum customization)
Chart.js (lightweight)
Plotly.js (interactive, full-featured)

Chose Plotly.js because:

✅ Built-in interactivity (hover, zoom, pan)
✅ Professional-looking defaults
✅ connectgaps: false for honest gaps
✅ Responsive without extra code
✅ Single CDN include

Why Pattern-Based Forecasting over Pure ML?

Considered:

ARIMA (statistical)
Prophet (ML time-series)
XGBoost (ML regression)
LSTM (deep learning)

Chose pattern-based for HTML dashboard because:

✅ No model training needed
✅ Works in pure JavaScript
✅ Interpretable (users understand "monthly average")
✅ Fast (pre-computed)
✅ Seasonal patterns are strong (good enough!)

Note: We built Prophet/XGBoost models too, available in Streamlit app for advanced users.

Why Percentile Normalization over Min-Max?

Math:

Min-Max (traditional): $$ x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)} $$

Problem: Outliers compress 99% of data to tiny range.

Percentile (robust): $$ x_{\text{norm}} = \frac{\text{clip}(x, p_5, p_{95}) - p_5}{p_{95} - p_5} $$

Benefit: Outliers clipped, normal values spread across [0, 1].

Result: All three indices visible (not dominated by wind/solar).

Dependencies (requirements.txt)

pandas==2.2.0
numpy==1.26.3
scikit-learn==1.4.0
prophet==1.1.5
xgboost==3.0.0
streamlit==1.31.0
plotly==5.18.0

Installation:

pip install -r requirements.txt

Compatibility:

Python 3.9+
macOS: Requires brew install libomp for XGBoost
Windows/Linux: Works out of the box

Project Structure

Data-Jam/
├── data/
│   ├── cleaned_data_with_city_filled.csv    # Input (long format)
│   ├── processed_wide_format.csv            # Transformed (wide format)
│   └── processed_indices.csv                # Final (with indices)
│
├── src/
│   ├── data_processing.py                   # AI inference
│   └── compute_indices.py                   # Feature engineering
│
├── models/
│   ├── prophet_model.py                     # Time-series forecasting
│   ├── xgboost_model.py                     # Gradient boosting
│   └── ensemble_model.py                    # Combined models
│
├── generate_html_dashboard.py               # Dashboard generator
│
├── web/
│   └── dashboard.html                       # Final product (5MB)
│
├── app.py                                   # Streamlit app (advanced)
│
├── requirements.txt                         # Python dependencies
├── TECHNICAL_REPORT.md                      # Full methodology
├── QUICK_REFERENCE.md                       # Summary
├── DATA_FLOW_DIAGRAM.md                     # Visual pipeline
└── PROJECT_STORY.md                         # This file!

Mathematical Foundations

Our project leverages several mathematical concepts:

Linear Algebra

Matrix operations for data transformation (pivot tables)
Vector operations for index calculations

Statistics

Percentile calculation: $p_k = \text{value at } k\% \text{ of sorted data}$
Mean aggregation: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$
Standard deviation for outlier detection

Time Series Analysis

Seasonal decomposition: $Y_t = T_t + S_t + R_t$
- $T_t$ = Trend component
- $S_t$ = Seasonal component
- $R_t$ = Residual component

Optimization

Gradient boosting (XGBoost): $$ \hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + \eta \cdot f_t(x_i) $$ Where $\eta$ is learning rate, $f_t$ is new tree

Geographic Models

Distance from latitude: $d = |\phi - \phi_0|$
Elevation effects: Linear coefficient models

🎯 Impact & Future Work

Real-World Applications

Energy Investment Planning
- Identify optimal locations for solar/wind/hydro projects
- Estimate ROI based on seasonal patterns
- Risk assessment via historical variability
Grid Management
- Predict renewable availability for load balancing
- Plan backup power requirements
- Optimize energy storage deployment
Climate Research
- Track long-term changes in renewable potential
- Assess climate change impact on energy systems
- Inform policy decisions

Future Enhancements

Real-Time Data Integration
- Connect to live weather APIs
- Daily forecast updates
- Alert system for extreme events
Advanced ML Models
- LSTM neural networks for sequence learning
- Transfer learning across similar cities
- Uncertainty quantification
Economic Analysis
- Cost-benefit calculator
- Payback period estimation
- Carbon offset calculations
Expanded Coverage
- Include all of North America
- Add offshore wind potential
- Geothermal resource mapping

🏆 What Makes This Special

AI-Driven Inference - Doesn't just discard incomplete data; intelligently fills gaps
Domain-Informed Features - Monthly hydro aggregation reflects real infrastructure
Honest Visualization - Shows data quality transparently, no artificial smoothing
Multiple Time Scales - Short-term operations + long-term planning
Universal Accessibility - Works on any device, no installation required
Open Source - Full methodology documented, reproducible results

Team ClimaZoneAI | SFU DataJam 2025

Empowering Canada's renewable energy transition through data-driven insights.

Built With

css3
html5
javascript
pandas
plotly.js
prophet
python
rstudio
scikit-learn
streamlit
xgboost

ClimaZoneAI - Renewable Energy Forecasting for Canada

🌟 Inspiration

💡 What We Learned

1. Domain Knowledge is Critical

2. AI Can Fill the Gaps - If You're Smart About It

3. Honest Visualization Builds Trust

4. Percentile Normalization > Min-Max Scaling

5. Static HTML > Dynamic Web Apps for Demos

🛠️ How We Built It

Phase 1: Data Pipeline (The Messy Reality)

Phase 2: Feature Engineering (Making Data Meaningful)

Solar Index

Wind Index

Hydro Index (Our Innovation!)

Normalization

Combined Score

Phase 3: Forecasting Engine (Pattern-Based Intelligence)

Phase 4: Visualization Dashboard (Making It Accessible)

🚧 Challenges We Faced

Challenge 1: The Hydro Index Mystery 🧐

Challenge 2: XGBoost Wouldn't Install on macOS 😤

Challenge 3: JSON Serialization Hell 🔥

Challenge 4: Data Gaps - To Connect or Not to Connect? 🤔

Challenge 5: Long Format → Wide Format Conversion 🔄

💻 Built With

Languages

Frameworks & Libraries

Data Processing

Machine Learning (Advanced Models)

Visualization

Data Source

Development Tools

Deployment & Hosting

Key Technical Decisions

Why Static HTML over Web Framework?

Why Plotly.js over D3.js?

Why Pattern-Based Forecasting over Pure ML?

Why Percentile Normalization over Min-Max?

Dependencies (requirements.txt)

Project Structure

Mathematical Foundations

Linear Algebra

Statistics

Time Series Analysis

Optimization

Geographic Models

🎯 Impact & Future Work

Real-World Applications

Future Enhancements

🏆 What Makes This Special

Built With

Updates