๐Ÿ† THE AIR DETECTIVES - Project Documentation


๐Ÿ’ก Inspiration

Every winter, Pakistan's major cities disappear behind a thick, toxic blanket of smog. Lahore becomes the world's most polluted city. Children miss school. Hospitals fill with respiratory patients. Elderly citizens suffer in silence.

We asked ourselves: "What if we could predict pollution spikes before they happen? What if we could tell people exactly when to wear masks, when to stay indoors, and what's causing their suffering?"

The inspiration came from:

  • Personal Experience: Team members who grew up in Lahore and Peshawar remember not being able to see the sun for weeks during smog season
  • The Data Gap: Air quality monitors exist, but no one translates that data into actionable public alerts
  • The 42,000+ Problem: Over 42,000 Pakistanis die annually from air pollution-related illnesses. We wanted to change that number.

"We realized that data without action is just numbers. We wanted to build a bridge between sensors and citizens."


โš™๏ธ What It Does

SmogNet (THE AIR DETECTIVES) is an end-to-end air quality intelligence system that:

1. ๐Ÿ” Detects Pollution Spikes

  • Analyzes 8,445+ hourly records from 5 Pakistani cities
  • Uses context-aware detection (what's normal in Lahore isn't normal in Karachi)
  • Identifies 442 pollution anomalies with 5.2% detection rate

2. ๐Ÿญ Classifies Pollution Sources

Tells you WHAT is causing the pollution:

Source Chemical Signature
๐ŸŒพ Crop Burning High NH3 + CO
๐Ÿš— Vehicular High NO + NO2
๐Ÿญ Industrial High SO2
๐ŸŒช๏ธ Dust Storm High PM10/PM2.5 ratio
๐Ÿ”€ Mixed Sources Multiple pollutants elevated

3. ๐Ÿ“ข Generates Public Alerts

Creates human-readable, actionable health alerts like:

๐Ÿšจ CRITICAL ALERT - Peshawar
PM2.5: 491 ยตg/mยณ - HAZARDOUS
Source: Crop Burning (100% confidence)
ACTION: Stay indoors, wear N95 masks

4. ๐Ÿ–ฅ๏ธ Provides Interactive Dashboard

  • Real-time visualization of all 5 cities
  • Anomaly markers on timeline
  • Source classification pie charts
  • City comparison tools
  • Data export for researchers

๐Ÿ› ๏ธ How We Built It

Technology Stack

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    TECHNOLOGY STACK                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚  ๐Ÿ“Š DATA PROCESSING                                          โ”‚
โ”‚  โ”œโ”€โ”€ Python 3.11                                             โ”‚
โ”‚  โ”œโ”€โ”€ Pandas (data manipulation)                             โ”‚
โ”‚  โ””โ”€โ”€ NumPy (numerical operations)                           โ”‚
โ”‚                                                              โ”‚
โ”‚  ๐Ÿค– MACHINE LEARNING                                         โ”‚
โ”‚  โ”œโ”€โ”€ Scikit-learn (Isolation Forest)                        โ”‚
โ”‚  โ”œโ”€โ”€ Statistical Z-score (rolling windows)                  โ”‚
โ”‚  โ””โ”€โ”€ Rule-based classification (chemical fingerprints)      โ”‚
โ”‚                                                              โ”‚
โ”‚  ๐ŸŽจ FRONTEND & VISUALIZATION                                 โ”‚
โ”‚  โ”œโ”€โ”€ Streamlit (interactive dashboard)                      โ”‚
โ”‚  โ”œโ”€โ”€ Plotly (dynamic charts)                                โ”‚
โ”‚  โ””โ”€โ”€ Custom CSS (styling)                                   โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Development Process

Phase 1: Data Collection & Cleaning (2 days)

  • Loaded 5 city datasets (Islamabad, Karachi, Lahore, Peshawar, Quetta)
  • Fixed date format issues (DD/MM/YYYY vs MM/DD/YYYY)
  • Handled missing values and outliers
  • Standardized column names across all datasets

Phase 2: Anomaly Detection Engine (3 days)

  • Implemented rolling Z-score with 7-day windows
  • Added seasonal thresholds (Winter = 3.0, Monsoon = 2.0)
  • Integrated Isolation Forest for complex pattern detection
  • Created hybrid detection combining both methods

Phase 3: Source Classification (2 days)

  • Researched chemical fingerprints for each pollution source
  • Developed rule-based scoring system
  • Added confidence scoring for each detection
  • Tested against known pollution events

Phase 4: Alert Generation (1 day)

  • Designed AQI-based severity levels
  • Created source-specific messaging
  • Added actionable recommendations
  • Formatted for public readability

Phase 5: Dashboard Development (2 days)

  • Built Streamlit web application
  • Created 6 interactive visualizations
  • Added filters and controls
  • Implemented data export functionality

๐Ÿšง Challenges We Ran Into

1. ๐Ÿ“… The Date Format Nightmare

Problem: CSV files had dates in DD/MM/YYYY format, but pandas expected MM/DD/YYYY

Error: time data "13/07/2024 00:00:00" doesn't match format "%m/%d/%Y %H:%M"

Solution: Used dayfirst=True parameter and tried multiple date formats

df['datetime'] = pd.to_datetime(df['datetime'], dayfirst=True, errors='coerce')

2. ๐Ÿ™๏ธ City Variations

Problem: A PM2.5 of 150 is NORMAL in Lahore winter but ANOMALY in Karachi summer

Solution: City-specific rolling windows and seasonal thresholds

City Winter Baseline Monsoon Baseline
Lahore 120 ยตg/mยณ 60 ยตg/mยณ
Karachi 70 ยตg/mยณ 35 ยตg/mยณ
Islamabad 60 ยตg/mยณ 30 ยตg/mยณ

3. ๐Ÿ”€ Mixed Source Classification

Problem: Many pollution events had multiple sources (crop burning + traffic)

Solution: Created confidence scoring and "mixed sources" category

if elevated_count >= 3:
    scores['mixed_sources'] = min(1.0, elevated_count / 5)

4. โšก Real-time Performance

Problem: Processing 8,445 records with multiple algorithms was slow

Solution: Implemented Streamlit caching and optimized data structures

@st.cache_data
def load_all_data():
    # Data only loads once, then cached

5. ๐ŸŽฏ Balancing Sensitivity

Problem: Too many false alarms OR missing real events

Solution: Adjustable sensitivity slider (0.01 to 0.15) with default 0.05


๐ŸŽ‰ Accomplishments We're Proud Of

1. โœ… Successfully Detected 442 Real Anomalies

Our system identified every major pollution event in the dataset:

  • Nov 4-8, 2024 smog crisis (PM2.5 > 480)
  • Post-monsoon crop burning season
  • Winter inversion spikes

2. ๐Ÿ† 100% Accuracy on Top 10 Events

The 10 most severe pollution spikes were ALL correctly classified with 100% confidence as crop burning - matching real-world reports!

3. ๐ŸŒ Full 5-City Coverage

Unlike other solutions that focus on one city, SmogNet covers:

  • Islamabad (Capital)
  • Karachi (Largest city)
  • Lahore (Most polluted)
  • Peshawar (Agricultural hub)
  • Quetta (Western region)

4. ๐Ÿ“Š Interactive Dashboard

Built a production-ready web application that:

  • Loads in under 3 seconds
  • Updates visualizations in real-time
  • Works on any browser
  • Requires no installation for users

5. ๐Ÿšจ Actionable Alerts

Generated human-readable alerts that actually help people:

  • Specific actions (wear N95 masks, stay indoors)
  • Risk groups identified (children, elderly, respiratory patients)
  • Source information (so people know WHY)

6. ๐Ÿ“ˆ Scientific Validation

Our findings align with real-world data:

  • Peak pollution: November (crop burning season)
  • Most affected: Peshawar, Lahore (agricultural regions)
  • Rush hour spikes (7-9 AM, 5-7 PM)

๐Ÿ“š What We Learned

Technical Lessons

Concept What We Learned
Z-score Simple but powerful for detecting obvious spikes
Isolation Forest Excellent for complex, multi-pollutant anomalies
Hybrid Detection Best of both worlds - catches everything
Rolling Windows Essential for seasonal/cyclical data
Context Matters What's normal in one city isn't normal in another

Data Science Lessons

  1. Always check date formats first - Saves hours of debugging
  2. Visualize early, visualize often - Charts reveal problems tables hide
  3. Start simple, then add complexity - Z-score first, then IForest
  4. Confidence scores matter - Users trust systems that show uncertainty

Real-World Lessons

  1. Air pollution is a SEASONAL crisis - Not random, predictable
  2. Crop burning is the #1 culprit - Policy changes needed
  3. Data exists but isn't used - Bridge the gap between sensors and citizens
  4. People need ACTIONABLE information - Not just numbers

Teamwork Lessons

  1. Divide and conquer - Each stage can be built independently
  2. Daily standups - 15 minutes saved hours of rework
  3. Git is your friend - Branch for features, merge when stable

๐Ÿš€ What's Next for THE AIR DETECTIVES

Short-term (Next 3 Months)

Feature Description Status
๐Ÿ“ฑ Mobile App Push notifications for severe alerts Planned
๐ŸŒ Live API Integration Real-time data from PM2.5 sensors In progress
๐Ÿ—ฃ๏ธ Urdu/Pashto Alerts Local language support Planned
๐Ÿ“ง Email/SMS Alerts Subscribe for daily updates Planned

Medium-term (6-12 Months)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    FUTURE ROADMAP                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚  ๐Ÿค– AI FORECASTING                                           โ”‚
โ”‚  โ”œโ”€โ”€ LSTM models for 48-72 hour predictions                โ”‚
โ”‚  โ”œโ”€โ”€ Weather pattern integration                            โ”‚
โ”‚  โ””โ”€โ”€ Crop burning prediction (satellite data)              โ”‚
โ”‚                                                              โ”‚
โ”‚  ๐Ÿฅ HEALTH IMPACT CORRELATION                                โ”‚
โ”‚  โ”œโ”€โ”€ Hospital admission data integration                   โ”‚
โ”‚  โ”œโ”€โ”€ Asthma attack prediction                              โ”‚
โ”‚  โ””โ”€โ”€ Vulnerable population alerts                          โ”‚
โ”‚                                                              โ”‚
โ”‚  ๐ŸŒ EXPANSION                                                โ”‚
โ”‚  โ”œโ”€โ”€ Add 10 more Pakistani cities                          โ”‚
โ”‚  โ”œโ”€โ”€ Cross-border collaboration (India, Bangladesh)        โ”‚
โ”‚  โ””โ”€โ”€ WHO certification                                     โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Long-term (1-2 Years)

1. Government Integration

  • Partner with Pakistan EPA for official alerts
  • Integrate with disaster management systems
  • Policy recommendation engine

2. Open Source Platform

  • Release code on GitHub
  • API for researchers
  • Citizen science sensor network

3. Educational Outreach

  • School air quality curriculum
  • Teacher training programs
  • Student sensor building workshops

4. Commercial Partnerships

  • Air purifier integration (automatic activation)
  • Smart home devices (Alexa/Google alerts)
  • Corporate wellness programs

๐ŸŽฏ Our Vision

"A Pakistan where every citizen, regardless of income or location, has access to real-time, actionable air quality intelligence."

We believe that information is power. By democratizing air quality data and translating it into clear, actionable alerts, we can:

  • โœ… Reduce hospital admissions
  • โœ… Save lives (especially children and elderly)
  • โœ… Inform policy decisions
  • โœ… Empower citizens to protect themselves

๐Ÿ™ Final Words

THE AIR DETECTIVES isn't just a datathon project. It's a mission.

Every line of code we wrote, every chart we built, every alert we generated - it's all for the 42,000+ Pakistanis who die prematurely each year from air pollution.

We proved that:

  • โœ… AI can detect pollution spikes accurately
  • โœ… Sources can be identified chemically
  • โœ… Alerts can be generated automatically
  • โœ… Information can save lives

This is just the beginning.


๐Ÿ“ž Connect With Us

Platform Link
๐Ÿ“ง Email the.air.detectives@smognet.org
๐Ÿ™ GitHub github.com/the-air-detectives
๐ŸŒ Website smognet.org (coming soon)
๐Ÿฆ Twitter @AirDetectives

๐Ÿ† Thank You!

"Clean air is not a luxury โ€” it's a human right."

- THE AIR DETECTIVES Team


Made with โค๏ธ for Pakistan | UET Mardan Datathon 2026

Built With

Share this project:

Updates