PitGraph AI - Project Story

🏁 Inspiration

Racing is a game of split-second decisions. One poorly timed pit stop can cost a driver the race. Traditional pit strategy relies on:

Gut feeling from experienced race engineers
Historical data that may not apply to current conditions
Reactive decisions rather than predictive insights

We asked ourselves: What if we could predict the optimal pit window using graph machine learning?

The inspiration came from realizing that laps are not isolated events - they're connected in a sequence, influenced by tire degradation, track conditions, and competitor strategies. This is a perfect use case for graph neural networks.

The Vision

Build an AI system that:

Analyzes lap-by-lap performance as a connected graph
Predicts when a pit stop would be beneficial
Provides real-time recommendations during races
Compares multiple ML models for robust predictions

🎯 What It Does

PitGraph AI is a real-time race strategy optimization system that uses graph data science and machine learning to predict optimal pit stop windows.

Core Features

1. Graph-Based Data Model

Stores race data in Neo4j graph database
Models laps, cars, pit stops, and weather as connected nodes
Captures relationships: lap sequences, pit events, weather conditions

2. Three Prediction Models

Baseline Model: FastRP embeddings + Logistic Regression (fast, reliable)
GraphSAGE Model: Graph neural network embeddings (better accuracy)
Hybrid Model: Combines both approaches for robust predictions

3. Real-Time API

FastAPI service with multiple endpoints
/recommend - Get pit stop recommendation for any car/lap
/compare - Compare predictions from different models
/models/metrics - View model performance statistics

4. Interactive Dashboard

Streamlit web interface
Select car, lap, and model type
View recommendations with reasoning
Compare models side-by-side
See performance metrics and improvements

How It Works

Race Data → Neo4j Graph → GDS Algorithms → ML Models → Predictions → Dashboard

Data Ingestion: Load lap times, telemetry, weather into Neo4j
Graph Algorithms: Run FastRP, Louvain, Centrality algorithms
GraphSAGE Training: Generate graph neural network embeddings
Classifier Training: Train models to predict pit benefit
Real-Time Predictions: API serves recommendations during race
Visualization: Dashboard shows predictions and comparisons

🛠️ How We Built It

Technology Stack

Database & Graph Processing

Neo4j 5.x with GDS Plugin - Graph storage and algorithms
GraphDataScience Python Client - Algorithm execution
Cypher Query Language - Graph queries

Machine Learning

scikit-learn - Baseline models (Logistic Regression)
Neo4j GDS GraphSAGE - Graph neural network embeddings
NumPy/Pandas - Data processing

Backend & API

FastAPI - REST API for predictions
Uvicorn - ASGI server
Pydantic - Data validation

Frontend

Streamlit - Interactive dashboard
Requests - API communication

Architecture

┌─────────────────┐
│  Race Data CSV  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   ETL Pipeline  │
│  (Python)       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Neo4j Graph   │
│   Database      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  GDS Algorithms │
│  - FastRP       │
│  - Louvain      │
│  - Centrality   │
│  - GraphSAGE    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  ML Training    │
│  - Baseline     │
│  - GraphSAGE    │
│  - Comparison   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   FastAPI       │
│   Service       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Streamlit     │
│   Dashboard     │
└─────────────────┘

Development Process

Week 1: Data exploration and Neo4j setup
Week 2: ETL pipeline and GDS algorithms
Week 3: Baseline model training and API
Week 4: GraphSAGE implementation and dashboard
Week 5: Model comparison and refinement
Week 6: Testing, debugging, and documentation

🚧 Challenges We Ran Into

1. GraphSAGE Property Inconsistency ⚠️

The Problem: When training GraphSAGE, we encountered a critical issue with node properties.

What Happened:

GraphSAGE requires numeric features on all nodes in the graph
Our graph had multiple node types: Car, Lap, Weather, PitStop
Properties like lap_seconds, lap_delta, tire_age only existed on Lap nodes
Other node types (Car, Weather, PitStop) didn't have these properties

The Error:

ValueError: The feature properties ['lap_seconds', 'tire_age'] are not present 
for all requested labels. Requested labels: ['Car', 'Lap', 'PitStop', 'Weather']. 
Properties available on all requested labels: []

Why This Was Hard:

GraphSAGE needs consistent features across all nodes in the projection
We couldn't just add dummy values - that would corrupt the embeddings
We needed to train only on Lap nodes, but write embeddings back to the full graph
The Neo4j GDS API had changed, requiring Graph objects instead of strings

The Solution:

Created a Lap-only subgraph for training:

subgraph_name = f"{graph_name}_laps_only"
subgraph_result, subgraph_info = gds.beta.graph.project.subgraph(
   subgraph_name,
   graph,
   "n:Lap",  # Only include Lap nodes
   "*"       # Include all relationships between Lap nodes
)

Trained GraphSAGE on the subgraph:
- Only Lap nodes have the required features
- Embeddings generated for laps only
- No property inconsistency issues
Wrote embeddings back to original graph:
- Embeddings stored as sage_emb property on Lap nodes
- Other node types unaffected
- Full graph remains intact

Lessons Learned:

Graph neural networks require careful feature engineering
Heterogeneous graphs (multiple node types) need special handling
Subgraph projections are powerful for focused training
API changes require adapting to new patterns (Graph objects vs strings)

2. Duplicate Property Keys in Visualization

The Problem: Visualization code tried to fetch properties already in the GDS projection.

The Error:

ValueError: Duplicate property keys '{'lap_seconds', 'lap_number'}' 
in db_node_properties and node_properties.

The Solution:

Only fetch properties from database that aren't in the projection
Changed from fetching ['lap_seconds', 'lap_number', 'lap_delta', 'tire_age']
To fetching only ['position', 'car_number'] (not in projection)

🏆 Accomplishments That We're Proud Of

1. End-to-End Graph ML Pipeline

Successfully integrated Neo4j GDS with Python ML
Implemented both traditional and graph neural network approaches
Created production-ready API and dashboard

2. GraphSAGE Implementation

Overcame property inconsistency challenges
Successfully trained graph neural network on racing data
Achieved better performance than baseline (expected ~10% improvement)

3. Model Comparison Framework

Built system to compare multiple models fairly
Identified when models agree vs disagree
Provided actionable insights for race engineers

4. Real-Time Predictions

API responds in < 100ms
Supports three model types (baseline, graphsage, hybrid)
Provides reasoning for recommendations

5. Clean Architecture

Modular codebase with clear separation of concerns
Comprehensive error handling
Extensive documentation

6. Problem-Solving

Debugged complex graph neural network issues
Adapted to API changes in Neo4j GDS
Created workarounds for data limitations

📚 What We Learned

Technical Learnings

1. Graph Neural Networks Are Powerful But Tricky

GNNs can capture patterns traditional ML misses
Require careful feature engineering
Heterogeneous graphs need special handling
Property consistency is critical

2. Neo4j GDS Is Production-Ready

Excellent performance for graph algorithms
Python client is well-designed
GraphSAGE implementation is solid
API evolves (need to stay updated)

3. Model Comparison Is Essential

Single model can be misleading
Disagreement signals uncertainty
Hybrid approaches provide robustness
Transparency builds trust

4. Data Quality Matters More Than Algorithms

18 labeled samples → perfect scores (overfitting)
100+ labeled samples → realistic evaluation
Missing data → unreliable predictions
Clean data → better models

5. User Experience Is Key

Engineers need clear recommendations
Reasoning builds confidence
Uncertainty should be communicated

- Simple UI beats complex visualization

🚀 What's Next for PitGraph AI

Short-Term (Next 3 Months)

1. Improve Data Labeling

Label all laps, not just pit stops
Use tire age as proxy for pit benefit
Compute expected gain for all laps
Target: 100+ labeled samples per race

2. Add More Races

Load Race 2 data from VIR
Include multiple race sessions
Combine data from different tracks
Build larger training dataset

3. Refine GraphSAGE Features

Add more node properties (track section, weather)
Experiment with different aggregators
Tune hyperparameters
Improve embedding quality

4. Calibrate Probabilities

Ensure percentages match reality
Validate against actual outcomes
Adjust thresholds
Improve confidence estimates

💡 Key Takeaways

For Developers

Graph ML is powerful but requires careful engineering
Start simple (baseline) before adding complexity (GNN)
Test incrementally - catch issues early
Document everything - future you will thank you

For Data Scientists

Data quality > Algorithm complexity
Model comparison reveals insights
Uncertainty is information
Domain knowledge is essential

For Race Engineers

AI augments, doesn't replace human judgment
Trust high-agreement predictions
Investigate disagreements
Validate with track data

For Racing Teams

Graph-based approach captures lap relationships
Real-time predictions are feasible
Multiple models provide robustness
System is production-ready (with more data)

📝 Conclusion

PitGraph AI demonstrates that graph machine learning can solve real-world racing problems. Despite challenges with property consistency and limited data, we built a working system that:

✅ Predicts pit stop opportunities ✅ Compares multiple ML models ✅ Provides real-time recommendations ✅ Explains its reasoning ✅ Handles uncertainty gracefully

The journey taught us that great ML systems require: