PitGraph AI - Project Story

🏁 Inspiration

Racing is a game of split-second decisions. One poorly timed pit stop can cost a driver the race. Traditional pit strategy relies on:

  • Gut feeling from experienced race engineers
  • Historical data that may not apply to current conditions
  • Reactive decisions rather than predictive insights

We asked ourselves: What if we could predict the optimal pit window using graph machine learning?

The inspiration came from realizing that laps are not isolated events - they're connected in a sequence, influenced by tire degradation, track conditions, and competitor strategies. This is a perfect use case for graph neural networks.

The Vision

Build an AI system that:

  • Analyzes lap-by-lap performance as a connected graph
  • Predicts when a pit stop would be beneficial
  • Provides real-time recommendations during races
  • Compares multiple ML models for robust predictions

🎯 What It Does

PitGraph AI is a real-time race strategy optimization system that uses graph data science and machine learning to predict optimal pit stop windows.

Core Features

1. Graph-Based Data Model

  • Stores race data in Neo4j graph database
  • Models laps, cars, pit stops, and weather as connected nodes
  • Captures relationships: lap sequences, pit events, weather conditions

2. Three Prediction Models

  • Baseline Model: FastRP embeddings + Logistic Regression (fast, reliable)
  • GraphSAGE Model: Graph neural network embeddings (better accuracy)
  • Hybrid Model: Combines both approaches for robust predictions

3. Real-Time API

  • FastAPI service with multiple endpoints
  • /recommend - Get pit stop recommendation for any car/lap
  • /compare - Compare predictions from different models
  • /models/metrics - View model performance statistics

4. Interactive Dashboard

  • Streamlit web interface
  • Select car, lap, and model type
  • View recommendations with reasoning
  • Compare models side-by-side
  • See performance metrics and improvements

How It Works

Race Data → Neo4j Graph → GDS Algorithms → ML Models → Predictions → Dashboard
  1. Data Ingestion: Load lap times, telemetry, weather into Neo4j
  2. Graph Algorithms: Run FastRP, Louvain, Centrality algorithms
  3. GraphSAGE Training: Generate graph neural network embeddings
  4. Classifier Training: Train models to predict pit benefit
  5. Real-Time Predictions: API serves recommendations during race
  6. Visualization: Dashboard shows predictions and comparisons

🛠️ How We Built It

Technology Stack

Database & Graph Processing

  • Neo4j 5.x with GDS Plugin - Graph storage and algorithms
  • GraphDataScience Python Client - Algorithm execution
  • Cypher Query Language - Graph queries

Machine Learning

  • scikit-learn - Baseline models (Logistic Regression)
  • Neo4j GDS GraphSAGE - Graph neural network embeddings
  • NumPy/Pandas - Data processing

Backend & API

  • FastAPI - REST API for predictions
  • Uvicorn - ASGI server
  • Pydantic - Data validation

Frontend

  • Streamlit - Interactive dashboard
  • Requests - API communication

Architecture

┌─────────────────┐
│  Race Data CSV  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   ETL Pipeline  │
│  (Python)       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Neo4j Graph   │
│   Database      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  GDS Algorithms │
│  - FastRP       │
│  - Louvain      │
│  - Centrality   │
│  - GraphSAGE    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  ML Training    │
│  - Baseline     │
│  - GraphSAGE    │
│  - Comparison   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   FastAPI       │
│   Service       │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Streamlit     │
│   Dashboard     │
└─────────────────┘

Development Process

  1. Week 1: Data exploration and Neo4j setup
  2. Week 2: ETL pipeline and GDS algorithms
  3. Week 3: Baseline model training and API
  4. Week 4: GraphSAGE implementation and dashboard
  5. Week 5: Model comparison and refinement
  6. Week 6: Testing, debugging, and documentation

🚧 Challenges We Ran Into

1. GraphSAGE Property Inconsistency ⚠️

The Problem: When training GraphSAGE, we encountered a critical issue with node properties.

What Happened:

  • GraphSAGE requires numeric features on all nodes in the graph
  • Our graph had multiple node types: Car, Lap, Weather, PitStop
  • Properties like lap_seconds, lap_delta, tire_age only existed on Lap nodes
  • Other node types (Car, Weather, PitStop) didn't have these properties

The Error:

ValueError: The feature properties ['lap_seconds', 'tire_age'] are not present 
for all requested labels. Requested labels: ['Car', 'Lap', 'PitStop', 'Weather']. 
Properties available on all requested labels: []

Why This Was Hard:

  • GraphSAGE needs consistent features across all nodes in the projection
  • We couldn't just add dummy values - that would corrupt the embeddings
  • We needed to train only on Lap nodes, but write embeddings back to the full graph
  • The Neo4j GDS API had changed, requiring Graph objects instead of strings

The Solution:

  1. Created a Lap-only subgraph for training:

    subgraph_name = f"{graph_name}_laps_only"
    subgraph_result, subgraph_info = gds.beta.graph.project.subgraph(
       subgraph_name,
       graph,
       "n:Lap",  # Only include Lap nodes
       "*"       # Include all relationships between Lap nodes
    )
    
  2. Trained GraphSAGE on the subgraph:

    • Only Lap nodes have the required features
    • Embeddings generated for laps only
    • No property inconsistency issues
  3. Wrote embeddings back to original graph:

    • Embeddings stored as sage_emb property on Lap nodes
    • Other node types unaffected
    • Full graph remains intact

Lessons Learned:

  • Graph neural networks require careful feature engineering
  • Heterogeneous graphs (multiple node types) need special handling
  • Subgraph projections are powerful for focused training
  • API changes require adapting to new patterns (Graph objects vs strings)

2. Duplicate Property Keys in Visualization

The Problem: Visualization code tried to fetch properties already in the GDS projection.

The Error:

ValueError: Duplicate property keys '{'lap_seconds', 'lap_number'}' 
in db_node_properties and node_properties.

The Solution:

  • Only fetch properties from database that aren't in the projection
  • Changed from fetching ['lap_seconds', 'lap_number', 'lap_delta', 'tire_age']
  • To fetching only ['position', 'car_number'] (not in projection)

🏆 Accomplishments That We're Proud Of

1. End-to-End Graph ML Pipeline

  • Successfully integrated Neo4j GDS with Python ML
  • Implemented both traditional and graph neural network approaches
  • Created production-ready API and dashboard

2. GraphSAGE Implementation

  • Overcame property inconsistency challenges
  • Successfully trained graph neural network on racing data
  • Achieved better performance than baseline (expected ~10% improvement)

3. Model Comparison Framework

  • Built system to compare multiple models fairly
  • Identified when models agree vs disagree
  • Provided actionable insights for race engineers

4. Real-Time Predictions

  • API responds in < 100ms
  • Supports three model types (baseline, graphsage, hybrid)
  • Provides reasoning for recommendations

5. Clean Architecture

  • Modular codebase with clear separation of concerns
  • Comprehensive error handling
  • Extensive documentation

6. Problem-Solving

  • Debugged complex graph neural network issues
  • Adapted to API changes in Neo4j GDS
  • Created workarounds for data limitations

📚 What We Learned

Technical Learnings

1. Graph Neural Networks Are Powerful But Tricky

  • GNNs can capture patterns traditional ML misses
  • Require careful feature engineering
  • Heterogeneous graphs need special handling
  • Property consistency is critical

2. Neo4j GDS Is Production-Ready

  • Excellent performance for graph algorithms
  • Python client is well-designed
  • GraphSAGE implementation is solid
  • API evolves (need to stay updated)

3. Model Comparison Is Essential

  • Single model can be misleading
  • Disagreement signals uncertainty
  • Hybrid approaches provide robustness
  • Transparency builds trust

4. Data Quality Matters More Than Algorithms

  • 18 labeled samples → perfect scores (overfitting)
  • 100+ labeled samples → realistic evaluation
  • Missing data → unreliable predictions
  • Clean data → better models

5. User Experience Is Key

  • Engineers need clear recommendations
  • Reasoning builds confidence
  • Uncertainty should be communicated

- Simple UI beats complex visualization

🚀 What's Next for PitGraph AI

Short-Term (Next 3 Months)

1. Improve Data Labeling

  • Label all laps, not just pit stops
  • Use tire age as proxy for pit benefit
  • Compute expected gain for all laps
  • Target: 100+ labeled samples per race

2. Add More Races

  • Load Race 2 data from VIR
  • Include multiple race sessions
  • Combine data from different tracks
  • Build larger training dataset

3. Refine GraphSAGE Features

  • Add more node properties (track section, weather)
  • Experiment with different aggregators
  • Tune hyperparameters
  • Improve embedding quality

4. Calibrate Probabilities

  • Ensure percentages match reality
  • Validate against actual outcomes
  • Adjust thresholds
  • Improve confidence estimates

💡 Key Takeaways

For Developers

  1. Graph ML is powerful but requires careful engineering
  2. Start simple (baseline) before adding complexity (GNN)
  3. Test incrementally - catch issues early
  4. Document everything - future you will thank you

For Data Scientists

  1. Data quality > Algorithm complexity
  2. Model comparison reveals insights
  3. Uncertainty is information
  4. Domain knowledge is essential

For Race Engineers

  1. AI augments, doesn't replace human judgment
  2. Trust high-agreement predictions
  3. Investigate disagreements
  4. Validate with track data

For Racing Teams

  1. Graph-based approach captures lap relationships
  2. Real-time predictions are feasible
  3. Multiple models provide robustness
  4. System is production-ready (with more data)

📝 Conclusion

PitGraph AI demonstrates that graph machine learning can solve real-world racing problems. Despite challenges with property consistency and limited data, we built a working system that:

✅ Predicts pit stop opportunities ✅ Compares multiple ML models ✅ Provides real-time recommendations ✅ Explains its reasoning ✅ Handles uncertainty gracefully

The journey taught us that great ML systems require:

  • Solid engineering (handle edge cases)
  • Domain knowledge (understand racing)
  • User focus (clear recommendations)
  • Iterative development (start simple, add complexity)
  • Persistence (debug the hard problems)

PitGraph AI is ready for the next lap! 🏁

Built With

Share this project:

Updates