SatStack Project: Lessons Learned & Reflections
Project Overview
SatStack is an end-to-end machine learning system that detects real estate development opportunities by fusing satellite imagery analysis with economic indicators. The system processes NASA HLS (Harmonized Landsat Sentinel-2) data, applies computer vision for construction detection, and combines these insights with economic data to rank investment opportunities.
What We Learned
Technical Insights
1. Multi-Modal Data Fusion Complexity
Combining satellite imagery with economic time series data requires careful temporal alignment. Satellite observations are irregular (cloud cover dependent), while economic indicators follow fixed reporting schedules. We learned to:
- Build robust temporal interpolation strategies
- Handle missing data gracefully across modalities
- Create feature stores that maintain data lineage
2. Scale Challenges in Geospatial ML
Processing even a small region like Chapel Hill generates massive data volumes:
- Each HLS scene: ~1GB raw data
- 100m grid cells for North Carolina: ~5 million cells
- Temporal stacks multiply storage needs by observation frequency
- Solution: Implemented cloud-optimized GeoTIFF (COG) formats and used Dask for distributed processing
3. Ground Truth Acquisition is the Bottleneck
The hardest part wasn't building models but obtaining reliable labels:
- Construction permits lag actual development by months
- Satellite-visible changes don't always indicate development
- Solution: Created a probabilistic labeling framework using multiple weak supervision sources
4. Ensemble Methods Excel at Uncertainty Quantification
Single models struggled with the heterogeneous nature of development patterns. The ensemble approach provided:
- Better calibration through model averaging
- Natural uncertainty estimates from prediction variance
- Feature importance consensus across different model types
Architectural Lessons
1. Notebooks → Production is Non-Trivial
Starting with Jupyter notebooks for exploration was valuable, but the transition to production required significant refactoring:
- Extract reusable functions into modules
- Add proper error handling and logging
- Implement configuration management
- Create reproducible environments
2. AWS Service Integration Complexity
Orchestrating multiple AWS services introduced unexpected challenges:
- IAM permission debugging across services
- Network configuration between RDS, Lambda, and SageMaker
- Cost optimization (especially for GPU inference)
- Solution: Infrastructure as Code (Terraform/CDK) became essential
3. Real-Time vs Batch Trade-offs
Initial design aimed for real-time scoring, but we learned:
- Satellite processing is inherently batch-oriented
- Economic data updates are infrequent
- Weekly batch scoring with cached results proved more practical
Challenges Faced
Data Challenges
Cloud Cover in Optical Imagery
- Problem: 60-70% of scenes unusable due to clouds
- Impact: Temporal gaps in change detection
- Mitigation: Implemented multi-temporal compositing and considered SAR data integration
Coordinate System Nightmares
- Problem: Mixed CRS across data sources (WGS84, State Plane, Web Mercator)
- Impact: Spatial join errors and area calculation mistakes
- Solution: Standardized on EPSG:3857 with careful transformation validation
API Rate Limits and Quotas
- NASA Earthdata: 1000 requests/minute
- FRED API: 120 requests/minute
- Census API: 500 requests/hour
- Solution: Implemented exponential backoff and request caching
Model Challenges
Class Imbalance
- <1% of grid cells see development in any 6-month period
- Led to models predicting "no development" everywhere
- Solution: Focal loss, SMOTE, and careful stratified sampling
Spatial Autocorrelation
- Development clusters spatially (violates IID assumption)
- Standard cross-validation gave overly optimistic metrics
- Solution: Spatial cross-validation with buffer zones
Temporal Drift
- COVID-19 caused dramatic shifts in development patterns
- Models trained on 2019 data failed in 2020-2021
- Solution: Online learning with periodic retraining
Infrastructure Challenges
Cold Start Latencies
- Lambda functions loading large models took 30+ seconds
- Solution: Container images with provisioned concurrency
Database Performance
- PostGIS spatial queries on millions of polygons were slow
- Solution: Proper indexing (GIST), partitioning by date, and materialized views
Cost Overruns
- Initial design: $5000+/month AWS costs
- Main culprits: NAT gateways, idle SageMaker endpoints, uncompressed S3 storage
- Solution: Spot instances, endpoint auto-scaling, S3 lifecycle policies
What Inspired Us
The Power of Open Data
NASA's commitment to open Earth observation data democratizes capabilities once limited to large corporations. The HLS dataset's analysis-ready format eliminated weeks of preprocessing work.
Cross-Domain Innovation
Combining techniques from different fields yielded unexpected insights:
- Computer vision methods applied to economic forecasting
- Time series analysis enhancing spatial predictions
- Graph neural networks capturing development propagation patterns
Real-World Impact Potential
This system could:
- Help cities plan infrastructure more efficiently
- Enable sustainable development monitoring
- Democratize real estate investment intelligence
- Support climate adaptation planning
The ML Engineering Journey
Building an end-to-end system revealed that:
- Data engineering is 80% of the work
- Simple models with good features outperform complex models with poor data
- System design matters more than algorithm choice
- Monitoring and observability are not optional
Community and Collaboration
The project benefited immensely from:
- Open-source geospatial Python ecosystem (Rasterio, GeoPandas, STAC)
- AWS credits for researchers
- Academic papers sharing implementation details
- Stack Overflow's geospatial community
Key Takeaways
Start Simple, Iterate Fast: Our MVP with basic NDVI change detection provided value while we built sophisticated models
Invest in Data Quality: Time spent on data validation and cleaning paid 10x returns in model performance
Design for Failure: Assume every external API will fail and every process will crash
Measure Everything: Without comprehensive metrics, we couldn't identify bottlenecks
Document Decisions: Future-you will thank present-you for explaining why certain choices were made
Future Directions
The project opened several exciting research directions:
- Incorporating SAR data for all-weather monitoring
- Graph neural networks for modeling spatial dependencies
- Active learning for efficient ground truth collection
- Foundation models for satellite imagery understanding
- Real-time processing with streaming architectures
Conclusion
Building SatStack taught us that the intersection of satellite imagery and machine learning is both challenging and rewarding. While technical hurdles were significant, the potential to create actionable intelligence from Earth observation data keeps us motivated. The project reinforced that successful ML systems require not just good algorithms, but thoughtful engineering, careful data management, and constant iteration based on real-world feedback.
The journey from notebook prototype to production system revealed the true complexity of MLOps, but also demonstrated that with modern tools and cloud infrastructure, small teams can build systems that would have required entire departments just a few years ago.
Built With
- amazon-web-services
- api
- apis
- bls
- census
- cloud
- cloudwatch
- cmr-stac
- computer
- computing
- containerization
- core
- dash/plotly
- dashboards
- dask
- data
- database
- demographics
- discovery
- distributed
- docker
- earthdata
- economic
- ensemble
- fastapi
- federal
- fred
- functions
- geospatial
- harmonized
- imagery
- infrastructure
- key
- labor
- lambda
- landsat
- ml
- ml/data
- models
- monitoring
- nasa
- postgis
- postgresql
- processing
- python
- pytorch
- rasterio/geopandas
- rds
- reserve
- rest
- s3
- sagemaker
- satellite
- science
- sentinel-2
- serverless
- services
- spatial
- statistics
- storage
- technologies
- training/inference
- vision
- xgboost
Log in or sign up for Devpost to join the conversation.