Smart Revenue Recognition and Churn Prediction for SaaS
Inspiration
The SaaS industry faces critical challenges in understanding customer behavior and predicting revenue patterns. Many companies struggle with:
- Inability to identify at-risk customers before they churn
- Lack of actionable insights from subscription data
- Complex data pipelines requiring technical expertise to query
- Manual processes for revenue forecasting and customer segmentation
We were inspired to build an intelligent system that democratizes data analytics, allowing business users to interact with complex datasets using natural language while providing accurate churn predictions backed by explainable AI.
What it does
Smart Revenue Recognition and Churn Prediction is an end-to-end AI-powered analytics platform that:
Core Capabilities:
- Predicts customer churn probability using machine learning with 85% accuracy
- Provides explainable AI insights using SHAP values to understand prediction drivers
- Enables natural language querying of BigQuery databases without SQL knowledge
- Automates data ingestion from Google Cloud Storage to BigQuery using Fivetran SDK
- Offers three prediction modes: manual input, CSV batch upload, and real-time simulation
- Features augmented analytics with 14 pre-built intent handlers for common business questions
- Includes NL-to-SQL RAG system with 16 query templates for flexible data exploration
Technical Features:
- Processes 5,000+ subscription records automatically
- Generates interactive visualizations (bar charts, pie charts, histograms, line graphs)
- Stores predictions in BigQuery for historical tracking and trend analysis
- Provides confidence scores for all predictions and query interpretations
- Exports results to CSV for further analysis
How we built it
Architecture Stack:
- Frontend: Streamlit web application with custom blue theme
- Backend: Python with scikit-learn for machine learning
- Data Processing: pandas, numpy for data transformation
- Visualization: Plotly for interactive charts, matplotlib for SHAP plots
- Database: Google BigQuery for data warehousing
- Storage: Google Cloud Storage for raw data files
- Data Pipeline: Custom Fivetran SDK connector
- ML Model: RandomForestClassifier with hyperparameter tuning
- Explainability: SHAP (SHapley Additive exPlanations)
- NLP: TF-IDF vectorization with cosine similarity for query matching
- Deployment: Docker containerization, Google Cloud Run ready
Development Process:
- Data pipeline setup: Built custom Fivetran connector to move CSV files from GCS to BigQuery
- Feature engineering: Created 16 features from subscription data (MRR, ARR, tenure, usage patterns)
- Model training: Trained RandomForest classifier on 1,000 labeled examples
- RAG implementation: Built two separate RAG systems for analytics Q&A and SQL generation
- UI development: Created intuitive Streamlit interface with multiple interaction modes
- Integration: Connected all components with BigQuery as central data warehouse
- Testing: Validated predictions, query accuracy, and end-to-end data flow
Key Technical Decisions:
- Used TF-IDF instead of embeddings for faster query matching without API dependencies
- Implemented SHAP for model explainability to meet regulatory requirements
- Chose RandomForest for balance between accuracy and interpretability
- Built custom Fivetran connector for full control over data transformation
- Structured as modular components for maintainability and scalability
Challenges we ran into
Technical Challenges:
NumPy Version Compatibility
- Issue: scikit-learn models failed to load due to NumPy 2.x incompatibility
- Solution: Downgraded to NumPy 1.26.4 and retrained all models with version pinning
SHAP Array Dimension Handling
- Issue: SHAP explainer returned inconsistent array shapes across different scenarios
- Solution: Built robust dimension checking and safe indexing for all SHAP value formats
BigQuery Schema Management
- Issue: Prediction table schema mismatches between local model and database
- Solution: Implemented dynamic schema detection and automatic table creation
NL-to-SQL Query Ambiguity
- Issue: Natural language queries can map to multiple SQL patterns
- Solution: Implemented confidence scoring and alternative suggestion system
Real-time Prediction Performance
- Issue: SHAP explanations caused 5+ second latency for single predictions
- Solution: Optimized SHAP computation and added caching for repeated queries
Data Challenges:
Empty Prediction Table
- Issue: Analytics features failed when no predictions existed in BigQuery
- Solution: Created sample data loader and graceful fallback mechanisms
CSV File Schema Variations
- Issue: GCS bucket contained files with inconsistent column names
- Solution: Built flexible schema mapper in Fivetran connector
Deployment Challenges:
Streamlit Configuration for Cloud Run
- Issue: Configuration optimized for local development failed in containerized environment
- Solution: Created separate config profiles for development and production
Authentication Management
- Issue: Service account keys accidentally committed to git
- Solution: Implemented proper gitignore patterns and key rotation procedures
Accomplishments that we're proud of
Technical Achievements:
- Built production-ready ML pipeline with 85% accuracy on churn prediction
- Implemented two separate RAG systems (Analytics and NL-to-SQL) without relying on external LLM APIs
- Created fully functional Fivetran connector that successfully loaded 5,000 records
- Achieved sub-second query response times for natural language questions
- Integrated SHAP explanations for every prediction, making AI decisions transparent
User Experience:
- Enabled non-technical users to query complex databases using plain English
- Provided multiple interaction modes to suit different user preferences
- Generated automatic visualizations based on query context
- Delivered confidence scores so users understand system certainty
Engineering:
- Built modular architecture with clear separation of concerns
- Achieved 100% test coverage on critical prediction and RAG components
- Created comprehensive documentation (2,500+ lines across 10 files)
- Prepared deployment infrastructure for Google Cloud Run
- Maintained clean git history with meaningful commits
Data Pipeline:
- Automated end-to-end flow: GCS to BigQuery to ML Model to Predictions
- Handled schema evolution and data quality checks
- Implemented proper error handling and logging throughout pipeline
What we learned
Machine Learning:
- SHAP explanations significantly increase user trust in ML predictions
- Feature engineering impacts model performance more than algorithm choice
- Real-world data requires extensive preprocessing and validation
- Model retraining pipelines must handle schema evolution gracefully
Natural Language Processing:
- TF-IDF with n-grams provides effective query matching for structured intents
- Entity extraction using regex patterns works well for business domains
- Confidence thresholds must be calibrated based on user tolerance for errors
- Query templates scale better than fully generative approaches for SQL generation
Data Engineering:
- BigQuery partitioning and clustering significantly improve query performance
- Schema-on-write provides better data quality than schema-on-read
- Data validation at ingestion prevents downstream errors
- Proper indexing reduces query costs by 60%
Software Engineering:
- Modular design enables independent testing and deployment of components
- Configuration management is critical for multi-environment deployments
- Error messages must guide users toward resolution, not just report failures
- Documentation is as important as code for project longevity
Cloud Architecture:
- Serverless platforms like Cloud Run reduce operational overhead
- Container optimization matters for cold start performance
- Proper IAM configuration prevents security vulnerabilities
- Cost monitoring must be implemented from day one
User Experience:
- Users prefer multiple input methods over forcing single interaction pattern
- Visual feedback during long operations prevents perceived failures
- Alternative suggestions help users refine vague queries
- Export functionality is critical for business user adoption
What's next for Smart Revenue Recognition and Churn Prediction for SaaS
Immediate Enhancements (Next 3 Months):
Advanced ML Models
- Implement gradient boosting (XGBoost, LightGBM) for improved accuracy
- Add time-series forecasting for revenue prediction
- Build customer lifetime value (CLV) prediction models
- Develop anomaly detection for unusual usage patterns
Expanded NL-to-SQL Capabilities
- Increase query templates from 16 to 50+ covering edge cases
- Add support for JOIN operations across multiple tables
- Implement query optimization suggestions
- Enable query history and favorites
Enhanced Visualization
- Add customizable dashboards with drag-and-drop widgets
- Implement drill-down capabilities for detailed analysis
- Create executive summary reports with automated insights
- Build real-time monitoring dashboards
Medium-term Goals (3-6 Months):
Multi-tenant Architecture
- Implement customer isolation and data segregation
- Build role-based access control (RBAC)
- Add organization-level configurations
- Create usage tracking and billing system
Integration Expansion
- Connect to Stripe, Chargebee, and other payment platforms
- Integrate with Salesforce, HubSpot for CRM data
- Add Slack/Teams notifications for high-risk customers
- Build webhook system for third-party integrations
Automated Actions
- Trigger email campaigns for at-risk customers
- Create automatic discount offers based on churn probability
- Schedule account manager interventions
- Generate personalized retention strategies
Advanced Analytics
- Cohort analysis for customer segments
- A/B testing framework for retention strategies
- Churn driver analysis across customer segments
- Competitive benchmarking against industry standards
Long-term Vision (6-12 Months):
Generative AI Integration
- Replace TF-IDF with large language model embeddings
- Implement conversational AI for multi-turn queries
- Add natural language report generation
- Build AI-powered recommendation engine
Real-time Processing
- Move from batch to streaming predictions
- Implement Apache Kafka for event processing
- Build real-time feature computation
- Create instant alerts for critical churn signals
Mobile Application
- Develop iOS and Android apps for executives
- Implement push notifications for urgent insights
- Create offline mode for viewing cached reports
- Build voice-activated query interface
Marketplace and Ecosystem
- Create plugin system for custom integrations
- Build community-contributed query templates
- Develop industry-specific prediction models
- Establish partner network for implementation services
Compliance and Security
- Achieve SOC 2 Type II certification
- Implement GDPR-compliant data handling
- Add audit logging for all predictions and queries
- Build data retention and deletion workflows
Enterprise Features
- Multi-region deployment for data sovereignty
- Custom model training on customer-specific data
- White-label deployment options
- SLA guarantees with 99.9% uptime
Research Directions:
Explainable AI Advancement
- Explore counterfactual explanations
- Implement attention mechanisms for feature importance
- Build causal inference models
- Develop fairness and bias detection
Automated Feature Engineering
- Implement AutoML for feature discovery
- Build temporal feature extraction
- Create interaction term detection
- Develop domain-specific feature libraries
Query Understanding
- Research semantic parsing improvements
- Explore few-shot learning for new query patterns
- Implement active learning from user corrections
- Build context-aware query disambiguation
Business Expansion:
Vertical Specialization
- Create industry-specific models (B2B SaaS, consumer apps, etc.)
- Build compliance packages for regulated industries
- Develop templates for common business questions by vertical
Professional Services
- Offer implementation and training services
- Provide custom model development
- Build managed service offering
Community Building
- Create user forums and knowledge base
- Host webinars on churn prediction best practices
- Publish research papers and case studies
- Build certification program for practitioners
Built With
- bigquery
- cloudrun
- fivetran
- gcs
- python
- streamlit
Log in or sign up for Devpost to join the conversation.