Cover_Page
High Level Architecture
Data Flow Diagram
home page
demo1
demo2
demo3
demo 4

Smart Revenue Recognition and Churn Prediction for SaaS

Inspiration

The SaaS industry faces critical challenges in understanding customer behavior and predicting revenue patterns. Many companies struggle with:

Inability to identify at-risk customers before they churn
Lack of actionable insights from subscription data
Complex data pipelines requiring technical expertise to query
Manual processes for revenue forecasting and customer segmentation

We were inspired to build an intelligent system that democratizes data analytics, allowing business users to interact with complex datasets using natural language while providing accurate churn predictions backed by explainable AI.

What it does

Smart Revenue Recognition and Churn Prediction is an end-to-end AI-powered analytics platform that:

Core Capabilities:

Predicts customer churn probability using machine learning with 85% accuracy
Provides explainable AI insights using SHAP values to understand prediction drivers
Enables natural language querying of BigQuery databases without SQL knowledge
Automates data ingestion from Google Cloud Storage to BigQuery using Fivetran SDK
Offers three prediction modes: manual input, CSV batch upload, and real-time simulation
Features augmented analytics with 14 pre-built intent handlers for common business questions
Includes NL-to-SQL RAG system with 16 query templates for flexible data exploration

Technical Features:

Processes 5,000+ subscription records automatically
Generates interactive visualizations (bar charts, pie charts, histograms, line graphs)
Stores predictions in BigQuery for historical tracking and trend analysis
Provides confidence scores for all predictions and query interpretations
Exports results to CSV for further analysis

How we built it

Architecture Stack:

Frontend: Streamlit web application with custom blue theme
Backend: Python with scikit-learn for machine learning
Data Processing: pandas, numpy for data transformation
Visualization: Plotly for interactive charts, matplotlib for SHAP plots
Database: Google BigQuery for data warehousing
Storage: Google Cloud Storage for raw data files
Data Pipeline: Custom Fivetran SDK connector
ML Model: RandomForestClassifier with hyperparameter tuning
Explainability: SHAP (SHapley Additive exPlanations)
NLP: TF-IDF vectorization with cosine similarity for query matching
Deployment: Docker containerization, Google Cloud Run ready

Development Process:

Data pipeline setup: Built custom Fivetran connector to move CSV files from GCS to BigQuery
Feature engineering: Created 16 features from subscription data (MRR, ARR, tenure, usage patterns)
Model training: Trained RandomForest classifier on 1,000 labeled examples
RAG implementation: Built two separate RAG systems for analytics Q&A and SQL generation
UI development: Created intuitive Streamlit interface with multiple interaction modes
Integration: Connected all components with BigQuery as central data warehouse
Testing: Validated predictions, query accuracy, and end-to-end data flow

Key Technical Decisions:

Used TF-IDF instead of embeddings for faster query matching without API dependencies
Implemented SHAP for model explainability to meet regulatory requirements
Chose RandomForest for balance between accuracy and interpretability
Built custom Fivetran connector for full control over data transformation
Structured as modular components for maintainability and scalability

Challenges we ran into

Technical Challenges:

NumPy Version Compatibility
- Issue: scikit-learn models failed to load due to NumPy 2.x incompatibility
- Solution: Downgraded to NumPy 1.26.4 and retrained all models with version pinning
SHAP Array Dimension Handling
- Issue: SHAP explainer returned inconsistent array shapes across different scenarios
- Solution: Built robust dimension checking and safe indexing for all SHAP value formats
BigQuery Schema Management
- Issue: Prediction table schema mismatches between local model and database
- Solution: Implemented dynamic schema detection and automatic table creation
NL-to-SQL Query Ambiguity
- Issue: Natural language queries can map to multiple SQL patterns
- Solution: Implemented confidence scoring and alternative suggestion system
Real-time Prediction Performance
- Issue: SHAP explanations caused 5+ second latency for single predictions
- Solution: Optimized SHAP computation and added caching for repeated queries

Data Challenges:

Empty Prediction Table
- Issue: Analytics features failed when no predictions existed in BigQuery
- Solution: Created sample data loader and graceful fallback mechanisms
CSV File Schema Variations
- Issue: GCS bucket contained files with inconsistent column names
- Solution: Built flexible schema mapper in Fivetran connector

Deployment Challenges:

Streamlit Configuration for Cloud Run
- Issue: Configuration optimized for local development failed in containerized environment
- Solution: Created separate config profiles for development and production
Authentication Management
- Issue: Service account keys accidentally committed to git
- Solution: Implemented proper gitignore patterns and key rotation procedures

Accomplishments that we're proud of

Technical Achievements:

Built production-ready ML pipeline with 85% accuracy on churn prediction
Implemented two separate RAG systems (Analytics and NL-to-SQL) without relying on external LLM APIs
Created fully functional Fivetran connector that successfully loaded 5,000 records
Achieved sub-second query response times for natural language questions
Integrated SHAP explanations for every prediction, making AI decisions transparent

User Experience:

Enabled non-technical users to query complex databases using plain English
Provided multiple interaction modes to suit different user preferences
Generated automatic visualizations based on query context
Delivered confidence scores so users understand system certainty

Engineering:

Built modular architecture with clear separation of concerns
Achieved 100% test coverage on critical prediction and RAG components
Created comprehensive documentation (2,500+ lines across 10 files)
Prepared deployment infrastructure for Google Cloud Run
Maintained clean git history with meaningful commits

Data Pipeline:

Automated end-to-end flow: GCS to BigQuery to ML Model to Predictions
Handled schema evolution and data quality checks
Implemented proper error handling and logging throughout pipeline

What we learned

Machine Learning:

SHAP explanations significantly increase user trust in ML predictions
Feature engineering impacts model performance more than algorithm choice
Real-world data requires extensive preprocessing and validation
Model retraining pipelines must handle schema evolution gracefully

Natural Language Processing:

TF-IDF with n-grams provides effective query matching for structured intents
Entity extraction using regex patterns works well for business domains
Confidence thresholds must be calibrated based on user tolerance for errors
Query templates scale better than fully generative approaches for SQL generation

Data Engineering:

BigQuery partitioning and clustering significantly improve query performance
Schema-on-write provides better data quality than schema-on-read
Data validation at ingestion prevents downstream errors
Proper indexing reduces query costs by 60%

Software Engineering:

Modular design enables independent testing and deployment of components
Configuration management is critical for multi-environment deployments
Error messages must guide users toward resolution, not just report failures
Documentation is as important as code for project longevity

Cloud Architecture:

Serverless platforms like Cloud Run reduce operational overhead
Container optimization matters for cold start performance
Proper IAM configuration prevents security vulnerabilities
Cost monitoring must be implemented from day one

User Experience:

Users prefer multiple input methods over forcing single interaction pattern
Visual feedback during long operations prevents perceived failures
Alternative suggestions help users refine vague queries
Export functionality is critical for business user adoption

What's next for Smart Revenue Recognition and Churn Prediction for SaaS

Immediate Enhancements (Next 3 Months):

Advanced ML Models
- Implement gradient boosting (XGBoost, LightGBM) for improved accuracy
- Add time-series forecasting for revenue prediction
- Build customer lifetime value (CLV) prediction models
- Develop anomaly detection for unusual usage patterns
Expanded NL-to-SQL Capabilities
- Increase query templates from 16 to 50+ covering edge cases
- Add support for JOIN operations across multiple tables
- Implement query optimization suggestions
- Enable query history and favorites
Enhanced Visualization
- Add customizable dashboards with drag-and-drop widgets
- Implement drill-down capabilities for detailed analysis
- Create executive summary reports with automated insights
- Build real-time monitoring dashboards

Medium-term Goals (3-6 Months):

Multi-tenant Architecture
- Implement customer isolation and data segregation
- Build role-based access control (RBAC)
- Add organization-level configurations
- Create usage tracking and billing system
Integration Expansion
- Connect to Stripe, Chargebee, and other payment platforms
- Integrate with Salesforce, HubSpot for CRM data
- Add Slack/Teams notifications for high-risk customers
- Build webhook system for third-party integrations
Automated Actions
- Trigger email campaigns for at-risk customers
- Create automatic discount offers based on churn probability
- Schedule account manager interventions
- Generate personalized retention strategies
Advanced Analytics
- Cohort analysis for customer segments
- A/B testing framework for retention strategies
- Churn driver analysis across customer segments
- Competitive benchmarking against industry standards

Long-term Vision (6-12 Months):

Generative AI Integration
- Replace TF-IDF with large language model embeddings
- Implement conversational AI for multi-turn queries
- Add natural language report generation
- Build AI-powered recommendation engine
Real-time Processing
- Move from batch to streaming predictions
- Implement Apache Kafka for event processing
- Build real-time feature computation
- Create instant alerts for critical churn signals
Mobile Application
- Develop iOS and Android apps for executives
- Implement push notifications for urgent insights
- Create offline mode for viewing cached reports
- Build voice-activated query interface
Marketplace and Ecosystem
- Create plugin system for custom integrations
- Build community-contributed query templates
- Develop industry-specific prediction models
- Establish partner network for implementation services
Compliance and Security
- Achieve SOC 2 Type II certification
- Implement GDPR-compliant data handling
- Add audit logging for all predictions and queries
- Build data retention and deletion workflows
Enterprise Features
- Multi-region deployment for data sovereignty
- Custom model training on customer-specific data
- White-label deployment options
- SLA guarantees with 99.9% uptime

Research Directions:

Explainable AI Advancement
- Explore counterfactual explanations
- Implement attention mechanisms for feature importance
- Build causal inference models
- Develop fairness and bias detection
Automated Feature Engineering
- Implement AutoML for feature discovery
- Build temporal feature extraction
- Create interaction term detection
- Develop domain-specific feature libraries
Query Understanding
- Research semantic parsing improvements
- Explore few-shot learning for new query patterns
- Implement active learning from user corrections
- Build context-aware query disambiguation

Business Expansion:

Vertical Specialization
- Create industry-specific models (B2B SaaS, consumer apps, etc.)
- Build compliance packages for regulated industries
- Develop templates for common business questions by vertical
Professional Services
- Offer implementation and training services
- Provide custom model development
- Build managed service offering
Community Building
- Create user forums and knowledge base
- Host webinars on churn prediction best practices
- Publish research papers and case studies
- Build certification program for practitioners

Built With

bigquery
cloudrun
fivetran
gcs
python
streamlit

Submitted to

AI Accelerate: Unlocking New Frontiers

Created by

I guided the team on solution design and task execution, helped resolve code-level challenges, and supported deployment of the application on Cloud Run to ensure a smooth and scalable launch during the hackathon.

Biswanath Giri
AI & Cloud Enterprise Architect
Ipsita Nanda
Sampann Nigam
Leader, Data Science @ Cisco | AI/ML, NLP, Cloud Computing
Saumya Mohapatra
Full Stack Engineer | DevOps & Cloud | GenAI Innovator | Freelancer | Building scalable & simplified tech solutions

Updates

Ipsita Nanda started this project — Oct 24, 2025 02:30 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.