Smart Revenue Recognition and Churn Prediction for SaaS

Inspiration

The SaaS industry faces critical challenges in understanding customer behavior and predicting revenue patterns. Many companies struggle with:

  • Inability to identify at-risk customers before they churn
  • Lack of actionable insights from subscription data
  • Complex data pipelines requiring technical expertise to query
  • Manual processes for revenue forecasting and customer segmentation

We were inspired to build an intelligent system that democratizes data analytics, allowing business users to interact with complex datasets using natural language while providing accurate churn predictions backed by explainable AI.

What it does

Smart Revenue Recognition and Churn Prediction is an end-to-end AI-powered analytics platform that:

Core Capabilities:

  • Predicts customer churn probability using machine learning with 85% accuracy
  • Provides explainable AI insights using SHAP values to understand prediction drivers
  • Enables natural language querying of BigQuery databases without SQL knowledge
  • Automates data ingestion from Google Cloud Storage to BigQuery using Fivetran SDK
  • Offers three prediction modes: manual input, CSV batch upload, and real-time simulation
  • Features augmented analytics with 14 pre-built intent handlers for common business questions
  • Includes NL-to-SQL RAG system with 16 query templates for flexible data exploration

Technical Features:

  • Processes 5,000+ subscription records automatically
  • Generates interactive visualizations (bar charts, pie charts, histograms, line graphs)
  • Stores predictions in BigQuery for historical tracking and trend analysis
  • Provides confidence scores for all predictions and query interpretations
  • Exports results to CSV for further analysis

How we built it

Architecture Stack:

  • Frontend: Streamlit web application with custom blue theme
  • Backend: Python with scikit-learn for machine learning
  • Data Processing: pandas, numpy for data transformation
  • Visualization: Plotly for interactive charts, matplotlib for SHAP plots
  • Database: Google BigQuery for data warehousing
  • Storage: Google Cloud Storage for raw data files
  • Data Pipeline: Custom Fivetran SDK connector
  • ML Model: RandomForestClassifier with hyperparameter tuning
  • Explainability: SHAP (SHapley Additive exPlanations)
  • NLP: TF-IDF vectorization with cosine similarity for query matching
  • Deployment: Docker containerization, Google Cloud Run ready

Development Process:

  1. Data pipeline setup: Built custom Fivetran connector to move CSV files from GCS to BigQuery
  2. Feature engineering: Created 16 features from subscription data (MRR, ARR, tenure, usage patterns)
  3. Model training: Trained RandomForest classifier on 1,000 labeled examples
  4. RAG implementation: Built two separate RAG systems for analytics Q&A and SQL generation
  5. UI development: Created intuitive Streamlit interface with multiple interaction modes
  6. Integration: Connected all components with BigQuery as central data warehouse
  7. Testing: Validated predictions, query accuracy, and end-to-end data flow

Key Technical Decisions:

  • Used TF-IDF instead of embeddings for faster query matching without API dependencies
  • Implemented SHAP for model explainability to meet regulatory requirements
  • Chose RandomForest for balance between accuracy and interpretability
  • Built custom Fivetran connector for full control over data transformation
  • Structured as modular components for maintainability and scalability

Challenges we ran into

Technical Challenges:

  1. NumPy Version Compatibility

    • Issue: scikit-learn models failed to load due to NumPy 2.x incompatibility
    • Solution: Downgraded to NumPy 1.26.4 and retrained all models with version pinning
  2. SHAP Array Dimension Handling

    • Issue: SHAP explainer returned inconsistent array shapes across different scenarios
    • Solution: Built robust dimension checking and safe indexing for all SHAP value formats
  3. BigQuery Schema Management

    • Issue: Prediction table schema mismatches between local model and database
    • Solution: Implemented dynamic schema detection and automatic table creation
  4. NL-to-SQL Query Ambiguity

    • Issue: Natural language queries can map to multiple SQL patterns
    • Solution: Implemented confidence scoring and alternative suggestion system
  5. Real-time Prediction Performance

    • Issue: SHAP explanations caused 5+ second latency for single predictions
    • Solution: Optimized SHAP computation and added caching for repeated queries

Data Challenges:

  1. Empty Prediction Table

    • Issue: Analytics features failed when no predictions existed in BigQuery
    • Solution: Created sample data loader and graceful fallback mechanisms
  2. CSV File Schema Variations

    • Issue: GCS bucket contained files with inconsistent column names
    • Solution: Built flexible schema mapper in Fivetran connector

Deployment Challenges:

  1. Streamlit Configuration for Cloud Run

    • Issue: Configuration optimized for local development failed in containerized environment
    • Solution: Created separate config profiles for development and production
  2. Authentication Management

    • Issue: Service account keys accidentally committed to git
    • Solution: Implemented proper gitignore patterns and key rotation procedures

Accomplishments that we're proud of

Technical Achievements:

  • Built production-ready ML pipeline with 85% accuracy on churn prediction
  • Implemented two separate RAG systems (Analytics and NL-to-SQL) without relying on external LLM APIs
  • Created fully functional Fivetran connector that successfully loaded 5,000 records
  • Achieved sub-second query response times for natural language questions
  • Integrated SHAP explanations for every prediction, making AI decisions transparent

User Experience:

  • Enabled non-technical users to query complex databases using plain English
  • Provided multiple interaction modes to suit different user preferences
  • Generated automatic visualizations based on query context
  • Delivered confidence scores so users understand system certainty

Engineering:

  • Built modular architecture with clear separation of concerns
  • Achieved 100% test coverage on critical prediction and RAG components
  • Created comprehensive documentation (2,500+ lines across 10 files)
  • Prepared deployment infrastructure for Google Cloud Run
  • Maintained clean git history with meaningful commits

Data Pipeline:

  • Automated end-to-end flow: GCS to BigQuery to ML Model to Predictions
  • Handled schema evolution and data quality checks
  • Implemented proper error handling and logging throughout pipeline

What we learned

Machine Learning:

  • SHAP explanations significantly increase user trust in ML predictions
  • Feature engineering impacts model performance more than algorithm choice
  • Real-world data requires extensive preprocessing and validation
  • Model retraining pipelines must handle schema evolution gracefully

Natural Language Processing:

  • TF-IDF with n-grams provides effective query matching for structured intents
  • Entity extraction using regex patterns works well for business domains
  • Confidence thresholds must be calibrated based on user tolerance for errors
  • Query templates scale better than fully generative approaches for SQL generation

Data Engineering:

  • BigQuery partitioning and clustering significantly improve query performance
  • Schema-on-write provides better data quality than schema-on-read
  • Data validation at ingestion prevents downstream errors
  • Proper indexing reduces query costs by 60%

Software Engineering:

  • Modular design enables independent testing and deployment of components
  • Configuration management is critical for multi-environment deployments
  • Error messages must guide users toward resolution, not just report failures
  • Documentation is as important as code for project longevity

Cloud Architecture:

  • Serverless platforms like Cloud Run reduce operational overhead
  • Container optimization matters for cold start performance
  • Proper IAM configuration prevents security vulnerabilities
  • Cost monitoring must be implemented from day one

User Experience:

  • Users prefer multiple input methods over forcing single interaction pattern
  • Visual feedback during long operations prevents perceived failures
  • Alternative suggestions help users refine vague queries
  • Export functionality is critical for business user adoption

What's next for Smart Revenue Recognition and Churn Prediction for SaaS

Immediate Enhancements (Next 3 Months):

  1. Advanced ML Models

    • Implement gradient boosting (XGBoost, LightGBM) for improved accuracy
    • Add time-series forecasting for revenue prediction
    • Build customer lifetime value (CLV) prediction models
    • Develop anomaly detection for unusual usage patterns
  2. Expanded NL-to-SQL Capabilities

    • Increase query templates from 16 to 50+ covering edge cases
    • Add support for JOIN operations across multiple tables
    • Implement query optimization suggestions
    • Enable query history and favorites
  3. Enhanced Visualization

    • Add customizable dashboards with drag-and-drop widgets
    • Implement drill-down capabilities for detailed analysis
    • Create executive summary reports with automated insights
    • Build real-time monitoring dashboards

Medium-term Goals (3-6 Months):

  1. Multi-tenant Architecture

    • Implement customer isolation and data segregation
    • Build role-based access control (RBAC)
    • Add organization-level configurations
    • Create usage tracking and billing system
  2. Integration Expansion

    • Connect to Stripe, Chargebee, and other payment platforms
    • Integrate with Salesforce, HubSpot for CRM data
    • Add Slack/Teams notifications for high-risk customers
    • Build webhook system for third-party integrations
  3. Automated Actions

    • Trigger email campaigns for at-risk customers
    • Create automatic discount offers based on churn probability
    • Schedule account manager interventions
    • Generate personalized retention strategies
  4. Advanced Analytics

    • Cohort analysis for customer segments
    • A/B testing framework for retention strategies
    • Churn driver analysis across customer segments
    • Competitive benchmarking against industry standards

Long-term Vision (6-12 Months):

  1. Generative AI Integration

    • Replace TF-IDF with large language model embeddings
    • Implement conversational AI for multi-turn queries
    • Add natural language report generation
    • Build AI-powered recommendation engine
  2. Real-time Processing

    • Move from batch to streaming predictions
    • Implement Apache Kafka for event processing
    • Build real-time feature computation
    • Create instant alerts for critical churn signals
  3. Mobile Application

    • Develop iOS and Android apps for executives
    • Implement push notifications for urgent insights
    • Create offline mode for viewing cached reports
    • Build voice-activated query interface
  4. Marketplace and Ecosystem

    • Create plugin system for custom integrations
    • Build community-contributed query templates
    • Develop industry-specific prediction models
    • Establish partner network for implementation services
  5. Compliance and Security

    • Achieve SOC 2 Type II certification
    • Implement GDPR-compliant data handling
    • Add audit logging for all predictions and queries
    • Build data retention and deletion workflows
  6. Enterprise Features

    • Multi-region deployment for data sovereignty
    • Custom model training on customer-specific data
    • White-label deployment options
    • SLA guarantees with 99.9% uptime

Research Directions:

  1. Explainable AI Advancement

    • Explore counterfactual explanations
    • Implement attention mechanisms for feature importance
    • Build causal inference models
    • Develop fairness and bias detection
  2. Automated Feature Engineering

    • Implement AutoML for feature discovery
    • Build temporal feature extraction
    • Create interaction term detection
    • Develop domain-specific feature libraries
  3. Query Understanding

    • Research semantic parsing improvements
    • Explore few-shot learning for new query patterns
    • Implement active learning from user corrections
    • Build context-aware query disambiguation

Business Expansion:

  1. Vertical Specialization

    • Create industry-specific models (B2B SaaS, consumer apps, etc.)
    • Build compliance packages for regulated industries
    • Develop templates for common business questions by vertical
  2. Professional Services

    • Offer implementation and training services
    • Provide custom model development
    • Build managed service offering
  3. Community Building

    • Create user forums and knowledge base
    • Host webinars on churn prediction best practices
    • Publish research papers and case studies
    • Build certification program for practitioners

Built With

  • bigquery
  • cloudrun
  • fivetran
  • gcs
  • python
  • streamlit
Share this project:

Updates