FastProfile AI - Submission

Inspiration

From my previous data science work experience, I discovered that 90%+ of time is spent on dataset learning, exploration, and feature engineeringβ€”not on model building. Putting data into a model isn't hard; the hard part is understanding what goes into the model and dealing with messy, dirty data.

Meanwhile, there's a growing concern about data privacy. Companies hesitate to send sensitive data to cloud-based AI services, especially in regulated industries like healthcare, finance, and government.

FastProfile AI was born from these two pain points:

  1. Accelerate the data exploration bottleneck
  2. Enable privacy-preserving AI assistance

I wanted to build a tool that helps managers and analysts understand datasets faster to assign team goals at a more efficient pace, while ensuring sensitive information never leaves their control.


What it does

FastProfile AI is a privacy-first data exploration assistant that combines automated profiling with AI-powered insights.

Core Features:

πŸ” Automated Data Profiling (Local Processing)

  • Analyzes each column: data types, distributions, missing values, outliers
  • Detects 6+ PII patterns (SSN, credit cards, emails, phone numbers, IPs, zip codes)
  • Calculates correlation matrices and identifies data quality issues
  • Everything runs locallyβ€”your raw data never leaves your machine

πŸ”’ Configurable Privacy Masking

  • Column-level privacy policies: Allow, Mask, or Deny
  • 10+ masking strategies: pseudonymization, partial reveal, numeric bucketing, date truncation, etc.
  • Real-time preview of masked data before sending to AI
  • Audit logs track what data is shared with LLM

πŸ’¬ AI Chat Assistant (Privacy-Safe)

  • Ask questions in plain English: "What does each row represent?" "How should I join these tables?"
  • Only sanitized summaries are sent to OpenAIβ€”never raw data
  • Multi-table relationship detection and join recommendations
  • Token/cost estimates before each query

πŸ“Š Multi-Table Analysis

  • Upload multiple related CSVs
  • Automatic relationship detection (one-to-one, one-to-many, many-to-many)
  • Suggests optimal join strategies for creating master tables
  • Ideal for relational database exports

Who Benefits:

  • Data Scientists: Skip manual profiling, focus on modeling
  • Business Leaders: Understand datasets without coding
  • Compliance Teams: Audit and enforce privacy policies automatically
  • Data Engineers: Detect table relationships and data quality issues early

How we built it

Tech Stack:

  • Python - Core data processing and profiling logic
  • Streamlit - Interactive web UI for rapid prototyping and demo
  • Pandas/NumPy - Data manipulation and statistical analysis
  • OpenAI GPT-4 - AI-powered chat and insights
  • Pydantic - Data validation and schema enforcement

Architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User's CSV     β”‚ (Never leaves local machine)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Local Profiling β”‚ (Column stats, PII detection, outliers)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Privacy Masking β”‚ (Configurable policies per column)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sanitized View  β”‚ (Only summaries/masked samples)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenAI API      β”‚ (Chat assistant with sanitized data)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions:

  1. Local-first processing - Raw data never sent to cloud
  2. Iterative masking preview - Users see exactly what AI sees
  3. Token estimation - Cost transparency before LLM calls
  4. Session persistence - Resume analysis across sessions
  5. Multi-table support - Detect relationships automatically using column name similarity + value overlap

Challenges we ran into

1. Working Solo I didn't realize there was a Discord channel to find partners until am in the hackathon. With more people, I could have achieved:

  • A polished frontend design
  • More advanced masking algorithms

2. Privacy-Performance Trade-off Balancing comprehensive profiling with privacy protection was tricky:

  • Too much masking β†’ AI can't give useful insights
  • Too little masking β†’ Privacy risks
  • Solution: Real-time preview + token estimates helped users find the sweet spot

3. Multi-Table Relationship Detection Detecting joins between tables without knowing the schema beforehand required:

  • Name similarity matching (e.g., "customer_id" vs "cust_id")
  • Sample value overlap analysis (computationally expensive)
  • Heuristics for foreign key detection (uniqueness ratios)

4. Handling Diverse CSV Formats Real-world CSVs are messy:

  • Multiple encodings (UTF-8, Latin-1, CP1252)
  • Mixed data types in columns
  • Inconsistent date formats
  • Solution: Robust error handling with encoding fallbacks

Accomplishments that we're proud of

βœ… Complete product delivered in 24 hours of coding

  • Full workflow: Upload β†’ Profile β†’ Privacy β†’ Chat
  • Multi-table analysis with relationship detection
  • 10+ masking strategies with live preview
  • Working AI chat integration

βœ… Production-ready privacy features

  • Column-level policies
  • Audit logging
  • Token/cost estimation
  • Real masking algorithms (not just placeholders)

βœ… Solving a real problem

  • Based on actual pain points from data science work
  • Addresses genuine privacy concerns in enterprises
  • Could save teams hours per dataset

βœ… Accessible to non-technical users

  • No coding required for basic analysis
  • Plain English chat interface
  • Visual previews at every step

What's next for FastProfile AI

Short-term (Next 3 months):

  1. Automated Visualization Engine

    • AI-generated charts based on data types and questions
    • Interactive dashboards without code
  2. Model Recommendation System

    • Suggest ML models based on data characteristics
    • Generate starter code for scikit-learn, XGBoost, PyTorch

Long-term Vision:

  • Desktop app for fully offline operation
  • Database connectors (PostgreSQL, MySQL, MongoDB)
  • Team collaboration features (shared sessions, comments)
  • Custom masking functions (user-defined Python scripts)
  • LLM fine-tuning on privacy-safe synthetic data

The Ultimate Goal:

With AI assistance, even a person who doesn't know Python will have infinite potential to explore data, build models, and extract insightsβ€”all while maintaining complete control over their sensitive information.


Try it out

# Installation
git clone https://github.com/yourusername/fastprofile-ai
cd fastprofile-ai
pip install -r requirements.txt

# Add your OpenAI API key
echo "OPENAI_API_KEY=your-key-here" > .env

# Launch
streamlit run app.py

Demo Video: https://youtu.be/ZL1cSPmrdvQ GitHub Repo: https://github.com/JUNJIEQUANT/hackathon26/


Built with ❀️ in 24 hours for [Hackathon Name]

Built With

Share this project:

Updates