FastProfile AI - Submission

Inspiration

From my previous data science work experience, I discovered that 90%+ of time is spent on dataset learning, exploration, and feature engineering—not on model building. Putting data into a model isn't hard; the hard part is understanding what goes into the model and dealing with messy, dirty data.

Meanwhile, there's a growing concern about data privacy. Companies hesitate to send sensitive data to cloud-based AI services, especially in regulated industries like healthcare, finance, and government.

FastProfile AI was born from these two pain points:

Accelerate the data exploration bottleneck
Enable privacy-preserving AI assistance

I wanted to build a tool that helps managers and analysts understand datasets faster to assign team goals at a more efficient pace, while ensuring sensitive information never leaves their control.

What it does

FastProfile AI is a privacy-first data exploration assistant that combines automated profiling with AI-powered insights.

Core Features:

🔍 Automated Data Profiling (Local Processing)

Analyzes each column: data types, distributions, missing values, outliers
Detects 6+ PII patterns (SSN, credit cards, emails, phone numbers, IPs, zip codes)
Calculates correlation matrices and identifies data quality issues
Everything runs locally—your raw data never leaves your machine

🔒 Configurable Privacy Masking

Column-level privacy policies: Allow, Mask, or Deny
10+ masking strategies: pseudonymization, partial reveal, numeric bucketing, date truncation, etc.
Real-time preview of masked data before sending to AI
Audit logs track what data is shared with LLM

💬 AI Chat Assistant (Privacy-Safe)

Ask questions in plain English: "What does each row represent?" "How should I join these tables?"
Only sanitized summaries are sent to OpenAI—never raw data
Multi-table relationship detection and join recommendations
Token/cost estimates before each query

📊 Multi-Table Analysis

Upload multiple related CSVs
Automatic relationship detection (one-to-one, one-to-many, many-to-many)
Suggests optimal join strategies for creating master tables
Ideal for relational database exports

Who Benefits:

Data Scientists: Skip manual profiling, focus on modeling
Business Leaders: Understand datasets without coding
Compliance Teams: Audit and enforce privacy policies automatically
Data Engineers: Detect table relationships and data quality issues early

How we built it

Tech Stack:

Python - Core data processing and profiling logic
Streamlit - Interactive web UI for rapid prototyping and demo
Pandas/NumPy - Data manipulation and statistical analysis
OpenAI GPT-4 - AI-powered chat and insights
Pydantic - Data validation and schema enforcement

Architecture:

┌─────────────────┐
│  User's CSV     │ (Never leaves local machine)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Local Profiling │ (Column stats, PII detection, outliers)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Privacy Masking │ (Configurable policies per column)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Sanitized View  │ (Only summaries/masked samples)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ OpenAI API      │ (Chat assistant with sanitized data)
└─────────────────┘

Key Design Decisions:

Local-first processing - Raw data never sent to cloud
Iterative masking preview - Users see exactly what AI sees
Token estimation - Cost transparency before LLM calls
Session persistence - Resume analysis across sessions
Multi-table support - Detect relationships automatically using column name similarity + value overlap

Challenges we ran into

1. Working Solo I didn't realize there was a Discord channel to find partners until am in the hackathon. With more people, I could have achieved:

A polished frontend design
More advanced masking algorithms

2. Privacy-Performance Trade-off Balancing comprehensive profiling with privacy protection was tricky:

Too much masking → AI can't give useful insights
Too little masking → Privacy risks
Solution: Real-time preview + token estimates helped users find the sweet spot

3. Multi-Table Relationship Detection Detecting joins between tables without knowing the schema beforehand required:

Name similarity matching (e.g., "customer_id" vs "cust_id")
Sample value overlap analysis (computationally expensive)
Heuristics for foreign key detection (uniqueness ratios)

4. Handling Diverse CSV Formats Real-world CSVs are messy:

Multiple encodings (UTF-8, Latin-1, CP1252)
Mixed data types in columns
Inconsistent date formats
Solution: Robust error handling with encoding fallbacks

Accomplishments that we're proud of

✅ Complete product delivered in 24 hours of coding

Full workflow: Upload → Profile → Privacy → Chat
Multi-table analysis with relationship detection
10+ masking strategies with live preview
Working AI chat integration

✅ Production-ready privacy features

Column-level policies
Audit logging
Token/cost estimation
Real masking algorithms (not just placeholders)

✅ Solving a real problem

Based on actual pain points from data science work
Addresses genuine privacy concerns in enterprises
Could save teams hours per dataset

✅ Accessible to non-technical users

No coding required for basic analysis
Plain English chat interface
Visual previews at every step

What's next for FastProfile AI

Short-term (Next 3 months):

Automated Visualization Engine
- AI-generated charts based on data types and questions
- Interactive dashboards without code
Model Recommendation System
- Suggest ML models based on data characteristics
- Generate starter code for scikit-learn, XGBoost, PyTorch

Long-term Vision:

Desktop app for fully offline operation
Database connectors (PostgreSQL, MySQL, MongoDB)
Team collaboration features (shared sessions, comments)
Custom masking functions (user-defined Python scripts)
LLM fine-tuning on privacy-safe synthetic data

The Ultimate Goal:

With AI assistance, even a person who doesn't know Python will have infinite potential to explore data, build models, and extract insights—all while maintaining complete control over their sensitive information.

Try it out

# Installation
git clone https://github.com/yourusername/fastprofile-ai
cd fastprofile-ai
pip install -r requirements.txt

# Add your OpenAI API key
echo "OPENAI_API_KEY=your-key-here" > .env

# Launch
streamlit run app.py

Demo Video: https://youtu.be/ZL1cSPmrdvQ GitHub Repo: https://github.com/JUNJIEQUANT/hackathon26/

Built with ❤️ in 24 hours for [Hackathon Name]

Built With

python

Updates

JUNJIEQUANT Li started this project — Nov 08, 2025 04:37 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.