FastProfile AI - Submission
Inspiration
From my previous data science work experience, I discovered that 90%+ of time is spent on dataset learning, exploration, and feature engineeringβnot on model building. Putting data into a model isn't hard; the hard part is understanding what goes into the model and dealing with messy, dirty data.
Meanwhile, there's a growing concern about data privacy. Companies hesitate to send sensitive data to cloud-based AI services, especially in regulated industries like healthcare, finance, and government.
FastProfile AI was born from these two pain points:
- Accelerate the data exploration bottleneck
- Enable privacy-preserving AI assistance
I wanted to build a tool that helps managers and analysts understand datasets faster to assign team goals at a more efficient pace, while ensuring sensitive information never leaves their control.
What it does
FastProfile AI is a privacy-first data exploration assistant that combines automated profiling with AI-powered insights.
Core Features:
π Automated Data Profiling (Local Processing)
- Analyzes each column: data types, distributions, missing values, outliers
- Detects 6+ PII patterns (SSN, credit cards, emails, phone numbers, IPs, zip codes)
- Calculates correlation matrices and identifies data quality issues
- Everything runs locallyβyour raw data never leaves your machine
π Configurable Privacy Masking
- Column-level privacy policies: Allow, Mask, or Deny
- 10+ masking strategies: pseudonymization, partial reveal, numeric bucketing, date truncation, etc.
- Real-time preview of masked data before sending to AI
- Audit logs track what data is shared with LLM
π¬ AI Chat Assistant (Privacy-Safe)
- Ask questions in plain English: "What does each row represent?" "How should I join these tables?"
- Only sanitized summaries are sent to OpenAIβnever raw data
- Multi-table relationship detection and join recommendations
- Token/cost estimates before each query
π Multi-Table Analysis
- Upload multiple related CSVs
- Automatic relationship detection (one-to-one, one-to-many, many-to-many)
- Suggests optimal join strategies for creating master tables
- Ideal for relational database exports
Who Benefits:
- Data Scientists: Skip manual profiling, focus on modeling
- Business Leaders: Understand datasets without coding
- Compliance Teams: Audit and enforce privacy policies automatically
- Data Engineers: Detect table relationships and data quality issues early
How we built it
Tech Stack:
- Python - Core data processing and profiling logic
- Streamlit - Interactive web UI for rapid prototyping and demo
- Pandas/NumPy - Data manipulation and statistical analysis
- OpenAI GPT-4 - AI-powered chat and insights
- Pydantic - Data validation and schema enforcement
Architecture:
βββββββββββββββββββ
β User's CSV β (Never leaves local machine)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Local Profiling β (Column stats, PII detection, outliers)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Privacy Masking β (Configurable policies per column)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Sanitized View β (Only summaries/masked samples)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β OpenAI API β (Chat assistant with sanitized data)
βββββββββββββββββββ
Key Design Decisions:
- Local-first processing - Raw data never sent to cloud
- Iterative masking preview - Users see exactly what AI sees
- Token estimation - Cost transparency before LLM calls
- Session persistence - Resume analysis across sessions
- Multi-table support - Detect relationships automatically using column name similarity + value overlap
Challenges we ran into
1. Working Solo I didn't realize there was a Discord channel to find partners until am in the hackathon. With more people, I could have achieved:
- A polished frontend design
- More advanced masking algorithms
2. Privacy-Performance Trade-off Balancing comprehensive profiling with privacy protection was tricky:
- Too much masking β AI can't give useful insights
- Too little masking β Privacy risks
- Solution: Real-time preview + token estimates helped users find the sweet spot
3. Multi-Table Relationship Detection Detecting joins between tables without knowing the schema beforehand required:
- Name similarity matching (e.g., "customer_id" vs "cust_id")
- Sample value overlap analysis (computationally expensive)
- Heuristics for foreign key detection (uniqueness ratios)
4. Handling Diverse CSV Formats Real-world CSVs are messy:
- Multiple encodings (UTF-8, Latin-1, CP1252)
- Mixed data types in columns
- Inconsistent date formats
- Solution: Robust error handling with encoding fallbacks
Accomplishments that we're proud of
β Complete product delivered in 24 hours of coding
- Full workflow: Upload β Profile β Privacy β Chat
- Multi-table analysis with relationship detection
- 10+ masking strategies with live preview
- Working AI chat integration
β Production-ready privacy features
- Column-level policies
- Audit logging
- Token/cost estimation
- Real masking algorithms (not just placeholders)
β Solving a real problem
- Based on actual pain points from data science work
- Addresses genuine privacy concerns in enterprises
- Could save teams hours per dataset
β Accessible to non-technical users
- No coding required for basic analysis
- Plain English chat interface
- Visual previews at every step
What's next for FastProfile AI
Short-term (Next 3 months):
Automated Visualization Engine
- AI-generated charts based on data types and questions
- Interactive dashboards without code
Model Recommendation System
- Suggest ML models based on data characteristics
- Generate starter code for scikit-learn, XGBoost, PyTorch
Long-term Vision:
- Desktop app for fully offline operation
- Database connectors (PostgreSQL, MySQL, MongoDB)
- Team collaboration features (shared sessions, comments)
- Custom masking functions (user-defined Python scripts)
- LLM fine-tuning on privacy-safe synthetic data
The Ultimate Goal:
With AI assistance, even a person who doesn't know Python will have infinite potential to explore data, build models, and extract insightsβall while maintaining complete control over their sensitive information.
Try it out
# Installation
git clone https://github.com/yourusername/fastprofile-ai
cd fastprofile-ai
pip install -r requirements.txt
# Add your OpenAI API key
echo "OPENAI_API_KEY=your-key-here" > .env
# Launch
streamlit run app.py
Demo Video: https://youtu.be/ZL1cSPmrdvQ GitHub Repo: https://github.com/JUNJIEQUANT/hackathon26/
Built with β€οΈ in 24 hours for [Hackathon Name]
Log in or sign up for Devpost to join the conversation.