KleanSQL: Query Your Data at the Speed of Thought

Inspiration

Every data analyst knows the pain: you have a CSV file with thousands of rows, and you need insights fast. Traditional SQL requires learning complex syntax, and even then, you're often stuck debugging queries for hours. We wanted to bridge the gap between natural language and data analysis, making powerful analytics accessible to everyone—not just SQL experts.

The inspiration came from watching teammates struggle with basic data questions during hackathons. "What's the average grade for female students?" shouldn't require writing SELECT AVG(G1) FROM students WHERE sex = 'F'. It should be as simple as asking the question in plain English.

What it does

KleanSQL transforms messy CSV files into intelligent, queryable datasets through a beautiful web interface. Upload any CSV or Excel file, and instantly start asking questions in natural language:

"What's the average salary by department?"
"Show me students with grades above 15"
"Which cities have the highest population?"

Behind the scenes, KleanSQL:

Profiles your data using DuckDB for lightning-fast analysis
Generates clean SQL automatically via Claude Sonnet 4
Handles missing data with intelligent imputation strategies
Provides robust queries that work even with messy datasets
Offers follow-up suggestions to deepen your analysis

The platform delivers both raw SQL (for transparency) and robust SQL (with data cleaning) so you can trust your results and learn from the generated queries.

How we built it

Architecture Overview: Our system follows a sophisticated pipeline that mirrors the sequence diagram:

Data Ingestion & Profiling (profiler.py)
- DuckDB-powered CSV analysis with automatic type detection
- Comprehensive statistical profiling (null ratios, outliers, distributions)
- Semantic type classification (numeric, categorical, datetime, boolean)
- Optional Weaviate integration for vector-based column search
LLM Integration (llm_integration.py)
- Anthropic Claude Sonnet 4 for natural language understanding
- Dual SQL generation: raw queries + robust queries with data cleaning
- Intelligent prompt engineering for consistent, safe SQL output
- JSON-structured responses with metadata and follow-up questions
Query Execution (DataAssistant class)
- DuckDB in-memory database for sub-second query performance
- SQL sanitization and safety checks (read-only mode)
- Automatic result formatting and truncation
- Comprehensive error handling with user-friendly messages
Web Interface (app.py)
- Streamlit-based modern UI with custom CSS styling
- Real-time query processing with progress indicators
- Interactive SQL preview modals
- Responsive design with dark theme
CLI Support (data_assistant/cli.py)
- Command-line interface for power users
- Automated data cleaning and imputation
- Interactive Q&A sessions
- Batch processing capabilities

Key Technologies:

DuckDB: High-performance analytical database
Anthropic Claude: Advanced language model for SQL generation
Streamlit: Rapid web app development
Weaviate: Vector database for semantic search (optional)
Pandas: Data manipulation and analysis
Docker: Containerized deployment with Weaviate services

Challenges we ran into

1. SQL Safety & Validation

Challenge: Preventing malicious SQL injection while allowing complex analytical queries
Solution: Implemented comprehensive SQL sanitization, keyword filtering, and read-only mode enforcement. Added column existence validation and query complexity checks.

2. Data Quality & Missing Values

Challenge: Real-world datasets are messy with missing values, inconsistent types, and outliers
Solution: Built intelligent imputation pipeline with policies based on data types and null ratios. Created both "raw" and "robust" query modes—raw for transparency, robust for production use.

3. LLM Response Consistency

Challenge: Getting reliable, parseable SQL from language models
Solution: Developed sophisticated prompt engineering with JSON-structured responses, fallback mechanisms, and extensive response normalization. Added regex-based SQL extraction for edge cases.

4. Performance at Scale

Challenge: Fast query execution on large datasets
Solution: Leveraged DuckDB's columnar storage and vectorized execution. Implemented result truncation and efficient memory management.

5. User Experience

Challenge: Making complex data analysis feel intuitive
Solution: Created progressive disclosure with SQL previews, follow-up question suggestions, and comprehensive error messages. Built both web and CLI interfaces for different user preferences.

Accomplishments that we're proud of

Technical Innovations:

Dual Query Generation: Raw SQL for learning + robust SQL for production
Intelligent Data Profiling: Comprehensive statistical analysis with semantic type detection
Vector-Based Column Search: Optional Weaviate integration for semantic column discovery
Automated Data Cleaning: Smart imputation strategies based on data characteristics

User Experience:

Natural Language Interface: Ask questions in plain English, get SQL + results
Beautiful Web UI: Modern dark theme with responsive design
Progressive Disclosure: Show SQL queries for transparency and learning
Follow-up Suggestions: AI-powered question recommendations

Performance:

Sub-second Query Execution: DuckDB's vectorized engine
Memory Efficient: Streaming data processing with result truncation
Scalable Architecture: Handles datasets from hundreds to millions of rows

Developer Experience:

Multiple Interfaces: Web app, CLI, and programmatic API
Docker Deployment: One-command setup with Weaviate services
Comprehensive Testing: Unit tests and integration validation
Extensible Design: Modular architecture for easy feature additions

What we learned

AI Integration Insights:

Prompt Engineering is Critical: Small changes in prompts dramatically affect output quality
Fallback Strategies Matter: Always have backup plans when LLMs fail
Structured Outputs Work: JSON responses are more reliable than free-form text

Data Engineering Lessons:

Profile First, Query Later: Understanding your data structure is crucial for good SQL generation
Handle Edge Cases: Real-world data is messier than you think
Performance Matters: Users expect instant results, even on large datasets

User Experience Discoveries:

Transparency Builds Trust: Showing generated SQL helps users understand and learn
Progressive Disclosure: Don't overwhelm users with all features at once
Error Messages Matter: Clear, actionable error messages improve user experience significantly

Technical Architecture:

DuckDB is Incredible: Fast, SQL-compliant, and perfect for analytical workloads
Modular Design Pays Off: Separate concerns (profiling, LLM, execution) makes testing and debugging easier
Containerization Simplifies Deployment: Docker Compose makes complex setups trivial

What's next for KleanSQL

Short-term Enhancements:

Multi-table Joins: Support for relational queries across multiple CSV files
Advanced Visualizations: Charts and graphs for query results
Query History: Save and replay previous analyses
Export Capabilities: Download results as CSV, PDF reports

Medium-term Features:

Real-time Data Sources: Connect to databases, APIs, and streaming data
Collaborative Analysis: Share datasets and queries with team members
Custom Imputation Rules: User-defined data cleaning strategies
Query Optimization: Automatic query performance tuning

Long-term Vision:

Enterprise Features: Role-based access, audit logs, and compliance tools
AI-Powered Insights: Automatic anomaly detection and trend analysis
Natural Language Reports: Generate executive summaries from data
Integration Ecosystem: Connect with popular BI tools and data platforms

Open Source Goals:

Community Contributions: Open-source the core engine for community development
Plugin Architecture: Allow third-party extensions and custom functions
Educational Resources: Tutorials and examples for learning SQL through natural language

KleanSQL represents the future of data analysis—where powerful insights are just a question away. We're excited to continue building tools that make data accessible to everyone, not just SQL experts.

Built with ❤️ at CalHacks 12.0