DataWhisper

💡 Inspiration

As data science engineering students, we noticed a recurring bottleneck: the sheer amount of time poured into Exploratory Data Analysis (EDA) before any actual modeling begins. Writing repetitive boilerplate code to check for null values, distributions, and correlations felt like a hurdle to real insight. We built DataWhisper to automate this initial phase, allowing researchers to skip the "cleaning scripts" and jump straight to interpreting the data.

⚙️ What it does

DataWhisper is an intelligent, automated EDA companion. Users can upload CSV/TSV files or pull datasets directly via the Kaggle API when needed.

Local Profiling: The app runs a local Python script to extract metadata (data types, min/max, null counts, and correlations).
Privacy-First AI: To protect sensitive information, only this statistical metadata—not the raw row-level data—is sent to the Groq API (running Llama-3.3-70b).
Dynamic Visualization: The LLM identifies five critical patterns and returns strict JSON parameters.
Interactive Insight: A Python backend uses these parameters to render interactive Plotly charts. Users can then "chat" with the AI about specific graphs to gain deeper context or export the entire analysis to a PDF.

🛠️ How we built it

Frontend: Built with Streamlit and enhanced with custom CSS for a polished UI.
Backend: Powered by Pandas and SciPy. For example : We use SciPy to calculate the Pearson correlation coefficient $ r $ to map relationships between numeric features:

$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}$$

AI Integration: Leveraged the Groq API for ultra-low-latency LLM inference and the Kaggle API for seamless dataset sourcing.
Hardware: Developed and optimized to run on standard consumer laptops, ensuring accessibility for all users.

🚧 Challenges we ran into

The biggest hurdle was hallucination management. We needed the LLM to output deterministic JSON parameters for Plotly. If the model suggested a non-existent column or an incompatible chart type, the app would crash. We solved this by:

Implementing strict prompt engineering to enforce JSON schemas.
Writing a validation layer in Python that cross-checks AI-suggested columns against the actual DataFrame before attempting to render.

We also found difficulties with the generating errors due to data visualizations.

🏆 Accomplishments that we're proud of

-We participated in our first ever hackathon and submitted our first project. -We successfully built a reliable, end-to-end pipeline. It's rewarding to see an open-ended Kaggle query transform into a downloaded dataset, a local profile, and a series of dynamic, statistically-valid charts—all while maintaining a stable Streamlit session state.

📖 What we learned

Structured Outputs: How to force LLMs into providing rigid JSON for software integration.
State Management: Handling complex multi-view states in Streamlit.
Statistical Validity: Using scipy.stats to inject linear regression trendlines and $ R^2 $ values directly into generated visualizations to provide deeper mathematical grounding. -Working as a Group: We managed to communicate, divide tasks and insure that we submit the project. -Time management: We started very late but we have been able to manifest our idea into a useful project.

🚀 What's next for DataWhisper

-Our goal is 100% data privacy. We plan to integrate local LLM execution using a model like Mistral Nemo 12B. This would allow DataWhisper to generate insights entirely offline, meaning even the most sensitive datasets never have to leave the user's local hardware. -We also plan to upgrade and add more functionnalities so that it may give 3D and animated graphs, give better analysis for experts.