đź’ˇ Inspiration
As data science engineering students, we noticed a recurring bottleneck: the sheer amount of time poured into Exploratory Data Analysis (EDA) before any actual modeling begins. Writing repetitive boilerplate code to check for null values, distributions, and correlations felt like a hurdle to real insight. We built DataWhisper to automate this initial phase, allowing researchers to skip the "cleaning scripts" and jump straight to interpreting the data.
⚙️ What it does
DataWhisper is an intelligent, automated EDA companion. Users can upload CSV/TSV files or pull datasets directly via the Kaggle API when needed.
- Local Profiling: The app runs a local Python script to extract metadata (data types, min/max, null counts, and correlations).
- Privacy-First AI: To protect sensitive information, only this statistical metadata—not the raw row-level data—is sent to the Groq API (running Llama-3.3-70b).
- Dynamic Visualization: The LLM identifies five critical patterns and returns strict JSON parameters.
- Interactive Insight: A Python backend uses these parameters to render interactive Plotly charts. Users can then "chat" with the AI about specific graphs to gain deeper context or export the entire analysis to a PDF.
🛠️ How we built it
- Frontend: Built with Streamlit and enhanced with custom CSS for a polished UI.
- Backend: Powered by Pandas and SciPy. For example : We use SciPy to calculate the Pearson correlation coefficient \( r \) to map relationships between numeric features:
$$r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}$$
- AI Integration: Leveraged the Groq API for ultra-low-latency LLM inference and the Kaggle API for seamless dataset sourcing.
- Hardware: Developed and optimized to run on standard consumer laptops, ensuring accessibility for all users.
đźš§ Challenges we ran into
The biggest hurdle was hallucination management. We needed the LLM to output deterministic JSON parameters for Plotly. If the model suggested a non-existent column or an incompatible chart type, the app would crash. We solved this by:
- Implementing strict prompt engineering to enforce JSON schemas.
- Writing a validation layer in Python that cross-checks AI-suggested columns against the actual DataFrame before attempting to render.
We also found difficulties with the generating errors due to data visualizations.
🏆 Accomplishments that we're proud of
-We participated in our first ever hackathon and submitted our first project. -We successfully built a reliable, end-to-end pipeline. It's rewarding to see an open-ended Kaggle query transform into a downloaded dataset, a local profile, and a series of dynamic, statistically-valid charts—all while maintaining a stable Streamlit session state.
đź“– What we learned
- Structured Outputs: How to force LLMs into providing rigid JSON for software integration.
- State Management: Handling complex multi-view states in Streamlit.
- Statistical Validity: Using
scipy.statsto inject linear regression trendlines and \( R^2 \) values directly into generated visualizations to provide deeper mathematical grounding. -Working as a Group: We managed to communicate, divide tasks and insure that we submit the project. -Time management: We started very late but we have been able to manifest our idea into a useful project.
🚀 What's next for DataWhisper
-Our goal is 100% data privacy. We plan to integrate local LLM execution using a model like Mistral Nemo 12B. This would allow DataWhisper to generate insights entirely offline, meaning even the most sensitive datasets never have to leave the user's local hardware. -We also plan to upgrade and add more functionnalities so that it may give 3D and animated graphs, give better analysis for experts.

Log in or sign up for Devpost to join the conversation.