Inspiration
What it does
How we built it
💡 Inspiration
Every data science project starts the same way — hours of cleaning messy data, running EDA, trying different models, tuning hyperparameters, and then struggling to explain results to non-technical stakeholders. I wanted to eliminate all of that manual work with a single CSV upload.
🤖 What It Does
Autonomous Data Analysis Agent is a fully automated multi-agent ML pipeline that takes a raw, messy CSV file and delivers a trained model with plain-English explanations — zero configuration required.
Upload CSV → Click Run → Get Intelligence. That's it.
The system has 5 specialized agents orchestrated by a central brain:
- Agent 01 — Data Cleaning: Automatically handles missing values (mean/median/mode based on distribution), detects and caps outliers using winsorization, fixes data types, removes duplicates, and drops sparse columns
- Agent 02 — EDA: Detects whether the problem is classification, regression, or clustering by analyzing the target column, checks class imbalance ratio, computes feature correlations, and generates insights
- Agent 03 — Model Selection: Benchmarks 5 candidate models using 3-fold cross-validation and picks the winner automatically
- Agent 04 — Training & Evaluation: Trains the winning model with hyperparameter tuning, generates confusion matrices, learning curves, and computes accuracy, F1, RMSE, R² metrics
- Agent 05 — Explainability: Uses SHAP values to rank feature importance and converts everything into plain English — "Customer age and monthly spend were the strongest predictors of churn. The model is 91% accurate."
The Orchestrator sits above all agents, routes outputs between them, handles loop-back conditions, and aborts gracefully with clear messages.
🏗️ Architecture
User uploads CSV
↓
Orchestrator (brain)
↓
Agent 01 → Agent 02 → Agent 03 → Agent 04 → Agent 05
Cleaning EDA Model Train Explain
Select
Each agent is an independent Python module with a clean input/output
contract. The orchestrator passes a PipelineState object through each
agent, accumulating results at every step.
🛠️ How I Built It
- Designed each agent as an independent module with single responsibility
- Built the orchestrator to handle routing, errors and loop-back logic
- Used Streamlit for a clean dark-themed UI with live progress updates
- Integrated SHAP for explainability with fallback to feature importances
- Tested on datasets ranging from 1,500 to 99,441 rows
🚧 Challenges I Faced
- Dataclass field ordering: Python requires fields with defaults to come after fields without — caused runtime crashes that needed fixing across all agent result objects
- SHAP compatibility: Had to dynamically choose between TreeExplainer and LinearExplainer based on model type at runtime
- Windows encoding: Special Unicode characters in comments caused
SyntaxErroron Windows — learned to keep source files ASCII-safe - CSV encoding: Real-world datasets use Latin-1, Windows-1252 — built auto-detection across 5 encodings so any CSV loads correctly
- Class imbalance: Needed stratified splits to prevent misleadingly high scores on imbalanced datasets
📚 What I Learned
- How to design a clean multi-agent system with proper handoffs
- The importance of graceful degradation at every step
- SHAP values and how to translate them into business language
- Streamlit is incredibly powerful for shipping data apps fast
- Real-world data is always messier than expected ## Challenges we ran into
Accomplishments that we're proud of
What we learned
What's next for Autonomous Data Analysis Agent
Built With
- python
- streamlit
Log in or sign up for Devpost to join the conversation.