Autonomous Data Analysis Agent

Inspiration

What it does

How we built it

💡 Inspiration

Every data science project starts the same way — hours of cleaning messy data, running EDA, trying different models, tuning hyperparameters, and then struggling to explain results to non-technical stakeholders. I wanted to eliminate all of that manual work with a single CSV upload.

🤖 What It Does

Autonomous Data Analysis Agent is a fully automated multi-agent ML pipeline that takes a raw, messy CSV file and delivers a trained model with plain-English explanations — zero configuration required.

Upload CSV → Click Run → Get Intelligence. That's it.

The system has 5 specialized agents orchestrated by a central brain:

Agent 01 — Data Cleaning: Automatically handles missing values (mean/median/mode based on distribution), detects and caps outliers using winsorization, fixes data types, removes duplicates, and drops sparse columns
Agent 02 — EDA: Detects whether the problem is classification, regression, or clustering by analyzing the target column, checks class imbalance ratio, computes feature correlations, and generates insights
Agent 03 — Model Selection: Benchmarks 5 candidate models using 3-fold cross-validation and picks the winner automatically
Agent 04 — Training & Evaluation: Trains the winning model with hyperparameter tuning, generates confusion matrices, learning curves, and computes accuracy, F1, RMSE, R² metrics
Agent 05 — Explainability: Uses SHAP values to rank feature importance and converts everything into plain English — "Customer age and monthly spend were the strongest predictors of churn. The model is 91% accurate."

The Orchestrator sits above all agents, routes outputs between them, handles loop-back conditions, and aborts gracefully with clear messages.

🏗️ Architecture

User uploads CSV
      ↓
  Orchestrator (brain)
      ↓
Agent 01 → Agent 02 → Agent 03 → Agent 04 → Agent 05
Cleaning    EDA        Model       Train       Explain
                       Select

Each agent is an independent Python module with a clean input/output contract. The orchestrator passes a PipelineState object through each agent, accumulating results at every step.

🛠️ How I Built It

Designed each agent as an independent module with single responsibility
Built the orchestrator to handle routing, errors and loop-back logic
Used Streamlit for a clean dark-themed UI with live progress updates
Integrated SHAP for explainability with fallback to feature importances
Tested on datasets ranging from 1,500 to 99,441 rows

🚧 Challenges I Faced

Dataclass field ordering: Python requires fields with defaults to come after fields without — caused runtime crashes that needed fixing across all agent result objects
SHAP compatibility: Had to dynamically choose between TreeExplainer and LinearExplainer based on model type at runtime
Windows encoding: Special Unicode characters in comments caused SyntaxError on Windows — learned to keep source files ASCII-safe
CSV encoding: Real-world datasets use Latin-1, Windows-1252 — built auto-detection across 5 encodings so any CSV loads correctly
Class imbalance: Needed stratified splits to prevent misleadingly high scores on imbalanced datasets

📚 What I Learned

How to design a clean multi-agent system with proper handoffs
The importance of graceful degradation at every step
SHAP values and how to translate them into business language
Streamlit is incredibly powerful for shipping data apps fast
Real-world data is always messier than expected ## Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Autonomous Data Analysis Agent

Built With

python
streamlit

Updates

vasavi bokkisam started this project — Mar 16, 2026 03:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.