Inspiration

What it does

How we built it

💡 Inspiration

Every data science project starts the same way — hours of cleaning messy data, running EDA, trying different models, tuning hyperparameters, and then struggling to explain results to non-technical stakeholders. I wanted to eliminate all of that manual work with a single CSV upload.

🤖 What It Does

Autonomous Data Analysis Agent is a fully automated multi-agent ML pipeline that takes a raw, messy CSV file and delivers a trained model with plain-English explanations — zero configuration required.

Upload CSV → Click Run → Get Intelligence. That's it.

The system has 5 specialized agents orchestrated by a central brain:

  • Agent 01 — Data Cleaning: Automatically handles missing values (mean/median/mode based on distribution), detects and caps outliers using winsorization, fixes data types, removes duplicates, and drops sparse columns
  • Agent 02 — EDA: Detects whether the problem is classification, regression, or clustering by analyzing the target column, checks class imbalance ratio, computes feature correlations, and generates insights
  • Agent 03 — Model Selection: Benchmarks 5 candidate models using 3-fold cross-validation and picks the winner automatically
  • Agent 04 — Training & Evaluation: Trains the winning model with hyperparameter tuning, generates confusion matrices, learning curves, and computes accuracy, F1, RMSE, R² metrics
  • Agent 05 — Explainability: Uses SHAP values to rank feature importance and converts everything into plain English — "Customer age and monthly spend were the strongest predictors of churn. The model is 91% accurate."

The Orchestrator sits above all agents, routes outputs between them, handles loop-back conditions, and aborts gracefully with clear messages.

🏗️ Architecture

User uploads CSV
      ↓
  Orchestrator (brain)
      ↓
Agent 01 → Agent 02 → Agent 03 → Agent 04 → Agent 05
Cleaning    EDA        Model       Train       Explain
                       Select

Each agent is an independent Python module with a clean input/output contract. The orchestrator passes a PipelineState object through each agent, accumulating results at every step.

🛠️ How I Built It

  • Designed each agent as an independent module with single responsibility
  • Built the orchestrator to handle routing, errors and loop-back logic
  • Used Streamlit for a clean dark-themed UI with live progress updates
  • Integrated SHAP for explainability with fallback to feature importances
  • Tested on datasets ranging from 1,500 to 99,441 rows

🚧 Challenges I Faced

  • Dataclass field ordering: Python requires fields with defaults to come after fields without — caused runtime crashes that needed fixing across all agent result objects
  • SHAP compatibility: Had to dynamically choose between TreeExplainer and LinearExplainer based on model type at runtime
  • Windows encoding: Special Unicode characters in comments caused SyntaxError on Windows — learned to keep source files ASCII-safe
  • CSV encoding: Real-world datasets use Latin-1, Windows-1252 — built auto-detection across 5 encodings so any CSV loads correctly
  • Class imbalance: Needed stratified splits to prevent misleadingly high scores on imbalanced datasets

📚 What I Learned

  • How to design a clean multi-agent system with proper handoffs
  • The importance of graceful degradation at every step
  • SHAP values and how to translate them into business language
  • Streamlit is incredibly powerful for shipping data apps fast
  • Real-world data is always messier than expected ## Challenges we ran into

Accomplishments that we're proud of

What we learned

What's next for Autonomous Data Analysis Agent

Built With

Share this project:

Updates