The Problem

As a machine learning engineer, I've wasted countless hours debugging datasets that should've been flagged immediately. Missing values, class imbalance, data leakage, outliers - these issues cause model failures and wasted compute. The real kicker? I often don't catch them until after training.

The Solution

DatasetGPA is a browser-based tool that instantly analyzes your CSV datasets and tells you exactly what's wrong - before you waste time training. It detects:

  • Missing Values: Column-by-column missing data patterns
  • Outliers: IQR-based outlier detection for numeric features
  • Class Imbalance: Identifies severe class distribution problems
  • Data Leakage: Flags suspicious column names and duplicate columns
  • Health Score: 0-100 score showing overall dataset quality

The best part? It works 100% in your browser - just drag, drop, and analyze.

How I Built It

Tech Stack:

  • React + Babel (from CDN)
  • Papa Parse (CSV parsing)
  • Tailwind CSS (beautiful UI)
  • Claude API (AI-powered recommendations)

Key Features:

  • Drag & drop CSV upload
  • Real-time analysis engine
  • Interactive health score display
  • AI-generated actionable recommendations
  • Export analysis as markdown

What I Learned

  1. Data quality is underestimated - Most ML failures trace back to dataset issues, not model architecture
  2. Browser-based tools are powerful - CDN libraries + client-side processing = instant results with zero backend
  3. Claude API is amazing - Transforming raw statistics into human-readable insights is game-changing
  4. UI matters - Even the best analysis is useless if people can't understand it

Impact

  • Saves 2-5 hours per dataset by automating quality checks
  • Prevents bad model training by catching issues early
  • Works for any ML team - no installation, no backend setup
  • 100% privacy - your data never leaves your browser

What's Next

  • Correlation heatmaps for feature relationships
  • Automatic fixing suggestions (handle missing values, balance classes)
  • Integration with Weights & Biases for tracking quality over time
  • Support for larger datasets (streaming, chunked processing)

This project proves that data quality automation can be both accessible and intelligent - no PhD required, just upload and analyze.

Built With

Share this project:

Updates