DataMorph

Workflow
Architecture
Self Healing in Action
agent1-architecture
agent2-architecture
agent3-architecture
agent4-architectire
agent5-architecture
agent6-architecture
agent7-architecture
website-home
website-modal
website-modal-2
website-modal-3
website-logs-1
website-logs-2
website-logs-3
website-logs-4
website-logs-5
website-logs-6
website-logs-7
website-logs-8
website-logs-9
website-logs-10
website-logs-11
website-logs-12
website-logs-13
website-logs-14
website-logs-15
website-logs-16
website-logs-17
website-table-view-1
website-table-view-2
website-table-view-3
website-table-view-4
website-table-view-5
website-table-view-6

Inspiration

We witnessed a painful reality in data engineering: 70% of data projects fail due to ETL complexity, and data teams spend 80% of their time building pipelines instead of deriving insights. Business analysts waited weeks for simple transformations, data engineers spent days on repetitive PySpark code, and startups couldn't afford dedicated ETL teams. We asked ourselves: "What if anyone could create production-ready ETL pipelines just by describing them in plain English?" That question sparked DataMorph—an AI system that doesn't just assist developers, but autonomously handles the entire ETL lifecycle from understanding requirements to self-healing production failures.

What it does

DataMorph transforms natural language into production-ready ETL pipelines in 90-180 seconds. Simply describe what you want: eg- "Join customers and orders tables on customer_id, calculate total order amount per customer, filter customers with orders > $1000"

DataMorph automatically: ✅ Generates structured ETL specifications using Claude Sonnet 4.5 ✅ Creates production-ready PySpark code ✅ Executes the pipeline on AWS Glue ✅ Validates data quality with 5-phase hybrid testing ✅ Self-heals errors automatically (up to 5 iterations) ✅ Provides complete audit trail and real-time logs

Key Results: 95% success rate with self-healing 99.5% time savings (2-5 days → 90-180 seconds) 99.98% cost reduction ( 1,600+→0.30 per pipeline) Zero coding required for end users

How we built it

We designed DataMorph as a 7-agent serverless architecture on AWS, where each agent is a specialized microservice: Flask API (EC2) - HTTP interface & database proxy Orchestrator Lambda - Coordinates entire workflow Specs Generator Lambda - Converts natural language to JSON specs using Claude Sonnet 4.5 Glue Executor Lambda - Generates PySpark code and manages AWS Glue jobs Validator Lambda - 5-phase hybrid validation (rule-based + AI) Remediator Lambda - Autonomous error correction with multi-iteration healing Logger Lambda - Centralized logging to DynamoDB

Key Innovations:

Self-Healing Code Generation - If initial code fails, AI analyzes the error and regenerates corrected code. This increased success rate from 70% to 85% on first attempt.
Phase 4.5: AI Failure Verification - We discovered AI-generated tests sometimes failed due to rounding differences, not real issues. We added AI verification to distinguish false positives, reducing them from 5% to 0.6%.
Multi-Iteration Remediation - Up to 5 autonomous correction attempts with "clean slate" approach (drops and recreates tables). Results: 63% fixed in iteration 1, 27% in iteration 2, overall 95% success rate.

Frontend: Professional React website with real-time log viewer, code display, table viewer, and PDF export capabilities.

Tech Stack: Python 3.11, React 18.2, TypeScript, AWS Lambda, Bedrock (Claude Sonnet 4.5), Glue, RDS PostgreSQL, DynamoDB, S3

Challenges we ran into

AI Hallucinations (30% failure rate) Claude generated non-existent columns and invalid syntax Solution: Schema validation, sample data context, 2-attempt self-healing, temperature=0.3 Result: 85% first-attempt success
False Positives Eroding Trust (5% rate) AI tests failed due to rounding differences (100.0 vs 100), not real issues Solution: Phase 4.5 AI failure verification Result: 0.6% false positive rate
Lambda Timeout Issues Orchestrator timing out at 300s with long-running Glue jobs Solution: Increased to 900s, async monitoring, exponential backoff Result: Zero timeout failures
Prompt Engineering Spent 40% of development time perfecting prompts Solution: Iterated through 15+ versions with schema context, examples, explicit structure Result: Reliable JSON output and code generation

Accomplishments that we're proud of

🏆 95% Success Rate - Industry-leading reliability with autonomous self-healing

🏆 First-of-its-Kind Phase 4.5 - AI failure verification that distinguishes false positives (0.6% rate)

🏆 99.5% Time Savings - From 2-5 days to 90-180 seconds

🏆 99.98% Cost Reduction - From 1,600+to0.30 per pipeline

🏆 Production-Ready System - Fully deployed on AWS with 7-agent architecture

🏆 Complete Observability - Every operation logged with real-time monitoring

🏆 Zero Coding Required - True democratization of data engineering

🏆 Multi-Iteration Self-Healing - Up to 5 autonomous correction attempts

🏆 Hybrid Validation - Combines rule-based and AI testing for comprehensive quality assurance

🏆 Professional UI - React website with real-time logs, code display, and PDF export

What we learned

Technical Insights: Prompt Engineering is an Art - We spent 40% of development time perfecting prompts. Providing schema + sample data improved accuracy by 30%. Temperature 0.2-0.3 works best for code generation. AI Needs Guardrails - Claude's hallucinations required multi-layered validation and self-healing to catch 95% of issues. AI is powerful but needs architecture around it. False Positives Kill Trust - When validation tests failed incorrectly, users lost confidence. Phase 4.5 AI verification was crucial for maintaining trust. Serverless Scales Beautifully - From prototype to production without infrastructure changes. Lambda auto-scales, but needs careful timeout and polling strategies. Idempotency is Critical - Dropping tables before recreation, using run_id for deduplication, and storing artifacts in S3 prevented data corruption.

Architectural Insights: Separation of Concerns Wins - The 7-agent architecture made debugging trivial. Each agent has clear boundaries and responsibilities. Observability from Day One - Comprehensive logging to DynamoDB saved countless debugging hours. Every operation logged with timestamp, agent, status, and metadata. Clean Slate Approach - Always dropping and recreating tables before retries eliminated partial data issues and improved validation accuracy to 97%.

Team Collaboration: Clear Ownership - Each team member owned specific agents, enabling parallel development Daily Standups - 15-minute syncs kept everyone aligned and unblocked Documentation as You Go - Documenting each agent immediately made integration seamless Key Takeaway: AI can autonomously solve complex problems when given the right architecture, guardrails, and feedback loops. The future of data engineering is autonomous, intelligent, and accessible to everyone.

What's next for DataMorph

Short-Term (1-3 months) OAuth 2.0 authentication for enterprise security CloudFront CDN for global delivery Syntax highlighting for code display Performance optimizations (caching, parallel operations) Enhanced error messages and debugging tools

Medium-Term (3-6 months) Team collaboration features (shared pipelines, comments) Advanced analytics dashboard (usage stats, cost tracking) Scheduled pipeline execution with cron-like scheduling Webhook support for external integrations API marketplace for third-party extensions

Long-Term (6-12 months) Multi-database support (MySQL, MongoDB, Snowflake, BigQuery) Real-time streaming ETL with AWS Kinesis Mobile applications (iOS/Android) Visual pipeline builder with drag-and-drop interface Custom model fine-tuning for domain-specific optimizations Multi-cloud support (Azure, GCP)

Vision: Transform DataMorph from an ETL tool into a comprehensive autonomous data platform that handles the entire data lifecycle—from ingestion to transformation to analytics—all through natural language.