DataGuard

Inspiration

Every data team has lived this nightmare: a Stripe webhook schema changes at 2am, NULL values flood your warehouse, your revenue dashboard shows $0, and nobody notices until the CEO asks why Q1 numbers look wrong. 77% of data engineers spend over 30% of their time fighting data quality fires. The mean time to detect a data incident is 8–12 hours, and fixing it takes another 2–4 hours of manual SQL. We asked: what if an AI agent could detect these issues in seconds, fix them safely without touching production, and get smarter with every incident? That's DataGuard.

FLOW: Detect Anomaly → Search Aerospike Cache → Hit? Reuse Fix → Miss? Ask Claude → Fork Ghost DB → Apply Fix → Validate → Promote or Rollback

What it does

DataGuard is an autonomous AI agent that continuously monitors data pipelines, detects anomalies using statistical baselines, and self-heals them using safe database forking — with zero human intervention and zero production risk. Every 5 seconds, the agent ingests data through Airbyte from 5 upstream sources (GitHub API, Stripe webhooks, Salesforce events, Segment tracking, Snowflake CDC) into Ghost DB. It runs quality checks — null rates, duplicate rates, row counts, schema drift — and compares current metrics against Aerospike-stored baselines using z-score anomaly detection. When an anomaly is found, it searches Aerospike for similar past incidents using 64-dimensional vector embeddings. If it finds a match, it reuses the cached fix instantly. If it's a new pattern, Claude analyzes the data, diagnoses the root cause, and generates fix SQL. The fix is never applied directly to production — instead, the agent forks the entire database using Ghost DB's instant forking, applies the fix on the fork, runs validation checks, and only promotes the fork as the new primary if everything passes. Every fix gets fingerprinted and cached in Aerospike so repeat incidents are resolved in under a millisecond without any AI call. In our demo, the agent detected 3 real anomalies (two NULL spikes where 24% of event_type values went NULL, and a schema drift where the value column was dropped), fixed all of them autonomously through fork-test-promote cycles, and the second NULL spike was resolved instantly from Aerospike's pattern cache — proving the learning loop works.

How we built it

We built the entire system in TypeScript on Node.js as a single process running the agent loop, REST API server, and SSE streaming. Ghost DB serves as the primary data store — a cloud PostgreSQL 18.3 database on TimescaleDB with three tables: raw_events (pipeline data with source, event_type, JSONB payload, value), dq_metrics (quality check results), and remediations (full audit log of every fix). Ghost DB's instant forking is the core safety mechanism — we created 3 forks during the demo, each a full copy-on-write clone used to test fixes before promoting. Aerospike runs in Docker on GitHub Codespaces with a REST gateway, storing data across 4 sets: metrics (time-series check results), baselines (statistical profiles with mean, stddev, sample count per column), anomaly_fingerprints (pattern cache with fix SQL and success/failure status), and agent_state (agent health). Airbyte is implemented as a pipeline manager orchestrating data sync from 5 simulated SaaS sources into Ghost DB — 9 sync jobs totaling 450+ records with per-connection health monitoring. Claude Sonnet powers both the anomaly diagnosis engine (generating root cause analysis and single-statement fix SQL) and an embedded dashboard chatbot that answers natural language questions about the live system state. The frontend is a single-file HTML dashboard with Three.js 3D visualization showing data flowing between sponsor nodes, Server-Sent Events for real-time streaming, and the AI chatbot — no build step, no framework, just vanilla JS.

Challenges we ran into

The hardest bugs were all at integration boundaries between sponsor tools. Ghost CLI exits non-zero with valid JSON — the create command puts output in stderr even on success, so we had to parse both streams for JSON before throwing errors. Ghost DB forks get different passwords than the parent database, causing authentication failures until we extracted connection strings directly from the fork's JSON output. We discovered mid-hackathon that Aerospike Cloud requires enterprise billing with AWS VPC peering and a sales contract, so we pivoted to running Aerospike in Docker on a free GitHub Codespace with port forwarding to localhost. Claude kept generating multi-statement SQL wrapped in BEGIN/COMMIT transactions, but PostgreSQL's pg client can't handle multiple statements in a single query call — we fixed this by constraining the system prompt and adding SQL splitting logic as a safety net. After injecting schema drift (dropping the value column), all subsequent INSERT statements crashed with cascading failures — we had to add runtime column existence checks via information_schema.columns before every insert.

Accomplishments that we're proud of

The agent successfully detected 3 real anomalies and fixed 2 of them completely autonomously — no human touched the database. The fork-test-promote workflow worked flawlessly: fork the database in seconds, apply the fix on the copy, validate, promote. Production was never at risk. The pattern matching system proved itself when the second NULL spike was resolved instantly from Aerospike's cache without calling Claude — turning AI costs from O(n) to O(1) for repeat incidents. The entire detection-to-fix cycle runs in under 30 seconds compared to the industry average of 8–12 hours for detection alone. All three sponsor tools are genuinely load-bearing: remove Ghost DB and you lose safe remediation, remove Aerospike and you lose baselines and pattern caching, remove Airbyte and no data flows in.

What we learned

Database forking is a superpower for autonomous agents — it lets AI make mistakes safely because if the fix is wrong, you just delete the fork. Pattern caching turns AI costs into O(1) since the first anomaly costs a Claude API call but every repeat is free and instant from Aerospike. Statistical baselines beat static thresholds because a 15% null rate might be normal for one column and catastrophic for another, and z-scores adapt automatically. The hardest bugs are never in individual tools but at the seams between them: auth handoffs, JSON parsing quirks, SSL requirements, CLI exit codes. Building for a hackathon means infrastructure can fail at any moment — having fallback paths (codespace instead of cloud, direct API calls from browser instead of backend) saved us multiple times.

What's next for DataGuard

We plan to add real-time Airbyte Cloud integration with actual connector sources, expand anomaly detection to include distribution shift detection and cross-table referential integrity checks, build a Slack/PagerDuty integration so the agent notifies teams even as it self-heals, add a remediation approval mode for production environments where humans review before promote, and implement multi-database support so one DataGuard agent monitors an entire data platform across dozens of databases simultaneously.

Arch

Built With

aerospike
airbyte
claude-api-(anthropic)
ghost-db
github-codespaces
html/css
node.js
postgresql
three.js
typescript

Built With

Updates