Data Doctor

Inspiration

As engineers, we take plenty of time troubleshooting, so we wanted to make something that would reduce the amount of troubleshooting and therefore make our time more efficient. It's the fastest way to go from, "What went wrong" to "Let's move forward".

What it does

Data Doctor is an AI agent built on Databricks that automatically diagnoses, explains, and helps fix broken or underperforming data pipelines. Instead of spending hours combing through logs, schema diffs, or performance metrics, engineers can simply ask, ‘Why did my pipeline fail?’ and Data Doctor will trace the root cause, highlight schema drifts, identify performance bottlenecks, and even suggest code fixes using LLMs. It’s like having a senior data engineer embedded in your lakehouse — proactive, intelligent, and always available.

How we built it

Automated Code Correction for Failed Pipeline Executions Using Databricks

Our solution integrates Databricks notebooks, model serving, and Databricks workflows to automate targeted code updates based on failed pipeline executions. This approach enhances debugging efficiency by dynamically identifying and resolving notebook errors.

System Workflow

The agent follows a structured sequence to diagnose and correct failures:

Analyze all failed job runs within the pipeline.

Locate the corresponding notebook using the Databricks Notebook API.

Identify the exact cell(s) responsible for the failure.

Send the failure message and problematic code to the Databricks Model Serving Endpoint.

Apply the model-generated correction by seamlessly updating the impacted code cells.

Automated Execution & Code Integrity

The agent is scheduled to continuously monitor failed Databricks notebook executions via the Databricks Notebook API, ensuring timely error detection and resolution. It systematically retrieves corrections from the trained model, ensuring that updates modify only the necessary code sections while maintaining operational integrity.

Challenges we ran into

understanding the notebook failure structure where there are multiple nested tasks that hold the output error message
learning how to use the model serving endpoint and passing the needed parameters to the agent