How We Built It
We designed NetHealer to feel less like a monitoring tool and more like an autonomous operations system. From the beginning, the goal was to build something that could observe infrastructure, reason about failures, and respond in real time — the same way a human NOC team would, but faster and continuously.
To make that possible, we split the architecture into two clearly defined layers: a Python backend that handles AI reasoning and infrastructure automation, and a Next.js frontend that functions as a live Network Operations Center dashboard. The two communicate through persistent WebSocket streams so that every telemetry signal, AI decision, and remediation step appears instantly on the screen.
The result is a system where you can literally watch the network think and heal itself in real time.
The AI Pipeline
At the heart of NetHealer is a four-stage multi-agent reasoning pipeline, powered by Amazon Nova Lite through AWS Bedrock.
Whenever new telemetry arrives, it enters the pipeline and flows through four reasoning stages:
Telemetry Analysis → Root Cause Diagnosis → Remediation Planning → Automated Execution
Each stage acts like a specialized AI operator. Instead of using a single prompt, we structured the pipeline so that each agent receives a clear role and a full snapshot of the network state. The agent then sends a structured prompt to Nova, receives a structured response, and passes that output to the next stage.
This creates a deterministic reasoning chain, where every AI decision directly influences the next step in the process.
To make the system reliable enough to control infrastructure, Nova runs with a low temperature setting, ensuring consistent and structured outputs. When the result of an AI response could trigger a real infrastructure action — like rerouting DNS traffic or restarting services — predictability matters more than creativity.
Under the hood, the orchestrator runs each Nova inference in a background thread. This allows the FastAPI event loop to stay free and continue broadcasting updates to the dashboard between pipeline stages.
The effect is surprisingly powerful: the moment telemetry arrives, the dashboard begins updating as the AI analyzes the problem, identifies a root cause, and generates a remediation plan — all within seconds.
You’re not just seeing the result. You’re watching the reasoning process happen live.
Telemetry
A system that heals infrastructure has to understand it first. NetHealer pulls telemetry from two real data sources to give the AI a full picture of what’s happening.
The first source is the local machine collector, built with the psutil library. Every three seconds it gathers host-level metrics including CPU utilization, memory usage, disk usage, network interface throughput, active TCP connections, and battery state. These signals provide insight into the health of the machine running the system.
At the same time, NetHealer pulls real network performance data from ThousandEyes using their v7 API. These tests measure latency, packet loss, and jitter across real internet paths — including probes targeting Google DNS, AWS US-East, and the Bedrock API endpoint itself.
Combining these two telemetry streams gives Nova a powerful advantage. Instead of seeing only infrastructure metrics or only network metrics, the AI sees both layers at once. This allows it to reason about complex failures — distinguishing, for example, between a server overload and a network path degradation.
The telemetry streams are merged into a single unified snapshot before entering the AI pipeline, ensuring every decision is based on a consistent view of the system.
Execution
Once Nova generates a remediation plan, NetHealer turns those decisions into real infrastructure actions.
Every action produced by the AI is automatically categorized into three groups: ThousandEyes actions, safe AWS actions, and destructive AWS actions.
ThousandEyes actions — such as pausing monitoring tests, swapping agent pools, or adjusting alert thresholds — execute immediately. These actions improve monitoring visibility without affecting production traffic.
Safe AWS actions also run automatically. These include operations like rerouting traffic using Route53 DNS weighting or restarting services on EC2 instances through AWS Systems Manager.
For more aggressive responses, such as blocking IP addresses or isolating infrastructure nodes, NetHealer adds a human checkpoint. These actions appear on the dashboard and can be approved with a single click.
This approach keeps the system fully autonomous for safe recovery tasks, while still maintaining responsible oversight for actions that could disrupt live systems.
Frontend
The NetHealer dashboard is designed to feel like a modern Network Operations Center.
It’s built as a single-page Next.js application, with every component implemented from scratch. We intentionally avoided external UI libraries so we could tailor the interface exactly to the system’s behavior.
The centerpiece of the interface is the live network topology map, rendered entirely with HTML Canvas.
The map runs at 60 frames per second using requestAnimationFrame, enabling fluid animation and real-time updates.
Nodes represent infrastructure components, and connections between them show the active data paths across the network. Each connection contains animated light streaks that travel continuously between nodes, simulating real network traffic.
These visual signals aren’t static — they respond directly to telemetry from the backend. When latency rises, pulses slow down. When packet loss occurs, colors shift from blue to yellow to red. As the AI remediates issues, the network visibly returns to a healthy state.
The entire interface updates instantly through a WebSocket connection to the backend, meaning every anomaly detection, root cause analysis, and remediation action appears on screen as it happens.
Instead of reading logs after the fact, operators can watch the system diagnose and repair the network live.
In the end, NetHealer behaves less like a monitoring dashboard and more like an autonomous infrastructure control system — one that continuously observes, reasons, and repairs the network before small issues become real outages.
Log in or sign up for Devpost to join the conversation.