QNX Device Doctor

Inspiration

Embedded devices are everywhere: cars, robots, factories, labs, medical systems, and remote infrastructure. But when something goes wrong, debugging is still very manual. Engineers often have to SSH into the device, inspect processes, read logs, diagnose failures, restart services, verify recovery, and write an incident report.

We wanted to build an AI system that feels like an embedded systems engineer on call: not just a chatbot, but an agent that can observe, diagnose, plan, ask for approval, act safely, verify the fix, and report what happened.

## What it does

QNX Device Doctor is an agentic AI SRE for QNX embedded devices.

It runs against a Raspberry Pi running QNX 8.0 Non-Commercial. The QNX-side daemon starts real managed demo service processes with real PIDs and heartbeat files. The dashboard monitors those services, hardware-facing adapters, logs, diagnostics, and timeline events.

The demo includes:

Managed QNX demo services for sensor, telemetry, control, network, and watchdog behavior
Real process-level failure detection using PIDs and heartbeats
Hardware-adapter diagnostics for a sensor array and status light
A natural-language "Ask Device Doctor" flow
Human approval before repairs
Safe allowlisted repair actions
Recovery verification
Incident report generation
Optional GPT-4o reasoning layer for repair planning
Unsafe action blocking for dangerous operations like deleting logs, rebooting, or running arbitrary shell commands

One demo flow is: the user reports "my lightbulb is not working." Device Doctor inspects the control service, process health, status light adapter, desired state, actual state, power, driver responsiveness, and config validity. It diagnoses an invalid actuator configuration, proposes a safe repair, waits for approval, applies the fix, verifies that the status light is healthy again, and generates a report.

## How we built it

We built three main components:

A QNX device daemon written in Python standard library only.
A plain HTML/CSS/JavaScript dashboard running on the laptop.
A Device Doctor agent that performs the workflow: Observe -> Diagnose -> Plan -> Ask Approval -> Act -> Verify -> Report.

The QNX daemon exposes HTTP endpoints for device state, service status, logs, timeline, diagnostics, hardware diagnostics, failure injection, repair actions, and unsafe-action blocking.

The dashboard connects to the QNX daemon over HTTP and provides a polished live demo interface. The CLI agent can also run the workflow from the terminal.

We also added an optional GPT-4o brain. GPT-4o can propose a structured repair plan from device evidence, but it does not execute actions directly. A deterministic safety policy validates every proposed action before anything touches the device.

## Challenges we faced

We were first-time QNX users, so we intentionally avoided fragile kernel-level or hardware-specific work. We also avoided Linux-only assumptions like systemd, journalctl, Docker, and arbitrary shell execution.

A major challenge was making the demo feel real while staying safe and reliable. We solved this by running real managed demo processes on QNX with real PIDs and heartbeats, while modeling hardware through safe adapters that can later be replaced with real GPIO, I2C, sensor, or actuator integrations.

Another challenge was AI safety. We did not want an LLM freely executing commands on an embedded device. Our solution separates reasoning from execution: the AI can propose a plan, but only allowlisted safe tools can act.

## What we learned

We learned how different embedded operations are from normal web/server operations. QNX does not behave like a typical Linux server environment, and embedded recovery needs to be careful, auditable, and constrained.

We also learned that the best AI agent design is not "let the model do everything." It is better to give the model structured tools, safe boundaries, approval gates, and verification steps.

## What's next

Next, we would replace the demo hardware adapters with real QNX integrations for GPIO, I2C, sensors, actuators, watchdogs, and production service logs. We would also add multi-device fleet monitoring, persistent incident history, stronger authentication, and deeper GPT-4o report generation.