๐Ÿš€ Inspiration

Meetings generate decisions, action items, and commitments โ€” yet most of this context is lost after the call. At the same time, teams increasingly rely on LLMs, but treat them like black boxes, with no visibility into latency, failures, or reliability.

I wanted to solve both problems together:

Extract structured action items from meetings using AI

Treat the LLM itself as a production system that must be observable

That led to InsightPilot โ€” an AI meeting copilot with built-in LLM observability.

๐Ÿง  What It Does

InsightPilot processes meeting transcripts and:

Extracts action items, owners, and deadlines

Detects semantic failures (e.g., invalid JSON, reasoning errors)

Tracks LLM performance in real time:

Median & P95 latency

Request volume

Error count & error rate

All LLM behavior is monitored using Datadog dashboards and alerts, just like any production service.

๐Ÿ—๏ธ How I Built It Core Architecture

FastAPI backend exposes an API to process meeting transcripts

Google Vertex AI (Gemini 2.5 Flash) performs structured extraction

Custom instrumentation measures:

LLM latency

Request count

Semantic errors (invalid JSON)

Datadog Agent collects and visualizes metrics

Observability Layer

Custom Datadog metrics:

llm.response_latency_ms

llm.requests.count

llm.requests.error

Dashboards for:

P50 vs P95 latency

Error rate

Request volume

Alerts for:

High LLM latency

High semantic failure rate

This turns the LLM into a fully observable system, not a blind dependency.

โš”๏ธ Challenges Faced

LLM output instability: Gemini sometimes returns markdown-wrapped JSON, breaking parsers โ†’ Solved by detecting semantic failures and emitting error metrics

Metric math in Datadog: Computing error rate required correct query + formula setup

Latency accuracy: Needed precise timing around model inference only

Dashboard clarity: Balancing signal vs noise for judges and demo viewers

Each challenge directly shaped the final architecture.

๐Ÿ“š What I Learned

LLMs must be observed, not trusted blindly

Semantic failures are as critical as crashes

P95 latency matters more than averages

Observability tools like Datadog can (and should) be applied to AI systems

๐Ÿ”ฎ Whatโ€™s Next

Live meeting ingestion (Zoom / Meet)

Prompt drift detection

Model comparison dashboards

Automated rollback or alert-driven workflows

Memory across meetings

Built With

Share this project:

Updates