๐ Inspiration
Meetings generate decisions, action items, and commitments โ yet most of this context is lost after the call. At the same time, teams increasingly rely on LLMs, but treat them like black boxes, with no visibility into latency, failures, or reliability.
I wanted to solve both problems together:
Extract structured action items from meetings using AI
Treat the LLM itself as a production system that must be observable
That led to InsightPilot โ an AI meeting copilot with built-in LLM observability.
๐ง What It Does
InsightPilot processes meeting transcripts and:
Extracts action items, owners, and deadlines
Detects semantic failures (e.g., invalid JSON, reasoning errors)
Tracks LLM performance in real time:
Median & P95 latency
Request volume
Error count & error rate
All LLM behavior is monitored using Datadog dashboards and alerts, just like any production service.
๐๏ธ How I Built It Core Architecture
FastAPI backend exposes an API to process meeting transcripts
Google Vertex AI (Gemini 2.5 Flash) performs structured extraction
Custom instrumentation measures:
LLM latency
Request count
Semantic errors (invalid JSON)
Datadog Agent collects and visualizes metrics
Observability Layer
Custom Datadog metrics:
llm.response_latency_ms
llm.requests.count
llm.requests.error
Dashboards for:
P50 vs P95 latency
Error rate
Request volume
Alerts for:
High LLM latency
High semantic failure rate
This turns the LLM into a fully observable system, not a blind dependency.
โ๏ธ Challenges Faced
LLM output instability: Gemini sometimes returns markdown-wrapped JSON, breaking parsers โ Solved by detecting semantic failures and emitting error metrics
Metric math in Datadog: Computing error rate required correct query + formula setup
Latency accuracy: Needed precise timing around model inference only
Dashboard clarity: Balancing signal vs noise for judges and demo viewers
Each challenge directly shaped the final architecture.
๐ What I Learned
LLMs must be observed, not trusted blindly
Semantic failures are as critical as crashes
P95 latency matters more than averages
Observability tools like Datadog can (and should) be applied to AI systems
๐ฎ Whatโs Next
Live meeting ingestion (Zoom / Meet)
Prompt drift detection
Model comparison dashboards
Automated rollback or alert-driven workflows
Memory across meetings
Log in or sign up for Devpost to join the conversation.