TokenInsights

Inspiration

Visibility into token count, time to first token, and total token usage is essential for LLM AI agent and application teams because token-based pricing directly determines operational costs for every query, agent action, and prompt. Without granular metrics, teams risk runaway expenses and miss opportunities for budget optimization.

Tracking these metrics helps:

Control spend: Each token consumed by an LLM represents real, sometimes substantial, financial outlay; real-time token monitoring prevents unexpected overages and improves forecasting.
Optimize performance: Monitoring "time to first token" and total completion time is crucial for diagnosing bottlenecks and tuning prompt design or model selection for faster, more predictable user experiences.
Engineer for scale: Large-scale agents and applications must monitor token dynamics across features and users to ensure cost-efficient growth, avoiding the pitfall of scaling up costs as usage grows.
Enable governance and accountability: Fine-grained token and latency data support chargeback, team-level budgeting, and compliance with organizational AI usage policies.

What it does

MemMachine seamlessly tracks and publishes detailed metrics—such as token count, time to first token, and total token count—for every LLM request. These metrics are consumable in multiple industry-standard ways: they are directly exportable as Prometheus metrics (scraped or remote-written with zero loss of granularity), natively instrumented for OpenTelemetry pipelines (aggregated and streamed using the OTLP protocol or pushed to an OpenTelemetry collector), and available to be surfaced within front-end dashboards such as OpenWebUI.

Teams can integrate these observability signals into any infrastructure:

Use Prometheus to collect, query, and graph metrics using time series, alerts, and PromQL queries.
Stream data through OpenTelemetry to other platforms for unified monitoring, tracing, and analytics.
Review live token usage and latency dashboards in OpenWebUI for both interactive debugging and executive reporting.

This flexible, standards-driven approach ensures that every AI agent and application team—from MLOps to product—can access actionable insights, optimize costs, and maintain performance wherever their monitoring stack resides.