BearStack

Inspiration

Generative AI is often treated as a "Black Box," leading to unpredictable security risks and costs. We were inspired to bridge the gap between AI innovation and production reliability. We wanted to build a system that doesn't just "watch" the model but actively defends it, applying rigorous SRE principles to the chaotic nature of LLMs.

What it does

BearStack is an observability and active defense platform for Google Vertex AI. It automates reliability via three pillars:

Security (Jailbreak Defense): Detects malicious prompts, refuses the request, and triggers a Severity-3 incident to alert stakeholders.
Cost (Smart Rate Limiting): Identifies abusive usage patterns and enforces HTTP 429 limits to prevent resource exhaustion.
Performance (Circuit Breaking): Monitors error budgets in real-time. If downstream dependencies fail, it trips a Circuit Breaker to stop cascading failures.

How we built it

We instrumented a Python application with Datadog APM to trace requests from user input to Vertex AI inference.

Challenges we ran into

Signal-to-Noise Ratio: Distinguishing between a "slow, thoughtful LLM response" and a "system hang" was difficult. We had to rigorously tune our window sizes to handle the natural latency variance of GenAI.
Context Propagation: Passing security flags (like jailbreak_detected) from deep backend logic up to the observability dashboard required complex custom span instrumentation.

The Math of Reliability

Instead of static thresholds, we used SLO Burn Rate Monitoring. We defined a Service Level Objective (SLO) of $99.9\%$. To detect issues without alert fatigue, we calculate the Burn Rate ($B$) based on our error budget ($1 - SLO$):

$$B = \frac{E_{window}}{1 - SLO}$$

The system alerts only when the burn rate indicates a statistical certainty of an outage, specifically when:

$$Alert \iff \frac{\text{Current Error Rate}}{0.001} > 1 \quad (\text{over } 5h\ 25m)$$

Accomplishments that we're proud of

Unified Visibility: We consolidated Security, Cost, and Performance into a "Single Pane of Glass," allowing us to correlate traffic spikes with API health instantly.

What we learned

SRE is critical for AI: We learned that "Error Budgets" are essential for non-deterministic applications.
Math over Static Counts: We discovered that Burn Rate monitoring allows us to predict outages before users are significantly impacted.

What's next for BearStack

Automated Model Fallback: Dynamically routing traffic to a lighter model (e.g., Gemini Flash) if the primary model experiences high latency.
Granular Cost Tracking: Implementing token-level cost analysis to track spend per tenant/user.