COMPASS — Project Story

Inspiration

FedRAMP exists to protect the American public. But the process of getting authorized under it is so expensive and slow that most companies either give up or spend years in limbo. The average Authorization to Operate (ATO) takes 12 to 18 months and costs vendors anywhere from $250,000 to over $1,000,000 in consulting fees — most of it for work that is fundamentally repetitive: reading through control requirements, mapping them to system components, documenting gaps, and generating compliance artifacts.

We kept asking the same question: why is a human doing this?

NIST SP 800-53 Rev 5 is a published catalog. FIPS 199 is a deterministic scoring framework. OSCAL is a machine-readable format. The knowledge required to do a first-pass FedRAMP assessment is codified, structured, and finite. It is exactly the kind of domain where an AI agent should outperform a consultant — not by being smarter, but by being faster, consistent, and always available.

The Gemini Live API made one more thing possible that no prior AI could do well: voice. A security architect shouldn't have to learn a tool. They should be able to describe their system the same way they'd describe it to a consultant in a meeting room — and have something listen, understand, and respond.

That's COMPASS.

What We Learned

Gemini Live API is genuinely different

We've worked with text-based LLMs before. The Live API is a qualitatively different experience — true bidirectional audio with interruption support, not a voice wrapper around a chat API. The latency is low enough that it feels like a real conversation. Getting that working required understanding PCM streaming at 16kHz capture and 24kHz playback, Web Audio API scheduling, and the nuances of the bidiGenerateContent protocol.

We also learned the hard way that not all Gemini models support bidiGenerateContent. Only the native audio models do — specifically gemini-2.5-flash-native-audio-latest. The standard models, including gemini-2.5-pro, do not. gemini-2.0-flash variants are now deprecated. This cost us several hours of debugging.

We also discovered that the receive() method returns a persistent async iterator that yields all responses for the entire session lifetime — there's no need to wrap it in a while True loop or restart it between turns. This simplified our server-side receive loop considerably.

OSCAL is harder than it looks

OSCAL (Open Security Controls Assessment Language) is the right answer for machine-readable compliance artifacts. But the schema is deeply nested and the relationships between components, controls, and implementation statements are non-trivial. We built a validator that checks structural completeness — specifically that every required UUID reference resolves, and that control statements are present for every applicable control. Getting the generator to produce valid OSCAL on the first try required careful prompt engineering and a schema-aware output loop.

ADK agents compose naturally with the Live API

Google's Agent Development Kit (ADK) gives you a clean way to define sub-agents with tool access. We structured COMPASS as four specialized agents — classifier, mapper, gap analyzer, OSCAL generator — each invokable as a function call from the Live API session. The orchestration is clean: Gemini decides which agent to call based on conversation context, the agent runs its tools, and results come back as structured data that the frontend renders in real time.

FedRAMP impact math

FIPS 199 defines system impact level as the maximum of confidentiality, integrity, and availability impact across all information types. Formally:

$$\text{Impact}(\text{system}) = \max\left(\bigcup_{i=1}^{n} {C_i, I_i, A_i}\right)$$

where each $C_i, I_i, A_i \in {\text{Low}, \text{Moderate}, \text{High}}$ for information type $i$, and the ordering is $\text{Low} < \text{Moderate} < \text{High}$.

The number of applicable controls scales sharply with impact level:

Baseline	Controls
Low	~125
Moderate	~325
High	~421

COMPASS computes this automatically from the data types the user describes. Hearing a system classified as High in real time — during the voice conversation — is one of the most striking moments in the demo.

How We Built It

Architecture

Browser (React + WebAudio)
    │  PCM 16kHz audio (binary frames)
    │  JSON events (transcript, phase_change, classification, …)
    ▼
FastAPI on Cloud Run  (/ws/live)
    │  google-genai Live API (bidiGenerateContent)
    ▼
Gemini 2.5 Flash Native Audio
    │  function_calls
    ▼
ADK Sub-agents
    ├── classify_system   → FIPS 199 impact scoring
    ├── search_controls   → Vertex AI Vector Search (RAG over 800-53 Rev 5)
    ├── gap_analysis      → heuristic gap detection + remediation hints
    ├── generate_oscal    → OSCAL 1.1.2 SSP / POA&M / Assessment Results
    ├── map_data_types    → canonical data type tagging (PII, PHI, FTI, …)
    └── threat_lookup     → MITRE ATLAS AI/ML threat → control mapping

The backend is a single FastAPI application running on Cloud Run. The WebSocket endpoint opens a genai.Client.aio.live.connect() session per user, bidirectionally forwarding PCM audio and JSON events. Tool calls from Gemini are executed server-side and results returned as FunctionResponse frames.

The frontend is React + TypeScript (Vite), using the Web Audio API for capture and playback. Session state — phases, classification, controls, gaps, OSCAL docs — is managed in a single SessionContext that updates in real time as server events arrive. The three-column layout shows the live transcript on the left, the conversation center, and a structured data panel on the right that fills in as the assessment progresses.

Technology Stack

Layer	Technology
Voice / AI	Gemini 2.5 Flash Native Audio, Google GenAI SDK
Agent orchestration	Google ADK (Agent Development Kit)
Backend	Python 3.12, FastAPI, WebSockets
Hosting	Google Cloud Run
Session state	Cloud Firestore
OSCAL artifacts	Cloud Storage
Control RAG	Vertex AI Vector Search + `text-embedding-005`
Frontend	React 18, TypeScript, Vite, Tailwind CSS
IaC	Terraform
CI/CD	Cloud Build

Retrieval-Augmented Control Mapping

The full NIST SP 800-53 Rev 5 catalog (1,189 controls and enhancements) is embedded with text-embedding-005 at 768 dimensions and stored in a Vertex AI Vector Search index. When a user describes a system component, we embed the description and retrieve the $k$ most semantically relevant controls:

$$\text{score}(q, c_i) = \frac{q \cdot c_i}{|q| \cdot |c_i|}$$

where $q$ is the query embedding and $c_i$ is control $i$'s embedding. The top-$k$ results (default $k=10$) are returned to Gemini with full control text, which then reasons about applicability and implementation status in the context of the broader conversation.

Challenges

1. The Live API model wall

The biggest technical blocker: gemini-2.5-pro does not support bidiGenerateContent. Neither do gemini-2.0-flash variants, which are now deprecated. We spent several sessions debugging 1007 invalid argument and 1008 model not found errors before discovering that only gemini-2.5-flash-native-audio-latest supports the Live API protocol.

Additionally, these native audio models don't accept response_modalities=["AUDIO", "TEXT"]. They require ["AUDIO"] only, with transcripts arriving separately via server_content.output_transcription. This is not prominently documented.

2. Vertex AI Live API availability

Our production GCP project (compass-fedramp) does not have Vertex AI Live API access — it's still in a restricted preview. We had to pivot to using the Developer API (generativelanguage.googleapis.com) via an API key from a separate hackathon-enabled project, and set GEMINI_USE_VERTEX=false for the live connection while keeping Vertex AI for the text model and vector search.

3. OSCAL schema fidelity

OSCAL's SSP schema requires deeply nested structures: system-security-plan → system-implementation → components → implemented-requirements → statements. Each control reference must include a valid UUID that resolves to a component defined elsewhere in the document. Getting the generator to produce structurally valid OSCAL on first attempt required implementing a two-pass approach: first extract all component UUIDs, then generate implementation statements that reference them.

4. Audio scheduling jitter

Initial testing showed audible glitches in COMPASS's audio output — gaps and pops between packets. The fix was proper Web Audio API scheduling: instead of playing each PCM chunk immediately at AudioContext.currentTime, we maintain a nextPlayTime cursor and schedule each buffer to start exactly where the previous one ends:

const startAt = Math.max(now, nextPlayTimeRef.current);
source.start(startAt);
nextPlayTimeRef.current = startAt + audioBuffer.duration;

This produces gapless playback regardless of network jitter.

5. VAD buffer not flushing — the silent session bug

One of the most confusing bugs we encountered: after the user stopped speaking, COMPASS would sometimes go silent indefinitely — no response, no timeout, no error. The root cause was Gemini's Voice Activity Detection (VAD) buffer holding onto audio it hadn't processed yet.

The fix was sending audio_stream_end=True via send_realtime_input() whenever the user pauses or stops speaking. This signals to Gemini that the current audio segment is complete and the VAD buffer should be flushed — causing it to respond. Critically, this does not permanently close the stream; the client can resume sending audio at any time after. Without this signal, Gemini waits indefinitely for more audio before generating a response.

This behavior isn't prominently documented and cost us significant debugging time.

6. Firestore without composite indexes

Our list_sessions query originally used .order_by("updatedAt", direction="DESCENDING") combined with a where("userId", ...) filter — which requires a composite index in Firestore. Rather than wait for the index to build during the hackathon, we removed the server-side ordering and sort the results in Python after retrieval. Simple fix, zero downtime.

What's Next

Transcript fidelity — native audio transcription fields vary by response packet; building a more robust accumulation buffer
Warm instances — set --min-instances 1 on Cloud Run to eliminate cold start latency for demos
Vector Search population — the index is provisioned but the full 800-53 Rev 5 corpus needs to be embedded and uploaded; currently falling back to the LLM's training knowledge
Multi-session continuity — resuming a prior assessment session mid-conversation
Baseline scoping — improving the conversation flow to guide users through the full FedRAMP High baseline (421 controls) and progressively scope it down to Moderate or Low where applicable, making it more practical to reach a complete assessment across all impact levels

Built With

cloud-build
cloud-firestore
cloud-storage
fastapi
google-agent-development-kit-(adk)
google-cloud-run
google-gemini-2.5-flash-native-audio
google-genai-sdk
python
react-18
tailwind-css
terraform
text-embedding-005
typescript
vertex-ai-vector-search
vite
websockets

Updates

ivproduced Harris started this project — Mar 16, 2026 07:48 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.