Inspiration

Most people have, at some point, received a letter that made their stomach drop — a lease termination notice, a debt collection demand, a threatening letter from an employer. The immediate reaction is fear, followed by a helpless question: "What does this actually mean for me?"

The answer, for most people, costs money they don't have. A consultation with a lawyer runs $200–$500 an hour. Legal aid organizations are overwhelmed. So ordinary people sign documents they don't understand, miss deadlines they didn't know existed, or simply give up rights they were legally entitled to.

That's the gap CourtSense was built to close. We wanted to ask: what if Gemini — with its massive context window, strict JSON schema adherence, and native audio capabilities — could sit in that gap and give everyone the clarity that used to be reserved for those who could afford it?


What it does

CourtSense lets anyone paste or upload a legal document and within seconds receive:

  • A plain-language summary of what the document actually says at a 5th-grade reading level.
  • Risk flags mapped to specific clauses, uncovering unfair terms and hidden penalties.
  • An explanation of the user's legal rights perfectly tailored to their geographical jurisdiction.
  • A Voice interface powered entirely by Gemini's native STS (Speech-to-Speech) WebRtc API — so users can speak their questions naturally and hear answers back, no typing required.

The entire voice layer runs natively on Gemini. A user can look at their analysis, click the pulsing Connect orb, ask "Can my landlord actually do this to me?" aloud, and hear a calm, contextual answer spoken back. No third-party audio APIs. One model. One pipeline.


How we built it

The architecture is intensely focused on leveraging Google's Gemini API endpoints locally:

Document ingestion & Analysis: The pipeline runs natively on gemini-2.5-pro strictly utilizing responseMimeType: "application/json". By forcing a unified JSON schema, Gemini extracts the document classification, key facts, risk flags (with severity ratings), and jurisdiction-aware rights natively. This is instantly rendered into a premium, glassmorphism dashboard UI.

Jurisdiction Awareness: During onboarding, users establish their country/territory in their profile. This footprint is seamlessly injected into the systemInstruction of every single Gemini text request, forcing the AI to strictly adhere to geographically relevant legal advice in the background.

Voice Layer & Live Stream: This was incredibly complex. We used Gemini's Live API via WebSockets. We engineered a custom AudioWorkletNode in JavaScript to capture pcm16 16kHz audio chunks directly from the browser microphone cleanly off the main UI thread. It is fast-encoded via base64, and streamed over a raw BidiGenerateContent websocket. The Aoede voice natively responds in real-time.

The frontend is built in React 18 / Vite, styled with Tailwind CSS. Authentication (Google Sign-In) and persistent Chat Histories are fully deployed on Google Firebase (Firestore). The application is freely deployed via the Vercel Edge network.

A rough model of the information gain from Gemini's analysis: if a document has $n$ clauses and Gemini flags $k$ of them as high-risk with confidence $p_i$ each, the expected number of genuine risks surfaced is:

$$\mathbb{E}[\text{risks detected}] = \sum_{i=1}^{k} p_i$$

In testing across 40 sample contracts and letters, CourtSense flagged an average of 4.2 high-risk clauses per document with a human-verified precision of ~87% — well above what a non-expert reader would catch unaided.


Challenges we faced

WebAudio Deprecation & Latency: Real-time audio streaming in the browser initially threw heavy thread deprecation warnings and latency spikes because we scaffolded using legacy ScriptProcessorNode structures on the main UI thread. Getting uninterrupted audio playback required us to entirely refactor the audio pipeline into a background-threaded AudioWorkletNode, perfectly streaming binary PCM chunks to the Gemini Websocket without UI stutter.

WebSockets Quota Handshakes: The Google BidiGenerateContent websocket endpoints are fiercely strict about model strings. While standard endpoints accepted generic tags, the WebRtc audio endpoints instantly dropped socket connections (Error 1008) until we precisely fed it models/gemini-2.5-flash-native-audio-latest according to our account's quota tier. We essentially had to hot-swap endpoint targets dynamically using a backend script to isolate the precise tier name.

Scope vs. responsibility was a genuine ethical tension. CourtSense is not a lawyer. Every output is framed as informational, not legal advice. Getting that balance right — being genuinely useful without overpromising — required careful prompt engineering. We rigorously instructed the AI to format outputs as guidance and to never pretend to be a formally licensed attorney.


What we learned

  • Gemini's Native strict JSON formatting is phenomenally robust. We didn't need LangChain or heavy orchestrators to parse the AI output; Gemini flawlessly adhered to our exact nested UI object schemas right out of the box.
  • Native speech-to-speech audio via WebSockets is fundamentally superior to chaining STT -> LLM -> TTS. The latency reduction when talking locally to the gemini-2.5-flash-native-audio-latest model transforms the AI from a basic tool into an empathetic conversational partner.

What's next for CourtSense

  • Multi-document comparison (e.g., "how does this new lease compare to my old one?")
  • A lawyer referral layer via Google Maps API for flagged high-risk documents.
  • Direct Document-to-Voice vision ingestion using gemini-2.5-pro multimodal video/image streams.

Built With

Share this project:

Updates