SHIK Live

Architecture
Entry Point
Voice Activation

The problem

Every AI agent today has amnesia. When a session ends, the agent that "knew" you is gone. Start a new conversation and you're talking to a stranger wearing the same face. Whatever sense of identity that agent had was an accident of temporary state prompt tokens that evaporate when the context window closes.

This isn't just an inconvenience. As we move toward a world with millions of AI agents negotiating, collaborating, and making decisions on behalf of people, the inability to maintain a coherent identity across sessions, models, and devices becomes a fundamental infrastructure problem. You can't build trust with an entity that forgets who it is.

What SHIK Live does

SHIK Live is a real-time multimodal agent with a visible identity kernel â€” a persistent layer that maintains memory, context, communication style, and continuity state across sessions, independent of the underlying model.

You talk to SHIK through your microphone. It sees through your camera. It responds by voice in real time. You can interrupt it mid-sentence and it adapts. So far, that sounds like any Live API demo.

Here's what's different: while Gemini handles the conversation, a separate Identity Kernel captures who you are and what matters. The right panel of the UI shows this kernel updating live core memories being stored, session context building, continuity state tracking. When you end the session and start a new one, the kernel feeds your history back into Gemini's instructions. The agent greets you like a colleague, not a stranger.

The architectural separation is the point. Gemini is the cognition temporary, per-session. The Identity Kernel is the continuity â€” persistent, cross-session. They are explicitly decoupled. That's the thesis made visible.

Inspiration

SHIK Live is a prototype implementation of concepts from my PhD research on Self-Hosted Identity Kernels for Multi-Agent Systems (Kingsley, 2025). The paper proposes SHIK as a minimal architectural substrate that maintains a persistent, portable self-model for an artificial agent across models, devices, and networks.

The core insight: most current agent frameworks treat identity as an accidental byproduct of transient model state. SHIK argues it should be an explicit architectural layer a small, persistent kernel that any cognition engine can read from and write to. Swap the model, change the hardware, migrate between clouds. The agent stays the same agent.

The paper formalizes this with an identity tuple $I = \langle id, K, P, M, H \rangle$ representing the agent's stable identifier, cryptographic keys, policies/values, long-term memory, and interaction history. Two running processes on different nodes with different models can be considered the "same agent" if their identity states satisfy a formal equivalence relation across allowed transformations.

The hackathon was the opportunity to take this from theory to a working, visible prototype.

How we built it

Frontend: Next.js + React + Tailwind CSS. Three-panel layout Live Interaction (transcript), Visual Context (camera/screen), and the Identity Kernel (memory, context, continuity state). A bottom event log shows kernel operations in real time.

Voice pipeline: Browser captures mic audio at 16kHz PCM16, streams it over WebSocket directly to Gemini 2.0 Flash via the Live API. Gemini responds with 24kHz audio played through a separate AudioContext. Dual sample rates 16kHz capture, 24kHz playback â€” with proper Float32/Int16 conversion at each boundary.

Authentication: Ephemeral tokens minted server-side via a Next.js API route. The real API key never touches the browser. Tokens expire in 30 minutes with single-use session binding.

Identity Kernel extraction: After each conversation turn completes, a separate Gemini Flash call extracts structured data from the transcript core memories (persistent facts about the user), session context (transient observations), communication style profile (formality, technical depth, pace), and topic tracking. Extraction uses explicit confidence thresholds (â‰¥0.7 for core memory) and negative examples to prevent kernel bloat.

Persistence: Cloud Firestore stores four collections â€” core memory, session context, events, and style profiles. On session start, existing memories and style data load from Firestore and inject into the Gemini Live system prompt, giving the agent immediate context.

The key design decision: the agent doesn't "perform" memory. It doesn't say "according to my records" or "I recall from our last session." It just knows â€” the way a friend knows your name. The system prompt explicitly instructs natural reference over meta-commentary. Identity should be an invisible substrate, not a performance.

Architecture

Browser (Voice I/O, Camera, Kernel UI)
    â†• WebSocket (audio + vision)
Gemini 2.0 Flash Live API (real-time reasoning)
    â†’ Transcript
Identity Kernel Processor (Gemini Flash extraction)
    â†•
Cloud Firestore (core memory, context, events, style)
    â†’ Memory injection on next session start â†’ Gemini

All hosted on Google Cloud Run

Gemini handles cognition. The kernel handles continuity. Architecturally separate â€” that's the SHIK thesis.

Challenges

Audio sample rate mismatch. The browser's default AudioContext runs at 44.1kHz or 48kHz, but Gemini expects 16kHz input and outputs 24kHz. Sending mismatched audio produces silence or garbled playback. The fix: explicitly create the capture AudioContext at 16kHz and the playback context at 24kHz.

Extraction selectivity. Early versions of the kernel extraction prompt stored everything â€” greetings, filler phrases, the agent's own statements reflected back. The kernel bloated with noise. We added explicit negative examples ("DO NOT extract: greetings, weather comments, filler, restatements of existing memories, anything the agent said") and a confidence threshold that routes uncertain observations to transient session context rather than permanent core memory.

Mid-session injection. The Gemini Live API doesn't allow updating the system prompt once a session is active. We solved this with a two-tier approach: full Firestore injection on session start (strong context), and lightweight text nudges via sendRealtimeInput during the session (incremental updates).

The "performing memory" problem. Early system prompts produced an agent that constantly announced it was remembering things: "I'm adding that to my memory!" This breaks immersion. The fix was explicit instruction with good/bad examples â€” teach the model to reference knowledge naturally rather than narrating its own cognitive process.

What we learned

Identity is not memory. Memory is one component of identity, but the kernel also needs to capture communication style, interaction patterns, and relational context. An agent that remembers your name but talks to you like a stranger hasn't achieved continuity.

The extraction loop is the hardest design problem. Not technically â€” a second LLM call is straightforward. The hard part is deciding what's worth remembering. The kernel must be selective or it becomes a chat log. The confidence threshold and negative examples were more important than any architectural decision.

Separating cognition from identity is architecturally simple but conceptually powerful. Once you build the separation, swapping the reasoning engine becomes trivial. The identity kernel doesn't care if it's talking to Gemini, Claude, or a future model that doesn't exist yet.

What's next

SHIK Live is one slice of the full SHIK architecture described in the paper. Future directions include:

Document ingestion seeding the kernel with markdown, PDFs, and research materials so the agent arrives pre-informed
Cross-model portability” demonstrating the same identity kernel connected to different LLMs, proving model independence
Inter-agent identity” SHIK handshake protocols allowing agents to recognize and build trust with each other across encounters
Self-hosted deployment running the kernel on edge devices (Raspberry Pi, Jetson Nano) under user control, as proposed in the paper
Style adaptation over time the kernel already captures communication style; next is adaptive response matching that improves with each session

Built with

Built With

audio
cloud-firestore
gemini-2.0-flash
google-cloud-run
google-genai-sdk
next.js
react
tailwind-css
typescript
web
websockets

Submitted to

Gemini Live Agent Challenge

Created by

Designed the SHIK architecture based on my thesis for potential PhD research on Self-Hosted Identity Kernels. Built the system prompt engineering, identity kernel extraction logic, communication style profiling, and memory injection pipeline. Directed the full-stack implementation (Next.js, Gemini Live API, Firestore). Produced the architecture diagram, demo, and project documentation.

Susan Kingsley
James Kingsley