1. Abstract – What & Why
Working title: “LocaLens – Own Your Spatial Memory”
- Problem statement: AR glasses and phones will soon record everything users see and do, but today this data is streamed to cloud platforms (Meta, Google, Snap), creating huge surveillance and lock‑in risk. Users lack both technical and legal control over these “life logs” and the spatial maps built from them.[3][4]
- Why on‑device AI:
- Sensitive spatial data (home interiors, children, routines, locations) should never leave the device unencrypted; on‑device models enable semantic understanding without exposing raw video, audio, or location to a server.[5]
- Running SLMs locally removes per‑token costs and latency, making continuous AR assistance and journaling usable even on poor networks or in the field.[6][1]
- Vision / UX:
- User wears AR glasses or uses a phone camera. LocaLens quietly turns what they see and say into a private, searchable “spatial journal” of rooms, objects, people, and events.
- The app feels instant (sub‑second responses), private (everything processed locally, with optional encrypted backup), and resilient (works the same in Port Harcourt, a subway, or a disaster zone).
- User wears AR glasses or uses a phone camera. LocaLens quietly turns what they see and say into a private, searchable “spatial journal” of rooms, objects, people, and events.
2. Architecture – How It Works
You can describe a mobile-first pipeline, compatible with RunAnywhere’s SDK support for llama.cpp / whisper.cpp models on iOS/Android.[2][1]
High-level data flow (narrative diagram):
- Sensor input:
- Video frames + IMU + GPS from phone/AR glasses.
- Optional microphone for voice notes / commands.
- Video frames + IMU + GPS from phone/AR glasses.
- Local perception:
- Lightweight CV stack (e.g., on‑device object detection + SLAM) creates a local “spatial anchors” map (rooms, surfaces, objects).
- Lightweight CV stack (e.g., on‑device object detection + SLAM) creates a local “spatial anchors” map (rooms, surfaces, objects).
- Local understanding (SLMs):
- Whisper‑style model for speech‑to‑text (STT) running via RunAnywhere.[2]
- Small Llama / DeepSeek model to:
- Summarize scenes and events (“meeting with Chika in the lab about funding”).
- Generate semantic tags and Q&A embeddings for later retrieval.
- Storage & ownership layer:
- Encrypted local vector store + key‑value DB for:
- Spatial anchors (3D positions), text summaries, and event metadata.
- Optional “ownership proofs” as small blockchain‑compatible payloads (e.g., Merkle root of a day’s journal) that can be published later when connectivity exists (not needed for base UX).
- Encrypted local vector store + key‑value DB for:
- User interface:
- AR overlay or 2D mobile UI:
- “What did I discuss with Dr. Amina the last time I was in this ward?”
- “Show me all notes about the red generator in the back room.”
- AR overlay or 2D mobile UI:
- Sensor input:
How RunAnywhere SDK fits in:
- Use RunAnywhere to deploy and update quantized Whisper and Llama/DeepSeek models to devices, auto‑tuned to their hardware.[1][6]
- All inference is on‑device; RunAnywhere’s “privacy‑first architecture” and memory management simplify running SLMs in a mobile footprint.[2]
- Optional: use its analytics locally to show the user performance stats, not to send their prompts to a server.
In your diagram slide, label nodes like: Camera → Local CV → RunAnywhere Core → On‑device SLM → Encrypted DB → (Optional) Blockchain anchor when online.
3. SLM Strategy – The Brains
Position your model choices as a deliberate tradeoff between speed, footprint, and reasoning.
- Models:
- STT: On-device Whisper‑derived model for fast speech recognition; RunAnywhere explicitly supports Whisper-style models via whisper.cpp and similar frameworks.[1][2]
- Primary SLM:
- Llama‑3‑3B / 3.2‑3B (quantized) for mobile: documented to run on lower VRAM, making it suitable for phones and edge devices while still handling summarization and conversational queries.[7]
- Optionally, DeepSeek‑R1‑Distill (small) or similar distilled variant when better reasoning is needed (e.g., interpreting complex multi‑step tasks). Distilled DeepSeek models target stronger reasoning in smaller sizes than full V3‑class models, which are too heavy for typical phones.[8][7]
- Feasibility (what to say):
- Quantized 3B‑class Llama models (e.g., 4‑bit GGUF via llama.cpp) can run in ~4–6 GB RAM envelopes typical on mid‑range devices, especially with aggressive context limits.[7]
- RunAnywhere adds device‑aware optimization and smart memory management, making it feasible to deploy multiple models (Whisper + SLM) while keeping latency within ~80 ms TTFT on capable devices.[6][2]
- All heavy CV (detection + SLAM) can leverage hardware accelerators (Neural Engine / GPU) outside the SLM memory budget.
You can add a small table in your deck like:
| Component | Model | Reason |
|---|---|---|
| Speech | Whisper-small | Fast STT offline |
| Language core | Llama‑3‑3B Q4 | Mobile‑friendly |
| Reasoning mode | DeepSeek‑R1‑distill | Deeper reasoning |
4. Offline Scenario – When It “Saves the Day”
Connect directly to the hackathon’s “Offline Edge” and “True Privacy” themes.[5]
Scenario example – Rural medical worker with AR glasses:
- A community health worker in a rural area with no reliable internet uses LocaLens on a phone plus low‑cost smart glasses.
- As they walk through homes and clinics, the app:
- Recognizes locations and equipment (beds, oxygen cylinders, vaccine fridges) and attaches structured notes to each spatial anchor.
- Transcribes patient interviews locally and summarizes symptoms and treatments, without ever sending audio or text to the cloud.[5]
- When a repeat patient arrives months later, the worker can say: “Show me what I did for Blessing in this room last time,” and immediately sees summarized notes and warnings in AR, all retrieved from local encrypted storage.
- If the device briefly regains connectivity, the worker (or their NGO) can optionally anchor hashes of anonymized records on a lightweight chain for integrity audits—without exposing patient data.
- A community health worker in a rural area with no reliable internet uses LocaLens on a phone plus low‑cost smart glasses.
Why this is impossible / inferior in the cloud:
- Continuous video + audio streaming from AR glasses over weak networks is unreliable and often impossible in rural areas.
- Cloud workflows require transmitting sensitive patient data, clashing with privacy laws and raising serious trust issues.[5]
- Latency would break the “live AR” experience; by the time a cloud model responds, the user may have moved on. On‑device SLMs remove this dependency.[6][1]
- Continuous video + audio streaming from AR glasses over weak networks is unreliable and often impossible in rural areas.
5. How to Position It in the Slides
To fit the judges’ criteria (feasibility, privacy/edge utility, innovation):
Technical Feasibility (40%)
- Emphasize:
- 3B‑class quantized Llama as the main SLM, with published guidance that it suits mobile and edge.[7]
- RunAnywhere’s cross‑platform SDK and on-device optimization for Whisper and Llama models.[1][2]
- Clear resource boundaries and data flow.
- Emphasize:
Privacy & Edge Utility (30%)
- State that all raw media, transcripts, and embeddings remain local by default; nothing leaves the device unencrypted.
- Highlight strict offline operation and how that enables rural and low‑connectivity scenarios.[5]
- State that all raw media, transcripts, and embeddings remain local by default; nothing leaves the device unencrypted.
Innovation & Vision (30%)
- Connect your earlier vision: “spatial data as permanent, ownable, and safe” via:
- Local-first spatial journals.
- Optional blockchain anchoring for ownership/integrity without cloud inference.
- Stress that this is a spatial web wallet for lived experience, not a surveillance tool.
- Connect your earlier vision: “spatial data as permanent, ownable, and safe” via:
Built With
- ai
- deepseek
Log in or sign up for Devpost to join the conversation.