Lucid Prism: A MR LLM-Based Multi-Modal Agent System

Inspiration

The idea for Lucid Prism emerged from everyday scenarios where using a smartphone was inconvenient or impractical—like cooking with messy hands or multitasking. I wanted to create a system that seamlessly blends human expression with machine understanding, leveraging the immersive capabilities of MR (Mixed Reality) and the power of LLMs (Large Language Models). The goal was to design a virtual assistant that feels intuitive, responsive, and truly helpful in mixed environments, pushing the boundaries of what multi-modal interaction can achieve.

What it does

Lucid Prism is a multi-modal assistant system that integrates MR, LLMs, and cloud-based APIs to provide three distinct functionalities:

Conversational Assistant: Enables seamless voice-based interactions, enhanced by real-time camera views and memory storage for context continuity.
Spatial Computing Assistant: Allows users to interact with virtual environments by describing objects and generating textures using voice commands.
Object Transformation Assistant: Quickly transforms real-world objects into virtual assets through image capture, background removal, and 3D reconstruction.

How we built it

Conversational Assistant:
- Leveraged Meta’s Voice SDK for speech-to-text functionality.
- Developed a custom camera access hack to bypass Meta’s restrictions, enabling real-time visual context transmission to the Claude API.
- Integrated a cloud-based memory storage system to maintain conversational continuity by saving user inputs, API outputs, and logs.
Spatial Computing Assistant:
- Utilized Meta’s Depth API to extract spatial metadata from Unity scenes.
- Converted scene prefabs into JSON files for LLM processing.
- Integrated Stable Diffusion API to generate and apply textures dynamically to Unity GameObjects.
Object Transformation Assistant:
- Created a cloud pipeline for background removal in captured images.
- Used Meshy’s API for fast 3D reconstruction of processed images into virtual objects.

Challenges we ran into

Camera Access Limitations: Developing a workaround for Meta’s camera restrictions required creative problem-solving and custom hacks to ensure smooth integration.
Contextual Memory: LLM APIs lack built-in memory capabilities, which necessitated designing a robust cloud-based memory storage system.
Latency: Real-time interactions with multiple APIs introduced latency issues, requiring optimizations in data transmission and processing.
Spatial Understanding: Ensuring accurate metadata extraction and meaningful LLM responses based on scene prefabs was a significant technical challenge.

Accomplishments that we're proud of

Successfully implemented a camera access hack that enhances interaction between the assistant and the user’s environment.
Designed a memory system that simulates human-like conversational continuity for the assistant.
Enabled real-time texture generation and application in Unity using voice commands, bridging LLM capabilities with spatial computing.
Built a robust object transformation pipeline that converts real-world objects into virtual assets efficiently.

What we learned

The importance of multi-modal interaction in creating intuitive user experiences.
How to integrate MR, LLMs, and APIs to create a cohesive and functional system.
Strategies for optimizing real-time interactions with cloud-based services.
The critical role of context and memory in making virtual assistants feel more human and responsive.

What's next for Lucid Prism

Enhanced Multi-Modal Interactions: Expanding the assistant’s capabilities to include gesture-based commands and deeper emotional understanding.
Improved Real-Time Performance: Optimizing latency and streamlining API integration for faster responses.
User-Centric Design: Conducting user testing to refine the assistant’s functionality and usability further.
Public Release: Packaging the system as a developer toolkit for broader adoption in the MR and LLM communities.
Integration with Emerging Technologies: Exploring how the system can leverage advancements in BCI (Brain-Computer Interfaces) and haptics to create even more immersive experiences.