Lucid Prism: A MR LLM-Based Multi-Modal Agent System

Inspiration

The idea for Lucid Prism emerged from everyday scenarios where using a smartphone was inconvenient or impractical—like cooking with messy hands or multitasking. I wanted to create a system that seamlessly blends human expression with machine understanding, leveraging the immersive capabilities of MR (Mixed Reality) and the power of LLMs (Large Language Models). The goal was to design a virtual assistant that feels intuitive, responsive, and truly helpful in mixed environments, pushing the boundaries of what multi-modal interaction can achieve.

What it does

Lucid Prism is a multi-modal assistant system that integrates MR, LLMs, and cloud-based APIs to provide three distinct functionalities:

  1. Conversational Assistant: Enables seamless voice-based interactions, enhanced by real-time camera views and memory storage for context continuity.
  2. Spatial Computing Assistant: Allows users to interact with virtual environments by describing objects and generating textures using voice commands.
  3. Object Transformation Assistant: Quickly transforms real-world objects into virtual assets through image capture, background removal, and 3D reconstruction.

How we built it

  1. Conversational Assistant:

    • Leveraged Meta’s Voice SDK for speech-to-text functionality.
    • Developed a custom camera access hack to bypass Meta’s restrictions, enabling real-time visual context transmission to the Claude API.
    • Integrated a cloud-based memory storage system to maintain conversational continuity by saving user inputs, API outputs, and logs.
  2. Spatial Computing Assistant:

    • Utilized Meta’s Depth API to extract spatial metadata from Unity scenes.
    • Converted scene prefabs into JSON files for LLM processing.
    • Integrated Stable Diffusion API to generate and apply textures dynamically to Unity GameObjects.
  3. Object Transformation Assistant:

    • Created a cloud pipeline for background removal in captured images.
    • Used Meshy’s API for fast 3D reconstruction of processed images into virtual objects.

Challenges we ran into

  • Camera Access Limitations: Developing a workaround for Meta’s camera restrictions required creative problem-solving and custom hacks to ensure smooth integration.
  • Contextual Memory: LLM APIs lack built-in memory capabilities, which necessitated designing a robust cloud-based memory storage system.
  • Latency: Real-time interactions with multiple APIs introduced latency issues, requiring optimizations in data transmission and processing.
  • Spatial Understanding: Ensuring accurate metadata extraction and meaningful LLM responses based on scene prefabs was a significant technical challenge.

Accomplishments that we're proud of

  • Successfully implemented a camera access hack that enhances interaction between the assistant and the user’s environment.
  • Designed a memory system that simulates human-like conversational continuity for the assistant.
  • Enabled real-time texture generation and application in Unity using voice commands, bridging LLM capabilities with spatial computing.
  • Built a robust object transformation pipeline that converts real-world objects into virtual assets efficiently.

What we learned

  • The importance of multi-modal interaction in creating intuitive user experiences.
  • How to integrate MR, LLMs, and APIs to create a cohesive and functional system.
  • Strategies for optimizing real-time interactions with cloud-based services.
  • The critical role of context and memory in making virtual assistants feel more human and responsive.

What's next for Lucid Prism

  1. Enhanced Multi-Modal Interactions: Expanding the assistant’s capabilities to include gesture-based commands and deeper emotional understanding.
  2. Improved Real-Time Performance: Optimizing latency and streamlining API integration for faster responses.
  3. User-Centric Design: Conducting user testing to refine the assistant’s functionality and usability further.
  4. Public Release: Packaging the system as a developer toolkit for broader adoption in the MR and LLM communities.
  5. Integration with Emerging Technologies: Exploring how the system can leverage advancements in BCI (Brain-Computer Interfaces) and haptics to create even more immersive experiences.

Built With

  • c#
  • claudeapi
  • http
  • huggingface
  • meshy
  • metasdk
  • python
  • stablediffusion
  • unity
  • witai
Share this project:

Updates