Voxel: Listen. Imagine. Display.
Inspiration
We were inspired by the disconnect between our digital devices and our physical environments. Smart displays show generic photos or predictable data, but they lack the soul and dynamism of human interaction. We asked: What if a device could not just display art, but could create it—in real-time—as a reflection of the life happening right in front of it?
We were further driven by the challenge of making advanced AI feel personal and private. In a world of cloud-based services, we wanted to build something that could harness the power of AI while respecting the sanctity of the home, keeping its most sensitive data—conversations—completely local. Voxel is our vision of a future where technology is not just a tool, but an empathetic and artistic member of the household.
What it does
Voxel is a smart digital picture frame that generates a unique, ambient piece of art based on the mood and topics of the conversation in the room. It listens to audio snippets, processes the speech entirely on the device to protect privacy, and uses the resulting text to craft a creative prompt. This prompt is then sent to a generative AI model, which returns a beautiful image that embodies the feeling of the conversation. The frame seamlessly updates, turning abstract dialogue into a visual centerpiece.
How we built it
Voxel was built on a Raspberry Pi 4, transforming it into a dedicated appliance. The architecture follows a precise pipeline:
- Audio Capture: A Python script using the
sounddevicelibrary captures continuous 5-second audio clips from a USB microphone. - On-Device Speech-to-Text: The audio clip is passed to a locally hosted Vosk model (
vosk-model-small-en-us-0.15), which transcribes it to text without any data leaving the device. Prompt Engineering & NLP: The transcribed text is analyzed using simple NLP techniques to extract keywords and sentiment. A rules-based system then crafts a detailed, artistic prompt. For example:
Input Text: "I'm so tired, I wish I was on a quiet beach." Crafted Prompt: "A serene and peaceful digital painting of a vast, empty beach at sunset, with soft waves and pastel colors, evoking a sense of calm and tranquility, in the style of Studio Ghibli."
The prompt crafting can be represented as a function of the extracted nouns $N$ and adjectives $A$:
$$ \text{Prompt} = f(N, A) = \text{"A [} \mathit{adjective} \text{] scene of [} \mathit{noun} \text{], [} \mathit{adjective} \text{], style of [} \mathit{artistic\ style} \text{]"} $$
Image Generation: The crafted prompt is sent to the OpenAI DALL-E 3 API to generate a high-resolution image. This is the only cloud-based step.
Display: The generated image is downloaded and displayed full-screen on the connected monitor using a lightweight image viewer, creating a seamless, gallery-like experience.
Challenges we ran into
- The Latency vs. Privacy Trade-off: The biggest challenge was achieving real-time performance while insisting on on-device speech processing. Vosk models are computationally heavy for a Raspberry Pi. We had to optimize audio sampling rates and chunk sizes to find a balance between responsiveness and accuracy.
- Audio Filtering: Initial versions picked up every minor noise (e.g., keyboard clicks, coughs). Implementing basic audio filtering and a volume threshold was crucial to ensure only clear speech was processed.
- Prompt Engineering: Getting consistently high-quality art from the AI was non-trivial. Our first prompts generated literal, often awkward, interpretations. We spent significant time developing a robust prompt-crafting algorithm that injects artistic style and mood descriptors.
- API Cost & Rate Limiting: During development, we had to carefully manage our API calls to the DALL-E API to avoid exceeding quotas and costs, which required building in efficient caching and rate-limiting logic.
Accomplishments that we're proud of
- Creating a Magical Demo: The moment we saw Voxel generate a beautiful, relevant image from a casual conversation was pure magic. The live demo is captivating and never fails to impress.
- Upholding Our Privacy Principle: We are incredibly proud of building a system that respects user privacy by design. Successfully processing sensitive audio data locally was a core technical achievement.
- Seamless Integration: We built a complex, multi-stage pipeline (audio → text → NLP → API → display) that functions as a single, cohesive application on a low-cost device.
- Delivering a Unique Concept: We created something that feels both futuristic and natural, a piece of technology that enhances a space without being intrusive.
What we learned
- The Power of Edge AI: We gained deep hands-on experience with on-device machine learning, understanding the constraints and rewards of running models on resource-limited hardware.
- The Art of the Prompt: We learned that working with generative AI is less about coding and more about "linguistic sculpting." Writing effective prompts is a new and essential skill for developers.
- Full-Stack Hardware/Software Integration: This project was a masterclass in tying together physical hardware (microphones, displays), local software processes, and remote cloud APIs into a single user experience.
- The Importance of a Strong Narrative: A project is more than its code. Crafting a story around why it exists and who it's for is just as important as building it.
What's next for Voxel: Listen. Imagine. Display.
- Advanced On-Device Models: Explore replacing the DALL-E API with a quantized, locally-run generative model like Stable Diffusion Tiny, moving us to a fully offline system.
- Personalization & Memory: Allow users to fine-tune the art style. Implement a "memory" so the frame can learn preferences and even create art inspired by past conversations.
- Proactive Interaction: Add a button to let users "save" a generated piece they particularly love or flag one they don't, creating a feedback loop for the AI.
- Commercial Prototype: Design a dedicated, elegant frame enclosure to house the Raspberry Pi and screen, transforming our prototype into a market-ready product.

Log in or sign up for Devpost to join the conversation.