Aura Voice Clock Context-Aware Chronology via Gemini Native Audio

Inspiration The Aura Voice Clock was born from the observation that modern timekeeping has become anxiety-inducing. Most people check their phones for time, only to be sucked into a vortex of notifications. I wanted to create a device that felt like a calm concierge. Inspired by Dieter Rams' functionalism and the tactile nature of mid-century hardware, Aura uses the Tuya T5 E1 to provide a screenless, voice-first experience. The goal was to humanize the interface between time and productivity, allowing users to ask 'What does my day look like?' and receive a nuanced, spoken summary that accounts for traffic, weather, and calendar density.

What I Learned Developing for the ESP32-S3 required mastering the ESP-ADF (Audio Development Framework). I learned the critical importance of I2S buffer sizing. If the buffer is too small, audio stutters; too large, and the latency makes the AI feel robotic. I also delved deep into the Gemini 2.5 Flash Native Audio API, learning how to stream raw PCM bytes directly to the I2S peripheral. This bypassed the need for traditional STT/TTS chains, reducing latency by nearly 40%. Memory management was another key takeaway; managing the SPIRAM on the T5 E1 was essential to hold high-quality audio chunks while maintaining the Wi-Fi connection.

Construction & Engineering The hardware stack centers on the Tuya T5 E1. I integrated an INMP441 MEMS microphone for high-fidelity audio capture and a MAX98357A I2S amplifier for output. The software architecture utilizes a dual-core approach. Core 0 handles the high-priority I2S DMA interrupts to ensure gapless audio, while Core 1 manages the HTTP/2 streaming connection to the Gemini API. To ensure privacy, the device uses a local wake-word engine based on ESP-Skainet, only opening the cloud stream after a 'Hey Aura' trigger. This hybrid approach ensures both security and performance.

\Delta f = \frac{f_s}{N} FFT Frequency Resolution for Wake-word detection ($f_s=16kHz, N=1024$)

L_{total} = L_{network} + L_{inference} + L_{buffer} Total System Latency Model

Challenges & Solutions The primary challenge was Acoustic Echo Cancellation (AEC). In such a small enclosure, the mic would pick up the speaker's own output, triggering false AI interpretations. I implemented a software logic gate that lowered mic gain significantly whenever the I2S output buffer was active. Another hurdle was 'token-jitter'—where variations in network speed caused uneven audio delivery. I solved this by implementing a dynamic jitter buffer that adjusts its depth based on the moving average of the last five packet arrivals.

Future Roadmap In the next iteration, I plan to add gesture controls using a ToF (Time of Flight) sensor, allowing users to silence alarms with a wave. I also hope to integrate local Llama-3-Tiny for offline basic time queries, making the device truly resilient.

Technical Stack MCU

ESP32-S3 (T5 E1)

AI Model

Gemini 2.5 Flash

Framework

ESP-IDF / ADF

Built With

Share this project:

Updates