Inspiration

Talking to AI today feels….boring. It’s transactional. We type a prompt, wait, and get a generic response that sounds more like a search result than a conversation. There is no memory, personality, or sense that AI actually knows us. It does not feel like we are collaborating with a partner, rather using a tool. And even after paying for “premium” AI subscriptions, most systems still feel limited, slow, and surprisingly average in performance.

This is not the version of AI we grew up dreaming about.

We grew up watching Iron Man- watching Tony Stark talk to Jarvis like real teammates. It understood him instantly, anticipated what he needed, controlled not only his designs, projects, but also his environment. It worked with him, not for him.

So we asked ourselves, why does this not exist in real life yet? Why are we still stuck typing into chatbots like it is 2015? Why can AI not remember us, multitask, and act on the world instead of just responding to us?

That question became our mission.

FRIDAY is our attempt to build the AI assistant we always wanted- one that feels present, proactive, and personal. Not just a chatbot, but a companion that remembers, reacts, executes, and evolves with the user. (Fun fact: FRIDAY was the initial version of Jarvis that was discontinued.)

Unlike today’s AI which waits for instructions, FRIDAY takes initiative. Where chatbots forget everything after one chat, FRIDAY maintains context like a real partner.

We are redefining the way people interact with AI.

What it does

FRIDAY is your personal AI operating system. A voice-controlled cross-platform hardware companion that lets you interact with your computer the way you wish technology worked: naturally, conversationally, and completely hands-free.

Instead of clicking through menus, typing long prompts, or switching between apps, you just speak. FRIDAY understands, responds, and executes- instantly. With real-time voice processing, personality-driven responses, and persistent memory, FRIDAY feels less like a chatbot and more like a smart partner that knows you.

FRIDAY does not just recognize commands, it understands the intent, remembers past interactions, adapts to your workflow, and handles tasks across applications without the user ever needing to touch the keyboard. It lets you control your entire computer through natural conversation- opening apps, searching files, managing windows, and organizing your workspace without ever needing to type or click. In the ideal version, you’ll just speak, and FRIDAY will convert your thoughts into polished, send-ready emails-completely hands-free. Unlike traditional chatbots with slow responses, FRIDAY runs on a real-time, low-latency system that makes interactions feel instant, seamless, and human.

FRIDAY is a hardware first solution, making it the only plug-in-play companion that works on any device and turns your computer into a fully voice-controlled, smart assistant in seconds.

Now that is some Tony Stark level shit.

How we built it

Building FRIDAY wasn't just about writing code-it was about creating a distributed real-time system that fundamentally rethinks how AI processes voice, executes decisions, and controls hardware. Traditional software-only AI is bottlenecked by sequential task-switching and cloud dependency, so we engineered a three-layer distributed AI automation system with full-duplex streaming and hardware-accelerated processing.​​

Layer 1: ESP32-S3 Core Firmware (Audio Processing Layer) At the foundation sits an ESP32-S3 dual-core microcontroller running at 240 MHz-essentially a powerful mini-computer designed for real-time audio. We connected it to a high-quality digital microphone that captures your voice at professional recording quality (16 kHz) and streams it to our AI backend over a persistent WebSocket connection with built-in error checking to ensure nothing gets lost. What makes this special is that the same connection also brings back FRIDAY's voice responses in real-time, and we built a custom audio buffer that keeps everything smooth even when your internet speed fluctuates. To handle both listening and speaking simultaneously-just like a real conversation-we run two independent processes on separate processor cores, so FRIDAY never has to stop listening to respond. We also integrated Esp32's professional audio algorithms that filter out echoes and background noise, making FRIDAY sound crystal-clear while cutting power usage by 40% when idle.​

Layer 2: Xiaozhi-Compatible AI Backend (Inference & Orchestration Layer) The second layer is where FRIDAY's intelligence lives-it runs a streaming AI pipeline that handles speech recognition, language understanding, tool execution, and voice synthesis all at the same time. Instead of waiting for you to finish talking before processing (like most voice assistants), FRIDAY works on your words as you speak them, cutting response time to under 800 milliseconds-faster than most human reactions. The backend exposes specialized tool endpoints that let FRIDAY actually do things in the real world: open apps, compose emails, control Spotify, and search files. These aren't just simple commands-they're intelligent actions that bridge FRIDAY's AI brain with physical control over your computer, enabling real automation beyond basic conversation.​

Layer 3: Hardware Automation Bridge (Physical Control Layer) To give FRIDAY true computer control, we built a hardware bridge using a second ESP32 that talks to an Arduino Leonardo over a reliable communication protocol. The Leonardo acts as a fake keyboard and mouse (USB HID), so when FRIDAY wants to open VS Code or type an email, it literally sends keystrokes to your computer as if you were typing. Commands flow from the AI backend as structured messages with built-in error checking and sequence tracking to guarantee every action executes perfectly-no dropped commands, no mistakes. Everything operates asynchronously with constant two-way communication between all three layers, eliminating the delays typical of cloud-based systems and cutting response time from seconds to milliseconds.

Challenges we ran into

Building a real-time AI system taught us that hardware and software don't always play nicely together-especially when you're asking them to do things they weren't designed for.

Real-time latency was our first nightmare. Standard Arduino WebSocket libraries introduced over 700 milliseconds of delay-completely unacceptable for natural conversation. We rewrote the transport layer to support chunked, non-blocking binary frames and implemented periodic pings for connection stability.​

Audio synchronization proved trickier than expected. ESP32 I2S drivers produced clock drift during long sessions, causing audio to fall out of sync. We added timestamped sequence headers and a sliding jitter buffer to align incoming TTS packets to playback time, compensating for both network jitter and clock drift.​

I²C throughout and reliability were problematic under heavy interrupt load. Early I²C frames occasionally dropped bytes, which could corrupt entire macro sequences. We added ACK-based retransmission and CRC8 validation to guarantee deterministic macro execution on the Leonardo.​

Concurrency conflicts nearly broke everything. Handling I2S DMA, Wi-Fi stack, and WS TX/RX simultaneously required tuning FreeRTOS task priorities and pinning threads to separate cores to prevent starvation. The dual-core architecture saved us-audio processing runs on Core 0 while networking runs on Core 1.​

The biggest challenge was bridging unstructured AI with deterministic hardware. Mapping unconstrained LLM intent to executable automation required designing rigid MCP schemas with typed parameters to avoid hallucinated commands. We essentially built a type system on top of the LLM's output to ensure every command FRIDAY executes is safe and predictable.

Accomplishments that we're proud of

We didn't just build another voice chatbot-we achieved performance metrics that rival commercial AI assistants while running entirely on open hardware that costs less than a meal at Chipotle.

We achieved sub-400ms end-to-end latency from the moment you stop speaking to when FRIDAY starts responding-entirely on a $6 microcontroller. Most commercial assistants like Alexa and Siri operate at 800-1000ms, making FRIDAY feel as responsive as talking to a real person.​

We implemented a completely self-authored Xiaozhi firmware stack from scratch, including binary streaming protocol, custom audio buffering, and token-based reconnection logic. This wasn't about adapting existing libraries-we rewrote the entire transport layer to optimize for real-time performance.​

We created a safe, deterministic hardware automation layer using ESP32 + Arduino Leonardo that converts natural language into physical keystrokes and app launches without errors. Unlike software-only automation, our hardware bridge uses CRC validation and typed schemas to guarantee every command executes exactly as intended.​

We demonstrated continuous bidirectional streaming on budget hardware, proving you don't need expensive cloud infrastructure to build truly intelligent AI. FRIDAY can listen and speak simultaneously in real-time, something most voice assistants still fake by task-switching.

What we learned

Streaming beats polling every time. Continuous WebSocket streams with timestamped packets completely outperform traditional request-response architectures for low-latency voice pipelines. While polling creates repeated overhead from HTTP handshakes and reconnections, WebSockets maintain persistent, full-duplex connections that cut latency dramatically and eliminate unnecessary network chatter.​

Real-time systems require scheduling discipline. Proper FreeRTOS task segregation and buffer sizing are absolutely critical to avoid jitter and audio dropout. We learned the hard way that without pinning critical tasks to separate CPU cores and tuning priorities correctly, resource contention will cause missed frames and unpredictable behavior.​

LLM intent must be structured. Defining strict MCP tool schemas prevents unsafe or ambiguous automation commands. Probabilistic AI outputs need deterministic guardrails—without typed parameters and validation rules, LLMs will hallucinate commands that sound right but execute incorrectly or dangerously.​

Edge compute can handle AI I/O. With optimized firmware, even a small ESP32-S3 can sustain 16 kHz audio streaming and full-duplex playback with negligible delay. This proves you don't need expensive cloud infrastructure or enterprise hardware to build responsive, intelligent AI systems-you just need smart architecture and careful optimization

What's next for FRIDAY

We want to work towards the goal of truly utilizing AI in the automation of your work as a daily consumer. We also want to make FRIDAY more personalized by training it to learn your habits over time-like knowing how long your coding sessions typically last, when you prefer to take breaks, or what times you're most productive. The goal is for FRIDAY to become so familiar with your workflow that it can anticipate your needs and proactively suggest actions before you even ask.

Discord usernames: aryanmetkar, dhruvbhilare, santra8619, tejasree8833

Built With

  • arduino
  • asr
  • c++
  • esp32
  • esp32s3
  • freertos
  • i2c
  • i2s
  • json
  • llm
  • mcp
  • oled
  • qwenai
  • tts
  • websocket
Share this project:

Updates