Inspiration

We noticed that Siri, despite being built into your computer, lacks a lot of necessary features/functionalities that users might expect. Thus, we felt that we wanted to build something that could automate some of these tasks with only our voices.

What it does

After starting the application, a Python server runs and waits for the 'wake word' to be spoken, which in our case, is the word 'Jarvis'. Upon hearing the wake word, the app begins recording your commands and sends them to the backend to process. The backend then uses LangChain and Gemini to understand your intent and invoke the appropriate tools to carry out actions on your computer, i.e, opening Discord and sending a message, playing a song on Spotify, or opening something on YouTube. This all works with just your words and requires no manual inputs.

How we built it

We started by initializing our project with Tauri, a lightweight framework for building secure and performant web-based desktop applications. Using React on the frontend, we created a translucent HUD that keeps in sync with the user's latest command.

For the core logic, we developed a Python backend server responsible for wake word detection, audio processing/transcription/narration, and AI tool calling. We integrated Picovoice Porcupine for wake word detection, LangChain + Gemini to interpret natural language commands, bind and invoke tools, and vectorize content. We also leveraged ChromaDB to provide a RAG knowledge base synced with local files.

Challenges we ran into

Implementing RAG from scratch was a big challenge. Since we went for a self-hosted solution that also had to stay in sync with local files, we had to figure out how to keep ChromaDB synced with the provided folder. This meant manually chunking files and using a file watcher to re-index the knowledge base when necessary. Thankfully, we managed to get this to work, allowing the user to upload new files to the knowledge base while the application is running.

Documentation for Picovoice Porcupine was very sparse, which made it difficult to implement.

Accomplishments that we're proud of

We were really proud of the fact that, despite not having a solid idea until 4:00 pm on the first day of the hackathon, we were able to push out something that exceeded all of our expectations.

What we learned

An LLM combined with LangChain can be made into an extremely powerful application that can interpret natural language commands and seamlessly connect them to real-world actions through tools and APIs.

What's next for Jarvis

In the future, we hope to add:

  • drag-and-drop workflow builder, activated by voice commands.
  • User-configurable MCP integrations to support more diverse needs.
  • Improved UI and responsiveness.
  • Supporting authentication to websites/apps.
  • Enhancing security to prevent malicious actions.

Built With

Share this project:

Updates