Inspiration
Merlin is inspired by AI agents being developed today, and a need for a simplified, elegant command-line version of those same tools. It also draws inspiration from fictional helpers like Jarvis and the fantasy theme of the hackathon and its themed rooms. Finally, we drew inspiration from our daily frustrations with bash commands and attempting to run similar commands across OS we were unfamiliar with.
What it does
Merlin uses speech-to-text technologies to turn your words into reality within your computer. It connects with Google's Gemini 2.0 Flash to generate a list of terminal commands and execute those commands, checking/fixing errors and asking you, the user, for further guidance. Merlin narrates commands and thought process, making it more user-friendly.
How we built it
We used OpenAI's Whisper model and Google's AI Studio and Gemini 2.0 Flash to translate speech into usable terminal commands, no matter what the configuration. By enabling Gemini to read command output and cache a prompt-response history, we were able to utilize Gemini's >1 MILLION token content window and make the most of our API key. We integrated both of these services, along with a command execution schematic, into Python and managed packages using virtual environments. Finally, eReader was used to implement Merlin's voice.
Challenges we ran into
As it turns out, running GPU-based architecture is difficult on tiny laptops. Our integrated graphics handle speech processing very inefficiently, so we need to use the lighter weight models at the cost of having more errors in transcription. This was relatively mitigated with prompt cleaning, but still was a major drawback. There was a second challenge that arose out of hosting our own speech-to-text model. Python packages were often large (in some cases more than 1 GB!) This slowed development and stalled us as we quickly lost patience waiting for Pip to load these modules. The largest challenge was our time limitations. We were not allowed to stay on-site overnight and left around 11PM Friday and Saturday.
Accomplishments that we're proud of
One of the coolest things we learned was prompt engineering, and making sure that Merlin was able to create responses within the guidelines using Gemini's structured response feature. Another awesome achievement was integrating a chat history to allow Merlin to work similar to a chatbot, making CRUD operations much more possible and simple, and making Merlin more synced with its environment. This also allows Merlin to have recursive error handling. We also implemented guardrails to prevent Merlin from accidentally wrecking havoc on an environment. Merlin has some pushback from assuming sudo.
What we learned
We learned tons about working as a team, and our all-highschool background with two freshmen provided a unique experience for our developers. We also learned about error-handling and taking care of ourselves despite the hackathon sending us into insanity.
What's next for Merlin
The next step for Merlin is making our speech-to-text work better. Once we do that, we open up to a myriad of possibilities. We tinkered with real UIs using tkinter, text-to-speech to give Merlin a voice, and even more efficient contextualization using cached history. The possibilities are endless, as the terminal can do anything in terms of interacting with the system.
Team
- Jeremy - Backend & Demo
- Eli - UI Design & Demo
- Robby - Text-to-speech & Research
Submission Category / Track
- Best High School Project
- Best use of Gemini API
- General Track
Log in or sign up for Devpost to join the conversation.