Inspiration: Whispers in the Machine - Towards Inner Speech Web Navigation
As a Computational Sciences major with a passion for Cognitive Neuropsychology, I've always been fascinated by the power of voice and the potential of brain-computer interfaces. My capstone project, "Look ma no hands," is inspired by this fascination, aiming to explore a future where web navigation is as intuitive as our own inner monologue.
The project is directly influenced by the concept of inner speech – that silent voice in our heads we use for thinking and self-talk. Imagine navigating the web not by typing or even speaking aloud, but by simply thinking your commands. This project takes the first step towards that vision by utilizing voice input as a proxy for inner speech, paving the way for future integration with Brain-Computer Interfaces (BCIs) like Meta's brain2qwerty.
"Look ma no hands" pushes the boundaries further by envisioning a more seamless, thought-driven interaction. The goal is to create an AI assistant that understands natural language voice commands to control a web browser, performing tasks like searching, clicking links, and navigating pages – all without physically interacting with the keyboard or mouse.
This project is not just about convenience; it's about accessibility and the future of human-computer interaction. By making web navigation hands-free and potentially thought-driven, we can open up the digital world to individuals with motor impairments and create a more natural and intuitive computing experience for everyone.
What I Learned & How I Built It:
This project is a journey into the exciting intersection of AI, web technologies, and cognitive science. Here's a glimpse into the technologies and learning experiences:
- Voice Input & Transcription: Initially, "Look ma no hands" utilizes real-time voice input, leveraging powerful Speech-to-Text APIs (like Whisper API or similar) to transcribe spoken commands accurately. This phase focuses on understanding the nuances of natural language and converting it into actionable instructions.
- Browser Control with AI: We employ AgentQL and Playwright to programmatically control a web browser. The AI assistant interprets transcribed voice commands and translates them into browser actions – opening URLs, performing searches (using DuckDuckGo for privacy), clicking links, scrolling, and more.
- Natural Language Understanding (NLU): While still in its early stages, the project incorporates elements of Natural Language Understanding to interpret the user's intent from their voice commands. Gemini Pro (or similar language models) is used to understand the context of the query and generate appropriate browser actions.
- Real-time Feedback: The assistant provides visual feedback by demonstrating browser control in real-time, allowing users to witness the AI executing their commands directly. Spoken responses using Text-to-Speech (TTS) further enhance the interactive experience.
Technologies Used:
- Pipecat SDK: - Core framework for multimodal AI in partnership with Gemini.
- Gemini Pro (or similar LLM): Natural Language Understanding and Response Generation.
- Whisper API (or Speech-to-Text library): Voice Transcription.
- AgentQL & Playwright: Browser Automation and Control.
- DuckDuckGo API: Privacy-focused Search Engine.
- Daily.co: (If you are using this for screen sharing - otherwise remove).
- Python: Primary programming language.
Challenges Faced & Future Directions:
This project, being a capstone within a hackathon timeframe, naturally presents challenges. Some key hurdles and future directions include:
- Accuracy of Speech Recognition: Ensuring robust and accurate speech recognition in varying environments is crucial. Exploring noise reduction techniques and fine-tuning STT models is an ongoing challenge.
- Complexity of Natural Language: Interpreting the full spectrum of natural language commands and handling ambiguous queries requires more advanced NLU capabilities. Future iterations will focus on improving the AI's understanding of user intent.
- Browser Automation Robustness: Websites are dynamic and constantly changing. Maintaining robust and reliable browser automation across different websites and layouts is a continuous effort.
- The Leap to Inner Speech (Long-Term Vision): The ultimate goal is to move beyond voice input and directly interface with brain activity to capture "inner speech." This project serves as a vital stepping stone, establishing the foundation for voice-controlled web navigation that can be expanded upon with BCI technology in the future. Exploring Meta's brain2qwerty research and similar advancements will be a key focus for future development.
- Accessibility Considerations: Continuously refining the system to be truly accessible and user-friendly for individuals with diverse needs is paramount.
"Look ma no hands" is more than just a voice-controlled browser; it's a glimpse into a future where technology seamlessly integrates with our thoughts and intentions. It's a journey towards making web navigation as natural and effortless as thinking itself.
Log in or sign up for Devpost to join the conversation.