Touchless Browse

What does it do

While browsing the internet has become the norm today, no matter how you browse, a mouse and keyboard have always been needed... until now; our project, Touchless Browse is a web browser leveraging natural language processing, computer vision, and voice recognition and transcription models to facilitate a seamless web browsing experience where a user never needs to touch their laptop. We made what used to be limited to sci-fi movies; using our software, individuals can fully navigate the web using just their voice and hand gestures recognized by computer vision software.

**One can use their voice and the Whsiper AI model to conduct Google searches in a seemless way, where summaries of links are provided, powered by OpenAI's GPT model, along with the ability to use our software to open up these web pages. In addition to that, an individual can scroll, and click through a site using just the power of their voice.

To incorpoate added functionality, we leveraged OpenCV and other computer vision software to let users use their index finger to control a cursor by moving it in front of their webcam; although a new technology, an individual can simulate controling their computer via a "holographic means" something that used to be limited to the movies.**

Touchless Browser is the first web browser to facilitate information consumption, and regular browsing of the internet, but in the most innovative way possible.

How we built it

The core of VoiceBrowse was built on a combonation of voice recognition, natural language processing, and gesture recognition. We utilized the Whisper AI model to convert user voice input into text. We used the ChatGPT 3.5 api to process and organize both the user's input and to organize the web search results powered by the SERPER API that provides Google Search results. After reformating the results, we provided the user with output via the text to speech python libraries and used PyAudio, FFMpeg, and PyDub for audio processing. We also used OpenCV for facial recognition which we modified to track a user's finger to move the cursor using PyAutoGui

Challenges we ran into

Initially, we wanted to implement scrolling using eye tracking; however, after hours of testing, we realized few people have successfully achieved that feat with little callibration of the webcam due to the detailed nature of eye movements. As a result, we ended up settling with index finger movements which in itself was a challenge in its own right due to the nature of image recognition libraries such as OpenCV. We also had difficulty using the innate python text to speech speechrecognition library and had to use OpenAI's whisper API instead due to the slowness of the processing and the interaction with the speechrecognition libraries with other libraries we were using in the end. Finally, managing the sheer number of functions across the various different branches that we had was a challenge in itself; our documentation of headers came in incredibly helpful during this process. Despite the odds, the challenges that we faced made our end product so much more rewarding in the end.

Accomplishments that we're proud of

We are particularly proud of the system's ability to use 3 different forms of advanced AI technology including computer vision, voice recognition, and natural language processing all integrating well with each other. The gesture recognition module, although challenging, has been developed to a point where basic navigation commands can be executed with simple hand movements, something that lets average computers simulate touchscreens to an extent. Seeing our vision of a keyboardless and mouseless interface of knowledge, something often limited to science fiction, come to life was something we were most definitely proud of.

What we learned

Throughout the development of VoiceBrowse, we gained invaluable experience in integrating the various different fronts of applied machine learning and AI together to build useful applications. We learned the importance of product differentiation; when we started, we realized the similarity of our product to others, until we incorporated the fully touchless capabilities of the software. We also learned that although many foundational AI models exist today, they're not all one-stop problem-solving machines; at least for now, it takes humans to integrate them into the applications that we build to make them useful.

What's next for VoiceBrowse

Our vision for the future of VoiceBrowse includes refining and expanding the gesture recognition capabilities to cover more complex commands and interactions. We aim to enhance the stability and accuracy of voice and gesture inputs, making the system more robust and user-friendly. Further development will focus on improving the virtual keyboard and enhancing the mouse stability and scrolling functionalities to ensure that users can perform all necessary actions without switching input methods. Additionally, we plan to explore the integration of AI-driven content summarization to provide users with concise summaries of web content, further enhancing the browsing experience. Our ultimate goal is to create a fully immersive, voice and gesture-based browsing experience that is accessible and efficient for all users.

For the purposes of security, we have ommitted our API keys located in key.py from the public version of our repo and are submitting a duplicated version of our repo

Built With

computer-vision
github
gpt
opencv
pyaudio
pydub
python
whisper

Updates

Adhitya Raghavan started this project — Feb 03, 2024 09:51 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.