Inspiration
At the beginning, we wanted to create a website that would act as a mediator between ASL speakers and voice speakers. However, we encountered some major complications and decided to change projects, focusing on building an interactive multimedia platform that allows users to control a virtual canvas and camera using hand gestures and voice commands. We aimed to develop a multifaceted project that would be both innovative and beneficial for all types of users.
What it does
Our website is an interactive multimedia platform that combines voice commands and gesture detection for a touch-free experience. First, there is a webcam that captures images and records videos, utilizing audio input. For instance, when the user says "picture", the program will take a picture, while "start video" and "stop video" control video recording. Secondly, we added a feature that tracks the user’s hand using MediaPipe Hands. With one finger up, the user can draw. Then, with two fingers up, the user can move around freely without deleting the previous drawings on the screen. Finally, with five fingers up, the drawing resets the canvas. All of these features make this platform intuitive, responsive, and highly interactive for any type of use.
How we built it
TheClick was designed with audio processing and real-time interaction to create a hands-free multimedia experience. We used MediaPipe Hands to detect and track hand gestures from the webcam, allowing us to draw in the air. We also included PyAudio and SpeechRecognition to enable audio input, allowing users to take photos, record videos, and display an audio message with their voice. OpenCV manages the webcam feed by overlaying drawings, displaying messages, handling screenshots, and recording video.
Challenges we ran into
Throughout the development of TheClick, we encountered several challenges that pushed us to learn quickly and think creatively. One of the first problems we encountered was installing MediaPipe, which required creating virtual environments for compatibility. Once installed, we faced the technical challenge of accurately detecting hand landmarks and mapping them to the correct gestures, which took time, tuning, and lots of testing. We also struggled with the front-end development, since none of us had experience building a website interface before. Figuring out where to start (how to structure the page, integrate the webcam, and connect everything together) was a major challenge. After multiple tries, we ended up using Tkinter to build our "frontend" inside our backend code. Finally, one of the biggest challenges was implementing all features into one cohesive program without causing errors. Combining voice recognition, video capture, gesture detection, and air drawing in a program required a lot of debugging. Despite these obstacles, we persisted and ultimately brought all components together into a working, integrated platform.
Accomplishments that we're proud of
We are proud of creating a fully functional program that incorporates original and creative features we haven’t seen before. Despite changing our idea midway through the project, we stayed persistent, adapted quickly, and built an entirely new concept. We’re especially proud that TheClick almost seamlessly combines voice commands, real-time hand tracking, and interactive drawing. This project pushed us to think creatively, solve unexpected challenges, and design something innovative that feels both fun and intuitive to use.
What we learned
Throughout this project, we learned how to integrate multiple input systems, such as voice commands and hand gesture detection. We learned a lot about Python libraries, such as PyAudio for audio input, MediaPipe for hand tracking, and OpenCV for webcam image processing. We also learned how to manage real-time data to create a smooth-running website by ensuring that all gestures and speech are converted into actionable program commands. We are happy we learned how to implement a "frontend" in the backend using Tkinter (thank you to the mentor that helped us)
What's next for TheClick
In the future, we would like to expand its interactivity and creative capabilities. For example, we would like to include more advanced voice commands such as “draw a circle”, “erase the last stroke”, etc. However, our main objective is to integrate machine learning to recognize the words that users write in the air and display the corresponding characters directly on the screen.
Built With
- ai
- backend
- desktop
- javascript
- mediapipe
- opencv
- pyaudio
- python
- speechrecognition
- speechrecognitionapi
- threading
- tkinter
- web
Log in or sign up for Devpost to join the conversation.