XTS | Devpost

Account Page
Upload Page
Home Page
Login Page

Inspiration

We wanted to solve a unique problem we felt was impacting many people but was not receiving enough attention. With emerging and developing technology, we implemented neural network models to recognize objects and images, and converting them to an auditory output.

What it does

XTS takes an X and turns it To Speech.

How we built it

We used PyTorch, Torchvision, and OpenCV using Python. This allowed us to utilize pre-trained convolutional neural network models and region-based convolutional neural network models without investing too much time into training an accurate model, as we had limited time to build this program.

Challenges we ran into

While attempting to run the Python code, the video rendering and text-to-speech were out of sync and the frame-by-frame object recognition was limited in speed by our system's graphics processing and machine-learning model implementing capabilities. We also faced an issue while trying to use our computer's GPU for faster video rendering, which led to long periods of frustration trying to solve this issue due to backwards incompatibilities between module versions.

Accomplishments that we're proud of

We are so proud that we were able to implement neural networks as well as implement object detection using Python. We were also happy to be able to test our program with various images and video recordings, and get an accurate output. Lastly we were able to create a sleek user-interface that would be able to integrate our program.

What we learned

We learned how neural networks function and how to augment the machine learning model including dataset creation. We also learned object detection using Python.

Built With

Submitted to

QHacks 2023
- Winner QHacks 23 Theme Prize - Designing the Digital World

Created by

During the project, I was tasked with enhancing the video-to-speech function of our app. I pursued this by implementing a second version of the function, using a different text-to-speech library. My goal was to optimize and increase the efficiency of the current model. To achieve this, I leveraged the GPU to speed up the processing of video frames as text-to-speech conversion can be computationally intensive. I also conducted research on real-time video-to-speech and made progress towards its implementation.

Working with my team, I gained valuable knowledge in computer vision, as this was my first exposure to the technology. I also learned how we could utilize pre-trained models to improve our project. I used Tensorflow in my work and created a custom dataset using LabelImg to train my real-time video-to-speech model.

I am extremely excited about the work I accomplished during this project and hope to continue building on my experience even after the hackathon.

Al-Barr Ajiboye
As a member of the Startroopers' team working on the back-end of our project, I successfully implemented the video-to-speech function through extensive research and the application of real-time object detection using OpenCV and Python. To achieve this, I utilized pre-trained machine learning models, including resNet101 trained on the ImageNet dataset and a frozen graph utilizing the coco.names data set. I was able to create a modular program that can perform frame-by-frame analysis and object recognition on inputted videos through real-time object detection techniques. Furthermore, after completing that portion of the program, I implement a conversion of text-to-speech using the gTTS library in Python, in order for the program to correctly identify objects in the user's field of view.

Sere Otubu
I worked on the image-to-speech function of our project. This required me to learn how to utilized machine learning, and I took it upon myself to learn the foundation of neural network models, as well as implementing pre-trained models. Using this new knowledge, I was able to create 2 modular programs: the first took user input for an image file (.jpg or .png), transformed the image to fit dimensions for processing, utilized ResNet-101 to run the image through the image recognition model, then outputted the top 3 confidence score categories, as well as their corresponding confidence score as a percentage. The second modular program utilized the function of the first modular program to take the outputs, clean it up into a digestible format, and convert it into auditory output.

Andrew Kim
I worked on the front end as well as the backend portion of the project. I helped during the initial stages of the project by researching Python image detection as well as understanding the fundamentals of computer vision and machine learning. I was able to create a program that would analyze an inputted video and recognize the different objects by performing frame by frame detection. Furthermore due to my previous project work on app development, I utilized android studio with the flutter engine to develop an app that was compatible with both android and IOS. I worked on the front end portion of the app creating a user friendly and aesthetically pleasing user interface.

Mihran Asadullah
Second Year Computer Engineering Student