Inspiration
We wanted to build an application that we could see ourselves using. Each of the members on our team have had moments where we struggled understanding a topic covered in our lectures, and would like an alternative to revise topics we learned in class when perhaps there are no lecture recordings or for when we need a new perspective on the topic covered.
What it does
As a result, we decided to build a full stack web application that would allow students to upload their course materials and leverage modern LLMs like GPT, and Gemini/Veo to interpret images and generate video explanations about their lecture slides like a personalized tutor.
How we built it
Despite having mostly hardware experience due to our Electrical and AMS backgrounds we decided to go with building a web application to make it as accessible as possible. The project was split into a client side that would run as a React TypeScript application and a server that was utilizing Python with Flask to do all the pdf processing LLM pipeline. For the frontend we started with a template for uploading files with help from Lovable AI a site used for creating frontend templates. For the backend we created a simple RAG pipeline as follows: pdf -> extract text and images -> generate text analysis w/ GPT mini -> generate image analysis w/ GPT Vision -> Feed analysis into parameterized prompt for Veo 3 -> Stitch resulting videos together and save as a 30 second clip. The data extraction was mostly handled by the python library PyMuPDF and clip editing by MoviePy.
Challenges we ran into
We faced some challenges implementing video generation. Initially we planned to use Open AIs Sora 2 model but quickly had to pivot due to road blocks with verification processes to use the API. Additionally once we swapped to Veo we had to get creative with utilizing Veo as it had some limitations such as only being able to produce 8 second clips. As a result we decided to use GPT to generate interpretations in chunks and use each segment to generate a short video that would be stitched together into a final product. Implementing this part of the pipeline proved to be difficult because we had to test and create various parameterized prompts to get consistent results that would be easily parseable. Playing around with GPTs capabilities in addition to rate limits with the Gemini API forced us to be crafty and compromise on video length and in-depth explanations to create an easily generatable video.
Accomplishments that we're proud of
We were able to get a coherent 30 second video that would explain basic topics in Computer Science classes well. We were really amazed when we were able to get it to consistently generate parseable with little AI hallucinations because it meant that we could make longer videos that weren't limited by the 8 second limit on Veo. We had tried testing out Veo's video extension functionality but its results were inconsistent so when we saw that our simple solution worked we were content. We immediately started to want test out our Computer Architecture notes on our application.
We are also proud of our UI/UX as one of our team members made original art and designed the duck logos used in our web application.
What we learned
We learned about being adaptable and choosing the right technologies for the right job. In our case a language such as Javascript would have been easily just as useable for the backend but after some planning we found that Python would suit our simple architecture. Moreover while building the prototype for this project we learned a lot about RAG pipelines which gave us good perspective on the possibilities the project could have moving forward since this project does not make use of any text embedding and is a very rudimentary pipeline.
What's next for Prof. QuAIck
We would like to test and compare different video generation APIs. After seeing examples of Sora generated content, we would definitely like to see how well it would perform for our task. Additionally we initially set out on this project to make it as accessible as possible however we did not have enough time to deploy the app to production and host online. So in the future we would like to try to allow other users to use it via the web by hosting the application on an AWS EC2 instance but this would also require some refactoring of the video generation process as we would need to handle multiple simultaneous users and perhaps have to consider user account creation.
Log in or sign up for Devpost to join the conversation.