Inspiration
Children with reading disorders often find it challenging to follow english courses. Research suggests that the problem is solvable; incorporating audiovisual elements into pedagogy has been shown to increase knowledge retention and student satisfaction. Our app aims to breathe new life into classic plays by assigning each character a synthetic gender-matched voice, and displaying AI-generated images that are dynamically generated as the child progresses through the play.
What it does
Our app leverages various ML models to convert a loquacious play into an audiovisual experience. Each character is given a unique AI-generated gender-matched voice, that is maintained for the runtime of the play. Each line undergoes sentiment analysis, which influences the tone of speech to ensure a natural and engaging listening experience. The lines are used to generate images that accompany the play, and provide visualizations for the events of the play.
How we built it
We used Streamlit for the frontend, and Python (with Flask) for our backend. We used various ML models in the application; we leveraged GPT for gender prediction on the characters of the play, and BERT to run sentiment analysis on each line. We also leveraged Google’s AI APIs to generate natural-sounding speech from the text of the play, and OpenAI’s Dall-E diffusion model to produce visual accompaniments.
Challenges we ran into
The large amount of AI integration meant that we were always racing against time. We spent many hours retraining our models to get the accuracies we wanted. Our application is inherently time-sensitive - we didn’t want the reader to wait for the audio to load when they’re trying to understand the book. To make sure that everything is handled in real-time, we had to distribute some of the load throughout the application (e.g. preprocessing the generated images before the user selects a play), which posed difficult architecting problems, and contributed to many refactors throughout.
Accomplishments that we're proud of
We never took the easy route. We could’ve simply hardcoded the genders of each character, or done a lookup in a dictionary of names, but we wanted to leverage Machine Learning to make sure that the application is adaptable to any play. We could also have just used the same tone of voice for each character. It doesn’t sound that bad, and would’ve saved us hours of work. However, we knew that those tiny intonations are crucial to keep the attention of a young viewer. It was truly a labor of love, and no corners were cut in our journey to the goal.
What we learned
For many of us, it was our first interaction with these technologies. We had never used Streamlit or any of the APIs before, so we spent a lot of time figuring out the intricacies of each. The largest lesson was the importance of planning and estimation in the initial stages of a project. We would’ve saved ourselves a lot of time if we foresaw some of the obvious issues that came up (image generation is still very slow!), and just an hour spent talking about the structure of the entire stack would’ve prevented us from wrangling at the last minute to get all the different moving parts in order.
What's next for Tech Poets Society
We want the user to be able to input their own play or script, and have it adapted to a rendition with consistent and engaging voices in real-time. All the moving parts are already in place; we simply need to improve our preprocessing scripts so that they work for more general formats (for instance plays written ‘Juliet. Wherefore art thou Romeo?’ as opposed to ‘Juliet: Wherefore…’ ). We were also limited by API rate limits; when we convert to accounts with more funding, we’ll be able to offer more voices, and higher quality images.
Log in or sign up for Devpost to join the conversation.