Inspiration
When we were thinking about ideas for our project, one of our group members had an affinity for storytelling. This inspired our project idea because we wanted to develop a way to turn verbally described situations into short sitcom-style stories. It started with just wanting to add crowd reactions that could appropriately respond to real-time speech from the user. This would make it feel like the user is part of a sitcom show where they could have fun talking about anything and see how a crowd may respond. With the realization that we could implement images to better visualize what is being said, the project became more of a way to create short video clips that would act as a moving storyboard. We thought about how funny it would be to have a crowd constantly reacting to what you say and commemorating it with a sort of slideshow skit.
What it does
Tell-a-vision transcribes oral stories which are then read back aloud to the user with text-to-speech. Paired along with this text-to-speech voice is a slideshow “movie” that is automatically generated to fit the story’s setting or context in addition to precisely placed sitcom laugh tracks to bring the story together. This newly generated and stylized story is then presented to the user and subsequently made available for them to download.
How we built it
tell-a-vision is built with a frontend Javascript web client, using the Vue.js framework, as well as a backend web server built with Node.js and Express.js. Utilizing Javascript's native Web Speech API, we transcribe users' audio input into text transcriptions that we send to our backend server for further analysis. In our backend, we interface with the Google Cloud Natural Language API, Google Cloud Text-to-Speech API, as well as the Web Search API. Using the Google Cloud Natural Language API, we analyzed users' transcriptions for overall sentiment (positive or negative) and relevant entities. With these sentiments, we play appropriate crowd reaction sounds in the frontend client, while the entities served useful in finding relevant images for transcriptions that are able to effectively portray the contents of the transcription. With the Google Cloud Text-to-Speech API, we converted the gathered transcriptions and created computer-spoken audio files. Meanwhile, with the Web Search API, we used the most relevant entities found in our transcriptions and downloaded relevant photos for the transcription. In the end, with all of these resources finally gathered, we used FFmpeg and several Node.js wrappers to combine all of the images and audio files into one streamlined presentation-like video.
Challenges we ran into
Branding:
Creating the product name and logo design were difficult parts of the project. A good foundation for the project starts with a solid direction. We had changed the focus of our product from something that just makes crowd sound effects for real-time speech, to something that creates an entire frame-by-frame video with text-to-speech audio while still keeping the crowd sound effects. It was difficult to come up with a name that both integrated the storytelling aspect, the sitcom-esque inspiration, and the visualization of ideas with pictures and sound effects. After deciding to focus primarily on the creative visualization of one’s imagination, we realized that the word “visualize” could be a potential name. It was short, catchy, could be shortened further (ie. vize), and could be easily turned into a slogan (ie. “visualize it!” or “vize it!”). That being said, it did not quite encompass all of the elements and inspirations for the project. By looking at the base of the word “vis-,” we were able to come up with “tell-a-vision.” This name had strong links to our sitcom inspiration as a play on the word “television.” Our project is built on having a user tell a story and providing visuals for that story, so it was a perfect fit. Once we had a focus, it was much easier to work on other design aspects.
Coding:
Coordinating both a front-end client and back-end server that both communicate with several APIs was very difficult, as we have all not faced a project scope quite this technically large before. Setup with Google Cloud APIs was relatively straightforward, but referring to the Node.js documentation was a bit tough. Another challenge we faced was finding effective ways and methods to manipulate and combine audio and image files as we envisioned before we started developing the project. There are not too many resources out there, so we turned to using FFmpeg, an audio manipulating software project, and relevant Node.js wrappers.
Accomplishments that we're proud of
Being able to culminate all the aspects of our application together was an accomplishment in itself we were very proud to get done. During development, there were several stages where we found ourselves happy to have crossed as they marked progress towards creating our final product. Some of these victories included:
- Creating a stylized logo and look for our application
- Generating transcriptions (if microphone conditions are right!)
- Pulling relevant images that fit within our user’s stories and their transcriptions
- Utilizing several services from Google Cloud, including the Natural Language API and Text-to-Speech API
- Manipulating audio and image files into a final useable file
What we learned
Due to our team's various levels of experience with software development and design, each member learned skills outside of their comfort zone. Whether it was better insights into the design process or learning about the more technical aspects of the project each member rounded out their knowledge. On the design side of things, we learned about using colors palettes that are consistent and considerate. We also learned about using certain colors for certain interactive elements. For example, green is more of a “go” color, while red is more of a “stop” color. We were able to use this for our “play” and “stop” buttons. Again, consistency was a big goal with the design team and so the green and red colors used were the same used in the logo.
What's next for tell-a-vision
With further testing and bug fixes, tell-a-vision can be made more accurate in the generated images and soundtracks that are played along with the transcription. With more time commitment as well we could put together more laugh track sound bites to fit a broader range of emotions and situations. Outside of these minor improvements without further testing with a larger audience development for tell-a-vision is nearly complete. Users are freely able to create their own unique and exciting stories to share with anyone willing to listen.
Log in or sign up for Devpost to join the conversation.