Inspired by a team member's desire to study through his courses by listening to his textbook readings recited by his favorite anime characters, functionality that does not exist on any app on the market, we realized that there was an opportunity to build a similar app that would bring about even deeper social impact. Dyslexics, the visually impaired, and those who simply enjoy learning by having their favorite characters read to them (e.g. children, fans of TV series, etc.) would benefit from a highly personalized app.
What it does
Our web app, EduVoicer, allows a user to upload a segment of their favorite template voice audio (only needs to be a few seconds long) and a PDF of a textbook and uses existing Deepfake technology to synthesize the dictation from the textbook using the users' favorite voice. The Deepfake tech relies on a multi-network model trained using transfer learning on hours of voice data. The encoder first generates a fixed embedding of a given voice sample of only a few seconds, which characterizes the unique features of the voice. Then, this embedding is used in conjunction with a seq2seq synthesis network that generates a mel spectrogram based on the text (obtained via optical character recognition from the PDF). Finally, this mel spectrogram is converted into the time-domain via the Wave-RNN vocoder (see this paper for more technical details). Then, the user automatically downloads the .WAV file of his/her favorite voice reading the PDF contents!
How we built it
We combined a number of different APIs and technologies to build this app. For leveraging scalable machine learning and intelligence compute, we heavily relied on the Google Cloud APIs -- including the Google Cloud PDF-to-text API, Google Cloud Compute Engine VMs, and Google Cloud Storage; for the deep learning techniques, we mainly relied on existing Deepfake code written for Python and Tensorflow (see Github repo here, which is a fork). For web server functionality, we relied on Python's Flask module, the Python standard library, HTML, and CSS. In the end, we pieced together the web server with Google Cloud Platform (GCP) via the GCP API, utilizing Google Cloud Storage buckets to store and manage the data the app would be manipulating.
Challenges we ran into
Some of the greatest difficulties were encountered in the superficially simplest implementations. For example, the front-end initially seemed trivial (what's more to it than a page with two upload buttons?), but many of the intricacies associated with communicating with Google Cloud meant that we had to spend multiple hours creating even a landing page with just drag-and-drop and upload functionality. On the backend, 10 excruciating hours were spent attempting (successfully) to integrate existing Deepfake/Voice-cloning code with the Google Cloud Platform. Many mistakes were made, and in the process, there was much learning.
Accomplishments that we're proud of
We're immensely proud of piecing all of these disparate components together quickly and managing to arrive at a functioning build. What started out as merely an idea manifested itself into usable app within hours.
What we learned
We learned today that sometimes the seemingly simplest things (dealing with python/CUDA versions for hours) can be the greatest barriers to building something that could be socially impactful. We also realized the value of well-developed, well-documented APIs (e.g. Google Cloud Platform) for programmers who want to create great products.
What's next for EduVoicer
EduVoicer still has a long way to go before it could gain users. Our first next step is to implementing functionality, possibly with some image segmentation techniques, to decide what parts of the PDF should be scanned; this way, tables and charts could be intelligently discarded (or, even better, referenced throughout the audio dictation). The app is also not robust enough to handle large multi-page PDF files; the preliminary app was designed as a minimum viable product, only including enough to process a single-page PDF. Thus, we plan on ways of both increasing efficiency (time-wise) and scaling the app by splitting up PDFs into fragments, processing them in parallel, and returning the output to the user after collating individual text-to-speech outputs. In the same vein, the voice cloning algorithm was restricted by length of input text, so this is an area we seek to scale and parallelize in the future. Finally, we are thinking of using some caching mechanisms server-side to reduce waiting time for the output audio file.