Deep Learning Final Project

Poster

Title:

Building a Model to Classify Periodic Time-Series Data

(And using it to generate some cute animations)

Who:

Advay Mansingka (amansin1) Momoka Kobayashi (mkobaya1) Joanna Tasmin (jtasmin)

Introduction

Time series datasets are the foundation of many modern technologies, with applications to weather forecasting, speech, videos, dynamical systems, stocks, and biochemical reactions. Consequently, statisticians have extensively studied time series as frameworks to make predictions. While classical statistical methods have been quite successful in the past, deep learning has the potential to make more accurate and robust inferences in the future.

One very commonly used type of time series data in deep learning applications is speech data. Companies like Google, Apple, and Amazon have spent billions of dollars trying to build applications like Siri and Alexa which understand audio signals by converting them to text-based representations before making more complex inferences from them. While they have been wildly successful, converting audio data to text loses a lot of the nuanced information in the original dataset. For example, if I were to whisper hello" or shoutHEELLLLOOOOO" into one of these models, both would simply be encoded as ``Hello", erasing information about my pitch, tone, or the inflection and duration of my syllables.

We propose to build a model to convert speech data into image/video data, which can capture the nuance of the input. This would effectively be an automated lip-syncing bot - audio data goes in and an animated character lip-syncing your words comes out.

We will build our model to be agnostic of the fact that our input will be speech, thereby giving it the ability to generalize to any periodic time series dataset. \

What kind of problem is this:

This project has three phases: Unsupervised learning to extract common patterns from time series (such as phonemes from speech)

Regression to map patterns in the series to some form of amplitude (for example identifying how loudly a person says each syllable)

Classifying the encoding to an output (for example going from some stressed syllable in the encoding to the appropriate expression in an animated character)

Related Work:

Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.

Paper: https://research.nvidia.com/sites/default/files/publications/karras2017siggraph-paper\_0.pdf

The paper by Karras et al presents a deep neural network for inferring facial animation from audio data to develop high-quality animation data for in-game dialogue. The overall network consists of 10 convolutional layers and two fully connected layers. The inputs for training are audio data taken from audio files which have been converted into normalized 16 kHz mono files and divided into short intervals with double overlap. The network after training produces vectors dictating the movement of the face mesh. In the paper, a network architecture that maps input waveforms to the 3D vertex coordinates of a face model is outlined, along with methods to enable the network to discover variations in the training data such as emotion and a three-way loss function ensuring the network responds to ambiguous data. Testing with audio from other speakers with different gender, accent, or language, led to reasonable results indicating the generalizability of this model.

In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list.

https://link.springer.com/chapter/10.1007/978-1-4615-3950-6\_10

https://research.nvidia.com/sites/default/files/publications/karras2017siggraph-paper\_0.pdf

https://arxiv.org/abs/1805.09313

https://link.springer.com/article/10.1007/s11704-020-0133-7

Data:

The data we will be using will mostly just be controlled voice recordings of ourselves slowly cycling through common words in the English language, as well as more complex syllable sets. We anticipate that the initial audio recordings will likely give us in the order of $10^7$ datapoints thanks to the high fidelity of modern microphones. After some experimentation, we have found that .wav files are ideal for our use as they give us easy access to the waveform for the fourier network to take as input.

Once we have a baseline of how the model is training and behaving, we will download podcasts and news shows so that we have a much larger training and validation set. Downloading about 3 to 4 hours of a show such as Last Week Tonight by John Oliver, should provide us with plenty of clean audio for training data.

Methodology:

The first step in the model is to split the audio waveform into small input sections. This is effectively all the preprocessing we will need.

This slice of audio is then fed into our first network - a DNN performing the discrete fast fourier transform function. This provides us with a frequency spectrum representation of the audio data.

Next, the fourier spectrum is plugged into a variational autoencoder. This learns a reduced dimensional embedding of the audio data. We hope that each phoneme in the audio data will correspond to a unique Gaussian embedding.

The embedding will provide us with a simpler representation of the spectra, which we can then manually label to correspond to each phoneme. The manually labelled data is then used to train a simple regression model for audio pitch / loudness, as well as a simple feed-forward classification model to learn the relationship between encoding and phonemes. The model will select 1 out of a set of possible expressions for the animated character, at each instance in time, yeilding the final result.

Since syllables may be of different lengths, we anticipate that the size of the split for the original audio waveform will have a significant impact on the model. In the event that this does indeed pose a challenge, we propose building an ensemble of models with different input sizes, and then allowing them to vote on what the correct output should be.

Metrics:

Since the output of our model is an animation, it will be very difficult to have a quantitative metric for the accuracy of the model. It might instead be more pertinent to run turing-test style experiments, where we show the outputs of our model, as well as hand animated versions to real humans. We can then make them spot any inaccuracies in the lip syncing produced by the hand animation as well as the model. Comparing the model's number of mistakes per second to that of a manually animated sequence will give us a good baseline for how well the model is performing.

Ethics:

Our dataset is speech data, which has the potential to be incomplete or biased depending on our source. Our source, podcasts, may use a specific type of pronunciation that might not be representative of all pronunciations of the syllable. Although phonemes are (largely) universal and we can try to be exhaustive with which syllables and sounds we include, it may still exclude accents. Our model also concerns the English language right now so it might not apply as directly to non-English languages or less-syllabic languages.

Since our success metrics are not very quantitative, the definition of success in our model depends on our judgement, which could be biased potentially, particularly if the model learns an `accented' version of a phoneme that we judge as incorrect.

In terms of potentially-risky applications of our model, because our algorithm concerns audio/speech data and classifies/converts it into image/video data, there is the potential of misleading other people through these videos by falsifying words/speech that wasn't actually spoken by the person. For example, it would be misleading to create a video of the President or an authority figure speaking about false incidences — this is deeply (haha) linked to deepfakes and the false news crisis. Depending on the usage of this audio to video conversion, it could be used in a rather unethical way.

Division of labor:

Since the models are all deeply (haha) interconnected with each other, and because it will be hard to work on the classification step without the VAE step, we will work on the models sequentially instead of parallely. We have set up a group to help us communicate, and have been meeting regularly to make good progress. We will continue to work together to make sure that we complete the project that we have started, sharing the work equally.

Challenges and Insights:

One major challenge we faced was training a VAE with the sheer amount of data we have. We tried porting over existing VAE architectures from prior research as well as from our previous HW assignments. Unfortunately, none of these ended up working for us and we ended up with low accuracies and long training times.

This challenge was overcome by using a standard autoencoder instead of a VAE. One major advantage of a VAE is a very small latent space, and we wanted to find a way to get a regular autoencoder to do this too.

Making the architecture larger led to some gains in performance, but we faced the vanishing gradient problem, and extremely slow training times. The input data had a feature size of around 1000-2000 (since it is generated by performing a discrete fourier transform on an audio sample), and we wanted to compress this down to around 2-4 to make it visually presentable. Architectures with large step downs between layers performed very poorly in terms of loss, but architecture with smaller step downs did not converge after almost 10 hours of training.

Eventually, we discovered something called greedy layer wise training. This allows up to split the autoencoder into multiple parts, and train it piece by piece. Since each of the individual parts are nets of depth 4-6, they do not suffer from vanishing gradient. Since the compression in each net is not too large, the loss is low.

Another challenge is the dynamic time windows for the input data. We could consider an ensemble solution to get around this.

Since we don't have the animation output yet, we are not sure how the model performs overall, but the loss values seem pretty low right now, and the model is able to compress the audio data while maintaining temporal information.

Plans:

We have an encoded representation for our sample audio data! This is very exciting. Analysing this shows that the model has performed as expected, and generated clear clusters for different phonemes. We now plan on using some form of semi supervised learning (will try EM algorithm, guassian mixture, and yet another autoencoder) and get labels for the clusters.

Finally, we will want to have some correspondence between the clusters and the animation. This will likely be done using Manim and Pygame.