Learning new languages can be a challenging experience. With an increasing available in technology, we are more connected and open. We are exposed to new language through different channels: books, videos, people. However, learning and speaking a new language can be difficult. Particularly, trying to pronounce sounds that do not exist in your language. Chinese language has shown to be a difficult language for beginners because of how different sounds can be very closely similar. We experienced our first-hand struggle when we wanted to learn Chinese. We struggled to differentiate a sound among different others. This has inspired us to develop a tool for us to learn how to pronounce each sound correctly. We came up with CCC(Chinese Consonant Classification). Additionally, the reason that we chose Chinese is because Chinese language has a lot of data and resources available. We are hoping that CCC can be a pioneer for us to create something similar for other languages in the future.
2. What it does
Taking user’s pronunciation of Mandarin consonant, CCC decides which one of the 8 Chinese consonants the record is the closest to. The eight consonants CCC uses are c, ch, q, s, sh, x, z, zh. The application allows user up to 1.25 seconds to speak. With this implementation, the user learns over time how to pronounce Mandarin sounds accurately.
3. How we built it
Under the hood, Chinese Consonant Classification (CCC) is a deep convolutional neural network (deep CNN) for a multi-class Chinese consonant recognition model. This section goes over the network layers and their specifications.
I. Convolutional Layer I: The convolutional part of CCC consists of three convolutional layers. The first layer, referred to as CONV1, applies kernels of size 3 over an input tensor of 1 channel. The kernels move vertically and horizontally 1 step at a time and do not pad the outputs. The process yields an output tensor of 16 channels.
II.Convolutional Layer II and Convolutional Layer III: The specification of this layer resembles of CONV1, except its in_channel = 16 to handle 16 channels of an input tensor.
III. Max-Pooling Layer: The pooling layer has a kernel of size 2 that moves 2 steps at a time and does not pad an output. Within the 2x2 region, a max over 4 values is taken.
IV. Fully-connected Layer: The layer consists of 1 input layer and 1 output layer. The number of input nodes is the exact number of input values that are flattened into one dimension. There are 8 output nodes for 8 Chinese consonants of interest that will be classified.
The input representation of an audio file is a two-dimensioned of depth 1 when feeding the network. After each convolutional layer, we apply Relu function on an output tensor before regularizing by dropping some neurons in the network. The depth of input changes from 1 to 16 because of the convolutional specification. Then, we reduce a tensor’s dimension by pooling a max over 2x2 regions and flattened this 3-dimensional tensor into 1 before it is fed into the fully connected layer. The final product is 8 numbers at the output layer. The model decides on the consonant by returning the class that is associated with the max output node.
The implementation of the model closely follows many knowledgeable tutorials on Pytorch and CNN.
Model Optimization Besides opting for a conventional Adam optimization, we have played with several combinations of hyperparameters to increase the performance of the model during the training. We have focused on (a) learning rate, (b) weight decay rate, and (c) dropout regularization. Using a grid search, below is the table of combinations we have tried on Google cloud GPUs:
Learning Rate : 1e-4, 1e-5, 1e-6 Weight Decay: 1e-3, 1e-4, 1e-5 Dropout: 0, 0.1, 0.5, 0.75, 0.8, 0.9
Early stopping has also been implemented to prevent overfitting. The idea is that if the validation loss increases more than X% from the smallest loss in that epoch, the training is stopped. We have tried both 1% and 5% and found that 1% is more suitable for our model.
Training Data All the data used for the model training is from the Tone Perfect collection provided by a group of researchers at Michigan State University. The collection is comprised of every monosyllabic sound in Mandarin Chinese in all four tines, spoken by three male and female native speakers. There are 9,860 audio files in total. We use all variations of ‘c’, ‘ch’, ‘q’, ‘s’, ‘sh’, ‘x’, ‘z’, and ‘zh’ sounds in our training, which sum to 5508 audio files.
Data Augmentation The sound collection is extremely professionally recorded. Consonant pronunciation starts almost right after the audio file begins and ends right before the file ends. The length of each audio file is about equal and there is no background noise contaminating the files. On the one hand, this allows the model to really learn the core idea of each consonant. On the other hand, it makes the model sensitive to varying voice input from the user. The user is unlikely to input voice that follows the pattern of this standard training data. They could start speaking comparatively late after they hit record or there could be substantially loud background noise. We limit the length of the record to be 1.25 seconds, as close to the sound collection as possible while leaving some flexibility for users. Moreover, to handle noises, we implemented data augmentation by adding random background noises to some copies of sampled training data.
Audio Input Representation Unsurprisingly, the input to this model is audio files which require a modification that turns the frequency wave hearable to humans’ ear into a numerical representation that computers understand. We use Librosa, a python audio package, to extract a sound feature that is believed to be a key factor for sound recognition: Mel Frequency Cepstral Coefficient (MFCC).
MFCC is a mean to capture the unique short-time power spectrum of a consonant. The feature is state-of-the-art in speech recognition. An audio signal is sliced into smaller time frames and the power spectrum in each frame is calculated. The spectrum is then modified and turned into a numerical representation called coefficients. Only some of the coefficients are kept for the analysis. We use the default setting—keeping 13 coefficients—in the analysis for CCC. After the manipulation, we end up with a two-dimensional input. One dimension is time and the other is the 13 coefficients.
Since each of the audio files does not have the exact same length, processing them using the same time frame length results in an unequal number of the frames. This is a problem for a neural network because it cannot handle unequal sized input. We use the maximum number of columns as a number of columns of input the network. Other inputs with columns less than this number will be padded with 0’s at the end of every row until they have that many columns.
4. Challenges we ran into
4## Challenges we ran into The challenges we ran into are the following: Model Optimization The first problem was a classic issue. It was difficult to pinpoint which hyperparameters should be at what values. The only way for the optimization seemed to be a lot of trials and errors. We set a group of promising hyperparameters into several combinations and tried training the model with those. The one with the highest validation accuracy after the training is chosen to be the prototype of CCC. Still, it was impossible to know if the chosen combination is the best.
Long Training Time The second challenge immediately follows the first one. Deep learning algorithms have been notoriously known for a long training time. The amount of time spent on developing the model greatly spiked when we had to try tens of hyperparameter combinations even with on cloud GPU.
Quick Overfitting This is another classic issue that we ran into. We initially set up a complex network in the hope that it would help the model learn to recognize the consonant faster. However, the model generated a very high validation accuracy within the first 10 epochs and overfitted rather quickly. We adjusted the model structure and implemented regularization and early stopping to assuage the issue by a bit.
No Control Over Audio Input As mentioned earlier that the model was sensitive to noisy data because of the way it was trained. We use data augmentation to add noises to some of the training data. Although it seemed to help the model a bit, there were still a lot of issues when working with audio input and suppressed the model performance. Out current model and the software API did not have much control over user’s input. But we tried to include as many possibilities as we could.
5. Accomplishments that we're proud of
Most Chinese speech recognition projects that we have seen focus on the tone of language. This is understandable because these applications are interpreted from the standpoint of developers whose native languages are non-tonal. However, as a Thai learning Mandarin, the most difficult part is not the tone but rather the pronunciation of each consonant. Each Thai syllable is either distinctly or identically pronounced—despite the fact that there are 44 Thai characters—so it is rather indistinguishable to us how the pronunciations of, for instance, ‘x’ and ‘s’ differ. We want this consonant classification to be a useful resource for people who share the same struggle as us. We hope that this project will draw more attention from developers into this challenge and inspire them to apply the idea to other languages as well.
6. What we learned
If we have not participated in this Hackathon, we would not have been aware of abundant resources and Python packages that facilitate a neural network implementation. As important as exploiting these available resources, we had hands-on experience in implementing a CNN structure extensively. Trial and error were a part of our development process, we have tried to optimize several models with different parameters that we have discussed in 3 and 4.
Apart from the technical skills that we have acquired in the process of working on this project, what we cannot take for granted is the importance of teamwork. To be efficient, communication is a key. Organizing and documenting the project was very crucial. The challenges in implementing taught us to be critical in our thought process and develop mentality of coding and debugging.
7. What's next for Chinese Sound Correction
There are many ideas to improve our CCC: 1. Given the time constraint, we did not get a chance to experiment with the network structure. The first thing we would do to improve the performance of CCC would be to train differently structured models. We would also want to try different hyperparameter combinations and apply other methods of regularization. 2. As of now, our model outputs the closest consonant to the audio input. What our application lacks is detailed feedback for the users. We want it to show a percentage comparing user’s input to each of the 8 sounds. For example, we want to display “your pronunciation is analyzed to be 40% c sound, 20% ch sound, 5% q, 15% s, 5% sh, 10% x, 4 %z, and 1% zh”. 3. CCC currently does not have a user-friendly interface and we realize that mobile application is a very convenient way for users to learn new languages. We want users to have great experience by providing them feedback as mentioned in 7.2 with tongue position diagrams for each sound. This way, learners can practice their pronunciations with accurate and desirable results.
- The most ambitious goal of CCC is to be able to classify all Chinese sounds so that our application can be a one-stop service for those who want to learn Chinese pronunciation.
Catherine Ryu, Mandarin Tone Perception & Production Team, and Michigan State University Libraries. Tone Perfect: Multimodal Database for Mandarin Chinese. Accessed 1 January 2019. [https://tone.lib.msu.edu/]