GitHub Repository: https://github.com/imri554/dlfinalproject Final Presentation: https://drive.google.com/file/d/1R5IPWzFX9lQfGc8ipIxbtIZLy-nCRNBo/view?usp=sharing Final Reflection: https://docs.google.com/document/d/1ieLNQkuEPY439PSgkNoXuzGcnlSpy5p2lmZJjCOwxLA/edit?usp=sharing First Reflection: https://docs.google.com/document/d/1Qv7630oF79VyQygd-Z4T2SxyOneCR27gVTQ0X2k4TlU/edit?usp=sharing
The outline that you submit/write-up to Devpost should contain the following: Title: Summarizes the main idea of your project. Improving audio file comprehension Who: Names and logins of all your group members. Imri Haggin ihaggin Jude McCutcheon cmccutc2
Introduction: What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. Objectives: Labeled data is not always available, and current speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance. The paper describes a unique process called semi-supervised learning that uses partially labeled datasets in training. They presented a framework where a convolutional neural network was used to encode speech audio. It is further processed using a transformer network. If you are doing something new, detail how you arrived at this topic and what motivated you. What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc. This is both a classification problem and a sort of unsupervised learning (semi-self supervised). Related Work: Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. Semi-Supervised Classification for Natural Language Processing (2014) This 2014 paper explores the idea of using semi-supervised learning to classify NLP datasets, which is the main task of the 2006 paper we are hoping to emulate. Since supervised learning often requires a disproportionate amount of time and data, while unsupervised learning models often require a larger amount of data to achieve the same accuracy, there is a tradeoff between the two. This 2014 paper provides a way to achieve balance in the form of semi-supervised classification, which the 2006 paper implements for audio file speech recognition. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list. https://github.com/facebookresearch/fairseq https://github.com/huggingface/transformers
Data: What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). How big is it? Will you need to do significant preprocessing? Labeled: libri-light 10h of labeled speech Unlabeled: cstr vctk Unlabeled speech data from 110 english speakers with varying accents 400 sentences each The speech data is not rich/large. All recordings are 16 bits, which is significantly smaller than photo data. We will have to do some preprocessing but not too much, as the two datasets are mostly already ready for use. Methodology: What is the architecture of your model? How are you training the model? Pre-training: models will be implemented in fairseq Masking: masks spans of the resulting latent speech representations, similar to masked language modeling Feature encoder: temporal convolutions with predefined strides and kernel widths Resulting in an encoder output frequency of 49 hz and a stride of 20ms between each sample Dropout transformer Adam optimizer If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here. The Quantization module appears to be challenging; we have not studied it too thoroughly but we are excited to learn how it contributes to the efficiency of this approach. If you are doing something new, justify your design. Also note some backup ideas you may have to experiment with if you run into issues. Metrics: What constitutes “success?” What experiments do you plan to run? We plan to run the model on these new datasets, and the closer we can get to the accuracy threshold achieved by the original paper, the more successful we will be. For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? Accuracy does apply for our project, because the model’s job is to assign written language to a given audio file, and this can be done correctly or incorrectly. If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. The authors of the paper were hoping to prove that a deep learning model for speech audio alone, followed by fine-tuning on transcribed speech (labeled speech) could outperform the best semi-supervised methods while being conceptually simpler. They quantified their accuracy by comparing the classified recordings with their existing labels. If you are doing something new, explain how you will assess your model’s performance. What are your base, target, and stretch goals? Base: 50% accuracy of original model Target: 90% accuracy of original model Stretch: exceed accuracy of original model Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.) What broader societal issues are relevant to your chosen problem space? Why is Deep Learning a good approach to this problem? What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain? Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? Essentially all members of society! We are all regularly involved with the digitization of our speech, and any attempt to regularly and systematically transcribe our speech is destined to have significant consequences on our privacy. Additionally, our algorithm has the potential to grant access and agency to those who may not be able to make use of sound information themselves. While these issues have already been solved, adjustments to the way this transcription takes place is important. How are you planning to quantify or measure error or success? What implications does your quantification have? We are planning to quantify and measure our errors typically, by comparing the percentage of correctly labeled examples. Even though this is a semi-supervised method, we will still only be able to measure accuracy through the labeled examples. We are not aware of any wider implication of our measurements for success other than an understanding of the potential accuracy of the model. Add your own: if there is an issue about your algorithm you would like to discuss or explain further, feel free to do so. Division of labor: Briefly outline who will be responsible for which part(s) of the project. To be determined! We still need to develop a better understanding of the technical requirements of the implementation. we ran into
Accomplishments that we're proud of
What we learned
What's next for improving wavelength comprehension
Built With
- tensorflow
Log in or sign up for Devpost to join the conversation.