Inspiration

Albert has worked as a documentary linguist before, and is particularly interested in severely under-resourced languages. He has always been frustrated with the lack of a user-friendly, semi-automated interface to help document and transcribe these oft-endangered languages. He has worked with tools like ELAN and FLEx in the past, but was always left unsatisfied with their poor user interface and the necessity for painstaking manual review.

What it does

QuickGloss aims to semi-automate all of the roadblocks described above. We have three main features: speech to text, manual glossing, and, our centerpiece, automated glossing. Our speech to text takes an audio file of native speaker materials, and automatically transcribes it. It then prompts the user to gloss that transcribed text. The user then uploads a necessary corpus of morphemes, including as many or as little properties as they would like, in a custom list format. Our program automatically detects which language is being used. The algorithm then matches the morphemes with the ones in the provided sentences, and marks it down using standard Leipzig notation. Finally, for the morphemes not found in the corpus, it uses predictive ML techniques to extrapolate likely glosses based on morphological patterns in the language.

The manual glossing feature is built for under-resourced languages that have little to no readily available online data to train an ML model. The user simply uploads their desired text, morphemes, and meanings, and the program matches the words and morphemes, without machine learning. While we would like to add a trainable ML model for these languages, it is difficult given the scope of the hackathon.

How we built it

The backend is built using Python (Flask), which handles the majority of the algorithms and underlying logic. We used Whisper, an open source transcription model, for the speech-to-text model. We used spaCy, a commonly used NLP tool, for the machine learning aspects.

The frontend is built using HTML/CSS and JavaScript.

Challenges we ran into

None of us had worked with Flask before, but we saw it as an opportunity to learn a new framework. There was certainly a learning curve, which resulted in some friction between backend and frontend development. On numerous occasions, we had to adjust certain aspects of the site to resolve this friction. Once we learned how routes coordinate the two together, it became simpler. We also spent way too much time researching APIs or glossing dictionaries, a task which could have (and should have!) been done before the hackathon date.

Accomplishments that we're proud of

We're proud of the accuracy of our model. Going in, we didn't really have high expectations for what our necessarily rudimentary model was going to output, but we were pleasantly surprised with the result. While the actual segmentation has a few errors here and there (as is inevitable for predictive models), it is, for the most part, accurate.

What we learned

Our team consists of programmers, who don't know that much linguistics, and linguists, who don't know that much programming. Each learned from the other; our programmers took home a number of linguistic terms and concepts, while our linguist learned a bit about how to integrate his experience computationally.

What's next for QuickGloss

Looking ahead, we plan to add more accommodations for severely under-resourced languages. These include developing a user-friendly interface for uploading parallel corpora or pre-tagged texts to bootstrap language-specific models, and integrating a training interface for linguists to “teach” the system over time. We would also like to like to implement a way to crowd-source, allowing native speakers to directly contribute glosses and correct errors.

We also plan to integrate front-end frameworks like React or Vue.js to make the interface more modular and scalable. This would enable more advanced features like real-time collaboration or integration with glossing libraries and dictionaries through API calls.

Built With

Share this project:

Updates