Inspiration

Nathan's honors thesis is about incorporating morphological segmentations into tokenization. To do this, he needed a tool that could split words into morphemes. This problem was not nearly as solved as he expected. There weren't too many good-performing models out there, and the existing ones were often opaquely stashed in a poorly documented GitHub repository. We attempt to solve this problem by bringing one of the most accurate and computationally efficient models, tü-seg, into the widely used SpaCy NLP and PyPI ecosystems.

What it does

The Problem of Morpheme Segmentation is as Follows: Given a word, what are the morphemes of the word?

Morphemes are the smallest meaningful units of text. For example, segmenting the word "morphemes" would look something like ["morph","eme","s"]. There are 2 types of morpheme segmentation: surface and canonical. We focus on canonical morpheme segmentation, as it is more linguistically meaningful, ignoring things like inflection and conjugation to display the true morphemes. For example, while a surface segmentation of "manliness" might be ["man","li","ness"], a canonical segmentation would be ["man","ly","ness"], allowing for the "li" morpheme of "manliness" to be counted as an occurence of "ly", as it should. This is useful for many different linguistic/NLP analyses of text, as you can more easily determine the meaningful features imparted on words by their morphemes.

Our project makes the tü-seg model of morpheme segmentation more easily available through a web GUI and integration into the SpaCy python library and PyPI python package server. Tü-seg is a small (5-50 MB), computationally-efficient sequence labelling LSTM (deep learning) model that achieved a 96% f1 score on morpheme segmentation on the sigmorphon 2022 morpheme segmentation benchmark.

How we built it

We used code from the tü-seg research paper to train the model, then did two different things with it: created a python package (MorphSeg) served on PyPI and integrated as a plugin for SpaCy, a very popular python library/ecosystem for NLP, and built a Web GUI on top of it for users to see the morpheme segmentations of their sentences. The website frontend was built with Svelte, pixel art, and a lot of CSS animations. It hits a FastAPI lambda function which serves the MorphSeg library's segmentation methods.

Challenges we ran into

Cynthia: Svelte Hell Alexis: Svelte Hell Taoran: Svelte Hell Donnie: I spent most of my time setting up the MorphSeg data preprocessing pipeline, training models, creating a public interface, and generally cleaning and debugging the research library's code. Because the library's code was a higher quality than we anticipated going into the hackathon, this process actually went smoothly for the most part. The most challenging part was training on the Unity Cluster which was experiencing severe delays today, making it hard to get an A100 or L40S. Training without these takes a decent chunk longer. We fixed this by using the Amherst College compute cluster for a short period of time, although only Nathan had access to that, leading to another challenge. Nathan: I learned the hard way that pickling/safetensors-ing a PyTorch model in one namespace and unpickling it in another (say, for use inside a package), is not an intended use of saving and loading PyTorch models.

Accomplishments that we're proud of

Our frontend designers have done very little outside of basic web design before, and yet made a really good frontend! They were all new to Svelte but were able to pick it up on the spot and use it to create a really unique UI. This is also the first time we've created a PyPI package! The code is also clean enough to reasonably be used in production, although there's a lot of cleaning left to do!

What we learned

We've learned a lot about how to make open source contributions and how to productionize research code. We also learned a ton about web development, especially with a web framework like Svelte.

What's next for What the Segma?

Extending it into more languages! tü-seg was a submission to the SIGMORPHON 2022 shared task on morpheme segmentation, which was conducted across several languages. Extending it to as many languages as feasible is a good next step. The only reason we did not focus on this is because we believe any GUI on top of tü-seg should have some language specific features, and so we focused on making the GUI only for english. More code cleaning and adding model fine tuning functionality is also on our list!

Built With

Share this project:

Updates