Inspiration
As Reddit users, we find that some subreddits such as r/science have many useful resources on a variety of topic. However, as information are often provided by the specialists of the field, the contents are long and erudite. We noticed that people often say "elif" (explain like I'm five) and the specialists will try to explain the concept in simplified term. elif is extremely useful as Reddit users often prefer condensed and simplified content to quickly comprehend the information. Both of us are interested in NLP and it's still an ongoing challenge to build a sensible sentence out of synonyms. Thus we decide to explore the topic of NLP and try to build a bot! :)
What it does
We built a Reddit bot eli5bot that automize the process of parsing, summarizing and simplified the content. eli5bot is summoned when Redditors commented eli5 and it will parse the paragraph above, shorten it to 7 or less sentences and replace the erudite vocabulary with common word and provide a wiki link to the subject for the curious Redditors.
How we built it
We built a simpli5 API that uses spaCy, an NLP toolkit, to tokenize every sentences in the paragraph. We used WordNet to rank the frequency of the word and thus identify the uncommon words. Then we used thesaurus.altervista.org API to obtain the synonym list of the words based on the part of speech and chose among them the most common word. eli5bot uses SMMRY API to first construct a TL;DR of the paragraph and then uses our simpli5 API to simplify the content.
Challenges we ran into
English vocabulary has a rich variety of meanings. One word can have multiples groups of synonym that have completely different meanings. Also, a base form of a word can have many variation and its meaning evolves as the form changes. Also, while some terms are insignificant alone, it gains meaning in the context. For instance, for the term Alzheimer's disease, it will be tokenized into Alzheimer, 's, and disease by spaCy. While Alzheimer is only an English name alone, it can be interpreted as dementia and given a wikipage link to Alzheimer's disease if it's grouped together with 's and disease. It's challenging to develop an algorithm to group the tokens so that simpli5 can look up for the correct synonym. The Reddit posts are often informal and full of grammar mistakes and typos. The unexpected grammar creates many edge cases for the parsing.
Accomplishments that we're proud of
We designed an algorithm to customize the tokenization of the sentence. For instance, gamma-Aminobutyric acid (GABA) which is consisted of a list of noun, adjective, noun can be grouped together for an accurate synonym look up. We designed an algorithm to find the best synonym based on the frequency of the word in WordNet and replace the original word with the simplified term together with a wiki-link. We designed an algorithm to put together the simplified sentence while maintaining human-like grammar.
What we learned
We learned spaCy, an advanced Natural Language Processing software and we learned how to make a Reddit bot.
What's next for Eli5bot (Reddit) and simpli5 API
To find more appropriate synonym by better understanding the semantics and the content of the paragraph and relying less on pure word frequency.
Built With
- natural-language-processing
- pickle
- praw
- python
- spacy
Log in or sign up for Devpost to join the conversation.