Search engines are designed to allow the user to use key terms to find data that they are looking for. However, they often include irrelevant data such as advertisements and images.
What it does
No Longer Massey is a web application that would take a topic as input and returns the most accurate and relevant information. It removes the vast amount of redundant text from search engines so that the user has to read as little as possible.
How we built it
No Longer Massey is comprised of 2 parts. The first being a web crawler, and the second being an AI text summarizer.
Web Crawler / Data Aggregation
The web crawler was largely based on the Python Scrapy framework. The web crawler used aggregated data from large search engines in the form of brief summaries. This list of summaries was compiled into a large array and passed to the data summarizer.
The AI summarizer was coded from scratch in Java and was based on the TextRank paper. The AI first assigned "recommendations" to each sentence based on the frequency of keywords as they appeared in each sentence. It then constructs a weighted graph representing the different sentences and their similarities to one another. Based on the weights of the graph, an overall set of scores for each node can be computed that measures their overall "importance" to the graph.
Challenges we ran into
The majority of our team was most comfortable coding in Java, thus the primary AI was coded in Java. After, our team decided to turn our project into a web app using a Flask server. Realizing the difficulties of using a Java backend, we had to translate the Java code into Python.
Accomplishments that we're proud of
Contrary to the common approach to text summarization that is using NLP (machine learning), our group wanted to try a different approach that was more unique. We successfully researched and applied a novel technique to create an effective summarization AI (outlined in the paper above).
What we learned
We learned about the TextRank algorithm, which is itself comprised of advanced topics, including graph theory. Our model constructed a weighted graph where sentences were represented as nodes and the similarities between sentences were represented as the weight of the edges.
What's next for No Longer Massey
Next steps for No Longer Massey include using a pre-trained model such as word2vec to expand the vocabulary of the AI beyond the limited corpus provided by the input text.