TL;DR: We made a search engine designed to solve an information discovery problem that arises when you aren't sure what your query string should be; but rather have a specific document/webpage in mind that you want to find related work to. Instead of relying on the structural which exist between websites, we employ natural language processing to do this search and clustering contextually. As a result we achieve superhuman classification of not only expository documents but also code snippets. Click here to view the pitch deck.

See it in action:

Identifies, searches and clusters sub-topics relevant to the contextual material on a given page

View post on imgur.com

Determines algorithms and libraries used in code snippets from context

Identifies implicit political context and bias

Our Public-API endpoint received over 50,000 unique requests in under 36 hours

Inspiration

With an ever changing world and a migration towards a digital landscape the average user can become overwhelmed with data and information. In this digital world humans have relied on Google searches to attain and discover their knowledge as opposed to conventional learning strategies. These searches can be classified into two subsets: knowledge discovery, where the question is undefined and direct searching, where the question is defined. 60% of searches worldwide are classified as unsuccessful meaning multiple searches were conducted before the desired result was attained. We were inspired to create a new solution where we recommend articles on a similar topic being searched to aid in the discovery and education process. We think by minimizing search time and frustration, finding the right data can be transformed into a journey instead of a pain point. We want everyone to indulge their curiosity in whatever topic, interest or random fact they are looking for.

What it does

We developed an API to expose the core algorithm “bubblRank”, made available through StdLib, anyone can query our API with a web page and receive a categorized and labelled arrangement of related pages. We show one such application of “bubblRank” by building a chrome extension that computes the bubbl cluster of any given page and provides the user with the option to navigate through the cluster in an intuitive way. The back end is powered by a self-designed (at this hackathon) state of the art natural language processing and clustering algorithm, which scrapes the meaningful text from websites in order to produce rich document representations of said websites in vector form, by averaging, and comparing their pairwise cosine similarities we are able to design a robust similarity metric to then perform Hierarchical Density Based Spatial Clustering in parallel. At every stage in the development of bubblRank we take several steps to ensure that the accuracy of our algorithm is not compromised whilst maintaining state of the art computation speed. Take for example the way in which we verify the accuracy of our document vector representations, we do a graph analysis using T-SNE plots to reduce the dimensionality of our vector space and compare the presence of clusters. We then take the Spearman’s R coefficient with respect to human tests to verify the clusters made. This attention to detail is prevalent through our entire project and it is something we are very proud of.

How we built it

Bubbl was built primarily on top of Java because of the language’s capability in parallelism making it more effective than Python (which we had originally considered) due to the fact that Java allows for cores share memory whereas Python does not. Angular and Javascript were used in the front end (web app) to facilitate a pleasant user experience. The core of the algorithm and API is exposed using StdLib and node.js. All preliminary data integration tasks were done in Python.

Challenges we ran into

The largest challenges and most prominent problem the team faced was comparing large sets of websites by similarity, which involved both accessing the data through queries, compressing the data into large vectors, semantic analysis over comparing vectors using either Euclidean space or cosine similarity and then understanding that similarity score and testing the scores. Other problems further in the project stemmed from parallel clustering and then building a strong back/front end to visually display the topics and similar articles in a innovative fashion.

Accomplishments that we're proud of

Parsing a large variety of websites and conducting TextRank with cluster algorithm Building out a Chrome web extension and back end with API calls to our clustering algorithm Comparing large corpuses of data and being able to encourage learning through a sophisticated back end similarity algorithm with a sleek UI.

What we learned

Demonstrated to our challenges, a lot of machine learning and data integration was tackled. Additionally project management was a valuable skill.

What's next for bubbl

Continuing to scale our infrastructure and expand the use cases for our API.

Built With

Share this project:
×

Updates