Collective brainstorming and ideas forming on top of eachother simply lead to a pretty neat idea about linking developrs on Github together who work on the same kind of things.

What it does

Takes your Github repo as an input and compares it's readme files against the readme files of hundreds of thousands of other repos, giving you the ones most likely to be working on similar projects to you. This allows you to find other developers working on similar projects, as well as find existing implementations of your ideas for inspiration.

How we built it

The backend consisted of an implendation of the cosine similarity alogirthim - an algorithim that calculates the angle between two sets of text convereted into vectors based on word frequency. We implemented a scraping function that fed data into the cosine similarity algorithim in the backend, and a website front end to input the personal Github repo URL.

We downloaded over ten thousand readme's from different public repos on Github for a sample to be turned into vectors and calculated against an input.

Challenges we ran into

Initially we set out to use Tensorflow to derive the purpose of the project from a large selection of readme files scraped from Github, however we soon realising that Natural Language Processing, the study of teaching language to machines, is still on the bleeding edge side of technology, and as such went slightly beyond the scope of what we were capable of achieving in 24 hours.

Due to the algorithim we used we had to settle for basing our comparisons on word frequency alone, rather than any meaning of the words. Care had to be taken to ensure accurate results, such as calculating relative 'weight' of words within a readme to determine how unique it was, therefore filtering out common words and improving results.

Computation speed was a very important factor for the backend algorithim due to the enormous number of input elements, so extra effort was expended on both reducing the time complexity and the number of elements to be input into 'time bottlenecks' (such as for loops).

We ran out of time to properly link up the front end to the backend.

Accomplishments that we're proud of

We're very proud that we made a cool project with both a working front end and back end, it's just a shame we couldn't combine those two parts!

What we learned

Personally I learnt a lot about different types of algorithims for comparing text, as well as time complexities for a lot of methods of sorting and calculating data in Python. I also learnt the basics of how to work with a Github repository with other team members.

What's next for Repo Finder

Ideally, we would be able to both download every public Github repo (all 20+ million) and not rely on an algorithim based purely on word frequency, but rather a machine learning model that could accurately determine the meaning of the language used in the readme files themselves. Perhaps for the future.

Share this project: