What is this about?

This is a Python experiment to identify technologies and APIs used in a Devpost project. I'll start by retreiving a json representation of the project using our unofficial API. Then I'll search the project's title, tagline, description, and author contributions statements for common keywords (node.js, python, elasticsearch, heroku, etc.), normalize them (node.js / node js / node are all the same thing), and return an array of tags.

Why bother?

Most hackers only tag languages and key APIs. Hosting providers, build tools, and other tech almost never gets captured. I want to improve the likelihood that a mentioned tech/api will be tagged. If we can suggest tags based on the project description, users could then just click "add all" or whatever and be on their way.

And then what?

If this all works out, I want to think about relationships between tags. Is a project hosted on heroku more likely to be a RoR project? If so, maybe we could suggest that tag even if RoR isn't mentioned explicitly. Or hey, are you building a webapp with node? Well then you're probably using express or some other common framework.

Bruh…

Hey, if you've got some feedback, better ideas, whatevers, I'd love to hear it. BTW, shoutout to HH Python!

Built With

Share this project:
×

Updates

Neal Shyam posted an update

Did a very simple first pass today in Python. If you check the repo, you'll find tag.py and db.py which are the main script and small tag database.

If you run python tag.py nextpocket, the script will:

  1. Pull all the project details from the Devpost API and concatenate it into 1 string.
  2. Loop through every phrase in the database (e.g. Pocket, javascript, js, etc.) and run a regex search for it in the project details.
  3. For every match, pull the "cleaned", matching tag (JavaScript, instead of js or javascript)
  4. Returns all unique tags

This is the entire "extraction engine" and yes, it's just plain regex with a word boundary:

stags = [];
for t in db:
  reg = t['phrase'] + r'\b';
  if re.search(reg, text, re.IGNORECASE):
    stags.append(str(t['tag']))
stags = list(set(stags))

print "Suggested tags: \n"
print stags

Is it a little too simplistic? Probably, but it works for the two or three cases I've tried so far. It'll need work as I build out the database.

And speaking of the database, you'll notice that I listed things like StackOverflow, Stack Overflow, and SO as separate entries. I don't see a way around this. It'll be the same for things like node.js & nodejs & node.

Frankly, I don't see why we can't do this online, in the project submission form. Seems like a slam dunk to me.

Log in or sign up for Devpost to join the conversation.