Calvin is a GCP-deployed Flask-based web application that applies machine learning to determine whether an open-source repository on Github has an active community (i.e. a "Good" one) or a fairly stale community (i.e. a "Bad" one). Using this application (with "boredboy.py" being a comparable command-line tool), software developers/engineers in big organizations can figure out whether an individual repository has a good community, which is especially important in a production setting: relying on open-source projects that are badly maintained leads to extensive technical debt, a lack of readable documentation, and the fear that a single security vulnerability will go unnoticed/unfixed and therefore put enterprise software at risk, while well-maintained projects strive to reduce technical debt, make documentation good and readable, and find and fix vulnerabilities and bugs a la Linus's Law. With this tool, it should become easier for enterprises, companies, and even singular developers building large-scale systems to use more than just their gut intuition when deciding which communities they should think about trusting critical functionality to.

The name is based on "Calvin" from "Calvin and Hobbes," a cantankerous kid who's infamous for his impatience and general bad temper, and would generally be a kid asking "Is it done yet? Are we there yet? When will this be over?" This generally attempts to parallel the company's attempt at waiting until an issue in an open-source project gets resolved (i.e. opening an issue on Github, and then waiting patiently or impatiently until a maintainer gets back to it).

The application trains a Support Vector Machine (SVM), a linear classifier (as generally good Github repositories have large quantities of stars, users, and PRs, while bad Github repositories have small quantities if any) that will learn the necessary boundary based on a training data JSON file: this has been simplified via feature engineering, as given just the orgname/repname needed to identify a Github repository, a feature vector which includes the number of stars, number of users, number of PRs/ watchers, etc. will be generated (this also improves UX, as feature engineering is generally done on the backend). The frontend is written mostly using HTML/CSS and vanilla Javascript, with JQuery and Ajax as ways to actually connect the frontend and backend. The entire thing is deployed directly on GCP, enabling people to directly access the served model from a web browser. While both frontend development and GCP deployment were challenging for me (as frontend generally scares me and GCP was unfamiliar to me), I'm definitely a bit more comfortable with both of these things now, especially as I was able to use the Internet and available GCP mentors to my advantage.

This is especially important in the enterprise setting (and even in building larger, more involved open-source projects) within the security space. While actual vulnerabilities within codebases can be found via vulnerability scanners (such as Blackduck), understanding how quickly a vulnerability gets found and fixed, and potentially how easy it is for anyone to contribute patches and fixes to the community is something that hasn't been explored in depth, and Calvin is a good start for companies to do so.

Built With

Share this project:
×

Updates