We can predict the popularity of existing packages by looking at star ratings on djangopackages.org. However, for new packages, we only have their release dates and people working on them, but not their star ratings. As a result, we are unable to predict their popularity.
What it does
Our project aims to predict the popularity of new packages via star ratings based on popularity of existing packages and participants who have worked on them, assuming that participants working on new packages have worked on an existing one.
How we built it
● We scrape data from https://djangopackages.org/python3/ to obtain package information of all the Django packages being used in Python3 development projects. ● After that, popularity of new packages is predicted based on the names of participants working on the package and a score vector of their previous contributions. ● The score vector consists of the usage counts (number of people working on the package), forks (number of times the package is downloaded), and star ratings (project ranking) of each project. ● A matrix consisting of the names of participants and the package(s) is multiplied by the score vector to obtain a transition matrix which signifies the reputation of each participant based on their earlier contributions to packages. ● Another matrix consisting of the same length as the transition matrix is obtained for participants who are working on the new package. It is multiplied by the transition matrix to get a score vector, and output the column that gives the star rating to predict the popularity of a new package.
Challenges we ran into
● After conceiving the idea, it took time to come up with the plan to proceed ahead. ● Which factor to consider for predicting was also one of the challenge ● Initially, when we do not have star ratings for Django packages, we thought of using a linear regression to predict the star ratings based on the correlation between forks and usage counts against watchers. However, it did not perform well on testing data due to one outlier value. ● It was unnecessary to utilize linear regression when we had enough data that showed us the star ratings. Thus, we changed the scope of our question, as documented in the lessons learned. ● We decided to include PageRank (PR) of packages as one of the criteria to predict the popularity of the package, but after 2 hours of coding and seeing the result, we figured out that PR cannot be applied here. It is because all the packages become sink nodes here - nodes with no outgoing links but only the incoming links.
Accomplishments that we're proud of
Despite the challenges faced, we are proud of proposing a new question that we have not seen or thought of before. Although the project did not turn out as expected, we were still able to generate an output and a working model to test our cases.
What we learned
We learned that when working on a new project, initial ideas would not work as hoped and we had to spend a lot of time figuring out new ways to approach the problem or change the scope of our question. For example, we initially thought of predicting popular Django packages based on the limited data we have by predicting the correlation of forks and usage counts against watchers. However, after scraping more data, we found that we can get star ratings and determine the popularity of packages based on star ratings. Therefore, we had to change our question to predicting popularity of new packages, utilizing data from existing packages.
What's next for Predicting Django packages that are getting Popular
Predicting Django packages that are getting popular allows users to utilize popular packages that would enhance their credibility. Doing so would also serve as a standard for developing packages that would gain sufficient attention. Analyzing common features of recent popular packages and providing guidelines to develop them could serve as future directions.