Statistical Analysis of DiscoveryEngine

Inspiration

Good morning! I'm a freshman and a Biochemistry concentrator at Brown that just began learning statistics and a tiny bit of R. I have always been passionate about science, and DiscoveryEngine tries to quantify the quality of papers. If the data is well interpreted, many very interesting insights can be taken.

What it does

Basically, I tackled the problem to identify scientific personalities, expressed in how papers are judged. There are some types of raters that typically value what they read less than the norm and also some that associate the variables in different ways. My project is a new data frame built from a classical statistical analysis of the correlation between how a single rater ranks different questions in relation to each other by using r-squared of a linear model of the variables. Also, the data frame shows many papers each reader rated, the mean and standard deviation for each question. The user ids without ratings were taken out of the data frame. This way, it is very easy to analyze different profiles and check the variance of each rater's ratings.

How I built it

I built the entire code using R and R Studio.

Challenges I ran into

This was my first code for a hackathon and I don't have CS experience, so the entire part of coding was challenging. Besides, there is still a bug in determining r-squared for the correlation of different questions for each individual rater.

Accomplishments that I'm proud of

I never thought I would be able to almost finish a program in such a small time period and I really enjoyed learning a little more of R for real data analysis.

What I learned

First of all, it is easy to see that questions 1 and 2 are highly correlated, but the other questions not so much. However, there are some individuals that show a bigger r-squared than the mean. Also, there are some duplicates of ratings for the same user. This can be seen when the standard deviation equals 0; therefore, the rater rated the same paper more than once, or maybe it was a bug. Also, it is easy to see that some raters rated as much as 80, whereas others just one. This new data frame makes it much easier to analyze individual ratings with the raters' personalities.

What's next for Making Raters Connections (DiscoveryEngine)

Using other APIs to address the accuracy of each rater and putting that into the dataframe.

Built With

Updates

Lucas Paulo de Lima Camillo started this project — Mar 04, 2018 09:52 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.