Inspiration

OpenPowerlifting is an open-source project that collects everyone's performance at powerlifting competitions. Much of the data entry is manual, and open-source contributors often make mistakes in their data-entry. We wanted to write a script that would automatically detect these errors.

What it does

The script scans the entire database, finds associations between data points, and uses an algorithm to judge how "far apart" a data point is from another. This algorithm allows us to split the data into clusters. We then run through each cluster and perform the classic single-variate outlier detection.

How we built it

We used nltk's clustering algorithm because it allows us to write our own function that judges the distances. Usual k-means clustering uses the Euclidean distance formula. However, this formula is not accurate with categorical variables. Our algorithm assigned a distance of 1 to categorical variables that do not match, and use Euclidean distance for numerical variables.

Challenges we ran into

The database consists of categorical variables like sex, equipment used, and age-division that are not numerical. This means that our data is ordinal. Many data science and machine learning libraries are primarily meant for numerical-only data, and we needed a way to workaround this. Since nltk's clustering library requires numerical data, we serialized all categorical variables into ASCII-number representations.

Accomplishments that we're proud of

We didn't have time to optimize our algorithm, so it currently tags around a fourth of the database as outliers. However, we are proud that the project serves as proof of concept.

What we learned

We learned statistical concepts and possible limitations of a database, mostly by speaking to a Cal Poly professor, Dr. Foaad Khosmood.

What's next for Multivariate Outlier Detection for OpenPowerlifting

After documenting, cleaning up code, and optimizing the algorithm, we're going to submit the changes as a pull request to the OpenPowerlifting project. We hope that the open-source and powerlifting will enjoy our contribution!

Available on http://github.com/LucasYoung/openpowerlifting Path: scripts/find-outliers.py

Built With

Share this project:

Updates