big-dog

1) Introduction:

Researchers often work with large data sets containing potentially high-dimensional data. The first step when processing these data sets is always the same: visualize the data and perform some basic exploratory analysis (e.g. visualize the distribution, check for normality, and see if it clusters). This is a time-intensive process which is repeated for each new data set that is received. Furthermore, many researchers are unfamiliar with the necessary statistics and are forced to hire outside help to do this relatively simple process for them.

We have automated the visualization process via an interactive web application. The user first imports his or her dataset, and then the application renders a multitude of interactive plots which the user can explore. In this way, the user saves countless hours which can now be used in a more productive way.

2) Acceptable Data Types Uploaded data must have patients/trials/samples in rows and categories/dimensions in columns.

3) Data Processing:

I) Data Visualizes imported data in the form of a data table. User can sort each column in ascending and descending order. User can also search by row name.

II) Marginal Distributions Visualizes the marginal distributions of the columns in the data set. User can choose between a histogram, kernel density estimate, or a combinaiton of both.

III) Outlier Analysis Computes mahalanobis distances between data points in the uploaded data. The square of the distances are chi-squared distributed with degrees of freedom equal to the number of dimensions in the data set. Data points sufficiently far from the mean are rejected as outliers. The user can choose the set p-value between (0 and 0.1). Rejected samples are dynamically displayed.

IV) Correlation Analysis Computes the Pearson correlation coefficient between different columns in the data set. This is the first step in searching for relationships between variables. The coefficients are visualized via a lower triangular heatmap.

V) Mean Vector Computes the mean of the dat ain each column and displays as a either ascatter plot, a scatter plot with error bars given as plus or minus one standard error, a box and whisker plot with outliers shown as points, or a violin plot.

VI) Clustering Perform hierarchical clustering. The user can visualize the results of clustering in the included plot. To allow for visualization of high-dimensional data, principle component analysis is first conducted to project the data onto two dimensions. This reduced data is then plotted and colour coded by which cluster it belongs to. The user can choose how many clusters he or she wants to visualize.

Built With

Submitted to

HopHacks - Fall 2015
- Winner Best Use of AWS

Created by

I worked on connecting the client side web page to the node.js backend as well as coordinating data flow and organization between node.js and shiny-server.

Alec
Statistics; Backend design; Data visualization

ikuznet1
Back-end (deploying the website on AWS, running node.js server, integrating shiny app)
Front-end (Integrating Bootstrap and Angular)

Stephanie Chew
Ben Shapiro

Updates

ikuznet1 started this project — Sep 13, 2015 08:20 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.