1) Introduction:

Researchers often work with large data sets containing potentially high-dimensional data. The first step when processing these data sets is always the same: visualize the data and perform some basic exploratory analysis (e.g. visualize the distribution, check for normality, and see if it clusters). This is a time-intensive process which is repeated for each new data set that is received. Furthermore, many researchers are unfamiliar with the necessary statistics and are forced to hire outside help to do this relatively simple process for them.

We have automated the visualization process via an interactive web application. The user first imports his or her dataset, and then the application renders a multitude of interactive plots which the user can explore. In this way, the user saves countless hours which can now be used in a more productive way.

2) Acceptable Data Types Uploaded data must have patients/trials/samples in rows and categories/dimensions in columns.

3) Data Processing:

I) Data Visualizes imported data in the form of a data table. User can sort each column in ascending and descending order. User can also search by row name.

II) Marginal Distributions Visualizes the marginal distributions of the columns in the data set. User can choose between a histogram, kernel density estimate, or a combinaiton of both.

III) Outlier Analysis Computes mahalanobis distances between data points in the uploaded data. The square of the distances are chi-squared distributed with degrees of freedom equal to the number of dimensions in the data set. Data points sufficiently far from the mean are rejected as outliers. The user can choose the set p-value between (0 and 0.1). Rejected samples are dynamically displayed.

IV) Correlation Analysis Computes the Pearson correlation coefficient between different columns in the data set. This is the first step in searching for relationships between variables. The coefficients are visualized via a lower triangular heatmap.

V) Mean Vector Computes the mean of the dat ain each column and displays as a either ascatter plot, a scatter plot with error bars given as plus or minus one standard error, a box and whisker plot with outliers shown as points, or a violin plot.

VI) Clustering Perform hierarchical clustering. The user can visualize the results of clustering in the included plot. To allow for visualization of high-dimensional data, principle component analysis is first conducted to project the data onto two dimensions. This reduced data is then plotted and colour coded by which cluster it belongs to. The user can choose how many clusters he or she wants to visualize.

Log inorsign up for Devpostto join the conversation.