Center & scale molecular dynamics coordinates for PCA

Inspiration

In the bitsxLaMarato hackathon, Ivonne Westermaier proposed as a challenge how to use principal component analysis (PCA) to analyze molecular dynamics trajectories of CLCN5 protein, the mutations of which are involved in Dent disease.

During her presentation, we spotted a potential way of improving the quality of the results provided by the PCA by doing a centering and scaling of the variables. In PCA analysis, when variables have very different variances (even if they are of the same units), a scaling should be done to prevent that the variables that have most variance mask the effect of those having less.

What it does

Extracts the frames of the simulation in a text-readable format (PDB), then read the alpha carbon coordinates for each frame and builds a table, out of which a PCA analysis can be done and used to retrieve a different set of eigenvectors and eigenvalues.

How we built it

Using both C++ programs to parse and extract the data, and a Python script to run the PCA and generate the eigenvectors and eigenvalues.

Challenges we ran into

The simulation data files are large, and we ran into several quota limiting disk space. We also hitted CPU usage quote limits while trying to run some of the scripts for longer than a few minutes. All of these were related to running all this in MareNostrum's CTE Power9 cluster.

Accomplishments that we're proud of

We need more eigenvectors to explain the same variance in the variables using our approach, which is interpreted so that they capture more information of the simulation's trajectory