Protein Docking Poses Consensus: Essence Ligand Encoding

GIF
Encoding of a rigid-body ligand as its 3 most distant atoms.
GIF
Ligand representatives of the top 8 poses clusters.
Evolution of RMSD as more poses of different docking algorithms are added.
GIF
A cluster cloud with its representative pose in the middle.

Inspiration

The relationship between some of the PPIs involved in mental illness has been established. However, most of these interactions do not have an experimental structure of the complex formed by the two interacting proteins. That's why we need molecular docking simulation programs. However, these programs are not able to sort the different structures carefully obtained by their actual relevance. In addition, the metrics used by programs to classify predictions are not comparable.

Such consensus algorithms exist, but they are not scalable, and can not handle big amounts of docking poses. Therefore, there is a strong need for efficient poses consensus algorithms.

What it does

In this project we introduce ELE (Essence Ligand Encoding) algorithm; an efficient docking poses' clustering algorithm, encoding each rigid-body ligand as its three most-distant atoms. We prove that by using ELE, the execution time of such consensus algorithms can be reduced up to 99%, maintaining the same clustering accuracy.

How we built it

The core of ELE lives in the representation of each Ligand. As we are only dealing with rigid-body ligands, all the poses of this molecule can be encoded by its position in space and its 3D rotation. Alternatively, the three most distant points of a molecule can approximate well this information.

Since PDB files encode two molecules (the main protein and the ligand), and the main protein does not move, it adds no information. Because we can encode the ligand by its three most distant atoms, we can further reduce the whole PDB information needed for the clustering algorithms as the coordinates of these three atoms. That is nine numbers. Flattening these numbers, we can represent a PDB file with only a vector of 9 coordinates.

Using the ELE trick, the clustering algorithms have far fewer data to manage and therefore are much faster.

Challenges we ran into

We found that after applying the ELE trick, the bottleneck of the whole algorithm was the process of reading and storing in memory the whole PDB files --but we only needed 3 lines and 3 columns from each!

This is why we created an efficient method to read only the part of the PDB files we are interested in and do not allocate in memory the whole PDF file.

Accomplishments that we're proud of

We provide two tables for comparisons between the currently used consensus algorithm and ELE:

Time comparative between the existing implementation and ELE

Algorithm	ftdock (10.000 PDB)	set1 (120.000 PDB)
Existing implementation (AdaptivePELE)	1h 30min	?
ELE + DBSCAN (local)	1min 16sec + 461ms	14min + 21.2s
ELE + Sklearn K-Means (local)	1min 16s + 759ms	14min + 12.3s

RMSD between most populated clusters and the reference structure by algorithm

Cluster	AdaptivePELE (RMSD)	ELE
1	62.5375	65.5237
2	61.2252	63.2584
3	64.9931	59.2742
4	64.2487	64.4392
5	64.1294	62.0101

What we learned

We learned several things while doing this project:

to analyze the problems before getting into coding
a bunch of visualization techniques for protein structures data.
a lot of new concepts related to bioinformatics: docking, PPI, PDB and XTC files, etc.

What's next for Protein Docking Poses Consensus: Essence Ligand Encoding

We can further reduce the ELE algorithm time by parallelizing the code. Besides, we need to test the algorithm with more datasets of PPIs to provide an accurate benchmarking of the ELE algorithm accuracy. Other clustering algorithms could be applied with the ELE trick to increase the reliability of the clusters. Fine tuning of the ELE algorithm parameters is a must to achieve maximum accuracy and usability.