Inspiration
The relationship between some of the PPIs involved in mental illness has been established. However, most of these interactions do not have an experimental structure of the complex formed by the two interacting proteins. That's why we need molecular docking simulation programs. However, these programs are not able to sort the different structures carefully obtained by their actual relevance. In addition, the metrics used by programs to classify predictions are not comparable.
Such consensus algorithms exist, but they are not scalable, and can not handle big amounts of docking poses. Therefore, there is a strong need for efficient poses consensus algorithms.
What it does
In this project we introduce ELE (Essence Ligand Encoding) algorithm; an efficient docking poses' clustering algorithm, encoding each rigid-body ligand as its three most-distant atoms. We prove that by using ELE, the execution time of such consensus algorithms can be reduced up to 99%, maintaining the same clustering accuracy.
How we built it
The core of ELE lives in the representation of each Ligand. As we are only dealing with rigid-body ligands, all the poses of this molecule can be encoded by its position in space and its 3D rotation. Alternatively, the three most distant points of a molecule can approximate well this information.
Since PDB files encode two molecules (the main protein and the ligand), and the main protein does not move, it adds no information. Because we can encode the ligand by its three most distant atoms, we can further reduce the whole PDB information needed for the clustering algorithms as the coordinates of these three atoms. That is nine numbers. Flattening these numbers, we can represent a PDB file with only a vector of 9 coordinates.
Using the ELE trick, the clustering algorithms have far fewer data to manage and therefore are much faster.
Challenges we ran into
We found that after applying the ELE trick, the bottleneck of the whole algorithm was the process of reading and storing in memory the whole PDB files --but we only needed 3 lines and 3 columns from each!
This is why we created an efficient method to read only the part of the PDB files we are interested in and do not allocate in memory the whole PDF file.
Accomplishments that we're proud of
We provide two tables for comparisons between the currently used consensus algorithm and ELE:
Time comparative between the existing implementation and ELE
| Algorithm | ftdock (10.000 PDB) | set1 (120.000 PDB) |
|---|---|---|
| Existing implementation (AdaptivePELE) | 1h 30min | ? |
| ELE + DBSCAN (local) | 1min 16sec + 461ms | 14min + 21.2s |
| ELE + Sklearn K-Means (local) | 1min 16s + 759ms | 14min + 12.3s |
RMSD between most populated clusters and the reference structure by algorithm
| Cluster | AdaptivePELE (RMSD) | ELE |
|---|---|---|
| 1 | 62.5375 | 65.5237 |
| 2 | 61.2252 | 63.2584 |
| 3 | 64.9931 | 59.2742 |
| 4 | 64.2487 | 64.4392 |
| 5 | 64.1294 | 62.0101 |
What we learned
We learned several things while doing this project:
- to analyze the problems before getting into coding
- a bunch of visualization techniques for protein structures data.
- a lot of new concepts related to bioinformatics: docking, PPI, PDB and XTC files, etc.
What's next for Protein Docking Poses Consensus: Essence Ligand Encoding
We can further reduce the ELE algorithm time by parallelizing the code. Besides, we need to test the algorithm with more datasets of PPIs to provide an accurate benchmarking of the ELE algorithm accuracy. Other clustering algorithms could be applied with the ELE trick to increase the reliability of the clusters. Fine tuning of the ELE algorithm parameters is a must to achieve maximum accuracy and usability.
Built With
- biopandas
- linecache
- py3dmol
- python
- sklearn


Log in or sign up for Devpost to join the conversation.