The human body consists of a wide variety of proteins that allow us to perform many functions. Proteins unfortunately also have the capability of causing great damage when missfunction. In order to combat this and improve the overall welfare of modern society, we must better understand proteins through their unique characteristics. Thousands of proteins remain uncharacterized in terms of both structure and function, leaving humans in a vulnerable state as we may not be able to understand protein-related diseases very well due to this knowledge gap between an illnesses and its corresponding malfunctioning protein.

As a measure to help resolve this issue, machine learning algorithms have been developed that analyze amino acid sequences to elucidate potential structure features. A problem with these contemporary methods lies in the use of greedy algorithms, which may not yield the best possible fold recognition results. Bioinformaticians have recently published work recommending the use of the mean-shift algorithm in substitute of current programs, and have obtained improved results. We take this approach to protein sequences in order to better understand domains/folds (characteristic protein features) in an attempt to improve current protein knowledge.

Icarus was built using scikit-learn's API with the Mean Shift clustering algorithm. Sequences were derived from the Pfam database. Purge from the MEME Suite was used in conjunction in order to process fold protein sequences. R's ggplot2 was used in order to plot data from this project. Code was written in jupyter notebook with Python 3.7. Challenges we ran into

Evaluation metrics of whether or not sequences contained a certain fold were also unclear at the beginning of the project, but figured out after data visualization.

UHRF1 is a multi-domain protein that is recognized to be overexpressed in several forms of cancer. These domains differ in structure, and vary in size greatly (60 to 200 residues). Using Icarus, we used UHRF1's domain as test set and were able to identify all 5 domains in sequence.

If we are able to gain access to cloud computing, we are interested in further developing cluster centers for the current set of 650,000 protein domains. We would use this knowledge in order to develop a thorough public resource, where researchers can input sequences of little known knowledge and potentially receive leads on how to learn more about these uncharacterized proteins. How domains act with one another, and whether domains often appear with one another would also be of interest in a machine learning situation. Collaboration with protein modeling groups/programs would also be of interest, as ICURAS would be able to provide structure-independent assistance to current programs which rely on information like residue distances in order to determine structure. Predicting certain folds within a sequence may provide protein modeling a template as aid.

Share this project: