Analysis Query Form
Raw Data Download
Example Output Table
What it does
A separate data collection and standardization step was done to build the app database. This involved the automated querying of ChIP-Seq data from GEO and ENCODE, two of the largest repositories of biological data. Metadata was extracted for each study and bioinformatics processing was done to generate standardized ChIP-Seq peaks.
Due to the limited numbers of TFs and tissue types which have been experimentally profiled thus far, ChIP-IO will also include in silico predicted TF binding locations.
ChIP-IO hosts this data for direct download, but more importantly allows for custom gene regulatory analysis. Users can submit query forms containing parameters such as tissue type, p-value cutoffs for ChIP-Seq peaks, promoter region definitions, etc. Once submitted, the back-end will apply constraints, assemble custom gene regions, and perform the necessary operations mapping ChIP-Seq peaks to known, gene-specific regulatory regions. These results are than returned as downloadable table files mapping TFs to genes.
How I built it
Data collection was done through python pipelines designed to be run on linux clusters in parallel. Intermediate operations were done using common bioinformatics command line tools where appropriate. Metadata processing was done using the ENCODE API for studies sourced from ENCODE. Due to GEO's poor data standardization, metadata was first mined using the Entrez API followed by semi-manual parsing in Python.
Challenges I ran into
Automated data collection is not nearly as simple as it sounds, especially when attempting to scale on clusters. There are also a huge number of metadata inconsistencies, errors, and missing data which have to be accounted for in the automation process. GEO data is especially hard to reconcile.
This is my first time constructing a web application truly from scratch, especially one of this magnitude. The biggest mistake was not developing using a python virtual environment. This is a great way to keep track of only what the application needs to run. There were also many smaller issues with figuring out how to properly deploy the application (hint:don't use heroku for anything heavy duty), and other aspects of cloud computing and networking which are unfamiliar to me.
Accomplishments that I'm proud of
What I learned
Quite a bit.
What's next for ChIP-IO
The biggest next step is incorporating in silico TF predictions to expand the database substantially. I also need to write up a publication to get the word out.