Please sign up or log in to continue.

Inspiration

In the summer of 2024, I conducted a data processing research project using published journal articles to narrow down three genes as potential drug targets and prognostic markers for diffuse large B cell lymphoma, a condition my friend was suffering with. The process of manually sifting through data took an entire day, and throughout the process I realized a large potential for automation so that it may be quickly repeated for all types of cancers.

What it does

When given three types of data - 1) the collection of essential genes for a cancer type, 2) the collection of over-expressed genes for that cancer type, and 3) the collection of genes that have significantly impacted survival rates from that cancer type - my program is able to identify very select few genes as potential drug targets and prognostic markers.

How we built it

For 1), I ran through Behan et al. 's paper which described more than 6000 genes as potential essential genes for around 20 times of cancers. For each cancer type, I set the threshold that if the particular gene is present in more than 50% of cell lines, it is determined as an essential gene. Then, for all cancer types, a final list of considered genes are compiled by taking account of all essential genes.

2) I built the pipeline by programmatically reproducing and scaling analyses that are normally performed manually through the GEPIA2 web interface, which does not expose an official API. First, I reverse-engineered the GEPIA2 backend by inspecting browser network traffic while running differential gene analysis in the web UI, capturing the hidden HTTP endpoint and recreating the request in Python with browser-like headers to reliably query TCGA cancer datasets in batch. I implemented a custom parser using BeautifulSoup to extract the HTML table embedded in JSON, normalize the values, and convert the results into Pandas DataFrames containing gene symbols, expression levels, log2 fold change, and FDR-corrected q-values. I then filtered genes using a biologically motivated threshold (log2FC > 1) to identify strongly upregulated tumor genes, and integrated these with DepMap gene essentiality data by transforming large Excel knockout matrices into cancer-specific gene sets and intersecting them with the GEPIA results.

3) For survival analysis, I automated the GEPIA workflow that normally requires manually generating Kaplan–Meier plots and reading the log-rank p-value from the resulting PDFs. Using the official GEPIA Python client, I constructed a high-throughput pipeline that programmatically submits survival queries for each gene–cancer pair by dynamically resetting the module parameters between runs to prevent state carryover, specifying the gene signature and cancer cohort, and triggering the server-side generation of survival plots. Because the GEPIA client returns results as PDF files rather than structured data, I implemented a file-monitoring system that detects the newest PDF created after each query, extracts the log-rank p-value via text parsing, and immediately deletes the file to prevent disk accumulation when running thousands of analyses. The pipeline also includes checkpointed CSV logging and resume logic to ensure fault tolerance, as well as request throttling to avoid rate-limiting by the public academic server. This transformed a manual, single-gene workflow into an automated survival-screening system capable of evaluating large gene sets across multiple cancer types and retaining only statistically significant survival associations.

Accomplishments that we're proud of

To demonstrate my project, I ran it on around 20 types of cancer, including lymphoma, melanoma, breast cancer, kidney chromophobe, etc. For each type of cancer, I started with more than 6000 genes that behaved as potential essential genes, and managed to narrow it down to 10 to 80 genes. This acts as a strong jumping pad for further research, as it provides convincing evidence to indicate that these few genes are potential drug targets and prognostic markers.

Challenges we ran into

There was a particular part of the code that would take a long time to run, as it includes heavy workload in downloading image files and scraping numeric data from it. My current solution is to run that part of the code last, after previous procedures have already massively narrowed down the amount of scraping required. As my program is meant to process larger loads of data, I will keep working on ways to optimize this speed bump.

What we learned

It was really cool getting to experience the capabilities of vibecoding, and experiment with how I can shape it with my prior knowledge in coding. In addition to the hacking portion, this project gave me a great opportunity to brush up on biology. I learned about the significance of cell lines and familiarized myself with great data analyzation tools in biotech.

What's next for OncoFilter

As it is, the project largely cuts down the amount of time required to identify potential drug targets and prognostic markers in any type of cancer. Manually, it took me a day to process one type of cancer, while the construction and demonstration of this entire project took less than 12 hours to process around 20 types of cancer.

However, there are further ways that it can be automated. Currently, the project still requires users to download and filter information from published journal articles. With more capabilities in web scraping, it could be possible to remove that step, and minimize the amount of user intervention required to successfully run the program.

Built With

Share this project:

Updates