fireWolf Health

The TCGA Data Set

The Cancer Genome Atlas project has, since 2005, collected and published data on the genetics of cancer. This has been used to find hundreds of genetic markers involved in numerous cancer types. The database also collects patient biographic and disease data, including gender, race, self reported frequency of alcohol and tobacco, cancer stage and type, and duration of life post diagnosis.

Survivability Data Visualisation

Our webapp allows a user to see survivability curves based on subsets of the data arranged by the user specified biographic data categories.

Cancer Stage Prediction

Cancer phenotypes including staging are fundamentally driven by abnormal molecular activities such as gene expression, so we developed an interpretable, data-driven machine learning approach to predict clinical outcome(tumor stage) in lung cancer( LUAD lung adenocarcinoma ) using TCGA open source dataset (1103 LUAD cancer patients, for each patient including 20101 genes ). The input data to the framework is the gene expression data matrix (genes by patient samples), first of all, we identifies gene expression features using unsupervised learning such as linear and/or nonlinear dimensionality reductions including PCA, diffusion map. Then, we split the dataset into train(75%) and test(25%) dataset. We choose several classification methods to predict the tumor stage. The results show that we could get more than 90% accuracy prediction with the machine learning methods.

Our Technology Stack

The client app is written in jQuery and uses the Firebase and Google Charts libraries. There is a menu for selecting population subtypes, and when the user makes a selection, a request is posted to our Firebase Realtime Database. Since the Firebase REST API, which we used to interface with our python server, does not allow us to listen to events, the client also pings the known server location as to alert it of changes, which are then read off from Firebase.

We use Google App Engine to run a python webserver that responds to client requests with either the points of the computed survivability curve, or with the results of running our prediction model on user submitted single patient data. The TCGA data, despite being offered openly on the TCGA website, is invisible to the client.

If the client is notified of updates to its survivability data request, then it reads data consisting of pairs:

(time since diagnosis, probability of having survived for this long)

From this, we use the Google Charts library to plot a survivability graph, which of course manipulates the page DOM as opposed to causing the page to reload.

Share this project: