DoctorSpark

Inspiration

Our team, consisting of Hamid Mushtaq, Hani Alers and Zaid Alars from the Computer Engineering Lab of the TUDelft, have been working to create this DoctorSpark application to address computational challenges in DNA analysis. With the increasing amount of DNA information being used to diagnose genetic disease, DNA analysis time is becoming a bottleneck to using genomics in the filed to treat patients. We implemented a Spark based framework to enable easy division and distribution of DNA analysis computations on a large scalable infrastructure.

What it does

The Spark framework we developed enables efficient utilization of the computational resources available to the user. This is done by dividing the large input data sets into various chunks and running the analysis on each of these chunks in parallel.

How I built it

The framework was built using Scala. However, existing genomics DNA analysis programs can be used within the framework without modifications. These programs are implemented in both Java as well as C.

Challenges I ran into

There were limitations with regards to various system resources that we needed to optimize due the large DNA datasets (in the order of hundreds of GBs) typically used. For example. memory as well as disk space utilization have been optimized to eliminate these bottlenecks.

Accomplishments that I'm proud of

We are now able to run DNA analysis pipelines in such that it scales linearly with available compute nodes, while efficiently utilizing the infrastructure. This enables less than 1 hour processing time for a realistic DNA analysis data set to be used in practice.

What I learned

We learned a lot about using big data techniques to run computationally intensive pipelines with large datasets. We also learned that we are able to use these techniques to create solutions that can make a difference in practice.

What's next for DoctorSpark

We are working to integrate accelerators into the Spark framework, thereby enabling even higher performance and reducing the cost of the overall solution.