Processing and Visualizing Overlapping Genomic Information

alt text

Inspiration

Patients of rare genetic disease often suffer from either misdiagnosis or lack thereof. Meanwhile, analysts must dig through mountains of data to find the proper diagnosis for patients who are suffering from either misdiagnosis or lack thereof.

Scientist at HudsonAlpha are currently researching physical interactions between specific regions of the genome. The process and tools used currently produce noisy data. Project d.NA flexibly characterizes the quantity and quality of interactions measured across many runs of an experiment to help researchers separate real data from noise.

What it does

We have provided a simple, elegant, and performant solution that meets every all of the criteria identified by the challenge sponsors as desirable for an ideal solution.

By keeping the solution simple, we were able to focus on making sure the solution is complete, correct, and robust.

Although this work was done in a Hackathon format, what we are able to provide is a complete, robust, production ready solution that can be put into use by scientist right now, because cures can't wait!

alt text

How I built it

Project d.NA was built by a team of three contributors.

Daniel:

I researched the science and existing tools applicable to this challenge. We leveraged the bemaps utility from the BedOps project for data analysis.

Patrick:

For the visualization tool I utilized WPF and C#.

Leon:

I produced the artwork, video and providing team leadership. Between Photoshop for backgrounds and color correction, Illustrator for icon and logo creation and mix between Audition and Premiere, I created the video and the variety of assets throughout the product.

Challenges I ran into

Daniel:

I took a long time to find the right tool for the job and figure out how to properly use it. The challenge mentors helped direct me to creating some data files of my own to explore the functionality of the available tools and make sure that our results were correct.

Patrick:

I faced several challenges when I took upon the challenge of the Hackathon. Firstly, we initially wanted to view and display the data through an ElectronJS application which gave us multiplatform deployment. I figured the application development stack would follow the format of other stacks that I'm familiar with, QT and WPF. However, I quickly found out that it was completely different and found myself in a rabbit hole of google links. I then swapped to WPF, but then ran into an issue of rendering the data in a fast and interesting way with broad interactions. After some consideration I wrote the data with a 2d Rendering canvas native to WPF. While not efficient it gets the job done. This challenge revealed that I need to learn a lower level graphics framework such as DirectX or OpenGL to make the visualization of data better and faster.

Leon:

The largest challenge I had was communicating clear expectations for the direction of the project. Since I was largely a matter of design and this is my expertise it was making sure the high expectations happened within the time frame of a hackathon.

Accomplishments that I'm proud of

Daniel:

I proud that each member of the team was able to apply their unique talents to produce a simple, robust solution to this challenge.

Patrick:

Rendered a large amount of data using tools that weren't meant for it. Finished the application.

Leon:

I provided leadership and keeping people on task towards making the idea a reality. I feel good about the limited motion graphics that I provided and copy for the voice over work in the video.

What I learned

Daniel:

There are an incredible number of high-quality open source tools available for genetic research, and the research community is very open to sharing through open-source contributions, forum posts, and open datasets.

Patrick:

I learned that I need to study lower level libraries such as OpenGL and DirectX.

Leon:

Manage project and expertise and expectations in an encouraging and powerful way. To create even more dramatic and impactful graphics in a shorter span of time that will wow viewers.

What's next for Project d.NA

We look forward to seeing the our work applied to the benefit of scientist and patients and applying more open source solutions that empower analysts to save lives and make further technological gains.

Data Processing

See the README.md file in the scripts directory for instructions on executing.

We spent many hours researching existing tools that could be used to solve this challenge. We found bedmap, one of the tools in the bedops package to be an ideal fit for this challenge. We leverage this proven and efficient tool for our data processing.

We've packaged access to this tool in some simple shell scripts for ease of use conducted extensive testing of its fitness of purpose.

alt text

The result is a production ready tool that is simple enough that "even a genomics researcher can use it". Yes, these are brilliant scientist, but they are not necessarily expert programmers so we've focused on a simple solution that can be easily used and maintained by the end user.

alt text

Output

The output is a tab-delimited BED file. The first three columns are an echo of the input data. The fourth column is the file used as a reference. The fifth column is the total number of base pairs matched across all files. The total number of base pairs matched can be used along with the number of files matched in to assert the strength of the match.

Chromosone start end from filename # matches # matching BPs

Scalability

To test the scalability, we started with 25 copies of an HG38 BED file with 432604 lines of data in each file. Testing was performed on a Linux laptop with 8 2.7GHz cores Linux and 64GB of RAM and SSD.

Processing the data took less than 1 minute.

Process 50 replicas of the same data took less than 2-1/2 min resulting in an output file with over 21M lines of data.

files # comparisons processing time
10 43260400 19s
25 10815100 58s
50 21630200 2m 25s

The performance tests can be repeated on the target machine by executing

./pertest <num copies>

The data is presorted before being parsed by the bedmap algorithm. This means that only the small portion of data that is being compared at the time needs to be loaded into memory.

If performance on the target machine is not adequate, its is most likely due to file I/O. To boost performance, upgrade to a faster hard drive. We've included the performance measurement tools here to help you justify that high speed SSD you've been wanting!

Deploying.

The bedops package is a readily available open source package. To install on Ubuntu, a simple sudo apt install bedops is all that is needed. The remaining data analysis is performed using the bash shell scripting language which is pre-installed on most *nix distributions.

Built With

Share this project:
×

Updates