Inspiration
Inspired by efforts to build Data Safe Harbors in Advertising Technology -- which preserve key dimensions for analysis of targeting ads with the PII removed -- the vision of this project is to enable researchers to do the important investigative work they need to while protecting the security of the individual.
What it does
Our sample code, reducing the problem to a small surface for the time-crunch, takes an enzyme reference and sample for Lac Z -- which is much smaller than your typical human genome by many orders of magnitude -- and performs an alignment before chunking the sequences into codons and encodes the mutations into 9bit segments that could be easily concatenated and compacted into larger files for storage.
How we built it
Currently it is a simple prototype script in Python 3 utilizing BioPython and the bitarray library.
Challenges we ran into
- Lot of scope to cover for a smaller team, needed reduction
- Unknown unknowns for implementing details of alignment and file formats
- Did not have time to cover what we intended for:
- Purine/Pyrimidine transitions in RNA both within/around Proteins
- Actually outputting a file in the intended format (or designing the format in full even!)
Accomplishments that we're proud of
- Learned a ton about genomics (Joe)
- Proud of depth and breadth of knowledge brought to project (Amy)
What we learned
Alignment is a much harder problem than either of us gave it credit for.
Lot of scope creep even for a seemingly simple problem; could have done with more grooming during the week with the dataset.
What's next for Biochemical Characterization: An Approach to Genome Encoding
Lots of potential fun with this on the side; lot of work to still do to bring our idea to fruition -- but we had a lot of fun discussion it and learning about each other's work and perspectives!
Some of our future ideas:
- As stated above, including information about RNA and Purine/Pyrimidine mutations also has a lot of bearing on the research that can be done and insights that can be gained.
- Anonymized demographic data could also be included with the samples to further extend usefulness of data.
- Wrap the software in a service that allows for the usage of programmatic clients over HTTP with TLS -- would love to do a GRPC service that could generate clients in multiple languages. GRPC would be great for doing bulk streams of genomic data.
Built With
- biopython
- ncbi-data
- python
Log in or sign up for Devpost to join the conversation.