Automated Genome Feature Discovery

Inspiration

Discovering genomic features is a task that comes with some rules, and a lot of exceptions to those rules. Expressing these rules all in one place and understanding the way that they play off of each other is something that doesn't happen.

What it does

Depicted below is a model transcription region where the character 'X' represents a wildcard and 'Y' represents the payload. In this example, there are two AT-rich sequences: one (the Pribnow box) appearing 10 characters before the translation section and another appearing 35 characters before the translation section. This structure is typical for bacteria. It also supports eukaryotes.

How we built it

The s(CASP) program is an implementation of rules for identifying promoters in DNA sequences. Our team first compiled these rules in common English as seen in this document. These English rules were then converted to s(CASP) code.

Challenges we ran into

Learning s(CASP).

Accomplishments that we're proud of

Actually getting an MVP working with a genome feature.

What we learned

s(CASP)
DNA
Transcription

What's next for Automated Genome Feature Discovery

Implement other genomic features beyond just promoters, such as enhancers. We'd also like to have other forms of input data such as histone modifications and DNA methylation to future our rules.