Inspiration

I am studying the human gut microbiome by sequencing all the bacterial DNA in people's poop. One of the experiments I'm running involves figuring out what proteins all the DNA code for, and seeing if any protein DNA sequences are relatively more abundant in the gut microbiomes of healthy people compared to the gut microbiome of people with disease. I previously built a reference library of bacterial sequences annotated with protein functions. Because most bacteria have not been sequenced, I was only able to determine what protein were encoded in the DNA for a third of my sequences. To figure out what proteins were encoded by my other sequences, I wanted to assemble my DNA into partial genomes, and query them with the SEED protein annotation database (www.theseed.org/wiki/Main_Page).

What it does

The code takes sequences which have already been assembled by Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki), and uses the BLAST algorithm (through the command line tool http://www.ncbi.nlm.nih.gov/books/NBK279690/) to compare the sequences with the SEED protein annotation database.

How we built it

BLAST is run on the command line through bash scripts, and the amalgamation and extraction of sequence data is done with Perl. Bash scripts were also used to run the annotation individually for each microbiome sample.

Challenges we ran into

  • Some of the sequences are quite long, but it is possible for the first hundreds of hits in the SEED database to annotate the same sequence. To account for this, I took the best non-overlapping protein matches, and if any of the unmatched sequence segments were 500 nucleotides or longer, I queried them against the SEED database again.
  • I spent an inordinate amount of time debugging my code that checked if the positions of two protein matches overlapped. Flashback to my first mock interview, where I spent 45 minutes blanking on how to determine if two rectangles overlap. D:

Accomplishments that we're proud of

The code works!

What we learned

  • In Perl, use "if ((x>0 && x<10) || (y>0 && y<10))" rather than 'if ((x>0 and x<10) or (y>0 and y<10))".
  • If you are checking to see if two things overlap, don't forget to check the case where one thing is completely inside the other thing

What's next for metagenome annotation script

The BLAST is probably going to take a week to run. I need it to finish faster so I can write it up in my thesis!

Note: The relevant part of the GitHub repository linked is under the heading "Create fasta file for bowtie index".

Built With

Share this project:
×

Updates