Team 2: Identifying Bacterial Regions in Pea Aphid genome

Inspiration

The knowledge team two had about biology and bacterial genomes gave confidence to figure out why certain bacteria genes were being passed down and integrated into the insect genome. The evolutionary importance and affect aphids have on large vegetation inspired us to tackle this problem.

What it does

Both tools will analyze the Pea Aphid genome to find unique regions which indicate bacterial gene transfer. The sequences from these regions would later be used in BLASTP to further match and research fundamental proteins to the Pea Aphid organism.

How we built it

Python was the main language used to iterate through the raw data of the Pea Aphid. A sliding window as a function used to create a parameter for how many nucleotides can be read in a sequence and then entered through a calculation or mode for analysis. The sliding windows tested in the project were 1, 000, 10,000 and 100, 000 nucleotides. . This was to decrease processing time. Sliding windows broke down each sequence to be interpreted for GC content of each frame. The sliding window region's GC content for theoretically each chromosome was then visualized on a graph and there were noticeable spikes outlined in the graph. The second method designed to search the each chromosome to find all possible protein coding sequences in each chromosome Based on the supplemental paper’s method, it ran through all reading frames and captured (>60 AA) sequences that were in between stop codons. The identified sequences would also be pulled out to be ran through BLASTP for further research of match proteins.

Challenges we ran into

Some of these routes encountered long processing times that would take hours to days to run through the entire Pea Aphid genome.

The rich GC content approach did not provide as many instances of spikes throughout the genome, however, it did provide a few instances of significant spikes.

Accomplishments that we're proud of

Cutting down processing time and the ability to highlight different GC regions that are most unique to the Pea Aphid genome, regardless if it is a bacterial region or not.

Being able to calculate the GC content for each chromosome. Highlighting important protein sequences throughout the genome through method 2.

What we learned

GC content approach provided a fast and simple method to identify possible large regions of bacterial insertion into insect genome Brute force protein sequence identification approach is more time consuming and outputs large files of candidate sequences to Blast search against bacterial and invertebrate databases With more time we would of compared the Blast results between the two methods and draw better comparisons to the two methods