Inspiration
My sister just finished her first year of college, and was an undergraduate lab volunteer at a biology lab. One of her duties as a volunteer was to collect data from the alignment of human genes and animal genes. She had to process around 1600 genes for three different animals. If each alignment took one minute to complete, then my sister would have spent nearly 80 hours on this task. I wanted to find a way to reduce my sister's workload, and help the lab align genes more efficiently.
What it does
Semi-autonomously aligns genes for the user, taking an input of a .txt file with genes to align and .txt file with organisms to align the genes with and outputting a .csv file with the organised data that the lab needs. Pictures of the input and output are below.
Sample Input: organisms.txt
# Informal Organism Name, Formal Organism Name
DEER, deer[ORGN]
HORSE, Domestic Horse[ORGN]
DOG, dog[ORGN]
gene-symbols.txt
# Gene, Uni-Prot ID, Sequence
LARP7, Q4G0J3, METESGNQEKVMEEESTEKKKEVEKKKRSRVKQVLADIAKQVDFWFGDANLHKDRFLREQIEKSRDGYVDISLLVSFNKMKKLTTDGKLIARALRSSAVVELDLEGTRIRRKKPLGERPKDEDERTVYVELLPKNVNHSWIERVFGKCGNVVYISIPHYKSTGDPKGFAFVEFETKEQAAKAIEFLNNPPEEAPRKPGIFPKTVKNKPIPALRVVEEKKKKKKKKGRMKKEDNIQAKEENMDTSNTSISKMKRSRPTSEGSDIESTEPQKQCSKKKKKRDRVEASSLPEVRTGKRKRSSSEDAESLAPRSKVKKIIQKDIIKEASEASKENRDIEISTEEEKDTGDLKDSSLLKTKRKHKKKHKERHKMGEEVIPLRVLSKSEWMDLKKEYLALQKASMASLKKTISQIKSESEMETDSGVPQNTGMKNEKTANREECRTQEKVNATGPQFVSGVIVKIISTEPLPGRKQVRDTLAAISEVLYVDLLEGDTECHARFKTPEDAQAVINAYTEINKKHCWKLEILSGDHEQRYWQKILVDRQAKLNQPREKKRGTEKLITKAEKIRLAKTQQASKHIRFSEYD
Sample Output final-results.csv
Gene Code, Organism, Query Cover, E Value, % identity, Accession Length, Accession, Isoform
LARP7, DEER, 100.00, 0, 87.14, 580, XP_043726814.1, isoform X1
LARP7, DOG, 100.00, 0, 88.32, 579, XP_038300349.1, isoform X1
LARP7, HORSE, 100.00, 0, 88.85, 582, XP_001503501.1, isoform X1
How I built it
I coded it in Java, with carefully sectioned methods and throroughly commented code. I also was able to use the BioJava library, which allowed me to parse and process information from the websites I used and obtain the data I needed.
Challenges I ran into
Some websites I was working with had crawler protection, which prevented me from getting the gene sequence from it autonomously. Also, for one part of the process, the webiste I used took quite a while to load the results, so I timed out 10 second periods to check the website for my data. Lastly, I ran into issues with http requests, getting JavaScript code I was unable to extract data from, and used the BioJava library to obtain data instead.
Accomplishments that I'm proud of
I'm proud that I was able to finish the project and get it to output the intended results. I also learned a lot from the proejct!
What I learned
With this project, I learned much about accessing remote websites through Java, mining data from them, and the capabilities of Java! I also learned more about gene sequence alignment analysis.
What's next for Protein BLAST
I hope to continue improving Protein Blast, and then pass it off to my sister's university for them to efficiently align genes! Some improvements I've been thinking of are making the project more autonomous, making the project be able to handle timing issues and errors, and making it more user-friendly.

Log in or sign up for Devpost to join the conversation.