I'm currently doing my honours in biomedical sciences. I grow human cells in flasks and for them to grow healthily, they require a mix of nutrients and growth factors - called _ media _. I was using an expensive media and my supervisor, worried about costs, asked if I could find a cheaper alternative. I did a literature search, however, it's impossible to gauge what the most common media is when there are so many media and so many papers.

Answers on ResearchGate were anecdotal and most of the researchers in the faculty adopted a trial and error approach, which is costly and wastes time. I decided there needed to be an objective way to garner knowledge from the masses of published data, to provide an easy tool for decision making.

What it does

Tabulates data, on media in cell culture protocols, from research papers and analyses them, highlighting trends in cost, research impact and locality.

How I built it

Downloaded open access articles, in XML format from the Pubmed Central Database, which contained the string, 'HUVEC'.

Parsed the XML in python.

Extracted the methods and materials section from each paper.

Extracted the cell culture subsection from the methods and materials sections.

Extracted all strings in brackets, OR in ALLCAPS (numbers and puncuation included). This is a simple, but limited method to extract information about media, which is typically contained in brackets or in all caps.

Categorised the most common strings according to their function in the media.

Tabulated the categorised strings of all subsequent papers.

Stretch Goal: Retrieve the impact factor for all papers to check for associations between media composition and impact factor.

Additional Stretch Goal: By extracting and associating the universities responsible for publishing the article, one could supply a map of where certain media and cell lines are most utilised. This could be of great assistance to collaborative universities as they could align their media selection to standardise their results.

Challenges I ran into

I can't code. Literally starting from scratch.

After parsing the XML in miniDOM, I had no idea if the strings I had intended for the XML would actually find what I was looking for. The miniDOM format also made it unreadable (for me) so I couldn't figure out how to redesign the strings.

Regular expressions seem inadequate for extracting data which is quite varied in form. I think a frequency analysis of words and in the cell culture subsection could enable a knowledgable developer to develop a library of terms. Terms from the library could be used to search texts and categorise terms which match those in the library. The library could also be developed using a webcrawler on media suppliers websites.

Accomplishments that I'm proud of

Got started and went hard. I worked through three weeks' of Udacity's search engine building course in two days. It's never felt so good to sleep. I feel like I went from being a mute to being able to ask questions to other teams. I was super stoked to go to my first Hackathon! I loved it. I want to have code weekends.

What I learned

Coding is difficult and you need lots of resources. Part of setting up a space to code, is actually understanding how your computer works, for setting up file paths and running code. I was shocked at how difficult it was to even get python working on my computer (which I still can't, I have to use in-browser python!).

What's next for Labmedia Miner

I want it to expand it lots of areas from user interface, database expansion, marketing and library building.

The database is currently based on open access journals through the PubMed Central database. This excludes journals with are copyright protected, not indexed in PubMed Central, and Gray Literature (conference abstracts).

It needs an application interface which is easy to use, so that it can be demo'd to researchers and lab managers. As the available papers increases, the library of media and additives grows, however you can only find a media or additiv e if you explicitly know what it is called. Partnering with media suppliers to webcrawl their sites for media details solves this problem.

I want to streamline the text search and XML download into a GUI with the rest of the operations. To seamlessly work however, an updated library of media components must be maintained, and that is very difficult, as new media formulations are constantly being conceived. A webcrawler, which extracted all new media from media suppliers could solve this issue.

I've been approached during the Hack by someone looking to collaborate after the event, so I'm keen as to keep working!

N.B. Because I did not complete a video, I have uploaded a audio pitch in the 'Try it out' section. As of now there is no working application.

Built With

Share this project: