Sequencing data is complex, and the data exists for it not to be.

Researchers perform sequencing tests every day, and a lot of that information goes stale because parsing it is too time consuming. LLMs make it possible to actually understand that information without putting in nearly the same amount of required effort.

What it does

miROR is a Mistral base model fine tuned on ~10,000 PubMed papers to understand micro RNA (miRNA) data, and can take a series of miRNA molecules as string and give you window to see what's happening in that organism.

How we built it

  1. Crawled miRNA databases to gather 3000 relevant miRNAs
  2. Crawled Pubmed to gather articles about those RNAs
  3. Use those articles to fine tune the Mistral base model
  4. Created a benchmark of 20 questions and compared outputs across the fine tuned model, mistral-large-latest, gpt-3.5-turbo, and gpt-4-1106-preview. As of the time of writing, the fine-tuned model out performed the the base mistral model, mistral-large-latest, and gpt-3.5-turbo assessed on recall.

Challenges we ran into

  • We had some troubles connecting to a GPU which set us back several hours, and made a critical mistake to not fine tune on the instruct model to start.
  • Model currently does not output tokens as reliably as needed because of these issues, however, it does do a good job answering questions and recalling the right content based on the prompt

Accomplishments that we're proud of

  • Trained a model successfully, validation loss and training loss acted as expected and the model recalled information well without any RAG ## What we learned
  • Prompt engineering is critical when trying to run benchmarks.
  • Accurate benchmarking takes a significant amount of time and ffort. ## What's next for miROR
  • Clean up the code and get it operational, use deploy internally at my company, and then swap it out for gpt-3.5-turbo which we are currently using

Built With

Share this project:

Updates