Iron Claude

Overview

Elevator Pitch

We leverage Claude 2 to extract data from Materials Science research papers in a format that is suitable for shaping graph-based ML models on platforms like Citrine Informatics.

Inspiration

At Citrine Informatics, we build machine learning models for materials scientists and chemists. This enables them to make predictions on the next experiments to make, saving them resources, energy, time, and money. But it all starts with requiring structured data. The materials industry is far behind other industries (e.g. Biotech, Fintech, etc) when it comes to having good data management and structured data.

What it does

The Iron Claude ingests unstructured materials/chemicals-specific data (e.g. research papers), determines the physical application, and extracts the relevant process-structure-properties as nodes. Claude builds a graph linking each of these nodes together as they impact one another. This graph can be used by Citrine Platform to inform it how to build the right machine learning model architecture for running experimental predictions.

How we built it

This project was built with python using the anthropic, langchain, and networkx libraries. We have utility functions that parse unstructured data, form the anthropic client, format the prompt, and run Claude-2's model. Our main script can be ran from the command line with an optional document_path argument to an unstructured file. Or it can be ran on a corpus of documents.

Challenges we ran into

We really needed structured data as an output for this project. We had a bit of difficulty figuring out how to get Claude-2 to output only structured data; oftentimes, it was outputting the structured data, along with unstructured data (additional text info). Giving Claude-2 more information on how to structure the output seemed to resolve the issue.

Latency was another issue, particularly for building the edges in our graph. Claude builds the edges one-by-one looking at every possibility between nodes, so there is a time component to running this application. Depending on the complexity, this could take between a few seconds (for the most simple graphs) to ~30 minutes (for the most complex graphs). We need to conduct more investigation, but we suspect we could considerably decrease the latency with some prompt engineering (e.g. telling Claude-2 to look at all the nodes at once instead of doing it iteratively, as we are now). Another option could be to implement a job scheduler (like celery) for better user experience (this would be a more suitable if we implemented a user interface)

Accomplishments that we're proud of

We are extremely proud of being able to ingest unstructured materials/chemicals data and actually build a process-structure-property graph model out of it; and it seems to be extremely accurate. Our team at Citrine have already been prototyping with LLM's, but there are a few limitations. The token limit was one of those. We are stoked that we can take entire research papers (multiple too!), and use them as part of our prompt. We do not even have to split them or use a vector database! This project considerably simplified how we build our LLM products, and we are excited to use this solution for future materials innovation with better batteries, stronger polymers, more sustainable materials, etc.

What we learned

Our main takeaway is that Claude-2 is a great application for the type of data that we deal with. At Citrine, we commonly work with customers that have unstructured data. Claude-2 is great for dealing with this use-case, since you can throw so much at it to process. We were pleasantly surprised that it was able to understand very technical applications in the materials/chemicals space.

What's next for Iron Claude

We plan to take our learnings and our project to Citrine Informatics and work on productizing some of these ideas. The next step would be to design a data model for outputting this graph and mapping it to our predictor data model on the Citrine Platform. This will enable our customers to build their own initial machine learning graph models out of their research papers and other unstructured data.