pfas_ai

Inspiration

We watched a Veritasium video on the effects of forever chemicals on the planet. In the video, the narrator, Derek, explains the regulatory issue of banning PFAS compounds, noting that there are too many compounds for which there is insufficient experimental health data. We hope to solve this issue by creating an efficient method to accurately and precisely predict the health effects of untested PFAS compounds.

What it does

pfas_ai is a web application that allows users to submit a compound and produces a prediction of the health effects of that compound. It also allows users to look the risk level in their zip code, and if that zip code is not covered yet, they can submit a list containing the concentrations of several PFAS, which we use to predict the risk level.

The model first takes in the a compound and converts into a SMILE, which is an encoding of the compound. From this, we pass it into two separate models to predict the physical properties of the molecule. These physical properties are then passed to another neural network which predicts the results of 96 different assays that describe how the compound reacts with the body. This information is then passed to an LLM, Gemini 2.5 Flash, to identify the important information extracted from the assay data.

How we built it

To predict the chemical properties, we utilize an ensemble learning algorithm of random forests provided by RDKit and a Graph Neural Network + Transformer architecture provided by Moleculenet. This data is fed into an MLP (Multi-Layered Perceptron) along with toxicology data to identify probabilities of 96 health risk markers - assays. We used a Letta to work with the 96 health risks to create a detailed health report for the user.

Challenges we ran into

Our biggest issue was making sure the full pipeline worked; we struggled with communication between the property prediction and the health risks prediction. We also struggled with developing the property prediction model. We experimented with building our transformer models and GNNs, but eventually settled on an ensemble learning algorithm using industry standards. We also struggled with UI/UX to integrate the ML models with the website.

Accomplishments that we're proud of

We were able to predict the results of the 96 assays to 80% accuracy. This model is also comparatively light-weight and does not require any fancy equipment to run on. This means it is energy-efficient and is thus better for the environment.