Inspiration

Food and drug safety is a huge problem in the US. According to the CDC, there are 179 million cases of illnesses, about 486,000 cases of hospitalization, and about 6,000 cases of death each year in the US.

What it does

In this project, we aimed to analyze the FDA report data, identify potentially dangerous consumer products and patterns, and then provide recommended regulatory actions for FDA.

How we built it

Observing the dataset, we identified the two most important factors as the case outcome and concomitant status. Based on the two factors, we gave a danger index for each product and ranked the top ten potentially dangerous products. During the model construction, we categorized the data of case outcomes by 6 groups : death, disability, life threatening, hospitalization, serious outcome, and "other" outcomes, sorted by the danger level. We referred to the data of death and hospitalization from CDC and calculated a weight for each negative outcome. That was based on our first hypothesis. Then, we gave a coefficient for the suspect based on the official data of suspect-confirmed outbreak ratio. Finally we multiplied the suspect coefficient by the weight, summed them up, and gave a normalized overall danger score for each product.

Challenges we ran into

We needed to clear the data first. One major problem was that plenty of entry names were masked as “exemption 4” due to government regulation. So we deleted all entries with ‘exemption 4” in the data set since they were meaningless and could cause negative effects on the model. In addition, there were many repeated entries when people reported one case several times. Lastly, the identical product might have different names, for example the uppercase and lowercase Vitamin D should refer to the same thing. We used the machine learning tool Dedupe to solve these issues by calculating the distance cost between words. After we manually labelled 20~50 samples of duplicates, the model would learn the pattern of duplicate, label accordingly, and efficiently cluster the data by their labels.

Accomplishments that we're proud of

Our model effectively cleaned the data and our weights were supported by professional reference. One weakness was that the cases of “other outcomes” were underreported, so we could not evaluate its weight very precisely.

What we learned

We learned how to use the machine learning tool Dedupe. Also, we learned the methods to clean, process and visualize the data. Regarding the model, we found out the most potentially dangerous products. The first one is kratom, a tropical plant. Then the second is Johnson powder. After that, we have an eye disease related product, abbott similac, and flaxseed oil, shown in the diagrams.

What's next for Analysis for FDA Adverse Food and Products

We recommend the FDA to focus on the top 10 products, especially the top 2 because they have much higher scores than the others. Then we also suggest developing more gender and age specific regulations, and paying more attention to exemption 4 products.

Built With

Share this project:

Updates