Causal discovery from longitudinal observational data is an exciting research area that has promising potential for advancement of science. However, this project has an additional meaning to me as I have personal experience with family members who were hurt by adverse drug events (ADEs). This project during the datathon will serve as a seed for a much larger effort to perform high-throughput ADE detection from Electronic Health Record (EHR) data. Ultimately, I hope to discover new potential ADEs and improve the safety of pharmaceutical use in our healthcare system.


Adverse Drug Events (ADEs) are described by Kohn et. al 2000 as "an injury resulting from medical intervention related to a drug”. ADEs strain our healthcare system and account for an additional 1 million ED visits and 125,000 hospital admissions per year ( ADE detection from EHR systems is a challenging task as the data is both longitudinal and observational. We must rely only on the natural experiments that occur in our data and grapple with what is often incomplete and irregularly sampled data.

This datathon project focuses on piloting causal discovery methods for ADE detection as applied to a known outcome task: risk of acute MI from taking one of two COX2-inhibitors: Vioxx or Celebrex. Vioxx and Celebrex are both pain medications. However, Vioxx was removed from the market after having shown an unsafe increased risk of heart attack. Celebrex, however, remains on the market as it has not been shown to be unsafe. This project attempts to discover causal relationships between Vioxx, Celebrex, and heart attack and attempt to recover the known relationships.


The data for this project was provided by the Marshfield Clinic health system which services patients in the Northern and Central Wisconsin regions. The Marshfield Clinic EHR dataset is comprised of 1.5 million patients with over 40 years of data. These data include demographics, diagnoses, labs, medications, procedures, and vitals. This data was cleaned by removing patients with infrequent healthcare encounters and ultimately resulted in a size of 1.1 million patients with nearly 600,000 unique possible features.

Two smaller datasets were produced, one for Vioxx and one for Celebrex. For the Vioxx dataset, a set of case-control matched patients were selected by first identifying case patients who had taken Vioxx, but not Celebrex. Controls were then matched with these patients based on gender, date of birth, and lack of Vioxx use. This produced a smaller dataset of 1.5k patients. Frequency pruning was used to remove features with less than 1% representation among the Vioxx population leaving 11.5k features. Finally, features with correlation < .4 with Vioxx or MI, or a secondary correlation with a feature correlated with Vioxx or MI, were removed, yielding 566 unique features. The final Vioxx dataset was of size 1.5k patients with 566 features. The Celebrex dataset was produced in near identical fashion and yield 1.6k patients with 488 features.


Due to the sensitivity of health information, the data and analysis remained on a private and secure server. I ran the causal-cmd version of Tetrad on this server and utilized the FGES continuous algorithm to produce causal graphs for both Vioxx and Celebrex. These graphs were output as JSON files which were then securely transferred to a research tablet that ran a GUI version of Tetrad. I utilized Tetrad to visualize and manipulate these graphs. With several hundred variables graph interpretation was difficult without first trimming down the number of nodes shown. I produced two forms of graphs: one where nodes were only shown if they were adjacent to the drug or event, and a second where nodes were shown if they were connected to the drug or event by 2 or fewer edges.


In the case of both Vioxx and Celebrex and direct causal adjacency was found with Acute MI. While this was expected for Vioxx, it was not for Celebrex. There is, however, some literature suggesting that Celebrex could slightly increase the risk for heart attack, but not to the same unsafe degree that Vioxx did. In light of this possibility, it would not be unexpected to see a relationship between Celebrex and heart attack. In both the Vioxx and Celebrex the causal arcs went from the event to the drug which suggests that the edge orientations could be incorrect in these graphs. For this reason, my interpretation of the results will rely only on the adjacencies of the graphs.

The variables connected to the drugs and event by two or fewer edges largely made sense. For both Vioxx and Celebrex, there were diagnoses related to pain and joint discomfort; both of these would be a reason for receiving a prescription for a COX-2 inhibitor as a pain medication. The variables related to heart attack were largely related to causes of heart disease such as diabetes, hypertension, and age. Finally, there were some variables that were related to usage of the healthcare system such as office visits. While one could posit that a need for medication or having a heart condition would result in health care use, it could also be the case that these variables are stand-ins for the general health of an individual as higher healthcare usage is correlated with a sicker individual. Ultimately, the graphs qualitatively made a lot of sense and suggest that this methodology is appropriate for cause discovery of ADEs from EHR data.

Conclusion and Future Work

Automated discovery of ADEs allows for detection of scenarios in which use of certain pharmaceuticals could pose an undue risk to patient health. By utilizing the Tetrad program and algorithms I was able to confirm known interactions between Vioxx, Celebrex, and heart attack. While this work is promising, additional and more thorough validation is required. I would like to try these experiments using a larger dataset both in regards to the number of patients and features involved. Additionally, I would like to explore bootstrapping methods to get edge confidences and try other causal discovery algorithms such as GFCI which allow for the possibility of latent confounders. Finally, I intend to scale this us to exploring thousands of different drug-event combinations by pairing Tetrad algorithms with the Condor high-throughput computing system at UW-Madison.

Built With

Share this project: