As of 2012, about half of all adults—117 million people—had one or more chronic health conditions. One in four adults had two or more chronic health conditions. Seven of the top 10 causes of death in 2014 were chronic diseases. Two of these chronic diseases—heart disease and cancer—together accounted for nearly 46% of all deaths. These chronic diseases, such as heart disease, kidney failure, arthritis, diabetes, cancer are the most common, costly and preventable of all health problems. With better analysis on the trends in a patient's condition, physicians can actively promote and practice preventive healthcare, beneficial to everyone.

What it does

Diagnolytics is designed to serve as a platform for physicians to view detailed analytic reports of their patients to better aid their ability to prevent chronic diseases before they are contracted in their patients. With analytics on the distribution of patients, their conditions, and their situation in comparison with other patients, along with detailed reports and predictions tailored to each patient, the physician is better informed to make personalized diagnosis. Each patient record contains history of their illnesses, insurance claims trends(in patient, out patient and carrier), prediction on the amount of insurance they may claim in the subsequent year, predictions on the future diseases they may contract or the similarity of conditions of other patients.

How we built it

Centers for Medicare and Medicaid Services (CMS)Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) contains synthesized data taken from a 5% random sample of Medicare beneficiaries in 2008 and their claims from 2008 to 2010. We used pandas in python to query the relevant piece of information for our analysis from the five different types of data files to preserve the relational information. To accommodate the huge dataset, we used the Google Cloud Platform. Prediction on future diseases: Using the historic data from five files which correspond to chronic conditions already experienced, trends in claims, drug prescription, dosage, previous insurance claims, in patient, out patient and carrier info, etc, to measure similarity with other patients who have had similar conditions, using k-nearest-neighbors, to predict a probability distribution for each of the procedures in the ICD-9 series that they might take up in the future. Using cross validation, 10 neighbors was chosen as the appropriate parameter. The similarity measures were concluded using feature engineering. An accuracy of 79% was achieved, on testing on 20% of the data. Prediction of claims: Logistic regression was used to predict the total amount of claims they would file for based on present conditions and older trends and similar trends. Over a range of $63000, an accuracy of less than $1000 was achieved. Other statistics and analysis are displayed on Tableau.

Challenges we ran into

The initial understanding of the dataset took up a lot of time, since it was an excess amount of data that we could integrate and to decide how to combine them without losing relevance was quite a challenge initially.

Accomplishments that we're proud of

We were able to extract useful information from a seemingly vast supply of data in a domain we do not have much knowledge about.

What we learned

We learned the importance of preprocessing and significance of analysing the relations in a dataset to fully utilise the information we can derive from it.

What's next for Diagnolytics

Currently, we have used the drugs prescribed in predicting future diseases, with the intuition that some side effects of popular drugs can also be the source of some conditions. With relevant data on the exact nature of the drugs, we propose predicting future conditions with this new relevant information about the side effects of the prescribed drugs along with other data.

Built With

Share this project: