Genetic Disorder Analysis and Prediction - Team 33

Inspiration

Genetic Disorders are one of the leading causes of illness in human population. But the application of AI in this field is still under research. When we came across a dataset on Kaggle about the genetic disorders prevalent in children aged from 0 - 14 years, there was no sufficient analysis on the topic and not many people had taken up the task.

Hence we decided to look into the matter and understand how aspects such as age, gender , family background affect a genetic disorder and how we can come up with an ML model that can automatically categorize a genetic disorder according to the biological information provided by the patient.

What it does

The main task that our project does is predicting Genetic Disorder in people according to some biological information available about them.

We have used the following models for predicting Genetic Disorders:

Random Forest Classifier: Using this with 100 estimators and k-fold splits, we achieved an accuracy of 60%.
MLP Classifier: Using this classifier, with varied amounts of clusters, the analysis showed that the optimal amount of clusters was 2020 on which we achieved an accuracy of 61%.
Logistic Regression: Using this regression technique and 'liblinear' solver we achieved an accuracy of 53%. In this medthod we used L2 loss over a 100 iterations.
Adaboost Classifier: Using default of 50 decision tree estimators and an optimal learning rate of 1.5, we achieved accuracy of 53%.
Modified Adaboost: Instead of the conventional decision tree, we substituted it with MLP Classifier over the default 50 estimators and achieved accuracy of 53%.

How we built it

We took the dataset from Kaggle and performed some exploratory data analysis on it to answer questions like "Which Gender group is most susceptible to Alzheimers" ," Which genetic disorder is most likely to affect a male of 7 years of age" etc, and then build an ML model to find answers to these questions.

We experimented with a couple of algorithms to test the performance of model on unseen data and then decided one that gave the best test accuracy.

Challenges we ran into

Failure upon choosing the right problem to work on.
Usability of raw data.
Choosing dataset without the knowledge of environment.
Absence of a framework like Pytorch, Tensorflow or Keras, and outdated versions of other libraries.
Lack of GPU, limits choosing of neural networks.

Accomplishments that we're proud of

Being able to complete the project on time and submitting it for the datathon.
Being able to work in a diverse environment.
Being able to work on a societal issue using our tech stack and with the help of IBM Z.

What we learned

New data science tools.
Machine Learning on IBM zOS.
More exposure to IBM technology
Made new connections.

What's next for Genetic Disorder Analysis and Prediction

We would like to implement a Web UI and deploy the model so that people around the word can avail it to identity the likelihood of genetic disorders for themselves.