Lung Cancer Visualization & Detection
This project is the Kaggle Data Sicence Bowl 2017
Why do it?
- The air quality is becoming worse globally, millions of people are breathing excessive aerosols everyday. Early diagnose of lung cancers can help control the damage caused by polluted air. Traditional diagnosis are made by experienced doctors who visually identify malignant lesions in CT scans of patients' lungs. This makes it slow, expensive, and inaccessible to many people.
- Everybody likes something cool in a hackathon, but it is also fun to try something that may have a larger impact on people. Kaggle is a great way to learn for people not from computer science background (like myself). This project will be even slower if not for the tutorials posted by Guido Zuidhof here and by Jonathan Mulholland and Aaron Sander here.
Where do the data come from?
CT scan images with diagnose information are made available by National Lung Screening Trial. My goal is to use these labelled images to train machine learning models to diagnose lung cancer.
What are the challenges?
- The images are in special format and the resolutions of the images can be different for different patients. I've finished the preprocessing of these images, so that 3D arrays of consistent scales and sizes are produced and fed to 3D convolutional neural networks.
- The size of the training images are too big to fit in the memory (~150 GB in numpy float array format). Used keras image data generator.
- The training of 3D convolutional neural networks is computational expensive and did not finish before the end of McHacks.
What's next?
- Will train 3D conv nets on small batch flows;
- Will use U-net for segmentation to reduced the data size that enters the 3D convolutional nets;
- Will try recurrent 2D conv nets.
Built With
- keras
- matplotlib
- pydicom
- python
- scikit-learn
- skimage
- tensorflow

Log in or sign up for Devpost to join the conversation.