Having inspired by the teams resolving the problem of automated detection of metatases of lymph node sections during https://grand-challenge.org/site/camelyon16/, we decided to try the new possibilities given by distributed TensorFlow and see if we can speed up the whole training using distribution and POWER of ppc64le architecture.
What it does
Using Asyncronous distribution of training the model the solution proves that the scaling of speed during the training step can grow almost linearly, having a comapatively small degradation of accuracy.
How I built it
Using SuperVessel cloud we've built the cluster of Dockerized images, updated TensorFlow to the newest version (0.10rc0), distributed the training data and run the training.
Challenges I ran into
TensorFlow is in active development right now, so we have faced a number of troubles which we have to resolve before the actual run.
Accomplishments that I'm proud of
The flexibility of TensorFlow allowed us to tune the model in a tricky way so we were able to get the better result!
What I learned
Rapidly developing environments (like TensorFlow) can give the amazing results in combination with robust and proven solutions (Keras, OpenPOWER)
What's next for DistributedTensorFlow4CancerDetection
- train it on 300K samples dataset (currently 50K)
- 30 epochs (now 4)
- 2 iteration for update False-Positive samples in dataset
- use Synchronous distribution
- change VGG16 model to Inception-v3