Image Segmentation for Prostate Magnetic Resonance Imaging (MRI)
Ruofan Bie and Ruya Kang
Introduction
In this project, we intend to perform image segmentation with prostate Magnetic Resonance Imaging (MRI) data.
Prostate cancer is the second most frequent cancer diagnosis made in men and the fifth leading cause of death worldwide. [1] A few techniques are used for early detection of prostate cancer, including blood tests, biopsy and imaging tests. The Magnetic Resonance Imaging (MRI) scans create detailed images of soft tissues in the body using radio waves and strong magnets. MRI scans can give doctors a very clear picture of the prostate and nearby areas. [2]
MRI of prostate cancer usually consists of two non-overlapping adjacent regions: the peripheral zone (PZ) and the transition zone (TZ). An example of prostate MRI with labelled zones is shown in Figure 1. Identifying prostate zones is important for diagnostic and therapies. However, the identification work requires substantial expertise in reading MRI scans. Therefore, automatic segmentation of prostate zones is instrumental for prostate lesion detection.
The problem of prostate zone segmentation is challenging because of the lack of a clear prostate boundary, prostate tissue heterogeneity, and the wide inter-individual variety of prostate shapes. [3] In this project, we will be implementing some existing CNN and RNN models for image segmentation using prostate MRI data. We will use a survey for image segmentation using deep learning [4] as a guide, implement selected models and compare their performance.
Methodology
Fully Convolutional Networks (FCNs)
The FCNs [11] is constructed by very deep convolutional layers with deconvolutional layers as decoders and 1x1 convolutional layer for pixel-wise prediction. (Figure 4). As shown in the structure, the output of max-pooling layer goes through a deconvolutional layer and then fuse with the previous max-pooling layer output to make a prediction. This technique is called as skip connection and can combine down-sampling features with up-sampling features for more accurate prediction. As shown in Figure 2, we implement the FCN-8s model, where there are 2 skip connections and the final convolutional layer uses strides 8 to recover the original image size.

Encoder-Decoder Based Models
Most of the popular DL-based segmentation models use some kind of encoder-decoder architecture. A basic encoder-decoder model to implement image segmentation is to use convolutional layers as encoders and then use deconvolutional or convolution-transpose layers for decoders. We will implement two of them: DeConvNet for general image segmentation and U-Net for medical image segmentation.
DeConvNet
The DeConvNet [13] is designed on top of the convolutional layers adopted from the VGG 16-layer net [12]. As shown in Figure 5, DeConvNet is composed of convolution and deconvolution networks, where the convolution network acts as the feature extractor and the deconvolution network is a shape generator. The proposed architecture aims to overcome two limitations of FCNs. First, using FCN models, label prediction is done with only local information for large objects. Also, FCNs often ignore small objects and classify them as background. Second, in FCN, the input to the deconvolutional layer is too coarse and the deconvolution procedure is overly simple.

We simplify our DeConvNet model by reducing the number of filters per convolutional layer and incorporating less (de)convolutional blocks. There are no unpooling layers defined in TensorFlow. We make use of the implementation from https://github.com/aizawan/segnet/blob/master/ops.py.
U-Net
The U-Net model [14] is built upon [11]. It is designed specifically for biomedical data, where there is very little training data available. Different from [11], a large number of feature channels is used in the upsampling part. The modification allows the network to propagate context information to higher resolution layers. The network does not have any fully connected layers and only uses the valid part of each convolution. We will implement the simplified version of U-Net model shown in Figure 6 with a three-channel output corresponding to the three segmentation areas. The model is simplified in terms of the number of filters per convolutional layer and the number of blocks.

Dilated Convolutional Models
Due to the translation-invariant property of the convolutional layer, the FCN model is reliable in predicting the presence and roughly the position of objects in an image. However, as a trade-off between classification accuracy and localization accuracy, the FCN model might not be able to sketch the exact outline of the object. Instead of using standard convolutional layers, we consider using dilated convolutional layers [15] in the above FCN architecture. Unlike the standard convolutional layers, which apply filters on kernel-size blocks with adjacent pixels, the dilated convolutional layers apply filters on kernel-size blocks with pixels in distance l-1, where l is the dilated rate (Figure 7).

Since we only have 187 images, we firstly use data augmentation to add rotated or flipped images into the original training set to increase the training sample size and also increase the robustness of our model. All four models are trained with 50 epochs through the whole training set with batch size 1. The trained models are then applied to the test set and compute the accuracy metrics describes below.
Metrics
The model performance for image segmentation is measured differently from for classification. We will evaluate the model using a few new metrics. [4]
Pixel Accuracy
Pixel accuracy (PA) measures the proportion of correctly classified pixels. For K+1 classes, the pixel accuracy is defined as
,
where is the number of pixels of class predicted as belonging to class j.
Mean pixel accuracy (MPA) extends PA to the proportion of correctly predicted pixels in a per-class manner, and then average over the total number of classes.
Intersection over Union
Pixel accuracy has limitations such that it has a bias in the presence of very imbalanced classes, while mean pixel accuracy is not suitable for data with a strong background class. Another segmentation evaluation metric is the intersection over union (IoU). It is defined as the area of intersection between the predicted segmentation map A and the ground truth map B, divided by the area of the union between the two maps:
.
The mean intersection over union (Mean-IoU) is defined as the average IoU over all classes.
In this project, we would expect an accuracy of 50% for all models using the intersection over union metric as a baseline. Our goal is to implement 70-75% of accuracy for the four models. If these accuracies are easily achieved, we would consider adjusting the model to achieve around 90% accuracy.
Results
For each mode, we perform four experiments with different data augmentation methods: (a) no augmentation at all; (b) randomly flip each image vertically or horizontally; (c) randomly rotate reach image by any angle; (d) both image flip and rotation.
| Model | Pixel Accuracy (PA) | Mean Pixel Accuracy (MPA) | Intersection over Union (IoU) |
|---|---|---|---|
| FCN (no aug) | 0.962719 | 0.620694 | 0.459717 |
| Dilated FCN (no aug) | 0.974142 | 0.581018 | 0.481629 |
| DeConvNet (no aug) | 0.974025 | 0.619032 | 0.509451 |
| U-Net (no aug) | 0.976238 | 0.718744 | 0.597541 |
| --- | --- | --- | --- |
| FCN (flip) | 0.962796 | 0.601290 | 0.451951 |
| Dilated FCN (flip) | 0.977338 | 0.606064 | 0.481629 |
| DeConvNet (flip) | 0.979059 | 0.705706 | 0.515348 |
| U-Net (flip) | 0.978152 | 0.741071 | 0.616579 |
| --- | --- | --- | --- |
| FCN (rotation) | 0.969019 | 0.597827 | 0.468806 |
| Dilated FCN (rotation) | 0.976503 | 0.578570 | 0.504504 |
| DeConvNet (rotation) | 0.977549 | 0.637209 | 0.554754 |
| U-Net (rotation) | 0.979722 | 0.719528 | 0.613457 |
| --- | --- | --- | --- |
| FCN (both) | 0.962796 | 0.601290 | 0.451951 |
| Dilated FCN (both) | 0.972596 | 0.644752 | 0.538985 |
| DeConvNet (both) | 0.980162 | 0.698144 | 0.601530 |
| U-Net (both) | 0.973854 | 0.740341 | 0.601107 |
It can be seen that for our task, where the class labels are highly unbalanced, the pixel accuracy is not representative. Therefore, we compare the performance using the mean pixel accuracy over classes and the IoU.
FCN is outperformed by the two encoder-decoder models due to the thin deconvolutional layer as decoder and coarse output in the final deconvolutional layer. The dilated FCN managed to improve performance of FCN but is still outperformed by the encoder-decoder models. For the two encoder-decoder based models, U-Net outperforms DeConvNet in general. This is not surprising since U-Net is specifically designed for medical images and is supposed to achieve higher segmentation accuracy with fewer data. Moreover, randomly flipping input images seems to help improve performance for both of the models. Figure 8 and Figure 9 are the true (top) and predicted (bottom) segmentation of 10 test images from FCN and dilated FCN trained with data augmented by flipping and rotation. Figure 10 and Figure 11 are the true (top) and predicted (bottom) segmentation of 10 test images from DeConvNet and U-Net trained with data augmented by flipping and rotation. It can be seen that U-Net gives a smoother edge than FCN, dilated FCN and DeConvNet.




Challenges
One challenge of MRI segmentation is the imbalance between labels. Since the cross-entropy loss function is based on pixel-wise accuracy, it’s easy for our models to produce all-background predictions. To solve this problem, we down-sampled all-background images in the training set and used weighted cross-entropy as loss function. Specifically, we assign weight ratio 1:3:3 to label 0, 1 and 2. After this adjustment, it becomes easier for the DeConvNet and U-Net to learn the position and shape of the prostate (IoU accuracies are over 60%). Note that down-sampling results in less training data, so augmentation is necessary for a better performace.
Another challenge is that the original FCN-8s model still failed to learn any feature and kept producing all-background prediction. After comparing the structure of FCN and U-Net, we realize the importance of the comparable depth of encoder and decoder in image segmentation. We also noticed that by using stride 8 in the last deconvolutional layer in the FCN, the output of FCN can be very coarse. After increasing the number of filters and reducing strides in the deconvolutional layers of FCN, we managed to improve the performance of FCN and reached 48.227% accuracy in IoU. Furthermore, we considered using dilated convolutional layers in the FCN model. The dilated convolutional layers apply filters on kernel-size blocks with pixels in l-1 distance, where l is the dilated rate. Since dilated convolutional network might be able to extract information from larger region, the dilated FCN managed to improve the performance (5% improvement in IoU for none augmentation, 14% improvement in IoU for flip augmnetation, 8% improvement in IoU for rotation augmentation and 12% improvement in IoU for both augmentation).
The third challenge is that the two encoder-decoder based models do not learn well in its originally proposed architecture. One possible reason is that the models may be too complicated for this task so they overfitted to the noise. Moreover, the original DeConvNet and U-Net model are too large for a 16GB GPU to train. To solve the problems, we simplified the encoder-decoder models by reducing the number of encoder and decoder blocks as well as reducing the number of filters per block.
Reflection
In this project, we managed to implement our basic goals and part of the target goals: implementing 4 image segmentation model and reach IoU and mean pixel accuracy over 50%. We planned to reproduce the proposed model architectures from the original papers. However, it turned out that simpler models of same structure performed well. Also, we performed more data pre-processing than proposed due to the high imbalance in segmentation labels. Using weighted loss helped training as well.
During the project process, we found the performance of models are poorer than our expectation. There are two possible reasons for this: 1. The training set is relatively small. After down-sampling the all-background images, there are only 187 original images in the training set and 561 images after both flip and rotation augmentation. 2. Unlike ordinary figures where there are three color layers and the area of subjects and background is usually balanced, the MRIs only have one color layer and the position of the prostate is usually at the center of the image and the area of prostate is usually much smaller than the background. This increase the difficulty for neural networks to learn features.
Our project shows that comparable depth of encoder and decoder is crucial to segmentation of such kind of images. Deep convolutional layers are used to extract features from the inputs while deconvolution, upsampling or unpooling layers should be incorporated to reconstruct the inputs moderately (in contrast to FCN, where only a coarse deconvolution layer is applied). Moreover, for many image segmentation tasks where the label classes are highly unbalanced, adjusting the model with appropriate class weights is also very effective. By comparing the FCN and dilated FCN model, we also found that dilated convolutional layers might be better at learning features in this problem.
We also noticed that the training of dilated FCN model can be unstable: the same code can be stuck with all-background predictions sometimes and can learn the features in other times. Our insight for this problem is that initialization can be important to neural network training and the all-background prediction might be a local minimization for the model to be easily stuck with. In our future work, we can explore the reason why dilated FCN failed to learn any feature occasionally and try to modify the structure of DeConvNet and U-Net, such as using dilated convolutional layers, to see whether the performance of the two models can be improved. Existing models with pretrained weights could also be involved in the comparison.
Log in or sign up for Devpost to join the conversation.