Hi! This is team appropriate)*. We chose this name because it's appropriate :) Our team members include Kailiang Fu (kfu6), Yuan Pu (ypu7), and Yuchen Han (yhan33).
Link to Code: https://github.com/Kail-Fu/BoneyBoney_Code
Link to Poster: https://drive.google.com/file/d/1YTKukLO7U7P_s_DWOnAXr6wjJBPaKLkQ/view?usp=sharing
Link to Final Writeup: https://docs.google.com/document/d/1mooouWVYAyr6Fny7N12VmkWQIasVvCHFUwksn7-WkBw/edit?usp=sharing
Link to Second Checkpoint Reflection: https://docs.google.com/document/d/14GYzo5AYnDLg1TGqEzA3OPRqmIKiKT37TmllxiGmtGY/edit?usp=sharing
Introduction
“Attention is all you need” is one of the most breaking and influential new concepts in deep learning, but for this class, we have only explored its use in Natural Language Processing. Hence, we start to wonder whether the same idea can be applied to image processing.
The paper we are reimplementing is Exploring Self-attention for Image Recognition. This paper aims to explore variations of self-attention and assess their effectiveness for image recognition, specifically on the ImageNet dataset. Instead of using a generic data set, we will train and test the model on bone X-rays from the MURA dataset. We hope to achieve satisfactory accuracy in the recognition of bone images with the help of self-attention.
Data
We will reimplement the paper and apply it to MURA (musculoskeletal radiographs), an open-source dataset from the Stanford ML group of bone radiographs consisting of 40,561 images from 14,863 studies. These images have their corresponding upper body parts and forms (normal or abnormal) as labels.
Due to limiting computational resources available, we have shrunk the dataset to 18,677 training images and 1,975 testing images with five classes.
Preprocessing for this dataset should not be too heavy. The radiographs are already grouped in different files based on their corresponding body parts and the images themselves are labeled as normal or abnormal in the file names, so the correct labels are easy to extract. We have to separate the dataset into two exclusive sets used for training or testing, respectively. The more challenging step is to convert the radiographs into flattened matrices cropped into the same size, and this is similar to what we did in the CNN assignment, but with an additional step of resizing images.
Related Work
The Stanford ML group has proposed a baseline model in the paper MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs. A 169-layer CNN is trained to detect and localize abnormalities with high accuracy. The performance of the model is comparable to the best radiologist performance in detecting abnormalities on finger and wrist studies, but it is lower for detecting abnormalities on other body parts.
We will be implementing a model different from this paper - and other traditional image-processing models - we will apply a CNN with self-attention. Our main goal for the project is also different from what the paper is doing: instead of detecting the abnormality of the bones, we are trying to see if our model can recognize the body part presented in the images correctly.
We are reimplementing this self-attention model on image recognition in TensorFlow.
During reimplementation, we have drawn insights from a similar model.
Methodology
As described in the source paper, the model we built has 5 self-attention blocks, each consisting of multiple self-attention layers. Here we follow the SAN10 architecture described in the paper - the 5 blocks contain 2, 1, 2, 4, and 1 self-attention layer, respectively. Between these blocks are transition layers, which reduce spatial resolution (to reduce the computational burden and expand receptive field) and expand channel dimensionality. In the end, there is a classification layer that generates the predicted probabilities for each class of the input image. The chart from the paper is attached below, which summarizes the structure, along with an illustrated version created by ourselves (with self-attention layer structure in the paper).
Metrics
We will primarily measure success based on accuracy. In the paper we are trying to implement, the group achieves around 75% accuracy on top-1 predictions on the ImageNet dataset. The dataset we are going to use only has 5 labels: elbow, finger, forearm, hand, and humerus. Therefore, the project is successful if we can achieve more than 75% accuracy on bone recognition.
In the paper we are re-implementing, the author wants to prove the usefulness of self-attention on image recognition. They quantified the result by comparing their model’s accuracy with the traditional CNN model’s accuracy. Our base accuracy goal is 50% on bone recognition, the target goal is 75% on bone recognition, and the stretch goal would be exploring self-attention on the classification of whether a bone image is normal or not and achieving an accuracy of 60%.
Ethics
Our project belongs to a broader domain of development and exploration - medical image processing. There are many societal demands and issues related to this domain, including but not limited to personalized medicine, remote diagnosis, prognosis, and treatment, and automated operations.
Its application in medical activities endows its little tolerance for mistakes. Just like the algorithms for self-driving, people are trying to design and implement models making no mistakes. However, mistakes are inevitable and once they happen, serious consequences like the loss of life should be expected.
Division of Labor
Kail will be responsible for preprocessing of the dataset. Then we did everything together.
Built With
- python
- tensorflow



Log in or sign up for Devpost to join the conversation.