Deep Neural Networks for Galaxy Morphology Classification

Who:

Bitao Jin (bjin8) James Ro (jro3) Christopher Tripp (crtripp)

Introduction:

We can deduce important information about a galaxy by studying its morphology. In particular, its shape and features can tell us about its star-forming activity, its evolutionary history, and its interaction with its local environment. In the past, astronomers have manually examined images of galaxies taken by astronomical surveys. However, modern large-scale surveys capture so many images that it is no longer possible for professional astronomers to manually inspect them all. This has led to the “citizen- science” project Galaxy Zoo, which allows anyone to analyze random images of galaxies from a given data set and record information that they perceive about the galaxy’s visual features. This has in turn led to the establishment of large, labeled (with respect to morphology) sets of galaxy images. In this project we hope to design a deep neural network which is able to classify images in the Galaxy10: DECaLS data set to a high level of accuracy.

Related Work:

Most previous studies of galaxy image classification have used fewer than six classes, but in a 2022 study (https://arxiv.org/abs/2211.00397), Gharat and Dandawate noted that new images are of sufficiently high quality that using a greater number of classes is warranted, since that could take advantage of finer-grain distinctions among morphologies. In their study, they designed a CNN which achieved test accuracy of 84.73% on 21785 images from the Sloan Digital Sky Survey (SDSS) using Galaxy Zoo labels.

In another 2022 study (https://arxiv.org/abs/2211.00677), Ciprijanovic et al. applied off-the-shelf ResNet50 on a subset of the Galaxy10: DECaLS data and achieved 43% classification accuracy.

One of our goals is to build upon these recent results and design a deep neural network which, when applied to the Galaxy10: DECaLS data, achieves a high classification accuracy.

Data:

The Galaxy Zoo Data Release 2 (GZ DR2) dataset consisted of ~270,000 images taken from the Sloan Digital Sky Survey (SDSS), alongside corresponding labels (with respect to 10 broad classes of galaxy morphologies) established through the Galaxy Zoo project. Later, the Galaxy Zoo also established labels for a large number of images taken from the Dark Energy Camera Legacy Survey (DECaLS), which are much higher resolution and image quality compared to the images from SDSS.

We will be using the Galaxy10 DECaLS data set, which contains 17,736 galaxy images (256x256 pixels, colored) from DECaLS, labeled (through Galaxy Zoo) according to 10 classes: Class 0 (1081 images): Disturbed Galaxies Class 1 (1853 images): Merging Galaxies Class 2 (2645 images): Round Smooth Galaxies Class 3 (2027 images): In-between Round Smooth Galaxies Class 4 ( 334 images): Cigar Shaped Smooth Galaxies Class 5 (2043 images): Barred Spiral Galaxies Class 6 (1829 images): Unbarred Tight Spiral Galaxies Class 7 (2628 images): Unbarred Loose Spiral Galaxies Class 8 (1423 images): Edge-on Galaxies without Bulge Class 9 (1873 images): Edge-on Galaxies with Bulge

This data should require relatively little pre-processing, and has been provided at https://github.com/henrysky/Galaxy10

Methodology: We anticipate using a convolutional neural network, given its appropriateness for image classification tasks. We have not yet established the specifics of the architecture.

We will be dividing the 17,736 images from the data set into training, validation, and test sets.

Metrics:

Our primary metric will be classification accuracy on the test set, which we will calculate using argmax and categorical cross-entropy. One goal is to surpass the 43% accuracy achieved by Ciprijanovic et al. using ResNet50 on a modified version of the Galaxy 10 DECaLS data. Another goal is to surpass the 84.73% accuracy achieved by Gharat and Dandawate using a CNN; although Gharat and Dandawate used a data set from SDSS, their data set is of comparable size to ours, and used some of the same Galaxy Zoo labels (ours just uses higher quality images from DECaLS in such cases), so it seems like a somewhat informative comparison.

Ethics:

The results of our project should have little, if any, obvious societal impact, given the object of study.

One issue worth considering is that our labels were provided by crowd-sourcing through a citizen science project. While this shouldn’t introduce any bias issues, it is worth considering the ethics of profiting (in the intellectual sense, not the financial sense) from the labor of hundreds if not thousands of volunteers.

Division of labor:

We anticipate that labor will be divided equally and organically among the three group members as the project proceeds.

Second Check-in | Reflection

The reflection was posted as an update. But another link: https://docs.google.com/document/d/1SooQjTE2KhFK_M0pQXFnBMc5w1mIR7DI0lEbQYEgDKg/edit?usp=sharing

Final Reflection | Write-Up

https://docs.google.com/document/d/1lzMJgbuC2rYswJVTbebGH2BO0BVxImgPpkoxlVQUPLI/edit

Our Video Recording

https://drive.google.com/file/d/1qp8j2NSI3WMJsBWva-VLUZj7xK5G9fjR/view?ts=63964e57

Our Poster

https://docs.google.com/presentation/d/1RHx1ts7_qiU5ITfbLAocp5uhFEa9LutLCd1K_QUXffE/edit#slide=id.p

Built With

tensorflow

Updates

Christopher Tripp posted an update — Nov 30, 2022 10:03 PM EST

Introduction We can deduce important information about a galaxy by studying its morphology. In particular, its shape and features can tell us about its star-forming activity, its evolutionary history, and its interaction with its local environment. In the past, astronomers have manually examined images of galaxies taken by astronomical surveys. However, modern large-scale surveys capture so many images that it is no longer possible for professional astronomers to manually inspect them all. This has led to the “citizen- science” project Galaxy Zoo, which allows anyone to analyze random images of galaxies from a given data set and record information that they perceive about the galaxy’s visual features. This has in turn led to the establishment of large, labeled (with respect to morphology) sets of galaxy images. In this project we hope to design a deep neural network which is able to classify images in the Galaxy10: DECaLS data set to a high level of accuracy.

Challenge The biggest problem we encountered during this project is that CNN models tend to overfit which is determined by the nature of neutral networks. We observed that the training error in image classification is significantly lower than the error in the testing set, which serves as a strong indicator for overfitting. We conclude that the model is incapable of generalized prediction or classification beyond the scope of the training set. We applied the method of Data Augmentation by artificially expanding the size of the training dataset and also adjusting the CNN architecture, and eventually the validation error drops as a signal of regular fitting.

Insight There are various factors that could affect the efficiency, accuracy and performance of CNN models. By implementing the methods proposed, we gained practical experiences in terms of the following two aspects: Firstly, we gained insight on optimized selection. A common choice for image classification using CNN is to do optimization on gradient-based methods. The loss function was minimized and yielded the optimal solution parameter set. While the original version of gradient descent requires the evaluations of the gradient based on the whole data set and in our case the time complexity can increase exponentially. In our project, the data size is huge, and the loss function is of a more complicated analytical form, which makes the original version impractical. An alternative way is stochastic gradient descent, which only uses one instance to evaluate the gradient at a time while the convergence cannot be ensured since stochastic method only ensures the most steep direction on average and the optimal solution can easily get trapped by bouncing back and forth. Our insight here on optimizer selection is to use a more balanced optimization between fully stochastic and original methods. Possible choice here could be batch gradient descent or mini batch gradient descent. Secondly, we learned more on the selection of enhancement methods. For the purpose of generating an output of better quality, we found it is better to utilize methods like AdaDelta, Adagrad, and momentum. In practice we do realize the combination of AdaDelta along with batch gradient descent achieves the best performance.

Plan Up to this point, we'd consider our project somewhat on track. The data is processed, the proposed method is basically implemented, and various methods and architecture are being used and compared to get the best model performance. We still need more time to polish our project presentation, report and change some of our codes to be bug-free. Also, we think the data analysis coding can be optimized to achieve a better time complexity. The thing we want to change here is about data representation because it is closely related to data augmentation and even feature selection. We do believe that batch normalization could be helpful if we have more time.

Log in or sign up for Devpost to join the conversation.

Christopher Tripp started this project — Nov 13, 2022 05:51 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.