About the Galaxy Morphological Classification Dataset

In the beginning

In the five articles from this article to this article Galaxy shape(morphological) classification was performed using CNN (VGG16, ResNet) and ViT. In this article, I would like to consider the dataset used for the galaxy morphological classification, as I would like to re-examine the dataset when analyzing errors in the future.

A Dataset I used.

As noted here, I used [Galaxy10 DECals](https://astronn.readthedocs.io/en/latest/ galaxy10.html) as my dataset. The characteristics of this dataset are as follows.

  • The original Galaxy10 was Galaxy Zoo (GZ) Data Release 2, which classified about 220,000 images into 10 major categories by volunteer voting from about 270,000 images. For some images, they were replaced by images from DESI Legacy Imaging Surveys (DECals); Galaxy10 DECals is a combination of GZ DR2, DECals images instead of SDSS, and the DECals campaign ab,c. From the approximately 440,000 galaxies that DECals has, approximately 18,000 were selected and classified into 10 major categories by volunteer voting using strict filtering.
  • The image data is 256x256x3channel; the exact data count is 17,736.
  • The 10 classes are Disturbed/Merging/Round Smooth/In-between Round Smooth/Cigar/Barred Spiral/Tight Spiral/Loose Spiral/Edge-on without Bulge/Edge- on with Bulge. 1,081/1,853/2,645/2,027/334/2,043/1,829/2,628/1,423/1,873 galaxies, respectively.

A Dataset in “Galaxy Morphological Classification with Efficent Vision Transformer”

I looked into the dataset used by Galaxy Morphological Classification with Efficent Vision Transformer.

  • The abstract introduces the page on this Github. Looking here, it seems that Dataset is using Galaxy Zoo 2 Project2 (GZ2). It appears to be from the kaggle page.
  • 424x424x3channel data. Each channel is an image of a g/r/i filter.
  • The following 8 shapes are available (in order from 0-7). The class names are, in order: round elliptical/in-between elliptical/cigar-shaped elliptical/edge-on/barred spiral/unbarred spiral/irregular/merger.
  • 155,951 images divided 64% for training, 16% for validation, and 20% for testing.
  • Crop to 224x224x3 and perform flipping and rotating data expansion.

A Dataset in “Galaxy Morphology Classification with Deep Convolutional Neural Networks”

The dataset used by Galaxy Morphology Classification with Deep Convolutional Neural Networks was also examined.

  • The dataset used is described in section “2 DATASET.” It uses data from Galaxy Zoo - The Glalaxy Challenge (kaggle).
  • 61,578 JPG (424x424x3) color images.
  • Galaxy shapes are $f_{smooth}$, $f_{completely_round}$, $f_{in-between}$, $f_{cigar-shaped}$, $f_{features/disk}$, $f_{edge-on,yes}$, $f_{edge-on, no}$, $f_{spiral,yes}$, and $f_{edge-on,yes}$, classified into five classes (in order of 0-4). The class names are, in order: Completely round smooth/In-between smooth/Cigar-shaped smooth/Edge-on/Spiral. 8,434/8,069/578/3,903/7,806 data, respectively. (Listed in Table 1.)
  • Overall, 28,790 image data split training:testing = 9:1 (25,911, 2,879).

Evaluation

The table below summarizes the comparison of the above three datasets.

Dataset Name Galaxy10 DECals Galaxy Zoo 2: Images Galaxy Zoo - The Galaxy Challenge
Number of data 17,736 155,951 61,578
Number of classifications 10 8 5
Image Size 256x256x3 424x424x3 424x424x3

The data set I used is not bad, but the number of galaxies is small compared to others.

I would like to use Galaxy Zoo2: Images in the future.


Translated with www.DeepL.com/Translator (free version)