Building A Pulsar Classification Tool.
16133878, Amelia McDowell, CMP6221, MSc AI, GitHub
In the context of cosmological research, supermassive stars which undergo supernovae can leave behind several remnants, including nebulae, neutron stars and black holes. A neutron star is an extremely dense celestial body which consists mostly of neutrons and is typically created when the mass of the collapsing star is not quite large enough to generate a black hole. With radii of only a few tens of kilometres but masses comparable to that of Sol, neutron stars display remarkable properties, for instance, 'a cubic centimetre of these stars equals the mass of all the human beings on the Earth' (Possenti, D.U). Typically, neutron stars are extremely hot objects that rotate up to 800 times per second, generating strong magnetic bursts of radiation at their poles. In some special circumstances, the axis of rotation can align in such a way that these radiation pulses can be observed on Earth. Such examples of these neutron stars are known as Pulsars.
In this blog post, the development of a Pulsar classification tool is described. This is based on astronomical observation of over 17000 candidates provided by the 'Pulsar Classification for Class Prediction (High Time Resolution Universe Survey 2)' dataset.
This sounds easy to quantify in theory, but it presents a greater challenge when observations are several hundred lightyears away, a small dot of light in the sky indistinguishable from much larger and closer bodies and systems.
The 'Pulsar Classification for Class Prediction (HTRU 2)' dataset, was uploaded to Kaggle in 2020. The dataset itself is tabular in nature, consisting of 17898 observations of celestial objects, with several astronomical features relating to radio wave measurements and statistics. Table 1, provides an overview of the observational features which can be used as modelling input.
Notably, each feature is of floating point precision with no missing values or duplicate samples. The dataset supports a binary classification modelling objective, where light and pulse curve statistics can be used as features to predict whether or not the observation is a pulsar or not. However, the dataset is relatively imbalanced. Of the 17898 samples, only 1639 represent ground-truth Pulsar measurements, meaning that negative examples are almost ten times more frequent. This makes it a challenging dataset for developing a classification model, and metrics which are less sensitive to imbalance were adopted.
pulsar classification tool dEVELOPMENT
The Pulsar Classification Toolset is built using Python and uses the MongoDB community as the database backend. The tool itself is designed to be operated within a Python virtual environment. To deploy the toolset, several shell scripts were constructed. Firstly, an installation shell script (installation.sh) creates the virtual environment before running pip to install all the required packages contained within a requirements.txt file. This script also runs an external class which creates the Pulsar MongoDB database and inserts the data into a collection, before prompting the user to create a username and password. This is for demonstration purposes only, as this username/password pair can then be used to access the UI later on.
To run the full tool, a second shell script is provided (run.sh), which sources the virtual environment before starting the app's UI and orchestrating the backend code. The tool itself consists of four core Python classes, as described by the Unified Modelling Language (UML) diagram in Figure 3. Each class uses private attributes which are included in the UML for completeness.
Firstly, the DatabaseClient class is responsible for connecting to the local instance of MongoDB. It provides a login method, which uses a try and except block to check that the provided credentials are supported by the desired database in MongoDB. Additionally, this provides a means to obtain all records from a specific collection using the get_all_records function, which has the option to return the records as a pandas DataFrame. Additionally, it provides a number of handles to get the hidden parameters of the class (namely the pymongo client and database) via get_client and get_database functions, as well as a means to check that the client has appropriate credentials via is_logged_in.
Secondly, the DataLoader class is designed to transform the data into a format suitable for training a Machine Learning (ML) model. This takes an instance of the DatabaseClient and uses the get_all_records method to pull all data from the relevant collection. The prepare_data_training method takes the training data from the pulsar database collection and applies some basic cleaning checks via clean_data, which removes NaN values and drops any duplicate rows. Next, it uses scikit-learn's train_test_split function to split the sample into a train and test set, with size dependent on the test_size input argument. A stratified approach was taken, meaning that the balance of classes between the training set and test set is equal. Finally, a scikit-learn StandardScaler is used to normalise the data, first by calculating the statistics using the training set, before scaling the training and test set appropriately. This scaler is reused by the prepare_data_inference method, which takes in a numpy ndarray as an argument and scales it to match the preprocessing that was applied to the training set.
Next, the Predictor class is responsible for initialising the Decision Tree model, as implemented by scikit-learns tree module. This class provides a method to train the model (train) by passing the training features and labels from the DataLoader class. Once trained, the Predictor can produce a report of the classification performance of the model via the report_performance method, which calculates the model's accuracy, precision, recall and F1 score based on the test dataset. Finally, it provides a method for inference, i.e. when a single sample is used for classification. As with the DatabaseClient, the Predictor class provides handles for checking if the model is fitted (is_fitted) and for getting the model (get_model) externally.
The PulsarUI class is the orchestration tool. This provides a UI via the tkinter module and wraps around instances of DatabaseClient, DataLoader and Predictor to perform all of the functionality needed to train, test and infer the Decision Tree model. Figure 5 provides the user flow diagram of the tool.
Firstly, the user is confronted with a Login screen. This uses a DatabaseClient instance to check if the credentials are met for logging in. If they are, the Training Window is drawn (create_training_window). This consists of a slider to control what percentage of the dataset is used for training and testing, and a button to kick off the training procedure. Once trained, the test metrics are recorded on screen and a button for drawing the Decision Tree is added. If clicked, the Tree is drawn in a new window. If the user clicks on the Inference Portal button (create_inference_window), the user can input values for each of the eight fields in Table 1 to get a prediction about whether the measurement is a pulsar or not. If the fields are not complete, or the model has yet to be trained, a warning message will be printed into the UI. Additionally, this window also has controls to return to the training screen. The model can be retrained as many times as desired, and as a result, the performance metrics will change each time (as train_test_split uses a random number generator to perform the splitting). Figure 6 provides an overview of each of the three windows.
Overall, after running the training procedure ten times, the Pulsar classification model typically achieves 97% accuracy and an F1-score of 84% when using a test set equivalent to 20% of the size of the whole sample. Interestingly, when using 95% of the sample as test data, the performance barely degrades, with accuracy and F1-scores around the 95%/80% mark. However, the decision trees for these examples are simpler, as shown in Figure 7.
VIDEO OF OPERATION
Figure 1 (Blog Title Image): NASA, ESA, and T. Brown (STScI).(Unknown) All the Glittering Stars. [Photo Collage] Available at: https://www.nasa.gov/image-article/all-glittering-stars/
Figure 2: Kramer, M. (Unknown) Title Unknown. [GIF] Available at: https://www.atnf.csiro.au/outreach/education/everyone/pulsars/index.html
Figure 3: (Unknown) Title Unknown. [digital image] Available at: https://futurism.com/pulsars-what-are-they-why-do-they-spin-so-fast-2
Possenti, A. (Date Unknown) Andrea Possenti: “All pulsars are neutron stars, but not all neutron stars appear like pulsars”. Institut de Astrofisica de Canarias (IAC). Available at: https://www.iac.es/en/outreach/news/andrea-possenti-all-pulsars-are-neutron-stars-not-all-neutron-stars-appear-pulsars
High Time Resolution Universe Survey 2 (2018) Pulsar Classification for Class Prediction. [dataset] Available at: https://www.kaggle.com/datasets/brsdincer/pulsar-classification-for-class-prediction
Miro (2019) [computer software] Available at: https://www.miro.com [Accessed 28 February 2022]
Adobe (2023) Photoshop: Version: 25.1. [computer software] Available through: https://rb.gy/qapfe [Accessed September 2023].
SmartDraw (1994) [website] Available at: https://www.smartdraw.com/uml-diagram/uml-diagram-tool.htm
McKinney, W. & others, 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference. pp. 51–56.
Pedregosa, F. et al., 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.
Harris, C.R. et al., 2020. Array programming with NumPy. Nature, 585, pp.357–362.
Undh, F. (1999). An introduction to tkinter. URL: www. pythonware. com/library/tkinter/introduction/index.