Exploring heartbeat arrhythmia - part 2

Vesna Lukic
Jan 13, 2023
4 min read

Updated: Jan 27, 2023

kNN model for heartbeat classification

The current post looks at using another simple model: k-nearest neighbours (kNN) as a classifier.

The kNN model is a non-parametric model that attributes a class to a specific example for which the class is not known, by seeing what classes the nearest neighbours to the example in question have. The number of neighbours is chosen by the user.

ree — Explaining k nearest neighbours algorithm visually. Attribution: TseKiChun, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

Recap from previous post

The previous post (heartbeat arrhythmia EDA) described a simple exploratory data analysis of different arrhythmia types. The analysis showed there are already discernible features in the data, by looking at the pattern of the raw heartbeat signals, the means and standard deviations of the signals. A very simple model was developed by correlating the mean of the training set samples in each class, to a set of samples in the test set.

The model achieved the highest correct predictions (ie highest correlations) for 4 of the 5 classes. However, the confusion matrix showed quite high numbers on the off-diagonals, signifying that there were also reasonably high correlations with the means of the other classes. Therefore, the model has room for improvement.

Data pre-processing

The data has already been normalised, since the values are between 0 and 1, as discussed in the previous post. Prior to applying the kNN model to the dataset, the data must be upsampled.

The largest number of samples occurred for the 'N' (normal type, class 0, 72471 samples), smaller numbers were observed in the other sub-types.

In order to balance the classes, the four other subtypes were upsampled to have the same number as for the 'N' type (72,471*5=362,355) total samples.

Application of kNN model

After upsampling, the data is shuffled, and split into a training (80%) and testing (20%) set.

A grid search is applied to the first 40,000 shuffled samples, varying the number of neighbours (n_neighbours) between 2 and 7.

The optimal number of neighbours was found to be 2, giving an accuracy of 97.06% on the first 40,000 shuffled samples in the training set. (The higher the number of samples being used, the longer the time take.)

Then, the kNN model is fit to the training data, using n_neighbors=2 and uniform weights (where points in each neighbourhood are weighted equally).

Applying the trained model to the mitbih test set results in in classification accuracy.

Shown below is the confusion matrix (in terms of percentages) after applying the trained model to the test set.

Explainability

In order to see why the kNN model classifies specific examples as belonging to a particular class, it can help to look at the distance between a test point in question, and the k closest neighbours from the training set.

Another way to probe explainability is to look at the signals of individual samples and see how they compare to the overall mean of the class signal.

Distance metric

Here we choose the Euclidean distance between the test point sample and the closest neighbours to that sample in the training set.

Correctly classified examples

Considering the some examples in the test set (with true class '0' or 'normal type' heartbeat), it is predicted to also be in class '0'. The samples are shown in blue, and the two closest neighbours, from the training set, are shown in green and orange.

Incorrectly classified examples

Shown below are some incorrectly classified examples in the test set. In all three examples, the signal in question has a true class of '0' 'normal type'. However, the trained kNN model, with two nearest neighbours, has predicted the classes to be '4', '1\ and '2' respectively.

The two closest neighbours, from the training set, are shown in green and orange. The orange line cannot be seen because it is behind the green sample. In all three examples, the green and orange are identical, because the samples in the minority class have been upsampled, so there are duplicated signals in the minority classes.

It is interesting to note that in these instances, the kNN model found a closer distance to the samples in question to samples in the upsampled minority classes, rather than in the majority class of '0' or normal type, of which there are a richer variety of normal type heartbeat signals.

Individual samples

Another intuitive way to look at explainability is to plot the mean of each class (shown in colours) to the signal of the specific example (in black).

Correctly classified samples

The figure below shows some correctly classified examples - one example for each of class 0, 1, 3 and 4. In these instances, the specific example showed the highest correlation with the mean class of the signal.

Incorrectly classified sample

On the other hand, the figure below shows an incorrectly classified example in class 2.

The table below shows the example 2 correlation with the mean of each of the other classes.

class 0 correlation: 0.83
class 1 correlation: 0.72
class 2 correlation: 0.77
class 3 correlation: 0.68
class 4 correlation: 0.83

The Class 2 example has only a moderate level of correlation with the overall mean of class 2; there is a higher correlation with class 0 and and class 4. Therefore, in this instance, the mean is not a very good representative sample of the neighbouring signals of the example signal (kNN determines class belonging based on the average class of the nearest 2 neighbours).

Summary

This post looked at applying the kNN model to the MITBIH dataset. Since the dataset was imbalanced in terms of the classes, upsampling of the minority classes was performed.

A grid search to find the optimal k neighbours based on a portion of the training set, which found n_neighbors=2.

The algorithm was trained on the training set using n_neighbors=2, and subsequently applied to the test set, giving a classification accuracy >97%.

Explainability of the kNN algorithm was addressed by looking at the distance metric, as well as the appearance of correctly and incorrectly classified signals to the mean of each class.

It is worthwhile noting that the kNN model is trained here using a specific dataset and specific parameters, hence it may not work as expected in other similar datasets, where the heartbeat signals may be obtained using different centre(s) and machines, and a different cohort of people for example. Hence, it may need adjustments.

Furthermore, the kNN algorithm performance depends on the quality of the data, how the features are represented, the choice of distance metric and the number of nearest neighbours for example. In this post we only looked at varying the number of neighbors. Many more configurations can be explored, as well as evaluating the performance of the model using other metrics.

Exploring heartbeat arrhythmia - part 2

Recent Posts

Comments