Exploring heartbeat arrhythmia - part 1

Vesna Lukic
Dec 9, 2022
4 min read

Updated: Jan 6, 2023

This post describes an exploratory analysis using heartbeat arrhythmia data. It also initially builds a simple model based on looking at the means of the signals in each class. Finally, it explores the performance of the kNN algorithm on classifying the signals

Kaggle hosts a couple of datasets on heartbeats. They are the PTB Diagnostic ECG Database the MIT-BIH Arrhythmia Dataset. The current post focuses on the latter.

The link to the Kaggle dataset is given below:

https://www.kaggle.com/datasets/shayanfazeli/heartbeat

There are over 87,000 and 21,000 samples in the training and testing sets respectively. Each sample has a heartbeat signal of length 187. There are 5 possible classes, corresponding to the classes of Normal (N, Class 0), Supraventricular (S, Class 1) ectopic, Ventricular (V, Class 2) ectopic, Fusion (F, Class 3) and Unknown (Q, Class 4). From hereon in they will simply be referred to by the class numbers.

Exploratory data analysis

The following histogram shows how many data samples there are in each class, for the training set.

The largest number of samples occurred for the 'N' (normal type, class 0, 72471 samples), smaller numbers were observed in the other sub-types. In order to ensure class balance for future machine learning models, we should upsample the samples in the under-represented classes to have the same number as the 'N' class.

The heartbeat amplitudes have already been normalised to have a minimum of 0 and maximum of 1, as shown in the histogram below.

Examples of signals in each class

Some of the signals in each class appear to follow similar patterns, but there are also outliers. Next, we should plot the mean of the signals across each class, to see what a typical representative signal looks like.

Signal means and medians across each class

The fact that there are some general patterns within each class is demonstrated in the appearance of the signal mean within each class. It is interesting to note that some classes have visibly more peaks (e.g. class 1,3 and 4) whereas the other two classes are relatively flat. They could be flat due to more variability between the signals, so that more signals have peaks and troughs that occur at different time points, hence are averaged out.

We also plot the histograms to see the relative mean and standard deviation of the signals in each class. If the means are significantly different between the classes, it should be easier for a ML algorithm to classify them. If the standard deviation is small for the signals within the classes, there is less variability, therefore also easier to classify. However if it is large, then the larger variability could mean that some signals may get classified as being in a different class.

Histogram of signals across each class

Class 3 has the lowest mean of mean signal, whereas class 4 has the highest mean.

Next, we plot the mean and standard deviation of the mean signal. The more the means are distinct from each other, the easier they should be to distinguish, as long as the standard deviation is not too large.

The means of the mean signal are quite similar for classes 2 and 4, whereas they are quite different between the other 3 classes. Class 0 and 4 have the lowest and highest standard deviation respectively. Therefore class 0 should be the easiest to classify given that it has a relatively distinct mean and lowest standard deviation. Class 4 may be the most difficult since the mean is close to that of class 2, and the standard deviation is the highest.

Naive classifier based on signal means

In order to explore how using the mean of each class performs as a simple naive classifier, we choose 100 examples in the test set from each of the 5 classes, and calculate their correlation to the mean of the signals in each class, and take the mean. The rows shown the correlation for each class.

The simple model of calculating the mean of the training set examples in each class and correlating this with some instances of test set examples appears to work reasonably well. The mean correlations of the instances belonging to classes 0, 1, 2 and 3 are correctly predicted, however it is incorrect for class 4.

However, the numbers on the off-diagonals (indicating the correlations of the signal in question to those of other classes) are also quite high. An improvement to the current model should see a reduction in the values on the off-diagonal entries.

Summary and future steps

This concludes the exploratory data analysis of the MIT-BIH dataset. The simple model of correlations of individual samples to the class means appears to perform reasonably well in the first instance. However, other machine learning (ML) methods should be explored.

Prior to applying established ML methods, upsampling of the data should be performed for the signals classes in the classes that are under-represented. The data already appears to be normalised since the maximum is at 1 and minimum is at 0.

An interpretable ML model should be chosen, given that there are already discernible features observed just by looking at the mean of the examples in each class as a simple model.

The current training set can be split into a training and validation set. The ML model should be trained, and after a suitably low error is achieved, be evaluated on the test data set using some chosen performance metrics.

The model should also be explainable, such that we know what selection of features are likely to give rise to a certain prediction.

Exploring heartbeat arrhythmia - part 1

Recent Posts

Comments