REAL-TIME PROCESSING OF MULTICHANNEL ECG SIGNALS USING GRAPHIC PROCESSING UNITS

A novel approach to real-time processing of tens to hundreds of measured ECG signals is proposed. For multichannel ECG signal processing we utilized computing capabilities of current heterogeneous computing systems consisting of CPUs and GPUs. Specifically we analyzed the potential of parallel hardware and software platform named CUDA that supports general purpose computation on GPUs. Three typical tasks were selected from the real-time ECG signal processing chain and distributed between the CPU and GPU according to their suitability and computational demands. Computationally less intensive task –⁠ data formatting and typical sequential task –⁠ data saving were executed on CPU and computationally more intensive task –⁠ data filtration was executed on the GPU using thousands of CUDA threads running in parallel. Furthermore, parallel execution on the GPU was also supported by parallel execution between the CPU and GPU using asynchronous function calls. Special attention was paid exactly to the parallelization of data filtration. A digital high-pass FIR filter for continual parallel filtration of tens of measured ECG signals was designed. The filter was realized in frequency domain using fast convolution and the overlap-save method. The CUDA platform enabled a 5.3-fold speedup of the application in comparison to its serial implementation and represents promising alternative for data-parallel signal processing algorithms.

Keywords:
Real-time signal processing, multichannel ECG, heterogeneous computing systems, CUDA platform, general purpose computing on GPU, data-parallelism, parallel digital filtration, fast convolution

Autoři: Peter Kaľavský; Milan Tyšler
Působiště autorů: Institute of Measurement Science, Slovak Academy of Sciences, Bratislava, Slovak Republic
Vyšlo v časopise: Lékař a technika - Clinician and Technology No. 2, 2012, 42, 27-30
Kategorie: Conference YBERC 2012

Souhrn

Introduction

Real-time measurement and processing of multichannel electrocardiographic (ECG) signals is a prerequisite for body surface potential (BSP) mapping. BSP mapping is a non-invasive ECG method enabling more precise diagnostic of cardiac diseases based on detailed registration of surface cardiac potentials using high number of sensing electrodes [1]. Forty years of experience with several mapping lead sets showed that information content in maps constructed from 24 to 240 leads is greater than that of standard 12-lead ECG .

However, the processing of large number of measured ECG signals, using traditional single threaded serial algorithms, imposes increased performance requirements on the computing system. Together with the growing number of processed signals also the demands on the basic ECG signal processing are growing, thus the demands on data formatting, digital filtration, computation of lead signals, displaying of ECG curves or saving the data (see Fig. 1). The high number of measured channels also complicates the possibility to implement advanced methods of ECG signal processing (like BSP mapping) in real-time. One possible way how to overcome the mentioned obstacles could be the use of new heterogeneous processor systems and programming models.

**Fig. 1: Block diagram showing basic program modules of BSP mapping system.**

Today, even more frequently the combination of a multi-core central processing unit (CPU) and a manycore graphic processing unit (GPU) is used to perform high-performance computations. Heterogeneous CPUGPU based system enables to exploit the different capabilities of its individual processors. Serial parts of algorithms could be effectively processed on the CPU and parallel parts on the GPU. The significant speedup of many scientific applications is possible mainly due to the hardware and software architecture of current modern GPUs that allows also realization of nongraphical, so called general purpose computations on GPUs (GPGPU) [2]. GPUs have evolved into processors with unprecedented floating-point performance and programmability. Today’s GPUs greatly outpace CPUs in arithmetic throughput and memory bandwidth. The peak performance of the newest GPUs attacks almost 6 TFLOPS, whereas the newest desktop CPUs don’t reach even the 200 GFLOPS boundaries. Anyway, GPUs are designed specifically to accelerate a variety of data-parallel problems. Because the processing of multichannel ECG signals shows considerable data-parallelism, the involvement of the GPU into the signal processing chain is considered a reasonable step.

The goal of the present study is to introduce and demonstrate a possible concept of real-time multichannel ECG signal processing speedup using one of the currently popular and heavily used heterogeneous CPU-GPU computing platform named CUDA. The speedup will be achieved through the parallelization of computationally most expensive programming modules of the ECG signal processing chain and by subsequent distribution of the workload between the CPU and GPU. In this study ECG filtration module will be parallelized and processed on the GPU.

Methods

CUDA platform

Compute unified device architecture (CUDA) is a parallel hardware and software platform supporting GPGPU on NVIDIA GPUs.

CUDA hardware is organized into an array of several unified streaming multiprocessors (SM), each consisting of many cores (the latest CUDA based GPU with codename Kepler consists of 16 SM and 192 cores in each of them). Important parts of the GPU are various types of memory spaces with different bandwidth and capacity. Managing the significant performance differences between the slow on-board (tens of GB/s) and fast on-chip memories (hundreds up to thousands of GB/s), as well as the data transfers between the CPU and GPU through the PCIe bus (8 GB/s), is the primary concern of a CUDA programmer [3].

The CUDA software enables to call parallel functions, called kernels. When a kernel function is invoked, the program execution is moved from the CPU to the GPU, where typically thousands of parallel threads are generated. These threads are organized into thread blocks and grids of thread blocks and may access data in multiple memory spaces during their execution. The hierarchy of CUDA threads is mapped to a hierarchy of processors on the GPU. A GPU executes one or more kernel grids, a SM executes one or more thread blocks, and CUDA cores and other execution units in the SM execute threads [2], [3]. For this mapping is essential that thread blocks are mutually independent, so they can be scheduled in any order across any number of SM, enabling programmers to write a code that scales with the number of cores.

Data parallelism and multichannel ECG

When a data-parallel problem is being solved, it can be partitioned into coarse sub-problems that can be solved in parallel by block of threads, and each subproblem can be divided into finer pieces that can be solved cooperatively in parallel by all threads within the block. If we look closer into the program modules shown in Fig. 1, we can see that they have one common characteristic. The input to these modules represents signal matrix of size R x S, where the number of rows R corresponds to the number of ECG signals processed, and the number of columns S corresponds to the number of samples in one individual ECG signal (see Fig. 2). Usually, in all R rows, eventually also in all S columns it is necessary to perform the same operations. It is evident that multichannel ECG signal processing might be considered as a data-parallel task. The signal matrix can be partitioned into smaller independent parts that can be mapped onto CUDA blocks and consequently the blocks can be executed on SMs.

**Fig. 2: Mapping of the signal matrix of different sizes to CUDA blocks.**

Workload distribution between the CPU and GPU

In order to facilitate parallel execution between the CPU and GPU, some CUDA function calls are asynchronous. The control is returned to the CPU thread before the GPU stream has completed the requested task [2], [3], as shown in Fig. 3b. Thus we can exploit both, data parallelism inside the GPU and task parallelism between the CPU and GPU for a further reduction of the application runtime. This is a very valuable feature especially in conjunction with real-time applications.

Experiments

In order to test the speed of processing multichannel ECG signals in real-time, three program modules named formatting, saving and filtration were designed (see Fig. 3). All three modules were written in C language and can be executed on CPU. The parallelized version of the filtration module was written in CUDA C and can be executed on GPU. Modules were intended specifically for high resolution multichannel ECG mapping system ProCardio-8 that was developed at the Institute of Measurement Science, SAS. Regarding the maximal data throughput of the system, 67 channels were measured using sampling frequency of 2000 Hz and 22-bit sample resolution. In the next section the inner realization of individual modules is briefly described.

**Fig. 3: Time diagrams of program modules: a) serial algorithm; b) parallel algorithm exploiting the data parallelism inside the GPU as well as the task parallelism between the CPU and GPU.**

The formatting module reads every 16 ms a block of 32 3-byte data from the FT245R USB chip of the ProCardio-8, converts the data into 16-byte samples (samples are of double precision, floating-point complex data type that consists of interleaved real and imaginary components) and saves the samples into the signal matrix SM_A of size 67 x 32. Each row of SM_A represents an individual input sequence x(n) of length N = 32.

The saving module removes control bytes from the input stream and saves the blocks of 3-byte raw data to the disk.

The filtration module eliminates the ECG baseline wander using a high-pass finite impulse response (FIR) filter with impulse response h(n) of length M = 4065. The filter cutoff frequency respects recommendations for low-frequency noise reduction (f_-3dB = 0.67 Hz) [4]. The filtration is realized in the frequency domain using the convolution theorem of the discrete Fourier transform (DFT) that enables to compute output sequence y(n) of the length L (L = N + M –⁠ 1) as

where X(k) and H(k) are DFTs of x(n) and h(n) respectively, n represents the time-domain index of the input samples and k the index of the DFT output in the frequency domain [5]. To ensure continual ECG filtration we used the well-known "overlap-save" block filtering algorithm [5]. The resultant configuration of the signal matrix SM_B of size 67 x 4096 destined for the data filtration is depicted in Fig. 4.

**Fig. 4: The configuration of samples in the signal matrix SM<sub>B</sub>. The signal matrices from two consecutive runs of the measuring loop are depicted.**

Two versions of filtration module were created. Both versions use optimized fast Fourier transform (FFT) libraries for DFT computation. The serial version uses the efficient FFTW CPU-based library while the parallelized version uses the CUFFT GPU-based library.

Parallelized filtration module allows for parallel processing of samples of SM_B. Using the CUFFT, 67 forward FFTs and consequently 67 inverse FFTs are computed in parallel. Moreover, the elementwise multiplication of X(k).H(k) uses a handcoded kernel. When a kernel is invoked, 67 x 4096 threads running in parallel are generated and every thread computes one element of the output matrix.

The processing of parallelized filtration module was assigned to one CUDA stream. In order to support the parallel execution between the filtration module running on the GPU and the saving module running on the CPU (see Fig. 3b), partial tasks of the parallelized filtration module (the CPU to GPU transfer, the FFT computation, the elementwise multiplication X(k).H(k), the IFFT computation and the GPU to CPU transfer) were designed using asynchronous functions.

The experiments were realized using one CPU core of the Intel core i7-875K (4 cores, 2.93 GHz, 4 GB DDR3) and NVIDIA GeForce GTX 480 GPU (480 cores, 1.4 GHz, 1536 MB GDDR5). The electrodes of the ProCardio-8 measuring unit sensed 67 simulated ECG signals from a signal generator.

We measured the total runtimes of the serial and parallel algorithms, t_s and t_p, the runtimes of individual program modules Δt₁, Δt₂, Δt₃, Δt₄ and also the runtimes needed to compute the FFT, the elementwise multiplication X(k).H(k) and the IFFT on CPU as well as on GPU (see Fig. 3). Results are shown in graphical and numerical form in Fig. 5. Measured values represent averaged values of corresponding runtimes over a 60 second time interval (or over 3750 runs of the 16 ms measuring loop).

**Fig. 5: Runtimes of a) serial and parallel algorithm, b) individual program modules of the serial and parallel algorithms, c) individual tasks of the filtration program module.**

Results

The results depicted in Fig. 5 indicate the enormous computing power of the GPU. Using the combination of data parallelism inside the GPU and the task parallelism between the CPU and GPU, the total runtime needed to process all three program modules was reduced from 13.58 ms to 2.56 ms (see Fig. 5a). We achieved a 5.3-fold speedup and saved 81% of the total runtime when compared with the serial CPU version. As a consequence, an idle time interval of 13.44 ms has arisen in the 16 ms measuring loop. This relatively long time interval can be used for implementation of additional program modules into the ECG signal processing chain.

The major impact on the total runtime reduction had the movement of the digital filtration execution (initially the computationally most intensive task) from the CPU to the GPU. Thanks to thousands of CUDA threads running in parallel, the total runtime needed for multichannel filtration decreased from 11.02 ms to 0.82 ms (see Fig. 5b) what represents 13.4-fold speedup and 91% saving of runtime in comparison to the serial implementation.

Because also the parallel execution between the CPU and GPU was utilized and the runtime of the parallelized filtration module (0.82 ms) was shorter than the runtime of the saving module (2.38 ms), the total runtime of the parallel algorithm (2.56 ms) is given only by the sum of the formatting module runtime (0.18 ms) and the saving module runtime (2.38 ms).

Discussion

Thanks to the parallelization of the filtration module and the distribution of the workload between the CPU and GPU we speeded up the real-time ECG signal processing. As in many others GPGPU applications, the overall speedup of the parallel algorithm was limited by the serial portion of the application (in our case mainly by the saving module). For the sake of completeness it is important to say that the design and implementation of a high-order (4065) digital filter mentioned in this paper has more or less illustrative character because the time delay of this filter is approximately 1 second and displaying of ECG signals with such a long delay can be debatable.

Finally, we would like to point out that throughout the development of massive multithreaded applications we should beware of divergence of threads inside warps or deadlocks among the threads and also we should ensure the communication and synchronization among the threads, especially when they access shared resources. It is also important to be fair when making benchmark comparisons between the CPU and GPU by optimizing the application to achieve the highest performance on the CPU as well as on the GPU.

Acknowledgement

The work has been supported by research grant No. 2/0210/10 from the VEGA Grant Agency and by grant No. APVV-0513-10 from the Slovak Research and Development Agency.

Ing. Peter Kaľavský

Department of Biomeasurements

Institute of Measurement Science

Slovak Academy of Sciences

Dúbravská cesta 9, 841 04 Bratislava

E-mail: peter.kalavsky@savba.sk

Phone: +421 259 104 551

Zdroje

[1] Tyšler, M. et all. Non-invasive Assessment of Local Myocardium Repolarization Changes using High Resolution Surface ECG Mapping. Physiological Research, 2007, vol. 56, suppl 1, S133-S141.

[2] Kirk, D. B., Hwu, W. W. Programming Massively Parallel Processors. Burlington: Morgan Kaufmann, 2010. 251 p.

[3] Farber, R. CUDA Application Design and Development. Waltham: Morgan Kaufmann, 2011. 311 p.

[4] Kligfield, P. et al. Recommendations for the Standardization and Interpretation of the Electrocardiogram. In Journal of the American College of Cardiology, 2007, vol. 49, no. 10, p. 1109-1127.

[5] Vijay, K. M. The Digital Signal Processing Handbook –⁠ Digital Signal Processing Fundamentals. Boca Raton: CRC Press, 2010. 904 p.