Machine-Learning-Based Multi-Modal Force Estimation for Steerable Ablation Catheters

E Arefinia; J Jayender; R V Patel

doi:10.1109/tmrb.2024.3407590

. Author manuscript; available in PMC: 2025 Aug 1.

Published in final edited form as: IEEE Trans Med Robot Bionics. 2024 May 31;6(3):1004–1016. doi: 10.1109/tmrb.2024.3407590

Machine-Learning-Based Multi-Modal Force Estimation for Steerable Ablation Catheters

E Arefinia ¹, J Jayender ², R V Patel ³

PMCID: PMC11392016 NIHMSID: NIHMS2015744 PMID: 39280352

Abstract

Catheter-based cardiac ablation is a minimally invasive procedure for treating atrial fibrillation (AF). Electrophysiologists perform the procedure under image guidance during which the contact force between the heart tissue and the catheter tip determines the quality of lesions created. This paper describes a novel multi-modal contact force estimator based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The estimator takes the shape and optical flow of the deflectable distal section as two modalities since frames and motion between frames complement each other to capture the long context in the video frames of the catheter. The angle between the tissue and the catheter tip is considered a complement of the extracted shape. The data acquisition platform measures the two-degrees-of-freedom contact force and video data as the catheter motion is constrained in the imaging plane. The images are captured via a camera that simulates single-view fluoroscopy for experimental purposes. In this sensor-free procedure, the features of the images and optical flow modalities are extracted through transfer learning. Long Short-Term Memory Networks (LSTMs) with a memory fusion network (MFN) are implemented to consider time dependency and hysteresis due to friction. The architecture integrates spatial and temporal networks. Late fusion with the concatenation of LSTMs, transformer decoders, and Gated Recurrent Units (GRUs) are implemented to verify the feasibility of the proposed network-based approach and its superiority over single-modality networks. The resulting mean absolute error, which accounted for only 2.84% of the total magnitude, was obtained by collecting data under more realistic circumstances in contrast to previous research studies. The decrease in error is considerably better than that achieved by individual modalities and late fusion with concatenation. These results emphasize the practicality and relevance of utilizing a multimodal network in real-world scenarios.

Keywords: Steerable Ablation Catheters, Cardiac Ablation, Multi-Modal Force Estimation, Shape Extraction, Optical Flow Estimation, CNNs, RNNs, LSTMs

I. Introduction

Atrial fibrillation is an irregular heart rhythm that is often caused due to ectopic beats originating around the pulmonary veins [1]. When medication cannot recover the heart’s normal rhythm, cardiac ablation is the preferred therapy alternative [2]. Conventional cardiac ablation catheters (CACs) are used to ablate abnormal circuits within the heart. However, CACs lack appropriate force feedback, which is critical to ensure the creation of transmural, necrotic lesions. Conventional imaging modalities such as X-ray fluoroscopy and Intracardiac Echocardiography (ICE) are utilized to guide the catheter within the heart.

A. Motivation

To further clarify the significance of contact force in achieving successful ablation, according to several studies, the contact force between the catheter tip and the target tissue in the heart significantly impacts the procedure’s success [3]-[7]. All phases of the ablation procedure, including catheter insertion into the femoral vein, guiding the catheter to the target [8], ablation, and retrieval, can benefit significantly from employing force sensors.

As a result, several force sensors have been developed for integration with ablation catheters. There exist two commercially available CACs capable of measuring tip-contact forces [37]. One such catheter is the TactiCath Quartz ablation catheter (Abbott Cardiovascular, Nathan Lane North Plymouth, MN). The catheter determines the magnitude and angle of the contact force using FBG (Fiber Bragg Grating). Yokoyama et al. [9] report that -although the catheter’s size and operating range are within the optimal specifications, hysteresis and sensor resolution still need to be improved. Another popular ablation catheter is the ThermoCool catheter (Biosense Webster, Diamond Bar, CA) which detects contact forces using a magnetic transmitter. The catheter has to be calibrated in a blood pool for at least 15 minutes to improve its accuracy [10]. Numerous studies have demonstrated how these force sensors and catheters contribute to the procedure’s increased effectiveness and decreased risk of complications. However, using CACs is still restricted by the need to utilize specific navigation and user interface technologies. The force-sensing component also increases the cost of the catheters as they are single-use. During catheterization, the ability to precisely maneuver the catheter is crucial for reaching the ablation site. The addition of a sensor to the tip increases the weight, which impacts the maneuverability of the catheter’s distal end [38].

B. Background

Reflecting on the evolution of force estimation technologies, we can see a significant shift towards sensorless force estimation. This has particularly gained attention in relation to traditional unsensorized CACs [11]. The sensorless estimators mainly examine the deflection of the distal part via image processing to estimate the contact forces because it is easier, less constrained, and more adaptable than integration of proprioceptive sensors. The two general approaches for vision-based force estimation are model-based and curvature-based [12]. The focus of this work is on curvature-based estimators as the model-based estimators are generally slow due to the challenging optimization process and model integration. In contrast, curvature-based methods do not require iterative optimization and can integrate with any model.

In the realm of sensorless curvature-based force estimation, generally, the available methods can be categorized into model-based and learning-based approaches. In the first category, the shape features of the distal part are extracted and applied to obtain the mechanical characteristics of the inverse of the Cosserat Rod model [13], the beam model [14], and the inverse of Pseudo-rigid-body 3R model [19], [16]. An accurate model that can estimate static and dynamic tip forces is hard to obtain due to hysteresis [7], dead-zone, joint-coupling (in the case of manipulation with more than one tendon) [17], tendon stretch, and inertial, elastic, actuation, friction, and gravitational effects [18]. The dynamic and static governing equations of unidirectional CACs are complex and highly nonlinear, which requires significant computational effort and modeling. Alternatively, machine learning methods can capture the nonlinearities and modeling inaccuracies. Hence, learning-based estimators are a solution that attempts to model the relationship between the shape of the distal part and the corresponding tip forces. In [19], an image-based algorithm is proposed to measure the curvature of a specific part of the catheter. The curvatures in free motion and during contact are applied to define an index that specifies the tip force’s class (insufficient, sufficient, dangerous). Nourani [2] implemented a ResNet architecture to estimate 2-D contact forces from the shape. Fekri et al. [11] proposed a two-stream CNN-based network to approximate 3D contact forces from the shape of the catheter.

C. Contributions

Our research primarily focuses on the ablation phase, where the electrophysiologist applies a force to cardiac tissue. The clinical procedure uses established imaging technologies such as X-ray fluoroscopy and ICE to visualize the heart’s interior and capture images of the catheter. These techniques effectively compensate for the absence of a camera by offering detailed images of the catheter and its environment.

Acknowledging the current research gaps, in [11], the dynamic force generated by pulling and releasing the tendon has been over-looked, and only the tip force generated by environmental motion is considered. Moreover, in [6] and [7], static and dynamic friction between the tissue and the catheter’s tip are disregarded, and the models developed thus far do not include them. To resolve the above-mentioned issues, we consider sinusoidal tendon actuation that mimics electrophysiologist’s hand motion to generate dynamic tendon force, and analyze the resulting force in conjunction with the static environmental force at the tip. During data collection, the tip can slide on the tissue phantom in the imaging plane. The angle between the tissue and the tip is considered a complement of the extracted shape, which is not included in any of the work mentioned above. In our study, we analyze frames and motion as complementary components to obtain the spatio-temporal context in video frames of the catheter, we consider optical flow and shape as two separate modalities. To consider the hysteresis due to friction and time dependency of the video stream captured by the camera, a multi-modal network for spatio-temporal data based on CNNs and LSTMs with a MFN [20] has been developed. The precision of the fusion network is essential in the force estimator to prevent clinical consequences such as ineffective ablation and other complications such as cardiac tamponade, perforation or Esophageal ulceration. Better precision can be provided by considering modality-specific and cross-modality interactions to take better advantage of the modality knowledge. This improvement is achievable by the MFN which considers both types of interactions.

To date, a spatio-temporal dataset, including the extracted shape, angle, optical flow, and tip forces, has not been collected for ablation catheters. This work is the first to implement a CNN-RNN-based network for force estimation of CACs. The MFN is implemented to fuse the system of LSTMs. MFN has not so far been implemented using extracted features with fine-tuned CNNs. This fusion method enables the network to use spatio-temporal data to estimate the tip force. The network can directly map the shape and optical flow to the corresponding forces in the normal and tangential directions. However, like any deep learning-based algorithm, the network has a data appetite, and additional data might improve performance and generalization. To address this issue, we fine-tuned two pre-trained Resnet 101 models on ImageNet [21] and UCF 101 [22]. ImageNet is the most extensively used image dataset, and UCF 101 is a dataset of practical action videos. This paper studies the feasibility of the fine-tuning of pre-trained networks for estimating the force on ablation catheters. The superiority of the developed multi-modal network over late fusion using concatenation for LSTMs, GRUs, and transformer encoders and over single modalities is also investigated.

II. Problem Description

The focus of the current work is on unidirectional CACs. In this category, one pull-wire connects the distal tip of the catheter to the proximal handle. The linear displacement of the knob generates bending in the distal deflectable section, which is more flexible than the rest of the catheter’s body. The tendon passes through the catheter body inside a spring-like tubing attached to a thin steel plate. The tendon should be under tension, and the plate should be pulled to actuate the distal section; so the tension is an axial force acting on the plate. This force leads to buckling and the plate bending in an orthogonal plane to the plate surface [23].

An essential requirement for the ablation catheter is that the electrophysiologist can apply sufficient force on the heart wall during ablation to create transmural, necrotic lesions. Therefore, estimating the precision of the applied force is important. Also, the friction between the heart tissue and the catheter tip generates a hysteresis effect. Including the hysteresis in the model increases the accuracy of the estimator. In order to make the experimental setup more realistic, the static environmental force is also included. Therefore, in this section, the characteristics of the applied force are studied and a machine learning-based force estimator is investigated which fuses shape and optical flow to accurately estimate the contact force by considering the contact geometry.

In part A of this section, the main characteristics and formulation of the analytical models describing the force are summarized. The disadvantages of applying analytical models are explained, and then a new machine learning-based estimator is proposed. Also, the contact geometry is considered in part B.

A. Analytical Modeling

The applied force is the summation of the externally distributed load and the distributed loads due to tendon tension. The externally distributed loading forces include the effects of elasticity, inertia, and gravity. The shape and material properties affect the source of the elastic loading. The acceleration of the local position of the plate, the angular velocity, and its time derivative define the inertial force and moment that can be calculated by optical flow, shape, and the last two image frames. According to [24], the shape can be used to calculate the static force produced by the tendon tension. The derivative of the path generated by the tendon actuation, the linear and angular strains, and the tendon path related to the arc length can define the dynamic forces generated by the tendon tension. Furthermore, the dynamic equations are partial differential equations with respect to the arc length and time. Therefore, the applied force can be calculated by optical flow, shape, and the historical values of the shape and optical flow. In unidirectional catheters, the dynamic and static governing equations are complex and highly nonlinear, which requires significant computational effort for modeling. However, the shape, angle, and optical flow with time dependency can provide the required modeling information. Furthermore, there are other uncertainties such as slip that are difficult to model with traditional modeling approaches. Due to the complexity of the modeling and limited accuracy of machine learning-based force estimation, the statics and dynamics-based force estimation can be addressed using machine learning-based multi-modal approaches.

B. Contact Geometry

The contact geometry that we consider is point contact with friction, so all of the external moment components are thus zero. The interaction force is described as being in a plane defined by the normal axis (Z) and the tangent axis (X) to the phantom tissue surface. The heart tissue displays anisotropic characteristics, and in our methodology, we consider the surrounding tissue of the pulmonary veins by implementing silicone material with a stiffness comparable to that of heart tissue. Moreover, to replicate the angle observed near the pulmonary veins, we utilized an adapter capable of adjusting the angle within a range from −10° to −37° relative to the horizontal axis.The friction between the catheter’s tip and the tissue phantom causes hysteresis.

III. A Multi-modal CNN-LSTM Network

In this section, a deep learning-based force estimator is proposed, as shown in Figure 1, to estimate the catheter’s dynamic and static tip forces along the X and Z directions. The deep learning model receives extracted shape, tip angle, and optical flow simultaneously. Estimating the precision of the force estimation is critical for ensuring optimal lesion creation while avoiding complications. This is achievable using the aforementioned multimodal network. This goal is achieved through the fusion of the two streams with MFN [20]. Fine-tuned models are implemented to extract features of the shape and optical flow. The catheter-tissue phantom angle is estimated from the images using a vision-based algorithm and concatenated with the extracted features of the shape. Since this is a machine learning network, datasets encompassing the distal section’s shape, optical flow, angle, and the affiliated force along the X and Z directions are prerequisites for training the model.

Fig. 1. — The overall structure of the proposed force estimator. The tendon is actuated to apply a force aimed at ablating the source of arrhythmia, which is simulated using a tissue phantom designed to replicate the texture of the pulmonary vein. An imaging device captures the catheter’s image, from which we estimate its shape, optical flow, and tip angle. A force sensor records the tip force to create essential ground truth data for training the multimodal network.

A. Experimental Setup

A 2-D experimental setup was designed to collect training and testing data (see Figure 2). The catheter is a 6 Fr CAC (Biosense Webster, Diamond Bar, CA), which can bend in one direction via a tendon. As shown in Figures. 2(a) and (c), the ablation catheter has a flexible distal section and body. The distal section shape can be controlled by a knob that has prismatic motion. However, in our experiment, the knob is removed, and the tendon is actuated using a motorized linear stage (T-LSR300B, Zaber Technologies Inc., Vancouver, BC, Canada). The catheter handle is fixed to the sliding part of the linear stage by an adaptor piece. The tendon actuation mainly changes the shape of the distal section. The flexion of the catheter body is not affected. In the clinical procedure, the electrophysiologist steers the ablation catheter through the femoral vein to reach the ablation target inside the left atrium. The catheter is then maneuvered to reach the desired locations within the left atrium to create the lesions. Mimicking the standard-of-care procedure, we passed the catheter through a sheath replicating the catheter shape from the femoral vein to the heart. A camera (Dragonfly R, Point Grey Research Inc.) captures the video from the distal section during tendon actuation. The resolution of the camera images is 640×480 pixels. The tendon actuation bends the catheter in a plane parallel to the imaging plane. A force sensor (Nano 17, ATI Industrial Automation) is attached to an adaptor to measure the tip force. The adaptor can generate different contact angles. The position of the force sensor is controlled by a linear stage (T-LSMT050A, Zaber Technologies Inc.) to include the static environmental force. During the data collection, the sampling rates of the camera and the force sensor are 0.1 s and 0.01 s, respectively. We synchronized the data via time-stamping and recorded the data in real-time using software QUARC (QUANSER Inc.). The training process of the networks was implemented using a computer with a 3 GHz i7 processor and 32 GB of RAM, running Ubuntu 20.04.4, and two graphics processors (GeForce RTX2080 SUPER, NVIDIA).

Fig. 2. — The experimental setup showing (a) one ablation catheter; (b) the actuation mechanism with a linear stage to actuate the catheter handle; (c) a side-view camera that captures the image of the distal section, a force sensor attached to the linear stage, and an adapter to adjust the position and orientation of the force sensor; and (d) the axes X and Z relative to the catheter tip.

B. Data Collection

The maximum handle displacement is 22mm; however, we set it to 12.5 mm to prevent tendon damage. This range of handle displacement can generate a minimum force of 0.25 N in the negative Z direction. Sinusoidal commands with a random magnitude between 0.14 mm/s to 20 mm/s were used to actuate the tendon to simulate the hand motion of the electrophysiologist. These values compensate for the dead zone of the linear stage and satisfy its speed limit. The linear stage actuation frequency can be varied from 0.02 Hz to 0.25 Hz to work within the range of speed limits. Frequencies of 0.02, 0.05, 0.1, 0.15, 0.2 and 0.25 Hz were selected to actuate the tendon. To generate different variations in the angle between the tip of the catheter and the tissue, the angle of the artificial silicon tissue with the horizontal direction was varied between −10° to −37° relative to the horizontal axis. An artificial heart tissue phantom was constructed using EcoflexTM 00-30 (Smooth-on, USA), whose Shore hardness is approximately equal to the heart tissue stiffness around the pulmonary vein.

The data samples are collected using the following procedure: The force sensor adapter is positioned at an angle ranging from −10° to −37° relative to the horizontal axis, which are manually implemented in our setup, allowing for increased variation in the tip angle (Figure 2(d)). The prismatic knob on the proximal handle follows a sinusoidal path with a randomly chosen magnitude between 0-12.5mm. At the same time, a linear stage moves the force sensor upwards to establish contact with the catheter’s tip, its movement magnitude being randomly set. When the net force exceeds 0.03N, contact detection occurs and halts further upward movement of the second linear stage. The camera captures the shape of the distal section, while force data are recorded to complete one collection sequence. For each new sequence, configurations of both the force sensor adapter and movements of linear stages are randomly altered before repeating this process. This experimental setup imitates an actual ablation process within the human heart with the following characteristics: the catheter tip is situated near the myocardium; the clinician manipulates the distal shaft; and the position of the second linear stage carrying the force sensor is randomized emulating unpredictable heart position during point-of-contact initiation. It is important to note that post-contact heart movement is not accounted for here but has been identified as a subject requiring deeper analysis in future investigations. The target ablation contacts involve various actions such as inserting/withdrawing/rotating catheters along with flexing their distal shafts. The experiment involves a non-fixed catheter body that generates insert/withdraw movements within the sheath and flexing through random linear stage movements. This setup allows for force measurement at the catheter tip without altering the characteristics of the distal shaft.

Figure. 3 shows the tip force variations through the joint distributions of tip forces during the experiments. Plots (a) and (c) show the probability density functions in the X and Z directions, and the scatter plots in (b) shows the relationship between the forces in each direction. The scatter plot is the forces generated by pulling and releasing the catheter tendon. The experimental setup and data collection methodology mentioned above were used to collect a dataset of 3848 samples. The initial test set comprised the four longest and shortest sequences, representing 15% of the total dataset. The remaining 85% was used for training, involving a random selection of actuation magnitudes and sequence lengths with varying frequencies. For both the training and test datasets, the adapter angle was varied manually to introduce greater variation in the contact point. Each sample includes the extracted shape, the optical flow of the distal section, and the angle. The ground truth data consists of the corresponding forces in the X and Z directions, along with their statistics, as presented in Table I. Table I which shows the statistics for the ground truth data, consisting of the mean, standard deviation (STD), minimum, and maximum values in both the X and Z directions.

Fig. 3. — The joint distribution for the forces collected in the X and Z directions. The plots (a) and (c) show the kernel distributions for the probability density functions; the plots (b) shows the pairwise relationship.

TABLE I.

Statistics of the ground truth data

Direction	Mean(N)	STD(N)	Min(N)	Max(N)
X	0.0203	0.0676	−0.1918	0.1662
Z	−0.1094	0.0468	−0.2488	0.0076

Open in a new tab

Comment 1: We consider a broad perspective on the term “slip,” which traditionally refers to the relative movement between two surfaces under stress. We distinguish between two specific definitions of slip in the context of material interactions. Firstly, slip can refer to material deformation, specifically pertaining to the internal behavior of a material such as Ecoflex^™ 00-30 under shear stress. This involves layers within the material sliding over each other as a result of applied forces, particularly shear forces, leading to deformation or movement when these forces exceed the material’s resistance. The resistance is influenced by the material’s elastic and viscoelastic properties. Secondly, slip can also pertain to surface interaction. In our case, it includes the motion of a metal catheter tip across the surface of Ecoflex^™ 00-30. This showcases slip as occurring at an interface between materials with contrasting properties: rigid metal and highly elastic silicone rubber. Our experimental setup explores the last definition of slip: examining practical application through analyzing frictional interactions and mechanical behavior at the interface between two distinct materials during catheter tip movements over Ecoflex^™ 00-30.

C. Shape Extraction

We consider the environment around the catheter as the background and the catheter as the foreground. In practice, the catheter and the heart are moving, so the shape extraction algorithm should be able to adapt to moving background and foreground. The ViBe algorithm [25] satisfies this requirement. A set of background samples models each background pixel, and the current pixel value is compared to the collection in its proximity, defined as a sphere with a radius R centered on the pixel value. To add the pixel, which is the center of the sphere, to the background, the cardinality of the set containing the intersection of the model samples and the proximity sphere should be higher than or equal to a determined threshold. The spatial vicinity of each pixel populates the pixel models. Background samples are gathered from the past and the lifespan of samples is smooth and exponentially decaying. The advantage of this updating policy is it deals with accompanying events that are progressing at different velocities using a single model with a reasonable size for each pixel.

The CACs have special metal electrodes in the distal section that are designed to be clearly visible in RGB and normal or low-dose X-ray images. The largest size electrode is located at the tip, and multiple equal-size electrodes are positioned proximal to the largest one. This feature is implemented to extract the tip angle of the catheter in the images. For this purpose, the RGB images are converted to gray-scale images, and a threshold of 0.6 is selected to convert the grayscale images to their binary equivalent masks. All pixels with luminance higher than 0.6 are replaced with white, and the rest are replaced with black. Then the structures which are brighter than those nearby and connected to the image edge are eliminated to clear the image border. A rectangular structuring element is utilized to probe the rectangular shapes included in the mask images as the projection of the tip electrode is a rectangle in the imaging plane with length and width of 4 mm and 2 mm, respectively. The camera is calibrated, and the rectangle is constructed using structuring element decomposition [27]. Morphological closing is applied for filling small holes while the shape and size of the largest electrode are preserved. The structuring element decomposition technique is implemented to construct the minor rectangles to make dilation swifter as dilation by a sequence of smaller rectangles can be computed faster than with a large rectangle. The ellipse having the same second moment as the rectangle is constructed, and the angle between the horizontal axis and the ellipse’s major axis is calculated to be the tip angle. The angle of the tissue phantom is adjusted using an adapter and the final angle is the subtraction of the two angles.

The parameters of ViBe are the radius of the sphere R, the time subsampling factor $ϕ$ (by which time windows covering the background pixel models are extended), the number of samples kept in each pixel model, and the cardinality threshold. First, the radius, $ϕ$ , the number of sample values, and the threshold are set to 20, 16, 30, and 2, respectively. The extracted catheter shape is overlaid with the original image as shown in Figure 4 to demonstrate the performance of ViBe.

Fig. 4. — The Performance of Shape Extraction.

Comment 2: The shape of the catheter is extracted by the ViBe algorithm that is known for its versatility, as it can operate independent of the color space used. Each background pixel is modeled with a set of samples and compared to the current pixel value within the immediate vicinity of the selected color space’s . This comparison occurs within a defined sphere with radius R in the Euclidean color space. Importantly, for a pixel value to be recognized as part of the background, it only needs to be close to any of the sample values rather than being in the majority, which enhances robustness in the ViBe method. The inclusion of a pixel in the background model depends on whether there are enough matches between the model samples and points within that sphere; if this exceeds a predefined threshold then that particular pixel is included. Since only a few matching samples need to be identified, the segmentation process can stop as soon as enough matches meet this threshold requirements. The effectiveness of ViBe hinges largely on just two parameters: 1) The spheres’ radius R 2) The minimum number of matches required. Importantly these do not require adjustment during background subtraction nor modification for different pixel positions across an image.

D. Optical Flow Estimation

The context in the video frames is spatio-temporal; so the temporal information is captured by obtaining the dissimilarity between optical flow displacement in video frames. Optical flow is calculated by the TV-L1 algorithm [26]. The formulation of the algorithm is based on total variation regularization and the robust L1 norm of the data fidelity term. High variations in the 2-D displacement field are penalized by a regularization term. The data fidelity term is the optical flow constraint which presumes that the intensity of the two images does not change during the motion. The formulation preserves a discontinuous flow field and enhances robustness against noise, occlusions, and illumination changes - the main challenges of knowledge-driven optical flow. It provides real-time performance using an efficient numerical scheme based on a dual formulation of total energy variation and a point-wise thresholding step.

To get real-time performance of the TV-L1 algorithm for the application, the time step of the numerical scheme ( $τ$ ), the scale numbers, which are utilized to generate the pyramid of the images (nscale), the number of wraps for each scale (nwraps), which determines the stability of the algorithm, and the stopping criterion in the numerical scheme (threshold) are set as shown in Table II. Without ground truth, an alternative evaluation approach is to apply warping [28]which is used in this work. The first image is re-sampled with the estimated flow to resemble the second image, and then a similarity matrix is used to compare the warped image and the second image. The maximum pixel value is 253, and the mean-squared error [29] and structural similarity index [30] are 28.66 and 0.946, respectively.

TABLE II.

Parameters for TV-L1 Algorithm

Parameter	$τ$	nscale	nwraps	threshold
Value	0.125	15	9	0.04

Open in a new tab

E. Memory Fusion Network (MFN)

We fuse the shape and optical flow modalities to characterize the higher-order correlation between the modalities and the tip force. The hysteresis due to the friction and the time dependence of the video stream recorded by the camera make the dataset time-dependent. As a result, an RNN, in particular an LSTM, is implemented, which can address the fading memory issue. The force estimator’s accuracy in the fusion network is crucial for ensuring the best clinical outcome. This can be achieved through greater use of the modality information, taking into account modality-specific and cross-modality interactions. This enhancement is possible with MFN [20], which considers both kinds of interactions. In LSTMs, the cross-modality interactions are outlined via the delta-memory attention network (DMAN). A history of cross-modality interactions over time is stored in multi-modality memory. The output of DMAN is the input of a neural network to create a cross-modality update proposal. Two sets of gates, update and retain gates, control the memory. The update gate determines how much of the memory is updated based on the revised proposal, and the retain gate specifies how much of the current state of the memory to keep. Two neural networks control the gates, and the memory is updated with the gates and the current cross-modality proposal. The output of the MFN concatenates the hidden state output for each input time step and the multi-modality gated memory.

F. Transfer Learning

ImageNet [21] is the most extensively used sizable image dataset in academia on which many models have been successfully trained. Although the size of our image dataset is not small, the number of shape images is less than that in ImageNet. The size of our dataset may cause overfitting of the complex models that are appropriate for the ImageNet dataset. Also, because of the limited number of training images, the error of the trained model may not satisfy actual demands. One obvious remedy is to gather more data. However, more information may be needed due to low variations in the shape images. Another solution is transfer learning to transfer learned knowledge [38] from the ImageNet (source dataset) to our dataset (target dataset). Although the ImageNet dataset has no relation with the catheter shape, the pre-trained CNN-based model can extract general image features, which can help to identify image features such as edges and shapes. These similar features can be useful for extracting shape features. The learned knowledge could be assumed to be applicable to our dataset because general image features are the same between the two datasets, and more dataset-specific features can be learned via fine-tuning [38]. The output layer of the source model is connected to the ImageNet dataset, so it is not considered. A fully connected layer is added to the target model, and the layer parameters are initialized randomly.

UCF 101 [22] is a dataset of practical action videos containing 101 action categories with 13320 videos with a substantial assortment of actions, object poses, object scales, etc. UCF 101 is a generic dataset that is used to extract optical flow features. The features of the catheter are included in the features of UCF 101 actions. Due to similar reasons as the application of fine-tuning for the shape modality, the target model for the optical flow modality is fine-tuned.

G. Methodology

The proposed deep force estimator aims to estimate the contact forces of an ablation catheter in the X and Z directions. These directions are selected as they correspond to the camera’s imaging plane that can record videos of the distal section. Since processing RGB images is insufficient to estimate static and dynamic forces, the proposed networks utilize optical flow data in conjunction with RGB images as another modality. In other words, the network is fed with two data streams: images and optical flows. The contact angle is calculated and concatenated with the extracted features from RGB images. Slight variations in the shape and optical flow images lead to high data sensitivity to movement. As a result, the network needs to concentrate on the catheter’s shape and optical flow rather than other unrelated objects. Therefore, the network admits the shape and optical flow images as two modalities. A deep network is proposed that combines a visual feature extractor, a CNN, and LSTMs with MFN to learn spatial and temporal features of the shapes and the optical flows. Figure 5 shows the core of the proposed approach. In the network, the visual input is passed through a feature transformation $η_{V} (v_{t})$ parametrized by V to generate a fixed-length feature representation $η_{t} (v_{t})$ . The feature space representation of the visual input sequence $〈 η_{1}, η_{2}, \dots, η_{T} 〉$ is calculated and fed to the multi-modal sequence model.

Fig. 5. — The proposed multi-modal deep neural network: The network includes two streams (left and right) for RGB shape and optical flow.

For the two modalities, two CNNs were used to extract features, which were then processed through dense layers for fine-tuning before being input into two distinct LSTM networks. The shape features were subsequently combined with the tip angle. These streams were then fed into LSTMs to analyze the temporal relationships across frames. DMAN maps cross-modality interaction, and then the update and retain gates determine how much of the memory updates and how much of the current state of the memory does not change.The concatenation of hidden state output for each input time step and the memory is passed through multiple neural networks to predict the output. Constituting many layers of nonlinear functions can provide a vigorous image data model. As with regression, the networks involve a significant amount of training data due to the large size of the model, which requires a high number of trainable parameters. The system of LSTMs allows each modality to have different input, output, and memory shapes. Suppose each CNN network of the image and optical flow modalities has weights $W_{i}$ and each LSTM has time steps $T_{i}$ . Disregarding the fine-tuned layers at the end of the CNN network, the model’s weights $W_{i}$ are reused at every time step. The model learns the generic time step-to-time step dynamics and prevents the parameter size from growing as a result of a large number of time steps. When dense layers are incorporated into the pretrained CNNs, both the number of layers and the number of neurons become critical parameters that should be carefully adjusted and fine-tuned during the training process.

Adding extra layers to deep neural networks can improve their performance. However, a maximum threshold exists in the depth of the networks, and adding more layers can degrade the performance because of the vanishing gradient problem. The Resnet block addresses the issue by bypassing some layers in between and adding to the output layer. This reformulation ensures that the performance of the higher layers is as good as the lower layers with the learning of the identity functions. Hence, the Resnet model is expected to perform better than or at least as well as the deep neural network models. In [32], comprehensive empirical evidence is provided that the residual networks require less effort than plain networks to optimize, and they can provide better accuracy as they can learn with deep layers. A 101-layer ResNet pre-trained on ImageNet and UCF101 datasets were used as the CNNs for the shape and optical flow modalities. This is more accurate than a 34/50-layer Resnet and provides lower error as a result of significantly enhanced depth. Furthermore, the target models that duplicate all the parameters of the pre-trained Resnet101 models, excluding the output layers, were generated.

The proposed network architecture is configured to train with the collected dataset. The first and second modalities are the shape and optical flow extracted from the RGB images and the two frames, respectively. The first modality input passes through a fine-tuned pre-trained Resnet101 network on ImageNet to extract the features of the shape. The last layer of the Resnet101 model is excluded and replaced with a dense layer with 2048 units and a Rectified Linear Unit (ReLU) activation function to obtain the feature map of the modality. Then the angle is concatenated with the output of the feature extractor. Three LSTM layers follow the concatenation with 512, 256, and 64 units, respectively. The features of the optical flow modality are extracted using a pre-trained Resnet101 model on the UCF101 dataset. The network is fine-tuned by replacing the last layer of Resnet101 with a dense layer with 1024 units and the ReLU activation function. Two LSTMs are added to consider the time dependence with 256 and 64 units. Then the outputs of the last LSTMs of the shape and optical flow modality networks are fused with memory fusion to consider cross-modality interactions between the shape and optical flow modalities. DMAN is a neural network with 128 neurons ( $d_{a}$ ) and a softmax activation function. The output of DMAN is the input of the three neural networks that construct the multi-view gated memory and retain and update gates, controlling the multi-view gated memory. The dimensionalities of all the neural networks are set to 32 neurons ( $d_{mem}$ ), which provided the best training performance, to generate the cross-modality update plan. After this layer, three dense layers with 2048, 1024, and 2 neurons are added to estimate the final forces in all the models. The activation functions of the layers are ReLU, ReLU, and linear, respectively. The configurations of the proposed network, the networks for training with the single modalities (only shape and only optical flow), and the concatenation of the outputs of the last LSTM of each modality are shown in Table III. The performance of the proposed network is compared with the performance of the single modalities with LSTM, GRU [33], and transformer decoder [34] layers and late fusion using the concatenation of two modalities using the configuration shown in Table III. In the shape modality, the fine-tuned dense layer is followed by the concatenation layer to fuse the extracted features of the extracted shape with the angle. The hidden size of the forward network of the transformer is two, and the number of heads is set to one.

TABLE III.

Configurations of the Networks: in all the networks, ResNet101 is applied as the CNN to extract features.

Network	Modality	Types of Layers	Number of Units	Loss Function	Learning Rate	$d_{a}$	$d_{m}$
LSTMs	Shape	Dense+LSTMs+Denses	2048-512,256,64-2048,1024,2	MAE	0.001	-	-
LSTMs	Optical Flow	Dense+LSTMs+Denses	2048-512,128,64-1024,2	MAE	0.001	-	-
LSTMs(Concatenation)	Shape	Dense+LSTMs	2048-512,256,64	MAE+ MSLE	0.001	-	-
LSTMs(Concatenation)	Optical Flow	Dense+LSTMs	1024-256,64	MAE+ MSLE	0.001	-	-
LSTMs(MFN)	Shape	Dense+LSTMs	2048-512,256,64	MAE+ MSLE	0.0001	128	32
LSTMs(MFN)	Optical Flow	Dense+LSTMs	1024-256,64	MAE+ MSLE	0.0001	128	32
GRUs	Shape	Dense+GRUs+Denses	2048-512,256,64-2048,1024,2	MAE	0.0001	-	-
GRUs	Optical Flow	Dense+GRUs+Denses	2048-256,64-2048,1024,2	MAE	0.0001	-	-
GRUs(Concatenation)	Shape	Dense+GRUs	2048-256,64	MAE	0.0001	-	-
GRUs(Concatenation)	Optical Flow	Dense+GRUs	2048-256,64	MAE	0.0001	-	-
Transformer	Shape	Dense+Transformer Decoder	2048-1	MAE	0.0001	-	-
Transformer	Optical Flow	Dense+Transformer Decoder	2048-1	MAE	0.0001	-	-
Transformer (Concatenation)	Shape	Dense+Transformer Decoder	2048-1	MAE+ MSLE	0.0001	-	-
Transformer (Concatenation)	Optical Flow	Dense+Transformer Decoder	2048-1	MAE+ MSLE	0.0001	-	-

Open in a new tab

Because of the small values of the tip forces, the mean-squared-error (MSE) loss function was adjusted to include the mean-squared log-scaled error (MSLE) loss function:

\begin{matrix} l (y, \hat{y}) = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2} + \\ α [\frac{1}{N} \sum_{i = 1}^{N} {(\log (y_{i} + β) - \log ({\hat{y}}_{i} + β))}^{2}] \end{matrix}

where N is the dataset size, $α$ and $β$ are hyperparameters tuned using trial and error, $y_{i}$ , and ${\hat{y}}_{i}$ are the measured and predicted forces, respectively. During the training phase, $α$ and $β$ are considered as two hyperparameters, and set to 10 and 2.75, respectively which provides the best training performance for the dataset. In all the cases, the Adam optimizer was selected with an exponential decay rate for the first and second-moment estimates of 0.9 and 0.999, respectively.

IV. Evaluation and Discussion

In this section, the trained networks were assessed on unseen test data using Mean Absolute Error (MAE), Mean-Squared Error (MSE), and Root Mean-Squared Error (RMSE) metrics. These error metrics specifically pertain to X and Z force components. The selection of the first test dataset led to a higher MAE for the test set due to challenges in force prediction for longer sequences needing accurate recall of prior samples, and the fewer samples available for analysis in the shortest sequences. The MAE, MSE, and RMSE values for the test dataset are shown in Table IV. The training losses have decreased to a stable point, and there is a very tight generalization gap between them which indicates an excellent fit. From the RMSEs in the Table IV, it is evident that all the structures have an error of less than 0.06N, the errors of the fused networks are less than those of the single modalities, and the proposed network has the lowest error. Table IV also shows that the MAEs are substantially less than the RMSEs, implying that some significant errors exist. Figure 8 displays the box plot of each fused network, which shows the difference between the performance of the networks. As can be seen, the performance of the LSTM networks with memory fusion is better than that of the others. Incorporating the tip angle proved to be beneficial, resulting in enhanced precision with MAE and MSE values of 0.0016 and 0.0006, respectively, as well as an RMSE value of 0.024. A duration of 250 time steps, corresponding to the length of the longest sequence, was provided to the MFN.

TABLE IV.

Performance Comparison of the Networks with ResNet101 Employed as the Feature Extractor Across All Networks.

Network	Fusion Method	Modality	MAE (N)	MSE (N²)	RMSE (N)
LSTMs	-	Shape	0.0280	0.0014	0.0374
LSTMs	-	Optical Flow	0.0424	0.0029	0.0539
LSTMs	Concatenation	Shape+Optical Flow	0.0251	0.0009	0.0307
LSTMs	Memory Fusion	Shape+Optical Flow	0.0125	0.0004	0.0200
GRUs	-	Shape	0.0288	0.0013	0.0361
GRUs	-	Optical Flow	0.0471	0.0034	0.0583
GRUs	Concatenation	Shape+Optical Flow	0.0248	0.0009	0.0310
Transformer	-	Shape	0.0274	0.0010	0.0316
Transformer	-	Optical Flow	0.0360	0.0019	0.0436
Transformer	Concatenation	Shape+Optical Flow	0.0265	0.0001	0.0311

Open in a new tab

Fig. 8. — Comparison of the different estimators.

Although in most applications, transformers perform better for learning long sequences due to fewer assumptions on the patterns and structures of the data, the attention layer is costly to compute. The longest input sequence available has 250 images which require a high amount of memory because of the quadratic complexity of the transformer. This is a major drawback of implementing attention-based networks such as transformers on long sequences. Some work exists such as [35], [36], to address this issue, which is outside the scope of this paper.

From Figure 8, it can be seen that most of the error values are smaller than 0.025N, but there exist some errors higher than 0.1N, which can explain why the RMSE values are higher than those of MAEs. The statistical information for each box plot is given in Table V. Error dispersion for each error sample is examined by comparison between the IQRs. The box corresponding to the proposed network has the smallest size. The size of the box shows the variability of the error. This indicates that the proposed network error is better controlled than for the other networks. The extreme values at the end of the two whiskers (the two lines outside the box) indicate the error range. So Resnet101 with the concatenation of the GRUs has a wider distribution implying more scattered error. The maximum error of the transformer network is 0.36% less than that of the proposed network; yet the number of outliers (dots) with a value greater than 0.08N is higher than for the proposed network. These points significantly deviate from the majority of the dataset and fall outside the general distribution pattern. The performance of the fused networks is better than that of the single modality networks, which indicates that the correlation between the two modalities can enhance the performance of the networks. However, the contribution of optical flow alone is lower than the value from RGB images, which shows that the task of motion modality is more complex, and the shape modality has more information regarding the force. Thus, it is challenging to estimate force from optical flow alone. Furthermore, the differences between the three networks fused by concatenation are insignificant, but memory fusion can enhance performance considerably. Since memory fusion results in less error between the estimated and measured force, it can get more information from the flow modality than other fused networks. Memory fusion continuously models view-specific and cross-view interactions through time via neural architectures. Figures 6 and 7 provide a more precise illustration of the functioning of the proposed network trend for force estimation from optical flow and RGB images. They show the network’s output on the longest and shortest sequences of the test data. The two subplots show the predicted force in the X and Z directions and their corresponding ground truths. The error distribution given in Figure 9 shows the high accuracy of the model for the force prediction. The distribution determines the likelihood of each occurrence of error uncertainty. The errors are calculated from the actual values measured by the force sensor and the network’s predictions fused by memory fusion on the test dataset. It can be seen that the error distribution along the Z direction is higher than along the X direction. The force error in the X direction is a right-skewed histogram in which the mean (0.0059N) is larger than the median (0.0004N). Also, the graphs show that the right-sided tail of the peak is more stretched than the left side, and the error data have a lower bound. The lower bounds of the force error in the X direction are lower than those in the Z direction. Moreover, the X direction error has a longer start-up period than the Z direction error, which shows fewer errors. The graph shows that the error in the Z direction is a left-skewed histogram since the mean −0.00016N is less than the median −0.00006N, and the trail is more prominent in the negative direction. The force error has an upper bound, and the cause of the left skew is that the force estimation has more errors in this direction. Subsequently, cross-validation considered diverse combinations of actuation magnitude, frequency, and contact angle, and sequence length while maintaining consistent test dataset size. The outcomes resulted in MAE, MSE, and RMSE values as follows: 0.0106N, 0.0005N, and 0.022N respectively, indicating slight increases in MSE by 0.0001N² and RMSE by 0.002N compared to errors identified in the initial test dataset.

TABLE V.

Statistical Overview of Box Plots in Figure 8: Utilizing ResNet101 as the Feature Extractor Across All Networks.

Network	Fusion Method	Mean (N)	Q1 (N)	Median (N)	Q3 (N)	Min (N)	Max (N)	IQR (N)
LSTMs	Concatenation	0.0141	0.0002	0.0031	0.0234	0.00007	0.1203	0.0232
GRUs	Concatenation	0.0152	0.0003	0.0032	0.0237	0.000006	0.1463	0.0235
Transformers	Concatenation	0.0146	0.0005	0.0028	0.0240	0.00006	0.1000	0.0235
LSTMs	Memory Fusion	0.0128	0.0006	0.0048	0.0210	0.00002	0.1036	0.0204

Open in a new tab

Fig. 6. — The proposed network’s performance for predicting the longest sequence in the X and Z directions.

Fig. 7. — The proposed network’s performance for predicting the shortest sequence in the X and Z directions.

Fig. 9. — Error histogram and density for the estimated force given by the proposed network in 2-D using the test dataset.

In the setup for collecting data, we consider the inherent uncertainty that comes with applying static environmental forces. To address this, we randomly position the tip force sensor on the linear stage and apply a consistent upward (−Z direction) force. The intentional variation in sensor placement accounts for the relatively higher error observed in the Z direction. Furthermore, by including additional environmental data in our dataset, we anticipate improvement in measurement accuracy specifically in the Z direction.

The preprocessing time for each image involves three key components: background subtraction, optical flow calculation, and tip angle measurement, which take approximately 0.04 seconds, 0.1 seconds, and 0.007 seconds, respectively. Additionally, the time required for predicting the corresponding force is 0.013 seconds. Among these processes, optical flow calculation stands out as the most time-consuming step. It is important to note that adjusting the parameters of the TV-L1 algorithm can reduce the computation time for optical flow, but this adjustment involves a trade-off between precision and processing time, as mentioned earlier. The time taken for force estimation is nearly equal to that of the force sensor, demonstrating efficient processing within the estimator.

For a fair comparison, it is essential that the experimental setup, including the catheter type, is the same. To our knowledge, the most pertinent study is referenced as [19], [6]. In this study, a force sensor is affixed to a linear stage to create varying static forces across three contact angles. Additionally, the study accounts for the static force produced through tendon actuation and proposes a force index calculated from the curvature of a specific distal section segment. This index categorizes the tip force into three zones: insufficient, safe, and excessive, each modeled using a Gaussian distribution. A Gaussian Mixture Model (GMM) with three components then classifies the derived force index. This defined index successfully identifies the correct range of applied contact forces in over 80% of instances. However, its accuracy depends on tendon length and curvature in free space. Unlike this approach, our experimental design introduces the dynamic force from tendon actuation and friction. Instead of deriving a force index, we directly calculate force values in two dimensions, employing a method that does not require sensors.

In [39], four different catheter models, each with unique measurement technologies, were analyzed: Abbott’s Tacticath Quartz, Biotronik/Acutus’s AcQBlate Force, Boston Scientific’s Stablepoint, and Biosense Webster’s Smarttouch SF. The research setup involved a catheter fixation system and a force-application platform that allowed alignment of the catheters at angles ranging from 0° to 90°. The results showed that the mean absolute errors for all models did not exceed 0.029N for force ranges of 0.01N to 0.39N—indicating less than 10% deviation. Although the primary focus was on static contact force, consideration was also given to dynamic forces in vivo caused by heartbeat and tendon movements. Our research extended the contact angle investigation to 107°, and the mean absolute error rates were 2.84% for our experimental setup, indicating the precision of the catheters equipped for force measurement. The prior study only reported measurement errors on a single axis, but our approach evaluates both axes, confirming the accuracy of our force estimation method.

Remark 1: The basis for the approach presented in this work is that the applied contact force occurs on the plane where the catheter bends. This assumption is particularly applicable to unidirectional catheters, which are designed to apply significant forces within their plane. However, experimental observations have revealed that unidirectional catheters may not be able to produce sufficient forces perpendicular to their plane, mainly due to their inherent flexibility. In clinical situations, additional forces acting on the distal shaft of the catheter may lead to deviations from its intended plane. Maintaining a predominant interaction force between the catheter tip and tissue is essential in preventing veering from target areas. Under such conditions, relying solely on a single camera for visualizing shape at the distal section proves inadequate; instead, a comprehensive assessment is needed to identify potential forces causing alignment deviations in the catheter and to assess their impact on shaping its configuration while interacting with tissue.

Remark 2: During the ablation process, the distal end of the catheter operates in a blood-filled cavity; so it is essential to factor in the fluid resistance present in this setting. However, our existing experimental setup could not address this aspect because that would require force sensors that can operate in fluid.

Remark 3: The technique was developed based on experiments using unidirectional ablation catheters. It is expected that this method can be adapted for use with other types of ablation catheters, including bidirectional ones. Due to variations in texture and material properties, the performance of different catheters may differ somewhat, requiring the collection of datasets for each type. Training with these datasets is anticipated to yield results applicable to catheters with similar design and construction, as well as those utilizing the same actuation mechanisms. However, further research is needed to assess the effectiveness of this technique across various ablation catheter models from different manufacturers.

Remark 4: ICE imaging is widely acknowledged as the one of the best radiation-free techniques for imaging during ablation procedures. Studies show that ICE has the ability to visualize soft tissue and the catheter with excellent resolution. Additionally, a specially textured marker can be added to the catheter tip to improve detection accuracy. This enables precise determination of both shape and tip angle from ICE images, which is essential for the performance of the algorithm. In this research, a camera positioned statically provides a direct planar perspective of the catheter to estimate in-plane contact forces. However, within a clinical environment utilizing ICE imaging, the ICE probe’s position may not remain static, leading to more realistic imaging outcomes. Therefore, suitable methods need to be developed for accurately extracting the catheter’s shape from ICE imaging. Additionally, the effectiveness of the ViBe algorithm and the tip angle calculation algorithm should be rigorously tested and verified in this dynamic imaging context.

Remark 5: Our algorithm requires knowledge of the entire shape of the distal section, but research indicated in [19] suggests that knowing the shape of a specific part of the distal section suffices for force estimation. This aspect warrants further analysis, and we propose it as an area for future work.

Remark 6: Our estimation of the tip force considers both the static external force applied to the tip and the activation of the tendons. We focus on point contact as the considered contact geometry. The force exerted along the catheter’s body impacts its shape, optical flow, and the angle of the tip, while also acting as a distributed force. However, for this study, we have excluded the shape, optical flow, and tip angles from our dataset as they fall beyond the current scope of our paper. Nevertheless, we recognize these elements as areas for future investigation.

Remark 7: Adjusting brightness and contrast does not create a new training dataset, as the ViBe algorithm demonstrates robustness against such changes. In our scenario, applying vertical and horizontal flips is unsuitable because the force exerted on the catheter varies with its shape. For instance, a horizontal flip would interchange the positions of the base and the tip, affecting the force values due to their dissimilarity. While simulating camera movements such as zooming in, zooming out, and rotation could be considered, the ViBe algorithm is designed for use with stationary cameras, making these options less feasible. Furthermore, when implementing crop, zoom-in, and zoom-out techniques, it is crucial to ensure capturing the entire distal section of the catheter. The analysis in [19], has indicated that capturing some portion of the distal shaft’s shape is sufficient for force estimation. Based on these considerations, we have decided to defer exploring these data augmentation techniques to future work. Other data augmentation techniques suitable for sequential data include time warping, window slicing, time shifting, and cropping. However, considering friction, altering the temporal dimensions of the sequence changes the action, leading to discrepancies from the original sequence values that we do not have. Due to these factors, we refrained from employing these types of augmentations.

V. Conclusion

In this paper, a multi-modal RNN-CNN deep network was proposed to estimate the tip force for an ablation catheter applied as a result of tendon tensions and static environmental forces. The proposed network can accurately estimate the tip forces in the X and Z directions from the extracted shape of the distal section, optical flow, and the angle between the tip and tissue. An experimental setup was designed to simulate an ablation process. It considers the effect of friction between the tissue phantom and the catheter tip. The requisite data set was collected for training and testing, including contact forces and video data. Pre-trained ResNet101 models were applied to extract the features of the shape and optical flow modalities. Fine-tuned models and LSTMs fused by MFN were then implemented to consider the time dependence and hysteresis due to friction. Compared with single-modality networks and three state-of-the-art multi-modal networks, the proposed network showed improved performance in estimating the force in 2-D. The main limitation of the network is that the motion of the catheter is constrained in 2-D. Our ongoing work is aimed at enhancing the current network to estimate 3-D forces.

Acknowledgments

This work was funded by the Natural Sciences and Engineering Research Council (NSERC) of Canada under grant RGPIN-1345 (RVP), the Canada Research Chairs Program (RVP), and the National Institutes of Health under Grant R01EB028278(JJ).

Biographies

graphic file with name nihms-2015744-b0001.gif

Elaheh Arefinia received the B.Sc. degree in Electrical Engineering from Shahid Chamran University of Ahvaz, Ahvaz, Iran, in 2012, and the M.Sc. degree from Amirkabir University of Technology (Tehran Polytechnic), Tehran, Iran, in 2015. She is currently pursuing her Ph.D. degree with the School of Electrical Engineering, Western University, London, ON, Canada. Ms. Arefinia holds the position of Research Assistant at Canadian Surgical Technologies and Advanced Robotics (CSTAR), London Health Sciences Centre, London, ON, Canada. Her research interests encompass surgical robotics, continuum robots, machine learning, and controls.

graphic file with name nihms-2015744-b0002.gif

Jayender Jagadeesan (Senior Member, IEEE) received the B.Tech. degree from the Indian Institute of Technology, Kanpur, India, in 2003, and the Ph.D. degree from the University of Western Ontario, London, ON, Canada, in 2007. He is currently an Associate Professor at the Harvard Medical School and a Lead Investigator at the Brigham and Women’s Hospital, Boston, MA, USA. His research interests include medical robotics, surgical navigation, and machine learning algorithms. He has published numerous papers in high-impact journal papers and is listed as a co-inventor on several patents. One of his patents has been licensed to a startup company, Navigation Sciences, where serves as a co-founder and consultant.

graphic file with name nihms-2015744-b0003.gif

Rajni V. Patel (Life Fellow, IEEE) received the PhD degree in Electrical Engineering from the University of Cambridge, England, in 1973. He is currently a Distinguished University Professor and a Tier-1 Canada Research Chair in the Department of Electrical and Computer Engineering at Western University, London, Ontario, Canada, with cross appointments in the Department of Surgery and the Department of Clinical Neurological Sciences. He is the Director of Engineering at Canadian Surgical Technologies and Advanced Robotics (CSTAR), London Health Sciences Centre, London, Ontario, Canada. Dr. Patel is also a Fellow of the American Society of Mechanical Engineers (ASME), the Royal Society of Canada and the Canadian Academy of Engineering. He has served on the editorial boards of several journals, including the IEEE Transactions on Robotics, the IEEE/ASME Transactions on Mechatronics, the IEEE Transactions on Automatic Control, Automatica, the Journal of Medical Robotics Research, and the International Journal of Medical Robotics and Computer Assisted Surgery. Dr. Patel is the Editor of “Minimally Invasive Surgical Robotics”, Volume 1 (of 4 volumes) of the Encyclopedia of Medical Robotics, (World Scientific Publishing, 2018).

Contributor Information

E. Arefinia, Department of Electrical and Computer Engineering, Western University, London, ON, Canada, and Canadian Surgical Technologies and Advanced Robotics (CSTAR), University Hospital, LHSC, London, ON, Canada

J. Jayender, Department of Radiology at Brigham and Women’s Hospital, and the Harvard Medical School, Boston, MA 02115, USA.

R. V. Patel, Department of Electrical and Computer Engineering, Western University, London, ON, Canada, and Canadian Surgical Technologies and Advanced Robotics (CSTAR), University Hospital, LHSC, London, ON, Canada.

References

[1].Colilla S, Crow A, Petkun W, Singer DE, Simon T, and Liu X, “Estimates of current and future incidence and prevalence of atrial fibrillation in the US adult population,” Am. J. Card, vol. 112, no. 8, pp.1142–1147, 2013. [DOI] [PubMed] [Google Scholar]
[2].Nourani H, “Imaged-based tip force estimation on steerable intracardiac catheters using learning-based methods,” M.Sc. dissertation, Concordia Univ, Montreal, Quebec, Canada, 2021. [Google Scholar]
[3].Yokoyama K et al. , “Novel contact force sensor incorporated in irrigated radiofrequency ablation catheter predicts lesion size and incidence of steam pop and thrombus,” Circ.: Arrhythmia Electrophysiol, vol. 1, no. 5, pp. 354–362, 2008. [DOI] [PubMed] [Google Scholar]
[4].Ndrepepa G and Estner H, “Ablation of cardiac arrhythmias energy sources and mechanisms of lesion formation,” in Catheter Ablation of Cardiac Arrhythmias, pp. 35–53, 2006. [Google Scholar]
[5].Di Biase L et al. , “Relationship between catheter forces, lesion characteristics, popping, and char formation: experience with robotic navigation system,” J. Cardiovasc. Electrophysiol, vol. 20, no. 4, pp. 436–40, Apr. 2009. [DOI] [PubMed] [Google Scholar]
[6].Khoshnam M and Patel RV, “Estimating contact force for steerable ablation catheters based on shape analysis,”IEEE Int. Conf. Intell. Robots Syst., Chicago, IL, USA, Sep. 2014, pp. 3509–3514. [Google Scholar]
[7].Khoshnam Tehrani M, “Modeling and Control of Steerable Ablation Catheters,” Ph.D. dissertation, Dept. Elect. Eng, Western Univ., London, ON, Canada, 2014. [Google Scholar]
[8].Jayender J, Patel RV, Michaud GF, and Hata N, “Optimal transseptal puncture location for robot-assisted left atrial catheter ablation,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol.7, no.2, pp.193–201, 2011. [DOI] [PubMed] [Google Scholar]
[9].Yokoyama K et al. , “Novel contact force sensor incorporated in irrigated radiofrequency ablation catheter predicts lesion size and incidence of steam pop and thrombus,” Circ. Arrhythm. Electrophysiol, vol.1, no.5, pp.354–362, 2008. [DOI] [PubMed] [Google Scholar]
[10].Lin T, Ouyang F, Kuck KH, and Tilz R, “Thermocool^® Smarttouch^® catheter–the evidence so far for contact force technology and the role of Visitag^™ module,” Arrhythmia Electrophysiol. Rev, vol.3, no.1, p.44, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Fekri P, Khodashenas H, Lachapelle K, Cecere R, Zadeh M, and Dargahi J, “Y-Net: A Deep Convolutional Architecture for 3D Estimation of Contact Forces in Intracardiac Catheters,” IEEE Robot. Autom. Lett, vol.7, no.2, pp.3592–3599, 2022. [Google Scholar]
[12].Diezinger MA, Tamadazte B, and Laurent GJ, “3D curvature-based tip load estimation for continuum robots,” IEEE Robot. Autom, vol. 7, no. 4, pp.10526–10533, 2022. [Google Scholar]
[13].Hooshiar A, Sayadi A, Jolaei M, and Dargahi J, “Analytical Tip Force Estimation on Tendon-driven Catheters Through Inverse Solution of Cosserat Rod Model,” IEEE Int. Conf. Intell. Robots Syst.,Prague, Czech Republic, Sep. 2021, pp. 1829–1834. [Google Scholar]
[14].Khoshnam M, Azizian M, and Patel RV, “Modeling of a steerable catheter based on beam theory,” IEEE Int. Conf. Robot. Autom., River-Centre, Saint Paul, Minnesota, USA, May 2012, pp. 4681–4686. [Google Scholar]
[15].Khoshnam M and Patel RV, “Pseudo-rigid-body 3R model for a steerable ablation catheter,” IEEE Int. Conf. Robot. Autom.(ICRA), 2013, pp. 4427–4432. [Google Scholar]
[16].Khoshnam M, Khalaji I, Patel RV, “A Robotics-Assisted Catheter Manipulation System for Cardiac Ablation with Real-Time Force Estimation,” IEEE Int. Conf. Intell. Robots Syst. (IROS), 2015, pp. 3202–3207. [Google Scholar]
[17].Loschak PM, Brattain LJ, and Howe RD, “Automated pointing of cardiac imaging catheters,” IEEE Int. Conf. Robot. Autom., Karlsruhe, Germany, May 2013, pp. 5794–5799. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Rone WS and Ben-Tzvi P, “Continuum robot dynamics utilizing the principle of virtual power,” IEEE Trans. Robot, vol.30, no.1, pp.275–287, 2013. [Google Scholar]
[19].Khoshnam M, Skanes A, Patel RV, “Modeling and Estimation of Tip Contact Force for Steerable Ablation Catheters,” IEEE Trans. Biomed. Engrg,, vol. 62, no. 5, pp. 1404–1415, 2015. [DOI] [PubMed] [Google Scholar]
[20].Zadeh A et al. , “Memory fusion network for multi-view sequential learning,” Proc. Conf. AAAI Artif. Intell, vol. 32, no. 1, 2018. [Google Scholar]
[21].Deng J, Dong W, Socher R, Li LJ, Li K, and Fei-Fei L, “Imagenet: A large-scale hierarchical image database,” IEEE Conf. Comput. Vis. pattern Recognit., Jun. 2009, pp. 248–255. [Google Scholar]
[22].Soomro K, Zamir AR, and Shah M, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012. [Google Scholar]
[23].Ganji Y and Janabi-Sharifi F, “Catheter kinematics for intracardiac navigation,” IEEE. Trans. Biomed. Eng, vol. 56, no. 3, pp.621–632, 2009 [DOI] [PubMed] [Google Scholar]
[24].Rucker DC and Webster RJ III, “Statics and dynamics of continuum robots with general tendon routing and external loading,” IEEE Trans. Robot, vol.27, no.6, pp.1033–1044, 2011. [Google Scholar]
[25].Barnich O and Van Droogenbroeck M, “ViBe: A universal background subtraction algorithm for video sequences,” IEEE Trans. Image Process, vol. 20, no.6, pp.1709–1724, 2010. [DOI] [PubMed] [Google Scholar]
[26].Zach C, Pock T, and Bischof H, “A duality based approach for realtime tv-l 1 optical flow,” In Joint pattern recognit. symposium, Sep. 2007, pp. 214–223. [Google Scholar]
[27].Van Den Boomgaard R and Van Balen R, “Methods for fast morphological image transforms using bitmapped binary images,” CVGIP-GRAPH. MODE.L IM, vol.54, no.3, pp.252–258, 1992. [Google Scholar]
[28].Brox T, Bruhn A, Papenberg N, and Weickert J, “High accuracy optical flow estimation based on a theory for warping, ” Comput. Vis. ECCV, 2004, pp. 25–36. [Google Scholar]
[29].Wang Z and Bovi AC, “Mean squared error: Love it or leave it? A new look at signal fidelity measures,” IEEE Signal Process. Mag, vol.26, no. 1, pp.98–117, 2009. [Google Scholar]
[30].Wang Z, Bovik AC, Sheikh HR, and Simoncelli EP, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process, vol.13, no. 4, pp. 600–612, 2004. [DOI] [PubMed] [Google Scholar]
[31].Pan SJ and Yang Q, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng, vol. 22, no. 10, pp.1345–1359, 2009. [Google Scholar]
[32].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” IEEE Conf. Comput. Vis. pattern Recognit., 2016, pp. 770–778. [Google Scholar]
[33].Cho K, Van Merriënboer B, Bahdanau D, and Bengio Y, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014. [Google Scholar]
[34].Vaswani A et al. , “Attention is all you need,”Adv. Neural Inf. Process. Syst, 2017. [Google Scholar]
[35].Beltagy I, Peters ME, and Cohan A, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020. [Google Scholar]
[36].Wang S, Li BZ, Khabsa M, Fang H, and Ma H, “Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020. [Google Scholar]
[37].Padmanabhan D, Rao PSM, and Pandya HJ, “Force sensing technologies for catheter ablation procedures,” Mechatronics, vol.64, p.102295, 2019. [Google Scholar]
[38].Yaftian P, Bandari N, Dargahi J, and Hooshiar A, “Comparison of mechanistic and learning-based tip force estimation on tendon-driven soft robotic catheters, ” EMBC, 2022, pp. 3489–3494. [DOI] [PubMed] [Google Scholar]
[39].Küffer T et al. , “Validation of the accuracy of contact force measurement by contemporary force-sensing ablation catheters,” J. Cardiovasc. Electrophysiol, vol. 34, no. 2, 2023, pp. 292–299. [DOI] [PubMed] [Google Scholar]

[R1] [1].Colilla S, Crow A, Petkun W, Singer DE, Simon T, and Liu X, “Estimates of current and future incidence and prevalence of atrial fibrillation in the US adult population,” Am. J. Card, vol. 112, no. 8, pp.1142–1147, 2013. [DOI] [PubMed] [Google Scholar]

[R2] [2].Nourani H, “Imaged-based tip force estimation on steerable intracardiac catheters using learning-based methods,” M.Sc. dissertation, Concordia Univ, Montreal, Quebec, Canada, 2021. [Google Scholar]

[R3] [3].Yokoyama K et al. , “Novel contact force sensor incorporated in irrigated radiofrequency ablation catheter predicts lesion size and incidence of steam pop and thrombus,” Circ.: Arrhythmia Electrophysiol, vol. 1, no. 5, pp. 354–362, 2008. [DOI] [PubMed] [Google Scholar]

[R4] [4].Ndrepepa G and Estner H, “Ablation of cardiac arrhythmias energy sources and mechanisms of lesion formation,” in Catheter Ablation of Cardiac Arrhythmias, pp. 35–53, 2006. [Google Scholar]

[R5] [5].Di Biase L et al. , “Relationship between catheter forces, lesion characteristics, popping, and char formation: experience with robotic navigation system,” J. Cardiovasc. Electrophysiol, vol. 20, no. 4, pp. 436–40, Apr. 2009. [DOI] [PubMed] [Google Scholar]

[R6] [6].Khoshnam M and Patel RV, “Estimating contact force for steerable ablation catheters based on shape analysis,”IEEE Int. Conf. Intell. Robots Syst., Chicago, IL, USA, Sep. 2014, pp. 3509–3514. [Google Scholar]

[R7] [7].Khoshnam Tehrani M, “Modeling and Control of Steerable Ablation Catheters,” Ph.D. dissertation, Dept. Elect. Eng, Western Univ., London, ON, Canada, 2014. [Google Scholar]

[R8] [8].Jayender J, Patel RV, Michaud GF, and Hata N, “Optimal transseptal puncture location for robot-assisted left atrial catheter ablation,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol.7, no.2, pp.193–201, 2011. [DOI] [PubMed] [Google Scholar]

[R9] [9].Yokoyama K et al. , “Novel contact force sensor incorporated in irrigated radiofrequency ablation catheter predicts lesion size and incidence of steam pop and thrombus,” Circ. Arrhythm. Electrophysiol, vol.1, no.5, pp.354–362, 2008. [DOI] [PubMed] [Google Scholar]

[R10] [10].Lin T, Ouyang F, Kuck KH, and Tilz R, “Thermocool^® Smarttouch^® catheter–the evidence so far for contact force technology and the role of Visitag^™ module,” Arrhythmia Electrophysiol. Rev, vol.3, no.1, p.44, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Fekri P, Khodashenas H, Lachapelle K, Cecere R, Zadeh M, and Dargahi J, “Y-Net: A Deep Convolutional Architecture for 3D Estimation of Contact Forces in Intracardiac Catheters,” IEEE Robot. Autom. Lett, vol.7, no.2, pp.3592–3599, 2022. [Google Scholar]

[R12] [12].Diezinger MA, Tamadazte B, and Laurent GJ, “3D curvature-based tip load estimation for continuum robots,” IEEE Robot. Autom, vol. 7, no. 4, pp.10526–10533, 2022. [Google Scholar]

[R13] [13].Hooshiar A, Sayadi A, Jolaei M, and Dargahi J, “Analytical Tip Force Estimation on Tendon-driven Catheters Through Inverse Solution of Cosserat Rod Model,” IEEE Int. Conf. Intell. Robots Syst.,Prague, Czech Republic, Sep. 2021, pp. 1829–1834. [Google Scholar]

[R14] [14].Khoshnam M, Azizian M, and Patel RV, “Modeling of a steerable catheter based on beam theory,” IEEE Int. Conf. Robot. Autom., River-Centre, Saint Paul, Minnesota, USA, May 2012, pp. 4681–4686. [Google Scholar]

[R15] [15].Khoshnam M and Patel RV, “Pseudo-rigid-body 3R model for a steerable ablation catheter,” IEEE Int. Conf. Robot. Autom.(ICRA), 2013, pp. 4427–4432. [Google Scholar]

[R16] [16].Khoshnam M, Khalaji I, Patel RV, “A Robotics-Assisted Catheter Manipulation System for Cardiac Ablation with Real-Time Force Estimation,” IEEE Int. Conf. Intell. Robots Syst. (IROS), 2015, pp. 3202–3207. [Google Scholar]

[R17] [17].Loschak PM, Brattain LJ, and Howe RD, “Automated pointing of cardiac imaging catheters,” IEEE Int. Conf. Robot. Autom., Karlsruhe, Germany, May 2013, pp. 5794–5799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Rone WS and Ben-Tzvi P, “Continuum robot dynamics utilizing the principle of virtual power,” IEEE Trans. Robot, vol.30, no.1, pp.275–287, 2013. [Google Scholar]

[R19] [19].Khoshnam M, Skanes A, Patel RV, “Modeling and Estimation of Tip Contact Force for Steerable Ablation Catheters,” IEEE Trans. Biomed. Engrg,, vol. 62, no. 5, pp. 1404–1415, 2015. [DOI] [PubMed] [Google Scholar]

[R20] [20].Zadeh A et al. , “Memory fusion network for multi-view sequential learning,” Proc. Conf. AAAI Artif. Intell, vol. 32, no. 1, 2018. [Google Scholar]

[R21] [21].Deng J, Dong W, Socher R, Li LJ, Li K, and Fei-Fei L, “Imagenet: A large-scale hierarchical image database,” IEEE Conf. Comput. Vis. pattern Recognit., Jun. 2009, pp. 248–255. [Google Scholar]

[R22] [22].Soomro K, Zamir AR, and Shah M, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012. [Google Scholar]

[R23] [23].Ganji Y and Janabi-Sharifi F, “Catheter kinematics for intracardiac navigation,” IEEE. Trans. Biomed. Eng, vol. 56, no. 3, pp.621–632, 2009 [DOI] [PubMed] [Google Scholar]

[R24] [24].Rucker DC and Webster RJ III, “Statics and dynamics of continuum robots with general tendon routing and external loading,” IEEE Trans. Robot, vol.27, no.6, pp.1033–1044, 2011. [Google Scholar]

[R25] [25].Barnich O and Van Droogenbroeck M, “ViBe: A universal background subtraction algorithm for video sequences,” IEEE Trans. Image Process, vol. 20, no.6, pp.1709–1724, 2010. [DOI] [PubMed] [Google Scholar]

[R26] [26].Zach C, Pock T, and Bischof H, “A duality based approach for realtime tv-l 1 optical flow,” In Joint pattern recognit. symposium, Sep. 2007, pp. 214–223. [Google Scholar]

[R27] [27].Van Den Boomgaard R and Van Balen R, “Methods for fast morphological image transforms using bitmapped binary images,” CVGIP-GRAPH. MODE.L IM, vol.54, no.3, pp.252–258, 1992. [Google Scholar]

[R28] [28].Brox T, Bruhn A, Papenberg N, and Weickert J, “High accuracy optical flow estimation based on a theory for warping, ” Comput. Vis. ECCV, 2004, pp. 25–36. [Google Scholar]

[R29] [29].Wang Z and Bovi AC, “Mean squared error: Love it or leave it? A new look at signal fidelity measures,” IEEE Signal Process. Mag, vol.26, no. 1, pp.98–117, 2009. [Google Scholar]

[R30] [30].Wang Z, Bovik AC, Sheikh HR, and Simoncelli EP, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process, vol.13, no. 4, pp. 600–612, 2004. [DOI] [PubMed] [Google Scholar]

[R31] [31].Pan SJ and Yang Q, “A survey on transfer learning,” IEEE Trans. Knowl. Data Eng, vol. 22, no. 10, pp.1345–1359, 2009. [Google Scholar]

[R32] [32].He K, Zhang X, Ren S, and Sun J, “Deep residual learning for image recognition,” IEEE Conf. Comput. Vis. pattern Recognit., 2016, pp. 770–778. [Google Scholar]

[R33] [33].Cho K, Van Merriënboer B, Bahdanau D, and Bengio Y, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014. [Google Scholar]

[R34] [34].Vaswani A et al. , “Attention is all you need,”Adv. Neural Inf. Process. Syst, 2017. [Google Scholar]

[R35] [35].Beltagy I, Peters ME, and Cohan A, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020. [Google Scholar]

[R36] [36].Wang S, Li BZ, Khabsa M, Fang H, and Ma H, “Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020. [Google Scholar]

[R37] [37].Padmanabhan D, Rao PSM, and Pandya HJ, “Force sensing technologies for catheter ablation procedures,” Mechatronics, vol.64, p.102295, 2019. [Google Scholar]

[R38] [38].Yaftian P, Bandari N, Dargahi J, and Hooshiar A, “Comparison of mechanistic and learning-based tip force estimation on tendon-driven soft robotic catheters, ” EMBC, 2022, pp. 3489–3494. [DOI] [PubMed] [Google Scholar]

[R39] [39].Küffer T et al. , “Validation of the accuracy of contact force measurement by contemporary force-sensing ablation catheters,” J. Cardiovasc. Electrophysiol, vol. 34, no. 2, 2023, pp. 292–299. [DOI] [PubMed] [Google Scholar]

PERMALINK

Machine-Learning-Based Multi-Modal Force Estimation for Steerable Ablation Catheters

E Arefinia

J Jayender

R V Patel

Roles

Abstract

I. Introduction

A. Motivation

B. Background

C. Contributions

II. Problem Description

A. Analytical Modeling

B. Contact Geometry

III. A Multi-modal CNN-LSTM Network

Fig. 1.

A. Experimental Setup

Fig. 2.

B. Data Collection

Fig. 3.

TABLE I.

C. Shape Extraction

Fig. 4.

D. Optical Flow Estimation

TABLE II.

E. Memory Fusion Network (MFN)

F. Transfer Learning

G. Methodology

Fig. 5.

TABLE III.

IV. Evaluation and Discussion

TABLE IV.

Fig. 8.

TABLE V.

Fig. 6.

Fig. 7.

Fig. 9.

V. Conclusion

Acknowledgments

Biographies

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases