Abstract
Rate-invariant or reparameterization-invariant matching between functions and shapes of curves, respectively, is an important problem in computer vision and medical imaging. Often, the computational cost of matching using approaches such as dynamic time warping or dynamic programming is prohibitive for large datasets. Here, we propose a deep neural-network-based approach for learning the warping functions from training data consisting of a large number of optimal matches, and use it to predict optimal diffeomorphic warping functions. Results show prediction performance on a synthetic dataset of bump functions and two-dimensional curves from the ETH-80 dataset as well as a significant reduction in computational cost.
1. Introduction
A key ingredient in rate or parameterization-invariant matching of shapes of one-dimensional functions or curves is a cost function (either an L1 or an L2 norm) that includes a nonlinear one-one differentiable function with a differentiable inverse (diffeomorphism) that controls the rate of change of one function (shape) with respect to the other. This transformation can be conveniently described by a group action if an appropriate representation or mapping of shapes or functions is defined, or by a direct non-linear transformation of the coordinates of the curves or graphs of functions. In both cases, the transformation results in a composition of a shape with a diffeomorphic warping function. While the full problem of shape analysis for curves also involves an invariant matching over a set of translations, scalings, and rotations, the main computation for both curves and one-dimensional functions requires an optimization over a set of diffeomorphic warping functions. There have been several algorithms proposed to solve this optimization problem. A classical, widely-used algorithm is dynamic time warping (DTW), which is amenable to a global solution using dynamic programming [13]. This algorithm has a computational complexity of O(T2) for T samples or points along the shape. In this paper, instead of solving the dynamic programming (DP) problem explicitly, we outline a deep learning (DL) framework for predicting the warping functions by learning from pairwise matches from a training dataset.
1.1. Motivation
Our motivation for learning to predict reparameterizations is multifold.
Computational cost:
Direct computation of alignment, either exclusively using dynamic time warping or by optimizing a cost function to find geodesics as in [3, 4, 14], incurs a computational cost. This cost is negligible for small-sized functional or curve-shape datasets with sizes on the order of a few thousands. However, for an order increase of tenfold, hundredfold, or larger (data sizes > 100K or > 1 million), the computational cost becomes prohibitive. In real-world applications, such data arise from biological shapes [2, 5], clinical time-series data of heart rates [11], or functional magnetic resonance imaging signals (fMRI) [8]. Thus, we suggest that a prediction or an evaluation-type of an approach has the potential to yield sizable cost savings as compared to a direct computation-type of an approach. In spirit, this motivation is similar to the idea proposed by Yang et al. who predict the momentum-parameterization of diffeomorphisms using a deep learning approach to achieve a speedup in image registration [17].
Shape learning and classification:
As proposed by Thompson [15], the task of comparing the deformations of the objects, rather than a precise definition of the object itself, often yields more interesting information. The temporal warps or spatial reparameterizations (diffeomorphisms) can be analyzed under a statistical learning framework including tangent-space principal component analysis or linear discriminant analysis. Under a deep-learning framework for warping functions, one would not only have the predicted warping function, but also the weights associated with the underlying model training from the population. Thus the goal of shape learning or shape understanding may potentially benefit from having more information about the population. We also note that Lohit et al. [10] aims to achieve this by jointly learning the time warping functions associated with temporal human activity. There, the approach is classification-focused and is trained to give the best performance with respect to classification.
Novel descriptors:
Finally, the output of the intermediate layers can serve as a rich feature descriptor for the shape of the population. This can be achieved by following the process of feature extraction and obtaining the projection of a novel shape as a mapping of the intermediate layers of the neural network, or by modeling the ensembles of the activation layer for a population level shape descriptor.
1.2. Background and related work
Several ideas have been proposed for learning-based prediction of the warping functions. Here, we outline them in brief. Kazlauskaite et al. proposed a method for automatically learning functional alignments based on a probabilistic model built on non-parametric priors [6]. Another idea, deep canonical time warping (DCTW) for learning warping functions (for multiple time-series) uses deep learning to achieve simultaneous temporal alignment and maximal correlation of time-series in a common subspace [16]. The work by Oh et al. [12] uses a sequence transformer network to learn linear transformations along the temporal axis (stretch, compress, flip and/or shift the signal) to identify and account for invariances in clinical time-series data. A more recent approach by Lohit et al. [10] uses a temporal transformer network (TTN) to learn warping functions in the context of classification. Their network performs learning as well as class-aware discriminative alignment jointly for time-series classification including action trajectories by reducing the intra-class variability and also increasing the inter-class separation. Notably, in their framework the prediction of the warping functions is achieved without an explicit template for the class.
1.3. Contributions
The contributions of this paper are as follows. We propose a deep learning based framework for predicting diffeomorphic warps giving rise to invariant matching of one-dimensional functions and two-dimensional curves. The network architecture is simple and consists of a convolutional layer followed by three dense layers. We propose a choice of three loss functions that measure discrepancies between i) warping functions, ii) linear combination of the coordinates (graphs in case of functions) and warping functions, and a iii) linear combination of the shape representations and warping functions, and demonstrate that the latter yields the best performance. We train the network on warping functions obtained using dynamic programming and show prediction results for a large set (∼500K) of data for synthetic bump functions and two-dimensional curves from the ETH-80 dataset [9].
Our architecture differs from [10] in the following ways: i) we permit negative inputs into the fully connected layers by using leaky reLU activation functions with parameter 0.1, ii) we train our network directly on the shapes and warping functions to minimize loss functions that penalize shape matching differences as opposed to minimizing loss functions that penalize classification errors, and iii) we do not enforce a positive monotonicity in the network output and instead try to learn this constraint.
Additionally, different from the approach in [10], we achieve reparameterization-invariant matching for pairs of curves by integrating loss functions that aim to find the shortest distance between points in a Hilbert sphere.
This paper is organized as follows. Section 2 outlines the preliminaries for the shape representation and matching problem. Section 3 outlines the deep learning architecture including the choice of loss functions, followed by results in section 4 and discussion in section 5.
2. Shape representation preliminaries
Throughout this paper, we will consider a parameterized representation for two-dimensional curves and functions. We will let p denote a parameterized curve such that In this paper we will only consider those ps that are differentiable and their first derivative is in For a one-dimensional function, we assume that with a slight abuse of notation. Though the following discussion is for two-dimensional curves, the theory holds for one-dimensional functions. For the purpose of studying the shape of p, we will represent it using the square-root velocity function (SRVF) [3, 4, 14] defined as where We note that for every there exists a curve p (unique up to a translation) such that the given q is the SRVF function of that p. This curve is recoverable using the equation:
To achieve invariance to scale, we re-scale all curves to be of length 2π. This causes the SRVF functional representation for these curves to be identified as elements of a hypersphere in the Hilbert manifold In this paper, we will use the notation to denote this hypersphere.
We impose a standard L2 metric on the tangent space of this hypersphere as follows. Since is a sphere in its tangent space at a point q is given by: Here, denotes the inner product in In this paper, we deal with one-dimensional functions (n = 1) and two-dimensional curves (n = 2). This standard metric on restricts to one on and is used to compute geodesics between shapes.
Representing a parameterized curve p(t) by its SRVF function q(t), and imposing the constraint makes it invariant to translation and scaling. Further, the rotation and the reparameterization variability is accounted for as follows.
For two-dimensional curves, a rotation is an element of SO(2), the special orthogonal group of 2 × 2 matrices, and a re-parameterization is an element of Γ, the set of all orientation-preserving diffeomorphisms of D. The rotation and re-parameterization of a curve p are both denoted by the actions of SO(2) and Γ on its SRVF. While the action of SO(2) is denoted by multiplication: SO(n) × → , the action of Γ is derived as follows. For a γ ∈ Γ, the composition p ◦ γ denotes its re-parameterization; the SRVF of the re-parameterized curve is given by where q is the SRVF of p. This gives us the action
Next, we enable comparisons between functions by computing the shortest path between them. Since the space is a Hilbert sphere, the shortest path between two points (shapes) q1 and q2 can be expressed analytically as,
| (1) |
where t ∈ [0, 2π] and the initial tangent vector is given by Then the shortest distance between the two shapes q1 and q2 in is given by
| (2) |
The distance in Eqn. 2 can be made invariant by searching over all reparaemterizations γ as
| (3) |
We use dynamic programming to minimize Eqn. 3 to find the optimal reparameterization γDP as the minimizer
In the following discussion with a slight abuse of notation, we refer to the γDP obtained using dynamic programming as γ without the subscript DP and will refer to the predicted warping function using deep learning as Next, we outline the framework for learning this warping function by considering a training dataset of pairwise matchings.
3. Deep Learning Architecture
In an effort to minimize training time, we constructed our network to be relatively simple with sufficient complexity to be applicable to varying datasets. Indeed, we use the same architecture across various datasets as discussed further in section 4.
Drawing inspiration from TTN [10], our network also consists of a convolutional layer. However, this is then followed by three dense layers as illustrated in Figure 1.
Figure 1.
Deep neural architecture of the prediction network.
As presented, this network operates on two-dimensional curves; that is, given and the network aims to find the optimal diffeomorphism mapping p2 to p1. The curves p1 and p2 are first downsampled by selecting T evenly spaced points in the interval [0, 2π] and then evaluating the curves at these points. As such, p1 and p2 can be regarded as matrices each of size T × 2. With a slight abuse of notation, we refer to these matrix representations of p1 and p2 as p1 and p2. Concatenating the two matrices yields This is what is then fed through the network.
The input is first passed through a convolutional layer with 32 filters of size 3 × 3 and a unit stride. The output of this layer is then fed through the standard reLU activation function. After flattening this output, it is then fed through two successive fully-connected layers of sizes 256 and 128, respectively. Both of these layers have a leaky reLU activation function with parameter 0.1 and both are followed by dropout layers with a drop probability of 0.25. The output layer has T neurons and, to restrict the output to be in D, the activation function is given by where is the sigmoid function.
For the network remains the same except the convolution layer now has filters of size 2 × 2 and the input to the network is
In contrast to [10], we do not enforce a non-decreasing output but rather try to learn it from the data. Moreover, our network aims to reproduce warps obtained from DTW rather than warps that yield the best classification results.
3.1. Choice of loss functions
Given discretized curves p1 and p2, our network outputs which is an estimate of the γ obtained by solving
| (4) |
where q1 and q2 are the SRVF representations of p1 and p2, respectively.
Since we desire to be as close to γ as possible, it is natural to consider their squared difference as a measure of similarity. As such, one possible loss function is
| (5) |
After warping p2, we expect the distance between p1 and this warped p2 to be smaller than that of p1 and p2. As such, we can also aim to minimize with respect to However, this loss does not make use of the true value γ, rendering the use of deep learning moot. As such, we can simply consider a linear combination of this and :
| (6) |
where
Since the true γ is obtained by solving (4), it makes sense to include this function in our loss. As such, the final loss function we consider is again a linear combination of this function and
| (7) |
Because we do not impose a nonnegativity constraint on instead of considering loss (7), we will consider the equivalent (8) for numerical stability
| (8) |
4. Results
4.1. Data Generation
We consider two separate applications of time warping: to one and two-dimensional curves. One-dimensional curves were synthesized by appending sinusoidal waves of varying phase and amplitudes. We refer to these curves as bumps and characterize them by their number of peaks. Examples of one, two, three, and four-bump curves is given in figure 2.
Figure 2.
Examples of synthesized bumps.
Each curve was discretized to T = 300 points. Amplitudes were sampled uniformly on the interval [0.15, 3] and wave lengths were chosen to be a percentage of T chosen uniformly on the interval [5,10].
Two-dimensional curves were constructed by first modifying the contours in the ETH-80 [9] dataset. In particular, each contour was manually inspected and modified so that the curve did not exhibit any holes. Figure 3 illustrates this process on an example of a cow contour from the dataset. This process yields closed curves in with no holes.
Figure 3.
Hole removal from ETH-80 contour dataset.
Since the dataset consists of PNG images, we first binarize the image and then apply the Moore-Neighbor tracing algorithm to extract the curve boundary. This process gives a set of points in . Using linear interpolation, the curve is downsampled to T = 300 points so that the curve is an element of This process is applied to each of the modified ETH-80 curves, yielding 3280 curves belonging to one of eight evenly distributed classes: apple, car, cow, cup, dog, horse, pear, and tomato. Within each class, each curve is enumerated according to its original file name; afterwards, every possible image pair (i, j), j > i is considered. Since each class has 410 curves, there are such pairs. Across all 8 classes, this gives a total of 670760 curve pairs.
For each curve pair (i, j), the optimal diffeomorphism that warps curve j to curve i was computed using DTW. With the curves and diffeomorphisms at hand, the data is split into training/validation/testing in a 70/10/20 split, respectively, for a total of 469532 training pairs.
In the one-dimensional case, for each bump curve we generate 650000 random pairs so as to be consistent with the size of the two-dimensional dataset. The optimal alignment diffeomorphism is then found using DTW for each pair and the same training/validation/test split is applied. This is repeated for one, two, three, and four-bump curves so that we have 450000 training pairs for each bump class.
4.2. Loss function performance
Using the one-bump dataset, we trained the network using the , and losses and used a = 1 and for and . Figure 4 depicts the average value of the loss on the validation data when trained using each loss. We see that the loss gives outputs that are most similar, in the sense, to the desired output. Consequently, we use the loss to train our network on the remaining bump datasets. The loss also gives the best performance on the two-dimensional dataset as depicted in figure 5.
Figure 4.
Average loss on 1-d validation data.
Figure 5.
Average loss on 2-d validation data.
4.3. Model performance
Figure 6 illustrates the performance of our network trained on one, two, three, and four-bump datasets separately using the loss
Figure 6.
Performance for the one, two, three, and four-bump cases. Top: Warping functions from DP (γ) and DL . Matching by an identity warp (2nd row), DP γ (3rd row), and deep learning (4th row).
The first row plots γ (in green), (in red) and γI (in blue) where γI(t) = t. The second row is a correspondence plot between p1 and p2 as determined by γI. The third row is a correspondence plot between p1 and p2 as determined by γ and the fourth row is a correspondence plot between p1 and p2 as determined by .
Similarly, figure 7 illustrates the performance of our network when trained on the full 2d-dataset consisting of all classes.
Figure 7.
Performance for 2D curves. Top: Warping functions from DP (γ) and DL . Matching by an identity warp (2nd row), DP γ (3rd row), and deep learning (4th row).
Figures 8 and 9 show the average error for the one-dimensional (bump) data and the two-dimensional curves for different bump and curve types. It is observed that in the case of bumps, the lowest error is obtained for the one-bump case, whereas the highest error is obtained for the three-bump case, with the two and three-bump cases giving similar errors. For curves, rotund shapes such as apples, cups, pears, and tomatoes yielded lower errors compared to shapes with articulated features such as cows, dogs, and horses.
Figure 8.
Average loss on test data for all bumps.
Figure 9.
Average loss on test data for shapes.
4.4. Implementation and computational cost
The network was trained with a batch size of 32 for 200 epochs using an Adam optimizer [7] with a learning rate of 0.001, and exponential decay rates of 0.9 and 0.999 for the first and second moment estimates, respectively.
Table 1 shows comparisons of computational costs between the dynamic programming approach and the deep learning prediction approach. All experiments were performed on an Intel i7-7700K CPU @ 4.20GHz. The machine was equipped with 2 TITAN Xp GPUs for deep learning. The network was implemented and trained using TensorFlow [1]. The dynamic programming was executed on the same machine. It is observed that the deep learning warping prediction approach was approximately 3000 times faster for one-dimensional functions (bumps) and 900 times faster for two dimensional curves.
Table 1.
Time (in seconds) to obtain 100,000 warps on an i7-7700K CPU @ 4.20GHz (dynamic programming) and two NVIDIA GP102 TITAN Xp GPUs (deep learning).
| Dynamic Programming | Deep Learning | |
|---|---|---|
| Bumps | 45130.011 | 14.047 |
| Shapes | 5519.849 | 5.639 |
5. Discussion
We presented a deep learning approach for predicting warping functions that achieve rate-invariant alignment in the case of functions and reparameterization-invariant matching for two-dimensional curves. While we listed shape learning and novel shape representation as potential applications, in this paper, our primary motivation was demonstrating reduced computational cost. The network architecture was simple to construct and has similarities with the approaches in [10] and [12]. We experimented with three loss functions, the first of which only plenalized the cost between warping functions. The second penalized the cost between the coordinates of curves and functions, and the third penalized the cost between their SRVFs; both of these penalized the cost between warping functions as well. We showed that the latter yielded the best results. While, visually, the predicted warping, and consequently the ensuing matching, appear close to each other, we also observed cases where the predicted function failed to achieve an optimal warping. We also noted that, occasionally, the dynamic programming algorithm partially failed to achieve a good match.
In the one-dimensional case, figure 6 suggests our network is able to perform reasonably well in aligning curves when the curves are relatively close to one another. Figure 10 (a) is an example where the alignment of the curves would require significant stretching and we see that both the DP and DL solutions fail to achieve this. Because the training data is derived from the DP warps, performance of our model must be measured relative to the DP performance.
Figure 10.
(a) A two-bump example where both DP and DL fail to match bumps. (b) A four-bump example where DP successfully matches all bumps but DL fails to match the third.
While our framework offers similar performance to DP in relatively simple curves, its architectural simplicity limits its flexibility and performance on difficult curves. For example, figure 10 (b) depicts a curve where DP succeeds but DL fails to match every bump correctly. Future work can aim to enforce positive monotonicity in the predicted warps as in [10]. However, more complicated architectures should be explored. In particular, the convolution filter size should be examined and chosen so as to span the dimension of the curve and be large enough to capture variations in the curve throughout its entire domain.
When trained on the two-dimensional shape data, we see that the performance is comparable to DP for simple shapes. In figure 7 we see that DL is able to perform alignments similar to DP in both the cup and pear shapes. For more complicated shapes like dogs and horses, detailed artifacts such as legs, tails, and ears can be matched reasonably well but performance is not as strong as in the DP case. This, again, may be attributed to the simplicity of the network and may possibly improve under a more complex network.
6. Acknowledgments
This research was partially supported by NIH National Institute on Alcohol Abuse and Alcoholism awards K25-AA024192 and R01-AA026834.
References
- [1].Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265–283, 2016. 6 [Google Scholar]
- [2].Joshi Shantanu H, Cabeen Ryan P, Joshi Anand A, Sun Bo, Dinov Ivo, Narr Katherine L, Toga Arthur W, and Woods Roger P. Diffeomorphic sulcal shape analysis on the cortex. IEEE Transactions on Medical Imaging, 31(6):1195–1212, 2012. 1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Joshi Shantanu H, Klassen E, Srivastava A, and Jermyn I. A novel representation for Riemannian analysis of elastic curves in Rn. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–7, 2007. 1, 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Joshi Shantanu H, Klassen E, Srivastava A, and Jermyn I. Removing shape-preserving transformations in square-root elastic (SRE) framework for shape analysis of curves. In Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), pages 387–398, 2007. 1, 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Joshi Shantanu H, Narr Katherine L, Philips Owen R, Nuechterlein Keith H, Asarnow Robert F, Toga Arthur W, and Woods Roger P. Statistical shape analysis of the corpus callosum in schizophrenia. NeuroImage, 64:547–559, 2013. 1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Kazlauskaite Ieva, Ek Carl Henrik, and Campbell Neill. Gaussian process latent variable alignment learning. In Proceedings of Machine Learning Research, volume 89, pages 748–757, 2019. 2 [Google Scholar]
- [7].Kingma Diederik P and Ba Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 5 [Google Scholar]
- [8].Lee David S, Loureiro Joana, Narr Katherine L, Woods Roger P, and Joshi Shantanu H. Elastic registration of single subject task based fMRI signals. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 154–162. Springer, 2018. 1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Leibe Bastian and Schiele Bernt. Analyzing appearance and contour based methods for object categorization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 409–415, 2003. 2, 4 [Google Scholar]
- [10].Lohit Suhas, Wang Qiao, and Turaga Pavan. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In IEEE Conference on Computer Vision and Pattern Recognition, pages 12426–12435, 2019. 2, 3, 6, 7 [Google Scholar]
- [11].Oh Jeeheh, Makar Maggie, Fusco Christopher, Robert McCaffrey Krishna Rao, Ryan Erin E, Washer Laraine, West Lauren R, Young Vincent B, Guttag John, et al. A generalizable, data-driven approach to predict daily risk of clostridium difficile infection at two large academic health centers. Infection Control & Hospital Epidemiology, 39(4):425–433, 2018. 1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Oh Jeeheh, Wang Jiaxuan, and Wiens Jenna. Learning to exploit invariances in clinical time-series data using sequence transformer networks. In Proceedings of the 3rd Machine Learning for Healthcare Conference, volume 85, pages 332–347, 2018. 2, 6 [Google Scholar]
- [13].Sakoe Hiroaki and Chiba Seibi. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978. 1 [Google Scholar]
- [14].Srivastava Anuj, Klassen Eric, Joshi Shantanu H., and Jermyn Ian H. Shape analysis of elastic curves in Euclidean spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1415–1428, 2011. 1, 2 [DOI] [PubMed] [Google Scholar]
- [15].Thompson DW On Growth and Form. Cambridge University Press, 1943. 1 [Google Scholar]
- [16].Trigeorgis George, Nicolaou Mihalis A, Schuller Björn W, and Zafeiriou Stefanos. Deep canonical time warping for simultaneous alignment and representation learning of sequences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5):1128–1138, 2017. 2 [DOI] [PubMed] [Google Scholar]
- [17].Yang Xiao, Kwitt Roland, Styner Martin, and Niethammer Marc. Quicksilver: Fast predictive image registration–a deep learning approach. NeuroImage, 158:378–396, 2017. 1 [DOI] [PMC free article] [PubMed] [Google Scholar]










